US20010056347A1 - Feature-domain concatenative speech synthesis - Google Patents

Feature-domain concatenative speech synthesis Download PDF

Info

Publication number
US20010056347A1
US20010056347A1 US09/901,031 US90103101A US2001056347A1 US 20010056347 A1 US20010056347 A1 US 20010056347A1 US 90103101 A US90103101 A US 90103101A US 2001056347 A1 US2001056347 A1 US 2001056347A1
Authority
US
United States
Prior art keywords
segments
feature vectors
speech
speech signal
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US09/901,031
Other versions
US7035791B2 (en
Inventor
Dan Chazan
Ron Hoory
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nuance Communications Inc
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US09/901,031 priority Critical patent/US7035791B2/en
Publication of US20010056347A1 publication Critical patent/US20010056347A1/en
Application granted granted Critical
Publication of US7035791B2 publication Critical patent/US7035791B2/en
Assigned to NUANCE COMMUNICATIONS, INC. reassignment NUANCE COMMUNICATIONS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: INTERNATIONAL BUSINESS MACHINES CORPORATION
Assigned to CERENCE INC. reassignment CERENCE INC. INTELLECTUAL PROPERTY AGREEMENT Assignors: NUANCE COMMUNICATIONS, INC.
Assigned to CERENCE OPERATING COMPANY reassignment CERENCE OPERATING COMPANY CORRECTIVE ASSIGNMENT TO CORRECT THE ASSIGNEE NAME PREVIOUSLY RECORDED AT REEL: 050836 FRAME: 0191. ASSIGNOR(S) HEREBY CONFIRMS THE INTELLECTUAL PROPERTY AGREEMENT. Assignors: NUANCE COMMUNICATIONS, INC.
Assigned to BARCLAYS BANK PLC reassignment BARCLAYS BANK PLC SECURITY AGREEMENT Assignors: CERENCE OPERATING COMPANY
Assigned to CERENCE OPERATING COMPANY reassignment CERENCE OPERATING COMPANY RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: BARCLAYS BANK PLC
Assigned to WELLS FARGO BANK, N.A. reassignment WELLS FARGO BANK, N.A. SECURITY AGREEMENT Assignors: CERENCE OPERATING COMPANY
Adjusted expiration legal-status Critical
Assigned to CERENCE OPERATING COMPANY reassignment CERENCE OPERATING COMPANY CORRECTIVE ASSIGNMENT TO CORRECT THE REPLACE THE CONVEYANCE DOCUMENT WITH THE NEW ASSIGNMENT PREVIOUSLY RECORDED AT REEL: 050836 FRAME: 0191. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT. Assignors: NUANCE COMMUNICATIONS, INC.
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • G10L13/07Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • the present invention relates generally to computerized speech synthesis, and specifically to methods and systems for efficient, high-quality text-to-speech conversion.
  • TTS text-to-speech
  • TTS systems for synthesis of arbitrary speech typically perform three essential functions:
  • MFCCs mel-frequency cepstral coefficients
  • the synthesizer applies a cost function to the feature vectors of the speech segments, based on a measure of vector distance.
  • the synthesizer then concatenates the selected segments, while adjusting their prosody and pitch to provide a smooth, natural speech output.
  • Pitch Synchronous Overlap and Add (PSOLA) algorithms are used for this purpose, such as the Time Domain PSOLA (TD-PSOLA) algorithm described in the above-mentioned thesis by Donovan. This algorithm breaks speech segments into many short-term (ST) signals by Hanning windowing.
  • ST signals are altered to adjust their pitch and duration, and are then recombined using an overlap-add scheme to generate the speech output.
  • PSOLA schemes give generally good speech quality, it requires a large database of carefully-chosen speech segments.
  • One of the reasons for this requirement is that PSOLA is very sensitive to prosody changes, especially pitch modification. Therefore, in order to minimize the prosody modifications at synthesis time, the database must contain segments with a large variety of pitch and duration values.
  • Other problems with PSOLA schemes include:
  • U.S. Pat. No. 5,751,907, to Moebius et al. whose disclosure is incorporated herein by reference, describes a speech synthesizer having an acoustic element database that is established from phonetic sequences occurring in an interval of natural speech. The sequences are chosen so that perceptible discontinuities at junction phonemes between acoustic elements are minimized in the synthesized speech.
  • U.S. Pat. No. 5,913,193, to Huang et al. whose disclosure is also incorporated herein by reference, describes a concatenative speech synthesis system that stores multiple instances of each acoustic unit during a training phase. The synthesizer chooses the instance that most closely resembles a desired instance, so that the need to alter the stored instance is reduced, while also reducing spectral distortion between the boundaries of adjacent instances.
  • U.S. Pat. No. 6,041,300 to Ittycheriah et al., whose disclosure is incorporated herein by reference, describes a speech recognition system that synthesizes and replays words that are spoken into the system so that the speaker can confirm that the word is correct.
  • the system uses a waveform database, from which appropriate waveforms are selected, followed by acoustic adjustment and concatenation of the waveforms.
  • the component phonemes in the spoken words are divided into sub-units, known as lefemes, which are the beginning, middle and ending portions of the phoneme.
  • the lefemes are modeled and analyzed using Hidden Markov Models (HMMs). HMM-modeling of lefemes can also be used in speech synthesis, as described in the above-mentioned U.S. Pat. No. 5,913,193 and in Donovan's thesis.
  • HMMs Hidden Markov Models
  • complex line spectrum refers to the sequence of respective sine-wave amplitudes, phases and frequencies in a sinusoidal speech representation.
  • the sequences of feature vectors corresponding to successive speech output segments are concatenated in the feature domain, rather than in the time domain as in TD-PSOLA and related techniques known in the art. Only after concatenation and spectral reconstruction is the spectrum converted to the time domain (preferably by short-term inverse Discrete Fourier Transform) for output as a speech signal. This method is further described by Chazan et al. in “Speech Reconstruction from Mel Frequency Cepstral Coefficients and Pitch Frequency,” Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) , June, 2000, which is incorporated herein by reference.
  • Preferred embodiments of the present invention provide methods and devices for speech synthesis, based on storing feature vectors corresponding to speech segments, and then synthesizing speech by selecting and concatenating the feature vectors. These methods are useful particularly in the context of feature-domain speech synthesis, as described in the above-mentioned U.S. patent application and in the article by Chazan et al. They enable high-quality speech to be synthesized from a text input, while using a much smaller database of speech segments than is required by speech synthesis systems known in the art.
  • the segment database is constructed by recording natural speech, partitioning the speech into phonetic units, preferably lefemes, and analyzing each unit to determine corresponding segment data.
  • these data comprise, for each segment, a corresponding sequence of feature vectors, a segment lefeme index, and segment duration, energy and pitch values.
  • the feature vectors comprise spectral coefficients, such as MFCCs, along with voicing information, and are compressed to reduce the volume of data in the database.
  • a TTS front end analyzes the input text to generate phoneme labels and prosodic parameters.
  • the phonemes are preferably converted into lefemes, represented by corresponding HMMs, as is known in the art.
  • a segment selection unit chooses a series of segments from the database corresponding to the series of lefemes and their prosodic parameters by computing and minimizing a cost function over the candidate segments in the database.
  • the cost function depends both on a distance between the required segment parameters and the candidate parameters and on a distance between successive segments in the series, based on their corresponding feature vectors.
  • the selected segments are adjusted based on the prosodic parameters, preferably by modifying the sequences of feature vectors to accord with the required duration and energy of the segments.
  • the adjusted sequences of feature vectors for the successive segments are then concatenated to generate a combined sequence, which is processed to reconstruct the output speech, preferably as described in the above-mentioned U.S. patent application.
  • a method for speech synthesis including:
  • a segment inventory including, for a plurality of speech segments, respective sequences of feature vectors, by estimating spectral envelopes of input speech signals corresponding to the speech segments in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain so as to determine vector elements of the feature vectors;
  • providing the segment inventory includes providing segment information including respective phonetic identifiers of the segments, and selecting the sequences of feature vectors includes finding the segments whose phonetic identifiers are close to the received phonetic information.
  • the segments include lefemes, and the phonetic identifiers include lefeme labels.
  • the segment information further includes one or more prosodic parameters with respect to each of the segments, and selecting the sequences of feature vectors includes finding the segments whose one or more prosodic parameters are close to the received prosodic information.
  • the one or more prosodic parameters are selected from a group of parameters consisting of a duration, an energy level and a pitch of each of the segments.
  • the feature vectors include auxiliary vector elements indicative of further features of the speech segments, in addition to the elements determined by integrating the spectral envelopes of the input speech signals.
  • the auxiliary vector elements include voicing vector elements indicative of a degree of voicing of frames of the corresponding speech segments, and computing the complex line spectra includes reconstructing the output speech signal with the degree of voicing indicated by the voicing vector elements.
  • receiving the prosodic information includes receiving pitch values, and reconstructing the output speech signal includes adjusting a frequency spectrum of the output speech signal responsive to the pitch values.
  • selecting the sequences of feature vectors includes selecting candidate segments from the inventory, computing a cost function for each of the candidate segments responsive to the phonetic and prosodic information and to the feature vectors of the candidate segments, and selecting the segments so as to minimize the cost function.
  • concatenating the selected sequences of feature vectors includes adjusting the feature vectors responsive to the prosodic information.
  • the prosodic information includes respective durations of the segments to be incorporated in the output speech signal, and adjusting the feature vectors includes removing one or more of the feature vectors from the selected sequences so as to shorten the durations of one or more of the segments, or adding one or more further feature vectors to the selected sequences so as to lengthen the durations of one or more of the segments.
  • the prosodic information includes respective energy levels of the segments to be incorporated in the output speech signal, and adjusting the feature vectors includes altering one or more of the vector elements so as to adjust the energy levels of one or more of the segments.
  • processing the selected sequences includes adjusting the vector elements so as to provide a smooth transition between the segments in the time domain signal.
  • a method for speech synthesis including:
  • receiving the input speech signal includes dividing the input speech signal into the segments and determining segment information including respective phonetic identifiers of the segments, and reconstructing the output speech signal includes selecting the segments whose feature vectors are to be concatenated responsive to the segment information determined with respect to the segments.
  • dividing the input speech signal into the segments includes dividing the signal into lefemes, and wherein the phonetic identifiers include lefeme labels.
  • determining the segment information further includes finding respective segment parameters including one or more of a duration, an energy level and a pitch of each of the segments, responsive to which parameters the segments are selected for use in reconstructing the output speech signal, and reconstructing the output speech signal includes modifying the feature vectors of the selected segments so as to adjust the segment parameters of the segments in the output speech signal.
  • the window functions are non-zero only within different, respective spectral windows and have variable values over their respective windows
  • integrating the spectral envelopes includes calculating products of the spectral envelopes with the window functions, and calculating integrals of the products over the respective windows of the window functions.
  • the method includes applying a mathematical transformation to the integrals in order to determine the elements of the feature vectors.
  • the frequency domain includes a Mel frequency domain
  • applying the mathematical transformation includes applying log and discrete cosine transform operations in order to determine Mel Frequency Cepstral Coefficients to be used as the elements of the feature vectors.
  • a device for speech synthesis including:
  • a memory arranged to hold a segment inventory including, for a plurality of speech segments, respective sequences of feature vectors having vector elements determined by estimating spectral envelopes of input speech signals corresponding to the speech segments in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain;
  • a speech processor arranged to receive phonetic and prosodic information indicative of an output speech signal to be generated, to select the sequences of feature vectors from the inventory responsive to the phonetic and prosodic information, to process the selected sequences of feature vectors so as to generate a concatenated output series of feature vectors, and to compute a series of complex line spectra of the output signal from the series of the feature vectors and transform the complex line spectra to a time domain speech signal for output.
  • a device for speech synthesis including:
  • a memory arranged to hold a segment inventory determined by processing an input speech signal containing a set of speech segments so as to estimate spectral envelopes of the input speech signal in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments;
  • a speech processor arranged to reconstruct an output speech signal by concatenating the feature vectors corresponding to a sequence of the speech segments.
  • a computer software product including a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to access a segment inventory including, for a plurality of speech segments, respective sequences of feature vectors having vector elements determined by estimating spectral envelopes of input speech signals corresponding to the speech segments in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain, and in response to phonetic and prosodic information indicative of an output speech signal to be generated, cause the computer to select the sequences of feature vectors from the inventory responsive to the phonetic and prosodic information, to process the selected sequences of feature vectors so as to generate a concatenated output series of feature vectors, and to compute a series of complex line spectra of the output signal from the series of the feature vectors and transform the complex line spectra to a time domain speech signal for output
  • a computer software product including a computer-readable medium in which a segment inventory is stored, the inventory having been determined by processing an input speech signal containing a set of speech segments so as to estimate spectral envelopes of the input speech signal in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments, so that a speech processor can reconstruct an output speech signal by concatenating the feature vectors corresponding to a sequence of the speech segments.
  • FIG. 1 is a block diagram that schematically illustrates a device for synthesis of speech signals, in accordance with a preferred embodiment of the present invention
  • FIG. 2 is a block diagram that schematically shows details of the device of FIG. 1, in accordance with a preferred embodiment of the present invention.
  • FIG. 3 is a flow chart that schematically illustrates a method for generating a speech segment inventory, in accordance with a preferred embodiment of the present invention.
  • FIG. 1 is a block diagram that schematically illustrates a speech synthesis device 20 , in accordance with a preferred embodiment of the present invention.
  • Device 20 typically comprises a general-purpose or embedded computer processor, which is programmed with suitable software for carrying out the functions described hereinbelow.
  • FIG. 1 shows device 20 in FIG. 1 as comprising a number of separate functional blocks, these blocks are not necessarily separate physical entities, but rather represent different computing tasks. These tasks may be carried out in software running on a single processor, or on multiple processors.
  • the software may be provided to the processor or processors in electronic form, for example, over a network, or it may be furnished on tangible media, such as CD-ROM or non-volatile memory.
  • device 20 may comprise a digital signal processor (DSP) or hard-wired logic.
  • DSP digital signal processor
  • Device 20 typically receives its input in the form of a stream of text characters.
  • a TTS front end 22 of the processor analyzes the text to generate phoneme labels and prosodic information, as is known in the art.
  • the prosodic information preferably comprises pitch, energy and duration associated with each of the phonemes.
  • An adapter 24 converts the phonetic labels and prosodic information into a form required by a segment selection and concatenation block 26 .
  • front end 22 and adapter 24 are shown for the sake of clarity as separate functional units, the functions of these two units may easily be combined.
  • adapter 24 Preferably, for each phoneme, adapter 24 generates three lefeme labels, each comprising a HMM, as is known in the art.
  • the duration and energy of each phoneme are likewise converted into a series of three lefeme durations and lefeme energies. This conversion can be carried out using simple interpolation methods or, alternatively, by following a decision tree from its roots down to the leaves associated with the appropriate HMMs. The decision tree method is described by Donovan in the above-mentioned thesis.
  • Adapter 24 preferably interpolates the pitch values output by front end 22 , most preferably so that there is a pitch value for every 10 ms frame of output speech.
  • Segment selection and concatenation block 26 receives the lefeme labels and prosodic parameters generated by adapter 24 , and uses these data to produce a series of feature vectors for output to a feature reconstructor 32 .
  • Block 26 generates the series of feature vectors based on feature data extracted from a segment inventory 28 held in a memory associated with device 20 .
  • Inventory 28 contains a database of speech segments, along with a corresponding sequence of feature vectors for each segment. The inventory is preferably produced using methods described hereinbelow with reference to FIG. 3. Each speech segment in the inventory is identified by segment information, including a corresponding lefeme label, duration and energy.
  • the feature vectors comprise spectral coefficients, most preferably MFCCs, along with a voicing parameter, indicating whether the corresponding speech frame is voiced or unvoiced.
  • spectral coefficients most preferably MFCCs
  • voicing parameter indicating whether the corresponding speech frame is voiced or unvoiced.
  • the feature vectors are held in the memory in compressed form, and are decompressed by a decompression unit 30 when required by block 26 . Further details of the operation of block 26 are described hereinbelow with reference to FIG. 2.
  • Feature reconstructor 32 processes the series of feature vectors that are output by block 26 , together with the associated pitch information from adapter 24 , so as to generate a synthesized speech signal in digital form.
  • Reconstructor 32 preferably operates in accordance with the method described in the above-mentioned U.S. patent application Ser. No. 09/432,081. Further aspects of this method are described in the above-mentioned article by Chazan et al., as well as in U.S. patent application Ser. No. 09/410,085, which is assigned to the assignee of the present patent application, and whose disclosure is incorporated herein by reference.
  • FIG. 2 is a block diagram that schematically shows details of segment selection and concatenation block 26 , in accordance with a preferred embodiment of the present invention.
  • a segment selector 40 in block 26 is responsible for selecting the segments from inventory 28 that correspond to the segment information received from adapter 24 .
  • a candidate selection block 46 finds the segments in the inventory whose segment parameters (lefeme label, duration, energy and pitch) are closest to the parameters specified by adapter 24 .
  • a distance between the specified parameters and the parameters of the candidate segments in inventory 28 is determined as a weighted sum of the differences of the corresponding parameters. Certain parameters, such as pitch, may have little or no weight in this sum.
  • the segments in inventory 28 whose respective distances from the specified parameter set are smallest are chosen as candidates.
  • block 46 determines a cost function.
  • the cost function is based on the distance between the specified parameters and the segment parameters, as described above, and on a distance between the current segment and the preceding segment in the series chosen by selector 40 . This distance between successive segments in the series is computed based on the respective feature vectors of the segments.
  • a dynamic programming unit 48 uses the cost function values to select the series of segments that minimizes the cost function. Methods for cost function computation and dynamic programming of this sort are known in the art. Exemplary methods are described by Donovan in the above-mentioned thesis and by Huang et al. in U.S. Pat. No.
  • the segments chosen by selector 40 along with their corresponding sequences of feature vectors and other segment parameters, are passed to a segment adjuster 42 .
  • Adjuster 42 alters the segment parameters that were read from inventory 28 so that they match the prosodic information received from adapter 24 .
  • the duration and energy adjustment is carried out by modifying the feature vectors. For example, for each 10 ms by which the duration of a segment needs to be shortened, one feature vector is removed from the series. Alternatively, feature vectors may be duplicated or interpolated as necessary to lengthen the segment. As a further example, the energy of the segment may be altered by increasing or decreasing the lowest-order mel-cepstral coefficient for the MFCC feature vectors.
  • the adjusted feature vectors are input to a segment concatenator 44 , which generates the combined series of feature vectors that is output to reconstructor 32 .
  • FIG. 3 is a flow chart that schematically illustrates a method for generating segment inventory 28 , in accordance with a preferred embodiment of the present invention.
  • a recording is made of the speaker whose voice is to be synthesized, at a recording step 50 .
  • the speaker reads a list of sentences, which have been prepared in advance.
  • the speech is digitized and divided into frames, each preferably of 10 ms duration, at a frame analysis step 52 .
  • a feature vector is computed, by estimating the spectral envelope of the signal; multiplying the estimate by a set of frequency-domain window functions; and integrating the product of the multiplication over each of the windows.
  • the elements of the feature vector are given either by the integrals themselves or, preferably, by a set of predetermined functions applied to the integrals.
  • the vector elements are MFCCs, as described, for example, in the above-mentioned article by Davis et al. and in U.S. patent application Ser. No. 09/432,081.
  • the analysis at step 52 also estimates the pitch of the frame and thus determines whether the frame is voiced or unvoiced.
  • a preferred method of pitch estimation is described in U.S. patent application Ser. No. 09/617,582, filed Jul. 14, 2000, which is assigned to the assignee of the present patent application and is incorporated herein by reference.
  • the voicing parameter indicating whether the frame is voiced or unvoiced, is then added to the feature vector.
  • the voicing parameter may indicate a degree of voicing, with a continuous value between 0 (purely unvoiced) and 1 (purely voiced). Further analysis may be carried out, and additional auxiliary information may be added to the feature vector in order to enhance the synthesized speech quality.
  • the digitized speech is further analyzed to partition it into segments, at a segmentation step 54 .
  • Each segment is classified, preferably using HMMs, as described by Donovan in the above-mentioned thesis, and in U.S. Pat. Nos. 5,913,193 and 6,041,300.
  • This classification yields segment parameters including a lefeme label (or lefeme index), energy level, duration, segment pitch and segment location in the database.
  • the energy level and pitch are computed based on the parameters of the frames in the present segment, which were determined at step 52 .
  • statistical analysis training of statistical models on the available recordings is performed first, in order to improve the classification.
  • such training involves retraining the HMM models and the decision trees using the database samples, so that they are adapted to the specific speaker and database contents. Prior to such retraining, it is assumed that a general, speaker-independent model is used for classification. A training procedure of this sort is described by Donovan in the above-mentioned thesis.
  • the segment Pre-selection step 56 is discarded, at a preselection step 56 .
  • a suitable method for such preselection is described by Donovan in an article entitled “Segment Pre-selection in Decision-Tree Based Speech Synthesis Systems,” Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) , June, 2000, which is incorporated herein by reference.
  • the feature vectors are preferably compressed, at a compression step 58 .
  • An exemplary compression scheme is illustrated in Table I, below.
  • This scheme operates on a 24-dimensional MFCC feature vector by grouping the vector elements into sub-vectors, and then quantizing each sub-vector using a separate codebook.
  • the codebook is generated by training on the actual feature vector data that are to be included in inventory 28 , using training methods known in the art.
  • One training method that may be used for this purpose is K-means clustering, as described by Rabiner et al., in Fundamentals of Speech Recognition (Prentice-Hall, 1993), pages 125-128, which is incorporated herein by reference.
  • the codebook is then used by decompression unit 30 is decompressing the feature vectors as they are recalled from the inventory by block 26 .
  • the compression scheme shown in Table I above relates to the MFCC elements of the feature vector.
  • Other elements of the vector such as the voicing parameter and other auxiliary data, are preferably compressed separately from the MFCCS, typically by scalar or vector quantization.
  • the data for each of the segments selected at step 56 are stored in inventory 28 , at a storage step 60 .
  • these data preferably include the segment lefeme index, the segment duration, energy and pitch values, and the compressed series of feature vectors (including MFCCS, voicing information and possibly other auxiliary information) for the series of 10 ms frames that make up the segment.

Abstract

A method for speech synthesis includes receiving an input speech signal containing a set of speech segments, and estimating spectral envelopes of the input speech signal in a succession of time intervals during each of the speech segments. The spectral envelopes are integrated over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments. An output speech signal is reconstructed by concatenating the feature vectors corresponding to a sequence of the speech segments.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • This application is a continuation-in-part of U.S. patent application Ser. No. 09/432,081, which is assigned to the assignee of the present patent application and whose disclosure is incorporated herein by reference.[0001]
  • FIELD OF THE INVENTION
  • The present invention relates generally to computerized speech synthesis, and specifically to methods and systems for efficient, high-quality text-to-speech conversion. [0002]
  • BACKGROUND OF THE INVENTION
  • Effective text-to-speech (TTS) conversion requires not only that the acoustic TTS output be phonetically correct, but also that it faithfully reproduce the sound and prosody of human speech. When the range of phrases and sentences to be reproduced is fixed, and the TTS converter has sufficient memory resources, it is possible simply to record a collection of all of the phrases and sentences that will be used, and to recall them as required. This approach is not practical, however, when the text input is arbitrarily variable, or when speech is to be synthesized by a device having only limited memory resources, such as an embedded speech synthesizer in a handheld computing or communication device, for example. [0003]
  • TTS systems for synthesis of arbitrary speech typically perform three essential functions: [0004]
  • 1. Division of text into synthesis units, or segments, such as phonemes or other subdivisions. [0005]
  • 2. Determination of prosodic parameters, such as segment duration, pitch and energy. [0006]
  • 3. Conversion of the synthesis units and prosodic parameters into a speech stream. [0007]
  • A useful survey of these functions and of different approaches to their implementation is presented by Robert Edward Donovan in [0008] Trainable Speech Synthesis (Ph.D. dissertation, University of Cambridge, 1996), which is incorporated herein by reference. The present invention is concerned primarily with the third function, i.e., generation of a natural, intelligible speech stream from a sequence of phonetic and prosodic parameters.
  • In order to synthesize high-quality speech from an arbitrary text input, a large database is created, containing speech segments in a variety of different phonetic contexts. For any given text input, the synthesizer then selects the optimal segments from the database. Typically, the selection is based on a feature representation of the speech, such as mel-frequency cepstral coefficients (MFCCs). These coefficients are computed by integration of the spectrum of the recorded speech segments over triangular bins on a mel-frequency axis, followed by log and discrete cosine transform operations. Computation of MFCCs is described, for example, by Davis et al. in “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” [0009] IEEE Transactions on Acoustics, Speech and Signal Processing ASSP-28 (1980), pp. 357-366, which is incorporated herein by reference. Other types of feature representations are also known in the art.
  • In order to dynamically choose the optimal segments from the database in real time, the synthesizer applies a cost function to the feature vectors of the speech segments, based on a measure of vector distance. The synthesizer then concatenates the selected segments, while adjusting their prosody and pitch to provide a smooth, natural speech output. Typically, Pitch Synchronous Overlap and Add (PSOLA) algorithms are used for this purpose, such as the Time Domain PSOLA (TD-PSOLA) algorithm described in the above-mentioned thesis by Donovan. This algorithm breaks speech segments into many short-term (ST) signals by Hanning windowing. The ST signals are altered to adjust their pitch and duration, and are then recombined using an overlap-add scheme to generate the speech output. [0010]
  • Although PSOLA schemes give generally good speech quality, it requires a large database of carefully-chosen speech segments. One of the reasons for this requirement is that PSOLA is very sensitive to prosody changes, especially pitch modification. Therefore, in order to minimize the prosody modifications at synthesis time, the database must contain segments with a large variety of pitch and duration values. Other problems with PSOLA schemes include: [0011]
  • Frequent mismatch between the selection process, which is based on spectral features extracted from the speech, and the concatenation process, which is applied to the ST signals. The result is audible discontinuities in the synthesized signal (typically resulting from phase mismatches). [0012]
  • High computational complexity of the segment selection process, caused by a complex cost function usually introduced to overcome the limitations mentioned above. [0013]
  • Large additional overhead to the speech data in the database (for example, pitch marking and features for segment selection) and a complex database generation (training) process. [0014]
  • There is therefore a need for a speech synthesis technique that can provide high-quality speech output without the large memory requirements and computational cost that are associated with PSOLA and other concatenative methods known in the art. [0015]
  • Various methods of concatenative speech synthesis are described in the patent literature. For example, U.S. Pat. No. 4,896,359, to Yamamoto et al., whose disclosure is incorporated herein by reference, describes a speech synthesizer that operates by actuating a voice source and a filter, which processes the voice source output based on a succession of short-interval feature vectors. U.S. Pat. No. 5,165,008, to Hermansky et al., whose disclosure is likewise incorporated herein by reference, describes a method for speech synthesis using perceptual linear prediction parameters, based on a speaker-independent set of cepstral coefficients. U.S. Pat. No. 5,740,320, to Itoh, whose disclosure is also incorporated herein by reference, describes a method of text-to-speech synthesis by concatenation of representative phoneme waveforms selected from a memory. The representative waveforms are chosen by clustering phoneme waveforms recorded in natural speech, and selecting the waveform closest to the centroid of each cluster as the representative waveform for the cluster. [0016]
  • Similarly, U.S. Pat. No. 5,751,907, to Moebius et al., whose disclosure is incorporated herein by reference, describes a speech synthesizer having an acoustic element database that is established from phonetic sequences occurring in an interval of natural speech. The sequences are chosen so that perceptible discontinuities at junction phonemes between acoustic elements are minimized in the synthesized speech. U.S. Pat. No. 5,913,193, to Huang et al., whose disclosure is also incorporated herein by reference, describes a concatenative speech synthesis system that stores multiple instances of each acoustic unit during a training phase. The synthesizer chooses the instance that most closely resembles a desired instance, so that the need to alter the stored instance is reduced, while also reducing spectral distortion between the boundaries of adjacent instances. [0017]
  • U.S. Pat. No. 6,041,300, to Ittycheriah et al., whose disclosure is incorporated herein by reference, describes a speech recognition system that synthesizes and replays words that are spoken into the system so that the speaker can confirm that the word is correct. The system uses a waveform database, from which appropriate waveforms are selected, followed by acoustic adjustment and concatenation of the waveforms. For the purpose of speech recognition, the component phonemes in the spoken words are divided into sub-units, known as lefemes, which are the beginning, middle and ending portions of the phoneme. The lefemes are modeled and analyzed using Hidden Markov Models (HMMs). HMM-modeling of lefemes can also be used in speech synthesis, as described in the above-mentioned U.S. Pat. No. 5,913,193 and in Donovan's thesis. [0018]
  • SUMMARY OF THE INVENTION
  • The above-mentioned U.S. patent application Ser. No. 09/432,081 describes an improved method for synthesizing speech based on spectral reconstruction of the speech from feature vectors, such as vectors of MFCCs or other cepstral parameters. In accordance with this method, a complex line spectrum of the output signal is computed as a non-negative linear combination of basis functions, derived from the feature vector elements. (In the context of the present patent application and in the claims, the term “complex line spectrum” refers to the sequence of respective sine-wave amplitudes, phases and frequencies in a sinusoidal speech representation.) The sequences of feature vectors corresponding to successive speech output segments are concatenated in the feature domain, rather than in the time domain as in TD-PSOLA and related techniques known in the art. Only after concatenation and spectral reconstruction is the spectrum converted to the time domain (preferably by short-term inverse Discrete Fourier Transform) for output as a speech signal. This method is further described by Chazan et al. in “Speech Reconstruction from Mel Frequency Cepstral Coefficients and Pitch Frequency,” [0019] Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), June, 2000, which is incorporated herein by reference.
  • Preferred embodiments of the present invention provide methods and devices for speech synthesis, based on storing feature vectors corresponding to speech segments, and then synthesizing speech by selecting and concatenating the feature vectors. These methods are useful particularly in the context of feature-domain speech synthesis, as described in the above-mentioned U.S. patent application and in the article by Chazan et al. They enable high-quality speech to be synthesized from a text input, while using a much smaller database of speech segments than is required by speech synthesis systems known in the art. [0020]
  • In preferred embodiments of the present invention, the segment database is constructed by recording natural speech, partitioning the speech into phonetic units, preferably lefemes, and analyzing each unit to determine corresponding segment data. Preferably, these data comprise, for each segment, a corresponding sequence of feature vectors, a segment lefeme index, and segment duration, energy and pitch values. Most preferably, the feature vectors comprise spectral coefficients, such as MFCCs, along with voicing information, and are compressed to reduce the volume of data in the database. [0021]
  • To synthesize speech from text, a TTS front end analyzes the input text to generate phoneme labels and prosodic parameters. The phonemes are preferably converted into lefemes, represented by corresponding HMMs, as is known in the art. A segment selection unit chooses a series of segments from the database corresponding to the series of lefemes and their prosodic parameters by computing and minimizing a cost function over the candidate segments in the database. Preferably, the cost function depends both on a distance between the required segment parameters and the candidate parameters and on a distance between successive segments in the series, based on their corresponding feature vectors. The selected segments are adjusted based on the prosodic parameters, preferably by modifying the sequences of feature vectors to accord with the required duration and energy of the segments. The adjusted sequences of feature vectors for the successive segments are then concatenated to generate a combined sequence, which is processed to reconstruct the output speech, preferably as described in the above-mentioned U.S. patent application. [0022]
  • There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for speech synthesis, including: [0023]
  • providing a segment inventory including, for a plurality of speech segments, respective sequences of feature vectors, by estimating spectral envelopes of input speech signals corresponding to the speech segments in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain so as to determine vector elements of the feature vectors; [0024]
  • receiving phonetic and prosodic information indicative of an output speech signal to be generated; [0025]
  • selecting the sequences of feature vectors from the inventory responsive to the phonetic and prosodic information; [0026]
  • processing the selected sequences of feature vectors so as to generate a concatenated output series of feature vectors; [0027]
  • computing a series of complex line spectra of the output signal from the series of the feature vectors; and [0028]
  • transforming the complex line spectra to a time domain speech signal for output. [0029]
  • Preferably, providing the segment inventory includes providing segment information including respective phonetic identifiers of the segments, and selecting the sequences of feature vectors includes finding the segments whose phonetic identifiers are close to the received phonetic information. Most preferably, the segments include lefemes, and the phonetic identifiers include lefeme labels. Additionally or alternatively, the segment information further includes one or more prosodic parameters with respect to each of the segments, and selecting the sequences of feature vectors includes finding the segments whose one or more prosodic parameters are close to the received prosodic information. Preferably, the one or more prosodic parameters are selected from a group of parameters consisting of a duration, an energy level and a pitch of each of the segments. [0030]
  • In a preferred embodiment, the feature vectors include auxiliary vector elements indicative of further features of the speech segments, in addition to the elements determined by integrating the spectral envelopes of the input speech signals. Preferably, the auxiliary vector elements include voicing vector elements indicative of a degree of voicing of frames of the corresponding speech segments, and computing the complex line spectra includes reconstructing the output speech signal with the degree of voicing indicated by the voicing vector elements. Further preferably, receiving the prosodic information includes receiving pitch values, and reconstructing the output speech signal includes adjusting a frequency spectrum of the output speech signal responsive to the pitch values. [0031]
  • Preferably, selecting the sequences of feature vectors includes selecting candidate segments from the inventory, computing a cost function for each of the candidate segments responsive to the phonetic and prosodic information and to the feature vectors of the candidate segments, and selecting the segments so as to minimize the cost function. [0032]
  • Further preferably, concatenating the selected sequences of feature vectors includes adjusting the feature vectors responsive to the prosodic information. Most preferably, the prosodic information includes respective durations of the segments to be incorporated in the output speech signal, and adjusting the feature vectors includes removing one or more of the feature vectors from the selected sequences so as to shorten the durations of one or more of the segments, or adding one or more further feature vectors to the selected sequences so as to lengthen the durations of one or more of the segments. Additionally or alternatively, the prosodic information includes respective energy levels of the segments to be incorporated in the output speech signal, and adjusting the feature vectors includes altering one or more of the vector elements so as to adjust the energy levels of one or more of the segments. [0033]
  • Preferably, processing the selected sequences includes adjusting the vector elements so as to provide a smooth transition between the segments in the time domain signal. [0034]
  • There is also provided, in accordance with a preferred embodiment of the present invention, a method for speech synthesis, including: [0035]
  • receiving an input speech signal containing a set of speech segments; [0036]
  • estimating spectral envelopes of the input speech signal in a succession of time intervals during each of the speech segments; [0037]
  • integrating the spectral envelopes over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments; and [0038]
  • reconstructing an output speech signal by concatenating the feature vectors corresponding to a sequence of the speech segments. [0039]
  • Preferably, receiving the input speech signal includes dividing the input speech signal into the segments and determining segment information including respective phonetic identifiers of the segments, and reconstructing the output speech signal includes selecting the segments whose feature vectors are to be concatenated responsive to the segment information determined with respect to the segments. Most preferably, dividing the input speech signal into the segments includes dividing the signal into lefemes, and wherein the phonetic identifiers include lefeme labels. Additionally or alternatively, determining the segment information further includes finding respective segment parameters including one or more of a duration, an energy level and a pitch of each of the segments, responsive to which parameters the segments are selected for use in reconstructing the output speech signal, and reconstructing the output speech signal includes modifying the feature vectors of the selected segments so as to adjust the segment parameters of the segments in the output speech signal. [0040]
  • Preferably, the window functions are non-zero only within different, respective spectral windows and have variable values over their respective windows, and integrating the spectral envelopes includes calculating products of the spectral envelopes with the window functions, and calculating integrals of the products over the respective windows of the window functions. Further preferably, the method includes applying a mathematical transformation to the integrals in order to determine the elements of the feature vectors. Most preferably, the frequency domain includes a Mel frequency domain, and applying the mathematical transformation includes applying log and discrete cosine transform operations in order to determine Mel Frequency Cepstral Coefficients to be used as the elements of the feature vectors. [0041]
  • There is additionally provided, in accordance with a preferred embodiment of the present invention, a device for speech synthesis, including: [0042]
  • a memory, arranged to hold a segment inventory including, for a plurality of speech segments, respective sequences of feature vectors having vector elements determined by estimating spectral envelopes of input speech signals corresponding to the speech segments in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain; and [0043]
  • a speech processor, arranged to receive phonetic and prosodic information indicative of an output speech signal to be generated, to select the sequences of feature vectors from the inventory responsive to the phonetic and prosodic information, to process the selected sequences of feature vectors so as to generate a concatenated output series of feature vectors, and to compute a series of complex line spectra of the output signal from the series of the feature vectors and transform the complex line spectra to a time domain speech signal for output. [0044]
  • There is further provided, in accordance with a preferred embodiment of the present invention, a device for speech synthesis, including: [0045]
  • a memory, arranged to hold a segment inventory determined by processing an input speech signal containing a set of speech segments so as to estimate spectral envelopes of the input speech signal in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments; and [0046]
  • a speech processor, arranged to reconstruct an output speech signal by concatenating the feature vectors corresponding to a sequence of the speech segments. [0047]
  • There is moreover provided, in accordance with a preferred embodiment of the present invention, a computer software product, including a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to access a segment inventory including, for a plurality of speech segments, respective sequences of feature vectors having vector elements determined by estimating spectral envelopes of input speech signals corresponding to the speech segments in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain, and in response to phonetic and prosodic information indicative of an output speech signal to be generated, cause the computer to select the sequences of feature vectors from the inventory responsive to the phonetic and prosodic information, to process the selected sequences of feature vectors so as to generate a concatenated output series of feature vectors, and to compute a series of complex line spectra of the output signal from the series of the feature vectors and transform the complex line spectra to a time domain speech signal for output. [0048]
  • There is furthermore provided, in accordance with a preferred embodiment of the present invention, a computer software product, including a computer-readable medium in which a segment inventory is stored, the inventory having been determined by processing an input speech signal containing a set of speech segments so as to estimate spectral envelopes of the input speech signal in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments, so that a speech processor can reconstruct an output speech signal by concatenating the feature vectors corresponding to a sequence of the speech segments. [0049]
  • The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings in which:[0050]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram that schematically illustrates a device for synthesis of speech signals, in accordance with a preferred embodiment of the present invention; [0051]
  • FIG. 2 is a block diagram that schematically shows details of the device of FIG. 1, in accordance with a preferred embodiment of the present invention; and [0052]
  • FIG. 3 is a flow chart that schematically illustrates a method for generating a speech segment inventory, in accordance with a preferred embodiment of the present invention. [0053]
  • DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
  • FIG. 1 is a block diagram that schematically illustrates a [0054] speech synthesis device 20, in accordance with a preferred embodiment of the present invention. Device 20 typically comprises a general-purpose or embedded computer processor, which is programmed with suitable software for carrying out the functions described hereinbelow. Thus, although device 20 is shown in FIG. 1 as comprising a number of separate functional blocks, these blocks are not necessarily separate physical entities, but rather represent different computing tasks. These tasks may be carried out in software running on a single processor, or on multiple processors. The software may be provided to the processor or processors in electronic form, for example, over a network, or it may be furnished on tangible media, such as CD-ROM or non-volatile memory. Alternatively or additionally, device 20 may comprise a digital signal processor (DSP) or hard-wired logic.
  • [0055] Device 20 typically receives its input in the form of a stream of text characters. A TTS front end 22 of the processor analyzes the text to generate phoneme labels and prosodic information, as is known in the art. The prosodic information preferably comprises pitch, energy and duration associated with each of the phonemes. An adapter 24 converts the phonetic labels and prosodic information into a form required by a segment selection and concatenation block 26. Although front end 22 and adapter 24 are shown for the sake of clarity as separate functional units, the functions of these two units may easily be combined.
  • Preferably, for each phoneme, [0056] adapter 24 generates three lefeme labels, each comprising a HMM, as is known in the art. The duration and energy of each phoneme are likewise converted into a series of three lefeme durations and lefeme energies. This conversion can be carried out using simple interpolation methods or, alternatively, by following a decision tree from its roots down to the leaves associated with the appropriate HMMs. The decision tree method is described by Donovan in the above-mentioned thesis. Adapter 24 preferably interpolates the pitch values output by front end 22, most preferably so that there is a pitch value for every 10 ms frame of output speech.
  • Segment selection and [0057] concatenation block 26 receives the lefeme labels and prosodic parameters generated by adapter 24, and uses these data to produce a series of feature vectors for output to a feature reconstructor 32. Block 26 generates the series of feature vectors based on feature data extracted from a segment inventory 28 held in a memory associated with device 20. Inventory 28 contains a database of speech segments, along with a corresponding sequence of feature vectors for each segment. The inventory is preferably produced using methods described hereinbelow with reference to FIG. 3. Each speech segment in the inventory is identified by segment information, including a corresponding lefeme label, duration and energy. The feature vectors comprise spectral coefficients, most preferably MFCCs, along with a voicing parameter, indicating whether the corresponding speech frame is voiced or unvoiced. The above-mentioned U.S. patent application Ser. No. 09/432,081 gives a detailed specification of a preferred structure and method of computation of such feature vectors. Preferably, the feature vectors are held in the memory in compressed form, and are decompressed by a decompression unit 30 when required by block 26. Further details of the operation of block 26 are described hereinbelow with reference to FIG. 2.
  • Feature reconstructor [0058] 32 processes the series of feature vectors that are output by block 26, together with the associated pitch information from adapter 24, so as to generate a synthesized speech signal in digital form. Reconstructor 32 preferably operates in accordance with the method described in the above-mentioned U.S. patent application Ser. No. 09/432,081. Further aspects of this method are described in the above-mentioned article by Chazan et al., as well as in U.S. patent application Ser. No. 09/410,085, which is assigned to the assignee of the present patent application, and whose disclosure is incorporated herein by reference.
  • FIG. 2 is a block diagram that schematically shows details of segment selection and [0059] concatenation block 26, in accordance with a preferred embodiment of the present invention. A segment selector 40 in block 26 is responsible for selecting the segments from inventory 28 that correspond to the segment information received from adapter 24. As a first stage in this process, a candidate selection block 46 finds the segments in the inventory whose segment parameters (lefeme label, duration, energy and pitch) are closest to the parameters specified by adapter 24. Typically, a distance between the specified parameters and the parameters of the candidate segments in inventory 28 is determined as a weighted sum of the differences of the corresponding parameters. Certain parameters, such as pitch, may have little or no weight in this sum. The segments in inventory 28 whose respective distances from the specified parameter set are smallest are chosen as candidates.
  • For each candidate segment, block [0060] 46 determines a cost function. The cost function is based on the distance between the specified parameters and the segment parameters, as described above, and on a distance between the current segment and the preceding segment in the series chosen by selector 40. This distance between successive segments in the series is computed based on the respective feature vectors of the segments. A dynamic programming unit 48 uses the cost function values to select the series of segments that minimizes the cost function. Methods for cost function computation and dynamic programming of this sort are known in the art. Exemplary methods are described by Donovan in the above-mentioned thesis and by Huang et al. in U.S. Pat. No. 5,913,193, as well as by Hoory et al., in “Speech Synthesis for a Specific Speaker Based on a Labeled Speech Database,” Proceedings of the International Conference on Pattern Recognition (1994), pp. C145-148, which is incorporated herein by reference.
  • The segments chosen by [0061] selector 40, along with their corresponding sequences of feature vectors and other segment parameters, are passed to a segment adjuster 42. Adjuster 42 alters the segment parameters that were read from inventory 28 so that they match the prosodic information received from adapter 24. Preferably, the duration and energy adjustment is carried out by modifying the feature vectors. For example, for each 10 ms by which the duration of a segment needs to be shortened, one feature vector is removed from the series. Alternatively, feature vectors may be duplicated or interpolated as necessary to lengthen the segment. As a further example, the energy of the segment may be altered by increasing or decreasing the lowest-order mel-cepstral coefficient for the MFCC feature vectors. The adjusted feature vectors are input to a segment concatenator 44, which generates the combined series of feature vectors that is output to reconstructor 32.
  • FIG. 3 is a flow chart that schematically illustrates a method for generating [0062] segment inventory 28, in accordance with a preferred embodiment of the present invention. To begin, a recording is made of the speaker whose voice is to be synthesized, at a recording step 50. Preferably, the speaker reads a list of sentences, which have been prepared in advance. The speech is digitized and divided into frames, each preferably of 10 ms duration, at a frame analysis step 52. For each frame, a feature vector is computed, by estimating the spectral envelope of the signal; multiplying the estimate by a set of frequency-domain window functions; and integrating the product of the multiplication over each of the windows. The elements of the feature vector are given either by the integrals themselves or, preferably, by a set of predetermined functions applied to the integrals. Most preferably the vector elements are MFCCs, as described, for example, in the above-mentioned article by Davis et al. and in U.S. patent application Ser. No. 09/432,081.
  • The analysis at [0063] step 52 also estimates the pitch of the frame and thus determines whether the frame is voiced or unvoiced. A preferred method of pitch estimation is described in U.S. patent application Ser. No. 09/617,582, filed Jul. 14, 2000, which is assigned to the assignee of the present patent application and is incorporated herein by reference. The voicing parameter, indicating whether the frame is voiced or unvoiced, is then added to the feature vector. Alternatively, the voicing parameter may indicate a degree of voicing, with a continuous value between 0 (purely unvoiced) and 1 (purely voiced). Further analysis may be carried out, and additional auxiliary information may be added to the feature vector in order to enhance the synthesized speech quality.
  • The digitized speech is further analyzed to partition it into segments, at a [0064] segmentation step 54. Each segment is classified, preferably using HMMs, as described by Donovan in the above-mentioned thesis, and in U.S. Pat. Nos. 5,913,193 and 6,041,300. This classification yields segment parameters including a lefeme label (or lefeme index), energy level, duration, segment pitch and segment location in the database. The energy level and pitch are computed based on the parameters of the frames in the present segment, which were determined at step 52. Optionally, statistical analysis training of statistical models on the available recordings is performed first, in order to improve the classification. Typically, such training involves retraining the HMM models and the decision trees using the database samples, so that they are adapted to the specific speaker and database contents. Prior to such retraining, it is assumed that a general, speaker-independent model is used for classification. A training procedure of this sort is described by Donovan in the above-mentioned thesis.
  • Preferably, in order to limit the size of [0065] inventory 28, some of the segments and their corresponding feature vectors are discarded, at a preselection step 56. A suitable method for such preselection is described by Donovan in an article entitled “Segment Pre-selection in Decision-Tree Based Speech Synthesis Systems,” Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), June, 2000, which is incorporated herein by reference. To reduce the size of the inventory still further, the feature vectors are preferably compressed, at a compression step 58. An exemplary compression scheme is illustrated in Table I, below. This scheme operates on a 24-dimensional MFCC feature vector by grouping the vector elements into sub-vectors, and then quantizing each sub-vector using a separate codebook. Preferably, for maximal coding efficiency, the codebook is generated by training on the actual feature vector data that are to be included in inventory 28, using training methods known in the art. One training method that may be used for this purpose is K-means clustering, as described by Rabiner et al., in Fundamentals of Speech Recognition (Prentice-Hall, 1993), pages 125-128, which is incorporated herein by reference. The codebook is then used by decompression unit 30 is decompressing the feature vectors as they are recalled from the inventory by block 26.
    TABLE I
    FEATURE VECTOR COMPRESSION
    Component index Number of bits Codebook size
    0 5 32
    1-2 9 512
    3-5 10 1024
    6-8 9 512
     9-12 9 512
    13-17 8 256
    18-23 6 64
  • As noted above, the compression scheme shown in Table I above relates to the MFCC elements of the feature vector. Other elements of the vector, such as the voicing parameter and other auxiliary data, are preferably compressed separately from the MFCCS, typically by scalar or vector quantization. [0066]
  • The data for each of the segments selected at [0067] step 56 are stored in inventory 28, at a storage step 60. As noted above, these data preferably include the segment lefeme index, the segment duration, energy and pitch values, and the compressed series of feature vectors (including MFCCS, voicing information and possibly other auxiliary information) for the series of 10 ms frames that make up the segment.
  • Although embodiments described herein make use of certain preferred methods of spectral representation (such as MFCCS) and phonetic analysis (such as lefemes and HMMS), it will be appreciated that the principles of the present invention may similarly be applied using other such methods, as are known in the art of speech analysis and synthesis. Furthermore, although these embodiments are described in the context of TTS conversion, the principles of the present invention can also be used in other speech synthesis applications that are not text-based. [0068]
  • It will thus be understood that the preferred embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and subcombinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art. [0069]

Claims (75)

1. A method for speech synthesis, comprising:
providing a segment inventory comprising, for a plurality of speech segments, respective sequences of feature vectors, by estimating spectral envelopes of input speech signals corresponding to the speech segments in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain so as to determine vector elements of the feature vectors;
receiving phonetic and prosodic information indicative of an output speech signal to be generated;
selecting the sequences of feature vectors from the inventory responsive to the phonetic and prosodic information;
processing the selected sequences of feature vectors so as to generate a concatenated output series of feature vectors;
computing a series of complex line spectra of the output signal from the series of the feature vectors; and
transforming the complex line spectra to a time domain speech signal for output.
2. A method according to
claim 1
, wherein providing the segment inventory comprises providing segment information comprising respective phonetic identifiers of the segments, and wherein selecting the sequences of feature vectors comprises finding the segments whose phonetic identifiers are close to the received phonetic information.
3. A method according to
claim 2
, wherein the segments comprise lefemes, and wherein the phonetic identifiers comprise lefeme labels.
4. A method according to
claim 2
, wherein the segment information further comprises one or more prosodic parameters with respect to each of the segments, and wherein selecting the sequences of feature vectors comprises finding the segments whose one or more prosodic parameters are close to the received prosodic information.
5. A method according to
claim 4
, wherein the one or more prosodic parameters are selected from a group of parameters consisting of a duration, an energy level and a pitch of each of the segments.
6. A method according to
claim 1
, wherein the feature vectors comprise auxiliary vector elements indicative of further features of the speech segments, in addition to the elements determined by integrating the spectral envelopes of the input speech signals.
7. A method according to
claim 6
, wherein the auxiliary vector elements comprise voicing vector elements indicative of a degree of voicing of frames of the corresponding speech segments, and wherein computing the complex line spectra comprises reconstructing the output speech signal with the degree of voicing indicated by the voicing vector elements.
8. A method according to
claim 7
, wherein receiving the prosodic information comprises receiving pitch values, and wherein reconstructing the output speech signal comprises adjusting a frequency spectrum of the output speech signal responsive to the pitch values.
9. A method according to
claim 1
, wherein selecting the sequences of feature vectors comprises:
selecting candidate segments from the inventory;
computing a cost function for each of the candidate segments responsive to the phonetic and prosodic information and to the feature vectors of the candidate segments; and
selecting the segments so as to minimize the cost function.
10. A method according to
claim 1
, wherein concatenating the selected sequences of feature vectors comprises adjusting the feature vectors responsive to the prosodic information.
11. A method according to
claim 10
, wherein the prosodic information comprises respective durations of the segments to be incorporated in the output speech signal, and wherein adjusting the feature vectors comprises removing one or more of the feature vectors from the selected sequences so as to shorten the durations of one or more of the segments.
12. A method according to
claim 10
, wherein the prosodic information comprises respective durations of the segments to be incorporated in the output speech signal, and wherein adjusting the feature vectors comprises adding one or more further feature vectors to the selected sequences so as to lengthen the durations of one or more of the segments.
13. A method according to
claim 10
, wherein the prosodic information comprises respective energy levels of the segments to be incorporated in the output speech signal, and wherein adjusting the feature vectors comprises altering one or more of the vector elements so as to adjust the energy levels of one or more of the segments.
14. A method according to
claim 1
, wherein processing the selected sequences comprises adjusting the vector elements so as to provide a smooth transition between the segments in the time domain signal.
15. A method according to
claim 1
, wherein the vector elements comprise Mel Frequency Cepstral Coefficients of the speech segments, determined based on the integrated spectral envelopes.
16. A method for speech synthesis, comprising:
receiving an input speech signal containing a set of speech segments;
estimating spectral envelopes of the input speech signal in a succession of time intervals during each of the speech segments;
integrating the spectral envelopes over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments; and
reconstructing an output speech signal by concatenating the feature vectors corresponding to a sequence of the speech segments.
17. A method according to
claim 16
, wherein receiving the input speech signal comprises dividing the input speech signal into the segments and determining segment information comprising respective phonetic identifiers of the segments, and wherein reconstructing the output speech signal comprises selecting the segments whose feature vectors are to be concatenated responsive to the segment information determined with respect to the segments.
18. A method according to
claim 17
, wherein dividing the input speech signal into the segments comprises dividing the signal into lefemes, and wherein the phonetic identifiers comprise lefeme labels.
19. A method according to
claim 17
, wherein determining the segment information further comprises finding respective segment parameters including one or more of a duration, an energy level and a pitch of each of the segments, responsive to which parameters the segments are selected for use in reconstructing the output speech signal.
20. A method according to
claim 19
, wherein reconstructing the output speech signal comprises modifying the feature vectors of the selected segments so as to adjust the segment parameters of the segments in the output speech signal.
21. A method according to
claim 16
, and comprising determining respective degrees of voicing of the speech segments, and incorporating the degrees of voicing as elements of the feature vectors for use in reconstructing the output speech signal.
22. A method according to
claim 16
, wherein concatenating the feature vectors comprises concatenating the vectors to form a series in a frequency domain, and wherein reconstructing the output speech signal comprises computing a series of complex line spectra of the output signal from the series of feature vectors, and transforming the complex line spectra to a time domain signal.
23. A method according to
claim 16
, wherein the window functions are non-zero only within different, respective spectral windows and have variable values over their respective windows, and wherein integrating the spectral envelopes comprises calculating products of the spectral envelopes with the window functions, and calculating integrals of the products over the respective windows of the window functions.
24. A method according
claim 23
, and comprising applying a mathematical transformation to the integrals in order to determine the elements of the feature vectors.
25. A method according to
claim 24
, wherein the frequency domain comprises a Mel frequency domain, and wherein applying the mathematical transformation comprises applying log and discrete cosine transform operations in order to determine Mel Frequency Cepstral Coefficients to be used as the elements of the feature vectors.
26. A device for speech synthesis, comprising:
a memory, arranged to hold a segment inventory comprising, for a plurality of speech segments, respective sequences of feature vectors having vector elements determined by estimating spectral envelopes of input speech signals corresponding to the speech segments in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain; and
a speech processor, arranged to receive phonetic and prosodic information indicative of an output speech signal to be generated, to select the sequences of feature vectors from the inventory responsive to the phonetic and prosodic information, to process the selected sequences of feature vectors so as to generate a concatenated output series of feature vectors, and to compute a series of complex line spectra of the output signal from the series of the feature vectors and transform the complex line spectra to a time domain speech signal for output.
27. A device according to
claim 26
, wherein the segment inventory comprises segment information comprising respective phonetic identifiers of the segments, and wherein the processor is arranged to select the sequences of feature vectors by finding the segments in the inventory whose phonetic identifiers are close to the received phonetic information.
28. A device according to
claim 27
, wherein the segments comprise lefemes, and wherein the phonetic identifiers comprise lefeme labels.
29. A device according to
claim 27
, wherein the segment information further comprises one or more prosodic parameters with respect to each of the segments, and wherein the processor is arranged to select the sequences of feature vectors by finding the segments whose one or more prosodic parameters are close to the received prosodic information.
30. A device according to
claim 29
, wherein the one or more prosodic parameters are selected from a group of parameters consisting of a duration, an energy level and a pitch of each of the segments.
31. A device according to
claim 26
, wherein the feature vectors comprise auxiliary vector elements indicative of further features of the speech segments, in addition to the elements determined by integrating the spectral envelopes of the input speech signals.
32. A device according to
claim 31
, wherein the auxiliary vector elements comprise voicing vector elements indicative of a degree of voicing of frames of the corresponding speech segments, and wherein the processor is arranged to reconstruct the output speech signal with the degree of voicing indicated by the voicing vector elements.
33. A device according to
claim 32
, wherein the prosodic information comprises pitch values, and wherein the processor is arranged to adjust a frequency spectrum of the output speech signal responsive to the pitch values.
34. A device according to
claim 26
, wherein the processor is arranged to select the sequences of feature vectors by selecting candidate segments from the inventory, computing a cost function for each of the candidate segments responsive to the phonetic and prosodic information and to the feature vectors of the candidate segments, and selecting the segments so as to minimize the cost function.
35. A device according to
claim 26
, wherein the processor is arranged to adjust the feature vectors in the combined output series responsive to the prosodic information.
36. A device according to
claim 35
, wherein the prosodic information comprises respective durations of the segments to be incorporated in the output speech signal, and wherein the processor is arranged to adjust the feature vectors by removing one or more of the feature vectors from the selected sequences so as to shorten the durations of one or more of the segments.
37. A device according to
claim 35
, wherein the prosodic information comprises respective durations of the segments to be incorporated in the output speech signal, and wherein the processor is arranged to adjust the feature vectors by adding one or more further feature vectors to the selected sequences so as to lengthen the durations of one or more of the segments.
38. A device according to
claim 35
, wherein the prosodic information comprises respective energy levels of the segments to be incorporated in the output speech signal, and wherein the processor is arranged to adjust the energy levels of one or more of the segments by altering one or more of the vector elements.
39. A device according to
claim 26
, wherein the processor is arranged to adjust the vector elements so as to provide a smooth transition between the segments in the time domain signal.
40. A device according to
claim 26
, wherein the vector elements comprise Mel Frequency Cepstral Coefficients of the speech segments, determined based on the integrated spectral envelopes.
41. A device for speech synthesis, comprising:
a memory, arranged to hold a segment inventory determined by processing an input speech signal containing a set of speech segments so as to estimate spectral envelopes of the input speech signal in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments; and
a speech processor, arranged to reconstruct an output speech signal by concatenating the feature vectors corresponding to a sequence of the speech segments.
42. A device according to
claim 41
, wherein the input speech signal is processed by dividing the input speech signal into the segments and determining segment information comprising respective phonetic identifiers of the segments, and wherein the processor is arranged to reconstruct the output speech signal by selecting the segments whose feature vectors are to be concatenated responsive to the segment information determined with respect to the segments.
43. A device according to
claim 42
, wherein the input speech signal is divided into lefemes, and the phonetic identifiers comprise lefeme labels.
44. A device according to
claim 42
, wherein the segment information further comprises respective segment parameters including one or more of a duration, an energy level and a pitch of each of the segments, responsive to which parameters the segments are selected by the processor for use in reconstructing the output speech signal.
45. A device according to
claim 44
, wherein the processor is arranged to modify the feature vectors of the selected segments so as to adjust the segment parameters of the segments in the output speech signal.
46. A device according to
claim 41
, wherein the feature vectors comprise respective degrees of voicing of the speech segments, for use by the processor in reconstructing the output speech signal.
47. A device according to
claim 41
, wherein the processor is arranged to concatenate the feature vectors to form a series in a frequency domain, and to reconstruct the output speech signal by computing a series of complex line spectra of the output signal from the series of feature vectors, and transforming the complex line spectra to a time domain signal.
48. A device according to
claim 14
, wherein the window functions are non-zero only within different, respective spectral windows and have variable values over their respective windows, and wherein the feature vector elements are determined by calculating products of the spectral envelopes with the window functions, and calculating integrals of the products over the respective windows of the window functions.
49. A device according
claim 48
, wherein a mathematical transformation is applied to the integrals in order to determine the elements of the feature vectors.
50. A device according to
claim 48
, wherein the frequency domain comprises a Mel frequency domain, and wherein the mathematical transformation comprises log and discrete cosine transform operations, which are applied so as to determine Mel Frequency Cepstral Coefficients to be used as the elements of the feature vectors.
51. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to access a segment inventory comprising, for a plurality of speech segments, respective sequences of feature vectors having vector elements determined by estimating spectral envelopes of input speech signals corresponding to the speech segments in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain, and in response to phonetic and prosodic information indicative of an output speech signal to be generated, cause the computer to select the sequences of feature vectors from the inventory responsive to the phonetic and prosodic information, to process the selected sequences of feature vectors so as to generate a concatenated output series of feature vectors, and to compute a series of complex line spectra of the output signal from the series of the feature vectors and transform the complex line spectra to a time domain speech signal for output.
52. A product according to
claim 51
, wherein the segment inventory comprises segment information comprising respective phonetic identifiers of the segments, and wherein the instructions cause the computer to select the sequences of feature vectors by finding the segments in the inventory whose phonetic identifiers are close to the received phonetic information.
53. A product according to
claim 52
, wherein the segments comprise lefemes, and wherein the phonetic identifiers comprise lefeme labels.
54. A product according to
claim 52
, wherein the segment information further comprises one or more prosodic parameters with respect to each of the segments, and wherein the instructions cause the computer to select the sequences of feature vectors by finding the segments whose one or more prosodic parameters are close to the received prosodic information.
55. A product according to
claim 54
, wherein the one or more prosodic parameters are selected from a group of parameters consisting of a duration, an energy level and a pitch of each of the segments.
56. A product according to
claim 54
, wherein the feature vectors comprise auxiliary vector elements indicative of further features of the speech segments, in addition to the elements determined by integrating the spectral envelopes of the input speech signals.
57. A product according to
claim 56
, wherein the auxiliary vector elements comprise voicing vector elements indicative of a degree of voicing of frames of the corresponding speech segments, and wherein the instructions cause the computer to reconstruct the output speech signal with the degree of voicing indicated by the voicing vector elements.
58. A product according to
claim 57
, wherein the prosodic information comprises pitch values, and wherein the instructions cause the computer to adjust a frequency spectrum of the output speech signal responsive to the pitch values.
59. A product according to
claim 51
, wherein the instructions cause the computer to select the sequences of feature vectors by selecting candidate segments from the inventory, computing a cost function for each of the candidate segments responsive to the phonetic and prosodic information and to the feature vectors of the candidate segments, and selecting the segments so as to minimize the cost function.
60. A product according to
claim 51
, wherein the instructions cause the computer to adjust the feature vectors in the combined output series responsive to the prosodic information.
61. A product according to
claim 60
, wherein the prosodic information comprises respective durations of the segments to be incorporated in the output speech signal, and wherein the instructions cause the computer to adjust the feature vectors by removing one or more of the feature vectors from the selected sequences so as to shorten the durations of one or more of the segments.
62. A product according to
claim 60
, wherein the prosodic information comprises respective durations of the segments to be incorporated in the output speech signal, and wherein the instructions cause the computer to adjust the feature vectors by adding one or more further feature vectors to the selected sequences so as to lengthen the durations of one or more of the segments.
63. A product according to
claim 60
, wherein the prosodic information comprises respective energy levels of the segments to be incorporated in the output speech signal, and wherein the instructions cause the computer to adjust the energy levels of one or more of the segments by altering one or more of the vector elements.
64. A product according to
claim 51
, wherein the instructions cause the computer to adjust the vector elements so as to provide a smooth transition between the segments in the time domain signal.
65. A product according to
claim 51
, wherein the vector elements comprise Mel Frequency Cepstral Coefficients of the speech segments, determined based on the integrated spectral envelopes.
66. A computer software product, comprising a computer-readable medium in which a segment inventory is stored, the inventory having been determined by processing an input speech signal containing a set of speech segments so as to estimate spectral envelopes of the input speech signal in a succession of time intervals during each of the speech segments, and integrating the spectral envelopes over a plurality of window functions in a frequency domain so as to determine elements of feature vectors corresponding to the speech segments, so that a speech processor can reconstruct an output speech signal by concatenating the feature vectors corresponding to a sequence of the speech segments.
67. A product according to
claim 66
, wherein the input speech signal is processed by dividing the input speech signal into the segments and determining segment information comprising respective phonetic identifiers of the segments, and wherein to reconstruct the output speech signal, the processor selects the segments whose feature vectors are to be concatenated responsive to the segment information determined with respect to the segments.
68. A product according to
claim 66
, wherein the input speech signal is divided into lefemes, and the phonetic identifiers comprise lefeme labels.
69. A product according to
claim 66
, wherein the segment information further comprises respective segment parameters including one or more of a duration, an energy level and a pitch of each of the segments, responsive to which parameters the segments are selected by the computer for use in reconstructing the output speech signal.
70. A product according to
claim 69
, wherein to reconstruct the output speech signal, the instructions cause the computer to modify the feature vectors of the selected segments so as to adjust the durations and energy levels of the segments in the output speech signal.
71. A product according to
claim 66
, wherein the feature vectors comprise respective degrees of voicing of the speech segments, for use by the computer in reconstructing the output speech signal.
72. A product according to
claim 66
, wherein to reconstruct the output speech signal, the instructions cause the computer to concatenate the feature vectors to form a series in a frequency domain, to compute as series of complex line spectra of the output signal from the series of feature vectors, and to transform the complex line spectra to a time domain signal.
73. A product according to
claim 66
, wherein the window functions are non-zero only within different, respective spectral windows and have variable values over their respective windows, and wherein the feature vector elements are determined by calculating products of the spectral envelopes with the window functions, and calculating integrals of the products over the respective windows of the window functions.
74. A product according
claim 73
, wherein a mathematical transformation is applied to the integrals in order to determine the elements of the feature vectors.
75. A product according to
claim 74
, wherein the frequency domain comprises a Mel frequency domain, and wherein the mathematical transformation comprises log and discrete cosine transform operations, which are applied so as to determine Mel Frequency Cepstral Coefficients to be used as the elements of the feature vectors.
US09/901,031 1999-11-02 2001-07-10 Feature-domain concatenative speech synthesis Expired - Lifetime US7035791B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US09/901,031 US7035791B2 (en) 1999-11-02 2001-07-10 Feature-domain concatenative speech synthesis

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/432,081 US6725190B1 (en) 1999-11-02 1999-11-02 Method and system for speech reconstruction from speech recognition features, pitch and voicing with resampled basis functions providing reconstruction of the spectral envelope
US09/901,031 US7035791B2 (en) 1999-11-02 2001-07-10 Feature-domain concatenative speech synthesis

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US09/432,081 Continuation-In-Part US6725190B1 (en) 1999-11-02 1999-11-02 Method and system for speech reconstruction from speech recognition features, pitch and voicing with resampled basis functions providing reconstruction of the spectral envelope

Publications (2)

Publication Number Publication Date
US20010056347A1 true US20010056347A1 (en) 2001-12-27
US7035791B2 US7035791B2 (en) 2006-04-25

Family

ID=23714693

Family Applications (2)

Application Number Title Priority Date Filing Date
US09/432,081 Expired - Lifetime US6725190B1 (en) 1999-11-02 1999-11-02 Method and system for speech reconstruction from speech recognition features, pitch and voicing with resampled basis functions providing reconstruction of the spectral envelope
US09/901,031 Expired - Lifetime US7035791B2 (en) 1999-11-02 2001-07-10 Feature-domain concatenative speech synthesis

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US09/432,081 Expired - Lifetime US6725190B1 (en) 1999-11-02 1999-11-02 Method and system for speech reconstruction from speech recognition features, pitch and voicing with resampled basis functions providing reconstruction of the spectral envelope

Country Status (2)

Country Link
US (2) US6725190B1 (en)
IL (1) IL135192A (en)

Cited By (138)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040049375A1 (en) * 2001-06-04 2004-03-11 Brittan Paul St John Speech synthesis apparatus and method
FR2846457A1 (en) * 2002-10-25 2004-04-30 France Telecom Speech processing acoustic unit division procedure classifies units using acoustic distance with regrouping using neural network weightings into limited size classes
US20040260551A1 (en) * 2003-06-19 2004-12-23 International Business Machines Corporation System and method for configuring voice readers using semantic analysis
US20050137862A1 (en) * 2003-12-19 2005-06-23 Ibm Corporation Voice model for speech processing
USRE39336E1 (en) * 1998-11-25 2006-10-10 Matsushita Electric Industrial Co., Ltd. Formant-based speech synthesizer employing demi-syllable concatenation with independent cross fade in the filter parameter and source domains
US20060229876A1 (en) * 2005-04-07 2006-10-12 International Business Machines Corporation Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis
US20070198511A1 (en) * 2006-02-23 2007-08-23 Samsung Electronics Co., Ltd. Method, medium, and system retrieving a media file based on extracted partial keyword
US20080059184A1 (en) * 2006-08-22 2008-03-06 Microsoft Corporation Calculating cost measures between HMM acoustic models
US20080059190A1 (en) * 2006-08-22 2008-03-06 Microsoft Corporation Speech unit selection using HMM acoustic models
US20080177548A1 (en) * 2005-05-31 2008-07-24 Canon Kabushiki Kaisha Speech Synthesis Method and Apparatus
US20090048836A1 (en) * 2003-10-23 2009-02-19 Bellegarda Jerome R Data-driven global boundary optimization
US20090055162A1 (en) * 2007-08-20 2009-02-26 Microsoft Corporation Hmm-based bilingual (mandarin-english) tts techniques
US20090177473A1 (en) * 2008-01-07 2009-07-09 Aaron Andrew S Applying vocal characteristics from a target speaker to a source speaker for synthetic speech
US20100145691A1 (en) * 2003-10-23 2010-06-10 Bellegarda Jerome R Global boundary-centric feature extraction and associated discontinuity metrics
US20110218804A1 (en) * 2010-03-02 2011-09-08 Kabushiki Kaisha Toshiba Speech processor, a speech processing method and a method of training a speech processor
JP2013057735A (en) * 2011-09-07 2013-03-28 National Institute Of Information & Communication Technology Hidden markov model learning device for voice synthesis and voice synthesizer
US20130268275A1 (en) * 2007-09-07 2013-10-10 Nuance Communications, Inc. Speech synthesis system, speech synthesis program product, and speech synthesis method
US8682670B2 (en) * 2011-07-07 2014-03-25 International Business Machines Corporation Statistical enhancement of speech output from a statistical text-to-speech synthesis system
US8892446B2 (en) 2010-01-18 2014-11-18 Apple Inc. Service orchestration for intelligent automated assistant
US20140350940A1 (en) * 2009-09-21 2014-11-27 At&T Intellectual Property I, L.P. System and Method for Generalized Preselection for Unit Selection Synthesis
US20150025891A1 (en) * 2007-03-20 2015-01-22 Nuance Communications, Inc. Method and system for text-to-speech synthesis with personalized voice
US8977584B2 (en) 2010-01-25 2015-03-10 Newvaluexchange Global Ai Llp Apparatuses, methods and systems for a digital conversation management platform
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US9300784B2 (en) 2013-06-13 2016-03-29 Apple Inc. System and method for emergency calls initiated by voice command
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9368114B2 (en) 2013-03-14 2016-06-14 Apple Inc. Context-sensitive handling of interruptions
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US9502031B2 (en) 2014-05-27 2016-11-22 Apple Inc. Method for supporting dynamic grammars in WFST-based ASR
US9535906B2 (en) 2008-07-31 2017-01-03 Apple Inc. Mobile device having human language translation capability with positional feedback
US9576574B2 (en) 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
US9620104B2 (en) 2013-06-07 2017-04-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9620105B2 (en) 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US9626955B2 (en) 2008-04-05 2017-04-18 Apple Inc. Intelligent text-to-speech conversion
US9633660B2 (en) 2010-02-25 2017-04-25 Apple Inc. User profiling for voice input processing
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
US9633674B2 (en) 2013-06-07 2017-04-25 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US9646614B2 (en) 2000-03-16 2017-05-09 Apple Inc. Fast, language-independent method for user authentication by voice
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US9697822B1 (en) 2013-03-15 2017-07-04 Apple Inc. System and method for updating an adaptive speech recognition model
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
US9711141B2 (en) 2014-12-09 2017-07-18 Apple Inc. Disambiguating heteronyms in speech synthesis
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US9734193B2 (en) 2014-05-30 2017-08-15 Apple Inc. Determining domain salience ranking from ambiguous words in natural speech
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US9798393B2 (en) 2011-08-29 2017-10-24 Apple Inc. Text correction processing
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
US9922642B2 (en) 2013-03-15 2018-03-20 Apple Inc. Training an at least partial voice command system
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9953088B2 (en) 2012-05-14 2018-04-24 Apple Inc. Crowd sourcing information to fulfill user requests
US9959870B2 (en) 2008-12-11 2018-05-01 Apple Inc. Speech recognition involving a mobile device
US9966065B2 (en) 2014-05-30 2018-05-08 Apple Inc. Multi-command single utterance input method
US9966068B2 (en) 2013-06-08 2018-05-08 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US10078631B2 (en) 2014-05-30 2018-09-18 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US10089072B2 (en) 2016-06-11 2018-10-02 Apple Inc. Intelligent device arbitration and control
US10101822B2 (en) 2015-06-05 2018-10-16 Apple Inc. Language input correction
US10127220B2 (en) 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
US10185542B2 (en) 2013-06-09 2019-01-22 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US10186254B2 (en) 2015-06-07 2019-01-22 Apple Inc. Context-based endpoint detection
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10199051B2 (en) 2013-02-07 2019-02-05 Apple Inc. Voice trigger for a digital assistant
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10255907B2 (en) 2015-06-07 2019-04-09 Apple Inc. Automatic accent detection using acoustic models
US10269345B2 (en) 2016-06-11 2019-04-23 Apple Inc. Intelligent task discovery
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US10283110B2 (en) 2009-07-02 2019-05-07 Apple Inc. Methods and apparatuses for automatic speech recognition
US10289433B2 (en) 2014-05-30 2019-05-14 Apple Inc. Domain specific language for encoding assistant dialog
US10297253B2 (en) 2016-06-11 2019-05-21 Apple Inc. Application integration with a digital assistant
EP3152752A4 (en) * 2014-06-05 2019-05-29 Nuance Communications, Inc. Systems and methods for generating speech of multiple styles from text
US10318871B2 (en) 2005-09-08 2019-06-11 Apple Inc. Method and apparatus for building an intelligent automated assistant
US10354011B2 (en) 2016-06-09 2019-07-16 Apple Inc. Intelligent automated assistant in a home environment
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US10446141B2 (en) 2014-08-28 2019-10-15 Apple Inc. Automatic speech recognition based on user feedback
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10496753B2 (en) 2010-01-18 2019-12-03 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10521466B2 (en) 2016-06-11 2019-12-31 Apple Inc. Data driven natural language event detection and classification
US10552013B2 (en) 2014-12-02 2020-02-04 Apple Inc. Data detection
US10553209B2 (en) 2010-01-18 2020-02-04 Apple Inc. Systems and methods for hands-free notification summaries
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US10568032B2 (en) 2007-04-03 2020-02-18 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US10592095B2 (en) 2014-05-23 2020-03-17 Apple Inc. Instantaneous speaking of content on touch devices
US10659851B2 (en) 2014-06-30 2020-05-19 Apple Inc. Real-time digital assistant knowledge updates
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10679605B2 (en) 2010-01-18 2020-06-09 Apple Inc. Hands-free list-reading by intelligent automated assistant
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10705794B2 (en) 2010-01-18 2020-07-07 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US10706373B2 (en) 2011-06-03 2020-07-07 Apple Inc. Performing actions associated with task items that represent tasks to perform
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
US10762293B2 (en) 2010-12-22 2020-09-01 Apple Inc. Using parts-of-speech tagging and named entity recognition for spelling correction
US10791216B2 (en) 2013-08-06 2020-09-29 Apple Inc. Auto-activating smart responses based on activities from remote devices
US10789041B2 (en) 2014-09-12 2020-09-29 Apple Inc. Dynamic thresholds for always listening speech trigger
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US20210366463A1 (en) * 2017-03-29 2021-11-25 Google Llc End-to-end text-to-speech conversion
US11217255B2 (en) 2017-05-16 2022-01-04 Apple Inc. Far-field extension for digital assistant services
US11227579B2 (en) * 2019-08-08 2022-01-18 International Business Machines Corporation Data augmentation by frame insertion for speech data
US11430423B1 (en) * 2018-04-19 2022-08-30 Weatherology, LLC Method for automatically translating raw data into real human voiced audio content
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5621852A (en) * 1993-12-14 1997-04-15 Interdigital Technology Corporation Efficient codebook structure for code excited linear prediction coding
US6910011B1 (en) * 1999-08-16 2005-06-21 Haman Becker Automotive Systems - Wavemakers, Inc. Noisy acoustic signal enhancement
US6725190B1 (en) * 1999-11-02 2004-04-20 International Business Machines Corporation Method and system for speech reconstruction from speech recognition features, pitch and voicing with resampled basis functions providing reconstruction of the spectral envelope
US20030182106A1 (en) * 2002-03-13 2003-09-25 Spectral Design Method and device for changing the temporal length and/or the tone pitch of a discrete audio signal
US7376553B2 (en) * 2003-07-08 2008-05-20 Robert Patel Quinn Fractal harmonic overtone mapping of speech and musical sounds
US7610196B2 (en) * 2004-10-26 2009-10-27 Qnx Software Systems (Wavemakers), Inc. Periodic signal enhancement system
US8543390B2 (en) * 2004-10-26 2013-09-24 Qnx Software Systems Limited Multi-channel periodic signal enhancement system
US7949520B2 (en) 2004-10-26 2011-05-24 QNX Software Sytems Co. Adaptive filter pitch extraction
US7680652B2 (en) * 2004-10-26 2010-03-16 Qnx Software Systems (Wavemakers), Inc. Periodic signal enhancement system
US7716046B2 (en) * 2004-10-26 2010-05-11 Qnx Software Systems (Wavemakers), Inc. Advanced periodic signal enhancement
US8306821B2 (en) 2004-10-26 2012-11-06 Qnx Software Systems Limited Sub-band periodic signal enhancement system
US8170879B2 (en) * 2004-10-26 2012-05-01 Qnx Software Systems Limited Periodic signal enhancement system
US8520861B2 (en) * 2005-05-17 2013-08-27 Qnx Software Systems Limited Signal processing system for tonal noise robustness
US20070118361A1 (en) * 2005-10-07 2007-05-24 Deepen Sinha Window apparatus and method
GB2433150B (en) * 2005-12-08 2009-10-07 Toshiba Res Europ Ltd Method and apparatus for labelling speech
US7783488B2 (en) * 2005-12-19 2010-08-24 Nuance Communications, Inc. Remote tracing and debugging of automatic speech recognition servers by speech reconstruction from cepstra and pitch information
US20080058607A1 (en) * 2006-08-08 2008-03-06 Zargis Medical Corp Categorizing automatically generated physiological data based on industry guidelines
US7805308B2 (en) * 2007-01-19 2010-09-28 Microsoft Corporation Hidden trajectory modeling with differential cepstra for speech recognition
JP5233986B2 (en) * 2007-03-12 2013-07-10 富士通株式会社 Speech waveform interpolation apparatus and method
US20080231557A1 (en) * 2007-03-20 2008-09-25 Leadis Technology, Inc. Emission control in aged active matrix oled display using voltage ratio or current ratio
US8904400B2 (en) * 2007-09-11 2014-12-02 2236008 Ontario Inc. Processing system having a partitioning component for resource partitioning
US8850154B2 (en) 2007-09-11 2014-09-30 2236008 Ontario Inc. Processing system having memory partitioning
US8694310B2 (en) 2007-09-17 2014-04-08 Qnx Software Systems Limited Remote control server protocol system
DE602007004504D1 (en) * 2007-10-29 2010-03-11 Harman Becker Automotive Sys Partial language reconstruction
JP5159279B2 (en) * 2007-12-03 2013-03-06 株式会社東芝 Speech processing apparatus and speech synthesizer using the same.
KR101235830B1 (en) * 2007-12-06 2013-02-21 한국전자통신연구원 Apparatus for enhancing quality of speech codec and method therefor
US8209514B2 (en) * 2008-02-04 2012-06-26 Qnx Software Systems Limited Media processing system having resource partitioning
US8374873B2 (en) * 2008-08-12 2013-02-12 Morphism, Llc Training and applying prosody models
US8321225B1 (en) 2008-11-14 2012-11-27 Google Inc. Generating prosodic contours for synthesized speech
US8620643B1 (en) * 2009-07-31 2013-12-31 Lester F. Ludwig Auditory eigenfunction systems and methods
EP2363852B1 (en) * 2010-03-04 2012-05-16 Deutsche Telekom AG Computer-based method and system of assessing intelligibility of speech represented by a speech signal
CN102237081B (en) * 2010-04-30 2013-04-24 国际商业机器公司 Method and system for estimating rhythm of voice
US8595005B2 (en) * 2010-05-31 2013-11-26 Simple Emotion, Inc. System and method for recognizing emotional state from a speech signal
US10026407B1 (en) 2010-12-17 2018-07-17 Arrowhead Center, Inc. Low bit-rate speech coding through quantization of mel-frequency cepstral coefficients
US8620646B2 (en) * 2011-08-08 2013-12-31 The Intellisis Corporation System and method for tracking sound pitch across an audio signal using harmonic envelope
US9076446B2 (en) * 2012-03-22 2015-07-07 Qiguang Lin Method and apparatus for robust speaker and speech recognition
CN103366737B (en) * 2012-03-30 2016-08-10 株式会社东芝 The apparatus and method of tone feature are applied in automatic speech recognition
US9082401B1 (en) * 2013-01-09 2015-07-14 Google Inc. Text-to-speech synthesis
CN103528968B (en) * 2013-11-01 2016-01-20 上海理工大学 Based on the reflectance spectrum method for reconstructing of iteration method
JP2017508188A (en) 2014-01-28 2017-03-23 シンプル エモーション, インコーポレイテッドSimple Emotion, Inc. A method for adaptive spoken dialogue
US9348812B2 (en) 2014-03-14 2016-05-24 Splice Software Inc. Method, system and apparatus for assembling a recording plan and data driven dialogs for automated communications
JP6499305B2 (en) * 2015-09-16 2019-04-10 株式会社東芝 Speech synthesis apparatus, speech synthesis method, speech synthesis program, speech synthesis model learning apparatus, speech synthesis model learning method, and speech synthesis model learning program
US10726826B2 (en) 2018-03-04 2020-07-28 International Business Machines Corporation Voice-transformation based data augmentation for prosodic classification
US11935539B1 (en) * 2019-01-31 2024-03-19 Alan AI, Inc. Integrating voice controls into applications
US11955120B1 (en) 2019-01-31 2024-04-09 Alan AI, Inc. Systems and methods for integrating voice controls into applications

Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4896359A (en) * 1987-05-18 1990-01-23 Kokusai Denshin Denwa, Co., Ltd. Speech synthesis system by rule using phonemes as systhesis units
US5165008A (en) * 1991-09-18 1992-11-17 U S West Advanced Technologies, Inc. Speech synthesis using perceptual linear prediction parameters
US5485543A (en) * 1989-03-13 1996-01-16 Canon Kabushiki Kaisha Method and apparatus for speech analysis and synthesis by sampling a power spectrum of input speech
US5528518A (en) * 1994-10-25 1996-06-18 Laser Technology, Inc. System and method for collecting data used to form a geographic information system database
US5740320A (en) * 1993-03-10 1998-04-14 Nippon Telegraph And Telephone Corporation Text-to-speech synthesis by concatenation using or modifying clustered phoneme waveforms on basis of cluster parameter centroids
US5751907A (en) * 1995-08-16 1998-05-12 Lucent Technologies Inc. Speech synthesizer having an acoustic element database
US5774855A (en) * 1994-09-29 1998-06-30 Cselt-Centro Studi E Laboratori Tellecomunicazioni S.P.A. Method of speech synthesis by means of concentration and partial overlapping of waveforms
US5913193A (en) * 1996-04-30 1999-06-15 Microsoft Corporation Method and system of runtime acoustic unit selection for speech synthesis
US5940795A (en) * 1991-11-12 1999-08-17 Fujitsu Limited Speech synthesis system
US6041300A (en) * 1997-03-21 2000-03-21 International Business Machines Corporation System and method of using pre-enrolled speech sub-units for efficient speech synthesis
US6076083A (en) * 1995-08-20 2000-06-13 Baker; Michelle Diagnostic system utilizing a Bayesian network model having link weights updated experimentally
US6101470A (en) * 1998-05-26 2000-08-08 International Business Machines Corporation Methods for generating pitch and duration contours in a text to speech system
US6134528A (en) * 1997-06-13 2000-10-17 Motorola, Inc. Method device and article of manufacture for neural-network based generation of postlexical pronunciations from lexical pronunciations
US6195632B1 (en) * 1998-11-25 2001-02-27 Matsushita Electric Industrial Co., Ltd. Extracting formant-based source-filter data for coding and synthesis employing cost function and inverse filtering
US6266637B1 (en) * 1998-09-11 2001-07-24 International Business Machines Corporation Phrase splicing and variable substitution using a trainable speech synthesizer
US6334106B1 (en) * 1997-05-21 2001-12-25 Nippon Telegraph And Telephone Corporation Method for editing non-verbal information by adding mental state information to a speech message
US6366883B1 (en) * 1996-05-15 2002-04-02 Atr Interpreting Telecommunications Concatenation of speech segments by use of a speech synthesizer
US6587816B1 (en) * 2000-07-14 2003-07-01 International Business Machines Corporation Fast frequency-domain pitch estimation
US6665641B1 (en) * 1998-11-13 2003-12-16 Scansoft, Inc. Speech synthesis using concatenation of speech waveforms
US6697780B1 (en) * 1999-04-30 2004-02-24 At&T Corp. Method and apparatus for rapid acoustic unit selection from a large speech corpus
US6725190B1 (en) * 1999-11-02 2004-04-20 International Business Machines Corporation Method and system for speech reconstruction from speech recognition features, pitch and voicing with resampled basis functions providing reconstruction of the spectral envelope

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE3779351D1 (en) * 1986-03-28 1992-07-02 American Telephone And Telegraph Co., New York, N.Y., Us
US4797926A (en) * 1986-09-11 1989-01-10 American Telephone And Telegraph Company, At&T Bell Laboratories Digital speech vocoder
US5384891A (en) * 1988-09-28 1995-01-24 Hitachi, Ltd. Vector quantizing apparatus and speech analysis-synthesis system using the apparatus
CA1321645C (en) * 1988-09-28 1993-08-24 Akira Ichikawa Method and system for voice coding based on vector quantization
ZA948426B (en) * 1993-12-22 1995-06-30 Qualcomm Inc Distributed voice recognition system
US5528516A (en) 1994-05-25 1996-06-18 System Management Arts, Inc. Apparatus and method for event correlation and problem reporting
US5787387A (en) * 1994-07-11 1998-07-28 Voxware, Inc. Harmonic adaptive speech coding method and system
US5774837A (en) * 1995-09-13 1998-06-30 Voxware, Inc. Speech coding system and method using voicing probability determination
US5839098A (en) * 1996-12-19 1998-11-17 Lucent Technologies Inc. Speech coder methods and systems
TW358925B (en) * 1997-12-31 1999-05-21 Ind Tech Res Inst Improvement of oscillation encoding of a low bit rate sine conversion language encoder

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4896359A (en) * 1987-05-18 1990-01-23 Kokusai Denshin Denwa, Co., Ltd. Speech synthesis system by rule using phonemes as systhesis units
US5485543A (en) * 1989-03-13 1996-01-16 Canon Kabushiki Kaisha Method and apparatus for speech analysis and synthesis by sampling a power spectrum of input speech
US5165008A (en) * 1991-09-18 1992-11-17 U S West Advanced Technologies, Inc. Speech synthesis using perceptual linear prediction parameters
US5940795A (en) * 1991-11-12 1999-08-17 Fujitsu Limited Speech synthesis system
US5740320A (en) * 1993-03-10 1998-04-14 Nippon Telegraph And Telephone Corporation Text-to-speech synthesis by concatenation using or modifying clustered phoneme waveforms on basis of cluster parameter centroids
US5774855A (en) * 1994-09-29 1998-06-30 Cselt-Centro Studi E Laboratori Tellecomunicazioni S.P.A. Method of speech synthesis by means of concentration and partial overlapping of waveforms
US5528518A (en) * 1994-10-25 1996-06-18 Laser Technology, Inc. System and method for collecting data used to form a geographic information system database
US5751907A (en) * 1995-08-16 1998-05-12 Lucent Technologies Inc. Speech synthesizer having an acoustic element database
US6076083A (en) * 1995-08-20 2000-06-13 Baker; Michelle Diagnostic system utilizing a Bayesian network model having link weights updated experimentally
US5913193A (en) * 1996-04-30 1999-06-15 Microsoft Corporation Method and system of runtime acoustic unit selection for speech synthesis
US6366883B1 (en) * 1996-05-15 2002-04-02 Atr Interpreting Telecommunications Concatenation of speech segments by use of a speech synthesizer
US6041300A (en) * 1997-03-21 2000-03-21 International Business Machines Corporation System and method of using pre-enrolled speech sub-units for efficient speech synthesis
US6334106B1 (en) * 1997-05-21 2001-12-25 Nippon Telegraph And Telephone Corporation Method for editing non-verbal information by adding mental state information to a speech message
US6134528A (en) * 1997-06-13 2000-10-17 Motorola, Inc. Method device and article of manufacture for neural-network based generation of postlexical pronunciations from lexical pronunciations
US6101470A (en) * 1998-05-26 2000-08-08 International Business Machines Corporation Methods for generating pitch and duration contours in a text to speech system
US6266637B1 (en) * 1998-09-11 2001-07-24 International Business Machines Corporation Phrase splicing and variable substitution using a trainable speech synthesizer
US6665641B1 (en) * 1998-11-13 2003-12-16 Scansoft, Inc. Speech synthesis using concatenation of speech waveforms
US6195632B1 (en) * 1998-11-25 2001-02-27 Matsushita Electric Industrial Co., Ltd. Extracting formant-based source-filter data for coding and synthesis employing cost function and inverse filtering
US6697780B1 (en) * 1999-04-30 2004-02-24 At&T Corp. Method and apparatus for rapid acoustic unit selection from a large speech corpus
US6725190B1 (en) * 1999-11-02 2004-04-20 International Business Machines Corporation Method and system for speech reconstruction from speech recognition features, pitch and voicing with resampled basis functions providing reconstruction of the spectral envelope
US6587816B1 (en) * 2000-07-14 2003-07-01 International Business Machines Corporation Fast frequency-domain pitch estimation

Cited By (195)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
USRE39336E1 (en) * 1998-11-25 2006-10-10 Matsushita Electric Industrial Co., Ltd. Formant-based speech synthesizer employing demi-syllable concatenation with independent cross fade in the filter parameter and source domains
US9646614B2 (en) 2000-03-16 2017-05-09 Apple Inc. Fast, language-independent method for user authentication by voice
US20040049375A1 (en) * 2001-06-04 2004-03-11 Brittan Paul St John Speech synthesis apparatus and method
US7062439B2 (en) * 2001-06-04 2006-06-13 Hewlett-Packard Development Company, L.P. Speech synthesis apparatus and method
FR2846457A1 (en) * 2002-10-25 2004-04-30 France Telecom Speech processing acoustic unit division procedure classifies units using acoustic distance with regrouping using neural network weightings into limited size classes
US20040260551A1 (en) * 2003-06-19 2004-12-23 International Business Machines Corporation System and method for configuring voice readers using semantic analysis
US20070276667A1 (en) * 2003-06-19 2007-11-29 Atkin Steven E System and Method for Configuring Voice Readers Using Semantic Analysis
US20100145691A1 (en) * 2003-10-23 2010-06-10 Bellegarda Jerome R Global boundary-centric feature extraction and associated discontinuity metrics
US8015012B2 (en) 2003-10-23 2011-09-06 Apple Inc. Data-driven global boundary optimization
US7930172B2 (en) * 2003-10-23 2011-04-19 Apple Inc. Global boundary-centric feature extraction and associated discontinuity metrics
US20090048836A1 (en) * 2003-10-23 2009-02-19 Bellegarda Jerome R Data-driven global boundary optimization
US20050137862A1 (en) * 2003-12-19 2005-06-23 Ibm Corporation Voice model for speech processing
US7702503B2 (en) 2003-12-19 2010-04-20 Nuance Communications, Inc. Voice model for speech processing based on ordered average ranks of spectral features
US7412377B2 (en) 2003-12-19 2008-08-12 International Business Machines Corporation Voice model for speech processing based on ordered average ranks of spectral features
US20060229876A1 (en) * 2005-04-07 2006-10-12 International Business Machines Corporation Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis
US7716052B2 (en) * 2005-04-07 2010-05-11 Nuance Communications, Inc. Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis
US20080177548A1 (en) * 2005-05-31 2008-07-24 Canon Kabushiki Kaisha Speech Synthesis Method and Apparatus
US10318871B2 (en) 2005-09-08 2019-06-11 Apple Inc. Method and apparatus for building an intelligent automated assistant
US20070198511A1 (en) * 2006-02-23 2007-08-23 Samsung Electronics Co., Ltd. Method, medium, and system retrieving a media file based on extracted partial keyword
US8356032B2 (en) * 2006-02-23 2013-01-15 Samsung Electronics Co., Ltd. Method, medium, and system retrieving a media file based on extracted partial keyword
US8234116B2 (en) 2006-08-22 2012-07-31 Microsoft Corporation Calculating cost measures between HMM acoustic models
US20080059190A1 (en) * 2006-08-22 2008-03-06 Microsoft Corporation Speech unit selection using HMM acoustic models
US20080059184A1 (en) * 2006-08-22 2008-03-06 Microsoft Corporation Calculating cost measures between HMM acoustic models
US9117447B2 (en) 2006-09-08 2015-08-25 Apple Inc. Using event alert text as input to an automated assistant
US8942986B2 (en) 2006-09-08 2015-01-27 Apple Inc. Determining user intent based on ontologies of domains
US8930191B2 (en) 2006-09-08 2015-01-06 Apple Inc. Paraphrasing of user requests and results by automated digital assistant
US9368102B2 (en) * 2007-03-20 2016-06-14 Nuance Communications, Inc. Method and system for text-to-speech synthesis with personalized voice
US20150025891A1 (en) * 2007-03-20 2015-01-22 Nuance Communications, Inc. Method and system for text-to-speech synthesis with personalized voice
US10568032B2 (en) 2007-04-03 2020-02-18 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US20090055162A1 (en) * 2007-08-20 2009-02-26 Microsoft Corporation Hmm-based bilingual (mandarin-english) tts techniques
US8244534B2 (en) * 2007-08-20 2012-08-14 Microsoft Corporation HMM-based bilingual (Mandarin-English) TTS techniques
US9275631B2 (en) * 2007-09-07 2016-03-01 Nuance Communications, Inc. Speech synthesis system, speech synthesis program product, and speech synthesis method
US20130268275A1 (en) * 2007-09-07 2013-10-10 Nuance Communications, Inc. Speech synthesis system, speech synthesis program product, and speech synthesis method
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US10381016B2 (en) 2008-01-03 2019-08-13 Apple Inc. Methods and apparatus for altering audio output signals
US20090177473A1 (en) * 2008-01-07 2009-07-09 Aaron Andrew S Applying vocal characteristics from a target speaker to a source speaker for synthetic speech
US9865248B2 (en) 2008-04-05 2018-01-09 Apple Inc. Intelligent text-to-speech conversion
US9626955B2 (en) 2008-04-05 2017-04-18 Apple Inc. Intelligent text-to-speech conversion
US9535906B2 (en) 2008-07-31 2017-01-03 Apple Inc. Mobile device having human language translation capability with positional feedback
US10108612B2 (en) 2008-07-31 2018-10-23 Apple Inc. Mobile device having human language translation capability with positional feedback
US9959870B2 (en) 2008-12-11 2018-05-01 Apple Inc. Speech recognition involving a mobile device
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US10475446B2 (en) 2009-06-05 2019-11-12 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US11080012B2 (en) 2009-06-05 2021-08-03 Apple Inc. Interface for a virtual digital assistant
US10795541B2 (en) 2009-06-05 2020-10-06 Apple Inc. Intelligent organization of tasks items
US10283110B2 (en) 2009-07-02 2019-05-07 Apple Inc. Methods and apparatuses for automatic speech recognition
US9564121B2 (en) * 2009-09-21 2017-02-07 At&T Intellectual Property I, L.P. System and method for generalized preselection for unit selection synthesis
US20140350940A1 (en) * 2009-09-21 2014-11-27 At&T Intellectual Property I, L.P. System and Method for Generalized Preselection for Unit Selection Synthesis
US10706841B2 (en) 2010-01-18 2020-07-07 Apple Inc. Task flow identification based on user intent
US10705794B2 (en) 2010-01-18 2020-07-07 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US8892446B2 (en) 2010-01-18 2014-11-18 Apple Inc. Service orchestration for intelligent automated assistant
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US10496753B2 (en) 2010-01-18 2019-12-03 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US9548050B2 (en) 2010-01-18 2017-01-17 Apple Inc. Intelligent automated assistant
US10553209B2 (en) 2010-01-18 2020-02-04 Apple Inc. Systems and methods for hands-free notification summaries
US8903716B2 (en) 2010-01-18 2014-12-02 Apple Inc. Personalized vocabulary for digital assistant
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US10679605B2 (en) 2010-01-18 2020-06-09 Apple Inc. Hands-free list-reading by intelligent automated assistant
US9424861B2 (en) 2010-01-25 2016-08-23 Newvaluexchange Ltd Apparatuses, methods and systems for a digital conversation management platform
US8977584B2 (en) 2010-01-25 2015-03-10 Newvaluexchange Global Ai Llp Apparatuses, methods and systems for a digital conversation management platform
US9424862B2 (en) 2010-01-25 2016-08-23 Newvaluexchange Ltd Apparatuses, methods and systems for a digital conversation management platform
US9431028B2 (en) 2010-01-25 2016-08-30 Newvaluexchange Ltd Apparatuses, methods and systems for a digital conversation management platform
US9633660B2 (en) 2010-02-25 2017-04-25 Apple Inc. User profiling for voice input processing
US10049675B2 (en) 2010-02-25 2018-08-14 Apple Inc. User profiling for voice input processing
US9043213B2 (en) * 2010-03-02 2015-05-26 Kabushiki Kaisha Toshiba Speech recognition and synthesis utilizing context dependent acoustic models containing decision trees
US20110218804A1 (en) * 2010-03-02 2011-09-08 Kabushiki Kaisha Toshiba Speech processor, a speech processing method and a method of training a speech processor
US10762293B2 (en) 2010-12-22 2020-09-01 Apple Inc. Using parts-of-speech tagging and named entity recognition for spelling correction
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US10102359B2 (en) 2011-03-21 2018-10-16 Apple Inc. Device access using voice authentication
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US10706373B2 (en) 2011-06-03 2020-07-07 Apple Inc. Performing actions associated with task items that represent tasks to perform
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US8682670B2 (en) * 2011-07-07 2014-03-25 International Business Machines Corporation Statistical enhancement of speech output from a statistical text-to-speech synthesis system
US9798393B2 (en) 2011-08-29 2017-10-24 Apple Inc. Text correction processing
JP2013057735A (en) * 2011-09-07 2013-03-28 National Institute Of Information & Communication Technology Hidden markov model learning device for voice synthesis and voice synthesizer
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US9953088B2 (en) 2012-05-14 2018-04-24 Apple Inc. Crowd sourcing information to fulfill user requests
US10079014B2 (en) 2012-06-08 2018-09-18 Apple Inc. Name recognition system
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US9576574B2 (en) 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
US9971774B2 (en) 2012-09-19 2018-05-15 Apple Inc. Voice-based media searching
US10199051B2 (en) 2013-02-07 2019-02-05 Apple Inc. Voice trigger for a digital assistant
US10978090B2 (en) 2013-02-07 2021-04-13 Apple Inc. Voice trigger for a digital assistant
US9368114B2 (en) 2013-03-14 2016-06-14 Apple Inc. Context-sensitive handling of interruptions
US9922642B2 (en) 2013-03-15 2018-03-20 Apple Inc. Training an at least partial voice command system
US9697822B1 (en) 2013-03-15 2017-07-04 Apple Inc. System and method for updating an adaptive speech recognition model
US9966060B2 (en) 2013-06-07 2018-05-08 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
US9620104B2 (en) 2013-06-07 2017-04-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9633674B2 (en) 2013-06-07 2017-04-25 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
US9966068B2 (en) 2013-06-08 2018-05-08 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10657961B2 (en) 2013-06-08 2020-05-19 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10185542B2 (en) 2013-06-09 2019-01-22 Apple Inc. Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
US9300784B2 (en) 2013-06-13 2016-03-29 Apple Inc. System and method for emergency calls initiated by voice command
US10791216B2 (en) 2013-08-06 2020-09-29 Apple Inc. Auto-activating smart responses based on activities from remote devices
US9620105B2 (en) 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US10592095B2 (en) 2014-05-23 2020-03-17 Apple Inc. Instantaneous speaking of content on touch devices
US9502031B2 (en) 2014-05-27 2016-11-22 Apple Inc. Method for supporting dynamic grammars in WFST-based ASR
US10497365B2 (en) 2014-05-30 2019-12-03 Apple Inc. Multi-command single utterance input method
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10083690B2 (en) 2014-05-30 2018-09-25 Apple Inc. Better resolution when referencing to concepts
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US10078631B2 (en) 2014-05-30 2018-09-18 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US11257504B2 (en) 2014-05-30 2022-02-22 Apple Inc. Intelligent assistant for home automation
US9966065B2 (en) 2014-05-30 2018-05-08 Apple Inc. Multi-command single utterance input method
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US10169329B2 (en) 2014-05-30 2019-01-01 Apple Inc. Exemplar-based natural language processing
US9734193B2 (en) 2014-05-30 2017-08-15 Apple Inc. Determining domain salience ranking from ambiguous words in natural speech
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10289433B2 (en) 2014-05-30 2019-05-14 Apple Inc. Domain specific language for encoding assistant dialog
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
EP3152752A4 (en) * 2014-06-05 2019-05-29 Nuance Communications, Inc. Systems and methods for generating speech of multiple styles from text
US9668024B2 (en) 2014-06-30 2017-05-30 Apple Inc. Intelligent automated assistant for TV user interactions
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US10659851B2 (en) 2014-06-30 2020-05-19 Apple Inc. Real-time digital assistant knowledge updates
US10904611B2 (en) 2014-06-30 2021-01-26 Apple Inc. Intelligent automated assistant for TV user interactions
US10446141B2 (en) 2014-08-28 2019-10-15 Apple Inc. Automatic speech recognition based on user feedback
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10431204B2 (en) 2014-09-11 2019-10-01 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10789041B2 (en) 2014-09-12 2020-09-29 Apple Inc. Dynamic thresholds for always listening speech trigger
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US9986419B2 (en) 2014-09-30 2018-05-29 Apple Inc. Social reminders
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US10552013B2 (en) 2014-12-02 2020-02-04 Apple Inc. Data detection
US11556230B2 (en) 2014-12-02 2023-01-17 Apple Inc. Data detection
US9711141B2 (en) 2014-12-09 2017-07-18 Apple Inc. Disambiguating heteronyms in speech synthesis
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US10311871B2 (en) 2015-03-08 2019-06-04 Apple Inc. Competing devices responding to voice triggers
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US10127220B2 (en) 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings
US10356243B2 (en) 2015-06-05 2019-07-16 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10101822B2 (en) 2015-06-05 2018-10-16 Apple Inc. Language input correction
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US10186254B2 (en) 2015-06-07 2019-01-22 Apple Inc. Context-based endpoint detection
US10255907B2 (en) 2015-06-07 2019-04-09 Apple Inc. Automatic accent detection using acoustic models
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US11069347B2 (en) 2016-06-08 2021-07-20 Apple Inc. Intelligent automated assistant for media exploration
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
US10354011B2 (en) 2016-06-09 2019-07-16 Apple Inc. Intelligent automated assistant in a home environment
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10733993B2 (en) 2016-06-10 2020-08-04 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10521466B2 (en) 2016-06-11 2019-12-31 Apple Inc. Data driven natural language event detection and classification
US10089072B2 (en) 2016-06-11 2018-10-02 Apple Inc. Intelligent device arbitration and control
US10269345B2 (en) 2016-06-11 2019-04-23 Apple Inc. Intelligent task discovery
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US10297253B2 (en) 2016-06-11 2019-05-21 Apple Inc. Application integration with a digital assistant
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10553215B2 (en) 2016-09-23 2020-02-04 Apple Inc. Intelligent automated assistant
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US11862142B2 (en) * 2017-03-29 2024-01-02 Google Llc End-to-end text-to-speech conversion
US20210366463A1 (en) * 2017-03-29 2021-11-25 Google Llc End-to-end text-to-speech conversion
US10755703B2 (en) 2017-05-11 2020-08-25 Apple Inc. Offline personal assistant
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US10410637B2 (en) 2017-05-12 2019-09-10 Apple Inc. User-specific acoustic models
US10791176B2 (en) 2017-05-12 2020-09-29 Apple Inc. Synchronization and task delegation of a digital assistant
US10482874B2 (en) 2017-05-15 2019-11-19 Apple Inc. Hierarchical belief states for digital assistants
US10810274B2 (en) 2017-05-15 2020-10-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US11217255B2 (en) 2017-05-16 2022-01-04 Apple Inc. Far-field extension for digital assistant services
US11430423B1 (en) * 2018-04-19 2022-08-30 Weatherology, LLC Method for automatically translating raw data into real human voiced audio content
US11227579B2 (en) * 2019-08-08 2022-01-18 International Business Machines Corporation Data augmentation by frame insertion for speech data

Also Published As

Publication number Publication date
IL135192A0 (en) 2001-05-20
US7035791B2 (en) 2006-04-25
IL135192A (en) 2004-06-20
US6725190B1 (en) 2004-04-20

Similar Documents

Publication Publication Date Title
US7035791B2 (en) Feature-domain concatenative speech synthesis
US8280724B2 (en) Speech synthesis using complex spectral modeling
US5905972A (en) Prosodic databases holding fundamental frequency templates for use in speech synthesis
JP4176169B2 (en) Runtime acoustic unit selection method and apparatus for language synthesis
Zen et al. An overview of Nitech HMM-based speech synthesis system for Blizzard Challenge 2005
US9368103B2 (en) Estimation system of spectral envelopes and group delays for sound analysis and synthesis, and audio signal synthesis system
Malfrère et al. High-quality speech synthesis for phonetic speech segmentation
US10692484B1 (en) Text-to-speech (TTS) processing
US11763797B2 (en) Text-to-speech (TTS) processing
EP2109096B1 (en) Speech synthesis with dynamic constraints
Lee Statistical approach for voice personality transformation
RU2427044C1 (en) Text-dependent voice conversion method
KR20180078252A (en) Method of forming excitation signal of parametric speech synthesis system based on gesture pulse model
JP2898568B2 (en) Voice conversion speech synthesizer
JP6330069B2 (en) Multi-stream spectral representation for statistical parametric speech synthesis
Lee et al. A segmental speech coder based on a concatenative TTS
JP3281266B2 (en) Speech synthesis method and apparatus
Narendra et al. Time-domain deterministic plus noise model based hybrid source modeling for statistical parametric speech synthesis
Wen et al. Pitch-scaled spectrum based excitation model for HMM-based speech synthesis
EP1589524B1 (en) Method and device for speech synthesis
Baudoin et al. Advances in very low bit rate speech coding using recognition and synthesis techniques
JPH1185193A (en) Phoneme information optimization method in speech data base and phoneme information optimization apparatus therefor
Khan et al. Singing Voice Synthesis Using HMM Based TTS and MusicXML
JPH10143196A (en) Method and device for synthesizing speech, and program recording medium
Phan et al. Extracting MFCC, F0 feature in Vietnamese HMM-based speech synthesis

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: NUANCE COMMUNICATIONS, INC., MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:022354/0566

Effective date: 20081231

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553)

Year of fee payment: 12

AS Assignment

Owner name: CERENCE INC., MASSACHUSETTS

Free format text: INTELLECTUAL PROPERTY AGREEMENT;ASSIGNOR:NUANCE COMMUNICATIONS, INC.;REEL/FRAME:050836/0191

Effective date: 20190930

AS Assignment

Owner name: CERENCE OPERATING COMPANY, MASSACHUSETTS

Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE ASSIGNEE NAME PREVIOUSLY RECORDED AT REEL: 050836 FRAME: 0191. ASSIGNOR(S) HEREBY CONFIRMS THE INTELLECTUAL PROPERTY AGREEMENT;ASSIGNOR:NUANCE COMMUNICATIONS, INC.;REEL/FRAME:050871/0001

Effective date: 20190930

AS Assignment

Owner name: BARCLAYS BANK PLC, NEW YORK

Free format text: SECURITY AGREEMENT;ASSIGNOR:CERENCE OPERATING COMPANY;REEL/FRAME:050953/0133

Effective date: 20191001

AS Assignment

Owner name: CERENCE OPERATING COMPANY, MASSACHUSETTS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:BARCLAYS BANK PLC;REEL/FRAME:052927/0335

Effective date: 20200612

AS Assignment

Owner name: WELLS FARGO BANK, N.A., NORTH CAROLINA

Free format text: SECURITY AGREEMENT;ASSIGNOR:CERENCE OPERATING COMPANY;REEL/FRAME:052935/0584

Effective date: 20200612

AS Assignment

Owner name: CERENCE OPERATING COMPANY, MASSACHUSETTS

Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE REPLACE THE CONVEYANCE DOCUMENT WITH THE NEW ASSIGNMENT PREVIOUSLY RECORDED AT REEL: 050836 FRAME: 0191. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNOR:NUANCE COMMUNICATIONS, INC.;REEL/FRAME:059804/0186

Effective date: 20190930