# Full text of "NASA Technical Reports Server (NTRS) 19880016369: The design and performance of a real-time self excited vocoder"

## See other formats

N88- 25753 The DESIGN and PERFORMANCE of a REAL-TIME SELF EXCITED VOCODER 1 Richaxd C. Rose, T. P. Barnwell III, S. McGrath Georgia Institute of Technology School of Electrical Engineering Digital Signal Processing Laboratory Atlanta, Georgia 30332 United States Abstract This paper is concerned with a generic class of predictive speech coders that includes the newly proposed The Self Excited Vocoder (SEV) [5] and the well known Code Excited Linear Predictive Coder (CELPC) [6]. All members of this class form an excitation sequence for a linear predictive model filter using the same general model for the excitation signal. The general excitation model is based on a block coding technique where each sequence is drawn from an ensemble of sequences. This paper reports on two developments related to this general model. The first development is a new type of excitation ensemble that can in general be populated by many different types of sequences. The second development is a means of populating this new type of ensemble based on a vector quantizer design procedure using a new distortion measure. 1 Introduction A general model for the excitation signal in linear predictive speech coders was originally presented in [5], Formal subjective tests, summarized in [4], characterized the performance of selected coders in this general class of predictive speech coders. A Self Excited Vocoder has been implemented in real time on a single circuit board using the AT&T DSP32 floating point digital signal processing devices [1], This implementation will serve as a prototype vocoder in the NASA sponsored Mobile Satellite Communications Project. This paper presents a new approach to the excitation modeling problem in self excited and code excited vocoders. The paper begins by reviewing the general model for the excita- tion signal in this class of predictive speech coders, and introduces a new type of excitation ensemble. Then a new procedure for populating the excitation ensemble using a proce- dure based on an iterative vector quantizer design algorithm is discussed. Finally, the last section, a new distance measure for the vector quantization procedure is introduced. 2 A New Class of Excitation Ensembles The general model for the excitation signal in this class of coders is described by the block diagram in Figure la. The excitation signal, e[n], is a linear combination of component excitation sequences, e*[n], where the fcth sequence is chosen from the associated excitation ensemble, /*. An excitation ensemble is simply a collection of discrete functions, / 7 [n], indexed in sample space by 7 and indexed in time by n. The optimum ensemble index, l This work was supported by the Jet Propulsion Laboratory under contract number 957074 509 7Jt, and gain, /?*, associated with the fcth excitation sequence are found by exhaustively searching through the excitation ensemble, T ky for that ensemble function that minimizes a weighted mean squared error [ 5 ]. Figure Is a) Model of the excitation signal for a generic class of predictive speech coders, b) Ensemble search interpretation of a single tap long-term predictor. Examples of some existing predictive coders can be identified if the excitation ensemble is constrained to contain a particular class of sequences. For example, the well known CELPC chooses an optimum excitation sequence from a stochastic ensemble, where each ensemble sequence is populated by Gaussian random varieties [6]. Figure lb shows how a simple long-term predictor can be interpreted as a time— varying excitation ensemble. In this case, the ensemble is the memory of a long-term predictor, whose predictor delay can vary over the expected range of a pitch period in speech. Each ensemble sequence is formed by sliding an N point rectangular window along the memory of the long-term predictor. The optimum ensemble sequence corresponds to an N point sequence beginning at sample —7 in the memory of the long-term predictor. This type of ensemble, referred to here as the “self excitation” ensemble, forms the basis for the SEV. After a brief period of initialization, the SEV derives its excitation signal, e[n] = / 3 e[n — 7], solely from this type of ensemble. The flexibility of the most general model of the excitation signal is derived from the fact that it poses no structure on the functions contained in the excitation ensemble. From the model definition, there is no fundamental requirement that an excitation ensemble be homo- geneous. Thus, a single excitation ensemble can contain more than one class of sequences. For example, an ensemble can be formed by combining a set of time— varying sequences chosen from the memory of a long-term predictor with a set of fixed Gaussian random se- quences. Figure 2 a illustrates an interpretation of a simple coder whose excitation is derived from this type of ensemble. While the figure suggests that a hard classification procedure is taking place, this is actually not the case. The ensemble search procedure chooses a single sequence from the entire ensemble, so the determination of which class of sequences is used is made by choosing the single sequence which results in the least measured distortion. This type of excitation ensemble will be referred to as a nonhomogeneous ensemble, and can, in general, contain many different classes of sequences. The particular ensemble illustrated by the block diagram in Figure 2 a is described by the excitation signal, e\n\ = / 3 z^[n], where z^n] = t^[n] e[n - 7] 1 < 7 < C C < 7 < F ( 1 ) and the fixed sequences, 17 [n], may be populated in many different ways. 510 Figure 2: a) An interpretation of a simple nonhomogeneous predictive speech coder, b) Block diagram illustrating a procedure for determining the fixed ensemble sequences in a nonhomogeneous excitation ensemble. 3 Populating Nonhomogeneous Ensembles This section describes a technique for determining the fixed sequences, u 7 [n], in Equation 1 using the vector quantizer design procedure of Linde et al [2]. Following the reasoning of Davidson et al, the distance measure used for the vector quantization procedure can be the same weighted mean squared distance used for coding the excitation signal in this class of coders [ 7 ], The following discussion describes the vector quantizer design procedure as it applies to populating the fixed sequences of the nonhomogeneous excitation ensemble. The generalized Lloyd algorithm, originally introduced in [2], is an iterative algorithm for designing an optimum vector quantizer by a method of successive approximation. The vector quantizer design procedure determines the the sequences, v 7 [n], 7 = of Equation 1 from the training vectors, z, , t = 1, . . . , n, derived from the original speech. At each iteration of the algorithm the training vectors are partitioned into clusters, and cluster centroids are computed based on the partitioning of the data. The splitting algorithm of Linde et al is used here to provide the initial cluster centroids. The cluster centroids that exist upon termination of the algorithm form the resulting excitation ensemble. Figure 3 is a block diagram illustrating the computation of the distortion, d(v y , z,), that is used for the vector quantizer data set partitioning and clustering procedures. For each excitation analysis frame, t, the coder represents the residual vector, r t , with an ensemble vector, v 7 . The coder also computes the short-term predictor, A,(z), and the excitation gain, The Atal LPC based weighting filter, Wi(z) [6], is used to compute the weighted Euclidean distance. The distance between training vector, z,-, and ensemble vector, v 7 can be expressed as N+L—2 / N - 1 \ 2 d(z,-,v 7 ) = j y,-[n] - ft ^ v 7 [n]ft[n - /] ) , (2) where ft[n] is a finite length impulse response approximation to the cascaded synthesis and error weighting filters in Figure 3 . The length of this impulse response is approximated as L samples ( L « 10). The distance calculation in Equation 2 suggests the form of the training data required for each excitation frame. To compute this distance for the t'th excitation frame, the weighted speech y the impulse response h,, and the ensemble gain ft must all be derived from the input speech. The form of each training vector is then given as = (y I, hi,/?,') . Therefore, the training data is derived from the original speech using the 511 Residual Sequence siauai uence — K“b/ r* Sr Short-term I 1 — | Predictor M z ) Error Weighting Wi(z) Initial Filter State Ensemble Sequence _► M z ) Error Weighting wiW J -\ Short-term Predictor y+ d(z,,v 7 ) Figure 3: Block diagram illustrating the distortion measure computation for Equa- tion 2. predictive speech coder itself. The specification of an initial excitation ensemble for this coder is necessary for the generation of the training data. In this research, a nonhomogeneous ensemble was generally divided into self excitation sequences and alternate sequences The procedure for populating the alternate sequences of the nonhomogeneous ensemble shown in Figure 2a is illustrated by the block diagram shown in Figure 2b. The procedure begins by generating the training data, z,-, using an predetermined set of signals for the alternate sequences. In this research, the alternate se- quences are populated by independent Gaussian random varieties. Once the training data has been generated, a classification procedure is used to select a subset of the training data to be used as input to the vector quantizer design procedure. This classification procedure simply chooses those training vectors where the predictive speech coder provides a poor representation of the original speech. Finally, the vector quantizer design procedure pro- duces sequences that are used to populate the alternate sequences in the nonhomogeneous ensemble. This procedure is described in the next section. 4 A New Vector Quantizer Distortion Measure This section describes a new distance measure for use in the iterative excitation vector quantization procedure. The new distance measure follows immediately from Equation 2, and results in circularly defined excitation ensemble sequences. The discussion is broken into three parts. First, the centroid calculation following from the weighted Euclidean distance measure of Equation 2 is described. Second, the short— comings of this distance measure when applied to vector quantization of the excitation signal are discussed. Finally, the new distance measure is introduced. Determining the centroid for a given cluster of training vectors corresponds to find- ing that sequence, v^, that minimizes an average distortion for the distance measure in Equation 2. This average distortion represents an average over all of the training vectors belonging to the cluster. Minimizing the average error for a cluster containing M training vectors with respect to fc = 0, . . . , N — 1, yields the matrix equation M M = (3) «=l <=l The vector q,- is an N length vector where the A:th element corresponds to the crosscorre* 512 lation between the weighted speech and the impulse response for excitation frame t, N + L - 2 ?«[*] = Pi H !«HM b - k], k = 0 , . . . , N - 1 . (4) n =0 The matrix R,- is an N x N toeplitz matrix where the element in the /th row and Jfcth column is given by the impulse response for excitation frame t, N - 1 = (5) 1=0 The matrix order, N , corresponds to the length of the excitation analysis frame, which is typically about twenty samples. Hence, computing the cluster centroid is not a compu- tationally expensive procedure, requiring only the solution of a twentieth order Toeplitz matrix equation. A major shortcoming of the above algorithm concerns the weighted Euclidean distance given in Equation 2. By this measure, the distance between two training vectors, where both vectors represent very similar excitation signals, may actually be very large. This is due to the fact that the excitation analysis window is placed asynchronously with respect to any significant events that may occur in the excitation signal. About 24, 000 training vectors derived from isolated words uttered by a single speaker were used as training data for this algorithm. Ensemble sequences derived from this algorithm were used to code a short utterance from the same speaker. There was a significant improvement in segmental signal-to-noise ratio using this new ensemble over that of a Gaussian ensemble. However, the improvement in subjective performance was not significant when judged by the authors in informal listening tests. A modification to this procedure is proposed here that reduces the dependency of the training vectors on the position of the associated excitation analysis frame. The modification to the design procedure results in a redefinition of the distance measure and centroid calculation of the vector training algorithm. The modification is based on simple permutations of the weighted speech that is used to form the training vector z,. The vector valued permutation 7Tt is a k sample circular right shift, *k{y) = {y[N - k},y[N - k + 1], ... ,y[0},y[l}, ... ,y[N - Jc - l}) . (6) By applying one of the permutations, {if* : k = 0, . . . , N - 1}, to the training data, similar events occurring in different excitation frames may be aligned in time. The distance measure and centroid calculation can be modified to exploit this behavior. First, the kth permutation of the t'th training vector is defined as x£(z,) = The weighted Euclidean distance of Equation 2 is restated a s d(zi,v 7 )= min <*(**&), ? 7 ) • (7) “ft i k — 0 — 1 The distance between a training vector and a cluster centroid is therefore defined as the minimum weighted Euclidean distance across all possible permutations of the input data. Having found the optimum partition by minimizing the average distortion, the centroid vector, for centroid, 7, can be determined by solving the matrix equation, M M X)4(q.) = ]£ R <*r • (8) «=1 «=1 In this equation, is the optimum permutation for training vector and M is the total number of training vectors in the cluster. 513 5 Conclusions This paper has introduced the nonhomogeneous excitation ensemble as a new type of en- semble used in a generic class of predictive speech coders. An iterative vector quantization procedure using a newly defined distance measure has been discussed as a means for pop- ulating the sequences for a specific nonhomogeneous ensemble. In this new procedure, the optimum choice of ensemble sequence is less dependent on the alignment of the excitation analysis frame with the original speech waveform. The procedure involves applying a set of circular permutations to the training data in order to time align similar events in different training vectors. The ensemble search procedure for this newly defined ensemble involves an exhaustive search, computing the weighted mean squared coding error for each circular permutation of each N point ensemble sequence. This is essentially equivalent to increasing the number of sequences in the ensemble from F sequences to FN sequences. However, the number of operations required to search this ensemble can be considerably reduced by using the recursive ensemble search procedure introduced in [3]. References [1] T. P. Barnwell III, R. C. Rose, and S. McGrath. A real-time implementation of a 4800 bps. self excited vocoder using the AT Sc T WE-DSP32 signal processing microprocessor. Proc. Speech Technology Conf., Apr. 1987. [2] Y. Linde, A. Buzo, and R. M. Gray. An algorithm for vector quantizor design. IEEE Trans. Commun., COM-28:84-95, Jan. 1980. [3] R. C. Rose and T. P. Barnwell III. The design and performance of an effective class of predictive speech coders. Submitted for Review IEEE Trans. Acoust., Speech , Signal Processing, June 1987. [4] R. C. Rose and T. P. Barnwell III. Quality comparison of low complexity 4800 bps. self excited and code excited vocoders. Proc. Inter. Conf. on Acoustics, Speech, and Signal Proc., April 1987. [5] R. C. Rose and T. P. Barnwell III. The self excited vocoder-an alternate approach to toll quality at 4800 bps. Proc. Inter. Conf. on Acoustics, Speech, and Signal Proc., 453-456, April 1986. [6] M. R. Schroeder and B. S. Atal. Code excited linear prediction: high quality speech at very low bit rates. Proc. Inter. Conf. on Acoustics, Speech, and Signal Proc., 937-940, April 1985. [7] G. Davidson M. Yong and A. Gersho. Real-time vector excitation coding of speech at 4800 bps. Proc. Inter. Conf. Acoust., Speech, Sig. Processing, 51.4.1-51.4.4, 1987. 514