ITU-T G.719: A New Low-Complexity Full-Band (20KHZ ... - Polycom

2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 18-21, 2009, New Paltz, NY 

ITU-T G.719: A NEW LOW-COMPLEXITY FULL-BAND (20 KHZ) AUDIO CODING 

STANDARD FOR HIGH-QUALITY CONVERSATIONAL APPLICATIONS 

Minjie Xie∗ and Peter Chu Anisse Taleb and Manuel Briand 

Polycom, Inc. 

10 Mall Road, #110, Burlington, MA 01803, USA 

Peter.Chu@polycom.com 

ABSTRACT 

This paper describes a new low-complexity full-band (20 kHz) 

audio coding algorithm which has been recently standardized 

by ITU-T as Recommendation G.719. The algorithm is 

designed to provide 20 Hz - 20 kHz audio bandwidth using a 48 

kHz sample rate, operating at 32 - 128 kbps. This codec 

features very high audio quality and low computational 

complexity and is suitable for use in applications such as 

videoconferencing, teleconferencing, and streaming audio over 

the Internet. Subjective test results from the Optimization/ 

Characterization phase of G.719 are also presented in the paper. 

Index Terms— Audio coding, full-band, low 

complexity, adaptive time-frequency transform, fast 

lattice vector quantization 

1. INTRODUCTION 

In hands-free videoconferencing and teleconferencing markets, 

there is strong and increasing demand for audio coding 

providing the full human auditory bandwidth of 20 Hz to 20 

kHz. This is because: 

• Conferencing systems are increasingly used for more 

elaborate presentations, often including music and sound 

effects (i.e. animal sounds, musical instruments, vehicles or 

nature sounds, etc.) which occupy a wider audio band than 

speech. Presentations involve remote education of music, 

playback of audio and video from DVDs and VCRs, 

audio/video clips from PCs, and elaborate audio-visual 

presentations from, for example, PowerPoint. 

• Users perceive the bandwidth of 20 Hz to 20 kHz as 

representing the ultimate goal for audio bandwidth. The 

resulting market pressures are causing a shift in this 

direction, now that sufficient IP bitrate and audio coding 

technology are available to deliver this. 

As with any audio codec for hands-free videoconferencing 

use, the requirements include: 

• Low latency (support natural conversation) 

• Low complexity (free cycles for video processing and other 

audio processing; reduce cost) 

• High quality on all signal types 

To meet the market need for such a full-band audio coding 

standard, ITU-T Q10/SG16 launched the standardization of 

Low-Complexity Full-Band Audio Coding Extension to G.722.1 

(G.722.1FB) in November 2006. The ITU-T G.722.1FB 

Ericsson Research – Multimedia Technologies 

Färögatan 6, Kista, Sweden 

{anisse.taleb, manuel.briand}@ericsson.com 

Qualification phase was completed in June 2007 and two 

candidates submitted respectively by Ericsson AB and Polycom, 

Inc. were qualified. In September 2007, the two proponents 

announced their collaboration in an Optimization/ 

Characterization phase and agreed to jointly develop a candidate 

codec. The joint candidate codec met all the requirements in the 

subjective quality tests for the G.722.1FB Optimization/ 

Characterization phase in April 2008. In June 2008, the joint 

codec was adopted as ITU-T Recommendation G.719 which is 

the first full-band audio coding standard of ITU-T [1]. 

In this paper we first give an overview of the ITU-T G.719 

codec, followed by a description of two important modules of 

the codec - adaptive time-frequency transform and quantization 

of transform coefficients. We also present the evaluation of the 

computational complexity and memory requirements of the 

codec. Finally, subjective test results from the Optimization/ 

Characterization phase are summarized. 

2. OVERVIEW OF THE G.719 CODEC 

The G.719 codec is a low-complexity transform-based audio 

codec and can provide an audio bandwidth of 20 Hz to 20 kHz 

at 32 - 128 kbps. The codec operates on frames of 20 ms 

corresponding to 960 samples at a sampling rate of 48 kHz and 

has an algorithmic delay of 40 ms. The codec features very high 

audio quality and extremely low computational complexity 

compared to other state-of-the-art audio coding algorithms. 

2.1 Encoder 

Input signal 

Fs = 48 kHz 

Transient 

detector 

Transform 

Norm 

estimation 

∗ Minjie Xie was with Polycom, Inc., when this work was done. He is now with Huawei Technologies (USA). 

978-1-4244-3679-8/09/$25.00 ©2009 IEEE 265 

Transient 

signaling 

Quantization 

indices 

Huffman 

code 

MUX 

Norm 

quantization 

and coding 

Spectrum 

normalization 

Norm 

adjustment 

Bit 

allocation 

FLVQ 

and coding 

Noise level 

adjustment 

Noise 

level 

adjustment 

index 

Figure 1: Block diagram of the G.719 encoder.


A block diagram of the G.719 encoder is shown in Figure 1. 

The input signal sampled at 48 kHz is processed through a 

transient detector. Depending on the detection of a transient, 

indicated by a flag IsTransient, a high frequency resolution or a 

low frequency resolution transform is applied on the input signal 

frame. The adaptive transform is based on a modified discrete 

cosine transform (MDCT) [2] in case of stationary frames. For 

transient frames, the MDCT is modified to obtain a higher 

temporal resolution without a need for additional delay and with 

very little overhead in complexity. Transient frames have a 

temporal resolution equivalent to 5 ms frames. 

The obtained transform coefficients are grouped into bands 

of unequal lengths as 8, 16, 24, or 32. As the bandwidth is 20 

kHz, only 800 transform coefficients are used. The 160 

transform coefficients representing frequencies above 20 kHz 

are ignored. The norm or power of each band is estimated and 

the resulting spectral envelope consisting of the norms of all 

bands is scalar quantized and encoded. The coefficients are 

then normalized by the quantized norms. The quantized norms 

are further adjusted based on adaptive spectral weighting and 

used as input for bit allocation. An adaptive bit-allocation 

scheme based on the quantized norms of the bands is used to 

assign the available bits in a frame among the bands. The 

number of bits assigned to each transform coefficient can be as 

large as 9 bits depending on the input signal. The normalized 

transform coefficients are lattice vector quantized according to 

the allocated bits for each band. Huffman coding is applied to 

the quantization indices for both the coded norms and transform 

coefficients. The norm of the non-coded transform coefficients 

is estimated, coded, and transmitted to the decoder. 

2.2 Decoder 

Transient signaling 

FLVQ indices 

FLVQ 

Decoding 

Noise level 

adjustment index 

Spectral-fill 

Generator 

DEMUX 

+ 

Norm indices 

Envelope 

Shaping 

Transient signaling 

Inverse 

Transform 

Figure 2: Block diagram of the G.719 decoder. 

Audio signal 

Fs = 48 kHz 

A block diagram of the decoder is shown in Figure 2. The 

transient flag is first decoded which indicates the frame 

configuration, i.e. stationary or transient. The spectral envelope 

is then decoded and the same, bit-exact, norm adjustment and 

bit-allocation algorithms are used at the decoder to re-compute 

the bit-allocation which is essential for decoding quantization 

indices of the normalized transform coefficients. 

After decoding the transform coefficients, the non-coded 

transform coefficients (allocated zero bits) in low frequencies 

are regenerated by using a spectral-fill codebook built from the 

decoded transform coefficients. The noise level adjustment 

index is used to adjust the level of the regenerated coefficients. 

The non-coded transform coefficients in high frequencies are 

regenerated using bandwidth extension. The decoded transform 

266 

coefficients and regenerated transform coefficients are mixed 

and lead to normalized spectrum. The decoded spectral 

envelope is applied to the decoded full-band spectrum. Finally, 

the inverse transform is applied to recover the time-domain 

decoded signal. This is performed by applying either the 

inverse MDCT for stationary mode, or the inverse of the higher 

temporal resolution transform for transient mode. 

3. ADAPTIVE TIME-FREQUENCY TRANSFORM 

The adaptive time-frequency transform is based on the detection 

of a transient. In the case of stationary signals, the transform 

has a high frequency resolution which is able to efficiently 

represent stationary sounds. In the case of transient sounds, the 

time-frequency transform will increase its time resolution and 

allows a better representation of the rapid changes in the input 

signal characteristics. These two modes of operations share a 

common buffering and windowing module and the switching 

between one mode of operation to the other is instantaneous. 

Thus no additional look-ahead in the transient detection is 

needed. This allows the codec to have selectable time resolution 

at low complexity and with zero additional delay. 

If the input signal is detected as stationary, a type IV 

discrete cosine transform (DCTIV) [2] is applied on the output 

~ 

x ( n) 

of the time domain aliasing (TDA) operation. The DCTIV 

of the time aliased signal is defined by the following equation: 

959 

⎡⎛ 

⎞⎛ 

⎞ ⎤ 

= 

~ 

1 1 π 

y( 

k) 

∑ x( 

n) 

cos⎢⎜ 

n + ⎟⎜ 

k + ⎟ ⎥ 

= ⎣⎝ 

2 ⎠⎝ 

2 ⎠ 960 

n 0 

⎦ 

, k = 0, 1, …, 959 (1) 

The resulting signal y(k) represents the transform coefficients of 

the input frame. It should be noted that in stationary mode, the 

cascade of windowing, TDA, and DCTIV is equivalent to 

applying the modulated lapped transform (MLT) [2]. 

If the signal is detected as a transient, a further re-ordering 

of the time-domain aliased signal is performed. The basis 

functions of the resulting filter-bank would have an incoherent 

time and frequency responses without re-ordering. The reordering 

operation consists of shuffling the upper and lower half 

of the TDA output signal 

~ 

x ( n) 

as follows: 

v( n) 

= 

~ 

x( 

959 − n) 

, n = 0, 1, …, 959. (2) 

This re-ordering is only conceptual and in reality no 

computations are involved. Higher time resolution is obtained 

by zero padding the signal v(n) and dividing the resulting signal 

into four overlapped equal length sub-frames. The amount of 

zero-padding is equal to 120 on each side of the signal. The 

segments are 50% overlapped and each segment has a length 

equal to 480. The two inner segments are post-windowed using 

a sine window of length 480. The windows for outer segments 

are constructed using half a sine window. 

Each resulting post-windowed segment is further processed 

by applying the MDCT, i.e. time aliasing followed by DCT IV. 

The output of the MDCT for each segment represents the signal 

spectrum at different time instants, thus in case of transients, a 

higher time resolution is used. The length of the output of each 

of the four MDCT is half the length of the input segment, i.e. 

240. Therefore, this operation does not introduce additional 

redundancy.


4. TRANSFORM COEFFICIENT QUANTIZATION 

Each band consists of one or more vectors of 8-dimensional 

transform coefficients and the coefficients are normalized by the 

quantized norm. All 8-dimensional vectors belonging to one 

band are assigned the same number of bits for quantization. A 

fast lattice vector quantization (FLVQ) scheme [3] is used to 

quantize the normalized coefficients in 8 dimensions. In FLVQ 

the quantizer comprises two sub-quantizers: a D8-based higherrate 

lattice vector quantizer (HRQ) and an RE8-based lower-rate 

lattice vector quantizer (LRQ). 

HRQ is a multi-rate quantizer designed to quantize the 

transform coefficients at rates of 2 up to 9 bit/coefficient and its 

codebook is based on the so-called Voronoi code for the D8 lattice [4]. D8 is a well-known lattice and defined as: 

D8 = {(y1, y2, y3, y4, y5, y6, y7, y8) ∈ Z8 | ∑ 

i= 

8 

1 

y i = even} (3) 

where Z 8 is the lattice which consists of all points with integer 

coordinates. It can be seen that D 8 consists of the points having 

integer coordinates with an even sum. The codebook of HRQ is 

constructed from a finite region of the D 8 lattice and is not 

stored in memory. The codewords are generated by a simple 

algebraic method and a fast quantization algorithm is used. 

To minimize the distortion for a given rate, the D 8 lattice 

should be truncated and scaled. Actually, the input vectors 

instead of the lattice codebook are scaled in order to use the fast 

searching and indexing algorithms proposed by Conway and 

Sloane [4]. However, the fast searching algorithm introduced in 

[4] assumes an infinite lattice which can not be used as the 

codebook in real-time audio coding systems. In other words, for 

a given rate the algorithm can not be used to quantize the input 

vectors lying outside the truncated lattice region. To resolve 

this problem, a fast method for quantizing “outliers” xo is 

developed and described as follows: 

1) Scale down the vector x o by 2: x o = x o / 2. 

2) Find the nearest D 8 point u to x o and then compute the 

index vector j of u using the indexing algorithm in [4]. 

3) Find the codeword y from the index vector j and then 

compare y with u. If y is different from u, repeat Steps 1) 

to 3). Otherwise, compute w = x o / 16. Note that due to 

the normalization of transform coefficients, a few 

iterations are required to find the best codeword. 

4) Compute x o = x o + w. 

5) Find the nearest D 8 point u to x o and then compute the 

index vector j of u using the indexing algorithm in [4]. 

6) Find the codeword y from the index vector j and then 

compare y with u. If y and u are exactly same, k = j and 

repeat Steps 4) to 6). Otherwise, k is the index of the best 

codeword to x o and stop. 

It has been observed that the 8-dimensional coefficient 

vectors have a high concentration of probability around the 

origin. Therefore, Huffman coding is optionally tried for the 

quantization indices of HRQ to increase efficiency of 

quantization when the rate is smaller than 6 bit/coefficient. 

LRQ is a 1-bit quantizer and uses spherical codes based on 

the rotated Gosset lattice RE8 [5] as the codebook. The RE8 lattice is defined as 

RE8 = 2 D8 U {2 D8 + (1, 1, 1, 1, 1, 1, 1, 1)} (4) 

267 

and consists of the points falling on concentric spheres of radius 

2 2r 

centered at the origin, where r = 0, 1, 2, ... . The set of 

points on a sphere constitutes a spherical code and can be used 

as a quantization codebook. The LRQ codebook of 256 

codewords is stored in a structured look-up table so that a fast 

searching method and a fast indexing method are developed to 

reduce the computational complexity [3]. 

5. COMPLEXITY OF THE G.719 CODEC 

The computational complexity of the G.719 codec in 16/32-bit 

fixed-point is estimated by encoding and decoding the source 

material used for the subjective tests of the G.719 Optimization/ 

Characterization phase. The observed average and worst-case 

complexity in units of WMOPS [6] are shown in Table 1 for 

different bitrates. Storage requirements are reported in Table 2. 

Table 1: Computational complexity of the G.719 codec. 

Bitrate 

(kbps) 

Encoder Decoder Encoder + Decoder 

Avg. Max Avg. Max Avg. Max 

32 6.663 7.996 6.876 7.413 13.539 15.397 

48 7.427 9.073 7.230 7.806 14.657 16.861 

64 7.912 9.899 7.554 8.161 15.466 18.060 

80 8.303 10.564 7.865 8.496 16.169 19.026 

96 8.555 10.796 8.122 8.757 16.677 19.536 

112 8.787 10.881 8.368 9.009 17.155 19.890 

128 8.980 11.775 8.610 9.225 17.590 21.000 

Table 2: Memory usage of the G.719 codec. 

Memory type Encoder Decoder Encoder + Decoder 

Static RAM 1 1.0 3.9 4.9 

Scratch RAM 1 12.2 12.2 24.4 

Data ROM 1 8.3 8.9 10.7 

Program ROM 2 1.2 1.2 1.8 

1 In units of 16-b kWords 2 In units of 1000’s basic operators 

6. SUBJECTIVE PERFORMANCE OF G.719 

Subjective tests for the ITU-T G.719 (ex. G.722.1FB) 

Optimization/Characterization phase were performed from mid 

February through early April 2008 by independent listening 

laboratories in American English, French, and Spanish. 

According to a test plan designed by ITU-T Q7/SG12 experts, 

the joint candidate codec conducted two experiments as follows: 

• Experiment 1: Speech (clean, reverberant, and noisy) 

• Experiment 2: Mixed content and music 

Mixed content items are representative of advertisement, film 

trailers, news with jingles, music with announcements, etc., and 

contain speech, music, and noise. Each experiment used the 

“triple stimulus/hidden reference/double blind test method” 

described in ITU-R Recommendation BS.1116-1. A standard 

MPEG audio codec, LAME MP3 version 3.97 as found on the 

LAME website as of September 2006 [7] was used as the 

reference codec in the subjective tests. The ITU-T requirement


was that the G.719 candidate codec at 32, 48, and 64kbps be 

proven “Not Worse Than” the reference codec at 40, 56, and 64 

kbps, respectively, with a 95% statistical confidence level. In 

addition, the G.719 candidate codec at 64 kbps was also tested 

against the G.722.1C codecs at 48 kbps for Experiment 2. 

The subjective test results for the G.719 codec are shown in 

Figures 3-6. Statistical analysis of the results showed that the 

G.719 codec met all performance requirements specified for the 

subjective Optimization/Characterization test. For experiment 1 

the G.719 codec was better than the reference codec at all bit 

rates in both languages. For experiment 2 the G.719 codec is 

better than the reference codec at the lowest bit rate for all the 

items and at the two other bitrates for most of the items. 

An additional subjective listening test for the G.719 codec 

was conducted later to evaluate the quality of the codec at rates 

higher than those described in the ITU-T test plan. Because the 

quality expectation of the codec at these high rates is high, a 

pre-selection of critical items, for which the quality at the lower 

bitrate range was most degraded, was conducted prior to testing. 

The test results are shown in Figure 7. It has been proven that 

transparency was reached for critical material at 128 kbps. 

DMOS 

5.0 

4.5 

4.0 

3.5 

3.0 

2.5 

2.0 

1.5 

1.0 

Ref 40kbps 

G.719 32kbps 

Ref 56kbps 


Ref 64kbps 

Speech (North American English) 


Ref 40kbps 


Ref 56kbps 


Ref 64kbps 

Clean Reverberant Reverberant and Noisy 

Figure 3: Subjective test results – Experiment 1 (English). 

DMOS 

5.0 

4.5 

4.0 

3.5 

3.0 

2.5 

2.0 

1.5 

1.0 

Ref 40kbps 


Ref 56kbps 


Ref 64kbps 


Speech (French) 

Ref 40kbps 


Ref 56kbps 


Ref 64kbps 

Clean Reverberant Reverberant and Noisy 

Figure 4: Subjective test results – Experiment 1 (French). 

DMOS 

5.0 

4.5 

4.0 

3.5 

3.0 

2.5 

2.0 

1.5 

1.0 

Ref 

40kbps 



Ref 40kbps 

Ref 40kbps 

Mixed-content and Music (North American English) 

G.719 

32kbps 

Ref 

56kbps 


48kbps 

Ref 

64kbps 


64kbps 



Ref 56kbps 

Ref 56kbps 



G.722.1C 

48kbps 

Ref 64kbps 

Ref 64kbps 




64kbps 

Figure 5: Subjective test results – Experiment 2 (English). 

268 

DMOS 

5.0 

4.5 

4.0 

3.5 

3.0 

2.5 

2.0 

1.5 

1.0 

Ref 

40kbps 


32kbps 

Mixed-content and Music (Spanish) 

Ref 

56kbps 


48kbps 

Ref 

64kbps 


64kbps 

G.722.1C 

48kbps 


64kbps 

Figure 6: Subjective test results – Experiment 2 (Spanish). 

5.5 

5.0 

4.5 

4.0 

3.5 

3.0 

2.5 

2.0 

1.5 

1.0 

Source 


32kbps 

Source 

Additional Subjective Test Results (DMOS) 


48kbps 

Source 


64kbps 

Source 


80kbps 

Source 


96kbps 

Source 


112kbps 

Figure 7: Additional subjective test results. 

7. CONCLUSION 

This paper has presented the 20 kHz full-band audio coding 

algorithm and subjective Optimization/Characterization test 

results of ITU-T Recommendation G.719. The G.719 codec 

features very high audio quality, extremely low computational 

complexity, and low algorithmic delay compared to other stateof-the-art 

audio coding codecs. Subjective test results show that 

the G.719 codec achieves transparent audio quality at 128 kbps. 

The main intended applications for this codec are 

videoconferencing and teleconferencing, as well as Internet 

streaming. 

8. REFERENCES 

[1] ITU-T Recommendation G.719, “Low-complexity fullband 

audio coding for high-quality conversational 

applications,” June 2008. 

[2] H. S. Malvar, Signal Processing with Lapped Transforms. 

Norwood, MA: Artech House, 1992. 

[3] M. Xie, “Lattice Vector Quantization and its Applications 

in Speech and Audio Coding,” in Speech Processing in 

Modern Communication: Challenges and Perspectives. 

Springer, To appear. 

[4] J. H. Conway and N. J. A. Sloane, Sphere Packings, 

Lattices and Groups. Springer-Verlag, New York, 1988. 

[5] C. Lamblin and J.-P. Adoul, “Algorithme de quantification 

vectorielle sphérique à partir du réseau de Gosset d’ordre 

8,” Annales des Télécommunications, pp. 172-186, 1988. 

[6] ITU-T Recommendation G.191, “Software tools for speech 

and audio coding standardization,” Sept. 2005. 

[7] http://sourceforge.net/project/showfiles.php?group_id=290 

&package_id=309. 

Source 


128kbps

ITU-T G.719: A New Low-Complexity Full-Band (20KHZ ... - Polycom

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?