CN104022967A

CN104022967A - Voice decoding apparatus

Info

Publication number: CN104022967A
Application number: CN201410058259.1A
Authority: CN
Inventors: 伏见涉; 铃木茂明; 山浦正
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2013-02-28
Filing date: 2014-02-20
Publication date: 2014-09-03
Also published as: KR101516113B1; TW201434039A; JP2014167525A; KR20140108119A

Abstract

An audio decoding device capable of reducing deterioration in call quality even when silent compression is applied. It includes: a jitter absorbing buffer for temporarily accumulating received packets and outputting the packets at a predetermined output timing; and a background noise generating unit for generating background noise data contained in packets output from the jitter absorbing buffer. noise audio data; an audio decoding unit that decodes audio coded data included in the packet output from the jitter absorbing buffer to generate voice audio data; a speech rate conversion unit that decodes the audio data decoded by the audio decoding unit Speech rate conversion that converts the reproduction speed of the above-mentioned audio data; and a control unit that controls the time length of the background noise generated by the background noise generation unit based on the accumulation status of packets in the above-mentioned jitter absorption buffer, and controls The reproduction speed converted by the speech rate conversion unit.

Description

audio decoding device

技术领域technical field

本发明涉及在网络电话等中使用的对编码后的音频进行解码的音频解码装置。The present invention relates to an audio decoding device for decoding encoded audio used in Internet telephones and the like.

背景技术Background technique

网络电话等语音通话以如下方式进行通话：在对语音进行编码后形成分组，并通过网络收发分组。在分组的通信中，接收到分组的时间间隔大多不固定，分组的接收时间间隔产生偏差（抖动）的情况较多。作为吸收这样的抖动，并连续地输出对接收到的分组所包含的音频码进行解码而得到的解码音频的技术，例如存在专利文献1所记载的技术。Voice calls such as Internet telephony are conducted by encoding voice to form packets, and sending and receiving packets via the network. In packet communication, the time interval at which packets are received is often not fixed, and the time interval at which packets are received varies (jitters) in many cases. As a technique for absorbing such jitter and continuously outputting decoded audio obtained by decoding an audio code included in a received packet, there is a technique described in Patent Document 1, for example.

在专利文献1所记载的技术中，根据临时存储接收分组的抖动吸收缓冲器中的接收分组的蓄积量加快或减慢再现速度，由此将抖动吸收缓冲器中的接收分组的蓄积量保持为恰当的量，并连续地输出解码音频。由此，与通过抖动吸收缓冲器中的接收分组的废弃、复制而将接收分组的蓄积量保持为恰当的量的情况相比，减轻了音频质量的劣化。In the technique described in Patent Document 1, the reproduction speed is increased or decreased in accordance with the accumulation amount of reception packets in the jitter absorption buffer temporarily storing reception packets, thereby keeping the accumulation amount of reception packets in the jitter absorption buffer at Just the right amount, and output decoded audio continuously. As a result, the degradation of audio quality is reduced compared to the case where the accumulated amount of received packets is kept at an appropriate amount by discarding and duplicating received packets in the jitter absorbing buffer.

【专利文献1】日本特许第3796240号公报[Patent Document 1] Japanese Patent No. 3796240

但是，在以往的音频解码装置中，是以如下情况为前提进行的控制：以固定的时间间隔对语音进行编码、打包并发送的分组存储到抖动吸收缓冲器中的与该分组的分组编号对应的位置处。因此，在应用了例如无声区间中分组的送出间隔变长等、不一定以固定的时间间隔送出分组的无声压缩的系统中，存在无法进行恰当的处理而导致通话质量劣化的问题。However, in the conventional audio decoding device, the control is performed on the premise that the voice is encoded at fixed time intervals, the packets that are packaged and transmitted are stored in the jitter absorbing buffer corresponding to the packet number of the packet. at the location. Therefore, in a system that applies silent compression that does not necessarily send packets at constant time intervals, such as when the packet sending interval becomes longer in a silent period, there is a problem that appropriate processing cannot be performed, resulting in degradation of call quality.

发明内容Contents of the invention

本发明正是为了解决上述那样的问题而完成的，其目的在于，得到一种即使在应用无声压缩时也能够减少通话质量的劣化的音频解码装置。The present invention was made in order to solve the above-mentioned problems, and an object of the present invention is to obtain an audio decoding device capable of reducing deterioration in speech quality even when silent compression is applied.

本发明的音频解码装置具有：抖动吸收缓冲器，其临时蓄积接收到的分组，并在规定的输出定时输出该分组；背景噪声生成部，其根据从上述抖动吸收缓冲器输出的分组所包含的背景噪声数据，生成背景噪声的音频数据；音频解码部，其对从上述抖动吸收缓冲器输出的分组所包含的音频编码数据进行解码而生成语音的音频数据；语速变换部，其进行对由上述音频解码部解码后的上述音频数据的再现速度进行变换的语速变换；以及控制部，其根据上述抖动吸收缓冲器中的分组的蓄积状况，控制由上述背景噪声生成部生成的上述背景噪声的时间长度，并且控制由上述语速变换部所变换的上述再现速度。The audio decoding device of the present invention includes: a jitter absorbing buffer temporarily storing received packets and outputting the packets at a predetermined output timing; and a background noise generating unit based on background noise data for generating audio data for background noise; an audio decoding unit for decoding audio coded data included in the packet output from the jitter absorbing buffer to generate audio data for speech; a speech rate conversion unit for converting Speech rate conversion that converts the playback speed of the audio data decoded by the audio decoding unit; and a control unit that controls the background noise generated by the background noise generation unit based on the accumulation status of packets in the jitter absorbing buffer. and control the above-mentioned reproduction speed converted by the above-mentioned speech rate conversion section.

根据本发明，通过具有：抖动吸收缓冲器，其临时蓄积接收到的分组，并在规定的输出定时输出该分组；背景噪声生成部，其根据从上述抖动吸收缓冲器输出的分组所包含的背景噪声数据生成背景噪声的音频数据；音频解码部，其对从上述抖动吸收缓冲器输出的分组所包含的音频编码数据进行解码而生成语音的音频数据；语速变换部，其进行对由上述音频解码部解码后的上述音频数据的再现速度进行变换的语速变换；以及控制部，其根据上述抖动吸收缓冲器中的分组的蓄积状况，控制由上述背景噪声生成部生成的上述背景噪声的时间长度，并且控制由上述语速变换部所变换的上述再现速度，由此，即使在应用无声压缩时也能够防止通话质量劣化。According to the present invention, by including: a jitter absorbing buffer temporarily storing received packets and outputting the packets at a predetermined output timing; The noise data generates audio data of background noise; the audio decoding unit decodes the audio coded data included in the packet output from the jitter absorbing buffer to generate audio data of speech; a speech rate conversion for converting the playback speed of the audio data decoded by the decoding unit; and a control unit for controlling the timing of the background noise generated by the background noise generation unit based on the accumulation status of packets in the jitter absorbing buffer. Length, and control the above-mentioned reproduction speed converted by the above-mentioned speed-of-speech conversion unit, thereby preventing deterioration of call quality even when silent compression is applied.

附图说明Description of drawings

图1是本发明实施方式1中的音频解码装置的功能块结构图。FIG. 1 is a functional block diagram of an audio decoding device according to Embodiment 1 of the present invention.

图2是示出分组的时间戳与抖动吸收缓冲器的蓄积之间的关系的说明图。FIG. 2 is an explanatory diagram showing the relationship between the time stamp of a packet and accumulation in a jitter absorbing buffer.

图3是本发明实施方式2中的音频解码装置的功能块结构图。FIG. 3 is a functional block diagram of an audio decoding device according to Embodiment 2 of the present invention.

图4是本发明实施方式3中的音频解码装置的功能块结构图。FIG. 4 is a functional block diagram of an audio decoding device according to Embodiment 3 of the present invention.

图5是本发明实施方式4中的音频解码装置的功能块结构图。5 is a functional block diagram of an audio decoding device according to Embodiment 4 of the present invention.

图6是本发明实施方式5中的音频解码装置的功能块结构图。FIG. 6 is a functional block diagram of an audio decoding device according to Embodiment 5 of the present invention.

图7是示出分组的时间戳与抖动吸收缓冲器的蓄积之间的关系的说明图。FIG. 7 is an explanatory diagram showing the relationship between the time stamp of a packet and accumulation in a jitter absorbing buffer.

标号说明Label description

1：抖动吸收缓冲器；2：背景噪声生成部；3：音频解码部；4：语速变换部；5：输出缓冲器；6：输出缓冲器监视部；7：控制部；71：缓冲器余量监视部；72：控制信号输出部；73：到达速度监视部；8：高精度无声压缩部；9：音频检测部；10：音频编码部；11：无声压缩控制部；12：背景噪声数据检测/插入部；20：音频解码装置；21：音频编码装置。1: Jitter absorbing buffer; 2: Background noise generating unit; 3: Audio decoding unit; 4: Speech rate converting unit; 5: Output buffer; 6: Output buffer monitoring unit; 7: Control unit; 71: Buffer Surplus monitoring part; 72: Control signal output part; 73: Arrival speed monitoring part; 8: High-precision silent compression part; 9: Audio detection part; 10: Audio coding part; 11: Silent compression control part; 12: Background noise Data detection/insertion unit; 20: audio decoding device; 21: audio encoding device.

具体实施方式Detailed ways

以下，说明本发明的实施方式。另外，以下的实施方式是本发明的一个例子，本发明并不受以下的实施方式限定。Embodiments of the present invention will be described below. In addition, the following embodiment is an example of this invention, and this invention is not limited to the following embodiment.

实施方式1.Implementation mode 1.

图1是示出本发明的一个实施例的音频解码装置的功能块结构图。FIG. 1 is a functional block diagram showing an audio decoding device according to an embodiment of the present invention.

在图1中，抖动吸收缓冲器1临时蓄积接收到的分组，并在规定的输出定时输出该分组。背景噪声生成部2根据从抖动吸收缓冲器1输出的分组所包含的背景噪声数据，生成背景噪声的音频数据。音频解码部3对从抖动吸收缓冲器1输出的分组所包含的音频编码数据进行解码而生成语音的音频数据。语速变换部4进行对由音频解码部3解码后的音频数据的再现速度进行变换的语速变换。输出缓冲器5对由上述背景噪声生成部2生成的背景噪声的音频数据和由上述音频解码部3生成的语音的音频数据进行临时蓄积。输出缓冲器监视部6监视输出缓冲器5所蓄积的音频数据的蓄积量，并根据该蓄积量对抖动吸收缓冲器1指示临时蓄积的分组的输出定时。控制部7根据抖动吸收缓冲器1中的分组的蓄积状况，控制由背景噪声生成部2生成的背景噪声的时间长度，并且控制由语速变换部4变换后的再现速度。In FIG. 1 , a jitter absorbing buffer 1 temporarily stores received packets, and outputs the packets at a predetermined output timing. The background noise generating unit 2 generates audio data of background noise based on the background noise data contained in the packets output from the jitter absorbing buffer 1 . The audio decoding unit 3 decodes the encoded audio data included in the packet output from the jitter absorbing buffer 1 to generate audio data of speech. The speech rate conversion unit 4 performs speech rate conversion for converting the playback speed of the audio data decoded by the audio decoding unit 3 . The output buffer 5 temporarily stores the audio data of the background noise generated by the background noise generating unit 2 and the audio data of speech generated by the audio decoding unit 3 . The output buffer monitoring unit 6 monitors the storage amount of the audio data stored in the output buffer 5, and instructs the jitter absorbing buffer 1 to output timing of the temporarily stored packets based on the storage amount. The control unit 7 controls the duration of the background noise generated by the background noise generation unit 2 and controls the playback speed converted by the speech rate conversion unit 4 according to the accumulation status of packets in the jitter absorption buffer 1 .

在本实施方式中，控制部7具有缓冲器余量监视部71和控制信号输出部72。缓冲器余量监视部71监视抖动吸收缓冲器1的余量，作为抖动吸收缓冲器1中的分组的蓄积状况。控制信号输出部72根据由缓冲器余量监视部71监视到的抖动吸收缓冲器余量，输出对由背景噪声生成部2生成的背景噪声的时间长度进行控制的时间长度控制信号、和对由语速变换部4所变换的再现速度进行控制的再现速度控制信号。In the present embodiment, the control unit 7 has a buffer remaining amount monitoring unit 71 and a control signal output unit 72 . The buffer remaining amount monitoring unit 71 monitors the remaining amount of the jitter absorbing buffer 1 as the state of accumulation of packets in the jitter absorbing buffer 1 . The control signal output unit 72 outputs a time length control signal for controlling the time length of the background noise generated by the background noise generating unit 2 and a time length control signal for controlling the time length of the background noise generated by the background noise generating unit 2 based on the remaining amount of the jitter absorption buffer monitored by the buffer remaining amount monitoring unit 71. The reproduction speed control signal for controlling the reproduction speed converted by the speech rate conversion unit 4 .

接着对动作进行说明。Next, the operation will be described.

另外，在本实施方式中，说明在用户与用户的通话对方的两者之间进行语音通话的情况下的动作，但本发明并不受其限定。In addition, in the present embodiment, the operation in the case of a voice call between the user and the user's counterpart is described, but the present invention is not limited thereto.

首先，当用户的通话对方发声时，其语音在通话对方侧被编码而形成分组，并通过网络在用户侧被接收。这样在用户侧接收到从通话对方侧发送的分组时，抖动吸收缓冲器1对该接收到的分组进行临时蓄积。抖动吸收缓冲器1蓄积预先确定的初始延迟量的分组后依次输出临时蓄积的分组，以吸收分组到达延迟的波动即抖动，使得能够在平滑化后的定时输出分组。其中，来自抖动吸收缓冲器1的输出定时依照来自输出缓冲器监视部6的指示。First, when the user's counterpart speaks, the voice is encoded at the counterparty's side to form packets, and is received at the user's side through the network. In this way, when the user side receives a packet transmitted from the communication partner side, the jitter absorbing buffer 1 temporarily stores the received packet. The jitter absorbing buffer 1 accumulates packets with a predetermined initial delay amount and sequentially outputs temporarily accumulated packets to absorb fluctuations in packet arrival delays, ie jitter, so that packets can be output at a smoothed timing. Here, the output timing from the jitter absorbing buffer 1 follows an instruction from the output buffer monitoring unit 6 .

从抖动吸收缓冲器1输出的分组被分为包含背景噪声数据的背景噪声分组、和包含音频编码数据的音频分组进行处理。在音频分组的情况下，该分组被输入到音频解码部3，在背景噪声分组的情况下，该分组被输入到背景噪声生成部2。与背景噪声分组一起，将该背景噪声分组与下一分组的时间差、例如分别表示赋予给背景噪声分组和下一分组的发送时间的时间戳值的差作为背景噪声生成时间长度从抖动吸收缓冲器1传递给背景噪声生成部2。Packets output from the jitter absorbing buffer 1 are divided into background noise packets including background noise data and audio packets including audio coded data, and processed. In the case of an audio packet, the packet is input to the audio decoding unit 3 , and in the case of a background noise packet, the packet is input to the background noise generation unit 2 . Together with the background noise packet, the time difference between the background noise packet and the next packet, for example, the difference in the time stamp value indicating the transmission time given to the background noise packet and the next packet, respectively, is taken from the jitter absorption buffer as the background noise generation time length 1 is passed to the background noise generator 2.

使用图说明详细的动作。图2是示出分组的时间戳与抖动吸收缓冲器的蓄积之间的关系的说明图。Detailed operations are explained using diagrams. FIG. 2 is an explanatory diagram showing the relationship between the time stamp of a packet and accumulation in a jitter absorbing buffer.

在图2中，包含t时间的音频编码数据的音频分组＃1、＃2、＃4，以及包含背景噪声数据的背景噪声分组＃3分别按照＃1、＃2、＃3、＃4的顺序到达，并临时蓄积到抖动吸收缓冲器1中。In Figure 2, the audio packets #1, #2, #4 containing the audio coded data at time t, and the background noise packet #3 containing the background noise data are in the order of #1, #2, #3, #4 respectively arrives and is temporarily stored in the jitter absorbing buffer 1.

当对作为背景噪声分组的＃3的分组赋予序号N、时间戳值M时，＃1分组的序号为N－2、＃2分组的序号为N－1、＃4分组的序号为N＋1，＃1分组的时间戳值为M－2t、＃2分组的时间戳值为M－t。＃4分组的时间戳值成为经过了噪声区间长度即T时间后的时间、即M＋T。背景噪声生成时间长度成为背景噪声分组即＃3分组与下一分组即＃4分组的时间戳值的差，即（M＋T）－M＝T。When assigning serial number N and time stamp value M to the grouping of #3 as the background noise grouping, the serial number of #1 grouping is N-2, the serial number of #2 grouping is N-1, and the serial number of #4 grouping is N+1, # The time stamp value of the 1 packet is M-2t, and the time stamp value of the #2 packet is M-t. The time stamp value of the #4 packet is the time after T time which is the length of the noise interval, that is, M+T. The background noise generation time length is the difference between the timestamp values of the #3 packet which is the background noise packet and the #4 packet which is the next packet, that is, (M+T)−M=T.

被输入了背景噪声分组和背景噪声生成时间长度的背景噪声生成部2根据背景噪声分组中所存储的背景噪声数据生成背景噪声，使背景噪声的生成持续背景噪声生成时间长度，并作为背景噪声的音频数据输出到输出缓冲器5。The background noise generation unit 2 to which the background noise group and the background noise generation time length are input generates background noise based on the background noise data stored in the background noise group, makes the generation of the background noise continue for the background noise generation time length, and serves as the background noise Audio data is output to the output buffer 5 .

被输入了音频分组的音频解码部3通过对音频分组中所存储的音频编码数据进行解码来生成语音的音频数据，并输出到语速变换部4。由语速变换部4处理后的语音的音频数据被输入给输出缓冲器5。The audio decoding unit 3 to which the audio packets are input decodes the encoded audio data stored in the audio packets to generate voice audio data, and outputs the audio data to the speech rate conversion unit 4 . The speech audio data processed by the speech rate conversion unit 4 is input to the output buffer 5 .

输出缓冲器监视部6监视输出缓冲器5所蓄积的音频数据的有无（所蓄积的音频数据的蓄积量），在判断为没有来自背景噪声生成部2和语速变换部4的输入（比规定的量少）的情况下，对抖动吸收缓冲器1指示分组的输出定时，以输出蓄积在抖动吸收缓冲器1中的1个分组。The output buffer monitoring unit 6 monitors the presence or absence of audio data stored in the output buffer 5 (accumulation amount of the stored audio data), and when it is determined that there is no input from the background noise generating unit 2 and the speech rate converting unit 4 (compared to When the predetermined amount is small), the jitter absorbing buffer 1 is instructed to output the timing of the packet so that one packet stored in the jitter absorbing buffer 1 is output.

缓冲器余量监视部71监视临时蓄积在抖动吸收缓冲器1中的分组的量，在缓冲器余量少于某个阈值A的情况下，将“小”通知给控制信号输出部72，在多于某个阈值B的情况下，将“大”通知给控制信号输出部72，当在某个阈值A以上且在某个阈值B以下的情况下，将“中”通知给控制信号输出部72。The buffer remaining amount monitoring unit 71 monitors the amount of packets temporarily stored in the jitter absorbing buffer 1, and when the buffer remaining amount is less than a certain threshold value A, notifies the control signal output unit 72 of "small", and then When it exceeds a certain threshold value B, "large" is notified to the control signal output unit 72, and when it is above a certain threshold value A and below a certain threshold value B, "medium" is notified to the control signal output unit 72.

接收到来自缓冲器余量监视部71的通知的控制信号输出部72输出时间长度控制信号和再现速度控制信号，该时间长度控制信号（指示）控制成抖动吸收缓冲器1的缓冲器余量越大则越缩短背景噪声生成时间长度，该再现速度控制信号（指示）控制成抖动吸收缓冲器1的缓冲器余量越大则越加快语速进行再现。The control signal output unit 72 that has received the notification from the buffer remaining amount monitoring unit 71 outputs a time length control signal (instruction) that is controlled so that the buffer remaining amount of the jitter absorbing buffer 1 becomes higher and higher, and a playback speed control signal. The larger the background noise generation time length is, the shorter the reproduction speed control signal (instruction) is controlled so that the larger the buffer margin of the jitter absorbing buffer 1 is, the faster the speech speed is reproduced.

例如，根据表1记载的控制内容，如果被通知了“小”，则向背景噪声生成部2发出延长背景噪声生成时间长度的指示、例如延长为1.1倍的指示，向语速变换部4发出缓慢进行再现的指示、例如减慢为0.8倍的指示。如果被通知了“大”，则向背景噪声生成部2发出缩短背景噪声生成时间长度的指示、例如缩短为0.9倍的指示，向语速变换部4发出加快再现的指示、例如加快为1.2倍的指示。如果被通知了“中”，则向背景噪声生成部2发出将背景噪声生成时间长度设为通常长度的指示、例如为1.0倍的指示，向语速变换部4发出将再现设为通常速度的指示、例如为1.0倍的指示。For example, according to the control content recorded in Table 1, if "small" is notified, an instruction to extend the background noise generation time length, for example, an instruction to extend by 1.1 times, is sent to the background noise generation part 2, and is sent to the speech rate conversion part 4. An instruction to play back slowly, for example, an instruction to slow down by a factor of 0.8. If "big" is notified, an instruction to shorten the background noise generation time length, for example, to 0.9 times, is sent to the background noise generation part 2, and an instruction to speed up the reproduction is sent to the speech rate conversion part 4, for example, to be 1.2 times faster instructions. If "medium" is notified, an instruction to set the background noise generation time length to a normal length, for example, 1.0 times, is sent to the background noise generation part 2, and an instruction to set the reproduction speed to a normal speed is sent to the speech rate conversion part 4. The indication is, for example, an indication of 1.0 times.

【表1】【Table 1】

如上所述，根据本实施方式，从控制部7向背景噪声生成部2和语速变换部4发出联动的指示。即，根据抖动吸收缓冲器1中的分组的蓄积状况，控制由背景噪声生成部2生成的背景噪声的时间长度，并且控制由语速变换部4所变换的再现速度。由此，分别控制发送间隔不同的背景噪声（无声区间）和语音（有声区间），因此即使在应用于不一定以固定间隔送出分组的无声压缩时也能够防止通话质量劣化。As described above, according to the present embodiment, the control unit 7 issues an instruction to cooperate with the background noise generation unit 2 and the speech rate conversion unit 4 . That is, the duration of the background noise generated by the background noise generating unit 2 is controlled, and the reproduction speed converted by the speech rate converting unit 4 is controlled according to the accumulation status of packets in the jitter absorbing buffer 1 . In this way, background noise (silent interval) and speech (voiced interval) at different transmission intervals are separately controlled, so that voice quality degradation can be prevented even when applied to silent compression that does not always send packets at fixed intervals.

作为抖动吸收缓冲器1中的分组的蓄积状况，根据抖动吸收缓冲器1的余量，输出控制由背景噪声生成部2生成的背景噪声的时间长度的时间长度控制信号、和控制由语速变换部4所变换的再现速度的再现速度控制信号，由此能够根据抖动吸收缓冲器1的余量进行恰当的抖动缓冲控制，即使在应用无声压缩时也能够防止通话质量劣化。As the accumulation status of packets in the jitter absorbing buffer 1, according to the remaining capacity of the jitter absorbing buffer 1, a time length control signal for controlling the time length of the background noise generated by the background noise generating section 2, and a control signal converted by the speech rate are output. By using the reproduction speed control signal of the reproduction speed converted by the unit 4, appropriate jitter buffering control can be performed according to the margin of the jitter absorption buffer 1, and deterioration of call quality can be prevented even when silent compression is applied.

根据阈值A、阈值B将抖动吸收缓冲器余量分为了“小”、“中”、“大”三类进行了说明，但能够通过进一步细分的控制，进行更细致的控制。The jitter-absorbing buffer margin is described as "small", "medium", and "large" according to threshold A and threshold B, but finer control can be performed by subdividing the control.

此外，虽然控制也伴随余量变化而发生变化，但通过根据余量的变化方向对区分“小”、“中”、“大”的阈值设定不同的阈值，能够避免控制由于阈值附近处的余量的增减而频繁发生变化，能够提供更良好的通话质量。例如，能够通过设定处于朝抖动吸收缓冲器余量增加的方向发生变化的变化方向的情况下的阈值C、阈值D，和处于朝减少的方向发生变化的变化方向的情况下的阈值E、阈值F，提供更良好的通话质量。In addition, although the control also changes with the change of the margin, by setting different thresholds for the thresholds of "small", "medium" and "large" according to the change direction of the margin, it is possible to avoid the Changes frequently due to the increase or decrease of the margin, which can provide better call quality. For example, it is possible to set the thresholds C and D in the case of a change direction in which the jitter absorption buffer margin increases, and the thresholds E, D in the case of a change in a decrease direction. Threshold F provides better call quality.

此外，在背景噪声生成部2中，能够通过在缩短背景噪声生成时间长度的情况下，使得背景噪声生成时间不短于某个固定的时间长度，来提供更良好的通话质量。In addition, in the background noise generation unit 2 , it is possible to provide better call quality by making the background noise generation time not shorter than a certain fixed time length while shortening the background noise generation time length.

另外，在上述说明中，将来自控制部7的针对背景噪声生成部2的指示记为了延长为1.1倍或缩短为0.9倍，但例如也可以是延长100ms或缩短200ms等关于增减的时间量的指示。In addition, in the above description, the instruction from the control unit 7 to the background noise generating unit 2 is described as being extended by 1.1 times or shortened by 0.9 times, but for example, it may be extended by 100 ms or shortened by 200 ms. instructions.

此外，说明了具有输出缓冲器5和输出缓冲器监视部6的情况，但也可以删除输出缓冲器5和输出缓冲器监视部6。例如，抖动吸收缓冲器1也可以构成为在具有规定时间间隔的输出定时输出分组。并且例如，也可以构成为根据抖动吸收缓冲器中的分组的蓄积状况，在与控制部7的控制对应的输出定时输出分组。In addition, although the case where the output buffer 5 and the output buffer monitoring part 6 were provided was demonstrated, the output buffer 5 and the output buffer monitoring part 6 may be deleted. For example, the jitter absorbing buffer 1 may be configured to output packets at output timings having predetermined time intervals. Furthermore, for example, the packet may be output at an output timing corresponding to the control of the control unit 7 in accordance with the accumulation state of the packet in the jitter absorbing buffer.

实施方式2.Implementation mode 2.

图3是示出本发明的一个实施例的音频解码装置的功能块结构图。FIG. 3 is a functional block diagram showing an audio decoding device according to an embodiment of the present invention.

在图3中，用同一标号示出与上述实施方式相同或对应的部分，而省略说明。In FIG. 3 , the parts that are the same as or corresponding to those in the above-described embodiment are denoted by the same reference numerals, and description thereof will be omitted.

在图3中，控制部7具有缓冲器余量监视部71、控制信号输出部72和到达速度监视部73。到达速度监视部73监视蓄积在抖动吸收缓冲器1中的分组的到达速度。在本实施方式中，控制信号输出部72根据作为抖动吸收缓冲器中的分组蓄积状况的、由缓冲器余量监视部71监视到的余量和由到达速度监视部73监视到的到达速度，输出控制由背景噪声生成部2生成的背景噪声的时间长度的时间长度控制信号、和控制由语速变换部4所变换的再现速度的再现速度控制信号。In FIG. 3 , the control unit 7 has a buffer remaining amount monitoring unit 71 , a control signal output unit 72 , and an arrival speed monitoring unit 73 . The arrival rate monitoring unit 73 monitors the arrival rate of packets stored in the jitter absorbing buffer 1 . In the present embodiment, the control signal output unit 72, based on the remaining amount monitored by the buffer remaining amount monitoring unit 71 and the arrival speed monitored by the arrival speed monitoring unit 73 as the packet accumulation status in the jitter absorbing buffer, A time length control signal for controlling the time length of the background noise generated by the background noise generation unit 2 and a reproduction speed control signal for controlling the reproduction speed converted by the speech rate conversion unit 4 are output.

接着对动作进行说明。Next, the operation will be described.

另外，在本实施方式中，说明在用户与用户的通话对方的两者之间进行了语音通话的情况下的动作，但本发明不限于此。In addition, in this embodiment, the operation in the case where the user and the user's call partner are in a voice call is described, but the present invention is not limited thereto.

从抖动吸收缓冲器1输出的分组被分为包含背景噪声数据的背景噪声分组、和包含音频编码数据的音频分组进行处理。在音频分组的情况下，该分组被输入到音频解码部3，在背景噪声分组的情况下，该分组被输入到背景噪声生成部2。从抖动吸收缓冲器1，与背景噪声分组一起，将该背景噪声分组与下一分组的时间差、例如分别表示赋予给背景噪声分组和下一分组的发送时间的时间戳值之差作为背景噪声生成时间长度传递给背景噪声生成部2。Packets output from the jitter absorbing buffer 1 are divided into background noise packets including background noise data and audio packets including audio coded data, and processed. In the case of an audio packet, the packet is input to the audio decoding unit 3 , and in the case of a background noise packet, the packet is input to the background noise generation unit 2 . From the jitter absorbing buffer 1, together with the background noise packet, the time difference between the background noise packet and the next packet, for example, the difference between the time stamp values respectively indicating the transmission time given to the background noise packet and the next packet is generated as background noise. The length of time is passed to the background noise generation unit 2 .

被输入了背景噪声分组和背景噪声生成时间长度的背景噪声生成部2根据背景噪声分组中所存储的背景噪声数据生成背景噪声，使背景噪声的生成持续背景噪声生成时间长度，并作为背景噪声的音频数据输出到缓冲器5。The background noise generation unit 2 to which the background noise group and the background noise generation time length are input generates background noise based on the background noise data stored in the background noise group, makes the generation of the background noise continue for the background noise generation time length, and serves as the background noise Audio data is output to buffer 5 .

被输入了音频分组的音频解码部3通过对音频分组中所存储的音频编码数据进行解码来生成语音的音频数据，并输出到语速变换部4。由语速变换部4处理后的语音的音频数据被输入到输出缓冲器5。The audio decoding unit 3 to which the audio packets are input decodes the encoded audio data stored in the audio packets to generate voice audio data, and outputs the audio data to the speech rate conversion unit 4 . The speech audio data processed by the speech rate conversion unit 4 is input to the output buffer 5 .

到达速度监视部73监视输入（到达）抖动吸收缓冲器1的分组的到达速度，在以比某个阈值α慢的速度进行输入的情况下，作为“低速”通知给控制信号输出部72，在以比某个阈值β快的速度进行输入的情况下，作为“高速”通知给控制信号输出部72，在不低于某个阈值α且不高于某个阈值β的情况下，作为“中速”通知给控制信号输出部72。The arrival rate monitoring unit 73 monitors the arrival rate of packets input (arriving) into the jitter absorbing buffer 1, and when the rate of arrival is slower than a certain threshold value α, notifies the control signal output unit 72 of “low speed”, and If the input speed is faster than a certain threshold value β, it is notified to the control signal output unit 72 as "high speed", and when it is not lower than a certain threshold value α and not higher than a certain threshold value β, it is regarded as "medium speed". Speed" is notified to the control signal output unit 72.

接收到来自缓冲器余量监视部71和到达速度监视部73的通知的控制信号输出部72输出时间长度控制信号和再现速度控制信号；该时间长度控制信号（指示）控制为抖动吸收缓冲器1的缓冲器余量越大则越缩短背景噪声生成时间长度，输入（到达）抖动吸收缓冲器1的分组的到达速度越高速则越缩短背景噪声生成时间长度；该再现速度控制信号（指示）控制为抖动吸收缓冲器1的缓冲器余量越大则越加快语速的再现，输入（到达）抖动吸收缓冲器1的分组的到达速度越高速则越加快语速进行再现。The control signal output section 72 that has received notifications from the buffer remaining amount monitoring section 71 and the arrival speed monitoring section 73 outputs a time length control signal and a reproduction speed control signal; this time length control signal (instruction) controls the jitter absorption buffer 1 The larger the buffer margin, the shorter the background noise generation time length is, and the higher the arrival speed of the packet input (arriving) into the jitter absorbing buffer 1 is, the shorter the background noise generation time length is; the reproduction speed control signal (instruction) controls The larger the buffer capacity of the jitter absorbing buffer 1 is, the faster the speech speed is reproduced, and the faster the arrival speed of packets input (arriving) into the jitter absorbing buffer 1 is, the faster the speech speed is reproduced.

例如根据表2记载的控制内容，向背景噪声生成部2和语速变换部4发出指示。针对背景噪声生成部2，在设为“延长”的情况下发出例如1.1倍的指示，在设为“进一步延长”的情况下发出例如1.3倍的指示，在设为“缩短”的情况下发出例如0.9倍的指示，在设为“进一步缩短”的情况下发出例如0.5倍的指示，在设为“普通”的情况下发出例如1.0倍的指示。针对语速变换部4，在设为“缓慢”的情况下发出例如0.8倍的指示，在设为“更缓慢”的情况下发出例如0.6倍的指示，在设为“加快”的情况下发出例如1.2倍的指示，在设为“进一步加快”的情况下发出例如1.4倍的指示，在设为“普通”的情况下发出例如1.0倍的指示。For example, based on the control content described in Table 2, an instruction is given to the background noise generation unit 2 and the speech rate conversion unit 4 . For the background noise generating unit 2, an instruction of, for example, 1.1 times is issued when it is set to "extend", an instruction is issued, for example, 1.3 times when it is set to "further extend", and an instruction is issued when it is set to "shorten". For example, an instruction of 0.9 times, for example, an instruction of 0.5 times is issued when it is set to "further shortening", and an instruction of, for example, 1.0 times is issued when it is set as "normal". For the speech rate conversion unit 4, an instruction of, for example, 0.8 times is issued when it is set to "slow", an instruction of, for example, 0.6 times is issued when it is set to "slower", and an instruction is issued when it is set to "speed up". For example, an instruction of 1.2 times, for example, an instruction of 1.4 times is issued when it is set to "Further Speed", and an instruction of, for example, 1.0 times is issued when it is set as "Normal".

【表2】【Table 2】

如上所述，根据本实施方式，从控制部7向背景噪声生成部2和语速变换部4发出联动的指示。即，根据抖动吸收缓冲器1中的分组的蓄积状况，控制由背景噪声生成部2生成的背景噪声的时间长度，并且控制由语速变换部4所变换的再现速度，由此分别控制发送间隔不同的背景噪声（无声区间）和语音（有声区间），因此即使在应用于不一定以固定间隔送出分组的无声压缩时也能够防止通话质量劣化。As described above, according to the present embodiment, the control unit 7 issues an instruction to cooperate with the background noise generation unit 2 and the speech rate conversion unit 4 . That is, the duration of the background noise generated by the background noise generating unit 2 is controlled according to the accumulation status of the packets in the jitter absorbing buffer 1, and the reproduction speed converted by the speech rate converting unit 4 is controlled, whereby the transmission intervals are respectively controlled. Different background noise (silent intervals) and speech (voiced intervals), thus preventing deterioration of call quality even when applied to silent compression that does not necessarily send out packets at regular intervals.

根据作为抖动吸收缓冲器1中的分组蓄积状况的、抖动吸收缓冲器1的余量和到达抖动吸收缓冲器1的到达速度，输出控制由背景噪声生成部2生成的背景噪声的时间长度的时间长度控制信号、和控制由语速变换部4所变换的再现速度的再现速度控制信号，由此能够根据抖动吸收缓冲器1的余量进行恰当的抖动缓冲控制，并且即使在分组的接收临时停滞、然后停滞解除而一下子到达大量分组的情况下，也能够通过监视到达速度而实现可将缓冲器溢出防患于未然的恰当的抖动缓冲控制，即使在应用无声压缩时也能够防止通话质量劣化。Outputs the time to control the duration of the background noise generated by the background noise generating section 2 based on the remaining amount of the jitter absorbing buffer 1 and the arrival speed to the jitter absorbing buffer 1, which are the status of packet accumulation in the jitter absorbing buffer 1 The length control signal and the reproduction speed control signal for controlling the reproduction speed converted by the speech rate conversion unit 4 can perform appropriate jitter buffering control according to the remaining capacity of the jitter absorption buffer 1, and even if the reception of packets temporarily stops , and then when the stagnation is resolved and a large number of packets arrive at once, it is possible to realize appropriate jitter buffer control that can prevent buffer overflow by monitoring the arrival rate, and prevent deterioration of call quality even when silent compression is applied .

虽然根据阈值A、阈值B将抖动吸收缓冲器余量分为了“小”、“中”、“大”三类，根据阈值α、阈值β将到达速度分为了“低速”、“中速”、“高速”三类进行了说明，但能够通过进一步细分控制，来进行更细致的控制。Although the jitter absorption buffer margin is classified into "small", "medium" and "large" according to threshold A and threshold B, and the arrival speed is classified into "low speed", "medium speed", The three types of "high speed" are explained, but finer control can be performed by further subdividing the control.

此外，虽然控制也伴随抖动吸收缓冲器余量和到达速度变化而发生变化，但通过根据余量和速度的变化方向对区分“小”、“中”、“大”、“低速”、“中速”、“高速”的阈值设定不同的阈值，能够避免控制由于阈值附近处的余量的增减而频繁发生变化，能够提供更良好的通话质量。例如，设定处于朝抖动吸收缓冲器余量增加的方向发生变化的变化方向的情况下的阈值C、阈值D，和处于朝减少的方向发生变化的变化方向的情况下的阈值E、阈值F。此外，设定处于朝到达速度加快的方向变化的变化方向的情况下的阈值γ、阈值δ，和处于朝减慢的方向发生变化的变化方向的情况下的阈值ε、阈值ζ。由此，能够提供更良好的通话质量。In addition, although the control also changes with the change of the jitter absorption buffer margin and the arrival speed, by distinguishing "small", "medium", "large", "low speed", "medium speed" according to the change direction of the margin and speed Setting different thresholds for the thresholds of "speed" and "high speed" can avoid frequent changes in the control due to the increase or decrease of the margin near the threshold, and can provide better call quality. For example, threshold C and threshold D in the case of a change direction in which the jitter absorption buffer margin increases, and threshold E and threshold F in the case of a change in a direction of decrease are set. . In addition, thresholds γ and δ are set when the direction of change increases toward an increase in arrival speed, and thresholds ε and ζ are set when the direction of change changes toward a direction of slowing down. As a result, better call quality can be provided.

此外，在背景噪声生成部2中，能够通过在缩短背景噪声生成时间长度的情况下，使得背景噪声生成时间不短于某个固定的时间长度，而提供更良好的通话质量。In addition, in the background noise generation unit 2 , it is possible to provide better call quality by making the background noise generation time not shorter than a certain fixed time length while shortening the background noise generation time length.

另外，在上述说明中，将来自控制部7的对背景噪声生成部2的指示记为了1.1倍或0.9倍，但例如也可以是延长100ms或缩短200ms等关于增减的时间量的指示。In addition, in the above description, the instruction from the control unit 7 to the background noise generating unit 2 is described as 1.1 times or 0.9 times, but it may be an instruction about the increase or decrease in time, such as extending by 100 ms or shortening by 200 ms.

此外，说明了具有缓冲器余量监视部71和到达速度监视部73的控制部7，但也可以构成为删除缓冲器余量监视部71，控制信号输出部72根据由到达速度监视部73监视到的到达抖动吸收缓冲器的到达速度，输出时间长度控制信号和再现速度控制信号。In addition, the controller 7 having the buffer remaining amount monitoring unit 71 and the arrival speed monitoring unit 73 has been described, but it may be configured such that the buffer remaining amount monitoring unit 71 is deleted, and the control signal output unit 72 is monitored by the arrival speed monitoring unit 73. The arrival speed of the arrival jitter absorbing buffer is outputted, and a time length control signal and a reproduction speed control signal are output.

此外，说明了具有输出缓冲器5和输出缓冲器监视部6的情况，但也可以删除输出缓冲器5和输出缓冲器监视部6。例如，抖动吸收缓冲器1也可以构成为在具有规定时间间隔的输出定时输出分组。并且例如，可以构成为根据抖动吸收缓冲器中的分组的蓄积状况，在与控制部7的控制对应的输出定时输出分组。In addition, although the case where the output buffer 5 and the output buffer monitoring part 6 were provided was demonstrated, the output buffer 5 and the output buffer monitoring part 6 may be deleted. For example, the jitter absorbing buffer 1 may be configured to output packets at output timings having predetermined time intervals. Furthermore, for example, the packet may be output at an output timing corresponding to the control of the control unit 7 in accordance with the accumulation state of the packet in the jitter absorbing buffer.

实施方式3.Implementation mode 3.

图4是示出本发明的一个实施例的音频解码装置的功能块结构图。FIG. 4 is a functional block diagram showing an audio decoding device according to an embodiment of the present invention.

在图4中，用同一标号示出与上述实施方式相同或对应部分，而省略说明。In FIG. 4 , the same reference numerals are used to designate the same or corresponding parts as those in the above-mentioned embodiment, and description thereof will be omitted.

在图4中，高精度无声压缩部8对接收到的分组进行分析，在从该分组所包含的音频编码数据中检测到无声/噪声区间的情况下，将该分组置换为包含背景噪声数据的背景噪声分组，在没有检测到无声/噪声区间的情况下，在不进行置换的情况下输出分组。In FIG. 4 , the high-precision silent compression unit 8 analyzes the received packet, and when a silent/noise interval is detected from the audio coded data included in the packet, replaces the packet with an interval containing background noise data. Background noise packet, in the case where no silence/noise interval is detected, the packet is output without permutation.

接着对动作进行说明。Next, the operation will be described.

另外，在本实施方式中，说明在用户与用户的通话对方的两者之间进行语音通话的情况下的动作，但本发明不限于此。In addition, in the present embodiment, the operation in the case of a voice call between the user and the user's counterpart is described, but the present invention is not limited thereto.

首先，当用户的通话对方发声时，其语音在通话对方侧被编码而形成分组，并通过网络在用户侧被接收。在通话对方侧的编码中进行无声压缩，在背景噪声区间中输出背景噪声分组，在语音区间中输出音频分组，并到达用户侧的音频解码装置。在通话对方侧的音频编码装置中的无声压缩功能的精度差的情况下，不论实际上是否为背景噪声区间，都作为音频分组输出分组。或者，在通话对方侧的音频编码装置中不实施无声压缩功能，而将所有分组作为音频分组输出。无论是何种情况，都以能够在用户侧的音频解码装置中实现恰当的抖动吸收缓冲器控制的方式设置有高精度无声压缩部8。First, when the user's counterpart speaks, the voice is encoded at the counterparty's side to form packets, and is received at the user's side through the network. Silent compression is carried out in the coding on the counterparty side, and the background noise packet is output in the background noise section, and the audio packet is output in the voice section, and reaches the audio decoding device on the user side. When the accuracy of the silence compression function of the audio encoding device on the other party's side is poor, the packet is output as an audio packet regardless of whether it is actually a background noise section or not. Alternatively, the audio coding device on the other party's side does not implement the silent compression function, and outputs all packets as audio packets. In any case, the high-precision silent compression unit 8 is provided so that appropriate jitter absorption buffer control can be realized in the audio decoding device on the user side.

在用户侧接收到从通话对方侧发送的分组时，高精度无声压缩部8对接收到的分组进行分析，并从接收到的音频分组所存储的编码数据中更高精度地找出噪声区间。在从该分组所包含的音频编码数据中检测到无声/噪声区间的情况下，将该分组置换为包含背景噪声数据的背景噪声分组，并输出到抖动吸收缓冲器1。在未检测到无声/噪声区间时，在不进行分组置换的情况下，将分组输出到抖动吸收缓冲器1。之后的动作与上述实施方式相同。When the user side receives a packet transmitted from the call partner, the high-precision silent compression unit 8 analyzes the received packet, and finds a noise interval with higher precision from encoded data stored in the received audio packet. When a silent/noise interval is detected from the audio coded data included in the packet, the packet is replaced with a background noise packet including background noise data, and is output to the jitter absorbing buffer 1 . When no silent/noise interval is detected, the packet is output to the jitter absorbing buffer 1 without performing packet permutation. Subsequent operations are the same as in the above-mentioned embodiment.

如上所述，根据本实施方式，对接收到的分组进行分析，在该分组所包含的音频编码数据中检测到无声/噪声区间的情况下，将该分组置换为包含背景噪声数据的背景噪声分组，在未检测到无声/噪声区间时，在不进行分组置换的情况下输出分组，由此不论对方的音频编码装置有无无声压缩功能或者无声压缩功能的好坏，都分别控制背景噪声（无声区间）和语音（有声区间），因此能够实现恰当的抖动吸收缓冲器控制，能够进一步防止通话质量劣化。As described above, according to this embodiment, a received packet is analyzed, and when a silence/noise interval is detected in the audio coded data included in the packet, the packet is replaced with a background noise packet including background noise data. , when the silence/noise interval is not detected, the packet is output without packet permutation, thereby controlling the background noise (silence interval) and voice (voiced interval), so appropriate jitter absorption buffer control can be realized, which can further prevent deterioration of call quality.

另外，在本实施方式中，说明了到达速度监视部73监视输入到高精度无声压缩部8的分组的到达速度的情况，但也可以构成为在高精度无声压缩部8与抖动吸收缓冲器1之间监视分组的到达速度。In addition, in this embodiment, the case where the arrival rate monitoring unit 73 monitors the arrival rate of the packet input to the high-precision silent compression unit 8 has been described, but it may be configured such that the high-precision silent compression unit 8 and the jitter absorption buffer 1 Monitor the arrival rate of packets.

此外，说明了具有缓冲器余量监视部71和到达速度监视部73的控制部7，但也可以构成为具有缓冲器余量监视部71和到达速度监视部73中的任意一方，并输出时间长度控制信号和再现速度控制信号。In addition, the controller 7 having the buffer remaining amount monitoring unit 71 and the arrival speed monitoring unit 73 has been described, but it may be configured to have either one of the buffer remaining amount monitoring unit 71 and the arrival speed monitoring unit 73, and output the time Length control signal and reproduction speed control signal.

此外，说明了具有输出缓冲器5和输出缓冲器监视部6的情况，但也可以删除输出缓冲器5和输出缓冲器监视部6。例如，抖动吸收缓冲器1也可以构成为在具有规定的时间间隔的输出定时输出分组。并且，例如，也可以构成为根据抖动吸收缓冲器中的分组的蓄积状况，在与控制部7的控制对应的输出定时输出分组。In addition, although the case where the output buffer 5 and the output buffer monitoring part 6 were provided was demonstrated, the output buffer 5 and the output buffer monitoring part 6 may be deleted. For example, the jitter absorbing buffer 1 may be configured to output packets at output timings having predetermined time intervals. Furthermore, for example, the packet may be output at an output timing corresponding to the control of the control unit 7 in accordance with the accumulation state of the packet in the jitter absorbing buffer.

实施方式4.Implementation mode 4.

图5是示出本发明的一个实施例的音频解码装置的功能块结构图。FIG. 5 is a functional block diagram showing an audio decoding device according to an embodiment of the present invention.

在图5中，用同一标号示出与上述实施方式相同或对应部分，而省略说明。In FIG. 5 , the same reference numerals are used to designate the same or corresponding parts as those in the above-mentioned embodiment, and description thereof will be omitted.

在图5中，音频解码装置20对在用户侧接收到的音频编码数据进行解码。音频编码装置21对将要从用户侧发送的语音进行编码。音频检测部9检测有无用户的发声。在本实施方式中，每隔固定区间对所输入的音频数据是“语音”，还是不为语音的“噪声”进行判定。在音频数据是“语音”的情况下，判定为存在用户的发声，在音频数据是“噪声”的情况下判定为不存在用户的发声。In FIG. 5, an audio decoding device 20 decodes audio coded data received at the user side. The audio encoding means 21 encodes speech to be transmitted from the user side. The audio detection unit 9 detects the presence or absence of a user's voice. In the present embodiment, whether the input audio data is "voice" or "noise" that is not voice is determined every fixed interval. When the audio data is "speech", it is determined that there is an utterance by the user, and when the audio data is "noise", it is determined that there is no utterance by the user.

音频编码部10对音频数据进行编码，并输出音频编码数据。无声压缩控制部11在由音频检测部9判定为“语音”的情况下，输出来自音频编码部10的音频编码数据，在判定为“噪声”的情况下从音频编码部10间歇地输出背景噪声数据。The audio encoding unit 10 encodes audio data, and outputs encoded audio data. The silent compression control unit 11 outputs the audio coded data from the audio coding unit 10 when the audio detection unit 9 judges it as “speech”, and outputs the background noise from the audio coding unit 10 intermittently when it is judged as “noise”. data.

此外，在本实施方式中，抖动吸收缓冲器1构成为在由音频检测部9检测到存在用户发声的情况下，使缓冲器内返回到初始状态。In addition, in the present embodiment, the jitter absorbing buffer 1 is configured to return the inside of the buffer to the initial state when the audio detection unit 9 detects the user's voice.

接着对动作进行说明。Next, the operation will be described.

在音频编码装置21中，将音频数据输入到音频检测部9和音频编码部10。音频检测部9每隔固定区间对所输入的音频数据是“语音”，还是不为语音的“噪声”进行判定，并将其结果输出到音频编码部10、无声压缩控制部11、和处于音频解码装置20中的抖动吸收缓冲器1。音频编码部10在被通知了是“语音”的情况下，输出所输入的音频数据的编码数据，在被通知了是“噪声”的情况下，输出背景噪声数据。无声压缩控制部11在被通知了是“语音”的情况下，输出来自音频编码部10的音频编码数据，在被通知了是“噪声”的情况下从音频编码部10间歇地输出背景噪声数据。还向抖动吸收缓冲器1通知音频检测部9的判定结果。抖动吸收缓冲器1在被通知了是“噪声”的情况下继续通常的处理，但在被通知了是“语音”的情况下，舍弃蓄积在抖动吸收缓冲器1中的音频分组，而从初始状态起重新开始处理。In the audio encoding device 21 , audio data is input to the audio detection unit 9 and the audio encoding unit 10 . The audio detection section 9 judges whether the input audio data is "speech" or "noise" that is not speech at regular intervals, and outputs the result to the audio coding section 10, the silent compression control section 11, and the audio data in the audio system. The jitter absorbing buffer 1 in the decoding device 20 . The audio coding unit 10 outputs coded data of the input audio data when it is notified that it is "voice", and outputs data of background noise when it is notified that it is "noise". The silent compression control unit 11 outputs the encoded audio data from the audio encoding unit 10 when it is notified that it is “speech”, and intermittently outputs the background noise data from the audio encoding unit 10 when it is notified that it is “noise”. . The determination result of the audio detection unit 9 is also notified to the jitter absorption buffer 1 . The jitter absorbing buffer 1 continues normal processing when it is notified that it is "noise", but when it is notified that it is "voice", it discards the audio packets stored in the jitter absorbing buffer 1, and starts from the initial state restarts processing.

在向音频编码装置21输入了“语音”的音频数据的情况下，是用户正在发声的状态，通常在此时，用户的通话对方不发声。因此，该情况下，没有必要在用户侧进行解码处理的可能性高，因此通过舍弃蓄积在抖动吸收缓冲器1中的音频分组并返回初始状态，能够在用户的通话对方开始发声而在用户侧开始解码处理时，能够从不是接近缓冲器枯竭或溢出的状态的初始状态起进行抖动吸收缓冲器控制。When the audio data of "speech" is input to the audio coding device 21, the user is speaking, and normally, the user's other party is not speaking at this time. Therefore, in this case, there is a high possibility that there is no need to perform decoding processing on the user side, so by discarding the audio packets stored in the jitter absorbing buffer 1 and returning to the initial state, it is possible for the user's counterpart to start speaking and the user side When starting the decoding process, it is possible to perform jitter absorbing buffer control from an initial state that is not a state close to buffer exhaustion or overflow.

如上所述，根据本实施方式，在向音频编码装置21输入了“语音”的音频数据的情况下，通过舍弃蓄积在抖动吸收缓冲器1中的音频分组并返回初始状态，在用户的通话对方开始发声而在用户侧开始解码处理时，能够从不是接近缓冲器枯竭或溢出状态的初始状态起进行抖动吸收缓冲器控制，因此能够实现更恰当的控制，能够进一步防止通话质量劣化。As described above, according to the present embodiment, when the audio data of "speech" is input to the audio coding device 21, the audio packets stored in the jitter absorbing buffer 1 are discarded and returned to the initial state, and the user's communication partner When utterance is started and decoding processing is started on the user side, the jitter absorbing buffer control can be performed from an initial state not close to the buffer exhaustion or overflow state, so that more appropriate control can be realized, and deterioration of call quality can be further prevented.

此外，在音频编码装置21中不一定需要应用无声压缩，也可以具有音频检测部9，并由抖动吸收缓冲器1取得该判定结果。In addition, the audio encoding device 21 does not necessarily need to apply the silent compression, and the audio detection unit 9 may be provided, and the jitter absorption buffer 1 may obtain the determination result.

实施方式5.Implementation mode 5.

图6是示出本发明的一个实施例的音频解码装置的功能块结构图。FIG. 6 is a functional block diagram showing an audio decoding device according to an embodiment of the present invention.

在图6中，用同一标号示出与上述实施方式相同或对应部分，而省略说明。In FIG. 6 , the same or corresponding parts as those in the above-described embodiment are denoted by the same reference numerals, and description thereof will be omitted.

在图6中，背景噪声数据检测/插入部12检测接收到的分组是否包含背景噪声数据，在检测到包含背景噪声数据的情况下，将个数与背景噪声数据的无声/噪声区间的时间长度相当的下述分组插入到抖动吸收缓冲器1中，该分组的每1个分组的时间长度与包含音频编码数据的分组的每1个分组的时间长度相等。In Fig. 6, the background noise data detection/insertion section 12 detects whether the received packet contains background noise data, and if it detects that the background noise data is included, compares the number with the time length of the silent/noise interval of the background noise data Corresponding packets whose time length per packet is equal to the time length of each packet including audio coded data are inserted into the jitter absorbing buffer 1 .

接着对动作进行说明。Next, the operation will be described.

首先，当用户的通话对方发声时，其语音在通话对方侧被编码而形成分组，并通过网络在用户侧被接收。First, when the user's counterpart speaks, the voice is encoded at the counterparty's side to form packets, and is received at the user's side through the network.

在背景噪声数据检测/插入部12中，检测接收到的分组是否为包含背景噪声数据的背景噪声分组，在检测到背景噪声分组的情况下，将个数与背景噪声数据的无声/噪声区间的时间长度相当的分组插入到抖动吸收缓冲器1中，该分组的每1个分组的时间长度与包含音频编码数据的分组的每1个分组的时间长度相等。In the background noise data detection/insertion section 12, it is detected whether the received packet is a background noise packet containing background noise data, and if a background noise packet is detected, the number is compared with the number of the silent/noise interval of the background noise data. Packets having an equivalent time length are inserted into the jitter absorbing buffer 1 , and the time length of each of these packets is equal to the time length of each packet including audio coded data.

使用附图说明详细的动作。图7是示出分组的时间戳与抖动吸收缓冲器的蓄积之间的关系的说明图。Detailed operations will be described using drawings. FIG. 7 is an explanatory diagram showing the relationship between the time stamp of a packet and accumulation in a jitter absorbing buffer.

在图7中，包含t时间的音频编码数据的音频分组＃1、＃2、＃4以及包含背景噪声数据的背景噪声分组＃3按照＃1、＃2、＃3、＃4的顺序到达并临时蓄积到抖动吸收缓冲器1中。当对作为背景噪声分组的＃3的分组赋予序号N、时间戳值M时，＃1分组的序号为N－2、＃2分组的序号为N－1、＃4分组的序号为N＋1，＃1分组的时间戳值为M－2t、＃2分组的时间戳值为M－t。＃4分组的时间戳值成为经过作为噪声区间长度即T时间后的时间、即M＋T。In FIG. 7, audio packets #1, #2, #4 containing audio coded data at time t and background noise packet #3 containing background noise data arrive in the order of #1, #2, #3, #4 and It is temporarily stored in the jitter absorption buffer 1. When assigning serial number N and time stamp value M to the grouping of #3 as the background noise grouping, the serial number of #1 grouping is N-2, the serial number of #2 grouping is N-1, and the serial number of #4 grouping is N+1, # The time stamp value of the 1 packet is M-2t, and the time stamp value of the #2 packet is M-t. The time stamp value of the #4 packet is the time after T time which is the length of the noise interval, that is, M+T.

背景噪声数据检测/插入部12在检测到作为背景噪声分组的＃3分组时，预先存储其序号N和时间戳值M，将＃3分组输出到抖动吸收缓冲器1，并且等待作为下一分组的序号为N＋1的分组的到达。背景噪声数据检测/插入部12在序号N＋1的分组、即＃4分组到达时，找出其时间戳值M＋T，并计算存在于＃2分组与＃4分组之间的噪声区间的时间长度T。为了使得背景噪声分组与以t时间间隔存在的音频分组同样地也以t时间间隔存在，将与T时间的噪声区间相当的X个t时间的背景噪声分组插入到抖动吸收缓冲器1内的＃2分组之后，然后将＃4分组输出到抖动吸收缓冲器1。由此，使得在抖动吸收缓冲器1内，每隔t时间地存在音频分组或背景噪声分组。The background noise data detecting/inserting section 12, upon detecting the #3 packet as the background noise packet, stores its sequence number N and time stamp value M in advance, outputs the #3 packet to the jitter absorbing buffer 1, and waits for the #3 packet to be used as the next packet The arrival of the packet whose sequence number is N+1. The background noise data detection/insertion unit 12 finds the time stamp value M+T when the packet with the sequence number N+1, that is, the #4 packet arrives, and calculates the time length T of the noise interval existing between the #2 packet and the #4 packet. In order to make background noise packets exist at intervals of time t like audio packets existing at intervals of t, X background noise packets at time t corresponding to the noise interval of time T are inserted into # of the jitter absorbing buffer 1 After the 2 packet, the #4 packet is then output to the jitter absorbing buffer 1. As a result, audio packets or background noise packets exist in the jitter absorbing buffer 1 at intervals of t.

缓冲器余量监视部71监视临时蓄积在抖动吸收缓冲器1中的分组的量，在作为缓冲器余量少于某个阈值A的情况下，将“小”通知给控制信号输出部72，在多于某个阈值B的情况下将“大”通知给控制信号输出部72，当在某个阈值A以上且在某个阈值B以下的情况下将“中”通知给控制信号输出部72。The buffer remaining amount monitoring unit 71 monitors the amount of packets temporarily stored in the jitter absorbing buffer 1, and when the buffer remaining amount is less than a certain threshold value A, notifies the control signal output unit 72 of "small", When it exceeds a certain threshold value B, "large" is notified to the control signal output unit 72, and when it is above a certain threshold value A and below a certain threshold value B, "medium" is notified to the control signal output unit 72 .

到达速度监视部73监视输入（到达）抖动吸收缓冲器1的分组的到达速度，在以比某个阈值α慢的速度进行了输入的情况下，作为“低速”通知给控制信号输出部72，在以比某个阈值β快的速度进行了输入的情况下，作为“高速”通知给控制信号输出部72，在不低于某个阈值α且不高于某个阈值β的情况下，作为“中速”通知给控制信号输出部72。The arrival speed monitoring unit 73 monitors the arrival speed of the packet input (arriving) into the jitter absorbing buffer 1, and notifies the control signal output unit 72 of the “low speed” when the arrival speed of the packet is slower than a certain threshold value α. If the input is performed at a speed faster than a certain threshold β, it is notified to the control signal output unit 72 as "high speed", and when it is not lower than a certain threshold α and not higher than a certain threshold β, it is regarded as "high speed". “Medium speed” is notified to the control signal output unit 72 .

接收到来自缓冲器余量监视部71和到达速度监视部73的通知的控制信号输出部72输出（指示）控制为抖动吸收缓冲器1的缓冲器余量越大越缩短背景噪声生成时间长度、输入（到达）抖动吸收缓冲器1的分组的到达速度越高速越缩短背景噪声生成时间长度的时间长度控制信号，并且输出（指示）控制为抖动吸收缓冲器1的缓冲器余量越大越加快语速进行再现、输入（到达）抖动吸收缓冲器1的分组的到达速度越高速越加快语速进行再现的再现速度控制信号。The control signal output unit 72 outputting (instructing) the control signal output unit 72 receiving the notification from the buffer remaining capacity monitoring unit 71 and the arrival speed monitoring unit 73 is controlled so that the larger the buffer capacity of the jitter absorbing buffer 1, the shorter the background noise generation time length, input (Arrival) The faster the arrival speed of the packet in the jitter absorbing buffer 1 is, the shorter the background noise generation time length control signal is, and the output (indication) is controlled so that the larger the buffer margin of the jitter absorbing buffer 1 is, the faster the speech rate is. A reproduction speed control signal for reproducing and reproducing a packet input (arrived) into the jitter absorbing buffer 1 at a faster speech rate as the arrival speed is higher.

在接收到来自缓冲器余量监视部71和到达速度监视部73的通知的控制信号输出部72中，例如根据表2记载的控制内容，向抖动吸收缓冲器1和语速变换部4发出指示。针对抖动吸收缓冲器1，在为“延长”的情况下发出例如插入1个背景噪声分组的指示，在为“进一步延长”的情况下发出例如插入3个背景噪声分组的指示，在为“缩短”的情况下发出例如删除1个背景噪声分组的指示，在为“进一步缩短”的情况下发出例如删除3个背景噪声分组的指示，在为“普通”的情况下发出例如无插入/删除的指示。针对语速变换部4，在为“缓慢”的情况下发出例如0.8倍的指示，在为“更缓慢”的情况下发出例如0.6倍的指示，在为“加快”的情况下发出例如1.2倍的指示，在为“进一步加快”的情况下发出例如1.4倍的指示，在为“普通”的情况下发出例如1.0倍的指示。The control signal output unit 72 that receives the notification from the buffer remaining amount monitoring unit 71 and the speed of arrival monitoring unit 73 issues instructions to the jitter absorbing buffer 1 and the speech rate conversion unit 4 based on the control content described in Table 2, for example. . For the jitter absorbing buffer 1, for example, an instruction to insert 1 background noise packet is issued in the case of "extending", an instruction is issued for example to insert 3 background noise packets in the case of "further extension", and in the case of "shortening" In the case of ", for example, an instruction to delete one background noise packet is issued, in the case of "further shortening", an instruction to delete, for example, three background noise packets is issued, and in the case of "normal", an instruction is issued, for example, no insertion/deletion instruct. For the speech rate conversion unit 4, an instruction of, for example, 0.8 times is issued in the case of "slow", an instruction of, for example, 0.6 times is issued in the case of "slower", and an instruction of, for example, 1.2 times is issued in the case of "faster". In the case of "further acceleration", for example, an instruction of 1.4 times is issued, and in the case of "normal", an instruction of, for example, 1.0 times is issued.

如上所述，根据本实施方式，根据抖动吸收缓冲器余量和到达速度，从控制部7向抖动吸收缓冲器1和语速变换部4发出联动指示。即，根据抖动吸收缓冲器1中的分组的蓄积状况，控制由背景噪声生成部2生成的背景噪声的时间长度，并且控制由语速变换部4所变换的再现速度，由此分别控制发送间隔不同的背景噪声（无声区间）和语音（有声区间），因此即使在应用于不一定以固定间隔送出分组的无声压缩时也能够防止通话质量劣化。As described above, according to the present embodiment, the control unit 7 issues a linkage instruction to the jitter absorbing buffer 1 and the speech rate converting unit 4 based on the remaining amount of the jitter absorbing buffer and the arrival speed. That is, the duration of the background noise generated by the background noise generating unit 2 is controlled according to the accumulation status of the packets in the jitter absorbing buffer 1, and the reproduction speed converted by the speech rate converting unit 4 is controlled, whereby the transmission intervals are respectively controlled. Different background noise (silent intervals) and speech (voiced intervals), thus preventing deterioration of call quality even when applied to silent compression that does not necessarily send out packets at regular intervals.

在检测到包含背景噪声数据的情况下，通过将个数与背景噪声数据的无声/噪声区间的时间长度相当的分组插入到抖动吸收缓冲器1中，控制由背景噪声生成部2生成的背景噪声的时间长度，其中，该分组的每1个分组的时间长度与包含音频编码数据的分组的每1个分组的时间长度相等，由此能够以蓄积在抖动吸收缓冲器1中的分组的个数进行控制，因此能够简化背景噪声生成部2的处理。When it is detected that background noise data is included, the number of packets corresponding to the time length of the silence/noise interval of the background noise data is inserted into the jitter absorbing buffer 1 to control the background noise generated by the background noise generating unit 2 The time length of each packet is equal to the time length of each packet containing the audio coded data, so that the number of packets accumulated in the jitter absorbing buffer 1 can be Since the control is performed, the processing of the background noise generation unit 2 can be simplified.

此外，即使在分组的接收临时停滞、然后停滞消除而一下子到达大量分组的情况下，也能够通过监视到达速度而实现可将缓冲器溢出防患于未然的恰当的抖动缓冲控制。Also, even when packet reception is temporarily stagnant, and the stagnation is resolved and a large number of packets arrive at once, appropriate jitter buffer control capable of preventing buffer overflow from occurring can be realized by monitoring the arrival rate.

根据阈值A、阈值B将抖动吸收缓冲器余量分为了“小”、“中”、“大”三类，根据阈值α、阈值β将到达速度分为了“低速”、“中速”、“高速”三类进行了说明，但能够通过进一步细分控制，进行更细致的控制。According to threshold A and threshold B, the jitter absorption buffer margin is divided into three categories: "small", "medium", and "large", and according to threshold α and threshold β, the arrival speed is divided into "low speed", "medium speed", " The three categories of "high speed" are described, but more detailed control can be carried out by further subdividing the control.

此外，控制还伴随抖动吸收缓冲器余量和到达速度变化而发生变化，但通过根据余量和速度的变化方向对区分“小”、“中”、“大”、“低速”、“中速”、“高速”的阈值设定不同的阈值，能够避免控制由于阈值附近处的余量的增减而频繁发生变化，能够提供更良好的通话质量。例如，设定处于朝抖动吸收缓冲器余量增加的方向发生变化的变化方向的情况下的阈值C、阈值D，和处于朝减少的方向发生变化的变化方向的情况下的阈值E、阈值F。此外，设定处于朝到达速度加快的方向变化的变化方向的情况下的阈值γ、阈值δ，和处于朝减慢的方向发生变化的变化方向的情况下的阈值ε、阈值ζ。由此，能够提供更良好的通话质量。In addition, the control also changes with the change of the jitter absorption buffer margin and the arrival speed, but by distinguishing "small", "medium", "large", "low speed", "medium speed" according to the change direction of the margin and speed ” and “High Speed” thresholds can be set with different thresholds, which can avoid frequent changes in the control due to the increase or decrease of the margin near the thresholds, and can provide better call quality. For example, threshold C and threshold D in the case of a change direction in which the jitter absorption buffer margin increases, and threshold E and threshold F in the case of a change in a direction of decrease are set. . In addition, thresholds γ and δ are set when the direction of change increases toward an increase in arrival speed, and thresholds ε and ζ are set when the direction of change changes toward a direction of slowing down. As a result, better call quality can be provided.

此外，在本实施方式中，基于分组化周期进行了说明，但在1个分组中包含多个音频编码帧的情况下，也可以基于该音频编码帧的时间长度进行控制。In addition, in this embodiment, the description is based on the packetization period, but when a plurality of audio coded frames are included in one packet, control may be performed based on the time length of the audio coded frame.

此外，作为背景噪声数据检测/插入部12的动作，可以在作为背景噪声分组的＃3分组到达后且作为音频分组的＃4分组之前的期间，每经过t时间将背景噪声分组依次插入到抖动吸收缓冲器1内。In addition, as an operation of the background noise data detection/insertion unit 12, the background noise packets may be sequentially inserted into the jitter data every time t after the arrival of the #3 packet as the background noise packet and before the #4 packet as the audio packet. Absorption buffer 1 inside.

此外，在背景噪声生成部2中，能够通过在缩短背景噪声生成时间长度的情况下，使得背景噪声生成时间不比某个固定的时间长度短，来提供更良好的通话质量。In addition, in the background noise generation unit 2 , when the background noise generation time length is shortened so that the background noise generation time does not become shorter than a certain fixed time length, better call quality can be provided.

此外，说明了具有缓冲器余量监视部71和到达速度监视部73的控制部7，但也可以构成为删除到达速度监视部73，并按照缓冲器余量监视部71的监视结果输出时间长度控制信号和再现速度控制信号。In addition, the controller 7 having the buffer remaining amount monitoring unit 71 and the arrival speed monitoring unit 73 has been described, but the arrival speed monitoring unit 73 may be deleted, and the time length may be output according to the monitoring result of the buffer remaining amount monitoring unit 71. control signal and reproduction speed control signal.

此外，说明了具有输出缓冲器5和输出缓冲器监视部6的情况，但也可以删除输出缓冲器5和输出缓冲器监视部6。例如，抖动吸收缓冲器1也可以构成为在具有规定的时间间隔的输出定时输出分组。并且例如，也可以构成为根据抖动吸收缓冲器中的分组的蓄积状况，在与控制部7的控制对应的输出定时输出分组。In addition, although the case where the output buffer 5 and the output buffer monitoring part 6 were provided was demonstrated, the output buffer 5 and the output buffer monitoring part 6 may be deleted. For example, the jitter absorbing buffer 1 may be configured to output packets at output timings having predetermined time intervals. Furthermore, for example, the packet may be output at an output timing corresponding to the control of the control unit 7 in accordance with the accumulation state of the packet in the jitter absorbing buffer.

Claims

1. An audio decoding device, characterized in that it has:

a jitter absorbing buffer temporarily accumulating received packets and outputting the packets at a prescribed output timing;

a background noise generating unit that generates audio data of background noise based on the background noise data included in the packet output from the jitter absorbing buffer;

an audio decoding unit that decodes encoded audio data included in the packet output from the jitter absorbing buffer to generate audio data of speech;

a speech rate conversion unit that performs speech rate conversion for converting the reproduction speed of the audio data decoded by the audio decoding unit; and

The control unit controls the duration of the background noise generated by the background noise generation unit and the playback speed converted by the speech rate conversion unit based on the accumulation status of packets in the jitter absorption buffer.

2. audio decoding device according to claim 1, is characterized in that,

The above-mentioned control unit has:

a buffer remaining amount monitoring unit that monitors the remaining amount of the jitter absorbing buffer as the storage status; and

a control signal output unit for outputting a time length control signal for controlling the time length of the background noise generated by the background noise generation unit, and a control signal for controlling the speech rate by the above-mentioned speech rate, based on the remaining capacity monitored by the buffer remaining capacity monitoring unit. The reproduction speed control signal of the reproduction speed converted by the conversion unit.

3. audio decoding device according to claim 1, is characterized in that,

The above-mentioned control unit has:

an arrival rate monitoring unit that monitors the arrival rate of the received packet at the jitter absorbing buffer as the storage status; and

a control signal output unit for outputting a time length control signal for controlling the time length of the background noise generated by the background noise generation unit based on the arrival speed monitored by the arrival speed monitoring unit; The converted reproduction speed control signal of the above reproduction speed.

4. audio decoding device according to claim 1, is characterized in that,

The audio decoding device includes a high-precision silent compression unit, and the high-precision silent compression unit analyzes the received packet, and when a silent/noise interval is detected from encoded audio data contained in the packet, the packet is divided into replacing with a background noise packet containing background noise data, outputting the above packet without replacement if the above-mentioned silence/noise interval is not detected,

The jitter absorbing buffer temporarily stores packets output from the high-precision silent compression unit.

5. audio decoding device according to claim 1, is characterized in that,

The above-mentioned audio decoding device has an audio detection unit that detects the presence or absence of a user's voice,

The jitter absorbing buffer returns to an initial state when the audio detection unit detects a user's utterance.

6. audio decoding device according to claim 1, is characterized in that,

The audio decoding device has a background noise data detection/insertion unit, the background noise data detection/insertion unit detects whether or not the received packet contains background noise data, and compares the number with the background noise data if background noise data is detected. Insert the following packets into the above-mentioned jitter absorbing buffer whose time length is equivalent to the silence/noise period of the noise data, and the time length of each packet of this packet is equal to the time length of each packet including the audio coded data packet .

7. audio decoding device according to claim 1, is characterized in that, has:

an output buffer temporarily accumulating audio data of the aforementioned background noise and audio data of the aforementioned speech; and

an output buffer monitoring unit that monitors an accumulation amount of the audio data accumulated in the output buffer, and instructs the jitter absorbing buffer to output timing of the temporarily accumulated packets based on the accumulation amount,

The jitter absorbing buffer outputs the temporarily stored packets according to an instruction from the output buffer monitoring unit.