WO2022094702A1

WO2022094702A1 - System, device and method for non-intrusively estimating in real-time the quality of audio communication signals

Info

Publication number: WO2022094702A1
Application number: PCT/CA2021/051551
Authority: WO
Inventors: Lokman SBOUI; Francois Gagnon
Original assignee: MEDIA5 Corp
Current assignee: MEDIA5 Corp
Priority date: 2020-11-03
Filing date: 2021-11-02
Publication date: 2022-05-12
Anticipated expiration: 2023-05-03

Abstract

A method, network device, system and storage memory for non-intrusively estimating in real-time the quality of an audio communication signal at an endpoint device of a network are provided. The method comprises determining packet loss (PL), and delay (T), and extracting codec type, from IP packets of the audio communication signal passing through a port of the endpoint device, the IP packets used for determining the packet loss (P) and the delay (T) being non-payload IP packets; and determining the quality of the audio communication signal using a simplified E-model being based on the packet loss (PL), delay (T), and the codec type at the endpoint device.

Description

SYSTEM, DEVICE AND METHOD FOR NON-INTRUSIVELY ESTIMATING IN REALTIME THE QUALITY OF AUDIO COMMUNICATION SIGNALS

TECHNICAL FIELD

The present application generally relates to the monitoring and quality assessment of audio signals, and more specifically relates to the monitoring of the quality of Voice-over- IP (VoIP) communications. The method, network device and system described herein are particularly adapted to the non-intrusive monitoring, evaluating and troubleshooting of VoIP communications from endpoints of a network.

BACKGROUND

Nowadays, VoIP is one of the most adopted means of communication in both residential and corporate environments, in which it is used as a mode of communication over the internet, replacing telephone lines. However, even with advances in network speed and capacity, VoIP, as a real-time application, suffers from network impairments affecting speech quality.

Given the particularity of VoIP applications, as being related to human speech and perception, assessing VoIP quality is usually a subjective task. A standard way of assessing the quality of a call is through human intervention, where users are asked to rate the quality of VoIP calls. The resulting subjective ratings are averaged to obtain a mean opinion score (MOS) that is used to determine the quality of the VoIP service of a network. However, this method is time-consuming and costly, and requires the participation of a large number of users to provide reliable results.

To avoid these issues, methods have been proposed that either present objective call quality measures or that measure the MOS indirectly. Objective quality measures are algorithms that estimate the call degradation without the need for human involvement. As a result, quality assessment can be fast and low cost. However, such algorithms are in general intrusive, meaning that they require both original and distorted speech signals in order to assess the corresponding quality.

Alternatively, different models have been developed for estimating the MOS. Such a model, called the E-model, was standardized by the Telecommunication Standardization Sector (ITU-T) of the International Telecommunication Union (ITU) in the ITU-T recommendation G.107 (June 2015). However, the G.107 E-model has some drawbacks, such as the complexity of the impairments’ calculations, as noticed in ITU-T Rec. G.107, and the intrusive extraction of these impairments from the VoIP traffic.

With the increasing traffic load on current networks, maintaining a reliable VoIP quality is mandatory for service providers, as users expect reliable service similar to the traditional switched phone system. Consequently, there is a need for methods and systems for generating a practical, accurate, real-time and non-intrusive tool that estimates the quality of VoIP calls in a decentralized manner.

SUMMARY

According to a first aspect, a method for non-intrusively estimating in real-time the quality of an audio communication signal at an endpoint device of a network is provided. The method comprises determining packet loss (PL) and delay (T), and extracting codec type, from IP packets of the audio communication signal passing through a port of the endpoint device. The delay (T) can involve the Round-Trip Delay (RTD) and the jitter (J). The IP packets used for determining the packet loss (P) and the delay (T) are non-payload IP packets. The method also comprises determining the quality of the audio communication signal using a simplified E-model, which uses the packet loss (PL), delay (T), and the codec type at the endpoint device.

In preferred implementations, the audio communication signal is a voice communication, and the IP packets are VoIP packets. Determining the quality of the audio communication preferably entails determining a transmission rating factor (R factor) obtained by the simplified E-model, where the transmission rating factor can be use to estimate or predict a voice signal quality rating, as it would have been made by an average user.

In a possible implementation of the method, the simplified E-model used to determine the R factor solely requires determining the packet loss (PL), the delay (T), and the codec type at the endpoint device. The determination of the packet loss (PL), and of the delay (T) is inferred from RTP Control Protocol (RTCP) packets only. In this preferred implementation, the codec type is extracted from Real-Time Transport Protocol (RTP) packets. While the proposed method is adapted to codec types having a narrowband, such as about 200 to 3,300Hz, the method can also be implemented with wideband codecs, such as between 50 and 7,000Hz.

In yet other possible implementations, the codec type can be extracted from Session Initiation/Description Protocol (SIP/SDP) packets of the VoIP traffic.

In possible implementations, the simplified E-model comprises a first constant component representative of a maximum value of the R factor when there is no impairment on the network (Rmax), a second component which is solely dependent on the determined delay (Fi(T)), and a third component which is solely dependent on both the determined packet loss (PL) and the extracted codec type (codec), referred to as F2 (PL, codec). Preferably, the second component (Fi(T)) follows a first linear relationship for a first delay interval; a second linear relationship for a second delay interval; and a third non-linear relationship for a third delay interval. Preferably, the second component (Fi(T)) of the simplified E- model follows first and second linear relationships for delays (T) that are less than 400ms, and preferably equal to or less than 380ms. The calculation of (Fi(T)) is thus greatly simplified, as in most cases the delay (T) is less than 380ms. As for the third component F2 (PL, codec), it follows a distinct non-linear, rational function of degree one, for each codec type.

In a preferred implementation of the proposed method, the simplified E-model for determining the R factor is based on:

wherein:

Rmax corresponds to the first constant component representing a maximum value of R,

F1 corresponds to the second component which is a first function dependent on the delay (T), wherein the delay (T) captures: simultaneous impairments of the voice signal (l_s) and delay impairment (Id) due to the network;

F2 corresponds to the third component which is a second function dependent on the packet loss (PL) and the codec type, the packet loss (PL) capturing an impairment factor l_e and a robustness factor B_pi of the network; and all constants in the simultaneous impairments of the voice signal (l_s) and in the delay impairment (Id), when T=0, being included in R_max.

Preferably, R_max can be approximated to a value between 80 and 96, and still preferably to about 93, and further preferably to 93.2.

In possible implementations of the method, the first component or function is provided by:

The second component or function can be provided by:

where G.729, G.711 , G.722 and G.723 correspond to first, second, third and fourth codec types, respectively.

In a possible implementation, the proposed method comprises the calculating, in real- time, a Mean Opinion Score (MOS) based on the R factor determined. The MOS calculation from the R factor is performed according to the ITU-T Recommendation G.107 standard (June 2015), wherein the relationship between the R factor and the MOS is provided by:

According to another aspect, a method for monitoring VoIP packets of a VoIP communication at an endpoint device and estimating therefrom a quality of the VoIP communication is provided. The proposed method comprises the steps of: extracting a codec type from Real-Time Transport Protocol (RTP) or from the Session Initiation/Description Protocol (SIP/SDP) packets, the RTP and SIP/SDP packets being application layer packets of ongoing VoIP traffic; determining packet loss (PL) from the Cumulative Number of Packet Loss (CPL) field and from the Sender’s Packet Count (SPC) field of the RTP Control (RTPC) packets; determining delay (T) based on: the one-way delay, calculated as half of the Round-Trip Delay (RTD) extracted from the RTCP sender report (SR) packets; jitter extracted from the interarrival field of the RTCP packets; and a fixed offset parameter 5; and estimating the quality of the VoIP communication by calculating the Mean Opinion Score (MOS) from a rating factor obtained from a simplified E-model, the simplified E- model being only function of the extracted codec type, and the determined delay (T) and packet loss (PL).

In possible implementations of the method, the packet loss (PL) is calculated according to:

In possible implementations of the method, the delay (T) is calculated according to:

wherein D is obtained by subtracting the Last Sender Report (LSR) timestamp (rtcp.ssrc.lsr) and the delay since the last SR (DLSR) timestamp (rtcp.ssrc.dlsr) from the time of reception (A) of the RTCP packet (A-LSR-DLSR); wherein J correspond to the value of interarrival jitter field (rtcp. ssrc.jitter) ; and wherein the fixed offset parameter 8 is about 10 ms.

In possible implementations of the method, network impairments are detected and flagged when the delay (T) varies from 0ms to 2000ms. In possible implementations of the method, use of high bandwidth applications is reduced at the endpoint device when the network impairment detected is determined as being mainly attributed to packet loss (PL).

In possible implementations of the method, the following remedies can be applied when the detected network impairment is determined as being attributed to both packet loss (PL) and to jitter: increasing priority of Vol P traffic; rerouting traffic through less congested endpoints or adapting jitter buffer.

In possible implementations, the method comprises one of issuing a notice/alert for hardware replacement or issuing a notice for contacting the internet service provider when the detected network impairment is mainly attributed to the delay (D).

In possible implementations, the method is executed or performed by a VoIP gateway.

In possible implementations, a plurality of VoIP gateways individually performs the method described above. The VoIP gateways are configured to periodically transmit the quality of the VoIP transmission determined as explained previously, to a centralized network monitoring system, where the combination of the service quality estimations provided by each VoIP gateway provides an indication of the overall QoS of the network.

According to another aspect, a network node that can send and receive audio communication signals through ports is provided. The network node comprises one or more processors and non-transitory storage means having stored thereon instructions causing the one or more processors to perform the method defined above. In possible implementations, the network node is a VoIP gateway.

In possible implementations, the network node is configured and adapted to display a Graphical User Interface (GUI), which can be generated locally by the network node or accessed remotely. The GUI displays at least the quality of service determined in real time, such as the R-factor and/or the MOS, and preferably also at least one of the following parameters of interest: the codec type, the jitter (J), the packet loss (PL), the Sender Packets count and the round-trip delay. Preferably, the GUI can also display alerts when the rate factor and/or MOS are below an associated rate factor or MOS threshold, and can also display alerts relating to the parameters of interest.

In yet another aspect of the invention, a non-transitory storage memory is provided. The non-transitory storage memory has stored thereon instructions for causing one or more processors to perform the method according to any one of the previous claims.

In yet another aspect, a system for assessing the QoS of VoIP communications is provided. The system comprises a plurality of network nodes as described above and a network monitoring system. The network nodes are configured and adapted to each periodically send the quality of audio communication determined to the network monitoring system (such as the R factor or MOS). The network monitoring system receives the QoS estimated by the different network nodes and determines an overall QoS of the network based on the quality of audio communication signals individually determined by each of the plurality of network nodes.

Other features and advantages of the embodiments of the present invention will be better understood upon reading of preferred embodiments thereof with reference to the appended drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a graph of the R-factor as a function of the one-way delay (T), which compares the variation curves associated to two E-model standards (G.711 and G.729) with the variation curve as per the proposed method, according to a possible implementation of the invention.

FIGs. 2A and 2B are graphs of ΔR and AMOS as a function of the one-way delay (T), each comparing a prior art approximation with the proposed approximation according to an implementation of the invention.

FIG. 3 is a graph comparing prior art approximations of the variations of effective equipment impairment factor (l_e-eff) related to the codecs as a function of the Probability of Packet Loss (P_pl) with the proposed approximation according to an implementation of the invention, and with the actual measurements of the effective equipment impairment factor related to the codecs.

FIG. 4A is a flow chart of the possible steps of the proposed method, wherein the monitoring includes non-intrusively estimating in real-time the quality of audio communication signals, according to a possible implementation.

FIG. 4B is a block diagram of a monitoring system that estimate the quality of VoIP communications at end points, according to a possible implementation.

FIG. 4G is a simplified algorithm of a possible implementation of steps of the method for non-intrusively estimating the quality of VoIP communications.

FIGs. 5A-5B are graphs illustrating the variation of the MOS as a function of Packet Loss (PL in %) and of one-way delay T (in ms) for two different types of codecs, obtained from possible implementations of the proposed method.

FIG. 6 is schematic representation of an experimental setup used fortesting the proposed system and method, according to one possible implementation.

FIGs. 7-8 are graphs of experimentation results obtained from applying the proposed monitoring method using the setup illustrated in FIG. 6, the graphs also include MOS measured using intrusive Perceptual Evaluation of Speech Quality (PESQ) measures, for comparison purposes.

FIGs. 9A-9D are graphs showing the performance of the proposed method, compared to the measured MOS for various network impairments for PCMU codecs.

FIGs. 10A- 10B are examples of a graphical user interfaces showing the R-factor and MOS obtained from the proposed non-intrusive estimation method, according to one embodiment.

FIGs. 11A-11 B are examples of a terminal-based interface according to one embodiment, in which alerts are displayed when a given parameter exceeds a predetermined threshold.

It should be noted that the appended drawings illustrate only exemplary embodiments of the invention and are therefore not to be construed as limiting of its scope, for the invention may admit to other equally effective embodiments. DETAILED DESCRIPTION

In the following description, similar features in the drawings have been given similar reference numerals and to not unduly encumber the drawings, some elements may not be indicated in some figures if they were already introduced in a preceding figure. The elements of the drawings are not necessarily depicted to scale, since emphasis is placed on illustrating the elements and the interactions between elements.

The system, network device and method described hereinbelow, for non-intrusively estimating in real-time the quality of audio communication signals, can be used for an efficient and low cost monitoring of VoIP communications and for the troubleshooting of networks used for VoIP communications. The proposed method, device and system allows for a lightweight, non-intrusive and real-time monitoring of VoIP communications from the various endpoint devices across a network. The system, device and method may also be configured and adapted for monitoring other types of media communications over the internet, such as audio communication signals, for which the quality of service is important.

In VoIP communications, voice data is transmitted through a network using data packets. Since a VoIP call is a real-time transmission, User Datagram Protocol (UDP) packets are typically used for the audio payload transmission. Contrary to the Transmission Control Protocol (TCP), UDP is a connectionless communication model and has no guarantee of delivery or mitigation mechanisms, such as retransmission, in case of packets lost. Consequently, a VoIP communication is prone to network degradation and may benefit from quality evaluation.

Different ways of evaluating the quality of a VoIP call exist. Some objective quality measuring algorithms estimate a call degradation without the need for human involvement but are in general intrusive, meaning that they require both original and distorted speech signals for assessing the corresponding quality. In contrast, non-intrusive algorithms rely only on distorted speech signals to estimate a call quality. However, existing methods involve a complex modeling of the impairment models and the extraction of these impairments from the VoIP traffic. Other methods, such as the E-model, have been developed for estimating the subjective MOS measure of a communication. The E-model was introduced by the ITU-T Rec. G.107 to quantify, for a given call, the rating of an average user. The E-model was originally designed as a planning tool to assess a point-to-point voice quality using the subjective MOS measure. The aim is to decide whether or not to carry a VoIP call on a given link. An advantage of a subjective measure, e.g. MOS, is spotting unforeseen impairments that might not be considered in objective measures. The E-model as defined by the ITU-T Rec. G.107 standard involves complex calculations and is therefore not well adapted for the real-time evaluation of QoS at endpoint devices, which often have limited resources allocated to monitoring applications.

Non-intrusive models are referred as such since they do not compare original and received calls but rather consider impairments and distortions on the voice call. The user call quality satisfaction is measured by the Mean Opinion Score (MOS) which has values between 1 and 5; 1 for dissatisfied, and 5 for satisfied. The primary E-model output is the rating factor, R, that scales from 0 to 100 depending on the quality level.

Different levels of user satisfaction and corresponding MOS values and R factor values are presented in the following table as a reference, where the R factor is a value derived from network-derived metrics that helps evaluate the quality of the experience (or quality of service) for VoIP calls :

Table I: MOS values and user satisfaction levels

According to one aspect of the invention, a non-intrusive real-time QoS estimation method is provided. In a possible implementation, the proposed method provides a simple and efficient estimation of the Mean Opinion Score (MOS) reflecting the call quality, based on an accurate simplification of the ITU-T G.107 E-model, using data extracted from VoIP traffic, where said data is, for the most part, extracted from non-payload packets. The proposed method can be used to give an accurate real-time status of the VoIP network. Optionally, troubleshooting options may be provided when impairments are detected. The real-time quality feedback as estimated by the proposed method can be useful in avoiding or limiting service outage by adapting VoIP transmissions accordingly, such as by adopting a lower rate transmission, using a codec adapted to the network conditions, etc.

The proposed system may be deployed at various types of endpoint devices of a network. The proposed system thus allows to decentralize the VoIP quality monitoring to endpoints of a network. Reporting the perceived quality determined by endpoint devices to the core of the network, such as to a centralized network monitoring system, results in frequent and accurate monitoring and rapid troubleshooting. The network devices (or nodes) of the proposed invention are configured and adapted to extract data from the VoIP traffic and can assess the quality of the VoIP communication, and may further provide troubleshooting guidance if estimated quality measures are outside of predefined quality ranges.

The system, device and method proposed herein present several advantages, as will be become obvious from the explanations provided below. Such advantages include providing a system that (a) is light-weight and easy to implement in all VoIP network endpoints, (b) provides an accurate approximation of the E-model which is the main standardized method of quality estimation used in research and industry environments, and (c) is reliable and highly sensitive to network impairments.

According to a possible implementation, an accurate approximation of the E-model is used to estimate a MOS value of collected real-time packet loss and delay measurements from VoIP traffic, obtained in a non-intrusive way.

In the present application, “endpoint device” refers to any device that may comprise a network carrying the VoIP communication. Such endpoint devices may include, without be limited to, media gateways, servers, and IP phones, software-based VoIP endpoints, mobile devices running voice calling applications. Approximating the E-model

According to a possible implementation, the proposed system and method provides an estimation of the quality of service of a VoIP communication, by estimating the R factor, and also preferably the MOS, which is derived from the R factor. Given the complexity and the large number of equation series involved calculating the R factor in the standard E-model, the proposed method approximates the R factor using mostly, and in some embodiments solely, non-payload packets, collected from VoIP communication sessions. Thus, the R factor estimation process can be performed entirely at the different end points of the network, which renders the QoS estimation lightweight and effective.

The approximation for calculating the R factor (R) is based on parameters available in the RTP Control Protocol (RTCP) packets. Once calculated, the R factor may be converted to MOS, based on the following relationship:

While the ITU recommendation G.107 relies on more than 20 parameters to establish the R factor, it has been found that for VoIP communications, most of these parameters are constant, and that the following parameters are sufficient to accurately approximate the R factor : the codec type, packet loss-related variables, and delay-related variables.

Codec-related variables may be associated to the following parameters: I_e and B_pl, respectively defined as the codec impairment and robustness factors.

Packet loss might occur due to no packet arrival because of network congestion or packet drop due to large delay in packet arrival. Packet loss may also be related to several factors such as the codec type, and packet size. While the E-model uses probability of packet loss as a parameter, the proposed method uses the actual packet loss of captured packets, denoted PL, for approximating the E-model.

Delay, denoted by T, may be defined as the time elapsed between transmission and reception of a packet from the sender to the receiver. The E-model identifies multiple parameters related to delay: T defined as the mean one-way delay of the echo path;

Tr defined as the round-trip delay in a 4-wire loop;

Ta defined as the absolute delay in echo-free connections.

However, in general T and Ta are assumed to be the same and equal to half of Tr. In a preferred embodiment, T is defined by:

In a preferred embodiment, an approximation of the R factor may be calculated based on the impairment variables hereinabove, with R expressed as follows:

Where R_o is the signal-to-noise ratio, I_s reflects the simultaneous impairments of the voice signal, I_d reflects the delay impairments, /_e-eff is the effective equipment impairment factor related to the codecs, and A is an advantage factor depending on the communication system application. In some embodiments, A may be equal to zero.

The proposed method is based on the finding of a direct dependencies between R and the codec type, packet loss and delay. In the preceding equation, R_o does not depend on any of the impairment factors, I_s and I_d depend only on T, and /_e-eff depends only on the packet loss and the codec variables PL, I_e and B_pl.

According to a implementation of the method, the expression of R of the proposed simplified E-model comprises a first constant component representative of a maximum value of the R factor when there is no impairment on the network (R_max), a second component which is solely dependent on the determined delay (F₁(T)), and a third component which is solely dependent on the determined packet loss (PL) and the extracted codec type (F₂(PL, codec)). Equation (3) may be simplified because none of the components of R are a function of T and PL simultaneously, such that the equation becomes:

where R_max is the constant presenting the maximum value of the R factor, where F₁(T) is a function of delay T based on I_s and I_d, and where F₂(PL, codec) is a function of PL, I_e, and B_pi which are parameters of /_e-eff associated to the codec. It will be noted that all the constants in I_s and I_d, obtained when T = 0, are included in R_max. The formulation of R in equation (4) reflects the direct impact of the parameters of interest on the VoIP quality. Figure 1 shows the different components in the R factor (R_max, F₁(T), and F₂(PL, codec), corresponding to R_max and to the dotted sections 100, 102 and 104, where sections 100 and 102 are linear functions. As can be appreciated, the proposed approximation is very close to the standard E-model, for each of the two codec types. The accuracy of the approximation holds even for delays above 380ms.

Based on the direct numerical application of the detailed formulas in the E-model in ITU- T Rec. G.107, the following values are obtained:

Hence, R_max is given by

In a proposed implementation, F₁(T) of equation (4) may be evaluated by trying to fit a curve of the R factor as a function of Delay T (ms), where such a curve is known in the prior art. Such an approach is taken because the variation of R with respect to T in the E- model is complex and cannot be explicitly expressed as it involves multiple levels of nonlinear equations. Additionally, the variation of R with T is independent from its variation with PL and the codec. We thus provide the following approximation for the second component F₁(T) of the simplified E-model:

wherein the second component F₁(T) follows a first linear relationship for a first interval (such as between 0 and 170ms); a second linear relationship for a second interval (such as between 170 and 380ms); and a third non-linear relationship for a third interval (above 380ms). It will be noted that this approximation is simple for T < 380ms, which represents the majority of the delay cases. Also, F₁(T) remains the same regardless of the codec as shown in FIG.2A. The second component of the proposed simplified E-model thus follows first and second linear relationships when the delay (T) is less than 400ms, and preferably equal to or less than 380ms.

FIG.2B illustrates a comparison of the MOS variation as a function of the Delay between an approximation well known in the prior art and the approximation according to a preferred embodiment of the proposed method. The accuracy of the approximations is evaluated by defining and comparing the R and MOS differences, AR = R_approx - R_E-model. ΔMOS = MOS_approx - MOS_E.model- It becomes evident from FIGs. 2A-2B that the approximation proposed in the present application is generally closer to the E-model R factor with |ΔR| < 0.8 which is below 1%, which is below 0.05 in terms of MOS units. An advantage of the proposed approximation is that it supports values of T beyond 350ms.

In a preferred embodiment, F₂(PL, codec) of equation (4) is evaluated based on the component of R that depends on the packet loss, the equipment impairment detailed in the E-model by the following formula :

where P_pl is the probability of packet loss and BurstR is the burst ratio, equal to 1 when packet loss is random. The evaluation of the present application is based on equation (10), wherein the probability of packet loss is replaced by actual packet loss PL. In some embodiments, the proposed expressions of F2 are given by

According to this proposed implementation, the third component (F₂) of the simplified E- model follows a distinct non-linear, rational function of degree one, for each codec type. While the examples of codec types are all narrowband codec types (about 200 to 3,300Hz), the proposed method and system is also applicable to wideband codec types (such as 50 to 7,000 Hz).

FIG.3 presents a comparison between the proposed method, prior art methods, and actual measurements, for two standards G.729 and G.711. Maximum R factor and MOS values for various codec types are presented in the following table:

Table II: le and Bpl values for various codecs

Consequently, in a preferred embodiment, equation (4), the R factor model, becomes

where F₁(T) is given in equation (9) and F₂(PL, codec) = I_e-eff given in equation (10). The obtained result allows to model the MOS variation with T and PL and the codec.

VoIP Monitoring

Now referring to FIG.4A which shows a flowchart of the method for non-intrusively estimating in real-time the quality of audio communication 400, and to FIG.4B, which shows a detailed flowchart of the monitoring section 500, preferred embodiments of the proposed system include steps for monitoring VoIP traffic (410) and analyzing the MOS (420), involving a real-time assessment of the VoIP traffic quality, impairment detection (430), and potential troubleshooting of the network (440).

All VoIP traffic is encapsulated in UDP packets in the transport layer of the TCP/IP protocol. In the application layer, the VoIP traffic involves three types of packets:

- Session I nitiation/Description Protocol (SIP/SDP) packets, used for signaling and call initiation/description. The SDP is mainly related to the media description and initial negotiation of the call endpoints.

- RTP (Real-time Transport Protocol) packets, used for delivering the actual audio in the packet’s payload. RTP packets represent over 99.5% of the VoIP traffic.

- RTCP packets, used for providing statistics and control information of the RTP packets. RTCP packets allow the call endpoints to exchange call statistics such as delay, jitter, and packet loss. RTCP packets are provided periodically related to the pervious RTP stream without containing data payload.

In preferred embodiments, estimating the MOS comprises extracting and analyzing at least the codec (530), the delay (534), the jitter (536), and the packet loss (532) parameters from the VoIP traffic (510 and 520) in a real-time and non-intrusive way. For the proof of concept of the proposed method, the application Wireshark™, an open-source packet analyzer tool, was used to extract these parameters, but other analyzer tools can be used as well.

According to one possible embodiment of the proposed method, the codec type of an ongoing call may be extracted from the RTP packets included in the “Payload Type" field. Each codec type is represented in this field by a number as shown in the following table

Extracting the codec type from RTP packets allows for determining the codec type at any moment during the VoIP call, RTP packets being responsible for conveying the call data. However, RTP packets have a relatively large size which means that extracting the codec from these packets might need more processing. Also, the “Payload Type" field is dynamic and might provide erroneous values for new codec types. Furthermore, RTP packets may be encrypted and extracting any information from the packets may become highly complicated.

Consequently, according to another possible embodiment of the proposed method, the codec type 530 can be extracted from the SIP/SDP packets 510 in the “MIME Type" field (sdp. mime. type). This method allows to reduce the amount of data to analyze, since SIP/SDP packets account for less than 0.05 % of VoIP traffic, making it a lightweight implementation. However, this possible implementation of the method requires capturing packets before the start of the call since SDP packets are only available in the initialization phase.

The system may then extract the Packet Loss (PL) information 532 from the VoIP communication. The exact value of PL is not available in the VoIP traffic. However, PL may be calculated using two fields of RTCP packets 520: Cumulative number of packets lost (rtcp.ssrc.cum_nr) or CPL, and the Sender’s packet count (rtcp.sender.packetcount) or SPC. CPL is defined as the difference between the expected number of packets to be received and the number of packets actually received. SPC is defined as the total number of transmitted packets since the beginning of the call. The system may then calculate PL percentage as a ratio of CPL over SPC as presented in the following equation:

It will be noted that PL is a measure of the number of lost packets from the beginning of the call. That is, at any given moment of the call, PL gives an indication about the quality of the elapsed part of the call. The system may further extract delay information 534 and jitter 536 from the VoIP communication. It should be noted that the jitter is not considered as a parameter when computing R in the original E-model but is considered in extensions of the E-model. The system may calculate the jitter, denoted by /, in the expression of delay T as follows:

where

D is the one-way delay calculated as half of the round-trip delay (RTD) that is extracted from the RTCP sender report (SR) packets as follows:

where A is the time of reception of the RTCP packet, LSR is last SR timestamp (rtcp.ssrc.lsr) defined as the middle 32 bits of the NTP timestamp, and DLSR is the delay since the last SR (rtcp.ssrc.dlsr).

J is extracted from the RTCP packets in the interarrival jitter field (rtcp.ssrc. jitter) defined as the mean of the interarrival time of the RTP packets. Parameter 8 is an offset related to the processing and buffering time after the packets capture that is on average 10 ms (5 ~ 10ms).

A summary of the relevant parameters, corresponding modeling and VoIP field sources is presented in the following table:

Table III: Factors conversion from E-model to the proposed monitoring method.

Referring to FIG.4B, once the codec type 530, packet loss 532, delay 534 and jitter 536 information are extracted and analyzed, the system may then calculate the R factor (538) and corresponding MOS (540) of the VoIP communication, as explained in detail in the previous section.

In preferred embodiments, the proposed method extracts data from SDP packets 510 and RTCP packets 520. An advantage of this method over analyzing RTP packets is that data in RTCP packets is obtained by analyzing the RTP stream preceding each RTP packet, making it fast and accurate. Additionally, SDP packets 510 and RTCP packets 520 of the VoIP traffic represent less than 0.5% of the VoIP traffic. Therefore, the proposed monitoring method has the advantage of a reduced computing complexity and reduced processing time further contributing to delivering real-time MOS. FIG. 4C presents an exemplary algorithm that may be used by the system to extract and calculate the R factor and the MOS of the VoIP communication. Network Impairments Detection and Troubleshooting

In preferred embodiments, once the MOS is calculated, network impairments can be detected by the system. Referring to FIG. 4A, this corresponds to steps 420 and 430 of the proposed method.

At step 420, the system may diagnose network impairments according to thresholds of the MOS value and alternatively according to thresholds associated with the parameters extracted from the VoIP communication. In some embodiments, a network impairment may be detected when a measure is outside a threshold for a certain period of time. As an example only, the MOS value may have a threshold for respecting the second level of users satisfaction mentioned in Table I, i.e. , R > 80, and MOS > 4.03.

In some embodiments, quality zones respecting simultaneous delay and packet loss may be defined, outside of which the system may suggest a troubleshooting recommendation with regards to the network. It will be noted that parameter values and thresholds for diagnosing impairments may change depending on the codec type. As an example, for the same impairments, different codecs give different MOS values. FIGs. 5A-5B illustrate the variations of the MOS with T and PL, for G.711 and G.729. It will be noted by looking at those figures that when PL = 0%, and T = 120ms, the G.711 MOS is in the best rating category (above 4.34), whereas for G.729, a delay of 120ms decreases the MOS to the middle of the third category (between 3.6 and 4.03).

In addition, it may be observed that PL highly affects the call quality as the call quality becomes unsatisfactory for G.729 and G.711 , at 10% and 20% packet loss respectively. It was also shown, during experiments for testing the proposed method, that even for delays exceeding 250ms, the call quality is relatively acceptable meaning that a constant delay does not affect the call quality. Hence, detecting network impairments is highly related to the codec type and is related to the packet loss and the delay, and the proposed method adequately captures this phenomenon.

For each detected network impairment in step 430, the system may, in preferred embodiments, suggest at step 440 a recommended counteracting troubleshooting procedure. Examples of such counteracting troubleshooting procedures are presented in the following table Table IV: Troubleshooting recommendations for the major

In preferred embodiments, a user interface for the proposed system and method is provided. In some embodiments, the user interface may be a terminal-based interface. Examples of graphical user interfaces are shown in FIGs.10A-10B and 11A-11 B. The terminal-based interface, displaying the output in a terminal, may be deployed, in some embodiments, in any and/or every node participating in the VoIP network. As an example, it may be deployed in media gateways, servers, and IP phones. The GUI may be intended for network administrators to visually monitor the monitoring procedure. The GUI are configured and adapted to display alerts when the rate factor and/or MOS are below an associated rate factor or MOS threshold, and also preferably display alerts relating to the parameters of interest.

Real-time information and data may be displayed in the user interface, including, but not limited to:

- Call branch: including the source and destination IP addresses of the analyzed traffic, reflecting the part of the network being monitored;

- Date and time: to time the monitoring results;

- Codec type: captured form SIP/SDP or RTP;

- Jitter, cumulative packet loss, and sender’s packet count, directly extracted from

RTCP packets;

- Round trip delay, computed as the double of T in equation (15);

- Percentage of lost packets computed as in equation (14); - R factor and MOS: computed as in equations (13) and (1), respectively.

Additionally, the proposed method may provide warning alerts whenever performance falls outside certain pre-specified thresholds.

In some embodiments, the proposed system and method store data related to previous monitoring activities in a separate database.

Experimental Setup and Performance Evaluation

The proposed system, device and method was implemented in a real experimental setup to evaluate its performance. In order to assess the accuracy of the resulted MOS of the proposed method, two reference metrics were introduced:

A measured MOS based on a third-party library. The corresponding MOS value is calculated from the RTP traffic and requires high processing power, meaning that it may not be implemented at every endpoint of a VoIP network.

The perceptual evaluation of speech quality (PESO) described in ITU-T Recommendation P.862 which is highly correlated with MOS. The PESO score is based on analyzing two signals, an original signal and a corresponding distorted one. Given the intrusive characteristic of the PESO, monitoring calls in real-time may not be performed. PESO is only used as a reference to measure the accuracy of the proposed method.

It will be noted that MOS is a subjective measure reflecting the real quality, whereas PESO is an objective measure based on an algorithm designed to assess the degradation of the call quality.

A setup used for testing the system 600, shown in FIG.6, comprised two analog phones 602, 602’, connected to the VoIP server, 610, via VoIP gateways 604 and a switch 606. In such a setup, VoIP calls are performed between the two analog phones for evaluating the proposed method for three different codecs: G.711 PCM A-Law (PCMA), G.711 PCMμ -Law (PCMU), and G.729. It can thus be understood that an application program running on network devices, such as VoIP gateways, can be used to implement the proposed method. A network emulator 608 called NetEm, linked to the server 610, was used for the impairment generation. When implementing the proposed system in an operational network, the network emulation would be the real IP network. NetEm 608 can emulate real responses by including delay, packet loss and jitter to the traffic. For each of the three codecs under evaluation, three types of network impairments generated using NetEm were introduced:

A packet loss of 0, 1 , 5, 10 and 20, 40, 80, and 100%.

A progressive packet loss with three cases: ascending, ascending-descending, and descending

A delay (in ms) of 250±25, 500±50, 750±75, 1000±100, and 1250±125.

A dedicated testing call was conducted for each impairment. Each call had a duration of 3 minutes and included four voice samples repeated twice. In order to capture the two- way distorted traffic, the proposed method was implemented close to the Vol P server 610. Data was then collected during the VoIP calls using the packet tracing and analyzing tool Wireshark™.

Referring to FIG.7, the MOS of the proposed system, device and method, along with the measured MOS and the intrusive PESO measure of the tests, for PL values between 0% and 100%, are presented for G.729 and POMA codec types.

As the experimental testing demonstrated, the proposed method results in accurate MOS values as the MOS values are close to both the PESO and the measured MOS for the entire PL range. The proposed method provides MOS values closer to PESO than the measured MOS.

Furthermore, it was observed that packet loss had an important negative impact on the call quality. In fact, when PL increases from 0% to 20%, the drop in MOS is from 4 to 2, whereas the drop is slow and converges to 1 as PL goes to 100%.

Lastly, the codec type has a meaningful impact on the call quality and on the MOS. G.729 MOS is around 1 MOS-unit below POMA MOS which is related to the values of I_e and B_pl for these codecs. Referring to FIG.8A-8B, a direct comparison between the MOS of the proposed method and the measured MOS for G729 and POMA is presented. It will be noted that the proposed method presents accurate results as a majority of the points are close to the first bisection. Additionally, the proposed method is robust as its accuracy is preserved for a MOS ranging from 1 to 4.5, indicating that the effectiveness of the proposed method is ensured for all quality levels.

In Figures 9A-9B, the effects of the different network impairments of the performance of the proposed method can be visualized, compared to measured MOS. As can be appreciated, the proposed method provides high real-time sensitivity compared to measured MOS. In the same plot, it is shown that the variation of the measured MOS does not reflect the impairment variation and introduces abrupt MOS drops. For instance, in call 1 and 2, the proposed method reflects exactly the variation of the delay whereas the measured MOS is a constant presenting a missed detection event. In addition, in the second part of call 2, the proposed method reflects the smooth variation of the packet loss whereas the measured MOS presents abrupt MOS drops to 1 presenting a false alarm event.

Moreover, in call 3 where only a packet loss is introduced, the proposed method captures the existence of jitter (that was not introduced as an impairment). This jitter is related to the high packet loss percentage (beyond 60%) that affects the jitter buffer. In fact, the packet loss causes a gradual MOS decrease to a low value greater than 1 , while the jitter causes the MOS to be 1. In call 4, where a complete traffic cut (by physically unplugging the network cable) was introduced, it is shown that the rest of the call quality depends on the cut event which introduces a quality recovery. This observation shows that the MOS is related to the entire call. In fact, given the subjectivity of the MOS measure, when a call cut is perceived in the call, affects the user’s opinion during the rest of the call. Hence, a real-time MOS estimation is related to the past/history of the call. The history here is presented by the cumulative packet loss. The same behavior is observed in call 5 where three separated traffic cuts are introduced.

According to an aspect of the invention, the method can be implemented by a network node that can send and receive audio communication signals through its ports. The network node, such as for example the VoIP gateway of FIG. 6, comprises one or more processors and non-transitory storage means having stored thereon instructions causing the one or more processors to perform the method as described above.

The network node can be configured and adapted to display Graphical User Interfaces (GUI) as described above, which can be generated locally by the network node or accessed remotely, the GUI displaying at least the quality of service determined in real time, such as the R-factor and/or the MOS, and preferably also at least one of the following parameters of interest: the codec type, the jitter (J), the packet loss (PL), the Sender Packets count and the round-trip delay.

In yet other implementations, instructions can be stored on a storage memory, such as non-transitory storage memory, for causing one or more processors of a network device to execute the different steps and calculations described above.

In yet other implementations, a VoIP monitoring system can include a plurality of network nodes and a network monitoring system. The network nodes each periodically send the quality of audio communication (R factor or MOS) determined to the network monitoring system, and the network monitoring system determines in turn an overall quality of service of the network based on the quality of audio communication signals individually determined by each of the plurality of network nodes.

In other possible implementations, the proposed method includes predicting the QoS of audio, and more particularly VoIP communications. The proposed method may comprise data collection that can be stored to track the QoS history. The historic values of, the delay (T), the jitter (J), the packet loss (PL), the R factor and MOS can be used to train machine learning models to predict upcoming I future QoS, i.e. predicted values of MOS and/or R factor. The trained algorithms can be based on, but not limited to, autoregression algorithms or long short term memory (LSTM) neural networks, as examples only.

In summary, the proposed real-time non-intrusive VoIP quality monitoring and troubleshooting method provides, in possible implementations, an accurate approximation of the E-model and a real-time estimation of the MOS using the VoIP traffic. Therefore, the proposed method presents a lightweight implementation on the various network nodes contributing to a decentralized VoIP monitoring solution. In the experimental performance evaluation, the proposed method has been evaluated for different network impairments such as packet loss, delay, and jitter. It was demonstrated that the proposed method is accurate (based on the PESQ and measured MOS), robust to network impairments, and highly sensitive to impairment variations. Hence, the proposed method presents a high level of reliability and confidence. Advantageously, the proposed method (and associated devices and system) allows to estimate the speech quality in a VoIP call, based on limited information including the variation of the codec, the packet loss and the delay through time. Of course, numerous modifications could be made to the implementations and embodiments described above without departing from the scope of the present disclosure.

Claims

1. A method for non-intrusively estimating in real-time the quality of an audio communication signal at an endpoint device of a network, the method comprising: determining packet loss (PL) and delay (T), and extracting codec type, from IP packets of the audio communication signal passing through a port of the endpoint device, the IP packets used for determining the packet loss (P) and the delay (T) being nonpayload IP packets; and determining the quality of the audio communication signal using a simplified E- model being based on the packet loss (PL), delay (T), and the codec type at the endpoint device.

2. The method according to claim 1 , wherein the audio communication signal is a voice communication, and the IP packets are VoIP packets.

3. The method according to claim 1 or 2, wherein determining the quality of the audio communication comprises determining a transmission rating factor, referred to as R factor, obtained by the simplified E-model, the transmission rating factor quantifying a voice signal quality rating by an average user.

4. The method according to claim 3, wherein the simplified E-model used to determine the R factor solely requires determining the packet loss (PL), the delay (T), jitter (J) and the codec type at the endpoint device, and wherein determining the packet loss (PL), jitter (J), and delay (T) is inferred from RTP Control Protocol (RTCP) packets only.

5. The method according to any one of claims 1 to 4, wherein the codec type is extracted from Real-Time Transport Protocol (RTP) packets.

6. The method according to any one of claims 1 to 4, wherein the codec type is extracted from Session Initiation/Description Protocol (SIP/SDP) packets of the VoIP traffic.

7. The method according to any one of claims 1 to 6, wherein the codec type is one of a narrowband (about 200 to 3,300Hz) or wideband codec (about 50 to 7,000Hz).

8. The method according to anyone of claims 3 to 7, wherein the simplified E-model comprises a first constant component representative of a maximum value of the R factor when there is no impairment on the network, a second component which is solely dependent on the determined delay (T), and a third component which is solely dependent on the determined packet loss (PL) and the extracted codec type (codec).

9. The method according to anyone of claims 8, wherein the second component follows a first linear relationship for a first interval; a second linear relationship for a second interval; and a third non-linear relationship for a third interval.

10. The method according to claim 8, wherein the second component of the simplified E-model follows the first and second linear relationships with the delay (T) when less than 400ms.

11. The method according to any one of claims 8 to 10, wherein the third component of the simplified E-model follows a distinct non-linear, rational function of the delay (T), for each codec type.

12. The method according to any one of claims 8 to 11 , wherein the simplified E-model for determining the R factor is based on:

wherein:

Rmax corresponds to the first constant component representing the maximum value of R,

13. The method according to claim 12, wherein R_max can be approximated to a value between 80 and 96.

14. The method according to claim 12 or 13, wherein the first function is provided by:

15. The method according to any one of claims 12 to 14, wherein the second function is provided by:

16. The method according to any one of claims 3 to 15, further comprising calculating, in real-time, a Mean Opinion Score (MOS) based on the R factor determined.

17. The method according to claim 16, wherein the MOS calculation from the R factor is performed according to the ITU-T Recommendation G.107 standard.

18. The method according to claim 16 or 17, wherein the relationship between the R factor and the MOS is provided by:

19. A method for monitoring VoIP packets of a VoIP communication at an endpoint device, and estimating therefrom a quality of the VoIP communication, the method comprising the steps of: extracting a codec type from Real-Time Transport Protocol (RTP) packets of the or from the Session Initiation/Description Protocol (SIP/SDP) packets, the RTP and SIP/SDP packets being application layer packets of ongoing VoIP traffic; determining packet loss (PL) from the Cumulative Number of Packet Loss (CPL) field and from the Sender’s Packet Count (SPC) field of the RTP Control (RTPC) packets; determining delay (T) based on: the one-way delay, calculated as half of the Round-Trip Delay (RTD) extracted from the RTCP sender report (SR) packets; jitter extracted from the interarrival field of the RTCP packets; and a fixed offset parameter 5; and estimating the quality of the VoIP communication by calculating the Mean Opinion Score (MOS) from a rating factor obtained from a simplified E-model, the simplified E- model being only function of the extracted codec type, and the determined delay (T) and packet loss (PL).

20. The method according to claim 19, wherein determining the packet loss (PL) is calculated according to:

21. The method according to claim 19 or 20, wherein the delay (T) is calculated according to:

wherein D is obtained by subtracting the Last Sender Report (LSR) timestamp (rtcp.ssrc.lsr) and the delay since the last SR (DLSR) timestamp (rtcp.ssrc.dlsr) from the time of reception (A) of the RTCP packet (A-LSR-DLSR); wherein J correspond to the value of interarrival jitter field (rtcp.ssrc.jitter); and wherein the fixed offset parameter 8 is about 10 ms.

22. The method according to any one of claims 19 to 21 , comprising a step of detecting network impairments when the delay (T) varies from 0ms to 2000ms.

23. The method according to claim 22, comprising reducing use of high bandwidth applications at the endpoint device when the network impairment detected is mainly attributed to packet loss (PL).

24. The method according to claim 22 or 23, comprising one of: increasing priority of VoIP traffic; rerouting traffic through less congested endpoints or adapting jitter buffer when the detected network impairment is attributed to both packet loss (PL) and to jitter.

25. The method according to any one of claims 22 to 24, comprising one of issuing a notice for hardware replacement or for contacting internet service provider when the detected network impairment is mainly attributed to the delay (D).

26. The method according to any one of claims 19 to 25, wherein the method is implemented on a VoIP gateway.

27. The method according to claim 26, wherein the VoIP gateway periodically transmits the quality of the VoIP transmission determined to a centralized network monitoring system.

28. The method according to any one of claims 19 to 27, comprising collecting data including historic values of MOS and/or R factor, and predicting QoS based on future values of MOS and/or R factor obtained by prediction algorithms and/or trained machine learning models.

29. A network node for sending and receiving audio communication signals through ports thereof, the network node comprising: one or more processors; and non-transitory storage means having stored thereon instructions causing the one or more processors to perform the method according to any one of claims 1 to 28.

30. The network node according to claim 29, wherein the network node is a VoIP gateway.

31. The network node according to claim 29 or 30, configured and adapted to display a Graphical User Interface (GUI), generated locally by the network node or accessed remotely, the GUI displaying at least the quality of service determined in real time, including the R-factor and/or the MOS, and at least one of the following parameters of interest: the codec type, the jitter (J), the packet loss (PL), the Sender Packets count and the round-trip delay.

32. The network node according to claim 30, wherein the GUI displays alerts when the rate factor and/or MOS are below an associated rate factor or MOS threshold.

33. A system comprising: a plurality of network node according to any one of claims 29 to 32, and a network monitoring system, the network nodes each periodically sending the quality of audio communication determined to the network monitoring system, the network monitoring system determining an overall quality of service of the network based on the quality of audio communication signals individually determined by each of the plurality of network nodes.