WO2023202522A1

WO2023202522A1 - Playing speed control method and electronic device

Info

Publication number: WO2023202522A1
Application number: PCT/CN2023/088647
Authority: WO
Inventors: 程戈
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2022-04-21
Filing date: 2023-04-17
Publication date: 2023-10-26
Anticipated expiration: 2024-10-21
Also published as: CN114979798A; CN114979798B

Abstract

The present application belongs to the technical field of audios. Disclosed are a playing speed control method and an electronic device. The specific solution comprises: acquiring target streaming media; according to a preset frame length, performing segmentation processing on a voice frame of the target streaming media which has been subjected to short-time framing processing, so as to obtain a plurality of long-time frames; respectively determining a long-time spectrum energy difference characteristic value of each long-time frame, and determining a first streaming media clip and a second streaming media clip according to the long-time spectrum energy difference characteristic value; and outputting the first streaming media clip in the target streaming media according to a first speed, and outputting the second streaming media clip in the target streaming media according to a second speed, wherein the first streaming media clip is streaming media data that contains voice information, the second streaming media clip is streaming media data that does not contain voice information, and the first speed is less than the second speed.

Description

Playback speed control method and electronic device

相关申请的交叉引用Cross-references to related applications

本申请要求于2022年04月21日提交的申请号为202210425500.4，发明名称为“播放速度控制方法和电子设备”的中国专利申请的优先权，其通过引用方式全部并入本申请。This application claims priority to the Chinese patent application with application number 202210425500.4 and the invention title "Playback Speed Control Method and Electronic Device" submitted on April 21, 2022, which is fully incorporated into this application by reference.

Technical field

本申请属于音频技术领域，具体涉及一种播放速度控制方法和电子设备。This application belongs to the field of audio technology, and specifically relates to a playback speed control method and electronic equipment.

Background technique

随着短视频平台的发展，视频内容变得越来越丰富多彩。对于短视频平台来说，优质的视频内容是立足之本，但是许多短视频创作者拍摄的视频常会出现视频内容拖沓，播放节奏冗长的问题，如此，易使短视频平台的用户因为觉得乏味而切换视频或关闭平台。With the development of short video platforms, video content has become more and more colorful. For short video platforms, high-quality video content is the foundation. However, the videos shot by many short video creators often have problems with delayed video content and lengthy playback rhythm. In this way, users of the short video platform may easily become bored because they find it boring. Switch videos or close platforms.

在相关技术中，当短视频平台的用户觉得当前正在播放的视频片段比较乏味时，可以通过拖动视频界面中的播放进度条来加快视频的播放进度，然而，用户在使用这种方式时，很难把握拖动尺度，通常需要反复拖动进度条才能定位到想要查看的视频位置，不仅很容易使用户错过关键内容，而且还提高了用户的操作繁琐度。In related technologies, when users of short video platforms feel that the currently playing video clip is boring, they can speed up the playback progress of the video by dragging the playback progress bar in the video interface. However, when the user uses this method, It is difficult to grasp the dragging scale. It is usually necessary to drag the progress bar repeatedly to locate the video position you want to view. Not only does it easily make users miss key content, but it also makes the user's operation more cumbersome.

发明内容Contents of the invention

本申请实施例的目的是提供一种播放速度控制方法和电子设备，能够解决相关技术中控制播放速度的方式不仅很容易使用户错过关键内容，而且还提高了用户的操作繁琐度的问题。The purpose of the embodiments of the present application is to provide a playback speed control method and electronic device that can solve the problem in related technologies that the method of controlling playback speed not only easily causes users to miss key content, but also makes the user's operations more complicated.

第一方面，本申请实施例提供了一种播放速度控制方法，该方法包括：获取目标流媒体；按照预设帧长将经过短时分帧处理的所述目标流媒体的语音帧进行分割处理，得到多个长时帧；分别确定每个所述长时帧的长时谱能量差异特征值，并根据所述长时谱能量差异特征值确定第一流媒体片段和第二流媒体片段；按照第一速度输出所述目标流媒体中的第一流媒体片段，按照第二速度输出所述目标流媒体中的第二流媒体片段；其中，所述第一流媒体片段为包含语音信息的流媒体数据，所述第二流媒体片段为未包含语音信息的流媒体数据，所述第一速度小于所述第二速度。In a first aspect, embodiments of the present application provide a playback speed control method, which method includes: Obtain the target streaming media; divide the voice frames of the target streaming media that have been processed by short-time frame division according to the preset frame length to obtain multiple long-term frames; determine the long-term spectrum energy of each of the long-term frames respectively. difference characteristic value, and determine the first streaming media segment and the second streaming media segment according to the long-term spectrum energy difference characteristic value; output the first streaming media segment in the target streaming media at the first speed, and output the first streaming media segment at the second speed. The second streaming media segment in the target streaming media; wherein, the first streaming media segment is streaming media data that contains voice information, the second streaming media segment is streaming media data that does not contain voice information, and the first streaming media segment is streaming media data that does not contain voice information. The speed is less than the second speed.

第二方面，本申请实施例提供了一种播放速度控制装置，包括：获取模块和输出模块；所述获取模块，用于获取目标流媒体；所述处理模块，用于按照预设帧长将经过短时分帧处理的所述目标流媒体的语音帧进行分割处理，得到多个长时帧；所述处理模块，还用于分别确定每个所述长时帧的长时谱能量差异特征值，并根据所述长时谱能量差异特征值确定第一流媒体片段和第二流媒体片段；所述输出模块，用于按照第一速度输出所述目标流媒体中的第一流媒体片段，按照第二速度输出所述目标流媒体中的第二流媒体片段；其中，所述第一流媒体片段为包含语音信息的流媒体数据，所述第二流媒体片段为未包含语音信息的流媒体数据，所述第一速度小于所述第二速度。In the second aspect, embodiments of the present application provide a playback speed control device, including: an acquisition module and an output module; the acquisition module is used to acquire the target streaming media; the processing module is used to process the target streaming media according to the preset frame length. The voice frames of the target streaming media that have been processed by short-time frame division are segmented to obtain multiple long-term frames; the processing module is also used to determine the long-term spectral energy difference characteristic value of each long-term frame. , and determine the first streaming media segment and the second streaming media segment according to the long-term spectrum energy difference characteristic value; the output module is used to output the first streaming media segment in the target streaming media according to the first speed, according to the first speed. Output the second streaming media segment in the target streaming media at two speeds; wherein the first streaming media segment is streaming media data containing voice information, and the second streaming media segment is streaming media data that does not contain voice information, The first speed is less than the second speed.

第三方面，本申请实施例提供了一种电子设备，该电子设备包括处理器和存储器，所述存储器存储可在所述处理器上运行的程序或指令，所述程序或指令被所述处理器执行时实现如第一方面所述的方法的步骤。In a third aspect, embodiments of the present application provide an electronic device. The electronic device includes a processor and a memory. The memory stores programs or instructions that can be run on the processor. The programs or instructions are processed by the processor. When the processor is executed, the steps of the method described in the first aspect are implemented.

第四方面，本申请实施例提供了一种可读存储介质，所述可读存储介质上存储程序或指令，所述程序或指令被处理器执行时实现如第一方面所述的方法的步骤。In a fourth aspect, embodiments of the present application provide a readable storage medium. Programs or instructions are stored on the readable storage medium. When the programs or instructions are executed by a processor, the steps of the method described in the first aspect are implemented. .

第五方面，本申请实施例提供了一种芯片，所述芯片包括处理器和通信接口，所述通信接口和所述处理器耦合，所述处理器用于运行程序或指令，实现如第一方面所述的方法。In a fifth aspect, embodiments of the present application provide a chip. The chip includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the first aspect. the method described.

第六方面，本申请实施例提供一种计算机程序产品，该程序产品被存储在存储介质中，该程序产品被至少一个处理器执行以实现如第一方面所述的方法。In a sixth aspect, embodiments of the present application provide a computer program product. The program product is stored in a storage medium. The program product is executed by at least one processor to implement the method described in the first aspect. method.

第七方面，本申请实施例提供了一种电子设备，该电子设备被配置成用于实现如第一方面的方法的步骤。In a seventh aspect, embodiments of the present application provide an electronic device configured to implement the steps of the method in the first aspect.

在本申请实施例中，可以获取目标流媒体；按照预设帧长将经过短时分帧处理的所述目标流媒体的语音帧进行分割处理，得到多个长时帧；分别确定每个所述长时帧的长时谱能量差异特征值，并根据所述长时谱能量差异特征值确定第一流媒体片段和第二流媒体片段；按照第一速度输出所述目标流媒体中的第一流媒体片段，按照第二速度输出所述目标流媒体中的第二流媒体片段；其中，所述第一流媒体片段为包含语音信息的流媒体数据，所述第二流媒体片段为未包含语音信息的流媒体数据，所述第一速度小于所述第二速度。通过该方案，由于可以将经过短时分帧处理的目标流媒体的语音帧分割为多个长时帧，并通过分析长时帧的长时谱能量差异特征值的方式确定目标流媒体中的第一流媒体片段和第二流媒体片段，以及按照第一速度输出第一流媒体片段，按照第二速度输出第二流媒体片段，因此，一方面，由于长时特征具有比短时特征更高的平滑性和稳定性，因此通过分析长时帧的长时谱能量差异特征值可以提高分析结果的准确性；另一方面，由于第一速度小于第二速度，即第二流媒体片段的播放速度大于第一流媒体片段的播放速度，因此，不仅可以减少未包含语音信息的流媒体片段在播放过程中浪费的时间，避免用户错过第二流媒体片段中的关键内容，而且在目标流媒体的输出过程中，由于用户无需进行任何输入，因此降低了用户操作的繁琐度。In the embodiment of the present application, the target streaming media can be obtained; the voice frames of the target streaming media that have been processed by short-time frame division are divided according to the preset frame length to obtain multiple long-term frames; each of the said target streaming media is determined respectively. The long-term spectrum energy difference characteristic value of the long-term frame, and determine the first streaming media segment and the second streaming media segment according to the long-term spectrum energy difference characteristic value; output the first streaming media in the target streaming media at the first speed segment, outputting a second streaming media segment in the target streaming media at a second speed; wherein the first streaming media segment is streaming media data containing voice information, and the second streaming media segment is streaming media data that does not contain voice information. For streaming media data, the first speed is smaller than the second speed. Through this solution, the speech frame of the target streaming media that has been processed by short-time frame division can be divided into multiple long-term frames, and the third frame in the target streaming media can be determined by analyzing the long-term spectrum energy difference characteristic value of the long-term frame. The first-class media segment and the second streaming media segment, and the first streaming media segment is output at a first speed, and the second streaming media segment is output at a second speed. Therefore, on the one hand, since the long-term features have higher smoothness than the short-term features properties and stability, so the accuracy of the analysis results can be improved by analyzing the long-term spectrum energy difference feature values of the long-term frames; on the other hand, since the first speed is smaller than the second speed, that is, the playback speed of the second streaming media clip is greater than The playback speed of the first streaming media segment, therefore, can not only reduce the time wasted during the playback process of the streaming media segment that does not contain voice information, and avoid the user missing the key content in the second streaming media segment, but also during the output process of the target streaming media. , since the user does not need to make any input, the complexity of user operations is reduced.

Description of the drawings

图1是本申请实施例提供的播放速度控制方法的流程示意图；Figure 1 is a schematic flowchart of a playback speed control method provided by an embodiment of the present application;

图2是本申请实施例提供的播放速度控制方法的界面示意图之一；Figure 2 is one of the interface schematic diagrams of the playback speed control method provided by the embodiment of the present application;

图3是本申请实施例提供的播放速度控制方法的界面示意图之二；Figure 3 is the second schematic interface diagram of the playback speed control method provided by the embodiment of the present application;

图4是本申请实施例提供的播放速度控制方法的界面示意图之三；Figure 4 is the third schematic interface diagram of the playback speed control method provided by the embodiment of the present application;

图5是本申请实施例提供的播放速度控制装置的结构示意图； Figure 5 is a schematic structural diagram of a playback speed control device provided by an embodiment of the present application;

图6是本申请实施例提供的电子设备的结构示意图；Figure 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;

图7是本申请实施例提供的电子设备的硬件示意图。Figure 7 is a hardware schematic diagram of an electronic device provided by an embodiment of the present application.

Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art fall within the scope of protection of this application.

本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象，而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施，且“第一”、“第二”等所区分的对象通常为一类，并不限定对象的个数，例如第一对象可以是一个，也可以是多个。此外，说明书以及权利要求中“和/或”表示所连接对象的至少其中之一，字符“/”，一般表示前后关联对象是一种“或”的关系。The terms "first", "second", etc. in the description and claims of this application are used to distinguish similar objects and are not used to describe a specific order or sequence. It is to be understood that the figures so used are interchangeable under appropriate circumstances so that the embodiments of the present application can be practiced in orders other than those illustrated or described herein, and that "first," "second," etc. are distinguished Objects are usually of one type, and the number of objects is not limited. For example, the first object can be one or multiple. In addition, "and/or" in the description and claims indicates at least one of the connected objects, and the character "/" generally indicates that the related objects are in an "or" relationship.

下面结合附图，通过具体的实施例及其应用场景对本申请实施例提供的播放速度控制方法进行详细地说明。The playback speed control method provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios.

本申请实施例提供的播放速度控制方法，该播放速度控制方法的执行主体可以为电子设备或者电子设备中能够实现该播放速度控制方法的功能模块或功能实体，本申请实施例提及的电子设备包括但不限于手机、平板电脑、电脑、相机、可穿戴设备等，下面以电子设备作为执行主体为例对本申请实施例提供的播放速度控制方法进行说明。In the playback speed control method provided by the embodiments of the present application, the execution subject of the playback speed control method may be an electronic device or a functional module or functional entity in the electronic device that can implement the playback speed control method. The electronic device mentioned in the embodiments of the present application Including but not limited to mobile phones, tablets, computers, cameras, wearable devices, etc., the playback speed control method provided by the embodiment of the present application will be described below by taking an electronic device as an execution subject as an example.

如图1所示，本申请实施例提供了一种播放速度控制方法，该方法可以包括步骤101-步骤104：As shown in Figure 1, this embodiment of the present application provides a playback speed control method, which may include steps 101 to 104:

步骤101、获取目标流媒体。Step 101: Obtain the target streaming media.

可选地，上述目标流媒体为包括音频数据的流媒体，例如，可以为语音消息、录音、音乐和有声小说等；也可以为有声视频。 Optionally, the above-mentioned target streaming media is streaming media including audio data, for example, it can be voice messages, recordings, music, audio novels, etc.; it can also be audio video.

示例性地，在目标流媒体为语音消息的情况下，电子设备获取目标流媒体可以包括：在显示与其他联系人的聊天界面的情况下，用户可以对电子设备接收到的目标语音消息进行点击输入，电子设备可以响应于该点击输入，获取目标语音消息。在目标流媒体为有声视频的情况下，电子设备获取目标流媒体可以包括：若用户想要观看目标视频，则可以对目标视频进行点击输入，电子设备可以响应于该点击输入，显示目标视频的视频播放界面并获取目标视频中的音频数据。For example, when the target streaming media is a voice message, the electronic device may obtain the target streaming media by: while displaying a chat interface with other contacts, the user may click on the target voice message received by the electronic device. Input, the electronic device can obtain the target voice message in response to the click input. When the target streaming media is a video with sound, the electronic device may obtain the target streaming media by: if the user wants to watch the target video, the user may perform a click input on the target video, and the electronic device may respond to the click input by displaying the target video. Video playback interface and obtain audio data in the target video.

可选地，在获取目标流媒体之前，电子设备可以在显示播放速度设置界面的情况下，接收用户的第一输入；响应于所述第一输入，确定第一速度和第二速度。所述第一速度小于所述第二速度。Optionally, before acquiring the target streaming media, the electronic device may receive a first input from the user while displaying a playback speed setting interface; and determine the first speed and the second speed in response to the first input. The first speed is less than the second speed.

具体地，一段音频中可以包括第一流媒体片段和第二流媒体片段，其中，第一流媒体片段为包含语音信息的流媒体数据，第二流媒体片段为未包含语音信息的流媒体数据，例如，第二流媒体片段可以为噪声段或空白声段。在播放音频的过程中，用户想要获取的是第一流媒体片段中的语音信息，而第二流媒体片段是可以略过的，因此，用户可以分别设置第一流媒体片段的播放速度，即第一速度，以及第二流媒体片段的播放速度，即第二速度。Specifically, a piece of audio may include a first streaming media segment and a second streaming media segment, where the first streaming media segment is streaming media data containing voice information, and the second streaming media segment is streaming media data not containing voice information, for example , the second streaming media segment can be a noise segment or a blank sound segment. During the audio playback process, what the user wants to obtain is the voice information in the first streaming media segment, and the second streaming media segment can be skipped. Therefore, the user can separately set the playback speed of the first streaming media segment, that is, the second streaming media segment can be skipped. One speed, and the playback speed of the second streaming segment, which is the second speed.

示例性地，如图2所示，若用户想要调整电子设备的音频播放速度，则可以先触发电子设备显示播放速度设置界面，该播放速度设置界面可以包括两个播放速度设置选项，分别为“第一流媒体片段播放速度调整”和“第二流媒体片段播放速度调整”，其中，“第一流媒体片段播放速度调整”对应开关111，“第二流媒体片段播放速度调整”对应开关112。若用户想要对第一流媒体片段的播放速度进行调整，则可以对开关111进行一个点击输入，电子设备可以响应于该点击输入，控制开关111处于开启状态，并取消显示播放速度设置界面，显示第一流媒体片段播放速度调整界面。若用户想要对第二流媒体片段的播放速度进行调整，则可以对开关112进行一个点击输入，电子设备可以响应于该点击输入，控制开关112处于开启状态，并取消显示播放速度设置界面，显示第二流媒体片段播放速度调整界面。For example, as shown in Figure 2, if the user wants to adjust the audio playback speed of the electronic device, the electronic device can first be triggered to display a playback speed setting interface. The playback speed setting interface can include two playback speed setting options, respectively. "First streaming media segment playback speed adjustment" and "Second streaming media segment playback speed adjustment", where "first streaming media segment playback speed adjustment" corresponds to switch 111, and "second streaming media segment playback speed adjustment" corresponds to switch 112. If the user wants to adjust the playback speed of the first streaming media segment, he or she can perform a click input on the switch 111. In response to the click input, the electronic device can control the switch 111 to be on, cancel the display of the playback speed setting interface, and display The first streaming media clip playback speed adjustment interface. If the user wants to adjust the playback speed of the second streaming media clip, he or she can perform a click input on the switch 112, and the electronic device can control the switch 112 in response to the click input. Turn on the state, cancel the display of the playback speed setting interface, and display the second streaming media segment playback speed adjustment interface.

如图3所示，为第一流媒体片段播放速度调整界面，该第一流媒体片段播放速度调整界面包括预先存储的第一视频的播放界面121、加速调整控件122、减速调整控件123以及确定控件124，在第一视频处于播放状态的情况下，用户可以通过点击加速调整控件122或减速调整控件123对第一视频的播放速度进行调整，当调整至耳朵感到舒适的语速时，用户可以对确认控件124进行点击输入，电子设备可以响应于该点击输入，将调整的最终播放速度确定为第一速度。As shown in Figure 3, it is a first streaming media segment playback speed adjustment interface. The first streaming media segment playback speed adjustment interface includes a pre-stored first video playback interface 121, an acceleration adjustment control 122, a deceleration adjustment control 123 and a determination control 124. , when the first video is in the playing state, the user can adjust the playback speed of the first video by clicking the acceleration adjustment control 122 or the deceleration adjustment control 123. When the speaking speed is adjusted to a speaking speed that is comfortable for the ears, the user can confirm The control 124 performs a click input, and the electronic device can determine the adjusted final playback speed as the first speed in response to the click input.

如图4所示，为第二流媒体片段播放速度调整界面，该第二流媒体片段播放速度调整界面包括多个倍速调整控件和确认控件131，用户可以根据自身对播放速度的需求对多个倍速调整控件中的任一个控件进行点击输入，电子设备可以响应于该点击输入，突出显示用户点击输入的控件，之后，用户可以对确认控件131进行点击输入，电子设备可以响应于该点击输入，根据突出显示的控件对应的倍速值确定第二速度。As shown in Figure 4, it is a second streaming media clip playback speed adjustment interface. The second streaming media clip playback speed adjustment interface includes multiple speed adjustment controls and confirmation controls 131. Users can adjust multiple speeds according to their own needs for playback speed. Any control in the speed adjustment control performs click input, and the electronic device can respond to the click input and highlight the control that the user clicks to input. After that, the user can click input on the confirmation control 131, and the electronic device can respond to the click input. The second speed is determined based on the multiplier value corresponding to the highlighted control.

需要说明的是，上述第一输入可以包括多个子输入，例如，可以包括触发电子设备显示第一流媒体片段播放速度调整界面或第二流媒体片段播放速度调整界面的输入，还可以包括对第一流媒体片段播放速度调整界面或第二流媒体片段播放速度调整界面中控件的触控输入。It should be noted that the above-mentioned first input may include multiple sub-inputs. For example, it may include an input that triggers the electronic device to display the first streaming media segment playback speed adjustment interface or the second streaming media segment playback speed adjustment interface. It may also include input to the first stream media segment playback speed adjustment interface. Touch input of controls in the media clip playback speed adjustment interface or the second streaming media clip playback speed adjustment interface.

基于上述方案，由于可以根据第一输入确定第一速度和第二速度，因此用户可以根据自身的需求对两种播放速度进行自定义调整，从而满足不同用户的多样化速度播放需求。Based on the above solution, since the first speed and the second speed can be determined according to the first input, the user can customize the two playback speeds according to their own needs, thereby meeting the diverse speed playback needs of different users.

步骤102、按照预设帧长将经过短时分帧处理的所述目标流媒体的语音帧进行分割处理，得到多个长时帧。Step 102: Divide the voice frames of the target streaming media that have undergone short-time frame division processing according to the preset frame length to obtain multiple long-duration frames.

可选地，电子设备可以通过语音端点检测(voice activity detection，VAD)算法检测目标流媒体中是否存在人的语音信号，然而，相关技术中的VAD算法可能会将非语音段中的部分噪声判断成语音信号，而基于长时特征具有比短时特征更高的平滑性和稳定性的特点，因此，电子设备可以通过一个长时窗口对已经经过短时分帧处理的目标流媒体的语音帧重新进行分割，并分析重新分割后的语音特性。Optionally, the electronic device can detect whether there is a human voice signal in the target streaming media through a voice activity detection (VAD) algorithm. However, the VAD algorithm in the related art may determine part of the noise in the non-voice segment. into a speech signal, and based on long-term features It has the characteristics of higher smoothness and stability than short-term features. Therefore, the electronic device can re-segment the speech frames of the target streaming media that have been processed by short-term frame segmentation through a long-term window, and analyze the re-segmented speech frames. Voice characteristics.

示例性地，以目标流媒体包括100个短时帧、长时窗口的长度为10帧为例，对这100个短时帧进行重新分割后可以得到长时帧1-长时帧91，其中，长时帧1包括短时帧1-短时帧10，长时帧2包括短时帧2-短时帧11，以此类推。For example, assuming that the target streaming media includes 100 short-term frames and the length of the long-term window is 10 frames, after re-dividing these 100 short-term frames, we can obtain long-term frame 1 - long-term frame 91, where , long-term frame 1 includes short-term frame 1 to short-term frame 10, long-term frame 2 includes short-term frame 2 to short-term frame 11, and so on.

步骤103、分别确定每个所述长时帧的长时谱能量差异特征值，并根据所述长时谱能量差异特征值确定第一流媒体片段和第二流媒体片段。Step 103: Determine the long-term spectrum energy difference characteristic value of each long-term frame respectively, and determine the first streaming media segment and the second streaming media segment based on the long-term spectrum energy difference characteristic value.

可选地，重新分割语音帧后，电子设备可以分别确定目标流媒体中每个长时帧的长时谱能量差异特征值。Optionally, after re-segmenting the speech frame, the electronic device can separately determine the long-term spectral energy difference characteristic value of each long-term frame in the target streaming media.

具体地，电子设备可以确定第l个长时帧的长时谱能量差异特征值其中，第l帧的N阶的长时谱包络X(n)表示一段包含噪声的语音段，X(k,l+j)表示第l帧语音在频率k时的幅度谱，N(k)表示背景噪声在幅度k时的幅度谱；NFFT表示快速傅里叶变化FFT中的采样点个数。Specifically, the electronic device can determine the long-term spectrum energy difference characteristic value of the l-th long-term frame Among them, the N-th order long-term spectrum envelope of the lth frame X(n) represents a speech segment containing noise, X(k,l+j) represents the amplitude spectrum of the l-th frame of speech at frequency k, N(k) represents the amplitude spectrum of background noise at amplitude k; NFFT represents The number of sampling points in the fast Fourier transform FFT.

根据上述公式可知，电子设备是在计算N阶的长时谱包络时，利用长时原理，在短时幅度谱的基础上增加一个长度为2N+1帧的长时窗口进行分析，由于长时窗口可以扩大LTE和噪声幅度谱的差异性，因此，能够更准确的检测出目标流媒体中的语音和噪声。According to the above formula, it can be seen that when calculating the N-order long-time spectrum envelope, electronic equipment uses the long-time principle to add a long-time window of length 2N+1 frames to the short-time amplitude spectrum for analysis. Due to the long-term The time window can expand the difference between LTE and noise amplitude spectra, so it can more accurately detect speech and noise in the target streaming media.

可选地，电子设备根据长时谱能量差异特征值确定第一流媒体片段和第二流媒体片段，具体可以包括：在目标长时帧的长时谱能量差异特征值小于第一阈值的情况下，将所述目标长时帧确定为所述第一流媒体片段；在所述目标长时帧的长时谱能量差异特征值大于所述第一阈值的情况下，将所述目标长时帧确定为所述第二流媒体片段；其中，所述目标长时帧为所述多个长时帧中的任一个，所述第一阈值与噪声估计值和信噪比相关。Optionally, the electronic device determines the first streaming media segment and the second streaming media segment based on the long-term spectrum energy difference characteristic value, which may specifically include: when the long-term spectrum energy difference characteristic value of the target long-term frame is less than the first threshold. , determine the target long-term frame as the first streaming media segment; when the long-term spectral energy difference characteristic value of the target long-term frame is greater than the first threshold, determine the target long-term frame is the second streaming media segment; wherein, the target long-term frame is For any one of the plurality of long-term frames, the first threshold is related to the noise estimate value and the signal-to-noise ratio.

可选地，上述第一阈值其中，E_h(k)表示当前时刻获取的噪声估计值，即最新噪声估计值，SNR表示信噪比。Optionally, the above first threshold in, E _h (k) represents the noise estimate obtained at the current moment, that is, the latest noise estimate, and SNR represents the signal-to-noise ratio.

基于上述方案，由于可以根据目标长时帧的长时谱能量差异特征值确定目标长时帧为第一流媒体片段还是第一流媒体片段，因此，可以基于长时谱能量差异特征值确定目标流媒体中的第一流媒体片段和第二流媒体片段，从而为不同流媒体片段按照不同速度播放提供基础。Based on the above solution, since it can be determined whether the target long-term frame is the first streaming media segment or the first streaming media segment based on the long-term spectrum energy difference characteristic value of the target long-term frame, the target streaming media can be determined based on the long-term spectrum energy difference characteristic value. The first streaming media segment and the second streaming media segment in , thereby providing a basis for different streaming media segments to be played at different speeds.

可选地，电子设备根据长时谱能量差异特征值确定第一流媒体片段和第二流媒体片段，具体可以包括：在目标长时帧的长时谱能量差异特征值小于第一阈值的情况下，确定所述目标长时帧中含有基音的帧数；在所述含有基音的帧数与所述目标长时帧的总帧数的比值大于第二阈值的情况下，将所述目标长时帧确定为所述第一流媒体片段；在所述含有基音的帧数与所述目标长时帧的总帧数的比值小于所述第二阈值的情况下，将所述目标长时帧确定为所述第二流媒体片段；其中，所述目标长时帧为所述多个长时帧中的任一个。Optionally, the electronic device determines the first streaming media segment and the second streaming media segment based on the long-term spectrum energy difference characteristic value, which may specifically include: when the long-term spectrum energy difference characteristic value of the target long-term frame is less than the first threshold. , determine the number of frames containing pitch in the target long-term frame; when the ratio of the number of frames containing pitch to the total number of frames in the target long-term frame is greater than the second threshold, the target long-term frame is The frame is determined as the first streaming media segment; when the ratio of the number of frames containing pitch to the total number of frames of the target long-term frame is less than the second threshold, the target long-term frame is determined as The second streaming media segment; wherein the target long-term frame is any one of the multiple long-term frames.

具体地，对于敲击键盘，碰撞麦克风这类突发噪声，其噪声的LTD特征和语音的LTD特征很相似，因此常常会发生误判，因此，可以引入基因比例特征来辅助判别。基音频率是指语音中的声带振动频率，电子设备通过对目标长时帧进行基音检测，可以确定其中含有基音的帧数M_pitch，然后，电子设备可以进一步确定M_pitch和目标长时帧的总帧数M的比值θ_pitch，在大于第二阈值θ_v的情况下，电子设备可以将目标长时帧确定为第一流媒体片段；在θ_pitch小于第二阈值θ_v的情况下，电子设备可以将目标长时帧确定为第二流媒体片段。Specifically, for sudden noises such as typing on the keyboard and bumping the microphone, the LTD characteristics of the noise are very similar to the LTD characteristics of speech, so misjudgments often occur. Therefore, gene proportion features can be introduced to assist in the discrimination. The pitch frequency refers to the vocal cord vibration frequency in speech. By detecting the pitch of the target long-term frame, the electronic device can determine the number of frames containing the _pitch M pitch. Then, the electronic device can further determine M _pitch and the total number of the target long-term frame. The ratio θ _pitch of the frame number M, in When θ v is greater than the second threshold value θ _v , the electronic device may determine the target long-term frame as the first streaming media segment; when θ _pitch is less than the second threshold value θ _v , the electronic device may determine the target long-term frame as the second streaming media segment. Streaming clips.

基于上述方案，由于可以根据目标长时帧中含有基音的帧数与目标长时帧的总帧数的比值判断目标长时帧是第一流媒体片段还是第二流媒体片段，因此，一方面，可以基于基音的帧数确定目标流媒体中的第一流媒体片段和第二流媒体片段，从而为不同流媒体片段按照不同速度播放提供基础；另一方面，可以提高判断结果的准确性。Based on the above solution, it can be determined whether the target long frame is the first streaming media segment or the second streaming media segment based on the ratio of the number of frames containing pitch in the target long frame to the total number of frames in the target long frame. Therefore, on the one hand, the first streaming media segment and the second streaming media segment in the target streaming media can be determined based on the frame number of the pitch, thereby providing a basis for different streaming media segments to be played at different speeds; on the other hand, the judgment can be improved accuracy of results.

可选地，电子设备根据长时谱能量差异特征值确定第一流媒体片段和第二流媒体片段，具体可以包括：在目标长时帧的长时谱能量差异特征值小于第一阈值的情况下，将所述目标长时帧的时域语音信号转化为频域能量信号，得到目标流媒体片段；确定所述目标流媒体片段中多个频域能量采样点的平均能量值和最大能量值；在所述最大能量值与所述平均能量值的差处于第一预设范围内的情况下，将所述目标流媒体片段确定为所述第一流媒体片段；在所述最大能量值与所述平均能量值的差未处于所述第一预设范围内的情况下，将所述目标流媒体片段确定为所述第二流媒体片段；其中，所述目标长时帧为所述多个长时帧中的任一个。Optionally, the electronic device determines the first streaming media segment and the second streaming media segment based on the long-term spectrum energy difference characteristic value, which may specifically include: when the long-term spectrum energy difference characteristic value of the target long-term frame is less than the first threshold. , convert the time domain speech signal of the target long time frame into a frequency domain energy signal to obtain the target streaming media segment; determine the average energy value and the maximum energy value of multiple frequency domain energy sampling points in the target streaming media segment; When the difference between the maximum energy value and the average energy value is within a first preset range, determine the target streaming media segment as the first streaming media segment; when the maximum energy value and the If the difference in average energy values is not within the first preset range, the target streaming media segment is determined to be the second streaming media segment; wherein the target long-term frame is the plurality of long-term frames. any time frame.

噪声的种类很多，其中最常见的两类噪声分别为底噪和背景噪声。底噪的信号能量集中分布在低频部分(0～1000Hz)，在中高频部分几乎没有分布，而背景噪声的信号能量则平均的分布在整个频域，每个频点都有能量分布，且能量分布较均匀。语音信号和这两类噪声的频谱特性都不同，常常在整个频谱内都有信号能量分布，但能量大小的分布对于不同的频点来说没有规律。对于低噪，由于底噪的信号能量基本集中在低频段，因此对于同一采样点来说，对整个频段均匀选取频点计算平均能量和最大能量，两者的差距通常较大。对于背景噪声，由于背景噪声在整个频段内的分布较为平均，因此不同频点处的信号能量差别不大，对整个频段均匀选取频点计算平均能量和最大能量，两者的差距通常很小。对于语音信号，由于语音信号的能量没有上述规律，因此平均能量和最大能量的差值通常会处于一个差值范围内，而这个差值范围通常大于背景噪声的差值，而小于底噪信号的差值。因此，利用底噪、背景噪声以及语音信号的这种频谱能量特性，可以判断出一个流媒体片段是语音还是噪声。There are many types of noise, and the two most common types of noise are floor noise and background noise. The signal energy of the background noise is concentrated in the low frequency part (0~1000Hz), and there is almost no distribution in the middle and high frequency parts, while the signal energy of the background noise is evenly distributed in the entire frequency domain, and each frequency point has an energy distribution, and the energy The distribution is relatively even. The spectral characteristics of speech signals and these two types of noise are different. There is often signal energy distribution within the entire spectrum, but the distribution of energy size is irregular for different frequency points. For low noise, since the signal energy of the noise floor is basically concentrated in the low frequency band, for the same sampling point, frequency points are evenly selected across the entire frequency band to calculate the average energy and maximum energy, and the gap between the two is usually large. For background noise, since the distribution of background noise in the entire frequency band is relatively even, there is little difference in signal energy at different frequency points. Frequency points are evenly selected across the entire frequency band to calculate the average energy and maximum energy. The difference between the two is usually very small. For speech signals, since the energy of the speech signal does not follow the above rules, the difference between the average energy and the maximum energy is usually within a difference range, and this difference range is usually larger than the difference between the background noise and smaller than the background noise signal. difference. Therefore, using the noise floor, background noise, and the spectral energy characteristics of the speech signal, it is possible to determine whether a streaming media segment is speech or noise.

具体地，电子设备可以通过Audition工具将长时谱能量差异特征值小于第一阈值的目标长时帧中的时域语音信号转化为频域能量信号，得到目标流媒体片段，转化后电子设备可以显示不同频点的能量大小，其中颜色越深表示频点能量越大，颜色越暗，表示频点能量越小。之后，可以从预设时间长度和预设频域范围的目标流媒体片段中选取频域能量采样点，例如，可以从预设时间长度为t₀、预设频域范围为(0，f₀)的目标流媒体片段中选取40个频域能量采样点，即在时域上以0.1t₀为起始时间点，均匀选取5个点：t＝(0.1t₀,0.3t₀,0.5t₀,0.7t₀,0.9t₀)，在频域上以为起始频率，均匀选取8个点：然后，计算每个时域采样点t对应的8个频点的平均能量值和最大能量值

Specifically, electronic equipment can use the Audition tool to reduce the long-term spectrum energy difference characteristic value to a small The time-domain speech signal in the target long-time frame with the first threshold is converted into a frequency-domain energy signal to obtain the target streaming media segment. After conversion, the electronic device can display the energy size of different frequency points, where the darker the color, the greater the energy of the frequency point. The larger and darker the color, the smaller the energy of the frequency point. After that, the frequency domain energy sampling point can be selected from the target streaming media segment with a preset time length and a preset frequency domain range. For example, the preset time length is t ₀ and the preset frequency domain range is (0, f ₀ ), 40 frequency domain energy sampling points are selected from the target streaming media clip, that is, 0.1t ₀ is used as the starting time point in the time domain, and 5 points are evenly selected: t=(0.1t ₀ ,0.3t ₀ ,0.5t ₀ ,0.7t ₀ ,0.9t ₀ ), in the frequency domain as As the starting frequency, select 8 points evenly: Then, calculate the average energy value of the 8 frequency points corresponding to each time domain sampling point t and maximum energy value

在得到5个时域采样点的平均能量值和最大能量值后，再对这5个平均能量值和5个最大能量值取平均得到所有频域能量采样点的平均能量值E^mean和最大能量值E^max：

After obtaining the average energy value of 5 time domain sampling points, and maximum energy value Finally, for these five average energy values and 5 maximum energy values Average the average energy value E ^mean and the maximum energy value E ^max of all frequency domain energy sampling points:

之后，电子设备可以计算平均能量值E^mean和最大能量值E^max的差值，并判断该差值是否处于第一预设范围(a，b)内：Afterwards, the electronic device can calculate the difference between the average energy value E ^mean and the maximum energy value E ^max , and determine whether the difference is within the first preset range (a, b):

若E^max-E^mean≥b，则电子设备可以将目标流媒体片段确定为底噪信号，即在最大能量值与平均能量值的差未处于第一预设范围内的情况下，电子设备可以将目标流媒体片段确定为第二流媒体片段；If E ^max -E ^mean ≥ b, the electronic device can determine the target streaming media segment as the noise floor signal, that is, when the difference between the maximum energy value and the average energy value is not within the first preset range, the electronic device can Determine the target streaming media segment as the second streaming media segment;

若E^max-E^mean≤a，则电子设备可以将目标流媒体片段确定为背景噪声信号，即在最大能量值与平均能量值的差未处于第一预设范围内的情况下，电子设备可以将目标流媒体片段确定为第二流媒体片段；If E ^max -E ^mean ≤ a, the electronic device can determine the target streaming media segment as the background noise signal No., that is, when the difference between the maximum energy value and the average energy value is not within the first preset range, the electronic device can determine the target streaming media segment as the second streaming media segment;

若a＜E^max-E^mean＜b，则电子设备可以将目标流媒体片段确定为语音信号，即在最大能量值与平均能量值的差处于第一预设范围内的情况下，电子设备可以将目标流媒体片段确定为第一流媒体片段。If a<E ^max -E ^mean <b, the electronic device can determine the target streaming media segment as the speech signal, that is, when the difference between the maximum energy value and the average energy value is within the first preset range, the electronic device can The target streaming media segment is determined as the first streaming media segment.

基于上述方案，由于可以在目标长时帧的长时谱能量差异特征值小于第一阈值的情况下，将目标长时帧的时域语音信号转化为频域能量信号，得到目标流媒体片段，并确定目标流媒体片段中多个频域能量采样点的平均能量值和最大能量值，以及根据最大能量值与平均能量值的差确定目标流媒体片段为第一流媒体片段还是第一流媒体片段，因此，一方面，可以基于频域能量确定目标流媒体中的第一流媒体片段和第二流媒体片段，从而为不同流媒体片段按照不同速度播放提供基础；另一方面，可以提高判断结果的准确性。Based on the above solution, since the time domain speech signal of the target long time frame can be converted into a frequency domain energy signal when the long time spectrum energy difference characteristic value of the target long time frame is less than the first threshold, the target streaming media segment can be obtained. and determine the average energy value and the maximum energy value of multiple frequency domain energy sampling points in the target streaming media segment, and determine whether the target streaming media segment is the first streaming media segment or the first streaming media segment based on the difference between the maximum energy value and the average energy value, Therefore, on the one hand, the first streaming media segment and the second streaming media segment in the target streaming media can be determined based on the frequency domain energy, thereby providing a basis for different streaming media segments to be played at different speeds; on the other hand, the accuracy of the judgment results can be improved sex.

步骤104、按照第一速度输出目标流媒体中的第一流媒体片段，按照第二速度输出目标流媒体中的第二流媒体片段。Step 104: Output the first streaming media segment in the target streaming media at the first speed, and output the second streaming media segment in the target streaming media at the second speed.

具体地，电子设备可以按照第一速度输出目标流媒体的第一流媒体片段，按照第二速度输出目标流媒体的第二流媒体片段。也就是说，在播放目标流媒体的过程中，电子设备可以切换不同流媒体片段的播放速度。Specifically, the electronic device may output a first streaming media segment of the target streaming media at a first speed, and output a second streaming media segment of the target streaming media at a second speed. That is to say, during the process of playing the target streaming media, the electronic device can switch the playback speed of different streaming media segments.

在本申请实施例中，由于可以将经过短时分帧处理的目标流媒体的语音帧分割为多个长时帧，并通过分析长时帧的长时谱能量差异特征值的方式确定目标流媒体中的第一流媒体片段和第二流媒体片段，以及按照第一速度输出第一流媒体片段，按照第二速度输出第二流媒体片段，因此，一方面，由于长时特征具有比短时特征更高的平滑性和稳定性，因此通过分析长时帧的长时谱能量差异特征值可以提高分析结果的准确性；另一方面，由于第一速度小于第二速度，即第二流媒体片段的播放速度大于第一流媒体片段的播放速度，因此，不仅可以减少未包含语音信息的流媒体片段在播放过程中浪费的时间，避免用户错过第二流媒体片段中的关键内容，而且在目标流媒体的输出过程中，由于用户无需进行任何输入，因此降低了用户操作的繁琐度。In the embodiment of the present application, the speech frame of the target streaming media that has been processed by short-time frame division can be divided into multiple long-term frames, and the target streaming media can be determined by analyzing the long-term spectrum energy difference characteristic value of the long-term frame. The first streaming media segment and the second streaming media segment in, and the first streaming media segment is output at the first speed, and the second streaming media segment is output at the second speed. Therefore, on the one hand, since the long-term features have more advantages than the short-term features High smoothness and stability, so the accuracy of the analysis results can be improved by analyzing the long-term spectrum energy difference feature values of long-term frames; on the other hand, since the first speed is smaller than the second speed, that is, the second streaming media clip The playback speed is greater than the playback speed of the first streaming media segment. Therefore, it can not only reduce the time wasted during the playback process of streaming media segments that do not contain voice information, but also prevent the user from missing key content in the second streaming media segment. And during the output process of the target streaming media, since the user does not need to make any input, the complexity of user operations is reduced.

本申请实施例提供的播放速度控制方法，执行主体可以为播放速度控制装置。本申请实施例中以播放速度控制装置执行播放速度控制方法为例，说明本申请实施例提供的播放速度控制装置。For the playback speed control method provided by the embodiment of the present application, the execution subject may be a playback speed control device. In the embodiment of the present application, the playback speed control device performing the playback speed control method is taken as an example to illustrate the playback speed control device provided by the embodiment of the present application.

如图5所示，本申请实施例还提供一种播放速度控制装置500，包括：获取模块501、处理模块502和输出模块503；所述获取模块501，用于获取目标流媒体；所述处理模块502，用于按照预设帧长将经过短时分帧处理的所述目标流媒体的语音帧进行分割处理，得到多个长时帧；所述处理模块502，还用于分别确定每个所述长时帧的长时谱能量差异特征值，并根据所述长时谱能量差异特征值确定第一流媒体片段和第二流媒体片段；所述输出模块503，用于按照第一速度输出所述目标流媒体中的第一流媒体片段，按照第二速度输出所述目标流媒体中的第二流媒体片段；其中，所述第一流媒体片段为包含语音信息的流媒体数据，所述第二流媒体片段为未包含语音信息的流媒体数据，所述第一速度小于所述第二速度。As shown in Figure 5, the embodiment of the present application also provides a playback speed control device 500, which includes: an acquisition module 501, a processing module 502 and an output module 503; the acquisition module 501 is used to acquire target streaming media; the processing Module 502 is used to divide the voice frames of the target streaming media that have been processed by short-time frame division according to the preset frame length to obtain multiple long-time frames; the processing module 502 is also used to determine each of the required frames respectively. The long-term spectrum energy difference characteristic value of the long-term frame is determined, and the first streaming media segment and the second streaming media segment are determined according to the long-term spectrum energy difference characteristic value; the output module 503 is used to output all the streaming media segments at a first speed. The first streaming media segment in the target streaming media is output at a second speed; the first streaming media segment is streaming media data containing voice information, and the second streaming media segment is outputted at a second speed. The streaming media segment is streaming media data that does not include voice information, and the first speed is lower than the second speed.

可选地，所述处理模块502，具体用于：在目标长时帧的长时谱能量差异特征值小于第一阈值的情况下，将所述目标长时帧确定为所述第一流媒体片段；在所述目标长时帧的长时谱能量差异特征值大于所述第一阈值的情况下，将所述目标长时帧确定为所述第二流媒体片段；其中，所述目标长时帧为所述多个长时帧中的任一个，所述第一阈值与噪声估计值和信噪比相关。Optionally, the processing module 502 is specifically configured to: determine the target long frame as the first streaming media segment when the long-term spectrum energy difference characteristic value of the target long-term frame is less than a first threshold. ; In the case where the long-term spectrum energy difference characteristic value of the target long-term frame is greater than the first threshold, the target long-term frame is determined as the second streaming media segment; wherein the target long-term frame The frame is any one of the plurality of long-term frames, and the first threshold is related to the noise estimate value and the signal-to-noise ratio.

可选地，所述处理模块502，具体用于：在目标长时帧的长时谱能量差异特征值小于第一阈值的情况下，确定所述目标长时帧中含有基音的帧数；在所述含有基音的帧数与所述目标长时帧的总帧数的比值大于第二阈值的情况下，将所述目标长时帧确定为所述第一流媒体片段；在所述含有基音的帧数与所述目标长时帧的总帧数的比值小于所述第二阈值的情况下，将所述目标长时帧确定为所述第二流媒体片段；其中，所述目标长时帧为所述多个长时帧中的任一个。Optionally, the processing module 502 is specifically configured to: when the long-term spectrum energy difference characteristic value of the target long-term frame is less than the first threshold, determine the number of frames containing pitch in the target long-term frame; When the ratio of the number of frames containing pitch to the total number of frames of the target long-term frame is greater than the second threshold, the target long-term frame is determined to be the first streaming media segment; when the number of frames containing pitch is greater than the second threshold, When the ratio of the number of frames to the total number of frames of the target long frame is less than the second threshold, the target long frame is determined to be the second streaming media segment; wherein the target long frame for Any one of the plurality of long time frames.

可选地，所述处理模块502，具体用于：在目标长时帧的长时谱能量差异特征值小于第一阈值的情况下，将所述目标长时帧的时域语音信号转化为频域能量信号，得到目标流媒体片段；确定所述目标流媒体片段中多个频域能量采样点的平均能量值和最大能量值；在所述最大能量值与所述平均能量值的差处于第一预设范围内的情况下，将所述目标流媒体片段确定为所述第一流媒体片段；在所述最大能量值与所述平均能量值的差未处于所述第一预设范围内的情况下，将所述目标流媒体片段确定为所述第二流媒体片段；其中，所述目标长时帧为所述多个长时帧中的任一个。Optionally, the processing module 502 is specifically configured to: when the long-term spectrum energy difference characteristic value of the target long-term frame is less than the first threshold, convert the time-domain speech signal of the target long-term frame into a frequency signal. domain energy signal to obtain the target streaming media segment; determine the average energy value and the maximum energy value of multiple frequency domain energy sampling points in the target streaming media segment; when the difference between the maximum energy value and the average energy value is at the If the target streaming media segment is within a preset range, the target streaming media segment is determined to be the first streaming media segment; when the difference between the maximum energy value and the average energy value is not within the first preset range In this case, the target streaming media segment is determined as the second streaming media segment; wherein the target long-term frame is any one of the multiple long-term frames.

可选地，继续参考图5，所述装置500还包括接收模块504；所述接收模块504，用于在显示播放速度设置界面的情况下，接收用户的第一输入；所述处理模块503，还用于响应于所述第一输入，确定所述第一速度和所述第二速度。Optionally, continuing to refer to Figure 5, the device 500 also includes a receiving module 504; the receiving module 504 is used to receive the user's first input when the playback speed setting interface is displayed; the processing module 503, Also configured to determine the first speed and the second speed in response to the first input.

在本申请实施例中，由于可以将经过短时分帧处理的目标流媒体的语音帧分割为多个长时帧，并通过分析长时帧的长时谱能量差异特征值的方式确定目标流媒体中的第一流媒体片段和第二流媒体片段，以及按照第一速度输出第一流媒体片段，按照第二速度输出第二流媒体片段，因此，一方面，由于长时特征具有比短时特征更高的平滑性和稳定性，因此通过分析长时帧的长时谱能量差异特征值可以提高分析结果的准确性；另一方面，由于第一速度小于第二速度，即第二流媒体片段的播放速度大于第一流媒体片段的播放速度，因此，不仅可以减少未包含语音信息的流媒体片段在播放过程中浪费的时间，避免用户错过第二流媒体片段中的关键内容，而且在目标流媒体的输出过程中，由于用户无需进行任何输入，因此降低了用户操作的繁琐度。In the embodiment of the present application, the speech frame of the target streaming media that has been processed by short-time frame division can be divided into multiple long-term frames, and the target streaming media can be determined by analyzing the long-term spectrum energy difference characteristic value of the long-term frame. The first streaming media segment and the second streaming media segment in, and the first streaming media segment is output at the first speed, and the second streaming media segment is output at the second speed. Therefore, on the one hand, since the long-term features have more advantages than the short-term features High smoothness and stability, so the accuracy of the analysis results can be improved by analyzing the long-term spectrum energy difference feature values of long-term frames; on the other hand, since the first speed is smaller than the second speed, that is, the second streaming media clip The playback speed is greater than the playback speed of the first streaming media segment. Therefore, it can not only reduce the time wasted during the playback process of the streaming media segment that does not contain voice information, and avoid the user missing the key content in the second streaming media segment, but also prevent the user from missing the key content in the second streaming media segment. During the output process, since the user does not need to make any input, the complexity of the user operation is reduced.

本申请实施例中的播放速度控制装置可以是电子设备，也可以是电子设备中的部件，例如集成电路或芯片。该电子设备可以是终端，也可以为除终端之外的其他设备。示例性的，电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、移动上网装置(Mobile Internet Device，MID)、增强现实(augmented reality，AR)/虚拟现实(virtual reality，VR)设备、机器人、可穿戴设备、超级移动个人计算机(ultra-mobile personal computer，UMPC)、上网本或者个人数字助理(personal digital assistant，PDA)等，还可以为服务器、网络附属存储器(Network Attached Storage，NAS)、个人计算机(personal computer，PC)、电视机(television，TV)、柜员机或者自助机等，本申请实施例不作具体限定。The playback speed control device in the embodiment of the present application may be an electronic device, or may be a component of the electronic device, such as an integrated circuit or chip. The electronic device may be a terminal or other devices other than the terminal. For example, the electronic device can be a mobile phone, a tablet computer, a pen Laptops, PDAs, vehicle-mounted electronic devices, Mobile Internet Devices (MIDs), augmented reality (AR)/virtual reality (VR) devices, robots, wearable devices, super mobile individuals Computer (ultra-mobile personal computer, UMPC), netbook or personal digital assistant (PDA), etc., can also be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), Televisions (TVs), teller machines, self-service machines, etc. are not specifically limited in the embodiments of this application.

本申请实施例中的播放速度控制装置可以为具有操作系统的装置。该操作系统可以为安卓(Android)操作系统，可以为iOS操作系统，还可以为其他可能的操作系统，本申请实施例不作具体限定。The playback speed control device in the embodiment of the present application may be a device with an operating system. The operating system can be an Android operating system, an iOS operating system, or other possible operating systems, which are not specifically limited in the embodiments of this application.

本申请实施例提供的播放速度控制装置能够实现图1至图4的方法实施例实现的各个过程，为避免重复，这里不再赘述。The playback speed control device provided by the embodiment of the present application can implement various processes implemented by the method embodiments of Figures 1 to 4. To avoid repetition, they will not be described again here.

可选地，如图6所示，本申请实施例还提供一种电子设备600，包括处理器601和存储器602，存储器602上存储有可在所述处理器601上运行的程序或指令，该程序或指令被处理器601执行时实现上述播放速度控制方法实施例的各个步骤，且能达到相同的技术效果，为避免重复，这里不再赘述。Optionally, as shown in Figure 6, this embodiment of the present application also provides an electronic device 600, including a processor 601 and a memory 602. The memory 602 stores programs or instructions that can be run on the processor 601. When the program or instruction is executed by the processor 601, each step of the above embodiment of the playback speed control method is implemented, and the same technical effect can be achieved. To avoid repetition, the details will not be described here.

需要说明的是，本申请实施例中的电子设备包括上述所述的移动电子设备和非移动电子设备。It should be noted that the electronic devices in the embodiments of the present application include the above-mentioned mobile electronic devices and non-mobile electronic devices.

图7为实现本申请实施例的一种电子设备的硬件结构示意图。FIG. 7 is a schematic diagram of the hardware structure of an electronic device that implements an embodiment of the present application.

该电子设备1000包括但不限于：射频单元1001、网络模块1002、音频输出单元1003、输入单元1004、传感器1005、显示单元1006、用户输入单元1007、接口单元1008、存储器1009、以及处理器1010等部件。The electronic device 1000 includes but is not limited to: radio frequency unit 1001, network module 1002, audio output unit 1003, input unit 1004, sensor 1005, display unit 1006, user input unit 1007, interface unit 1008, memory 1009, processor 1010, etc. part.

本领域技术人员可以理解，电子设备1000还可以包括给各个部件供电的电源(比如电池)，电源可以通过电源管理系统与处理器1010逻辑相连，从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。图7中示出的电子设备结构并不构成对电子设备的限定，电子设备可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置，在此不再赘述。Those skilled in the art can understand that the electronic device 1000 may also include a power supply (such as a battery) that supplies power to various components. The power supply may be logically connected to the processor 1010 through a power management system, thereby managing charging, discharging, and function through the power management system. Consumption management and other functions. The structure of the electronic device shown in Figure 7 does not constitute a limitation on the electronic device, and the electronic device may include more than The figures illustrate more or fewer components, or combine certain components, or arrange different components, which will not be described again here.

其中，处理器1010，用于获取目标流媒体；处理器1010，用于按照预设帧长将经过短时分帧处理的所述目标流媒体的语音帧进行分割处理，得到多个长时帧；处理器1010，还用于分别确定每个所述长时帧的长时谱能量差异特征值，并根据所述长时谱能量差异特征值确定第一流媒体片段和第二流媒体片段；音频输出单元1003或显示单元1006，用于按照第一速度输出所述目标流媒体中的第一流媒体片段，按照第二速度输出所述目标流媒体中的第二流媒体片段；其中，所述第一流媒体片段为包含语音信息的流媒体数据，所述第二流媒体片段为未包含语音信息的流媒体数据，所述第一速度小于所述第二速度。Among them, the processor 1010 is used to obtain the target streaming media; the processor 1010 is used to segment the voice frames of the target streaming media that have been processed by short-time frame division according to the preset frame length to obtain multiple long-term frames; The processor 1010 is further configured to determine the long-term spectrum energy difference characteristic value of each long-term frame, and determine the first streaming media segment and the second streaming media segment according to the long-term spectrum energy difference characteristic value; audio output Unit 1003 or display unit 1006 is configured to output a first streaming media segment in the target streaming media at a first speed, and output a second streaming media segment in the target streaming media at a second speed; wherein, the first streaming media segment The media segment is streaming media data containing voice information, the second streaming media segment is streaming media data not containing voice information, and the first speed is lower than the second speed.

可选地，处理器1010，具体用于：在目标长时帧的长时谱能量差异特征值小于第一阈值的情况下，将所述目标长时帧确定为所述第一流媒体片段；在所述目标长时帧的长时谱能量差异特征值大于所述第一阈值的情况下，将所述目标长时帧确定为所述第二流媒体片段；其中，所述目标长时帧为所述多个长时帧中的任一个，所述第一阈值与噪声估计值和信噪比相关。Optionally, the processor 1010 is specifically configured to: when the long-term spectrum energy difference characteristic value of the target long-term frame is less than a first threshold, determine the target long-term frame as the first streaming media segment; in When the long-term spectrum energy difference characteristic value of the target long-term frame is greater than the first threshold, the target long-term frame is determined as the second streaming media segment; wherein the target long-term frame is In any one of the plurality of long-term frames, the first threshold is compared with the noise estimate value and the signal-to-noise ratio. close.

在本申请实施例中，由于可以根据目标长时帧的长时谱能量差异特征值确定目标长时帧为第一流媒体片段还是第一流媒体片段，因此，可以基于长时谱能量差异特征值确定目标流媒体中的第一流媒体片段和第二流媒体片段，从而为不同流媒体片段按照不同速度播放提供基础。In the embodiment of the present application, since it can be determined based on the long-term spectrum energy difference characteristic value of the target long-term frame whether the target long-term frame is the first streaming media segment or the first streaming media segment, it can be determined based on the long-term spectrum energy difference characteristic value. The first streaming media segment and the second streaming media segment in the target streaming media, thereby providing a basis for different streaming media segments to be played at different speeds.

可选地，处理器1010，具体用于：在目标长时帧的长时谱能量差异特征值小于第一阈值的情况下，确定所述目标长时帧中含有基音的帧数；在所述含有基音的帧数与所述目标长时帧的总帧数的比值大于第二阈值的情况下，将所述目标长时帧确定为所述第一流媒体片段；在所述含有基音的帧数与所述目标长时帧的总帧数的比值小于所述第二阈值的情况下，将所述目标长时帧确定为所述第二流媒体片段；其中，所述目标长时帧为所述多个长时帧中的任一个。Optionally, the processor 1010 is specifically configured to: when the long-term spectrum energy difference characteristic value of the target long-term frame is less than the first threshold, determine the number of frames containing pitch in the target long-term frame; When the ratio of the number of frames containing pitch to the total number of frames of the target long-term frame is greater than the second threshold, the target long-term frame is determined as the first streaming media segment; when the number of frames containing pitch If the ratio to the total number of frames of the target long frame is less than the second threshold, the target long frame is determined as the second streaming media segment; wherein the target long frame is the Any of the multiple long time frames.

在本申请实施例中，由于可以根据目标长时帧中含有基音的帧数与目标长时帧的总帧数的比值判断目标长时帧是第一流媒体片段还是第二流媒体片段，因此，一方面，可以基于基音的帧数确定目标流媒体中的第一流媒体片段和第二流媒体片段，从而为不同流媒体片段按照不同速度播放提供基础；另一方面，可以提高判断结果的准确性。In the embodiment of the present application, since it can be determined whether the target long frame is the first streaming media segment or the second streaming media segment based on the ratio of the number of frames containing pitch in the target long frame to the total number of frames in the target long frame, therefore, On the one hand, the first streaming media segment and the second streaming media segment in the target streaming media can be determined based on the frame number of the pitch, thereby providing a basis for different streaming media segments to be played at different speeds; on the other hand, the accuracy of the judgment results can be improved .

可选地，处理器1010，具体用于：在目标长时帧的长时谱能量差异特征值小于第一阈值的情况下，将所述目标长时帧的时域语音信号转化为频域能量信号，得到目标流媒体片段；确定所述目标流媒体片段中多个频域能量采样点的平均能量值和最大能量值；在所述最大能量值与所述平均能量值的差处于第一预设范围内的情况下，将所述目标流媒体片段确定为所述第一流媒体片段；在所述最大能量值与所述平均能量值的差未处于所述第一预设范围内的情况下，将所述目标流媒体片段确定为所述第二流媒体片段；其中，所述目标长时帧为所述多个长时帧中的任一个。Optionally, the processor 1010 is specifically configured to: when the long-term spectrum energy difference characteristic value of the target long-term frame is less than a first threshold, convert the time-domain speech signal of the target long-term frame into frequency-domain energy. signal to obtain the target streaming media segment; determine the average energy value and the maximum energy value of multiple frequency domain energy sampling points in the target streaming media segment; when the difference between the maximum energy value and the average energy value is at the first preset If it is within the range, determine the target streaming media segment as the first streaming media segment; if the difference between the maximum energy value and the average energy value is not within the first preset range , determining the target streaming media segment as the second streaming media segment; wherein the target long-term frame is any one of the plurality of long-term frames.

在本申请实施例中，由于可以在目标长时帧的长时谱能量差异特征值小于第一阈值的情况下，将目标长时帧的时域语音信号转化为频域能量信号，得到目标流媒体片段，并确定目标流媒体片段中多个频域能量采样点的平均能量值和最大能量值，以及根据最大能量值与平均能量值的差确定目标流媒体片段为第一流媒体片段还是第一流媒体片段，因此，一方面，可以基于频域能量确定目标流媒体中的第一流媒体片段和第二流媒体片段，从而为不同流媒体片段按照不同速度播放提供基础；另一方面，可以提高判断结果的准确性。In the embodiment of the present application, since the time domain speech signal of the target long time frame can be converted into a frequency domain energy signal when the long time spectrum energy difference characteristic value of the target long time frame is less than the first threshold, number, obtain the target streaming media segment, determine the average energy value and maximum energy value of multiple frequency domain energy sampling points in the target streaming media segment, and determine the target streaming media segment as the first stream based on the difference between the maximum energy value and the average energy value. The media segment is also the first streaming media segment. Therefore, on the one hand, the first streaming media segment and the second streaming media segment in the target streaming media can be determined based on the frequency domain energy, thereby providing a basis for different streaming media segments to be played at different speeds; on the other hand, the first streaming media segment and the second streaming media segment in the target streaming media can be determined based on the frequency domain energy. In this regard, the accuracy of judgment results can be improved.

可选地，用户输入单元1007，用于在显示播放速度设置界面的情况下，接收用户的第一输入；处理器1010，还用于响应于所述第一输入，确定所述第一速度和所述第二速度。Optionally, the user input unit 1007 is configured to receive the user's first input when the playback speed setting interface is displayed; the processor 1010 is further configured to respond to the first input, determine the first speed and the second speed.

在本申请实施例中，由于可以根据第一输入确定第一速度和第二速度，因此用户可以根据自身的需求对两种播放速度进行自定义调整，从而满足不同用户的多样化速度播放需求。In the embodiment of the present application, since the first speed and the second speed can be determined according to the first input, the user can customize the two playback speeds according to his own needs, thereby meeting the diverse speed playback needs of different users.

应理解的是，本申请实施例中，输入单元1004可以包括图形处理器(Graphics Processing Unit，GPU)10041和麦克风10042，图形处理器10041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。显示单元1006可包括显示面板10061，可以采用液晶显示器、有机发光二极管等形式来配置显示面板10061。用户输入单元1007包括触控面板10071以及其他输入设备10072中的至少一种。触控面板10071，也称为触摸屏。触控面板10071可包括触摸检测装置和触摸控制器两个部分。其他输入设备10072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆，在此不再赘述。It should be understood that in the embodiment of the present application, the input unit 1004 may include a graphics processor (Graphics Processing Unit, GPU) 10041 and a microphone 10042. The graphics processor 10041 is responsible for the image capture device (GPU) in the video capture mode or the image capture mode. Process the image data of still pictures or videos obtained by cameras (such as cameras). The display unit 1006 may include a display panel 10061, which may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1007 includes at least one of a touch panel 10071 and other input devices 10072 . Touch panel 10071, also known as touch screen. The touch panel 10071 may include two parts: a touch detection device and a touch controller. Other input devices 10072 may include but are not limited to physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be described again here.

存储器1009可用于存储软件程序以及各种数据。存储器1009可主要包括存储程序或指令的第一存储区和存储数据的第二存储区，其中，第一存储区可存储操作系统、至少一个功能所需的应用程序或指令(比如声音播放功能、图像播放功能等)等。此外，存储器1009可以包括易失性存储器或非易失性存储器，或者，存储器1009可以包括易失性和非易失性存储器两者。其中，非易失性存储器可以是只读存储器(Read-Only Memory，ROM)、可编程只读存储器(Programmable ROM，PROM)、可擦除可编程只读存储器(Erasable PROM，EPROM)、电可擦除可编程只读存储器(Electrically EPROM，EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory，RAM)，静态随机存取存储器(Static RAM，SRAM)、动态随机存取存储器(Dynamic RAM，DRAM)、同步动态随机存取存储器(Synchronous DRAM，SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM，DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM，ESDRAM)、同步连接动态随机存取存储器(Synch link DRAM，SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM，DRRAM)。本申请实施例中的存储器1009包括但不限于这些和任意其它适合类型的存储器。Memory 1009 may be used to store software programs as well as various data. The memory 1009 may mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area may store an operating system, an application program or instructions required for at least one function (such as a sound playback function, Image playback function, etc.) etc. Additionally, memory 1009 may include volatile memory or nonvolatile memory, or memory 1009 may include both volatile and nonvolatile storage. Both devices. Among them, the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electrically removable memory. Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (Random Access Memory, RAM), static random access memory (Static RAM, SRAM), dynamic random access memory (Dynamic RAM, DRAM), synchronous dynamic random access memory (Synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (Synch link DRAM) , SLDRAM) and direct memory bus random access memory (Direct Rambus RAM, DRRAM). The memory 1009 in the embodiment of the present application includes, but is not limited to, these and any other suitable types of memory.

处理器1010可包括一个或多个处理单元；可选的，处理器1010集成应用处理器和调制解调处理器，其中，应用处理器主要处理涉及操作系统、用户界面和应用程序等的操作，调制解调处理器主要处理无线通信信号，如基带处理器。可以理解的是，上述调制解调处理器也可以不集成到处理器1010中。The processor 1010 may include one or more processing units; optionally, the processor 1010 integrates an application processor and a modem processor, where the application processor mainly handles operations related to the operating system, user interface, application programs, etc., Modem processors mainly process wireless communication signals, such as baseband processors. It can be understood that the above modem processor may not be integrated into the processor 1010.

本申请实施例还提供一种可读存储介质，所述可读存储介质上存储有程序或指令，该程序或指令被处理器执行时实现上述播放速度控制方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。Embodiments of the present application also provide a readable storage medium. Programs or instructions are stored on the readable storage medium. When the program or instructions are executed by a processor, each process of the above playback speed control method embodiment is implemented, and can achieve The same technical effects are not repeated here to avoid repetition.

其中，所述处理器为上述实施例中所述的电子设备中的处理器。所述可读存储介质，包括计算机可读存储介质，如计算机只读存储器ROM、随机存取存储器RAM、磁碟或者光盘等。Wherein, the processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage media, such as computer read-only memory ROM, random access memory RAM, magnetic disk or optical disk, etc.

本申请实施例另提供了一种芯片，所述芯片包括处理器和通信接口，所述通信接口和所述处理器耦合，所述处理器用于运行程序或指令，实现上述播放速度控制方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。 An embodiment of the present application further provides a chip. The chip includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the above playback speed control method embodiment. Each process can achieve the same technical effect. To avoid repetition, we will not go into details here.

应理解，本申请实施例提到的芯片还可以称为系统级芯片、系统芯片、芯片系统或片上系统芯片等。It should be understood that the chips mentioned in the embodiments of this application may also be called system-on-chip, system-on-a-chip, system-on-a-chip or system-on-chip, etc.

本申请实施例提供一种计算机程序产品，该程序产品被存储在存储介质中，该程序产品被至少一个处理器执行以实现如上述播放速度控制方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。Embodiments of the present application provide a computer program product. The program product is stored in a storage medium. The program product is executed by at least one processor to implement each process of the above playback speed control method embodiment, and can achieve the same technology. The effect will not be described here to avoid repetition.

本申请实施例还提供了一种电子设备，该电子设备被配置成用于实现上述方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。The embodiment of the present application also provides an electronic device. The electronic device is configured to implement each process of the above method embodiment and can achieve the same technical effect. To avoid duplication, the details will not be described here.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外，需要指出的是，本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能，还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能，例如，可以按不同于所描述的次序来执行所描述的方法，并且还可以添加、省去、或组合各种步骤。另外，参照某些示例所描述的特征可在其他示例中被组合。It should be noted that, in this document, the terms "comprising", "comprises" or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus that includes that element. In addition, it should be pointed out that the scope of the methods and devices in the embodiments of the present application is not limited to performing functions in the order shown or discussed, but may also include performing functions in a substantially simultaneous manner or in reverse order according to the functions involved. Functions may be performed, for example, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对相关技术做出贡献的部分可以以计算机软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端(可以是手机，计算机，服务器，或者网络设备等)执行本申请各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a computer software product that is essentially or contributes to related technologies. The computer software product is stored in a storage medium (such as ROM/RAM, disk, CD), including several instructions to cause a terminal (which can be a mobile phone, computer, server, or network device, etc.) to execute the steps described in various embodiments of this application. method.

上面结合附图对本申请的实施例进行了描述，但是本申请并不局限于上述的具体实施方式，上述的具体实施方式仅仅是示意性的，而不是限制性的，本领域的普通技术人员在本申请的启示下，在不脱离本申请宗旨和权利要求所保护的范围情况下，还可做出很多形式，均属于本申请的保护之内。 The embodiments of the present application have been described above in conjunction with the accompanying drawings. However, the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Inspired by this application, many forms can be made without departing from the purpose of this application and the scope protected by the claims, all of which fall within the protection of this application.

Claims

A playback speed control method includes:

Get target streaming media;

Segment the voice frames of the target streaming media that have been processed by short-time frame division according to the preset frame length to obtain multiple long-time frames;

Determine the long-term spectrum energy difference characteristic value of each long-term frame respectively, and determine the first streaming media segment and the second streaming media segment based on the long-term spectrum energy difference characteristic value;

Output the first streaming media segment in the target streaming media at a first speed, and output the second streaming media segment in the target streaming media at a second speed;

Wherein, the first streaming media segment is streaming media data containing voice information, the second streaming media segment is streaming media data not containing voice information, and the first speed is lower than the second speed.

The playback speed control method according to claim 1, wherein the determining the first streaming media segment and the second streaming media segment according to the long-term spectrum energy difference characteristic value includes:

When the long-term spectrum energy difference characteristic value of the target long-term frame is less than the first threshold, determine the target long-term frame as the first streaming media segment;

If the long-term spectrum energy difference characteristic value of the target long-term frame is greater than the first threshold, determine the target long-term frame as the second streaming media segment;

Wherein, the target long-term frame is any one of the plurality of long-term frames, and the first threshold is related to the noise estimate value and the signal-to-noise ratio.

When the long-term spectral energy difference characteristic value of the target long-term frame is less than the first threshold, determine the number of frames containing pitch in the target long-term frame;

When the ratio of the number of frames containing pitch to the total number of frames of the target long-term frame is greater than the second threshold, determine the target long-term frame as the first streaming media segment;

When the ratio of the number of frames containing pitch to the total number of frames of the target long-term frame is less than the second In the case of a threshold, determine the target long-term frame as the second streaming media segment;

Wherein, the target long-term frame is any one of the plurality of long-term frames.

When the long-term spectrum energy difference characteristic value of the target long-term frame is less than the first threshold, convert the time-domain speech signal of the target long-term frame into a frequency-domain energy signal to obtain the target streaming media segment;

Determine the average energy value and the maximum energy value of multiple frequency domain energy sampling points in the target streaming media segment;

If the difference between the maximum energy value and the average energy value is within a first preset range, determine the target streaming media segment as the first streaming media segment;

If the difference between the maximum energy value and the average energy value is not within the first preset range, determine the target streaming media segment as the second streaming media segment;

The playback speed control method according to claim 1, wherein before obtaining the target streaming media, the method further includes:

When the playback speed setting interface is displayed, receive the user's first input;

The first speed and the second speed are determined in response to the first input.

A playback speed control device, including: an acquisition module, a processing module and an output module;

The acquisition module is used to acquire target streaming media;

The processing module is configured to divide the voice frames of the target streaming media that have been processed by short-time frame division according to the preset frame length to obtain multiple long-time frames;

The processing module is further configured to determine the long-term spectrum energy difference characteristic value of each long-term frame, and determine the first streaming media segment and the second streaming media segment based on the long-term spectrum energy difference characteristic value;

The output module is configured to output the first streaming media segment in the target streaming media at a first speed, and output the second streaming media segment in the target streaming media at a second speed;

Wherein, the first streaming media segment is streaming media data containing voice information, and the second streaming media segment The media segment is streaming media data that does not include voice information, and the first speed is lower than the second speed.

The playback speed control device according to claim 6, wherein the processing module is specifically configured to: when the long-term spectrum energy difference characteristic value of the target long-term frame is less than a first threshold, the target long-term frame is The frame is determined to be the first streaming media segment; when the long-term spectrum energy difference characteristic value of the target long-term frame is greater than the first threshold, the target long-term frame is determined to be the second streaming media segment. Fragment; wherein the target long-term frame is any one of the plurality of long-term frames, and the first threshold is related to the noise estimate value and the signal-to-noise ratio.

The playback speed control device according to claim 6, wherein the processing module is specifically configured to: when the long-term spectrum energy difference characteristic value of the target long-term frame is less than a first threshold, determine the target long-term frame. The number of frames containing pitch in the frame; when the ratio of the number of frames containing pitch to the total number of frames of the target long-term frame is greater than the second threshold, determine the target long-term frame as the first stream Media segment; when the ratio of the number of frames containing pitch to the total number of frames of the target long-term frame is less than the second threshold, determine the target long-term frame as the second streaming media segment ; Wherein, the target long-term frame is any one of the plurality of long-term frames.

The playback speed control device according to claim 6, wherein the processing module is specifically configured to: when the long-term spectrum energy difference characteristic value of the target long-term frame is less than a first threshold, the target long-term frame is The time domain speech signal of the frame is converted into a frequency domain energy signal to obtain the target streaming media segment; the average energy value and the maximum energy value of multiple frequency domain energy sampling points in the target streaming media segment are determined; between the maximum energy value and When the difference between the average energy values is within a first preset range, the target streaming media segment is determined to be the first streaming media segment; when the difference between the maximum energy value and the average energy value is not within If it is within the first preset range, the target streaming media segment is determined to be the second streaming media segment; wherein the target long-term frame is any one of the plurality of long-term frames.

The playback speed control device according to claim 6, wherein the device further includes a receiving module;

The receiving module is configured to receive the user's first input when displaying the playback speed setting interface before acquiring the target streaming media;

The processing module is further configured to determine the first speed and the second speed in response to the first input.

An electronic device, characterized in that it includes a processor and a memory, the memory stores programs or instructions that can be run on the processor, and when the program or instructions are executed by the processor, the implementation of claim 1- The playback speed control method described in any one of 5.

A readable storage medium, characterized in that the readable storage medium stores programs or instructions, and when the programs or instructions are executed by a processor, the playback speed control method according to any one of claims 1-5 is implemented. .

A chip. The chip includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to achieve the playback speed as described in any one of claims 1-5. Control Method.

A computer program product, which is executed by at least one processor to implement the playback speed control method according to any one of claims 1-5.

An electronic device configured to perform the playback speed control method according to any one of claims 1-5.