WO2025152573A1

WO2025152573A1 - Information display method based on dynamic digital human avatar, and electronic device

Info

Publication number: WO2025152573A1
Application number: PCT/CN2024/130678
Authority: WO
Inventors: 胡经川; 陈志文; 吕承飞
Original assignee: Taobao China Software Co Ltd
Current assignee: Taobao China Software Co Ltd
Priority date: 2024-01-17
Filing date: 2024-11-08
Publication date: 2025-07-24
Anticipated expiration: 2026-07-17
Also published as: CN117596373A; CN117596373B

Abstract

Disclosed in the embodiments of the present application are an information display method based on a dynamic digital human avatar, and an electronic device. The method comprises: obtaining video content in a plurality of real angles of view, wherein the video content in the plurality of real angles of view is obtained by means of a plurality of camera devices synchronously photographing, from different angles of view, the process of a real person in target clothes showing the target clothes; respectively extracting, from a plurality of video frames included in the video content, person images included in the plurality of video frames; using image-based rendering technology to generate person images at corresponding time points for a plurality of intermediate angles of view between adjacent real angles of view; and generating a multi-angle-of-view video on the basis of the person images at a plurality of time points in the plurality of real angles of view and the plurality of intermediate angles of view, such that the multi-angle-of-view video is matched on a client into a preset virtual 3D space scene model. By means of the embodiments of the present application, a more realistic visual experience can be provided for a user at a lower cost.

Description

Method and electronic device for information display based on dynamic digital human image

交叉引用Cross-references

本申请引用于2024年01月17日递交的名称为“基于动态数字人形象进行信息展示的方法及电子设”的第202410069436X号中国专利申请，其通过引用被全部并入本申请。This application refers to Chinese Patent Application No. 202410069436X, filed on January 17, 2024, entitled “Method and electronic device for information display based on dynamic digital human image”, which is incorporated into this application in its entirety by reference.

Technical Field

本申请涉及信息处理技术领域，特别是涉及基于动态数字人形象进行信息展示的方法及电子设备。The present application relates to the field of information processing technology, and in particular to a method and electronic device for displaying information based on a dynamic digital human image.

Background Art

随着科技的快速发展，尤其是计算机视觉、增强现实(AR)和虚拟现实(VR)技术的成熟，虚实融合技术正在吸引越来越多的关注。这种技术通过整合真实世界的元素和计算机生成的虚拟元素为用户提供了一个沉浸式的体验，它为各种产业开辟了新的应用领域和机会。尤其是在娱乐、广告和电商等行业，这些技术的兴起，不仅为消费者带来了新的体验，还为企业开辟了新的市场和营销机会。例如，在电商平台中，针对服饰行业，可以提供“模特走秀”、“试衣间”等功能，给消费者带来沉浸式的视觉体验。With the rapid development of science and technology, especially the maturity of computer vision, augmented reality (AR) and virtual reality (VR) technologies, virtual-reality fusion technology is attracting more and more attention. This technology provides users with an immersive experience by integrating real-world elements and computer-generated virtual elements, and it has opened up new application areas and opportunities for various industries. Especially in industries such as entertainment, advertising and e-commerce, the rise of these technologies has not only brought new experiences to consumers, but also opened up new markets and marketing opportunities for companies. For example, in e-commerce platforms, for the clothing industry, functions such as "model catwalk" and "fitting room" can be provided to bring consumers an immersive visual experience.

然而，在实现上述功能的过程中也提出了新的技术和设计挑战，比如，在“模特走秀”场景中，包括如何提供更具有真实感的虚拟环境以及“数字人”，等等。其中，关于“数字人”，在现有技术中，通常有以下几种实现方式：However, in the process of realizing the above functions, new technical and design challenges are also raised. For example, in the "model catwalk" scene, how to provide a more realistic virtual environment and "digital human", etc. Among them, regarding "digital human", in the existing technology, there are usually the following implementation methods:

方式一，传统的3D人像建模技术：使用高精度的3D扫描设备对真人模特进行扫描，然后手动或半自动地进行纹理贴图和细节雕刻，以生成人体模型。但是，这种方式下，通常需要昂贵的扫描设备和高技能的专家来处理扫描数据，从扫描到完成的模型需要较长的时间，并且，主要适用于静态建模，动态表情和动作捕捉需要额外的工作。Method 1: Traditional 3D portrait modeling technology: Use high-precision 3D scanning equipment to scan real models, and then manually or semi-automatically perform texture mapping and detail carving to generate a human body model. However, this method usually requires expensive scanning equipment and highly skilled experts to process the scanned data. It takes a long time from scanning to the completed model. In addition, it is mainly suitable for static modeling. Dynamic expressions and motion capture require additional work.

方式二，基于AI的人像合成：利用神经网络和大量的训练数据，直接合成或转换人像视角和姿态。但是，这种方式需要大量的标注数据进行训练，在某些复杂的场景和角度下，生成的结果可能不够真实，并且，实时应用可能需要高性能的硬件支持，计算需求高。Method 2: AI-based portrait synthesis: Using neural networks and a large amount of training data, directly synthesize or convert portrait perspectives and postures. However, this method requires a large amount of labeled data for training, and in some complex scenes and angles, the generated results may not be realistic enough. In addition, real-time applications may require high-performance hardware support and high computing requirements.

方式三，基于深度图像的立体图像渲染方案：使用颜色图像和对应的深度图像来生成新的视点图像。通过深度信息，该方法可以估计场景的3D结构，并从新的视点渲染场景。但是，这种方式的输出质量严重依赖于深度图像的质量，而低质量或不准确的深度信息还可能导致渲染图像中的伪影或失真，另外，深度图像的获取成本比较高，需要通过一些硬件，激光雷达等来获取，这可能增加成本和复杂性。Method 3, stereo image rendering solution based on depth image: Use color image and corresponding depth image to generate new viewpoint image. Through depth information, this method can estimate the 3D structure of the scene and render the scene from a new viewpoint. However, the output quality of this method is heavily dependent on the quality of the depth image, and low-quality or inaccurate depth information may also cause artifacts or distortion in the rendered image. In addition, the acquisition cost of depth image is relatively high, and it needs to be obtained through some hardware, lidar, etc., which may increase cost and complexity.

因此，如何在基于数字人进行信息展示的过程中，以更低成本为用户提供更真实的视觉体验，成为需要本领域技术人员解决的技术问题。 Therefore, how to provide users with a more realistic visual experience at a lower cost in the process of information display based on digital humans has become a technical problem that needs to be solved by technical personnel in this field.

发明内容Summary of the invention

本申请提供了基于动态数字人形象进行信息展示的方法及电子设备，能够以更低成本为用户提供更真实的视觉体验。The present application provides a method and electronic device for displaying information based on a dynamic digital human image, which can provide users with a more realistic visual experience at a lower cost.

本申请提供了如下方案：This application provides the following solutions:

一种基于动态数字人形象进行信息展示的方法，包括：A method for displaying information based on a dynamic digital human image, comprising:

获得多个真实视角的视频内容，所述多个真实视角的视频内容是通过多个相机设备从不同视角、对真实人物在穿着目标服饰状态下对所述目标服饰进行展示的过程进行同步拍摄得到的；Obtaining video contents from multiple real perspectives, wherein the video contents from multiple real perspectives are obtained by synchronously shooting a process in which a real person wears a target garment and displays the target garment from different perspectives using multiple camera devices;

从所述视频内容包含的多个视频帧中分别提取出其中包含的人物图像；Extracting the character images contained in the plurality of video frames respectively from the video content;

利用基于图像的渲染技术，根据相邻真实视角在同一时间点的人物图像之间的像素偏移关系信息，为所述相邻真实视角之间的多个中间视角生成对应时间点的人物图像；Using image-based rendering technology, based on pixel offset relationship information between character images at adjacent real perspectives at the same time point, character images at corresponding time points are generated for multiple intermediate perspectives between the adjacent real perspectives;

根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像生成多视角视频，以便在客户端将所述多视角视频匹配到预置的虚拟3D空间场景模型中，以提供动态数字人形象在所述虚拟3D空间场景中对所述目标服饰进行展示的内容，并提供用于模拟连续视角切换的互动效果。A multi-perspective video is generated based on the character images of the multiple real perspectives and the multiple intermediate perspectives at multiple time points, so that the multi-perspective video can be matched to a preset virtual 3D space scene model on the client, so as to provide content in which a dynamic digital human image displays the target clothing in the virtual 3D space scene, and provide an interactive effect for simulating continuous perspective switching.

其中，所述多个真实视角的视频内容是通过多个相机设备从不同视角、对真实人物在穿着目标服饰状态下在目标空间场所中行走以对所述目标服饰进行展示的过程进行同步拍摄得到的。The video contents of the multiple real perspectives are obtained by synchronously shooting a process in which a real person wearing target clothing walks in a target space to display the target clothing through multiple camera devices from different perspectives.

其中，所述为所述相邻真实视角之间的多个中间视角生成对应时间点的人物图像，包括：The step of generating character images at corresponding time points for a plurality of intermediate perspectives between the adjacent real perspectives includes:

利用基于图像的渲染技术，根据相邻真实视角在同一时间点的人物图像之间的像素偏移关系信息，对所述相邻真实视角之间的多个中间视角在对应时间点的像素位置进行估计，以生成所述中间视角在多个时间点上的人物图像。By using image-based rendering technology, according to the pixel offset relationship information between the character images of adjacent real perspectives at the same time point, the pixel positions of multiple intermediate perspectives between the adjacent real perspectives at corresponding time points are estimated to generate the character images of the intermediate perspectives at multiple time points.

其中，所述对所述相邻真实视角之间的多个中间视角在对应时间点的像素位置进行估计，包括：The estimating pixel positions of a plurality of intermediate perspectives between the adjacent real perspectives at corresponding time points includes:

以相邻真实视角在同一时间点分别对应的人物图像为输入，通过深度学习模型拟合出所述相邻真实视角的人物图像之间的稠密光流场，并利用所述稠密光流场，对所述相邻真实视角之间的多个中间视角在对应时间点的人物图像像素位置进行估计。Taking the character images corresponding to adjacent real perspectives at the same time point as input, a dense optical flow field between the character images of adjacent real perspectives is fitted through a deep learning model, and the dense optical flow field is used to estimate the pixel positions of the character images of multiple intermediate perspectives between the adjacent real perspectives at corresponding time points.

其中，还包括：Among them, it also includes:

对所述多视角视频进行压缩处理，以用于向客户端进行传输。The multi-view video is compressed for transmission to a client.

其中，所述对所述多视角视频进行压缩处理，包括：The step of compressing the multi-view video includes:

以时间点为单位对多个视角对应的视频帧进行拼接处理，得到由多个组合帧形成的帧序列；其中，将该时间点的多个视角对应的多个视频帧划分为多个集合，每个集合中的多个视频帧拼接为一个组合帧，且使得相邻视角的视频帧位于相邻组合帧的相同位置；The video frames corresponding to the multiple perspectives are spliced in units of time points to obtain a frame sequence formed by multiple combined frames; wherein the multiple video frames corresponding to the multiple perspectives at the time point are divided into multiple sets, and the multiple video frames in each set are spliced into a combined frame, and the video frames of adjacent perspectives are located at the same position of adjacent combined frames;

利用通用的视频编码器对所述多个组合帧形成的帧序列进行编码，并对所述多个组合帧进行帧间压缩处理。 A general video encoder is used to encode a frame sequence formed by the plurality of combined frames, and inter-frame compression processing is performed on the plurality of combined frames.

其中，每个组合帧的分辨率低于终端设备可支持的最大分辨率。The resolution of each combined frame is lower than the maximum resolution supported by the terminal device.

其中，还包括：Among them, it also includes:

在对多个组合帧形成的帧序列进行编码及帧间压缩处理之后，还对帧序列进行切片处理，以便以切片后得到的片段为单位进行传输，在接收端以片段为单位进行独立的解码播放。After encoding and inter-frame compression processing are performed on a frame sequence formed by a plurality of combined frames, the frame sequence is further sliced so as to be transmitted in units of the obtained segments after slicing and independently decoded and played in units of the segments at the receiving end.

其中，所述对多个组合帧形成的帧序列进行编码时，还包括：Wherein, when encoding a frame sequence formed by a plurality of combined frames, the method further includes:

根据每个切片中包括的组合帧的数量，控制帧间编码过程中的关键帧间隔，以便减少同一切片中被编码成关键帧的帧数。According to the number of combined frames included in each slice, the key frame interval in the inter-frame coding process is controlled so as to reduce the number of frames encoded as key frames in the same slice.

对于关键帧之外的组合帧，通过调低对双向参考帧的判断阈值，增加同一切片中被编码成双向参考帧的帧数。For combined frames other than key frames, the number of frames encoded as bidirectional reference frames in the same slice is increased by lowering the judgment threshold for bidirectional reference frames.

响应于用户发起的查看请求，获取多视角视频，所述多视角视频通过以下方式生成：通过多个相机设备从不同视角、对真实人物在穿着目标服饰状态下对所述目标服饰进行展示的过程进行同步拍摄得到多个真实视角的视频内容，对所述多个真实视角的视频内容包含的多个视频帧中分别提取出其中包含的人物图像，利用基于图像的渲染技术，为相邻真实视角之间的多个中间视角生成人物图像，并根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像生成所述多视角视频；In response to a viewing request initiated by a user, a multi-view video is obtained, wherein the multi-view video is generated in the following manner: a process of a real person wearing a target clothing and displaying the target clothing is synchronously photographed by multiple camera devices from different perspectives to obtain video contents of multiple real perspectives, character images contained in multiple video frames contained in the video contents of the multiple real perspectives are respectively extracted, character images are generated for multiple intermediate perspectives between adjacent real perspectives using an image-based rendering technology, and the multi-view video is generated according to the multiple real perspectives and the character images of the multiple intermediate perspectives at multiple time points;

对所述多视角视频进行解码；Decoding the multi-view video;

将所述多视角视频匹配到预置的虚拟3D空间场景模型中，以提供动态数字人形象在所述虚拟3D空间场景中对所述目标服饰进行展示的内容；Matching the multi-view video to a preset virtual 3D space scene model to provide a dynamic digital human image displaying the target clothing in the virtual 3D space scene;

响应于连续视角切换的交互操作，通过切换到其他视角的人物图像视频内容，提供模拟连续视角切换的互动效果。In response to the interactive operation of continuous perspective switching, an interactive effect simulating continuous perspective switching is provided by switching to character image video content of other perspectives.

一种生成动态数字人形象的方法，包括：A method for generating a dynamic digital human image, comprising:

获得多个真实视角的视频内容，所述多个真实视角的视频内容是通过多个相机设备从不同视角对真实人物执行目标动作的过程进行同步拍摄得到的；Obtaining video contents from multiple real perspectives, wherein the video contents from multiple real perspectives are obtained by synchronously shooting a process in which a real person performs a target action from different perspectives using multiple camera devices;

根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像，生成多视角视频，以便通过所述多视角视频进行对应的动态数字人形象的展示。A multi-perspective video is generated according to the multiple real perspectives and the character images of the multiple intermediate perspectives at multiple time points, so as to display the corresponding dynamic digital human image through the multi-perspective video.

一种基于动态数字人形象进行服饰信息展示的方法，包括：A method for displaying clothing information based on a dynamic digital human image, comprising:

响应于通过动态数字人对目标服饰进行展示的请求，获得虚拟3D空间场景模型，以及通过多视角视频的形式表达的动态数字人形象，其中，所述多视角视频用于从多个视角对目标人物在穿着目标服饰状态下对所述目标服饰进行展示的过程进行展示；In response to a request for displaying the target clothing through a dynamic digital human, a virtual 3D space scene model and a dynamic digital human image expressed in the form of a multi-view video are obtained, wherein the multi-view video is used to display the process of the target person displaying the target clothing when wearing the target clothing from multiple perspectives;

将所述多视角视频匹配到所述虚拟3D空间场景模型中，以提供动态数字人形象在所述虚拟3D空间场景中对所述目标服饰进行展示的内容，并提供用于模拟连续视角切换的互动效果。Match the multi-view video to the virtual 3D space scene model to provide a dynamic digital human image in the virtual 3D space scene model. The target clothing is displayed in the virtual 3D space scene, and an interactive effect for simulating continuous perspective switching is provided.

一种基于动态数字人形象进行信息展示的装置，包括：A device for displaying information based on a dynamic digital human image, comprising:

视频内容获得单元，用于获得多个真实视角的视频内容，所述多个真实视角的视频内容是通过多个相机设备从不同视角、对真实人物在穿着目标服饰状态下对所述目标服饰进行展示的过程进行同步拍摄得到的；A video content obtaining unit, configured to obtain video contents of multiple real perspectives, wherein the video contents of the multiple real perspectives are obtained by synchronously shooting a process in which a real person wears a target garment and displays the target garment from different perspectives using multiple camera devices;

人物图像提取单元，用于从所述视频内容包含的多个视频帧中分别提取出其中包含的人物图像；A character image extraction unit, used to extract character images contained in the plurality of video frames contained in the video content respectively;

视角合成单元，用于利用基于图像的渲染技术，根据相邻真实视角在同一时间点的人物图像之间的像素偏移关系信息，为所述相邻真实视角之间的多个中间视角生成对应时间点的人物图像；A perspective synthesis unit, configured to generate character images at corresponding time points for a plurality of intermediate perspectives between adjacent real perspectives based on pixel offset relationship information between character images at the same time point from adjacent real perspectives by using an image-based rendering technology;

多视角视频生存和单元，用于根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像生成多视角视频，以便在客户端将所述多视角视频匹配到预置的虚拟3D空间场景模型中，以提供动态数字人形象在所述虚拟3D空间场景中对所述目标服饰进行展示的内容，并提供用于模拟连续视角切换的互动效果。A multi-perspective video storage and unit is used to generate a multi-perspective video based on the multiple real perspectives and the character images of the multiple intermediate perspectives at multiple time points, so as to match the multi-perspective video to a preset virtual 3D space scene model on the client, to provide content of a dynamic digital human image displaying the target clothing in the virtual 3D space scene, and to provide an interactive effect for simulating continuous perspective switching.

多视角视频获取单元，用于响应于用户发起的查看请求，获取多视角视频，所述多视角视频通过以下方式生成：通过多个相机设备从不同视角、对真实人物在穿着目标服饰状态下对所述目标服饰进行展示的过程进行同步拍摄得到多个真实视角的视频内容，对所述多个真实视角的视频内容包含的多个视频帧中分别提取出其中包含的人物图像，利用基于图像的渲染技术，为相邻真实视角之间的多个中间视角生成人物图像，并根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像生成所述多视角视频；A multi-view video acquisition unit is used to acquire a multi-view video in response to a viewing request initiated by a user, wherein the multi-view video is generated in the following manner: a process of a real person wearing a target clothing and displaying the target clothing is synchronously photographed by multiple camera devices from different perspectives to obtain video contents of multiple real perspectives, character images contained in multiple video frames contained in the video contents of the multiple real perspectives are respectively extracted, character images are generated for multiple intermediate perspectives between adjacent real perspectives using an image-based rendering technology, and the multi-view video is generated according to the multiple real perspectives and the character images of the multiple intermediate perspectives at multiple time points;

解码单元，用于对所述多视角视频进行解码；A decoding unit, configured to decode the multi-view video;

添加单元，用于将所述多视角视频匹配到预置的虚拟3D空间场景模型中，以提供动态数字人形象在所述虚拟3D空间场景中对所述目标服饰进行展示的内容；An adding unit, used for matching the multi-view video to a preset virtual 3D space scene model, so as to provide a content in which a dynamic digital human image displays the target clothing in the virtual 3D space scene;

视角切换交互单元，用于响应于连续视角切换的交互操作，通过切换到其他视角的人物图像视频内容，提供模拟连续视角切换的互动效果。The perspective switching interaction unit is used to respond to the interactive operation of continuous perspective switching and provide an interactive effect simulating continuous perspective switching by switching to character image video content of other perspectives.

一种生成动态数字人形象的装置，包括：A device for generating a dynamic digital human image, comprising:

视频内容获得单元，用于获得多个真实视角的视频内容，所述多个真实视角的视频内容是通过多个相机设备从不同视角对真实人物执行目标动作的过程进行同步拍摄得到的；A video content obtaining unit, used to obtain video contents of multiple real perspectives, wherein the video contents of the multiple real perspectives are obtained by synchronously shooting a process in which a real person performs a target action from different perspectives using multiple camera devices;

动态数字人资产生成单元，用于根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像，生成多视角视频，以便通过所述多视角视频进行对应的动态数字人形象的展示。A dynamic digital human asset generation unit is used to generate a plurality of digital human assets according to the plurality of real perspectives and the plurality of intermediate perspectives. The character image at a certain time point generates a multi-view video so as to display the corresponding dynamic digital human image through the multi-view video.

一种基于动态数字人形象进行服饰信息展示的装置，包括：A device for displaying clothing information based on a dynamic digital human image, comprising:

请求接收单元，用于响应于通过动态数字人对目标服饰进行展示的请求，获得虚拟3D空间场景模型，以及通过多视角视频的形式表达的动态数字人形象，其中，所述多视角视频用于从多个视角对目标人物在穿着目标服饰状态下对所述目标服饰进行展示的过程进行展示；A request receiving unit, configured to obtain, in response to a request for displaying a target garment through a dynamic digital human, a virtual 3D space scene model and a dynamic digital human image expressed in the form of a multi-view video, wherein the multi-view video is used to display a process of displaying the target garment by the target person in a state of wearing the target garment from multiple viewpoints;

展示单元，用于将所述多视角视频匹配到所述虚拟3D空间场景模型中，以提供动态数字人形象在所述虚拟3D空间场景中对所述目标服饰进行展示的内容，并提供用于模拟连续视角切换的互动效果。A display unit is used to match the multi-view video to the virtual 3D space scene model to provide content in which a dynamic digital human image displays the target clothing in the virtual 3D space scene, and to provide an interactive effect for simulating continuous viewpoint switching.

一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现前述任一项所述的方法的步骤。A computer-readable storage medium stores a computer program, which, when executed by a processor, implements the steps of any of the methods described above.

一种电子设备，包括：An electronic device, comprising:

一个或多个处理器；以及one or more processors; and

与所述一个或多个处理器关联的存储器，所述存储器用于存储程序指令，所述程序指令在被所述一个或多个处理器读取执行时，执行前述任一项所述的方法的步骤。A memory associated with the one or more processors, the memory being used to store program instructions, wherein the program instructions, when read and executed by the one or more processors, execute the steps of any of the aforementioned methods.

根据本申请提供的具体实施例，本申请公开了以下技术效果：According to the specific embodiments provided in this application, this application discloses the following technical effects:

通过本申请实施例，在需要为用户提供“模特秀场”等功能时，可以通过多个相机设备从不同视角、对真实人物在穿着目标服饰状态下对所述目标服饰进行展示的过程进行同步拍摄，得到多个真实视角的视频内容，之后，可以从所述视频内容包含的多个视频帧中分别提取出其中包含的人物图像，再利用基于图像的渲染技术，根据相邻真实视角在同一时间点的人物图像之间的像素偏移关系信息，为所述相邻真实视角之间的多个中间视角生成对应时间点的人物图像。然后，可以根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像生成多视角视频，以便在客户端将其中一视角的人物图像视频内容添加到预置的虚拟3D空间场景模型中进行渲染展示，以提供动态数字人在所述虚拟3D空间场景中对所述目标服饰进行展示的内容，并提供用于模拟连续视角切换的互动效果。通过这种方式，无需对模特人物进行显式建模，只需利用多视角拍摄的图像即可实现类3D的真人数字人效果，避免了复杂的建模流程和高昂的建模成本，从而可以获得低成本、高保真、可实时互动的动态真人数字人效果。并且，由于本申请实施例中建立起的真人数字人资产的格式为普通的HEVC等格式的视频，因此，可被大多数移动设备解析，并且避免了复杂的渲染流程和较大的计算开销。Through the embodiment of the present application, when it is necessary to provide users with functions such as "model show", multiple camera devices can be used to synchronously shoot the process of real people wearing target clothing and displaying the target clothing from different perspectives to obtain video content from multiple real perspectives. After that, the character images contained in the multiple video frames contained in the video content can be extracted respectively, and then the image-based rendering technology is used to generate character images at corresponding time points for multiple intermediate perspectives between the adjacent real perspectives according to the pixel offset relationship information between the character images of adjacent real perspectives at the same time point. Then, a multi-perspective video can be generated based on the multiple real perspectives and the character images of the multiple intermediate perspectives at multiple time points, so that the character image video content of one of the perspectives is added to the preset virtual 3D space scene model on the client for rendering and display, so as to provide the content of the dynamic digital human displaying the target clothing in the virtual 3D space scene, and provide an interactive effect for simulating continuous perspective switching. In this way, there is no need to explicitly model the model, and only the images taken from multiple perspectives can be used to achieve a 3D-like real-life digital human effect, avoiding the complex modeling process and high modeling cost, so as to obtain a low-cost, high-fidelity, real-time interactive dynamic real-life digital human effect. In addition, since the format of the real-life digital human asset established in the embodiment of the present application is a video in a common HEVC format, it can be parsed by most mobile devices, and avoids the complex rendering process and large computing overhead.

另外，对于多个视角对应的多份视频内容，具体在进行视频编码时，可以以时间点为单位对所述多个视角对应的多个视频帧进行拼接处理，得到由多个组合帧形成的帧序列，并且，将该时间点的多个视角对应的多个视频帧划分为多个集合，每个集合中的多个视频帧拼接为一个组合，且使得相邻视角的视频帧位于相邻组合帧的相同位置。之后，可以利用通用的视频编码器对所述多个组合帧形成的帧序列进行编码，并通过对所述多个组合帧进行帧间压缩处理，消除或减少相邻视角的视频帧间的冗余信息。这样，由于对多个视角的视频帧进行了分组拼接，因此，使得拼接后的组合帧的分辨率不至于过高，便于在大部分终端设备中进行实时解码；另外，由于在进行分组拼接时，对分组方式以及排列方式进行了控制，使得相邻视角的视频帧位于相邻组合帧的相同位置，也就是说，使得相邻视角的视频帧位于不同但是相邻的组合帧中，且在不同组合帧中的位置相同，而相邻视角的视频帧之间具有相似度比较高的特点，因此，通过这种方式拼接成的相邻组合帧之间具有比较高的相似度，进而通过通用的帧间压缩算法即可通过消除或减少相邻视角的视频帧间的冗余信息，而获得较高的压缩率。换言之，在本申请实施例中，通过通用的视频编码器即可获得理想的压缩率，相应的，在解码端利用通用的解码器即可完成解码，从而可以在更多的终端设备上得到支持。In addition, for multiple video contents corresponding to multiple perspectives, when performing video encoding, multiple video frames corresponding to the multiple perspectives can be spliced in units of time points to obtain a frame sequence formed by multiple combined frames, and the multiple video frames corresponding to the multiple perspectives at the time point are divided into multiple sets, and multiple video frames in each set are spliced into a combination, and the video frames of adjacent perspectives are located at the same position of adjacent combined frames. Afterwards, a general video encoder can be used to encode the frame sequence formed by the multiple combined frames, and the multiple combined frames can be encoded by encoding the multiple combined frames. Inter-frame compression processing is performed to eliminate or reduce redundant information between video frames of adjacent perspectives. In this way, since the video frames of multiple perspectives are grouped and spliced, the resolution of the spliced combined frame is not too high, which is convenient for real-time decoding in most terminal devices; in addition, since the grouping method and the arrangement method are controlled when performing grouping and splicing, the video frames of adjacent perspectives are located in the same position of adjacent combined frames, that is, the video frames of adjacent perspectives are located in different but adjacent combined frames, and the positions in different combined frames are the same, and the video frames of adjacent perspectives have the characteristic of high similarity. Therefore, the adjacent combined frames spliced in this way have a relatively high similarity, and then a general inter-frame compression algorithm can be used to eliminate or reduce the redundant information between the video frames of adjacent perspectives, and a higher compression rate can be obtained. In other words, in the embodiment of the present application, an ideal compression rate can be obtained by a general video encoder, and accordingly, decoding can be completed by using a general decoder at the decoding end, so that it can be supported on more terminal devices.

当然，实施本申请的任一产品并不一定需要同时达到以上所述的所有优点。Of course, any product implementing the present application does not necessarily need to achieve all of the advantages described above at the same time.

BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1是本申请实施例提供的系统架构的示意图；FIG1 is a schematic diagram of a system architecture provided in an embodiment of the present application;

图2是本申请实施例提供的第一方法的流程图；FIG2 is a flow chart of a first method provided in an embodiment of the present application;

图3是本申请实施例提供的帧重排方式的示意图；FIG3 is a schematic diagram of a frame reordering method provided in an embodiment of the present application;

图4是本申请实施例提供的第二方法的流程图；FIG4 is a flow chart of a second method provided in an embodiment of the present application;

图5是本申请实施例提供的第三方法的流程图；FIG5 is a flow chart of a third method provided in an embodiment of the present application;

图6是本申请实施例提供的第四方法的流程图；FIG6 is a flow chart of a fourth method provided in an embodiment of the present application;

图7是本申请实施例提供的电子设备的示意图。FIG. 7 is a schematic diagram of an electronic device provided in an embodiment of the present application.

DETAILED DESCRIPTION

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员所获得的所有其他实施例，都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. All other embodiments obtained by ordinary technicians in this field based on the embodiments in the present application belong to the scope of protection of this application.

首先需要说明的是，在本申请实施例中，主要可以针对商品信息服务系统等应用系统中为用户提供“模特秀场”等场景，在该场景中需要达到的效果通常是：用户可以通过客户端查看到“数字人”在穿着某个服饰的状态下，在某个空间场景中行走，或者做出某些展示动作，并且，通常需要为用户提供连续视角切换的互动效果。其中，数字人(Digital Human/Meta Human)，是运用数字技术创造出来的、与人类形象接近的数字化人物形象。也就是说，用户在通过手机等终端设备查看这种“模特秀场”界面的过程中，可以通过在手机屏幕上执行滑动操作的方式，切换到“任意”视角进行观看，就像用户亲自在“走秀”现场，并且可以走到不同位置对模特走秀过程进行观看一样。First of all, it should be noted that in the embodiments of the present application, it is mainly possible to provide users with scenes such as "model show" in application systems such as commodity information service systems. The effect to be achieved in this scene is usually: users can view through the client that "digital human" is wearing certain clothes, walking in a certain space scene, or making certain display actions, and it is usually necessary to provide users with an interactive effect of continuous perspective switching. Among them, digital human (Digital Human/Meta Human) is a digital human image close to human image created by digital technology. That is to say, when users are viewing this "model show" interface through mobile phones and other terminal devices, they can switch to an "arbitrary" perspective by sliding on the mobile phone screen, just like the user is personally at the "show" scene and can walk to different positions to watch the model show process.

在现有技术中，一些商品信息服务系统中也会存在“模特秀场”相关的产品，但是，主要是采用建模的方式来生成“3D数字人”，然后将3D服饰模型匹配到这种“3D数字人”身上，并模拟出真实人物“走秀”的效果。在此过程中，可以提供多视角查看等互动效果，但是，其中的“3D数字人”由于是通过3D建模的方式生成的，因此，看上去会有明显的“卡通感”，不够真实，行走、姿势等动作也会不够自然；另外，通过3D服饰模型也难以完全还原服饰的特点，等等。In the prior art, some commodity information service systems also have products related to "model shows", but they mainly use modeling to generate "3D digital people", and then match 3D clothing models to this "3D digital people" to simulate the effect of real people "walking on the catwalk". In this process, interactive effects such as multi-view viewing can be provided, but because the "3D digital people" are generated by 3D modeling, they look obviously "cartoon-like", not realistic enough, and their movements such as walking and posture are not natural enough; in addition, it is difficult to fully restore the characteristics of clothing through 3D clothing models, etc.

本申请实施例中，为了能够在商品信息服务系统中的“模特秀场”、“试衣间”等场景中，为用户提供更真实的视觉体验，让用户看到的是更真实的人物形象(而不是看上去比较缺乏相对的真实感的卡通形象)，提供了采用“真人数字人”的方式为用户提供“模特秀场”等功能的实现方案。In the embodiments of the present application, in order to provide users with a more realistic visual experience in scenes such as the "model show" and "fitting room" in the product information service system, so that users can see more realistic character images (rather than cartoon images that appear to lack relative realism), an implementation solution is provided for providing users with functions such as "model show" by using "real digital people".

具体的，可以以多视角视频来代替现有技术中的3D建模，也就是说，可以通过多相机拍摄等方式得到多个视角的视频，然后，可以通过这种多视角视频为用户提供“模特秀场”功能，从而使得“模特”秀场中的人物以及服饰状态都更真实。并且，由于实际生产出的资产是以多视角视频的形式存在，而不是模型，因此，可以支持在更多的终端设备中进行渲染展示。Specifically, multi-view videos can be used to replace 3D modeling in the prior art. That is, videos with multiple viewpoints can be obtained by shooting with multiple cameras, etc. Then, the multi-view videos can be used to provide users with a "model show" function, making the figures and clothing in the "model" show more realistic. In addition, since the assets actually produced exist in the form of multi-view videos rather than models, they can be rendered and displayed in more terminal devices.

在通过上述多视角视频来实现“模特秀场”的过程中，还存在以下问题：如前文所述，在“模特秀场”功能中通常需要为用户提供连续视角切换的互动效果，例如，让用户可以通过滑动屏幕等方式进行连续的视角切换。但是，本申请实施例中，由于是通过多视角视频来表示具体的“数字人”，而视角是离散的，因此，在客户端展示过程中，如果需要切换视角，则只能在这些固定的视角之间进行切换，例如，从视角1切换到视角2，如果视角之间的距离比较大，则用户可能会感觉到明显的跳变现象。当然，理论上讲，如果相机设备的数量足够多，分布足够密集，使得可以获得更多视角上的视频内容，则在播放端进行播放的过程中，即使是在离散的视角之间进行切换，也可以从一定程度上模拟出连续视角切换的互动效果。但是，相机设备数量越多，则意味着成本会越高。In the process of realizing the "model show" through the above-mentioned multi-view video, there are still the following problems: As mentioned above, in the "model show" function, it is usually necessary to provide users with an interactive effect of continuous perspective switching, for example, allowing users to switch perspectives continuously by sliding the screen. However, in the embodiment of the present application, since a specific "digital person" is represented by a multi-view video, and the perspective is discrete, if the perspective needs to be switched during the client display process, it can only be switched between these fixed perspectives, for example, from perspective 1 to perspective 2. If the distance between the perspectives is relatively large, the user may feel an obvious jump phenomenon. Of course, in theory, if the number of camera devices is large enough and the distribution is dense enough so that video content from more perspectives can be obtained, then during the playback process at the playback end, even if switching between discrete perspectives, the interactive effect of continuous perspective switching can be simulated to a certain extent. However, the more camera devices there are, the higher the cost.

因此，为了能够以更低成本实现对连续视角切换的模拟，本申请实施例还采用了基于图像的渲染技术，通过该技术，可以利用相邻两个视角实际拍摄到的图像，估计出多个中间视角上的像素位置，并生成中间视角的图像。也即，不需要实际增加相机设备的数量，而是可以通过基于图像的渲染技术，对并未真正部署相机设备的多个中间视角上的图像进行补充，以此生成更多视角上的图像内容，进而达到更好的模拟连续视角切换的目的。Therefore, in order to achieve the simulation of continuous perspective switching at a lower cost, the embodiment of the present application also adopts an image-based rendering technology, through which the images actually captured by two adjacent perspectives can be used to estimate the pixel positions at multiple intermediate perspectives and generate images at intermediate perspectives. In other words, there is no need to actually increase the number of camera devices, but the image-based rendering technology can be used to supplement the images at multiple intermediate perspectives where camera devices are not actually deployed, so as to generate image content at more perspectives, thereby achieving the purpose of better simulating continuous perspective switching.

当然，具体实现时，还可能需要在“模特秀场”功能提供多种不同的场景，例如，室内场景，室外场景，科技场景，等等，如果直接在真实的空间中搭建出这些场景，则成本可能仍然比较高，并且需要同一模特人物分别在多个不同的场景中重新完成走秀，并重新进行拍摄等等，因此，时间成本也会比较高。 Of course, in the specific implementation, it may be necessary to provide a variety of different scenes in the "Model Show" function, such as indoor scenes, outdoor scenes, technological scenes, etc. If these scenes are built directly in the real space, the cost may still be relatively high, and the same model character will be required to re-complete the show in multiple different scenes and re-shoot, etc. Therefore, the time cost will also be relatively high.

为此，在本申请实施例中，真人数字人的生成与场景的创建可以独立进行，在通过相机设备拍摄得到多视角的视频内容后，可以对视频中的模特人物进行“抠图”，后续的对中间视角的图像合成等都可以是基于抠图后得到的人物图像来进行。另外还可以提供多种不同的3D虚拟空间模型，在客户端进行展示时，可以将多视角的人物图像视频放入到3D虚拟空间场景模型中进行展示。这样，同一个多视角人物图像视频可以放到不同的3D虚拟空间场景模型中进行展示。To this end, in the embodiment of the present application, the generation of real digital people and the creation of scenes can be performed independently. After obtaining multi-perspective video content through camera equipment, the model characters in the video can be "cut out", and the subsequent image synthesis of the intermediate perspective can be performed based on the character image obtained after cutting out. In addition, a variety of different 3D virtual space models can be provided. When displayed on the client, the multi-perspective character image video can be placed in the 3D virtual space scene model for display. In this way, the same multi-perspective character image video can be placed in different 3D virtual space scene models for display.

另外，由于本申请实施例中涉及到多视角视频，并且，为了实现对连续视角切换的模拟，视角(包括真实视角，以及通过算法补充的中间视角)的数量可能会非常多(可能是每隔3度或5度一个视角，则共有120或者72个视角，等等)，此时，对于这种多视角视频的压缩、传输等也是需要考虑的问题。针对该问题，本申请实施例中也提供了相应的解决方案，对于该方案，后文中会有详细介绍。In addition, since the embodiments of the present application involve multi-view video, and in order to simulate continuous view switching, the number of view angles (including real view angles and intermediate view angles supplemented by algorithms) may be very large (may be one view angle every 3 degrees or 5 degrees, so there are 120 or 72 view angles, etc.), at this time, the compression and transmission of such multi-view video are also issues that need to be considered. In response to this problem, the embodiments of the present application also provide a corresponding solution, which will be described in detail later.

从系统架构角度而言，参见图1，本申请实施例可以涉及到采集端、服务端以及客户端。其中，在采集端，可以通过多个相机设备对模特任务走秀等过程进行拍摄，得到多个真实视角的视频内容。在服务端，可以进行人像抠图、中间视角的图像合成、多视角视频压缩等处理。压缩处理后的多视角视频可以传输到客户端，在传输时还可以进行切片处理，以缩短客户端的等待时延。在客户端则可以对多视角视频进行下载，并渲染出3D空间场景模型，根据多视角视频渲染出“数字人”(某个默认视角的人物图像视频内容)，然后将“数字人”添加到3D空间场景模型中，在此过程中，还可以进行一些光影融合等处理，例如，包括实现不同视角下的阴影一致性，等等，以提升展示效果。在展示过程中，则可以响应用户的视角切换操作，通过切换到相邻的其他视角的视频内容，来模拟出连续视角切换的效果。另外，还可以响应用户执行的缩放操作，等等。From the perspective of system architecture, referring to FIG1 , the embodiment of the present application may involve an acquisition end, a service end, and a client end. Among them, at the acquisition end, multiple camera devices may be used to shoot processes such as model task catwalks to obtain video content from multiple real perspectives. At the service end, portrait cutouts, image synthesis of intermediate perspectives, multi-perspective video compression, and other processes may be performed. The compressed multi-perspective video may be transmitted to the client, and slice processing may also be performed during transmission to shorten the waiting delay of the client. At the client end, the multi-perspective video may be downloaded and a 3D space scene model may be rendered, and a "digital human" (a character image video content of a certain default perspective) may be rendered according to the multi-perspective video, and then the "digital human" may be added to the 3D space scene model. In this process, some light and shadow fusion processing may also be performed, for example, including achieving shadow consistency under different perspectives, etc., to enhance the display effect. During the display process, the user's perspective switching operation may be responded to, and the effect of continuous perspective switching may be simulated by switching to the video content of other adjacent perspectives. In addition, the zoom operation performed by the user may also be responded to, etc.

下面对本申请实施例提供的具体实现方案进行详细介绍。The specific implementation scheme provided in the embodiments of the present application is described in detail below.

实施例一Embodiment 1

首先，该实施例一从服务端的角度，提供了一种基于动态数字人形象进行信息展示的方法，参见图2，该方法具体可以包括：First, from the perspective of the server, this embodiment 1 provides a method for displaying information based on a dynamic digital human image. Referring to FIG. 2 , the method may specifically include:

S201：获得多个真实视角的视频内容，所述多个真实视角的视频内容是通过多个相机设备从不同视角、对真实人物在穿着目标服饰状态下对所述目标服饰进行展示的过程进行同步拍摄得到的。S201: Obtain video contents from multiple real perspectives, where the video contents from multiple real perspectives are obtained by synchronously shooting a process in which a real person wears a target garment and displays the target garment from different perspectives using multiple camera devices.

其中，所谓的真实视角，就是实际部署有相机设备以进行拍摄的视角。具体实现时，针对“模特秀场”功能，首先可以在摄影棚等真实空间场景中，环绕真实人物(也即真人模特人物)180/360度布设多台相机设备，确保每个视角都能得到清晰的拍摄，在真人模特人物穿着具体所需展示的服饰的状态下，利用所有相机设备进行同步拍摄，使得在同一时间捕获到人物的动作和姿态。这样，具体的模特人物可以在穿着具体所需展示的服饰的状态下，在上述空间场景中行走或者做出某些展示动作，在此过程中，上述多台相机设备就可以进行同步拍摄，从而得到多个视角的视频内容。在本申请实施例中，具体的相机设备数量可以不必过多，例如，可以十几台的数量级，等等。 Among them, the so-called real perspective is the perspective at which camera equipment is actually deployed for shooting. In specific implementation, for the "model show" function, first, multiple camera equipment can be arranged 180/360 degrees around real people (that is, real model characters) in real space scenes such as studios to ensure that each perspective can be clearly photographed. When the real model characters are wearing the specific clothing required to be displayed, all camera equipment are used to perform synchronous shooting, so that the movements and postures of the characters are captured at the same time. In this way, the specific model characters can walk or make certain display actions in the above-mentioned space scene while wearing the specific clothing required to be displayed. In this process, the above-mentioned multiple camera devices can be synchronously shot to obtain video content from multiple perspectives. In an embodiment of the present application, the number of specific camera devices does not need to be too large, for example, it can be on the order of more than a dozen, and so on.

S202：从所述视频内容包含的多个视频帧中分别提取出其中包含的人物图像。S202: Extracting character images contained in the plurality of video frames respectively from the plurality of video frames contained in the video content.

在获取到多个视角的视频内容之后，由于需要生成“数字人”，以便于与多个不同的场景模型进行融合，因此，首先可以从多个视角的视频帧中，分别提取出其中包含的人物图像。具体的，由于每个视角对应一份视频内容，每份视频内容中包括多个视频帧，每个视频帧中都可以包括人物图像，可以利用“抠图”技术从中将人物图像抠取出来。这样，每个视角都可以抠取出多份人物图像，对应的是同一人物，这里的“多份人物图像”是指每个视角在多个不同时间点上的视频帧中分别抠取出的人物图像。由于视频帧具有时间信息，因此，这种抠取出的人物图像也会具有对应的时间点信息，可以按照时间点对这些人物图像进行排列，行程人物图像序列，每个视角都可以得到一个人物图像序列。After acquiring video content from multiple perspectives, since it is necessary to generate a "digital human" in order to facilitate integration with multiple different scene models, the character images contained therein can first be extracted from the video frames of multiple perspectives. Specifically, since each perspective corresponds to a video content, each video content includes multiple video frames, and each video frame can include a character image, the character image can be extracted from it using the "cutout" technology. In this way, multiple character images can be extracted from each perspective, corresponding to the same person. The "multiple character images" here refer to the character images extracted from the video frames at multiple different time points in each perspective. Since the video frame has time information, the extracted character images will also have corresponding time point information. These character images can be arranged according to the time point to form a character image sequence, and a character image sequence can be obtained from each perspective.

S203：利用基于图像的渲染技术，根据相邻真实视角在同一时间点的人物图像之间的像素偏移关系信息，为所述相邻真实视角之间的多个中间视角生成对应时间点的人物图像。S203: Using image-based rendering technology, according to pixel offset relationship information between character images at the same time point in adjacent real perspectives, character images at corresponding time points are generated for multiple intermediate perspectives between the adjacent real perspectives.

在获得每个视角对应的人物图像序列之后，为了更好的模拟连续视角切换效果，还可以通过插值的方式，合成出更多视角上的人物图像序列。具体的，该视角合成过程可以在每两个相邻的真实视角之间来进行，例如，假设共有10个真实视角，视角1与视角2相邻，则可以对视角1与视角2之间的多个中间视角进行人物图像合成，也即，估计出假设中间视角处设有相机设备，则拍摄到的人物图像会是怎样的。另外，还可以对视角2与视角3之间，视角3与视角4之间等其他的相邻视角之间的中间视角进行图像合成。这样，使得每两个相邻的真实视角之间都可以估计出多个中间视角上的图像。After obtaining the character image sequence corresponding to each perspective, in order to better simulate the continuous perspective switching effect, character image sequences at more perspectives can be synthesized by interpolation. Specifically, the perspective synthesis process can be performed between every two adjacent real perspectives. For example, assuming that there are 10 real perspectives, and perspective 1 is adjacent to perspective 2, then character images can be synthesized for multiple intermediate perspectives between perspective 1 and perspective 2, that is, it is estimated that if a camera device is provided at the intermediate perspective, what the captured character image will look like. In addition, image synthesis can also be performed for intermediate perspectives between perspective 2 and perspective 3, between perspective 3 and perspective 4, and other adjacent perspectives. In this way, multiple images at intermediate perspectives can be estimated between every two adjacent real perspectives.

其中，关于中间视角的选取，可以根据实际需要而定，例如，经测试，如果每隔3度设置一个视角，则在视角之间切换的过程中，用户将不会明显地感觉到跳变现象，因此，就可以将中间视角的间隔设置为3度；当然，如果为了更好地模拟连续视角切换，则还可以将中间视角的间隔设置地更小，或者，如果为了避免传输的码率过高，则还可以将中间视角的间隔设置地更大一些，等等。Among them, the selection of the intermediate viewing angle can be determined according to actual needs. For example, after testing, if a viewing angle is set every 3 degrees, the user will not obviously feel the jump phenomenon during the switching between viewing angles. Therefore, the interval of the intermediate viewing angle can be set to 3 degrees; of course, if in order to better simulate continuous viewing angle switching, the interval of the intermediate viewing angle can be set smaller, or, if in order to avoid too high a transmission bit rate, the interval of the intermediate viewing angle can be set larger, and so on.

在确定出中间视角的间隔之后，就可以根据相邻两个视角对应的人物图像的像素位置，以及像素偏移关系信息，估计出各个中间视角的像素位置，进而可以合成出具体中间视角上的人物图像。其中，由于每个视角上的人物图像为多个，分别对应不同的时间点，因此，在进行中间视角的图像合成时，也可以以时间点为单位来进行，也即，为具体的中间视角生成多个时间点分别对应的人物图像，这样每个中间视角都可以对应一个人物图像序列。After determining the interval of the intermediate perspectives, the pixel positions of each intermediate perspective can be estimated based on the pixel positions of the character images corresponding to the two adjacent perspectives and the pixel offset relationship information, and then the character images at the specific intermediate perspective can be synthesized. Among them, since there are multiple character images at each perspective, each corresponding to a different time point, when synthesizing the images of the intermediate perspectives, it can also be performed in units of time points, that is, multiple character images corresponding to each time point are generated for the specific intermediate perspective, so that each intermediate perspective can correspond to a character image sequence.

具体实现时，根据相邻两个视角的人物图像对中间视角的像素位置进行估计时，可以有多种方式。例如，一种方式下，可以以相邻真实视角在同一时间点分别对应的人物图像为输入，通过深度学习模型拟合出所述相邻视角对应的人物图像之间的稠密光流场，之后，可以利用这种稠密光流场，对所述相邻真实视角之间的多个中间视角在对应时间点的像素位置进行估计。其中，光流(optical flow)是空间运动物体在观察成像平面上的像素运动的瞬时速度，光流法是利用图像序列中像素在时间域上的变化以及相邻帧之间的相关性，来找到上一帧跟当前帧之间存在的对应关系，从而计算出相邻帧之间物体的运动信息的一种方法。在空间中，运动可以用运动场描述，而在一个图像平面上，物体的运动往往是通过图像序列中不同图像灰度分布的不同体现的，从而，空间中的运动场转移到图像上就表示为光流场。光流场是一个二维矢量场，它反映了图像上每一点灰度的变化趋势，可看成是带有灰度的像素点在图像平面上运动而产生的瞬时速度场。它包含的信息即是各像点的瞬时运动速度矢量信息。其中，稠密光流是一种针对图像或指定的某一片区域进行逐点匹配的图像配准方法，它计算图像上所有的点的偏移量，从而形成一个稠密的光流场。通过这个稠密的光流场，可以进行像素级别的图像配准。因此，在本申请实施例中，就可以首先估计出这种光流场信息，然后，基于光流场，估计出各个中间视角在对应时间点的人物图像像素位置，进而可以为中间视角生成对应的人物图像。In specific implementation, there can be multiple ways to estimate the pixel position of the intermediate perspective based on the character images of two adjacent perspectives. For example, in one method, the character images corresponding to the adjacent real perspectives at the same time point can be used as input, and the dense optical flow field between the character images corresponding to the adjacent perspectives can be fitted through a deep learning model. After that, this dense optical flow field can be used to estimate the pixel positions of multiple intermediate perspectives between the adjacent real perspectives at corresponding time points. Among them, optical flow is the instantaneous speed of pixel movement of a spatial moving object on the observation imaging plane. The optical flow method is a method that uses the changes in pixels in the time domain in an image sequence and the correlation between adjacent frames to find the correspondence between the previous frame and the current frame, thereby calculating the motion information of the object between adjacent frames. In space, motion can be described by a motion field, while on an image plane, the motion of an object is often described by a motion field. The different manifestations of the grayscale distribution of different images in the image sequence, thus, the motion field in the space is transferred to the image and represented as an optical flow field. The optical flow field is a two-dimensional vector field, which reflects the changing trend of the grayscale of each point on the image, and can be regarded as the instantaneous velocity field generated by the movement of grayscale pixels on the image plane. The information it contains is the instantaneous motion velocity vector information of each image point. Among them, dense optical flow is an image registration method that performs point-by-point matching on an image or a specified area. It calculates the offset of all points on the image to form a dense optical flow field. Through this dense optical flow field, pixel-level image registration can be performed. Therefore, in an embodiment of the present application, this optical flow field information can be first estimated, and then, based on the optical flow field, the pixel position of the character image of each intermediate perspective at the corresponding time point can be estimated, and then the corresponding character image can be generated for the intermediate perspective.

S204：根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像生成多视角视频，以便在客户端将所述多视角视频匹配到预置的虚拟3D空间场景模型中进行渲染展示，以提供动态数字人形象在所述虚拟3D空间场景中对所述目标服饰进行展示的内容，并提供用于模拟连续视角切换的互动效果。S204: Generate a multi-perspective video based on the character images of the multiple real perspectives and the multiple intermediate perspectives at multiple time points, so that the multi-perspective video can be matched to a preset virtual 3D space scene model on the client for rendering and display, so as to provide content of a dynamic digital human image displaying the target clothing in the virtual 3D space scene, and provide an interactive effect for simulating continuous perspective switching.

在获得多个中间视角的人物图像序列之后，可以将多个真实视角以及所述多个中间视角在多个时间点上的人物图像分别组成的序列，分别组织成视频格式，从而得到多视角视频。也就是说，在本申请实施例中，多视角视频可以包括多个真实视角对应的视频，以及多个中间视角对应的视频。可以通过这种多视角视频，来表达一个“数字人”，该“数字人”是通过对真实人物进行拍摄、抠图、以及中间视角合成等方式生成的，而不是通过3D建模的方式生成，因此，在保持3D效果的同时，还可以使得“数字人”看上去更真实，减少卡通感，并且对于服饰等商品也可以实现更真实更自然的展示效果。After obtaining a sequence of character images of multiple intermediate perspectives, the sequences composed of multiple real perspectives and the character images of the multiple intermediate perspectives at multiple time points can be organized into video formats, respectively, to obtain multi-perspective videos. That is to say, in an embodiment of the present application, a multi-perspective video may include videos corresponding to multiple real perspectives, and videos corresponding to multiple intermediate perspectives. This multi-perspective video can be used to express a "digital person", which is generated by shooting, cutting out, and synthesizing intermediate perspectives of real people, rather than by 3D modeling. Therefore, while maintaining the 3D effect, the "digital person" can also look more real and reduce the cartoon feeling, and a more realistic and natural display effect can be achieved for goods such as clothing.

通过上述方式得到多视角视频之后，还可以对这种多视角视频进行压缩编码处理，以便提供给客户端进行展示。在客户端展示时，就可以解码出具体的多视角视频，由于之前已经进行了抠图等处理，因此，使得这种多视角视频可以具有透明背景等特点，因此，可以放置到预先生成的3D空间场景模型中，从而呈现出具体的“数字人”位于该3D空间场景模型中的视觉效果。After obtaining the multi-view video in the above manner, the multi-view video can also be compressed and encoded so as to be provided to the client for display. When the client displays it, the specific multi-view video can be decoded. Since the image has been cut out before, the multi-view video can have features such as a transparent background. Therefore, it can be placed in a pre-generated 3D space scene model, thereby presenting a visual effect of a specific "digital human" in the 3D space scene model.

其中，具体在进行压缩编码处理时，由于涉及到多视角视频，因此，本申请实施例中还可以对压缩编码的方式进行特殊处理。这是因为，相对于普通视频，多视角视频通常具有以下特点：在分辨率方面：多视角视频可能包括数十甚至上百个视角分别对应的视频数据，如果每个视角都对应的是高清的视频，那么整体的视频数据将是巨大的。假设有120个视角，整体的分辨率将会超过32K甚至更高，这对于大多数的设备来说都是很大的负载；视频码率方面，高分辨率意味着更高的视频码率，这将导致实时传输和流畅播放变得更为困难。普通的720P视频的码率可能是2-5Mbps，但多视角视频的码率会成几十倍甚至上百倍的增加。视频体积方面，多视角视频中包含了多个视角的数据，这导致了视频文件的体积迅速增长。一个小时的多视角视频可能需要数十甚至上百GB的存储空间。这些技术挑战使得多视角视频的存储、压缩和传输变得尤为困难。Specifically, when performing compression coding, since multi-view video is involved, the compression coding method can also be specially processed in the embodiment of the present application. This is because, compared with ordinary videos, multi-view videos usually have the following characteristics: In terms of resolution: multi-view videos may include video data corresponding to dozens or even hundreds of viewpoints. If each viewpoint corresponds to a high-definition video, the overall video data will be huge. Assuming there are 120 viewpoints, the overall resolution will exceed 32K or even higher, which is a large load for most devices; in terms of video bit rate, high resolution means higher video bit rate, which will make real-time transmission and smooth playback more difficult. The bit rate of ordinary 720P video may be 2-5Mbps, but the bit rate of multi-view video will increase by dozens or even hundreds of times. In terms of video volume, multi-view video contains data from multiple viewpoints, which leads to a rapid increase in the volume of video files. One hour of multi-view video may require tens or even hundreds of GB of storage space. These technical challenges make the storage, compression and transmission of multi-view videos particularly difficult.

现有技术中，存在一些对多视角视频进行压缩、传输的方案。例如：In the prior art, there are some solutions for compressing and transmitting multi-view videos. For example:

方式一，可以对多视角视频进行简单拼接，具体的，可以将同一时间点的所有视角的图像都拼在同一帧里，然后再进行压缩传输。但是，这会导致拼接后的视频分辨率过高，难以实时解码播放，也很难传输。Method 1: You can simply stitch multi-view videos together. Specifically, you can stitch all the views at the same time point together. The images are stitched together into the same frame and then compressed for transmission. However, this will result in a spliced video with a resolution that is too high, making it difficult to decode and play in real time and difficult to transmit.

方式二，通过流媒体方式进行传输，但是，一方面压缩的效率不够理想，另一方面，为了使得客户端能够切换视角，还需要对多视角视频分别进行切片处理之后，再进行流式传输，当用户在某时刻需要从视角A切换到视角B时，可以拉取视角B在对应时间片的数据进行播放；但是，分片的切流延迟比较大，并且，是否能够进行流畅的视角切换还依赖切片的大小，因为必须等上一个切片播完了才能切换下一个视角的下一切片进行播放。Method 2 is to transmit through streaming media. However, on the one hand, the compression efficiency is not ideal. On the other hand, in order to enable the client to switch perspectives, the multi-perspective videos need to be sliced and processed separately before streaming. When the user needs to switch from perspective A to perspective B at a certain moment, the data of perspective B in the corresponding time slice can be pulled for playback. However, the delay of segmented streaming is relatively large, and whether smooth perspective switching can be performed depends on the size of the slice, because the previous slice must be played before switching to the next slice of the next perspective for playback.

方式三，采用专为多视角视频设计的编解码器进行视频编解码，这种方式提供了更好的压缩性能，但它增加了编码和解码的复杂性，需要更高的计算能力，此外，由于MV-HEVC是相对较新的标准，需要专用的解码器才可以完成解码，因此，不所有设备都支持这种格式，尤其是普通的手机或者电脑等终端设备通常都是无法支持的，即使能够支持，在打开及播放视频的过程中也会出现卡顿等现象。Method three, use a codec designed for multi-view video for video encoding and decoding. This method provides better compression performance, but it increases the complexity of encoding and decoding and requires higher computing power. In addition, since MV-HEVC is a relatively new standard, a dedicated decoder is required to complete the decoding. Therefore, not all devices support this format, especially ordinary mobile phones or terminal devices such as computers usually cannot support it. Even if they can support it, there will be freezes and other phenomena when opening and playing the video.

针对上述情况，在本申请实施例中，还针对多视角视频的存储、压缩等问题提供了相应的解决方案。具体的，在对具体多视角视频进行编码之前，首先可以对多个视角分别对应的视频内容进行拼接，也即，将同一时间点不同视角对应的视频内容进行拼接，但是，并不是简单地将所有视角的视频内容全部拼接在同一帧中，而是可以对同一时间点多个视角对应的多个视频帧进行分组，得到多个集合，将同一集合内的多个视频帧拼接到同一帧中(为便于区分，可以将这种拼接后得到的帧称为“组合帧”)，也就是说，同一时间点可以对应多个不同的组合帧，通过这种方式，可以使得组合帧的分辨率不至于过高。In view of the above situation, in the embodiments of the present application, corresponding solutions are also provided for the storage and compression of multi-view videos. Specifically, before encoding a specific multi-view video, the video contents corresponding to the multiple viewpoints can be spliced first, that is, the video contents corresponding to different viewpoints at the same time point can be spliced, but it is not simply to splice all the video contents of all viewpoints into the same frame, but to group the multiple video frames corresponding to the multiple viewpoints at the same time point to obtain multiple sets, and splice the multiple video frames in the same set into the same frame (for the convenience of distinction, the frame obtained after such splicing can be called a "combined frame"). In other words, the same time point can correspond to multiple different combined frames. In this way, the resolution of the combined frame can be made not too high.

其中，具体在对多视角的视频帧进行分组时，可以根据具体视角的数量、每个视角下单个视频帧的分辨率以及终端设备可支持的最大分辨率等信息，来确定出需要分成多少个组，以及每组包括多少个视角，以使得每个组合帧的分辨率低于终端设备可支持的最大分辨率。例如，假设共有72个视角，每个视角的视频帧的分辨率为720P，目前市面上大部分终端设备通常能够支持4K分辨率的实时解码，此时，就可以将上述72个视角分成12组，每个组合帧中会包括6个视角在同一时间点对应的视频帧，每个组合帧的分辨率为720P×6＝4320P，与一般4K图像的分辨率接近，因此，大部分的终端设备可以实现对这种分辨率的图像帧的实时解码。Specifically, when grouping multi-view video frames, the number of groups to be divided and the number of view angles included in each group can be determined based on the number of specific view angles, the resolution of a single video frame at each view angle, and the maximum resolution supported by the terminal device, so that the resolution of each combined frame is lower than the maximum resolution supported by the terminal device. For example, assuming that there are 72 view angles in total, and the resolution of the video frame of each view angle is 720P, most of the terminal devices on the market can usually support real-time decoding of 4K resolution. At this time, the above 72 view angles can be divided into 12 groups, and each combined frame will include 6 video frames corresponding to the view angles at the same time point. The resolution of each combined frame is 720P×6＝4320P, which is close to the resolution of a general 4K image. Therefore, most of the terminal devices can achieve real-time decoding of image frames of this resolution.

在确定出具体的分组数之后，为了能够在编码过程中实现更高的压缩效率，还可以确定不同视角的分组方式，以及在具体组合帧中的排列方式等。其中，具体的分组方式以及排列方式可以有多种，例如，在最简单的方式下，可以是第1到6视角为一组，第7到12视角为一组，等等，在具体的组合帧中，可以划分为3×2(三排两列)的块，各个视角可以在这些块中按编号顺序排列，等等。但是，考虑到不同视角在同一时间点拍摄到的视频帧在内容上往往具有比较高的相似度，尤其是相邻视角，两者之间的相似度会更高，从信息编码角度而言，这种相邻视角之间存在的高度相似的内容中会存在大量的冗余信息，在编码的过程中是属于可以被压缩的内容。也就是说，冗余信息的存在有利于提升编码的压缩效率，因此，在编码过程中，如果能够充分利用到这种冗余信息，则对于提升压缩效率会有很大的帮助。After determining the specific number of groups, in order to achieve higher compression efficiency during the encoding process, the grouping method of different perspectives and the arrangement method in the specific combined frame can also be determined. There are many specific grouping methods and arrangement methods. For example, in the simplest way, the 1st to 6th perspectives can be grouped as a group, the 7th to 12th perspectives can be grouped as a group, and so on. In the specific combined frame, it can be divided into 3×2 (three rows and two columns) blocks, and each perspective can be arranged in these blocks in order of number, and so on. However, considering that the video frames shot at the same time point from different perspectives often have a relatively high similarity in content, especially the adjacent perspectives, the similarity between the two will be higher. From the perspective of information encoding, there will be a lot of redundant information in the highly similar content between adjacent perspectives, which is content that can be compressed during the encoding process. In other words, the existence of redundant information is conducive to improving the compression efficiency of encoding. Therefore, in the encoding process, if this redundant information can be fully utilized, it will be helpful to improve the compression efficiency. It will be of great help.

在视频编码过程中，具体的信息压缩技术可以分为帧内压缩与帧间压缩两种，帧内压缩是在空域(空间XY轴上)进行压缩，压缩过程中主要参考本帧数据之间的相似性；而帧间压缩则是利用视频序列中不同视频帧之间的帧间冗余，例如前后帧间的相似性，通过预测方法来减少数据量。通常，相对于帧内压缩而言，帧间压缩通常可以获得更高的压缩率。In the process of video encoding, specific information compression technologies can be divided into two types: intra-frame compression and inter-frame compression. Intra-frame compression is compression in the spatial domain (on the XY axis of space), and the similarity between the data of the current frame is mainly referred to during the compression process; while inter-frame compression uses the inter-frame redundancy between different video frames in the video sequence, such as the similarity between the previous and next frames, to reduce the amount of data through a prediction method. Generally, compared with intra-frame compression, inter-frame compression can usually achieve a higher compression rate.

但是，如果按照前述例子中所述的简单的视角分组以及排列方式，则具有最高冗余度的相邻视角的视频帧是在同一组合帧中，因此，在对组合帧进行压缩时，只能在帧内压缩过程中使用到这种冗余信息，而无法在帧间编码过程中得到充分利用。However, if the simple perspective grouping and arrangement method described in the above example is followed, the video frames of adjacent perspectives with the highest redundancy are in the same combined frame. Therefore, when the combined frame is compressed, this redundant information can only be used in the intra-frame compression process and cannot be fully utilized in the inter-frame encoding process.

为此，在本申请实施例中，还提供更优的视角分组以及排列方式，具体的，可以使得相邻视角的视频帧位于相邻组合帧的相同位置，也就是说，相邻视角的视频帧会被分到不同但是相邻的组中，并且，会位于相邻组合帧的相同位置。To this end, in an embodiment of the present application, a more optimal perspective grouping and arrangement method is also provided. Specifically, the video frames of adjacent perspectives can be located at the same position of adjacent combined frames. That is to say, the video frames of adjacent perspectives will be divided into different but adjacent groups, and will be located at the same position of adjacent combined frames.

例如，假设共有36个视角(为了便于描述，减少了视角的数量)，被分为6个组，每个组内有6个视角的视频帧，也即，每6个视角的视频帧组成一个组合帧。如图3所示，假设每个组合帧中包括3×2个块，每个块用于放置一个视角的视频帧，每个块的位置编号分别为0，1，2，3，4，5；另外假设36个视角分别用A1、A2、A3……A36来表示。则通过图3所示可以看出，视角A1、A2、A3、A4、A5、A6位于组合帧1至6的第0号位置，视角A7、A8、A9、A10、A11、A12位于组合帧1至6的第1号位置，以此类推。也就是说，A1、A7、A13、A19、A25、A31这些视角为第一组，拼接成组各帧1；A2、A8、A14、A20、A26、A32为第二组，拼接成组合帧2，以此类推。可见，每个组合帧内，视角的编号之间形成等差数列，视角编号之间的差值就是分组的数量，在该例子中为6。通过这种方式，可以使得相邻的组合帧之间，在同样位置处，视角也是相邻的，这就使得相邻组各帧之间在相同位置处的图像内容会具有很高的相似度，在进行帧间压缩编码时，就能够充分利用到相邻视角之间由于内容相似度较高而产生的冗余信息，从而有利于获得较高的压缩率。For example, suppose there are 36 views (the number of views is reduced for ease of description), which are divided into 6 groups, each group has 6 video frames of views, that is, every 6 video frames of views form a combined frame. As shown in FIG3, suppose each combined frame includes 3×2 blocks, each block is used to place a video frame of a view, and the position number of each block is 0, 1, 2, 3, 4, 5 respectively; in addition, suppose that the 36 views are represented by A1, A2, A3...A36 respectively. It can be seen from FIG3 that the views A1, A2, A3, A4, A5, A6 are located at the 0th position of the combined frames 1 to 6, and the views A7, A8, A9, A10, A11, A12 are located at the 1st position of the combined frames 1 to 6, and so on. That is to say, the perspectives A1, A7, A13, A19, A25, and A31 are the first group, which are spliced into group frame 1; A2, A8, A14, A20, A26, and A32 are the second group, which are spliced into group frame 2, and so on. It can be seen that in each combined frame, the perspective numbers form an arithmetic progression, and the difference between the perspective numbers is the number of groups, which is 6 in this example. In this way, the perspectives at the same position between adjacent combined frames can be adjacent, which makes the image content at the same position between adjacent groups of frames have a high degree of similarity. When performing inter-frame compression coding, the redundant information generated by the high content similarity between adjacent perspectives can be fully utilized, which is conducive to obtaining a higher compression rate.

当然，上述例子中仅示出了其中一个时间点上各个视角的视频帧拼接情况，其他时间点上各个视角的视频帧也可以按照上述方式进行分组及排列。这样，每个时间点都可以拼接出6个组合帧。各个时间点分别按照上述方式完成拼接后，可以将得到的组合帧形成帧序列，例如，如果将各个组合帧表示为“组合帧mn”，其中，m代表时间点的编号，n代表同一时间点对应的各个组合帧的编号，则形成的帧序列可以是：(组合帧11，组合帧12，组合帧13，组合帧14，组合帧15，组合帧16，组合帧21，组合帧22，组合帧23，组合帧24，组合帧25，组合帧26，组合帧31，组合帧32……)。Of course, the above example only shows the splicing of video frames of each perspective at one time point, and the video frames of each perspective at other time points can also be grouped and arranged in the above manner. In this way, 6 combined frames can be spliced at each time point. After the splicing is completed at each time point in the above manner, the obtained combined frames can be formed into a frame sequence. For example, if each combined frame is represented as "combined frame mn", where m represents the number of the time point and n represents the number of each combined frame corresponding to the same time point, the formed frame sequence can be: (combined frame 11, combined frame 12, combined frame 13, combined frame 14, combined frame 15, combined frame 16, combined frame 21, combined frame 22, combined frame 23, combined frame 24, combined frame 25, combined frame 26, combined frame 31, combined frame 32...).

通过上述分组以及排列方式，将这种相邻视角的视频帧分散到不同但是相邻的组合帧中，并且，位于相邻帧的相同位置，而相邻视角的视频帧之间通常会存在很高的相似度，因此，可以使得相邻的组合帧之间，至少在每个位置处的视频帧之间都具有很高的相似度，也即存在大量的冗余信息，这些冗余信息就是在进行帧间编码过程中，可以被压缩优化的对象。因此，在对组合帧进行编码时，就可以通过帧间编码的方式，充分利用相邻视角的视频帧之间的冗余信息，以获得更高的压缩效率。并且，由于这种高压缩率是通过帧间编码技术实现的，而通用的视频编码器就具有帧间编码的能力，因此，利用通用的视频编码器即可实现编码，而不需要依赖专用于多视角编码的编码器，相应的，在播放端利用通用的视频解码器进行解码即可，因此，进一步支持了在更多的终端设备中的解码播放。Through the above grouping and arrangement, the video frames of adjacent perspectives are dispersed into different but adjacent combined frames, and are located at the same position of adjacent frames. There is usually a high degree of similarity between video frames of adjacent perspectives. Therefore, the adjacent combined frames, at least the video frames at each position, have a high degree of similarity, that is, there is a large amount of redundant information, which can be compressed and optimized in the process of inter-frame coding. object. Therefore, when encoding the combined frame, the redundant information between the video frames of adjacent perspectives can be fully utilized through inter-frame coding to obtain higher compression efficiency. Moreover, since this high compression rate is achieved through inter-frame coding technology, and the general video encoder has the ability of inter-frame coding, the encoding can be achieved using a general video encoder without relying on an encoder dedicated to multi-perspective coding. Correspondingly, the decoding can be performed using a general video decoder at the playback end, thereby further supporting decoding and playback in more terminal devices.

也就是说，通过本申请实施例，采用了对多视角的视频帧进行分组拼接的方式，相对于直接将全部视角拼接到同一帧中的方式，可以减少每个组合帧的分辨率，降低终端设备的解码压力；另外，还对不同视角在不同组合帧中的排列方式进行了特殊处理，以使得不同的组合帧之间的内容相似度会比较高，这样，可以通过对组合帧进行帧间压缩的方式来消除或减少相邻视角的视频帧间的冗余信息，以此获得较高的压缩率，从而使得通用的编码器(例如，HEVC(High Efficiency Video Coding，高效率视频编码)等)即可完成编码过程。相应的，也可以使用通用的解码器来进行解码，这样，可以使得具体的视频可以在多数终端设备中进行解码播放，避免了复杂的渲染流程和较大的计算开销以及对专用编解码器的依赖。That is to say, through the embodiment of the present application, a method of grouping and splicing video frames of multiple perspectives is adopted. Compared with the method of directly splicing all perspectives into the same frame, the resolution of each combined frame can be reduced, and the decoding pressure of the terminal device can be reduced; in addition, the arrangement of different perspectives in different combined frames is specially processed so that the content similarity between different combined frames will be relatively high. In this way, the redundant information between video frames of adjacent perspectives can be eliminated or reduced by compressing the combined frames between frames, so as to obtain a higher compression rate, so that a general encoder (for example, HEVC (High Efficiency Video Coding, high-efficiency video coding) etc.) can complete the encoding process. Correspondingly, a general decoder can also be used for decoding, so that a specific video can be decoded and played in most terminal devices, avoiding complex rendering processes and large computing overheads as well as dependence on dedicated codecs.

在对多个组合帧形成的帧序列进行编码及帧间压缩处理之后，可以用于向客户端进行传输，以便在客户端进行解码展示。其中，在可选的实现方式下，还具体在进行传输之前，还可以对帧序列进行切片处理，以便以切片后得到的片段为单位进行传输，在接收端可以以片段为单位进行独立的解码播放。这样，接收端只需要收到第一个片段即可进行解码播放，而不需要等待所有的帧序列全部传输完成，因此，可以缩短等待延迟。After encoding and inter-frame compression of a frame sequence formed by multiple combined frames, it can be used for transmission to a client for decoding and display at the client. In an optional implementation, the frame sequence can be sliced before transmission so that it can be transmitted in units of fragments obtained after slicing, and the receiving end can decode and play the fragments independently. In this way, the receiving end only needs to receive the first fragment to decode and play, without waiting for all frame sequences to be transmitted, thereby shortening the waiting delay.

其中，具体的切片时长可以根据实际需求而定，如果切片越小，则接收端的延迟会越小。例如，每个切片的时长可以为1S，或者也可以是0.5S，等等。其中，在本申请实施例中，由于对多个视角的视频帧进行了分组拼接处理，因此，在确定出切片时长之后，每个切片中需要包括的组合帧的数量可以根据播放端的播放帧率而定。例如，仍然以72个视角并分为12组为例，每个时间点对应12个组合帧，另外假设播放端的播放帧率是30帧/S，在对组合帧进行分片处理时的分片时长为1S，则每个片段中需要包含30×72/6＝360个组合帧。也就是说，每个片段中包括的组合帧需要满足播放端在1S的时长内所需播放的帧数，其中，播放端具体在播放时，是需要将组合帧进行解码，从中选择某个视角对应的视频帧，并进行播放，并且，播放端在1S内播放的30帧通常是同一视角下的30个视频帧，而具体对哪个视角进行播放都是可能的，因此，在对组合帧进行切片时，如果每个片段为1S，就需要使得同一个片段中每个视角都存在30个视频帧。如果视角数量是72，则视频帧的数量为30×72个，由于这些视频帧进行了分组，拼接成了组合帧，因此，组合帧的数量就是30×72/6＝360。当然，在上述假设条件不变的情况下，如果将分片时长改为0.5S，则每个片段中包括180个组合帧即可，等等。Among them, the specific slice duration can be determined according to actual needs. If the slice is smaller, the delay at the receiving end will be smaller. For example, the duration of each slice can be 1S, or it can also be 0.5S, and so on. Among them, in the embodiment of the present application, since the video frames of multiple perspectives are grouped and spliced, after determining the slice duration, the number of combined frames that need to be included in each slice can be determined according to the playback frame rate of the playback end. For example, still taking 72 perspectives divided into 12 groups as an example, each time point corresponds to 12 combined frames. In addition, assuming that the playback frame rate of the playback end is 30 frames/S, the slice duration when the combined frame is sliced is 1S, then each segment needs to contain 30×72/6=360 combined frames. That is to say, the combined frames included in each segment need to meet the number of frames required to be played by the playback end within a duration of 1S, wherein the playback end specifically needs to decode the combined frames, select the video frames corresponding to a certain perspective, and play them, and the 30 frames played by the playback end within 1S are usually 30 video frames under the same perspective, and it is possible to play any perspective. Therefore, when slicing the combined frames, if each segment is 1S, it is necessary to make 30 video frames exist in each perspective in the same segment. If the number of perspectives is 72, the number of video frames is 30×72. Since these video frames are grouped and spliced into combined frames, the number of combined frames is 30×72/6=360. Of course, under the above assumptions, if the segment duration is changed to 0.5S, each segment can include 180 combined frames, and so on.

另外，如果进行了上述分片传输，则在进行帧间编码时，还可以通过控制其中的关键帧及双向参考帧的数量，来进一步提升压缩率。具体的，编码器将多张图像进行编码后生产成一段一段的GOP(Group of Pictures)，解码器在播放时则是读取一段一段的GOP 进行解码后读取画面再渲染显示。GOP是一组连续的画面，由一张I帧和数张B/P帧组成，是视频图像编码器和解码器存取的基本单位，它的排列顺序将会一直重复到影像结束。其中，I帧是内部编码帧(也称为关键帧)，P帧是前向预测帧(前向参考帧)，B帧是双向内插帧(双向参考帧)。具体的，I帧通常是一个完整的画面，而P帧和B帧记录的是相对于I帧的变化，其中，P帧和B帧中没有完整的画面数据，P帧中只有与前一帧的画面差别的数据，B帧记录的是本帧与前后帧的差别。其中，B帧所需记录的信息量比较少，因此，通常具有更高的压缩率。如果一个GOP中包括的I帧越少、B帧越多，则整体上的压缩率会比较高。In addition, if the above-mentioned segmented transmission is performed, the compression rate can be further improved by controlling the number of key frames and bidirectional reference frames during inter-frame coding. Specifically, the encoder encodes multiple images into GOPs (Group of Pictures) one by one, and the decoder reads GOPs one by one during playback. After decoding, the picture is read and then rendered for display. GOP is a group of continuous pictures, consisting of an I frame and several B/P frames. It is the basic unit for video image encoders and decoders to access. Its arrangement order will be repeated until the end of the image. Among them, I frame is an internal coding frame (also called a key frame), P frame is a forward prediction frame (forward reference frame), and B frame is a bidirectional interpolation frame (bidirectional reference frame). Specifically, I frame is usually a complete picture, while P frame and B frame record the changes relative to I frame. Among them, P frame and B frame do not have complete picture data. P frame only has the data of the picture difference with the previous frame, and B frame records the difference between this frame and the previous and next frames. Among them, the amount of information required to be recorded by B frame is relatively small, so it usually has a higher compression rate. If a GOP includes fewer I frames and more B frames, the overall compression rate will be higher.

在实际应用中，具体哪些帧会被编码为I帧或者P帧、B帧等，通常是由编码器根据算法来确定的，而在本申请实施例中，为了进一步控制视频的压缩率，还可以通过对编码器进行干预，来减少I帧的数量，增加B帧的数量。具体实现时，可以根据每个切片中包括的组合帧的数量，控制帧间编码过程中的关键帧间隔，以便减少同一切片中被编码成关键帧的帧数。例如，假设每个切片中包括360个组合帧，则可以将关键帧间隔设为360帧或者180帧，也即，使得同一切片中仅有一帧或两帧会被编码为I帧。另外，对于关键帧之外的组合帧，还可以通过调低对双向参考帧的判断阈值，来增加同一切片中被编码成双向参考帧的帧数。也就是说，对于B帧而言，编码器通常会通过计算当前帧与前后帧之间的相似度，并与某个阈值进行比较之后，确定当前帧是否可以被编码为B帧，在本申请实施例中，可以将该阈值调低，则可以将更多的帧编码为B帧，以提升压缩率。In practical applications, which frames will be encoded as I frames, P frames, B frames, etc. are usually determined by the encoder according to the algorithm. In the embodiment of the present application, in order to further control the compression rate of the video, the number of I frames can be reduced and the number of B frames can be increased by intervening in the encoder. In specific implementation, the key frame interval in the inter-frame encoding process can be controlled according to the number of combined frames included in each slice, so as to reduce the number of frames encoded as key frames in the same slice. For example, assuming that each slice includes 360 combined frames, the key frame interval can be set to 360 frames or 180 frames, that is, only one or two frames in the same slice will be encoded as I frames. In addition, for combined frames other than key frames, the number of frames encoded as bidirectional reference frames in the same slice can also be increased by lowering the judgment threshold of bidirectional reference frames. That is to say, for B frames, the encoder usually determines whether the current frame can be encoded as a B frame by calculating the similarity between the current frame and the previous and next frames and comparing it with a certain threshold. In the embodiment of the present application, the threshold can be lowered, and more frames can be encoded as B frames to improve the compression rate.

这里需要说明的是，由于P帧、B帧的解码依赖于I帧，而B帧的解码会依赖于前一帧及后一帧，因此，理论上而言，如果I帧数量较少，B帧的数量比较多，虽然在压缩率上会有更好的表现，但是，在解码时可能会影响图像质量。然而，在本申请实施例中，由于对多视角的视频帧进行了分组拼接，并将相邻视角的视频帧位于相邻组合帧的相同位置，因此，使得每两个相邻组合帧之间都会具有相似度比较高的特点，在该前提下，即使通过上述方式控制了I帧以及B帧的数量，通常也不会影响解码端的图像质量。经测试，本申请实施例提供的方案与简单拼接后进行编码的方案相比，分辨率、码率都得到了明显降低，PSNR(Peak Signal-to-Noise Ratio，峰值信噪比，表示信号最大可能功率和影响它的表示精度的破坏性噪声功率的比值，是衡量图像质量的指标之一)反而得到了提升，具体可以如表1所示：It should be noted here that since the decoding of P frames and B frames depends on I frames, and the decoding of B frames depends on the previous frame and the next frame, theoretically, if the number of I frames is small and the number of B frames is large, although there will be better performance in terms of compression rate, it may affect the image quality during decoding. However, in the embodiment of the present application, since the multi-perspective video frames are grouped and spliced, and the video frames of adjacent perspectives are located in the same position of adjacent combined frames, each two adjacent combined frames have a high degree of similarity. Under this premise, even if the number of I frames and B frames is controlled in the above manner, it usually does not affect the image quality of the decoding end. After testing, the solution provided by the embodiment of the present application has significantly reduced resolution and bit rate compared with the solution of encoding after simple splicing, and PSNR (Peak Signal-to-Noise Ratio, which represents the ratio of the maximum possible power of the signal to the destructive noise power that affects its representation accuracy, and is one of the indicators for measuring image quality) has been improved, as shown in Table 1:

表1
Table 1

当然，在实际应用中，如果需要更高的画面质量，则也可以适当增加I帧的数量，减少B帧的数量，例如，每个分片中可以包括2个或者更多的I帧，等等。Of course, in practical applications, if higher picture quality is required, the number of I frames can be appropriately increased to reduce the number of frames. Reduce the number of B frames, for example, each slice can include 2 or more I frames, and so on.

以上对多视角视频的压缩、传输等方面进行了详细介绍。在实际应用中，假设某用户通过客户端发起对“模特秀场”的访问，则可以将具体的多视角视频传输到客户端，客户端可以进行下载，并且可以使用通用的视频解码器进行解码。之后，可以将其中某个视角(通常可以将其中一视角设为默认视角)下的视频帧添加到预置的3D空间场景模型中。其中，3D空间场景模型可以是预先通过3D建模的方式生成的，也即，客户端可以对3D空间场景模型以及多视角视频分别进行渲染，并将其中一视角对应的视频内容添加到3D空间场景模型中展示。The above provides a detailed introduction to the compression and transmission of multi-view videos. In actual applications, assuming that a user initiates access to the "model show" through a client, the specific multi-view video can be transmitted to the client, which can download it and use a general video decoder to decode it. Afterwards, the video frames from one of the perspectives (usually one of the perspectives can be set as the default perspective) can be added to a preset 3D space scene model. Among them, the 3D space scene model can be pre-generated by 3D modeling, that is, the client can render the 3D space scene model and the multi-view video separately, and add the video content corresponding to one of the perspectives to the 3D space scene model for display.

具体实现时，为了使得视频内容与3D空间场景模型之间具有更好的相互融合的效果，还可以将场景视角与人物视角进行同步渲染，视角切换时也可以进行同步切换，另外，还可以实现人物与场景的光影融合，等等。例如，可以在3D空间场景中添加光照信息，然后根据光照信息计算出人物的影子位置、光线位置等，并在3D空间场景中添加影子信息、光线信息，等等，通过这种方式使得人物与场景更好的融合。In specific implementation, in order to achieve a better mutual integration effect between the video content and the 3D space scene model, the scene perspective and the character perspective can be rendered synchronously, and the perspective can also be switched synchronously. In addition, the light and shadow integration of the character and the scene can be achieved, etc. For example, lighting information can be added to the 3D space scene, and then the character's shadow position, light position, etc. can be calculated based on the lighting information, and shadow information, light information, etc. can be added to the 3D space scene, so that the character and the scene can be better integrated in this way.

总之，通过本申请实施例，在需要为用户提供“模特秀场”等功能时，可以通过多个相机设备从不同视角、对真实人物在穿着目标服饰状态下对所述目标服饰进行展示的过程进行同步拍摄，得到多个真实视角的视频内容，之后，可以从所述视频内容包含的多个视频帧中分别提取出其中包含的人物图像，再利用基于图像的渲染技术，根据相邻真实视角在同一时间点的人物图像之间的像素偏移关系信息，为所述相邻真实视角之间的多个中间视角生成对应时间点的人物图像。然后，可以根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像生成多视角视频，以便在客户端将其中一视角的人物图像视频内容添加到预置的虚拟3D空间场景模型中进行渲染展示，以提供动态数字人在所述虚拟3D空间场景中对所述目标服饰进行展示的内容，并提供用于模拟连续视角切换的互动效果。通过这种方式，无需对模特人物进行显式建模，只需利用多视角拍摄的图像即可实现类3D的真人数字人效果，避免了复杂的建模流程和高昂的建模成本，从而可以获得低成本、高保真、可实时互动的动态真人数字人效果。并且，由于本申请实施例中建立起的真人数字人资产的格式为普通的HEVC等格式的视频，因此，可被大多数移动设备解析，并且避免了复杂的渲染流程和较大的计算开销。In summary, through the embodiments of the present application, when it is necessary to provide users with functions such as "model show", multiple camera devices can be used to synchronously shoot the process of real people wearing target clothing and displaying the target clothing from different perspectives to obtain video content from multiple real perspectives. After that, the character images contained in the multiple video frames contained in the video content can be extracted respectively, and then the image-based rendering technology is used to generate character images at corresponding time points for multiple intermediate perspectives between the adjacent real perspectives according to the pixel offset relationship information between the character images of adjacent real perspectives at the same time point. Then, a multi-perspective video can be generated based on the multiple real perspectives and the character images of the multiple intermediate perspectives at multiple time points, so that the character image video content of one of the perspectives can be added to the preset virtual 3D space scene model on the client for rendering and display, so as to provide the content of the dynamic digital human displaying the target clothing in the virtual 3D space scene, and provide an interactive effect for simulating continuous perspective switching. In this way, there is no need to explicitly model the model, and only the images taken from multiple perspectives can be used to achieve a 3D-like real-life digital human effect, avoiding the complex modeling process and high modeling cost, so as to obtain a low-cost, high-fidelity, real-time interactive dynamic real-life digital human effect. In addition, since the format of the real-life digital human asset established in the embodiment of the present application is a video in a common HEVC format, it can be parsed by most mobile devices, and avoids the complex rendering process and large computing overhead.

另外，对于多个视角对应的多份视频内容，具体在进行视频编码时，可以以时间点为单位对所述多个视角对应的多个视频帧进行拼接处理，得到由多个组合帧形成的帧序列，并且，将该时间点的多个视角对应的多个视频帧划分为多个集合，每个集合中的多个视频帧拼接为一个组合帧，且使得相邻视角的视频帧位于相邻组合帧的相同位置。之后，可以利用通用的视频编码器对所述多个组合帧形成的帧序列进行编码，并通过对所述多个组合帧进行帧间压缩处理，消除或减少相邻视角的视频帧间的冗余信息。这样，由于对多个视角的视频帧进行了分组拼接，因此，使得拼接后的组合帧的分辨率不至于过高，便于在大部分终端设备中进行实时解码；另外，由于在进行分组拼接时，对分组方式以及排列方式进行了控制，使得相邻视角的视频帧位于相邻组合帧的相同位置，也就是说，使得相邻视角的视频帧位于不同但是相邻的组合帧中，且在不同组合帧中的位置相同，而相邻视角的视频帧之间具有相似度比较高的特点，因此，通过这种方式拼接成的相邻组合帧之间具有比较高的相似度，进而通过通用的帧间压缩算法即可通过消除或减少相邻视角的视频帧间的冗余信息，而获得较高的压缩率。换言之，在本申请实施例中，通过通用的视频编码器即可获得理想的压缩率，相应的，在解码端利用通用的解码器即可完成解码，从而可以在更多的终端设备上得到支持。In addition, for multiple video contents corresponding to multiple perspectives, specifically when performing video encoding, the multiple video frames corresponding to the multiple perspectives can be spliced in units of time points to obtain a frame sequence formed by multiple combined frames, and the multiple video frames corresponding to the multiple perspectives at the time point are divided into multiple sets, and the multiple video frames in each set are spliced into a combined frame, and the video frames of adjacent perspectives are located at the same position of adjacent combined frames. Afterwards, a general video encoder can be used to encode the frame sequence formed by the multiple combined frames, and the redundant information between the video frames of adjacent perspectives can be eliminated or reduced by performing inter-frame compression processing on the multiple combined frames. In this way, since the video frames of multiple perspectives are grouped and spliced, the resolution of the spliced combined frame is not too high, which is convenient for real-time decoding in most terminal devices; in addition, since the grouping method and the arrangement method are controlled when performing grouping and splicing, the video frames of adjacent perspectives are located at the same position of adjacent combined frames, that is, the adjacent perspectives are The video frames of the two angles are located in different but adjacent combined frames, and have the same position in different combined frames, and the video frames of adjacent viewing angles have a relatively high similarity. Therefore, the adjacent combined frames spliced in this way have a relatively high similarity, and then a general inter-frame compression algorithm can be used to eliminate or reduce the redundant information between the video frames of adjacent viewing angles, thereby obtaining a higher compression rate. In other words, in the embodiment of the present application, an ideal compression rate can be obtained by a general video encoder, and correspondingly, decoding can be completed by using a general decoder at the decoding end, so that it can be supported on more terminal devices.

实施例二Embodiment 2

该实施例二是与实施例一相对应的，从客户端的角度，提供了一种基于动态数字人形象进行信息展示的方法，参见图4，该方法可以包括：The second embodiment corresponds to the first embodiment and provides a method for displaying information based on a dynamic digital human image from the perspective of a client. Referring to FIG. 4 , the method may include:

S401：响应于用户发起的查看请求，获取多视角视频，所述多视角视频通过以下方式生成：通过多个相机设备从不同视角、对真实人物在穿着目标服饰状态下对所述目标服饰进行展示的过程进行同步拍摄得到多个真实视角的视频内容，对所述多个真实视角的视频内容包含的多个视频帧中分别提取出其中包含的人物图像，利用基于图像的渲染技术，为相邻真实视角之间的多个中间视角生成人物图像，并根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像生成所述多视角视频；S401: In response to a viewing request initiated by a user, a multi-view video is obtained, wherein the multi-view video is generated in the following manner: a process of a real person wearing a target clothing and displaying the target clothing is synchronously photographed by multiple camera devices from different perspectives to obtain video contents of multiple real perspectives, character images contained in multiple video frames contained in the video contents of the multiple real perspectives are respectively extracted, character images are generated for multiple intermediate perspectives between adjacent real perspectives using an image-based rendering technology, and the multi-view video is generated according to the multiple real perspectives and the character images of the multiple intermediate perspectives at multiple time points;

S402：对所述多视角视频进行解码；S402: Decoding the multi-view video;

S403：将所述多视角视频匹配到预置的虚拟3D空间场景模型中，以提供动态数字人形象在所述虚拟3D空间场景中对所述目标服饰进行展示的内容；S403: Matching the multi-view video to a preset virtual 3D space scene model to provide a dynamic digital human image displaying the target clothing in the virtual 3D space scene;

S404：响应于连续视角切换的交互操作，通过切换到其他视角的人物图像视频内容，提供模拟连续视角切换的互动效果。S404: In response to the interactive operation of continuous perspective switching, an interactive effect simulating continuous perspective switching is provided by switching to character image video content of other perspectives.

实施例三Embodiment 3

以上实施例一、二主要针对商品信息服务系统中需要为用户提供“模特秀场”等功能为例，对具体的实现方案进行了介绍。而在实际应用中，本申请实施例提供的生产动态真人数字人的方式也可以在其他应用场景中来使用。为此，该实施例三还提供了一种生成动态数字人形象的方法，参见图5，该方法可以包括：The above embodiments 1 and 2 mainly take the example of providing users with functions such as "model show" in the commodity information service system, and introduce the specific implementation scheme. In actual applications, the method of producing dynamic real-life digital people provided in the embodiment of this application can also be used in other application scenarios. To this end, this embodiment 3 also provides a method for generating a dynamic digital human image, see Figure 5, the method may include:

S501：获得多个真实视角的视频内容，所述多个真实视角的视频内容是通过多个相机设备从不同视角对真实人物执行目标动作的过程进行同步拍摄得到的。S501: Obtain video contents from multiple real perspectives, where the video contents from multiple real perspectives are obtained by synchronously shooting a process in which a real person performs a target action from different perspectives using multiple camera devices.

具体的目标动作可以根据实际场景中的需求而定，例如，可以是表演某个舞蹈动作，等等。The specific target action can be determined according to the needs of the actual scene, for example, it can be performing a certain dance move, etc.

S502：从所述视频内容包含的多个视频帧中分别提取出其中包含的人物图像。S502: Extracting character images contained in the plurality of video frames respectively from the plurality of video frames contained in the video content.

S503：利用基于图像的渲染技术，根据相邻真实视角在同一时间点的人物图像之间的像素偏移关系信息，为所述相邻真实视角之间的多个中间视角生成对应时间点的人物图像。S503: Using image-based rendering technology, according to pixel offset relationship information between character images at the same time point in adjacent real perspectives, character images at corresponding time points are generated for multiple intermediate perspectives between the adjacent real perspectives.

S504：根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像，生成多视角视频，以便通过所述多视角视频进行对应的动态数字人形象的展示。S504: Generate a multi-perspective video according to the multiple real perspectives and the character images of the multiple intermediate perspectives at multiple time points, so as to display the corresponding dynamic digital human image through the multi-perspective video.

通过上述方式生产出的动态数字人形象的相关数据，可以以多视角视频的形式存在，多视角视频则是由多个视角对应的普通视频格式的文件组成，因此，这种真人数字人具有端侧友好的特点，可以在多种场景中进行展示，包括与某种3D空间场景模型进行融合后展示，从而呈现出具体的数字人在具体的3D空间场景中做出对应动作的视角效果，并且，用户可以进行连续的视角切换或者视角缩放，等等。The data related to the dynamic digital human image produced by the above method can exist in the form of multi-view video, which is composed of files in ordinary video format corresponding to multiple viewpoints. Therefore, this kind of real digital human has The terminal-side friendly feature allows for display in a variety of scenarios, including display after integration with a certain 3D space scene model, thereby presenting the perspective effect of a specific digital person performing corresponding actions in a specific 3D space scene. In addition, users can perform continuous perspective switching or perspective zooming, etc.

实施例四Embodiment 4

在该实施例四中，主要对将多视角视频匹配到虚拟3D空间场景模型中的方式进行保护，具体的，该实施例四提供了一种基于动态数字人形象进行服饰信息展示的方法，参见图6，该方法可以包括：In the fourth embodiment, the method of matching a multi-view video to a virtual 3D space scene model is mainly protected. Specifically, the fourth embodiment provides a method for displaying clothing information based on a dynamic digital human image. Referring to FIG. 6 , the method may include:

S601：响应于通过动态数字人对目标服饰进行展示的请求，获得虚拟3D空间场景模型，以及通过多视角视频的形式表达的动态数字人形象，其中，所述多视角视频用于从多个视角对目标人物在穿着目标服饰状态下对所述目标服饰进行展示的过程进行展示；S601: In response to a request for displaying a target garment through a dynamic digital human, a virtual 3D space scene model and a dynamic digital human image expressed in the form of a multi-view video are obtained, wherein the multi-view video is used to display a process of displaying the target garment by the target person in a state of wearing the target garment from multiple viewpoints;

S602：将所述多视角视频匹配到所述虚拟3D空间场景模型中，以提供动态数字人形象在所述虚拟3D空间场景中对所述目标服饰进行展示的内容，并提供用于模拟连续视角切换的互动效果。S602: Matching the multi-view video to the virtual 3D space scene model to provide content in which a dynamic digital human image displays the target clothing in the virtual 3D space scene, and providing an interactive effect for simulating continuous viewpoint switching.

在该实施例四中，具体的多视角视频可以通过前述实施例中提供的方式生成，或者，也可以通过其他方式生成，例如，可以直接通过密集部署更多摄像机的方式生成多视角视频，等等。其中，在具体在将多视角视频匹配到虚拟3D空间场景模型中时，可以通过3D渲染引擎来实现，当然，在具体实现时，可以在通用的3D渲染引擎基础上进行一些功能定制，例如，实现多个不同视角的阴影一致性，等等。In the fourth embodiment, the multi-view video can be generated by the method provided in the above embodiment, or can be generated by other methods, for example, the multi-view video can be generated directly by densely deploying more cameras, etc. Specifically, when matching the multi-view video to the virtual 3D space scene model, it can be achieved through a 3D rendering engine. Of course, in the specific implementation, some functional customization can be performed on the basis of a general 3D rendering engine, for example, achieving shadow consistency of multiple different perspectives, etc.

关于该实施例二至四中的未详述部分内容，可以参见实施例一以及本说明书其他部分的记载，这里不再赘述。For the contents not described in detail in Embodiments 2 to 4, please refer to Embodiment 1 and other parts of this specification, which will not be repeated here.

需要说明的是，本申请实施例中可能会涉及到对用户数据的使用，在实际应用中，可以在符合所在国的适用法律法规要求的情况下(例如，用户明确同意，对用户切实通知，等)，在适用法律法规允许的范围内在本文描述的方案中使用用户特定的个人数据。It should be noted that the embodiments of the present application may involve the use of user data. In actual applications, user-specific personal data can be used in the scheme described herein within the scope permitted by applicable laws and regulations, subject to the requirements of applicable laws and regulations of the country where the user is located (for example, with the user's explicit consent, effective notification to the user, etc.).

与实施例一相对应，本申请实施例还提供了一种基于动态数字人形象进行信息展示的装置，该装置可以包括：Corresponding to the first embodiment, the embodiment of the present application further provides a device for displaying information based on a dynamic digital human image, which may include:

多视角视频生存和单元，用于根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像生成多视角视频，以便在客户端将所述多视角视频匹配到预置的虚拟3D空间场景模型中，以提供动态数字人形象在所述虚拟3D空间场景中对所述目标服饰进行展示的内容，并提供用于模拟连续视角切换的互动效果。A multi-view video storage and unit is used to generate a multi-view video according to the multiple real view angles and the character images of the multiple intermediate view angles at multiple time points, so as to match the multi-view video to a preset virtual 3D space scene model on the client side, so as to provide a dynamic digital human image to perform the target clothing in the virtual 3D space scene. The content displayed and provide interactive effects for simulating continuous perspective switching.

具体的，所述视角合成单元具体可以用于：Specifically, the perspective synthesis unit may be used for:

更为具体的，所述视角合成单元具体可以用于：More specifically, the perspective synthesis unit may be used for:

另外，该装置还可以包括：In addition, the device may also include:

视频压缩单元，用于对所述多视角视频进行压缩处理，以用于向客户端进行传输。The video compression unit is used to compress the multi-view video for transmission to a client.

其中，所述视频压缩单元具体可以包括：The video compression unit may specifically include:

拼接处理子单元，用于以时间点为单位对多个视角对应的视频帧进行拼接处理，得到由多个组合帧形成的帧序列；其中，对于同一时间点，将该时间点的多个视角对应的多个视频帧划分为多个集合，每个集合中的多个视频帧拼接为一个组合帧，且使得相邻视角的视频帧位于相邻组合帧的相同位置；A splicing processing subunit is used to splice video frames corresponding to multiple viewpoints in units of time points to obtain a frame sequence formed by multiple combined frames; wherein, for the same time point, the multiple video frames corresponding to the multiple viewpoints at the time point are divided into multiple sets, and the multiple video frames in each set are spliced into a combined frame, and the video frames of adjacent viewpoints are located at the same position of adjacent combined frames;

帧间压缩子单元，用于利用通用的视频编码器对所述多个组合帧形成的帧序列进行编码，并对所述多个组合帧进行帧间压缩处理。The inter-frame compression subunit is used to encode the frame sequence formed by the multiple combined frames using a universal video encoder, and perform inter-frame compression processing on the multiple combined frames.

另外，该装置还可以包括：In addition, the device may also include:

分片处理单元，用于在对多个组合帧形成的帧序列进行编码及帧间压缩处理之后，还对帧序列进行切片处理，以便以切片后得到的片段为单位进行传输，在接收端以片段为单位进行独立的解码播放。The slicing processing unit is used to slice the frame sequence formed by multiple combined frames after encoding and inter-frame compression, so that the frame sequence can be transmitted in units of the obtained segments after slicing, and independently decoded and played in units of segments at the receiving end.

再者，还可以包括：Furthermore, it may also include:

关键帧数量控制单元，用于根据每个切片中包括的组合帧的数量，控制帧间编码过程中的关键帧间隔，以便减少同一切片中被编码成关键帧的帧数。The key frame quantity control unit is used to control the key frame interval in the inter-frame coding process according to the quantity of combined frames included in each slice, so as to reduce the number of frames encoded as key frames in the same slice.

双向参考帧数量控制单元，用于对于关键帧之外的组合帧，通过调低对双向参考帧的判断阈值，增加同一切片中被编码成双向参考帧的帧数。The bidirectional reference frame quantity control unit is used to increase the number of frames encoded as bidirectional reference frames in the same slice by lowering the judgment threshold of the bidirectional reference frames for the combined frames other than the key frames.

与实施例二相对应，本申请实施例还提供了一种基于动态数字人形象进行信息展示的装置，该装置可以包括：Corresponding to the second embodiment, the embodiment of the present application further provides a device for displaying information based on a dynamic digital human image, which may include:

多视角视频获取单元，用于响应于用户发起的查看请求，获取多视角视频，所述多视角视频通过以下方式生成：通过多个相机设备从不同视角、对真实人物在穿着目标服饰状态下对所述目标服饰进行展示的过程进行同步拍摄得到多个真实视角的视频内容，对所述多个真实视角的视频内容包含的多个视频帧中分别提取出其中包含的人物图像，利用基于图像的渲染技术，为相邻真实视角之间的多个中间视角生成人物图像，并根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像生成所述多视角视频；The multi-view video acquisition unit is used to respond to a viewing request initiated by a user and acquire a multi-view video, wherein the multi-view video is generated by: synchronously shooting a process in which a real person wears a target garment and displays the target garment from different perspectives through multiple camera devices to obtain video contents of multiple real perspectives, and Extracting character images contained in multiple video frames contained in video contents of multiple real perspectives respectively, generating character images for multiple intermediate perspectives between adjacent real perspectives by using image-based rendering technology, and generating the multi-perspective video according to the multiple real perspectives and the character images of the multiple intermediate perspectives at multiple time points;

与实施例三相对应，本申请实施例还提供了一种生成动态数字人形象的装置，该装置可以包括：Corresponding to the third embodiment, the embodiment of the present application further provides a device for generating a dynamic digital human image, which may include:

动态数字人资产生成单元，用于根据所述多个真实视角以及所述多个中间视角在多个时间点上的人物图像，生成多视角视频，以便通过所述多视角视频进行对应的动态数字人形象的展示。The dynamic digital human asset generation unit is used to generate a multi-view video based on the character images of the multiple real perspectives and the multiple intermediate perspectives at multiple time points, so as to display the corresponding dynamic digital human image through the multi-view video.

与实施例四相对应，本申请实施例还提供了一种基于动态数字人形象进行服饰信息展示的装置，该装置可以包括：Corresponding to the fourth embodiment, the embodiment of the present application further provides a device for displaying clothing information based on a dynamic digital human image, and the device may include:

另外，本申请实施例还提供了一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现前述方法实施例中任一项所述的方法的步骤。In addition, an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of any one of the methods in the aforementioned method embodiments are implemented.

以及一种电子设备，包括：And an electronic device, comprising:

一个或多个处理器；以及one or more processors; and

与所述一个或多个处理器关联的存储器，所述存储器用于存储程序指令，所述程序指令在被所述一个或多个处理器读取执行时，执行前述方法实施例中任一项所述的方法的步骤。 A memory associated with the one or more processors, the memory being used to store program instructions, wherein the program instructions, when read and executed by the one or more processors, execute the steps of the method described in any one of the aforementioned method embodiments.

其中，图7示例性的展示出了电子设备的架构，具体可以包括处理器710，视频显示适配器711，磁盘驱动器712，输入/输出接口713，网络接口714，以及存储器720。上述处理器710、视频显示适配器711、磁盘驱动器712、输入/输出接口713、网络接口714，与存储器720之间可以通过通信总线730进行通信连接。7 exemplarily shows the architecture of the electronic device, which may include a processor 710, a video display adapter 711, a disk drive 712, an input/output interface 713, a network interface 714, and a memory 720. The processor 710, the video display adapter 711, the disk drive 712, the input/output interface 713, the network interface 714, and the memory 720 may be communicatively connected via a communication bus 730.

其中，处理器710可以采用通用的CPU(Central Processing Unit，处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit，ASIC)、或者一个或多个集成电路等方式实现，用于执行相关程序，以实现本申请所提供的技术方案。Among them, the processor 710 can be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., to execute relevant programs to realize the technical solution provided in this application.

存储器720可以采用ROM(Read Only Memory，只读存储器)、RAM(Random Access Memory，随机存取存储器)、静态存储设备，动态存储设备等形式实现。存储器720可以存储用于控制电子设备700运行的操作系统721，用于控制电子设备700的低级别操作的基本输入输出系统(BIOS)。另外，还可以存储网页浏览器723，数据存储管理系统724，以及信息展示处理系统725等等。上述信息展示处理系统725就可以是本申请实施例中具体实现前述各步骤操作的应用程序。总之，在通过软件或者固件来实现本申请所提供的技术方案时，相关的程序代码保存在存储器720中，并由处理器710来调用执行。The memory 720 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc. The memory 720 can store an operating system 721 for controlling the operation of the electronic device 700, and a basic input and output system (BIOS) for controlling the low-level operation of the electronic device 700. In addition, a web browser 723, a data storage management system 724, and an information display processing system 725, etc. can also be stored. The above-mentioned information display processing system 725 can be an application program that specifically implements the operations of the aforementioned steps in the embodiment of the present application. In short, when the technical solution provided by the present application is implemented by software or firmware, the relevant program code is stored in the memory 720 and is called and executed by the processor 710.

输入/输出接口713用于连接输入/输出模块，以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出)，也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等，输出设备可以包括显示器、扬声器、振动器、指示灯等。The input/output interface 713 is used to connect the input/output module to realize information input and output. The input/output module can be configured in the device as a component (not shown in the figure), or it can be externally connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, etc.

网络接口714用于连接通信模块(图中未示出)，以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信，也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。The network interface 714 is used to connect to a communication module (not shown) to realize communication interaction between the device and other devices. The communication module can realize communication through a wired mode (such as USB, network cable, etc.) or a wireless mode (such as mobile network, WIFI, Bluetooth, etc.).

总线730包括一通路，在设备的各个组件(例如处理器710、视频显示适配器711、磁盘驱动器712、输入/输出接口713、网络接口714，与存储器720)之间传输信息。The bus 730 comprises a pathway for transmitting information between the various components of the device (eg, the processor 710, the video display adapter 711, the disk drive 712, the input/output interface 713, the network interface 714, and the memory 720).

需要说明的是，尽管上述设备仅示出了处理器710、视频显示适配器711、磁盘驱动器712、输入/输出接口713、网络接口714，存储器720，总线730等，但是在具体实施过程中，该设备还可以包括实现正常运行所必需的其他组件。此外，本领域的技术人员可以理解的是，上述设备中也可以仅包含实现本申请方案所必需的组件，而不必包含图中所示的全部组件。It should be noted that, although the above device only shows a processor 710, a video display adapter 711, a disk drive 712, an input/output interface 713, a network interface 714, a memory 720, a bus 730, etc., in the specific implementation process, the device may also include other components necessary for normal operation. In addition, it can be understood by those skilled in the art that the above device may also only include components necessary for implementing the solution of the present application, and does not necessarily include all the components shown in the figure.

通过以上的实施方式的描述可知，本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。It can be known from the description of the above implementation methods that those skilled in the art can clearly understand that the present application can be implemented by means of software plus a necessary general hardware platform. Based on such an understanding, the technical solution of the present application can be essentially or partly contributed to the prior art in the form of a software product, which can be stored in a storage medium such as ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in the various embodiments of the present application or certain parts of the embodiments.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统或系统实施例而言，由于其基本相似于方法实施例，所以描述得比较简单，相关之处参见方法实施例的部分说明即可。以上所描述的系统及系统实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。The various embodiments in this specification are described in a progressive manner. The same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. As for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can refer to the partial description of the method embodiment. The system and system embodiment described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Ordinary technicians in this field can understand and implement it without creative work.

以上对本申请所提供的基于动态数字人形象进行信息展示的方法及电子设备，进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的一般技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处。综上所述，本说明书内容不应理解为对本申请的限制。 The above is a detailed introduction to the method and electronic device for information display based on dynamic digital human images provided by the present application. This article uses specific examples to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method and core ideas of the present application. At the same time, for those skilled in the art, according to the ideas of the present application, there will be changes in the specific implementation methods and application scope. In summary, the content of this specification should not be understood as limiting the present application.

Claims

A method for displaying information based on a dynamic digital human image, characterized by comprising:

Obtaining video contents from multiple real perspectives, wherein the video contents from multiple real perspectives are obtained by synchronously shooting a process in which a real person wears a target garment and displays the target garment from different perspectives using multiple camera devices;

Extracting the character images contained in the plurality of video frames respectively from the video content;

Using image-based rendering technology, based on pixel offset relationship information between character images at adjacent real perspectives at the same time point, character images at corresponding time points are generated for multiple intermediate perspectives between the adjacent real perspectives;

A multi-perspective video is generated based on the character images of the multiple real perspectives and the multiple intermediate perspectives at multiple time points, so that the multi-perspective video can be matched to a preset virtual 3D space scene model on the client, so as to provide content in which a dynamic digital human image displays the target clothing in the virtual 3D space scene, and provide an interactive effect for simulating continuous perspective switching.

The method according to claim 1, characterized in that

The video contents of the multiple real perspectives are obtained by synchronously shooting with multiple camera devices from different perspectives a process in which a real person walks in a target space wearing target clothing to display the target clothing.

The method according to claim 1, characterized in that

The step of generating character images at corresponding time points for a plurality of intermediate perspectives between the adjacent real perspectives includes:

By using image-based rendering technology, according to the pixel offset relationship information between the character images of adjacent real perspectives at the same time point, the pixel positions of multiple intermediate perspectives between the adjacent real perspectives at corresponding time points are estimated to generate the character images of the intermediate perspectives at multiple time points.

The method according to claim 3, characterized in that

The estimating pixel positions of a plurality of intermediate perspectives between the adjacent real perspectives at corresponding time points includes:

Taking the character images corresponding to adjacent real perspectives at the same time point as input, a dense optical flow field between the character images of adjacent real perspectives is fitted through a deep learning model, and the dense optical flow field is used to estimate the pixel positions of the character images of multiple intermediate perspectives between the adjacent real perspectives at corresponding time points.

The method according to claim 1, further comprising:

The multi-view video is compressed for transmission to a client.

The method according to claim 5, characterized in that

The compressing the multi-view video includes:

The video frames corresponding to the multiple perspectives are spliced in units of time points to obtain a frame sequence formed by multiple combined frames; wherein the multiple video frames corresponding to the multiple perspectives at the time point are divided into multiple sets, and the multiple video frames in each set are spliced into a combined frame, and the video frames of adjacent perspectives are located at the same position of adjacent combined frames;

A general video encoder is used to encode a frame sequence formed by the plurality of combined frames, and inter-frame compression processing is performed on the plurality of combined frames.

The method according to claim 6, characterized in that

The resolution of each combined frame is lower than the maximum resolution supported by the end device.

The method according to claim 6, further comprising:

After encoding and inter-frame compression processing are performed on a frame sequence formed by a plurality of combined frames, the frame sequence is further sliced so as to be transmitted in units of the obtained segments after slicing and independently decoded and played in units of the segments at the receiving end.

In response to a viewing request initiated by a user, a multi-view video is obtained, wherein the multi-view video is generated in the following manner: a process of a real person wearing a target clothing and displaying the target clothing is synchronously photographed by multiple camera devices from different perspectives to obtain video contents of multiple real perspectives, character images contained in multiple video frames contained in the video contents of the multiple real perspectives are respectively extracted, character images are generated for multiple intermediate perspectives between adjacent real perspectives using an image-based rendering technology, and the multi-view video is generated according to the multiple real perspectives and the character images of the multiple intermediate perspectives at multiple time points;

Decoding the multi-view video;

Matching the multi-view video to a preset virtual 3D space scene model to provide a dynamic digital human image displaying the target clothing in the virtual 3D space scene;

In response to the interactive operation of continuous perspective switching, an interactive effect simulating continuous perspective switching is provided by switching to character image video content of other perspectives.

A method for generating a dynamic digital human image, characterized by comprising:

Obtaining video contents from multiple real perspectives, wherein the video contents from multiple real perspectives are obtained by synchronously shooting a process in which a real person performs a target action from different perspectives using multiple camera devices;

A multi-perspective video is generated according to the multiple real perspectives and the character images of the multiple intermediate perspectives at multiple time points, so as to display the corresponding dynamic digital human image through the multi-perspective video.

A method for displaying clothing information based on a dynamic digital human image, characterized by comprising:

In response to a request for displaying the target clothing through a dynamic digital human, a virtual 3D space scene model and a dynamic digital human image expressed in the form of a multi-view video are obtained, wherein the multi-view video is used to display the process of the target person displaying the target clothing when wearing the target clothing from multiple perspectives;

The multi-view video is matched to the virtual 3D space scene model to provide content in which a dynamic digital human image displays the target clothing in the virtual 3D space scene, and to provide an interactive effect for simulating continuous viewpoint switching.

A computer-readable storage medium having a computer program stored thereon, characterized in that when the program is executed by a processor, the steps of the method described in any one of claims 1 to 11 are implemented.

An electronic device, comprising:

one or more processors; and

A memory associated with the one or more processors, the memory being used to store program instructions, wherein when the program instructions are read and executed by the one or more processors, the steps of the method according to any one of claims 1 to 11 are executed.