JP7538574B1

JP7538574B1 - Video creation device, video creation method, video creation program, and video creation system

Info

Publication number: JP7538574B1
Application number: JP2024068392A
Authority: JP
Inventors: 史睦川口
Original assignee: Individual
Current assignee: Individual
Priority date: 2024-04-19
Filing date: 2024-04-19
Publication date: 2024-08-22
Anticipated expiration: 2044-04-19
Also published as: WO2025220279A1; JP2025164419A

Abstract

The present invention provides a technique for generating video in a format desired by a user.
SOLUTION
The video generation device comprises an acquisition unit that acquires a video file, an analysis unit that acquires an analysis result including time-series speech information, which is text information obtained by transcribing speech included in the video file in chronological order, an editing plan unit that transmits the time-series speech information and command information that instructs an editing plan to be output for editing the time-series speech information in accordance with desired editing policy information to an interactive AI using a language model and receives the editing plan from the interactive AI, and a video editing unit that edits the video file in accordance with the editing plan and generates an edited video.
[Selected Figure] Figure 1

Description

本発明は、動画生成装置、動画生成方法、動画生成プログラムおよび動画生成システムに関するものである。 The present invention relates to a video generation device, a video generation method, a video generation program, and a video generation system.

特許文献１には、「ユーザの要求に応じて、ユーザが所望する長さで、かつ、ユーザが視聴したい部分の内容を含む編集後の動画データを生成する編集動画生成部、を備える動画編集装置」が記載されている。 Patent document 1 describes a video editing device that includes an edited video generation unit that generates edited video data in accordance with a user's request, the edited video data having a length desired by the user and including the content of the portion that the user wishes to view.

特開２０２１－０８７１８０号公報JP 2021-087180 A

上記技術は、動画データを所定の長さで分割した各区間を識別するためのインデックス情報を入力することで、ユーザの視聴したい部分を特定し、その内容を含む編集後の動画データを生成するというものであるため、生成される動画の構成や見栄えについては、考慮されるとは限らない。 The above technology involves inputting index information to identify each section of video data divided into predetermined lengths, identifying the part the user wants to watch, and generating edited video data that includes that content, so the structure or appearance of the generated video is not necessarily taken into consideration.

本発明の目的は、ユーザが望む態様の動画を生成することにある。 The objective of the present invention is to generate videos in the format desired by the user.

本願は、上記課題の少なくとも一部を解決する手段を複数含んでいるが、その例を挙げるならば、以下のとおりである。本発明の一態様に係る動画生成装置は、動画ファイルを取得する取得部と、前記動画ファイルに含まれる発話を時系列に書き起こしたテキスト情報である時系列発話情報を含む解析結果を取得する解析部と、前記時系列発話情報と、前記時系列発話情報を所望の編集方針情報に従って編集する編集計画を出力するように指示する命令情報とを、言語モデルを用いた対話型ＡＩに送信し、前記対話型ＡＩから前記編集計画を受信する編集計画部と、前記編集計画に沿って前記動画ファイルを編集し、編集動画を生成する動画編集部と、を有することを特徴とする。 The present application includes multiple means for solving at least part of the above problems, examples of which are as follows: A video generation device according to one aspect of the present invention is characterized by having an acquisition unit that acquires a video file, an analysis unit that acquires an analysis result including time-series speech information, which is text information obtained by transcribing speech included in the video file in a time series, an editing plan unit that transmits the time-series speech information and command information that instructs an editing plan to be output for editing the time-series speech information according to desired editing policy information to an interactive AI using a language model and receives the editing plan from the interactive AI, and a video editing unit that edits the video file according to the editing plan and generates an edited video.

また、上記の動画生成装置において、前記編集計画部は、前記命令情報に、前記編集計画により得られる前記編集動画についての制約条件を含めるものであってもよい。 In addition, in the above video generation device, the editing plan unit may include, in the command information, constraints on the edited video obtained by the editing plan.

また、上記の動画生成装置において、前記編集計画部は、前記命令情報に、前記編集計画により得られる前記編集動画についての構成情報を含めるものであってもよい。 In addition, in the video generation device described above, the editing plan unit may include, in the command information, configuration information about the edited video obtained by the editing plan.

また、上記の動画生成装置において、前記編集計画部は、前記命令情報に、前記編集計画により得られる前記編集動画に付加すべき動画、静止画または音声の指定を含めるものであってもよい。 In addition, in the video generation device described above, the editing plan unit may include in the command information a specification of a video, still image, or audio to be added to the edited video obtained by the editing plan.

また、上記の動画生成装置において、前記編集計画部は、前記命令情報に、前記編集計画により得られる前記編集動画において用いる視覚効果の指定を含めるものであってもよい。 In addition, in the video generation device described above, the editing plan unit may include in the command information a specification of visual effects to be used in the edited video obtained by the editing plan.

また、上記の動画生成装置において、前記編集計画には、前記動画ファイル内の経過時間軸上の開始位置と終了位置を指定した部分的な動画をつなぎ合わせて前記編集動画を構成する情報が含まれ、前記動画編集部は、前記部分的な動画を前記動画ファイルから切り出してつなぎ合わせることで前記編集動画を生成するものであってもよい。 In addition, in the above video generation device, the editing plan may include information for constructing the edited video by connecting together partial videos whose start and end positions on a time axis within the video file are specified, and the video editing unit may generate the edited video by extracting the partial videos from the video file and connecting them together.

また、上記の動画生成装置において、前記編集計画には、前記動画ファイル内の経過時間軸上の開始位置と終了位置を指定した部分的な動画をつなぎ合わせて前記編集動画を構成する情報、および前記部分的な動画の前後に付加すべき動画、静止画または音声の指定が含まれ、前記動画編集部は、前記部分的な動画を前記動画ファイルから切り出してつなぎ合わせ、付加すべき前記動画、静止画または音声を付加することで前記編集動画を生成するものであってもよい。 In the video generation device described above, the editing plan may include information for linking together partial videos, each of which has a specified start position and end position on a time axis within the video file, to create the edited video, and designation of video, still images, or audio to be added before and after the partial videos, and the video editing unit may generate the edited video by cutting out the partial videos from the video file, linking them together, and adding the video, still images, or audio to be added.

また、上記の動画生成装置において、前記編集計画には、前記動画ファイル内の経過時間軸上の開始位置と終了位置を指定した部分的な動画をつなぎ合わせて前記編集動画を構成する情報、および前記部分的な動画のつなぎ目に用いる視覚効果の指定が含まれ、前記動画編集部は、前記部分的な動画を前記動画ファイルから切り出してつなぎ合わせ、該つなぎ目に指定された前記視覚効果を適用することで前記編集動画を生成するものであってもよい。 In the video generation device described above, the editing plan may include information for stitching together partial videos, each of which has a specified start position and end position on a time axis within the video file, to create the edited video, and a specification of visual effects to be used at the seams of the partial videos, and the video editing unit may generate the edited video by cutting out the partial videos from the video file, stitching them together, and applying the specified visual effects to the seams.

また、上記の動画生成装置において、前記解析部は、前記動画ファイルに含まれる発話音声を時系列を維持しながら早送り編集し、所定の音声テキスト変換部に受け渡して前記時系列発話情報を得るものであってもよい。 In addition, in the above video generation device, the analysis unit may fast-forward edit the speech included in the video file while maintaining the time series, and pass it to a predetermined speech-to-text conversion unit to obtain the time series speech information.

また、上記の動画生成装置において、前記解析部は、前記動画ファイルに含まれる発話音声の話者を識別して前記話者ごとに時系列を維持しながら抽出し、所定の音声テキスト変換部に受け渡して得たテキスト情報を統合して前記時系列発話情報を得るものであってもよい。 In the video generation device described above, the analysis unit may identify the speaker of the speech sound included in the video file, extract the speech sound for each speaker while maintaining the time series, and transfer the text information obtained to a predetermined speech-to-text conversion unit to integrate the text information to obtain the time series speech information.

また、上記の動画生成装置において、前記編集計画は、所定のフォーマット言語により記述され、前記編集計画部は、前記命令情報に、前記編集計画を記述する前記フォーマット言語についての定義情報を含めるものであってもよい。 In the above video generation device, the editing plan may be described in a predetermined format language, and the editing plan unit may include definition information about the format language that describes the editing plan in the command information.

また、本発明の別の態様にかかる動画生成方法は、動画生成装置を用いた動画生成方法であって、前記動画生成装置は、プロセッサを備え、前記プロセッサは、動画ファイルを取得する取得ステップと、前記動画ファイルに含まれる発話を時系列に書き起こしたテキスト情報である時系列発話情報を含む解析結果を取得する解析ステップと、前記時系列発話情報と、前記時系列発話情報を所望の編集方針情報に従って編集する編集計画を出力するように指示する命令情報とを、言語モデルを用いた対話型ＡＩに送信し、前記対話型ＡＩから前記編集計画を受信する編集計画ステップと、前記編集計画に沿って前記動画ファイルを編集し、編集動画を生成する動画編集ステップと、を実施することを特徴とする。 In addition, a video generation method according to another aspect of the present invention is a video generation method using a video generation device, the video generation device having a processor, the processor is characterized by carrying out the following steps: an acquisition step of acquiring a video file; an analysis step of acquiring an analysis result including time-series speech information, which is text information obtained by transcribing speech included in the video file in a time series; an editing plan step of transmitting the time-series speech information and command information instructing an editing plan for editing the time-series speech information according to desired editing policy information to an interactive AI using a language model and receiving the editing plan from the interactive AI; and a video editing step of editing the video file according to the editing plan and generating an edited video.

また、本発明の別の態様にかかる動画生成プログラムは、情報処理装置に動画を生成させる動画生成プログラムであって、前記情報処理装置は、プロセッサを備え、前記プロセッサに、動画ファイルを取得する取得ステップと、前記動画ファイルに含まれる発話を時系列に書き起こしたテキスト情報である時系列発話情報を含む解析結果を取得する解析ステップと、前記時系列発話情報と、前記時系列発話情報を所望の編集方針情報に従って編集する編集計画を出力するように指示する命令情報とを、言語モデルを用いた対話型ＡＩに送信し、前記対話型ＡＩから前記編集計画を受信する編集計画ステップと、前記編集計画に沿って前記動画ファイルを編集し、編集動画を生成する動画編集ステップと、を実施させることを特徴とする。 In addition, a video generation program according to another aspect of the present invention is a video generation program that causes an information processing device to generate a video, the information processing device having a processor, and causes the processor to execute the following steps: an acquisition step of acquiring a video file; an analysis step of acquiring an analysis result including time-series speech information, which is text information obtained by transcribing speech included in the video file in a time series; an editing plan step of transmitting the time-series speech information and command information instructing the processor to output an editing plan for editing the time-series speech information according to desired editing policy information to an interactive AI using a language model and receiving the editing plan from the interactive AI; and a video editing step of editing the video file according to the editing plan and generating an edited video.

また、本発明の別の態様にかかる動画生成システムは、利用者端末と、該利用者端末と通信可能に接続される動画生成装置と、を備える動画生成システムであって、前記動画生成装置は、前記利用者端末から通信を介して動画ファイルを取得する取得部と、前記動画ファイルに含まれる発話を時系列に書き起こしたテキスト情報である時系列発話情報を含む解析結果を取得する解析部と、前記時系列発話情報と、前記時系列発話情報を所望の編集方針情報に従って編集する編集計画を記述して出力するように指示する命令情報とを、言語モデルを用いた対話型ＡＩに送信し、前記対話型ＡＩから前記編集計画を受信する編集計画部と、前記編集計画に沿って前記動画ファイルを編集し、編集動画を生成する動画編集部と、を有する、ことを特徴とする。 In addition, a video generation system according to another aspect of the present invention is a video generation system including a user terminal and a video generation device communicatively connected to the user terminal, the video generation device having an acquisition unit that acquires a video file from the user terminal via communication, an analysis unit that acquires an analysis result including time-series speech information, which is text information obtained by transcribing speech included in the video file in a time series, an editing plan unit that transmits the time-series speech information and command information that instructs the time-series speech information to be written and output as an editing plan for editing the time-series speech information according to desired editing policy information to an interactive AI using a language model and receives the editing plan from the interactive AI, and a video editing unit that edits the video file according to the editing plan and generates an edited video.

本発明によると、利用者が望む態様の動画を生成する技術を提供することができる。 The present invention provides technology that allows users to generate videos in the format they desire.

上記した以外の課題、構成および効果は、以下の実施形態の説明により明らかにされる。 Problems, configurations and advantages other than those mentioned above will become clear from the description of the embodiments below.

実施形態に係る動画生成システムの概要を示す図である。1 is a diagram showing an overview of a moving image generation system according to an embodiment. 実施形態に係る動画生成システムの構成図である。1 is a configuration diagram of a video production system according to an embodiment. 素材情報のデータ構造例を示す図である。FIG. 4 is a diagram illustrating an example of a data structure of material information. 時系列発話情報のデータ構造例を示す図である。FIG. 11 is a diagram illustrating an example of a data structure of time-series speech information. 編集方針情報のデータ構造例を示す図である。FIG. 11 is a diagram illustrating an example of a data structure of editing policy information. 命令情報のデータ構造例を示す図である。FIG. 13 illustrates an example of a data structure of command information. 編集計画書のデータ構造例を示す図である。FIG. 13 is a diagram illustrating an example of a data structure of an editing plan. 動画生成装置のハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a moving image generating device. 動画生成フロー（動画素材登録）の例を示す図である。FIG. 13 is a diagram illustrating an example of a video generation flow (video material registration). 動画生成フロー（編集方針登録）の例を示す図である。FIG. 13 is a diagram illustrating an example of a video generation flow (editing policy registration). 動画素材登録画面の画面例を示す図である。FIG. 13 is a diagram showing an example of a video material registration screen. 新規素材登録画面の画面例を示す図である。FIG. 13 is a diagram showing an example of a new material registration screen. 編集方針登録画面の画面例を示す図である。FIG. 13 is a diagram showing an example of an editing policy registration screen. 新規編集方針登録画面の画面例を示す図である。FIG. 13 is a diagram showing an example of a new editing policy registration screen.

以下に、本発明の一態様に係る実施形態を適用した動画生成システム１について、図面を参照して説明する。以下の実施の形態においては便宜上その必要があるときは、複数のセクションまたは実施の形態に分割して説明するが、特に明示した場合を除き、それらはお互いに無関係なものではなく、一方は他方の一部または全部の変形例、詳細、補足説明等の関係にある。 A video generation system 1 to which an embodiment according to one aspect of the present invention is applied will be described below with reference to the drawings. In the following embodiment, when necessary for convenience, the description will be divided into multiple sections or embodiments, but unless otherwise specified, they are not unrelated to each other, and one is a partial or complete modification, detail, supplementary explanation, etc. of the other.

また、以下の実施の形態において、要素の数等（個数、数値、量、範囲等を含む）に言及する場合、特に明示した場合および原理的に明らかに特定の数に限定される場合等を除き、その特定の数に限定されるものではなく、特定の数以上でも以下でもよい。 In addition, in the following embodiments, when referring to the number of elements (including the number, numerical value, amount, range, etc.), unless otherwise specified or clearly limited in principle to a specific number, the number is not limited to that specific number and may be more than or less than the specific number.

さらに、以下の実施の形態において、その構成要素（要素ステップ等も含む）は、特に明示した場合および原理的に明らかに必須であると考えられる場合等を除き、必ずしも必須のものではないことは言うまでもない。 Furthermore, it goes without saying that in the following embodiments, the components (including element steps, etc.) are not necessarily essential unless specifically stated otherwise or considered to be clearly essential in principle.

同様に、以下の実施の形態において、構成要素等の形状、位置関係等に言及するときは特に明示した場合および原理的に明らかにそうではないと考えられる場合等を除き、実質的にその形状等に近似または類似するもの等を含むものとする。このことは、上記数値および範囲についても同様である。 Similarly, in the following embodiments, when referring to the shapes, positional relationships, etc. of components, etc., it is intended to include shapes that are substantially similar or similar to those, unless otherwise specified or considered to be clearly different in principle. The same applies to the above numerical values and ranges.

また、実施の形態を説明するための全図において、同一の部材には原則として同一の符号を付し、その繰り返しの説明は省略する。 In addition, in all drawings used to explain the embodiments, the same components are generally given the same reference numerals, and repeated explanations will be omitted.

近年では、ネットワークや各種電子デバイス（パーソナルコンピュータ、タブレットデバイス、スマートフォン等）の普及により、いつでもどこでも動画を作成し公開する環境が構築されつつある。例えば、誰でも簡単にスマートフォン等により撮影し、誰でもアクセス可能なＳＮＳ（ＳｏｃｉａｌＮｅｔｗｏｒｋｉｎｇＳｅｒｖｉｃｅ）や動画共有サイト等に場所・時間を問わずに投稿できるようになりつつある。しかし、衆目を集めるような質の高い動画は、専門的知識を備える編集者が時間と労力をかけて作り出したものであることが多い。 In recent years, with the spread of networks and various electronic devices (personal computers, tablet devices, smartphones, etc.), an environment is being created in which videos can be created and published anywhere, anytime. For example, it is becoming possible for anyone to easily shoot videos using a smartphone or the like and post them anywhere, anytime to SNS (Social Networking Services) or video sharing sites that anyone can access. However, high-quality videos that attract attention are often the result of the time and effort of editors with specialized knowledge.

そこで、本発明に係る実施形態では、ユーザが望む動画編集方針を受け付けて該方針に沿った動画を自動生成する動画生成システム１を利用可能とする。動画生成システム１によれば、ユーザ自身に動画編集のスキルが無い場合や、動画生成のための設備環境がない場合であっても、ユーザが望む態様の動画を生成することができる。 Therefore, in an embodiment of the present invention, a video creation system 1 is made available that accepts a video editing policy desired by a user and automatically creates a video in line with the policy. With video creation system 1, a video can be created in the format desired by the user even if the user does not have video editing skills or does not have the equipment and environment required for video creation.

図１は、本実施形態に係る動画生成システムの概要を示す図である。動画生成システム１では、ユーザが、自身の利用するユーザ端末４００と、通信路を介してユーザ端末４００と通信可能に接続された装置群と、を利用する。装置群には、動画生成装置１００と、対話型ＡＩサービス２００を提供する装置群と、音声解析サービス３００を提供する装置群と、が含まれる。 Figure 1 is a diagram showing an overview of a video generation system according to this embodiment. In the video generation system 1, a user uses a user terminal 400 that the user uses, and a group of devices that are communicatively connected to the user terminal 400 via a communication path. The group of devices includes a video generation device 100, a group of devices that provide an interactive AI service 200, and a group of devices that provide a voice analysis service 300.

例えば、対話型ＡＩサービス２００を提供する装置群、音声解析サービス３００を提供する装置群、動画生成装置１００としては、インターネットを介して接続されるクラウドコンピュータや、動画生成装置１００と、対話型ＡＩサービス２００を提供する装置群と、音声解析サービス３００を提供する装置群の所有者が管理するサーバー装置等を用いるようにしてもよい。さらには、これに限られず、ユーザのスマートウォッチ等のウェアラブル装置をユーザ端末４００として用いるようにしてもよい。 For example, the group of devices providing the interactive AI service 200, the group of devices providing the voice analysis service 300, and the video generating device 100 may be a cloud computer connected via the Internet, or a server device managed by the owner of the video generating device 100, the group of devices providing the interactive AI service 200, and the group of devices providing the voice analysis service 300. Furthermore, without being limited to this, a wearable device such as a user's smart watch may be used as the user terminal 400.

なお、ユーザ端末４００と装置群（動画生成装置１００と、対話型ＡＩサービス２００を提供する装置群と、音声解析サービス３００を提供する装置群を含む）とが通信する際には、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）、インターネット、携帯電話網等、Ｂｌｕｅｔｏｏｔｈ（登録商標）等の近距離無線通信あるいはこれらが複合した通信網である通信路を介して接続される。なお、当該通信路５０は、携帯電話通信網等の無線通信網上のＶＰＮ（ＶｉｒｔｕａｌＰｒｉｖａｔｅＮｅｔｗｏｒｋ）等であってもよい。 When the user terminal 400 and the device group (including the video generating device 100, the device group providing the interactive AI service 200, and the device group providing the voice analysis service 300) communicate with each other, they are connected via a communication path such as a LAN (Local Area Network), a WAN (Wide Area Network), the Internet, a mobile phone network, or a short-range wireless communication such as Bluetooth (registered trademark), or a combination of these. The communication path 50 may be a VPN (Virtual Private Network) on a wireless communication network such as a mobile phone network.

動画生成システム１を用いることで、ユーザが望む態様の動画を生成することができる。具体的には、ユーザは、ユーザ端末４００を用いて動画生成システム１に発話および環境音が録画されている動画素材ファイルと動画編集方針を登録（１と２）した後、動画生成を申し込む。動画生成装置１００は、素材動画から音声成分のみを抽出し、素材動画の音声ファイル（３）として音声解析サービス３００に解析を依頼する。音声解析サービス３００は、素材動画の音声ファイルを解析して、発話タイミングと発話した単語および文章を対応付けた解析済みテキストファイル（４）を動画生成装置１００に返す。 By using the video creation system 1, a video can be created in the format desired by the user. Specifically, the user uses the user terminal 400 to register a video material file in which speech and environmental sounds are recorded and a video editing policy in the video creation system 1 (1 and 2), and then applies for video creation. The video creation device 100 extracts only the audio components from the material video and requests the audio analysis service 300 to analyze it as an audio file (3) for the material video. The audio analysis service 300 analyzes the audio file of the material video and returns an analyzed text file (4) that associates the timing of speech with spoken words and sentences to the video creation device 100.

動画生成装置１００は、音声解析サービス３００から得た解析済みテキストファイルと、動画編集方針を盛り込んだ命令情報（５）を対話型ＡＩサービス２００に送信して編集計画書の作成を依頼する。なお、この際、動画生成装置１００は、実際の動画素材ファイルや音声ファイルを対話型ＡＩサービス２００に送信せず、解析済テキストファイルを送信する。対話型ＡＩサービス２００は、命令情報に指定された制約条件、編集計画により得られる編集動画についての制約条件、構成情報、付加すべき動画、静止画または音声の指定、視覚効果の指定等を満たす編集計画を作成し、編集計画書（６）として動画生成装置１００に返す。 The video production device 100 sends the analyzed text file obtained from the audio analysis service 300 and command information (5) including the video editing policy to the interactive AI service 200 to request the creation of an editing plan. At this time, the video production device 100 does not send actual video material files or audio files to the interactive AI service 200, but sends the analyzed text file. The interactive AI service 200 creates an editing plan that satisfies the constraints specified in the command information, the constraints on the edited video obtained by the editing plan, the configuration information, the designation of the video, still images or audio to be added, the designation of visual effects, etc., and returns it to the video production device 100 as an editing plan (6).

動画生成装置１００は、対話型ＡＩサービス２００から編集計画書を受け取ると、編集計画に従って動画編集処理を行い、編集動画（７）を作成してユーザ端末４００に提供する。これにより、ユーザは、提供された編集動画を利活用可能となる。 When the video generating device 100 receives the editing plan from the interactive AI service 200, it performs video editing processing according to the editing plan, creates an edited video (7), and provides it to the user terminal 400. This allows the user to utilize the provided edited video.

図２は、実施形態に係る動画生成システムの構成図である。動画生成システム１には、動画生成装置１００と、通信路５０を介して動画生成装置１００と通信可能な対話型ＡＩサービス２００と、音声解析サービス３００と、ユーザ端末４００と、が含まれる。 Figure 2 is a configuration diagram of a video generation system according to an embodiment. The video generation system 1 includes a video generation device 100, an interactive AI service 200 capable of communicating with the video generation device 100 via a communication path 50, a voice analysis service 300, and a user terminal 400.

動画生成装置１００は、記憶部１１０と、処理部１２０と、入出力部１４０と、通信部１５０と、が互いにバス等で通信可能に接続される。 The video generating device 100 includes a memory unit 110, a processing unit 120, an input/output unit 140, and a communication unit 150, which are connected to each other via a bus or the like so that they can communicate with each other.

記憶部１１０には、素材情報１１１と、時系列発話情報１１２と、編集方針情報１１３と、命令情報１１４と、編集計画書１１５と、編集動画１１６と、が含まれる。 The memory unit 110 includes material information 111, time-series speech information 112, editing policy information 113, command information 114, editing plan 115, and edited video 116.

図３は、素材情報のデータ構造例を示す図である。素材情報１１１は、動画生成に用いるための素材動画の情報を複数記憶する。素材情報１１１には、ユーザ１１１Ａと、動画タイトル１１１Ｂと、動画ファイルパス１１１Ｃと、説明１１１Ｄと、解析済フラグ１１１Ｅと、解析結果１１１Ｆと、が含まれる。 Figure 3 is a diagram showing an example data structure of material information. Material information 111 stores information on multiple material videos to be used for generating videos. Material information 111 includes user 111A, video title 111B, video file path 111C, description 111D, analyzed flag 111E, and analysis result 111F.

ユーザ１１１Ａは、ユーザを、他のユーザから区別する情報である。動画タイトル１１１Ｂは、素材として登録する動画のタイトルである。動画ファイルパス１１１Ｃは、素材として登録する動画のファイルシステム上の格納場所、あるいはＵＲＩ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＩｄｅｎｔｉｆｉｅｒ）である。説明１１１Ｄは、素材として登録する動画の内容を自然言語で説明する情報である。解析済フラグ１１１Ｅは、音声解析サービス３００による解析を終えたか否かを示す情報である。解析結果１１１Ｆは、音声解析サービス３００による解析の結果情報である解析済テキストである。 User 111A is information that distinguishes a user from other users. Video title 111B is the title of the video to be registered as material. Video file path 111C is the storage location on the file system of the video to be registered as material, or a URI (Uniform Resource Identifier). Description 111D is information that explains in natural language the content of the video to be registered as material. Analyzed flag 111E is information that indicates whether or not analysis by the voice analysis service 300 has been completed. Analysis result 111F is analyzed text that is information on the results of analysis by the voice analysis service 300.

図４は、時系列発話情報のデータ構造例を示す図である。時系列発話情報１１２は、動画内での経過時間を時系列として、動画内でなされた発話のテキストを順に格納する情報である。時系列発話情報１１２には、発話開始時刻１１２Ａと、発話終了時刻１１２Ｂと、発話テキスト（単語）１１２Ｃと、が含まれる。 Figure 4 is a diagram showing an example data structure of time-series speech information. Time-series speech information 112 is information that stores the text of utterances made within a video in order, with the time that has elapsed within the video being a time series. Time-series speech information 112 includes utterance start time 112A, utterance end time 112B, and utterance text (words) 112C.

発話開始時刻１１２Ａと、発話終了時刻１１２Ｂとは、動画内でなされた発話の開始タイミングと、終了タイミングとをそれぞれ動画の開始時刻からの経過時間（動画内時刻）によって特定する情報である。発話テキスト（単語）１１２Ｃは、発話開始時刻１１２Ａと、発話終了時刻１１２Ｂとの間に発話された単語である。ただし、単語に限られず、一定の長さの文や節であってもよい。 The speech start time 112A and the speech end time 112B are information that specifies the start and end timings of speech made within the video by the time that has elapsed since the start time of the video (time within the video). The speech text (words) 112C are words that are spoken between the speech start time 112A and the speech end time 112B. However, they are not limited to words, and may be sentences or clauses of a certain length.

図５は、編集方針情報のデータ構造例を示す図である。編集方針情報１１３は、生成したい動画の編集方針の情報である。編集方針情報１１３には、タイトル１１３Ａと、コンテンツの目標１１３Ｂと、制約条件１１３Ｃと、コンテンツの構成１１３Ｄと、リソースファイル１１３Ｅと、編集計画書フォーマット１１３Ｆと、が含まれる。 Figure 5 is a diagram showing an example data structure of editing policy information. Editing policy information 113 is information on the editing policy of the video to be generated. Editing policy information 113 includes a title 113A, a content goal 113B, constraints 113C, a content structure 113D, a resource file 113E, and an editing plan format 113F.

タイトル１１３Ａは、編集方針のタイトルあるいは生成したい動画のタイトルである。コンテンツの目標１１３Ｂは、生成したい動画が目指すイメージや、視聴者の心理変化のねらい（見ると楽しくなる、あるいは落ち着ける）等の情報である。制約条件１１３Ｃは、生成動画の尺（再生時間）等の動画作成上の制約条件の情報である。コンテンツの構成１１３Ｄは、生成する動画の構成、例えば３つの連続動画を視覚効果のトランジションでつなぎ合わせる、等の構成に関する情報である。リソースファイル１１３Ｅは、生成する動画に用いる動画素材の情報である。編集計画書フォーマット１１３Ｆは、動画生成のための編集計画書のフォーマットを指定する情報である。編集計画書のフォーマットは、既知のフォーマットでもよいし、ＳＧＭＬ（ＳｔａｎｄａｒｄＧｅｎｅｒａｌｉｚｅｄＭａｒｋｕｐＬａｎｇｕａｇｅ）等に準拠した拡張言語で定義するものであってもよい。 Title 113A is the title of the editing policy or the title of the video to be generated. Content goal 113B is information such as the image that the video to be generated aims to achieve and the aim of psychological change in the viewer (to make the viewer feel happy or calm when watching). Constraints 113C is information on constraints for video creation such as the length (playback time) of the generated video. Content configuration 113D is information on the configuration of the video to be generated, for example, connecting three consecutive videos with a visual effect transition. Resource file 113E is information on video materials to be used for the video to be generated. Editing plan format 113F is information specifying the format of the editing plan for video generation. The format of the editing plan may be a known format or may be defined in an extension language conforming to SGML (Standard Generalized Markup Language) or the like.

図６は、命令情報のデータ構造例を示す図である。命令情報１１４は、対話型ＡＩサービス２００に処理をさせるための命令（プロンプト）である。本実施形態に係る動画生成の命令は、例えば、編集方針情報１１３を指定して、該編集方針に従って編集計画書を作成するよう指示するものであり、自然言語にて記述される。 Figure 6 is a diagram showing an example of the data structure of command information. Command information 114 is a command (prompt) for causing the interactive AI service 200 to perform processing. A command for generating a video according to this embodiment, for example, specifies editing policy information 113 and instructs the creation of an editing plan in accordance with the editing policy, and is written in natural language.

図７は、編集計画書のデータ構造例を示す図である。編集計画書１１５は、生成する動画の動画内時刻に割り当てられる構成要素をタグ指定する等により、編集情報を所定のフォーマットにて記述して動画作成の計画情報とするものである。 Figure 7 shows an example of the data structure of an editing plan. The editing plan 115 describes editing information in a specified format, such as by tagging components to be assigned to times within the video to be generated, to create planning information for video production.

本実施形態に係る編集計画書のフォーマットの概略を説明する。まず、編集計画書は、大きく“ｓｈｏｔ”、“ｖｉｅｗ”、“ａｔｔａｃｈ”の３種類の要素を含めることができる。“ｓｈｏｔ”タグは、複数の“ｖｉｅｗ”をまとめたものである。“ｖｉｅｗ”は、素材ファイルと関連情報を規定する。素材ファイルには、動画（動画素材内で使用する箇所の開始時刻と終了時刻の指定を含む）と、画像（画像ファイルの拡大率や画面内配置）とがあり、関連情報には、カラー指定およびグラデーション指定がある。“ａｔｔａｃｈ”は、“ｖｉｅｗ”で指定される素材に付加する形で表示する要素（画像であるならば、サイズ、配置、生成する動画内での開始時刻と終了時刻の指定を含む。音声であるならば、音声ボリューム、生成する動画内での開始時刻と終了時刻の指定を含む。）を指定する。 The format of the editing plan according to this embodiment will be outlined below. First, the editing plan can include three main types of elements: "shot", "view", and "attach". The "shot" tag is a collection of multiple "views". "View" specifies the material file and related information. Material files include videos (including the start and end times of the parts to be used in the video material) and images (magnification rate and on-screen layout of the image file), and related information includes color and gradation designations. "Attach" specifies an element to be displayed by adding it to the material specified by "view" (if it is an image, this includes the size, layout, and start and end times in the video to be generated. If it is audio, this includes the audio volume and start and end times in the video to be generated).

例えば、素材動画内時刻を指定して抽出した素材動画からの切り出し動画を該構成要素の一つ（“ｖｉｅｗｓ”）に割り当て、そのような複数の切り出し動画をトランジションを挟んで連続的に再生させた後、チャンネル内の他の動画にアクセスするためのＱＲコード（登録商標）を表示させる時間を付帯させる（“ａｔｔａｃｈｅｓ”）、等のカット編集情報を記述する。 For example, cut editing information such as allocating an excerpt from a source video extracted by specifying a time within the source video to one of the components ("views"), playing multiple such excerpts continuously with transitions in between, and then attaching a time to display a QR code (registered trademark) for accessing other videos in the channel ("attaches") is described.

例えば、編集計画書には、編集計画として、素材となる動画ファイル内の経過時間軸上の開始位置と終了位置を指定した部分的な動画をつなぎ合わせて編集動画を構成する情報が含まれてもよい。また、編集計画書には、編集計画として、素材となる動画ファイル内の経過時間軸上の開始位置と終了位置を指定した部分的な動画をつなぎ合わせて編集動画を構成する情報、および部分的な動画の前後に付加すべき動画、静止画または音声の指定が含まれてもよい。また、編集計画書には、編集計画として、素材となる動画ファイル内の経過時間軸上の開始位置と終了位置を指定した部分的な動画をつなぎ合わせて編集動画を構成する情報、および部分的な動画のつなぎ目に用いる視覚効果の指定が含まれてもよい。 For example, the editing plan may include, as an editing plan, information for constructing an edited video by joining partial videos with specified start and end positions on the elapsed time axis in the video file to be used as the material. The editing plan may also include, as an editing plan, information for constructing an edited video by joining partial videos with specified start and end positions on the elapsed time axis in the video file to be used as the material, and specification of videos, still images, or audio to be added before and after the partial videos. The editing plan may also include, as an editing plan, information for constructing an edited video by joining partial videos with specified start and end positions on the elapsed time axis in the video file to be used as the material, and specification of visual effects to be used at the joins of the partial videos.

図２の説明に戻る。処理部１２０には、取得部１２１と、解析部１２２と、編集計画部１２３と、動画編集部１２４と、が含まれる。 Returning to the explanation of FIG. 2, the processing unit 120 includes an acquisition unit 121, an analysis unit 122, an editing planning unit 123, and a video editing unit 124.

取得部１２１は、動画ファイルを取得する。解析部１２２は、動画ファイルに含まれる発話音声を時系列に書き起こしたテキスト情報である時系列発話情報を含む解析結果を音声解析サービスから取得する。また、解析部１２２は、動画ファイルに含まれる発話音声を時系列を維持しながら早送り編集し、所定の音声テキスト変換部（音声解析サービス３００）に受け渡して時系列発話情報を得てもよい。あるいはまた、解析部１２２は、動画ファイルに含まれる発話音声の話者を識別して話者ごとに時系列を維持しながら抽出し、所定の音声テキスト変換部（音声解析サービス３００）に受け渡して得たテキスト情報を統合して時系列発話情報を得てもよい。 The acquisition unit 121 acquires a video file. The analysis unit 122 acquires an analysis result including time-series speech information, which is text information obtained by transcribing speech included in the video file in time series, from the voice analysis service. The analysis unit 122 may also fast-forward edit the speech included in the video file while maintaining the time series, and pass it on to a specified voice-to-text conversion unit (voice analysis service 300) to obtain time-series speech information. Alternatively, the analysis unit 122 may identify the speaker of the speech included in the video file, extract it while maintaining the time series for each speaker, and integrate the text information obtained by passing it on to a specified voice-to-text conversion unit (voice analysis service 300) to obtain time-series speech information.

編集計画部１２３は、時系列発話情報と、時系列発話情報を所望の編集方針情報に従って編集する編集計画を出力するように指示する命令情報とを、言語モデルを用いた対話型ＡＩ（対話型ＡＩサービス２００）に送信し、対話型ＡＩから編集計画書を受信する。また、編集計画部は、命令情報に、編集計画により得られる編集動画についての制約条件を含めるようにしてよい。また、編集計画部１２３は、命令情報に、編集計画により得られる編集動画についての構成情報を含めるようにしてもよい。編集計画部１２３は、命令情報に、編集計画により得られる編集動画に付加すべき動画、静止画または音声の指定を含めるようにしてもよい。また、編集計画部１２３は、命令情報に、編集計画により得られる編集動画において用いる視覚効果の指定を含めるようにしてもよい。また、編集計画部１２３は、命令情報に、編集計画を記述するフォーマット言語についての定義情報を含めるようにしてもよい。 The editing plan unit 123 transmits the time-series speech information and command information instructing the output of an editing plan for editing the time-series speech information according to desired editing policy information to an interactive AI (interactive AI service 200) using a language model, and receives an editing plan from the interactive AI. The editing plan unit may also include, in the command information, constraint conditions for the edited video obtained by the editing plan. The editing plan unit 123 may also include, in the command information, configuration information for the edited video obtained by the editing plan. The editing plan unit 123 may also include, in the command information, a designation of a video, a still image, or a sound to be added to the edited video obtained by the editing plan. The editing plan unit 123 may also include, in the command information, a designation of a visual effect to be used in the edited video obtained by the editing plan. The editing plan unit 123 may also include, in the command information, definition information for a format language for describing the editing plan.

動画編集部１２４は、編集計画書に含まれる編集計画に沿って動画ファイルを編集し、編集動画を生成する。具体的には、動画編集部１２４は、部分的な動画を動画素材のファイルから切り出してつなぎ合わせることで前記編集動画を生成する。また、動画編集部１２４は、さらに、付加すべき動画、静止画または音声を付加することで編集動画を生成するようにしてもよい。また、動画編集部１２４は、部分的な動画を動画ファイルから切り出してつなぎ合わせ、該つなぎ目に指定された視覚効果を適用することで編集動画を生成するようにしてもよい。 The video editing unit 124 edits the video file in accordance with the editing plan included in the editing plan to generate an edited video. Specifically, the video editing unit 124 generates the edited video by cutting out partial videos from video material files and splicing them together. The video editing unit 124 may also generate the edited video by adding videos, still images, or audio to be added. The video editing unit 124 may also generate the edited video by cutting out partial videos from video files and splicing them together, and applying a specified visual effect to the splices.

入出力部１４０は、動画生成装置１００に対する入出力を制御する。例えば、入出力部１４０は、受け付けたタイピングやタッチ、フリック入力等の各種の接触入力、あるいは視線入力等の各種の入力を受け付ける。また、入出力部１４０は、ユーザへの出力を行う。出力される情報は、画面、プレゼンテーション情報、広告、動画等の各種出力情報である。 The input/output unit 140 controls input/output to/from the video generating device 100. For example, the input/output unit 140 accepts various types of input, such as various contact inputs, such as typing, touch, and flick input, or various types of inputs, such as gaze input. The input/output unit 140 also performs output to the user. The information that is output is various types of output information, such as a screen, presentation information, advertisements, and videos.

通信部１５０は、通信路５０を介して対話型ＡＩサービス２００を提供する装置群、音声解析サービス３００を提供する装置群、ユーザ端末４００およびその他インターネットを介して通信を行う他の端末との間で通信を行う。 The communication unit 150 communicates via the communication path 50 with a group of devices that provide the interactive AI service 200, a group of devices that provide the voice analysis service 300, a user terminal 400, and other terminals that communicate via the Internet.

対話型ＡＩサービス２００は、例えば、ＧＰＴ、Ｇｅｍｉｎｉ等のいわゆる生成ＡＩの機能をＡＰＩ（ＡｐｐｌｉｃａｔｉｏｎＰｒｏｇｒａｍｍｉｎｇＩｎｔｅｒｆａｃｅ）等を介して提供するサービスである。対話型ＡＩサービス２００は、自然言語による命令（プロンプト）を生成ＡＩに与えて、望む結果を生成させて得る。本実施形態では、生成ＡＩに動画を生成するための編集計画書を生成させる。 The interactive AI service 200 is a service that provides the functions of so-called generation AI, such as GPT and Gemini, via an API (Application Programming Interface) or the like. The interactive AI service 200 gives instructions (prompts) in natural language to the generation AI to generate the desired results. In this embodiment, the generation AI is made to generate an editing plan for generating a video.

音声解析サービス３００は、例えば、ＧｏｏｇｌｅＴＴＳＡＰＩ等の公知の技術を用いて音声解析を行う。音声解析サービス３００は、音声ファイルを受け付けると、音声ファイル内での発話をテキストに起こし、その音声ファイルに含まれる発話ごとに発話内容のテキストと、発話の開始時刻と終了時刻を特定する情報を解析済テキストとして出力する。 The voice analysis service 300 performs voice analysis using known technology such as the Google TTS API. When the voice analysis service 300 receives an audio file, it transcribes the speech in the audio file into text, and for each utterance contained in the audio file, it outputs the text of the utterance and information specifying the start and end times of the utterance as analyzed text.

ユーザ端末４００は、ユーザが利用する端末である。ユーザ端末４００としては、ユーザのスマートフォン端末、ＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）等を用いるようにしてもよい。さらには、これに限られず、ユーザのスマートウォッチ等のウェアラブル装置をユーザ端末４００として用いるようにしてもよい。 The user terminal 400 is a terminal used by the user. The user terminal 400 may be a smartphone terminal of the user, a PC (Personal Computer), or the like. Furthermore, without being limited to this, a wearable device such as a smart watch of the user may be used as the user terminal 400.

図８は、動画生成装置のハードウェア構成例を示す図である。動画生成装置１００は、いわゆるサーバー装置、ワークステーション、パーソナルコンピュータ、スマートフォンあるいはタブレット端末の筐体により実現されるハードウェア構成を備える。動画生成装置１００は、プロセッサ１０１と、メモリ１０２と、ストレージ１０３と、入力装置１０４と、表示装置１０５と、通信装置１０６と、各装置をつなぐバスと、を備える。 Figure 8 is a diagram showing an example of the hardware configuration of a video generation device. The video generation device 100 has a hardware configuration realized by the housing of a so-called server device, workstation, personal computer, smartphone, or tablet terminal. The video generation device 100 has a processor 101, memory 102, storage 103, input device 104, display device 105, communication device 106, and a bus connecting each device.

プロセッサ１０１は、例えばＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）等の演算装置である。 The processor 101 is a computing device such as a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit).

メモリ１０２は、例えばＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）などのメモリ装置である。 Memory 102 is a memory device such as a RAM (Random Access Memory).

ストレージ１０３は、デジタル情報を記憶可能な、いわゆるハードディスク（ＨａｒｄＤｉｓｋＤｒｉｖｅ）やＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）あるいはフラッシュメモリなどの不揮発性記憶装置である。 Storage 103 is a non-volatile storage device capable of storing digital information, such as a hard disk drive, a solid state drive (SSD), or a flash memory.

入力装置１０４は、キーボードやマウス、タッチパネル、マイクのいずれかまたは複数の入力を受け付ける装置である。表示装置１０５は、有機ＥＬ（Ｅｌｅｃｔｒｏ－Ｌｕｍｉｎｅｓｃｅｎｃｅ）ディスプレイ等の各種出力装置のいずれかまたは複数の表示を行う装置である。 The input device 104 is a device that accepts input from one or more of a keyboard, mouse, touch panel, and microphone. The display device 105 is a device that displays one or more of various output devices such as an organic EL (Electro-Luminescence) display.

通信装置１０６は、ネットワークを介して他の装置と通信するネットワークインターフェースカード（ＮＩＣ）等である。 The communication device 106 is a network interface card (NIC) or the like that communicates with other devices via a network.

なお、対話型ＡＩサービス２００を提供する装置、音声解析サービス３００を提供する装置、ユーザ端末４００についても、動画生成装置１００と略同様のハードウェア構成を備える。 The device providing the interactive AI service 200, the device providing the voice analysis service 300, and the user terminal 400 also have substantially the same hardware configuration as the video generation device 100.

上記した動画生成装置１００の処理部１２０と、取得部１２１と、解析部１２２と、編集計画部１２３と、動画編集部１２４とは、プロセッサ１０１に処理を行わせるプログラムによって実現される。このプログラムは、メモリ１０２、ストレージ１０３または図示しないＲＯＭ装置内に記憶され、実行にあたってメモリ１０２上にロードされ、プロセッサ１０１により実行される。 The processing unit 120, acquisition unit 121, analysis unit 122, editing planning unit 123, and video editing unit 124 of the video generating device 100 described above are realized by a program that causes the processor 101 to perform processing. This program is stored in the memory 102, storage 103, or a ROM device (not shown), and is loaded onto the memory 102 for execution and executed by the processor 101.

また、動画生成装置１００の記憶部１１０は、メモリ１０２及びストレージ１０３により実現される。また、入出力部１４０は、入力装置１０４および表示装置１０５により実現される。通信部１５０は、通信装置１０６により実現される。以上が、動画生成装置１００のハードウェア構成例である。 The memory unit 110 of the video generating device 100 is realized by the memory 102 and the storage 103. The input/output unit 140 is realized by the input device 104 and the display device 105. The communication unit 150 is realized by the communication device 106. The above is an example of the hardware configuration of the video generating device 100.

動画生成装置１００の構成は、処理内容に応じて、さらに多くの構成要素に分類することもできる。また、１つの構成要素がさらに多くの処理を実行するように分類することもできる。 The configuration of the video generating device 100 can be further divided into more components depending on the processing content. Also, a single component can be divided into more processes.

また、各処理部（処理部１２０と、取得部１２１と、解析部１２２と、編集計画部１２３と、動画編集部１２４）は、それぞれの機能を実現する専用のハードウェア（ＡＳＩＣ、ＧＰＵなど）により構築されてもよい。また、各処理部の処理が一つのハードウェアで実行されてもよいし、複数のハードウェアで実行されてもよい。 In addition, each processing unit (processing unit 120, acquisition unit 121, analysis unit 122, editing planning unit 123, and video editing unit 124) may be constructed using dedicated hardware (ASIC, GPU, etc.) that realizes each function. In addition, the processing of each processing unit may be executed by a single piece of hardware, or may be executed by multiple pieces of hardware.

次に、本実施形態における動画生成システム１の動作を説明する。 Next, the operation of the video generation system 1 in this embodiment will be described.

図９は、動画生成フロー（動画素材登録）の例を示す図である。動画生成フロー（動画素材登録）は、ユーザがユーザ端末４００のウェブブラウザあるいはアプリケーションソフトウェア（以後、単にブラウザと表記することもある）において開始を要求すると開始される。 Figure 9 shows an example of a video generation flow (video material registration). The video generation flow (video material registration) starts when a user requests the start of the flow in the web browser or application software (hereinafter sometimes simply referred to as the browser) of the user terminal 400.

動画生成装置１００の取得部１２１は、動画素材登録画面を生成し、ユーザ端末４００に表示させる（ステップＳ００１）。具体的には、取得部１２１は、ユーザが過去に登録済の動画の一覧を管理する動画素材登録画面を生成する。そして、取得部１２１は、生成した動画素材登録画面の表示情報をユーザ端末４００に送信する。 The acquisition unit 121 of the video production device 100 generates a video material registration screen and displays it on the user terminal 400 (step S001). Specifically, the acquisition unit 121 generates a video material registration screen that allows the user to manage a list of videos that have been previously registered. The acquisition unit 121 then transmits display information for the generated video material registration screen to the user terminal 400.

そして、ユーザ端末４００のブラウザは、動画素材登録画面を表示させ、登録する動画素材ファイルと、動画タイトルと、説明の情報を含む情報を付帯させて動画素材登録依頼を動画生成装置１００に送信する（ステップＳ００２）。 Then, the browser of the user terminal 400 displays a video material registration screen and sends a video material registration request to the video generating device 100 together with information including the video material file to be registered, the video title, and descriptive information (step S002).

取得部１２１は、動画素材ファイル等を取得する（ステップＳ００３）。具体的には、取得部１２１は、素材情報１１１に、ユーザと、動画タイトルと、動画ファイルと、説明と、を登録する。 The acquisition unit 121 acquires video material files, etc. (step S003). Specifically, the acquisition unit 121 registers the user, video title, video file, and description in the material information 111.

そして、解析部１２２は、動画解析（音声部分抽出）を行う（ステップＳ００４）。具体的には、解析部１２２は、取得した動画ファイルからオーディオ成分を分離取得する。 Then, the analysis unit 122 performs video analysis (audio portion extraction) (step S004). Specifically, the analysis unit 122 separates and acquires audio components from the acquired video file.

そして、解析部１２２は、動画解析（早送音声生成）を行う（ステップＳ００５）。具体的には、解析部１２２は、取得した動画ファイルに含まれるオーディオ成分を時系列を維持しながら早送り編集する。例えば、解析部１２２は、動画内時間で発話開始０分１５秒時点から発話終了０分２７秒時点までの発話（発話継続時間が１２秒）の動画ファイルについて処理する場合、４倍速に編集して、発話開始から発話終了までの時間が３秒となるよう音声ファイルのデータ量を小さく作成する。 Then, the analysis unit 122 performs video analysis (fast-forward audio generation) (step S005). Specifically, the analysis unit 122 fast-forward edits the audio components contained in the acquired video file while maintaining the chronological order. For example, when processing a video file of an utterance from 0 minutes 15 seconds when the utterance starts to 0 minutes 27 seconds when the utterance ends (the utterance lasts 12 seconds), the analysis unit 122 edits the video file at 4 times the normal speed, reducing the amount of data in the audio file so that the time from the start of the utterance to the end of the utterance is 3 seconds.

そして、解析部１２２は、動画解析（音声解析依頼）を行う（ステップＳ００６）。具体的には、解析部１２２は、音声解析サービス３００に、ステップＳ００５にて作成した早送音声の音声ファイルをＡＰＩ等を通じて送信して解析を依頼する。 Then, the analysis unit 122 performs video analysis (request for audio analysis) (step S006). Specifically, the analysis unit 122 requests the audio analysis service 300 to analyze the audio file of the fast-forwarded audio created in step S005 by sending it via an API or the like.

音声解析サービス３００は、送信された早送音声の音声ファイルについて、音声解析処理を行う（ステップＳ００７）。具体的には、音声解析サービス３００は、素材動画の発話タイミングと発話内容を対応付けて記録した素材動画の発話タイミング解析済テキストを生成し、動画生成装置１００に送信する。 The audio analysis service 300 performs audio analysis processing on the transmitted audio file of the fast-forwarded audio (step S007). Specifically, the audio analysis service 300 generates a speech timing analyzed text of the raw video in which the speech timing and speech content of the raw video are associated and recorded, and transmits the text to the video production device 100.

そして、解析部１２２は、動画解析（時系列情報作成）を行う（ステップＳ００７）。具体的には、解析部１２２は、受信した解析済みテキストを時系列発話情報１１２に格納し、解析済フラグ１１１Ｅを「済」に設定して、解析結果１１１Ｆに当該時系列発話情報１１２への参照情報を格納する。その際、解析部１２２は、解析済みテキストと時系列発話情報１１２と、のデータ構造が異なる場合には、解析済みテキストの情報について、時刻情報を早送状態から通常速度状態に戻すよう変換して時系列発話情報１１２として格納してもよいし、時刻情報を早送状態から通常速度状態に戻すよう変換した上で時系列発話情報１１２のデータ構造に変換して格納してもよい。 Then, the analysis unit 122 performs video analysis (creation of time-series information) (step S007). Specifically, the analysis unit 122 stores the received analyzed text in the time-series speech information 112, sets the analyzed flag 111E to "Done", and stores reference information to the time-series speech information 112 in the analysis result 111F. At that time, if the data structures of the analyzed text and the time-series speech information 112 are different, the analysis unit 122 may convert the information of the analyzed text so that the time information is returned from the fast-forward state to the normal speed state and store it as the time-series speech information 112, or may convert the time information so that the time information is returned from the fast-forward state to the normal speed state and then convert it into the data structure of the time-series speech information 112 and store it.

以上が、動画生成フロー（動画素材登録）の例である。動画生成フロー（動画素材登録）によれば、動画素材として登録された動画について、発話のテキスト情報と動画上のその発話タイミングを解析した時系列発話情報を得ることができる。 The above is an example of the video generation flow (video material registration). According to the video generation flow (video material registration), for videos registered as video material, it is possible to obtain text information of utterances and time-series utterance information that analyzes the timing of the utterances in the video.

図１０は、動画生成フロー（編集方針登録）の例を示す図である。動画生成フロー（編集方針登録）は、ユーザがユーザ端末４００のブラウザにおいて開始を要求すると開始される。 Figure 10 is a diagram showing an example of a video creation flow (editing policy registration). The video creation flow (editing policy registration) starts when a user requests the start of the flow in the browser of the user terminal 400.

動画生成装置１００の編集計画部１２３は、編集方針登録画面を生成し、ユーザ端末４００に表示させる（ステップＳ１０１）。具体的には、編集計画部１２３は、ユーザが過去に登録済の編集方針の一覧を管理する編集方針登録画面を生成する。そして、編集計画部１２３は、生成した編集方針登録画面の表示情報をユーザ端末４００に送信する。 The editing planning unit 123 of the video generating device 100 generates an editing policy registration screen and displays it on the user terminal 400 (step S101). Specifically, the editing planning unit 123 generates an editing policy registration screen that manages a list of editing policies that the user has previously registered. Then, the editing planning unit 123 transmits display information for the generated editing policy registration screen to the user terminal 400.

そして、ユーザ端末４００のブラウザは、編集方針登録画面を表示させ、登録する編集方針タイトルと、登録動画素材と、オーダーを含む情報を付帯させて編集方針登録依頼を動画生成装置１００に送信する（ステップＳ１０２）。 Then, the browser of the user terminal 400 displays an editing policy registration screen and transmits an editing policy registration request to the video production device 100 together with information including the editing policy title to be registered, the video material to be registered, and the order (step S102).

編集計画部１２３は、編集方針等を受け付ける（ステップＳ１０３）。具体的には、編集計画部１２３は、編集方針情報１１３に、編集方針タイトルと、オーダーに基づいてコンテンツの目標、制約条件、コンテンツの構成、編集計画書フォーマットと、登録動画素材に基づいてリソースファイルと、を登録する。なお、編集計画部１２３は、オーダーに記載されている自然言語を解釈して、オーダーに含まれているコンテンツの目標、制約条件、コンテンツの構成、編集計画書フォーマットを特定する。 The editing planning unit 123 accepts the editing policy and the like (step S103). Specifically, the editing planning unit 123 registers, in the editing policy information 113, the editing policy title, the content goal, constraint conditions, content structure, and editing plan format based on the order, and resource files based on the registered video material. The editing planning unit 123 interprets the natural language written in the order to identify the content goal, constraint conditions, content structure, and editing plan format included in the order.

そして、編集計画部１２３は、編集準備（命令情報作成）を行う（ステップＳ１０４）。具体的には、編集計画部１２３は、命令情報１１４を作成する。例えば、編集計画部１２３は、上述した命令情報１１４の編集方針データの指定部分を、編集方針情報１１３の内容に置き換えて、対話型ＡＩサービス２００に受け渡すプロンプトを生成する。 Then, the editing planning unit 123 performs editing preparation (creating command information) (step S104). Specifically, the editing planning unit 123 creates command information 114. For example, the editing planning unit 123 replaces the specified portion of the editing policy data in the command information 114 described above with the contents of the editing policy information 113, and generates a prompt to be passed to the interactive AI service 200.

そして、編集計画部１２３は、編集準備（計画依頼）を行う（ステップＳ１０５）。具体的には、編集計画部１２３は、ステップＳ１０４にて作成した命令情報１１４と、素材動画の発話タイミング解析済テキストと、を対話型ＡＩサービス２００にＡＰＩ等を通じて送信する。 Then, the editing planning unit 123 performs editing preparation (planning request) (step S105). Specifically, the editing planning unit 123 transmits the command information 114 created in step S104 and the speech timing analyzed text of the raw video to the interactive AI service 200 via an API or the like.

そして、対話型ＡＩサービス２００は、送信された命令情報に従って、編集計画処理を行う（ステップＳ１０６）。具体的には、対話型ＡＩサービス２００は、素材動画の発話タイミング解析済テキストを用いて、発話内容（意味）と発話タイミングを考慮して重要な部分や面白い、興味深い等と評価される発言を中心にカット編集を行い、オーダーに従ってトランジションやアタッチメントを組み込んで指定された尺を満たすよう編集する計画を立てる。対話型ＡＩサービス２００は、計画した編集内容を編集計画書として指定されたフォーマットで生成し、動画生成装置１００に送信する。 Then, the interactive AI service 200 performs editing planning processing according to the transmitted command information (step S106). Specifically, the interactive AI service 200 uses the speech timing analyzed text of the raw video to perform cut editing, focusing on important parts and comments evaluated as interesting or interesting, taking into consideration the speech content (meaning) and speech timing, and creates a plan to edit the video to meet the specified length by incorporating transitions and attachments according to the order. The interactive AI service 200 generates the planned editing content in a specified format as an editing plan and transmits it to the video generation device 100.

そして、動画編集部１２４は、該編集計画書１１５に従って動画編集（編集動画作成）を行う（ステップＳ１０７）。具体的には、動画編集部１２４は、送信された編集計画書を受信すると、記憶部１１０の編集計画書１１５に格納する。そして、動画編集部１２４は、該編集計画書１１５に従って動画編集（編集動画作成）を行い、動画編集の結果得られた編集動画を、記憶部１１０の編集動画１１６に格納するとともにユーザ端末４００へ送信する。なお、動画編集部１２４は、動画編集の結果得られた編集動画を、ダウンロード可能にウェブサイトに掲示してユーザ端末４００にそのリンクを送信するようにしてもよいし、あるいはユーザ端末４００から予め指定された動画共有サイトにアップロードするようにしてもよい。 Then, the video editing unit 124 performs video editing (edited video creation) according to the editing plan 115 (step S107). Specifically, when the video editing unit 124 receives the transmitted editing plan, it stores it in the editing plan 115 in the storage unit 110. Then, the video editing unit 124 performs video editing (edited video creation) according to the editing plan 115, stores the edited video obtained as a result of the video editing in the edited video 116 in the storage unit 110, and transmits it to the user terminal 400. Note that the video editing unit 124 may post the edited video obtained as a result of the video editing on a website so that it can be downloaded and transmit a link to the user terminal 400, or may upload the edited video obtained as a result of the video editing to a video sharing site designated in advance from the user terminal 400.

以上が、動画生成フロー（編集方針登録）の例である。動画生成フロー（編集方針登録）によれば、動画素材として登録された動画を解析して得た時系列発話情報と、編集方針を用いて作成した編集計画に従い、動画素材を編集して編集動画を得ることができる。したがって、ユーザが望む態様の動画を生成することができるといえる。 The above is an example of the video generation flow (editing policy registration). According to the video generation flow (editing policy registration), the video material can be edited to obtain an edited video according to the time-series speech information obtained by analyzing the video registered as video material and the editing plan created using the editing policy. Therefore, it can be said that a video in the format desired by the user can be generated.

図１１は、動画素材登録画面の画面例を示す図である。動画素材登録画面の画面例６００には、少なくとも、登録された動画素材ファイル６１０ごとに、動画タイトル６１１と、説明の情報６１５と、を含む情報を表示させる。その他、動画素材登録画面の画面例６００には、編集方針一覧画面へ遷移する指示を受け付ける編集方針表示ボタン６０１と、動画素材を新規登録する指示を受け付ける新規登録ボタン６０２と、登録された動画素材ファイル６１０ごとに、動画ファイル名６１２と、コンテンツ解析ステータス６１３と、動画素材の登録を解除する削除ボタン６１４と、が含まれる。 Figure 11 is a diagram showing an example of a video material registration screen. The example video material registration screen 600 displays information including at least a video title 611 and explanatory information 615 for each registered video material file 610. In addition, the example video material registration screen 600 includes an editing policy display button 601 that accepts an instruction to transition to an editing policy list screen, a new registration button 602 that accepts an instruction to newly register a video material, and, for each registered video material file 610, a video file name 612, a content analysis status 613, and a delete button 614 that cancels the registration of the video material.

コンテンツ解析ステータス６１３は、登録された動画素材について、発話のテキスト情報と動画上のその発話タイミングを解析した時系列発話情報を得たか否かを示す情報である。編集方針表示ボタン６０１は、入力を受け付けると、後述する編集方針登録画面の画面例に画面を遷移させる。新規登録ボタン６０２は、入力を受け付けると、後述する新規素材登録画面の画面例に画面を遷移させる。 The content analysis status 613 is information indicating whether or not time-series speech information obtained by analyzing the speech text information and the timing of the speech in the video for the registered video material has been obtained. When an input is received from the editing policy display button 601, the screen transitions to an example of an editing policy registration screen, which will be described later. When an input is received from the new registration button 602, the screen transitions to an example of a new material registration screen, which will be described later.

図１２は、新規素材登録画面の画面例を示す図である。新規素材登録画面の画面例６５０には、少なくとも、ユーザが登録する動画素材について、動画タイトル６５１と、動画ファイル名６５２と、動画ファイル名６５２にて特定される動画ファイルの格納位置を示すファイルパスを参照入力する参照ボタン６５３と、素材ファイルの説明入力欄６５４と、動画素材登録画面へ遷移する指示を受け付ける閉じるボタン６５５と、動画素材を登録する指示を受け付ける登録ボタン６５６と、が含まれる。 Figure 12 is a diagram showing an example of a new material registration screen. The example new material registration screen 650 includes at least a video title 651, a video file name 652, a reference button 653 for referencing and inputting a file path indicating the storage location of the video file specified by the video file name 652, a material file description input field 654, a close button 655 for receiving an instruction to transition to the video material registration screen, and a registration button 656 for receiving an instruction to register the video material.

素材ファイルの説明入力欄６５４は、フリーテキストにて素材の内容の説明を受け付ける。例えば、素材ファイルの説明入力欄６５４は、動画素材の場合、あらすじや、動画内時刻ごとのシーンの説明を受け付ける。登録ボタン６５６は、動画素材を登録する指示を受け付けると、動画生成フロー（動画素材登録）のステップＳ００３の登録処理を実施する。 The material file description input field 654 accepts a description of the content of the material in free text. For example, in the case of video material, the material file description input field 654 accepts a synopsis and a description of the scene for each time in the video. When the registration button 656 accepts an instruction to register video material, it carries out the registration process of step S003 of the video generation flow (video material registration).

図１３は、編集方針登録画面の画面例を示す図である。編集方針登録画面の画面例７００には、少なくとも、登録された編集方針７１０ごとに、編集方針名７１１と、編集方針の登録を解除する削除ボタン７１２と、編集方針の具体的な内容であるオーダー７１３と、編集計画書の作成の指示を受け付ける編集計画書作成ボタン７１４と、編集計画書により作成される動画のあらすじの説明の情報７１５と、編集計画書に従って編集動画を生成する指示を受け付ける動画生成ボタン７１６と、を表示させる。 Figure 13 is a diagram showing an example of an editing policy registration screen. The example 700 of the editing policy registration screen displays at least, for each registered editing policy 710, an editing policy name 711, a delete button 712 for canceling the registration of the editing policy, an order 713 that is the specific content of the editing policy, an editing plan creation button 714 for receiving an instruction to create an editing plan, information 715 explaining the plot of the video to be created according to the editing plan, and a video creation button 716 for receiving an instruction to generate an edited video according to the editing plan.

オーダー７１３は、編集方針（制約条件や構成条件を含む）を自然言語で記述したテキスト情報である。例えば、オーダー７１３には、生成する動画の尺の制限や目安、編集動画に付加すべき動画・静止画・音声の指定、あるいは編集動画において用いる視覚効果の指定が含まれる。 Order 713 is text information that describes the editing policy (including constraints and composition conditions) in natural language. For example, order 713 may include restrictions or guidelines for the length of the video to be generated, designation of video, still images, and audio to be added to the edited video, or designation of visual effects to be used in the edited video.

編集計画書作成ボタン７１４は、入力を受け付けると、編集計画書の作成の指示として受け付け、動画生成フロー（編集方針登録）のステップＳ１０４からステップＳ１０７を実施させる。あらすじ７１５には、編集計画書により示される編集動画のあらすじ（例えば、章立てや動画の再生時間等）が表示される。動画生成ボタン７１６は、入力を受け付けると、作成された編集計画書に従った動画作成の指示として受け付け、動画生成フロー（編集方針登録）のステップＳ１０７を実施させる。 When input is received by the edit plan creation button 714, it accepts the input as an instruction to create an edit plan, and executes steps S104 to S107 of the video creation flow (editing policy registration). Synopsis 715 displays the synopsis of the edited video indicated by the edit plan (e.g., chapter structure, video playback time, etc.). When input is received by the video creation button 716, it accepts the input as an instruction to create a video in accordance with the created edit plan, and executes step S107 of the video creation flow (editing policy registration).

また、編集方針登録画面の画面例７００には、登録動画素材一覧表示ボタン７０１と、新規登録ボタン７０２と、が含まれる。登録動画素材一覧表示ボタン７０１は、入力を受け付けると、動画素材登録画面の画面例６００に画面を遷移させる。新規登録ボタン７０２は、入力を受け付けると、後述する新規編集方針登録画面の画面例に画面を遷移させる。 The example editing policy registration screen 700 also includes a registered video material list display button 701 and a new registration button 702. When input is received from the registered video material list display button 701, the screen transitions to the example video material registration screen 600. When input is received from the new registration button 702, the screen transitions to the example new editing policy registration screen described below.

図１４は、新規編集方針登録画面の画面例を示す図である。新規編集方針登録画面の画面例７５０には、少なくとも、ユーザが登録する編集方針について、編集方針名７５１と、編集対象とする動画素材の動画ファイル名７５２と、動画ファイル名７５２にて特定される動画ファイルの格納位置を示すファイルパスを参照入力する参照ボタン７５３と、編集方針の具体的な内容を受け付けるオーダー入力欄７５４と、編集方針登録画面の画面例７００へ遷移する指示を受け付ける閉じるボタン７５５と、編集方針を登録する指示を受け付ける登録ボタン７５６と、が含まれる。 Figure 14 is a diagram showing an example of a new editing policy registration screen. The example 750 of the new editing policy registration screen includes at least an editing policy name 751 for the editing policy to be registered by the user, a video file name 752 of the video material to be edited, a reference button 753 for referencing and inputting a file path indicating the storage location of the video file specified by the video file name 752, an order input field 754 for receiving specific details of the editing policy, a close button 755 for receiving an instruction to transition to the example 700 of the editing policy registration screen, and a register button 756 for receiving an instruction to register the editing policy.

オーダー入力欄７５４は、フリーテキストにて編集方針の内容の指示（プロンプトへの追加情報）を受け付ける。具体的には、オーダー入力欄７５４は、生成する動画の尺の制限や目安、編集動画に付加すべき動画・静止画・音声の指定、あるいは編集動画において用いる視覚効果の指定を受け付ける。例えば、オーダー入力欄７５４は、「３つのシーンから構成され、それぞれのシーンの変遷には視覚効果を付けて急な被写体、明度の変化を避ける。ＢＧＭは明るい感じの曲で、動画の最後にはＱＲコードを表示する時間を１０秒設けて。動画全体の尺は５分以内で。」のようなフリーテキストを編集方針の内容として指示を受け付ける。 The order input field 754 accepts free text instructions (additional information to the prompt) regarding the content of the editing policy. Specifically, the order input field 754 accepts restrictions or guidelines for the length of the video to be generated, designation of video, still images, and audio to be added to the edited video, or designation of visual effects to be used in the edited video. For example, the order input field 754 accepts free text instructions as the content of the editing policy, such as "It will consist of three scenes, and visual effects will be added to the transitions between each scene to avoid sudden changes in subject and brightness. The background music will be a bright song, and there will be 10 seconds at the end of the video to display a QR code. The entire video should be within 5 minutes."

登録ボタン６５６は、編集方針を登録する指示を受け付けると、動画生成フロー（編集方針登録）のステップＳ１０３の登録処理を実施する。 When the registration button 656 receives an instruction to register an editing policy, it performs the registration process of step S103 of the video generation flow (editing policy registration).

以上が、本発明に係る実施形態の一つとしての動画生成システム１である。以上の実施形態のように、動画生成システム１によれば、ユーザ自身に動画編集のスキルが無い場合や、動画生成のための設備環境がない場合であっても、ユーザが望む態様の動画を生成することができる。 The above is an explanation of the video creation system 1 as one embodiment of the present invention. As in the above embodiment, the video creation system 1 allows a user to create a video in the format desired by the user, even if the user does not have video editing skills or does not have the equipment and environment required for video creation.

本発明は、上記の実施形態に制限されない。上記の実施形態は、本発明の技術的思想の範囲内で様々な変形が可能である。例えば、上記の実施形態においては、動画生成装置１００は、対話型ＡＩサービス２００を利用して動画の編集計画書を得ているが、これに限られず、例えば、動画生成装置１００自身にて動画生成に特化した生成ＡＩを稼働させ、編集動画を生成するものであってもよい。 The present invention is not limited to the above-described embodiment. The above-described embodiment can be modified in various ways within the scope of the technical concept of the present invention. For example, in the above-described embodiment, the video generating device 100 obtains a video editing plan using the interactive AI service 200, but this is not limited thereto. For example, the video generating device 100 itself may operate a generation AI specialized for video generation to generate an edited video.

あるいは、上記の実施形態においては、動画生成装置１００は、音声解析サービス３００を利用して素材動画の発話の解析を行っているが、これに限られず、例えば、動画生成装置１００自身にて音声解析に特化した生成ＡＩを稼働させ、時系列発話情報を生成するものであってもよい。 Alternatively, in the above embodiment, the video production device 100 uses the audio analysis service 300 to analyze the speech in the raw video, but this is not limited thereto. For example, the video production device 100 itself may operate a generation AI specialized in audio analysis to generate time-series speech information.

また、動画生成装置１００の機能は、１つ又は複数のコンピュータで構成されるクラウドサービスによって実現してもよい。 Furthermore, the functions of the video generating device 100 may be realized by a cloud service consisting of one or more computers.

また、上記した実施形態の技術的要素は、単独で適用されてもよいし、プログラム部品とハードウェア部品のような複数の部分に分けられて適用されるようにしてもよい。 In addition, the technical elements of the above-mentioned embodiments may be applied independently, or may be divided into multiple parts, such as program parts and hardware parts, and then applied.

以上、本発明について、実施形態を中心に説明した。 The present invention has been described above, focusing on the embodiments.

１・・・動画生成システム、５０・・・通信路、１００・・・動画生成装置、１１０・・・記憶部、１１１・・・素材情報、１１２・・・時系列発話情報、１１３・・・編集方針情報、１１４・・・命令情報、１１５・・・編集計画書、１１６・・・編集動画、１２０・・・処理部、１２１・・・取得部、１２２・・・解析部、１２３・・・編集計画部、１２４・・・動画編集部、１４０・・・入出力部、１５０・・・通信部、２００・・・対話型ＡＩサービス、３００・・・音声解析サービス、４００・・・ユーザ端末。 1: Video generation system, 50: Communication path, 100: Video generation device, 110: Memory unit, 111: Material information, 112: Time-series speech information, 113: Editing policy information, 114: Command information, 115: Editing plan, 116: Edited video, 120: Processing unit, 121: Acquisition unit, 122: Analysis unit, 123: Editing plan unit, 124: Video editing unit, 140: Input/output unit, 150: Communication unit, 200: Interactive AI service, 300: Voice analysis service, 400: User terminal.

Claims

An acquisition unit that acquires a video file;
an analysis unit that acquires an analysis result including time-series speech information, which is text information obtained by transcribing speech included in the video file in a time series;
An editing plan unit that transmits the time-series speech information and command information instructing the user to output an editing plan for editing the time-series speech information according to desired editing policy information to an interactive AI using a language model, and receives the editing plan from the interactive AI;
a video editing unit that edits the video file in accordance with the editing plan and generates an edited video;
A video generating device having the above configuration.

The video generating device according to claim 1 ,
The editing plan unit includes, in the command information, a constraint condition for the edited video obtained according to the editing plan.
A moving image generating device comprising:

The video generating device according to claim 1 ,
The editing plan unit includes, in the command information, configuration information about the edited video obtained according to the editing plan.
A moving image generating device comprising:

The video generating device according to claim 1 ,
The editing planning unit includes, in the command information, a designation of a video, a still image, or a sound to be added to the edited video obtained according to the editing plan.
A moving image generating device comprising:

The video generating device according to claim 1 ,
The editing plan unit includes, in the command information, a designation of a visual effect to be used in the edited video obtained according to the editing plan.
A moving image generating device comprising:

The video generating device according to claim 1 ,
The editing plan includes information for constructing the edited video by connecting partial videos each having a start position and an end position on a time axis in the video file,
the video editing unit generates the edited video by extracting the partial video from the video file and connecting the extracted partial video.
A moving image generating device comprising:

The video generating device according to claim 1 ,
The editing plan includes information for linking together partial videos, each of which has a start position and an end position on a time axis in the video file, to form the edited video, and information for specifying videos, still images, or audio to be added before and after the partial videos;
the video editing unit extracts the partial video from the video file, joins them together, and generates the edited video by adding the video, still image, or audio to be added;
A moving image generating device comprising:

The video generating device according to claim 1 ,
The editing plan includes information for constructing the edited video by joining together partial videos, each of which has a start position and an end position on a time axis in the video file, and a designation of visual effects to be used at the joins between the partial videos;
the video editing unit extracts the partial videos from the video file, joins them together, and applies the designated visual effect to the joins to generate the edited video;
A moving image generating device comprising:

The video generating device according to claim 1 ,
The analysis unit performs fast-forward editing of the speech included in the video file while maintaining the time series, and transfers the speech to a predetermined speech-to-text conversion unit to obtain the time-series speech information.
A moving image generating device comprising:

The video generating device according to claim 1 ,
The analysis unit identifies speakers of speech sounds included in the video file, extracts the speech sounds while maintaining a time series for each speaker, and transfers the speech sounds to a predetermined speech-to-text conversion unit to obtain the time-series speech information by integrating the obtained text information.
A moving image generating device comprising:

The video generating device according to claim 1 ,
The editing plan is described in a predetermined format language;
The editing plan unit includes, in the command information, definition information about the format language in which the editing plan is described.
A moving image generating device comprising:

A moving image generating method using a moving image generating device, comprising:
The video production device includes a processor,
The processor,
An acquisition step of acquiring a video file;
An analysis step of acquiring an analysis result including time-series speech information, which is text information obtained by transcribing speech included in the video file in a time series;
An editing planning step of transmitting the time-series speech information and command information instructing the user to output an editing plan for editing the time-series speech information according to desired editing policy information to an interactive AI using a language model, and receiving the editing plan from the interactive AI;
a video editing step of editing the video file in accordance with the editing plan to generate an edited video;
A video generation method that implements the above.

A moving image generating program for causing an information processing device to generate a moving image,
The information processing device includes a processor,
The processor,
An acquisition step of acquiring a video file;
An analysis step of acquiring an analysis result including time-series speech information, which is text information obtained by transcribing speech included in the video file in a time series;
An editing planning step of transmitting the time-series speech information and command information instructing the user to output an editing plan for editing the time-series speech information according to desired editing policy information to an interactive AI using a language model, and receiving the editing plan from the interactive AI;
a video editing step of editing the video file in accordance with the editing plan to generate an edited video;
A video generation program that performs the above.

A video production system including a user terminal and a video production device communicably connected to the user terminal,
The video generating device comprises:
An acquisition unit that acquires a video file from the user terminal via communication;
an analysis unit that acquires an analysis result including time-series speech information, which is text information obtained by transcribing speech included in the video file in a time series;
An editing plan unit that transmits the time-series speech information and command information instructing the user to output an editing plan for editing the time-series speech information according to desired editing policy information to an interactive AI using a language model, and receives the editing plan from the interactive AI;
a video editing unit that edits the video file in accordance with the editing plan and generates an edited video;
A video generation system comprising: