[go: up one dir, main page]

TWI900067B - Method and system for generating meeting minutes - Google Patents

Method and system for generating meeting minutes

Info

Publication number
TWI900067B
TWI900067B TW113122571A TW113122571A TWI900067B TW I900067 B TWI900067 B TW I900067B TW 113122571 A TW113122571 A TW 113122571A TW 113122571 A TW113122571 A TW 113122571A TW I900067 B TWI900067 B TW I900067B
Authority
TW
Taiwan
Prior art keywords
text
recognition results
identities
meeting
record
Prior art date
Application number
TW113122571A
Other languages
Chinese (zh)
Inventor
陳維超
黃柏瑄
彭啓人
陳瑞章
Original Assignee
英業達股份有限公司
Filing date
Publication date
Application filed by 英業達股份有限公司 filed Critical 英業達股份有限公司
Application granted granted Critical
Publication of TWI900067B publication Critical patent/TWI900067B/en

Links

Abstract

A method for generating meeting minutes includes the follow steps. A video signal, an audio signal and source localization information of a video conference are obtained. Face recognition is performed on a plurality of image frames of the video signal to obtain a plurality of face recognition results. Voice print recognition is performed on a plurality of audio segments of the audio signal to obtain a plurality of voice recognition results at a plurality of timestamps. The voice recognition results are matched to the face recognition results according to the source localization information, in order to obtain a plurality of speaker identities. Speech to text is performed on the audio segments of the audio signal to obtain a transcript. The speaker identities are marked in the transcript to obtain a context according to the timestamps. Context understanding is performed on the context to obtain a meeting minutes report.

Description

會議記錄生成的方法及系統Method and system for generating meeting minutes

本案係關於一種會議記錄生成的方法及系統,特別係關於一種關於適用於多人會議的會議記錄生成的方法及系統。 This case relates to a method and system for generating meeting minutes, and more particularly to a method and system for generating meeting minutes suitable for multi-person meetings.

現行之會議系統能透過人工智慧語音轉文字辨識等功能自動產出會議逐字稿,接著有些會把它直接視作文章,直接利用文字摘要技術對逐字稿作閱讀理解產生會議記錄。 Current conferencing systems can automatically generate meeting transcripts through features like artificial intelligence speech-to-text recognition. Some systems then treat these transcripts as written text and use text summarization technology to comprehend the transcripts and generate meeting records.

然而多人討論互動的會議與單一文章並不相同,純粹語音轉文字的逐字稿並無法真實地呈現多人會議的實際情境,進而限制了此種功能的實際應用。一個完善的會議記錄,需要考量到整個多人會議的情境,在進行會議摘要的時候,需要考量多種因素產出適當的會議記錄,比方說與會者的身份及互動情緒等,這些往往並非單一文字語意理解可以涵蓋的。 However, a meeting involving multiple participants is different from a single article. A purely speech-to-text transcript fails to accurately capture the context of a multi-person meeting, limiting the practical application of this feature. A comprehensive meeting record must consider the entire context of the meeting. When summarizing a meeting, multiple factors must be considered to produce an appropriate transcript, such as the identities of the participants and the emotional dynamics of the interaction. These factors often cannot be captured by a single textual interpretation.

另一方面,在現實生活中一個會議具有各種型態,有可能會是實體、線上、混合的方式進行,實際會議的互 動過程也常常是多模態的組成,比如與會者可能會使用相機、麥克風、文字等各模態的裝置。這些影音裝置在一場會議中未必會使用齊全,諸如相機無法開啟、多人共用同一裝置之麥克風或帳號加入會議等皆是常見之使用情景。 On the other hand, in real life, meetings can take many forms, including in-person, online, or hybrid. The interactive process in actual meetings is often multimodal. For example, participants may use cameras, microphones, text, and other devices. These audio and video devices may not all be used in a single meeting. Common scenarios include cameras not turning on, multiple participants sharing the same microphone, or accounts joining meetings.

更進一步地,參與會議的人員也隨時有可能有所變動,比如受邀者臨時有事而無法參與或會議臨時加入不在原本會議與會清單裡的人員。在多人共用麥克風情境裡,與會者難以判斷此時說話的人的身份,或是多人會議中,與會者常常難以認得所有人的身分,這些進一步導致了多人會議中資訊的混亂且難以記錄。 Furthermore, meeting participants can change at any time, for example, if an invitee becomes unavailable due to an emergency, or someone not originally on the attendee list is added to the meeting. In scenarios where multiple people share a microphone, it can be difficult for attendees to identify the speaker. In multi-person meetings, participants often struggle to recognize everyone, leading to confusion and difficulty recording information.

因此,如何提出一種會議記錄生成的方法及系統以解決上述問題為本領域中重要的議題。 Therefore, how to propose a method and system for generating meeting minutes to solve the above problems is an important issue in this field.

於本揭示文件的一些實施例中,為提昇多人會議之體驗及完善會議情境記錄,提出一多模態會議系統架構。本揭示文件的方法及系統藉由同時接收影像、聲音、文字輸入、帳號等多模態輸入,進行與會者的身份與會議內容解析,進行即時身份辨識標示及考慮到完整會議情境下的會議重點記錄發送。 In some embodiments of this disclosure, a multimodal conferencing system architecture is proposed to enhance the multi-person conferencing experience and improve the recording of meeting context. The disclosed method and system simultaneously receive multimodal inputs, including video, audio, text input, and account numbers, to analyze participant identities and meeting content. This allows for real-time identification and tagging, and the delivery of meeting keynote recordings that take into account the complete meeting context.

本揭示文件提供一種會議記錄生成方法。會議記錄生成方法包含下列步驟。獲取視訊會議中的視訊訊號、一語音訊號以及聲源定位資訊。對視訊訊號中的多個影像幀進行臉部辨識,以獲取多個臉部辨識結果。對該些語音訊 號中的多個語音段進行聲紋辨識,以獲取在多個時間標記下的多個聲紋辨識結果。根據聲源定位資訊,將此些聲紋辨識結果與此些臉部辨識結果匹配,以獲取在此些時間標記下的多個發言者身分。對語音訊號中的此些語音段進行語音轉文本識別,以獲取文字記錄。根據此些時間標記,將此些發言者身分標註至文字記錄,以獲取文本。對文本進行情境理解,以獲取會議記錄。 This disclosure provides a method for generating a conference transcript. The method comprises the following steps: obtaining a video signal, a voice signal, and sound source localization information from a video conference; performing facial recognition on multiple image frames in the video signal to obtain multiple facial recognition results; performing voiceprint recognition on multiple voice segments in the voice signal to obtain multiple voiceprint recognition results at multiple time stamps; matching the voiceprint recognition results with the facial recognition results based on the sound source localization information to obtain multiple speaker identities at the time stamps; and performing speech-to-text recognition on the voice segments in the voice signal to obtain a text transcript. Based on these time stamps, the identities of these speakers are annotated to the transcript to obtain the text. Contextual understanding is performed on the text to obtain the meeting record.

於一些實施例中,此些聲紋辨識結果以及此些臉部辨識結果包含多個已知身分以及至少一未知身分。 In some embodiments, these voiceprint recognition results and these facial recognition results include multiple known identities and at least one unknown identity.

於一些實施例中,此些聲紋辨識結果包含多個未知身分,並且其中此些臉部辨識結果包含多個已知身分。 In some embodiments, the voiceprint recognition results include multiple unknown identities, and the facial recognition results include multiple known identities.

於一些實施例中,會議記錄生成方法,更包含下列步驟。根據聲源定位資訊以及臉部辨識結果,判斷此些已知身分中的此些發言者身分。根據此些時間標記,根據此些發言者身分更新此些聲紋辨識結果的此些未知身分。 In some embodiments, the conference record generation method further includes the following steps: Determining the identities of the speakers among the known identities based on the sound source localization information and facial recognition results. Based on the time stamps, updating the unknown identities in the voiceprint recognition results with the speaker identities.

於一些實施例中,此些聲紋辨識結果包含複數個已知身分,並且其中此些臉部辨識結果包含複數個未知身分。 In some embodiments, the voiceprint recognition results include a plurality of known identities, and the facial recognition results include a plurality of unknown identities.

於一些實施例中,聲源定位資訊包含多個聲源各自的角度以及方向中至少一者。 In some embodiments, the sound source localization information includes at least one of the angle and direction of each of the multiple sound sources.

於一些實施例中,臉部辨識結果包含此些影像幀中的臉部邊界框的位置以及此些臉部邊界框各自對應的身分。 In some embodiments, the facial recognition results include the locations of facial bounding boxes in these image frames and the identities corresponding to these facial bounding boxes.

於一些實施例中,會議記錄生成方法更包含下列步驟。獲取至少一用戶配置的一文字輸入。根據時間序列,將文字輸入插入文字記錄,以產生更新後文字記錄。根據此些時間標記,將此些發言者身分標註至更新後文字記錄,以獲取文本。對文本進行情境理解,以獲取會議記錄。 In some embodiments, the meeting transcript generation method further includes the following steps: obtaining text input configured by at least one user; inserting the text input into the transcript based on a time sequence to generate an updated transcript; tagging the speakers' identities to the updated transcript based on the time stamps to obtain text; and performing contextual understanding on the text to obtain a meeting transcript.

於一些實施例中,會議記錄生成方法更包含下列步驟。對文本進行情境理解,以獲取文本中多個語句的多個個情緒語意。根據此些情緒語意,移除文本中的部分內容,以產生更新後文本。對更新後文本進行摘要提取,以獲取該會議記錄。 In some embodiments, the meeting transcript generation method further includes the following steps: performing contextual understanding on the text to obtain multiple emotional meanings of multiple sentences in the text; removing portions of the text based on these emotional meanings to generate an updated text; and performing summary extraction on the updated text to obtain the meeting transcript.

本揭示文件提供一種會議記錄生成系統。會議記錄生成系統包含記憶體以及處理器。記憶體用以儲存多個指令以及資料。處理器電性連接記憶體。處理器用以存取記憶體儲存的此些指令以及該資料以執行下列步驟。獲取視訊會議中的視訊訊號、一語音訊號以及聲源定位資訊。對視訊訊號中的多個影像幀進行臉部辨識,以獲取多個臉部辨識結果。對該些語音訊號中的多個語音段進行聲紋辨識,以獲取在多個時間標記下的多個聲紋辨識結果。根據聲源定位資訊,將此些聲紋辨識結果與此些臉部辨識結果匹配,以獲取在此些時間標記下的多個發言者身分。對語音訊號中的此些語音段進行語音轉文本識別,以獲取文字記錄。根據此些時間標記,將此些發言者身分標註至文字記錄,以獲取文本。對文本進行情境理解,以獲取會議記錄。 This disclosure provides a conference record generation system. The conference record generation system includes a memory and a processor. The memory is used to store multiple instructions and data. The processor is electrically connected to the memory. The processor is used to access these instructions and data stored in the memory to execute the following steps. A video signal, a voice signal, and sound source localization information in a video conference are obtained. Facial recognition is performed on multiple image frames in the video signal to obtain multiple facial recognition results. Voiceprint recognition is performed on multiple voice segments in the voice signals to obtain multiple voiceprint recognition results at multiple time stamps. Based on the sound source localization information, the voiceprint recognition results are matched with the facial recognition results to obtain the identities of multiple speakers at these time stamps. Speech-to-text recognition is performed on the speech segments in the voice signal to obtain a text transcript. Based on the time stamps, the speaker identities are annotated to the text transcript to obtain a text. Contextual understanding is performed on the text to obtain a meeting transcript.

為使本揭露之上述和其他目的、特徵、優點與實施例能更明顯易懂,所附符號之說明如下: To make the above and other objects, features, advantages and embodiments of the present disclosure more clearly understood, the accompanying symbols are explained as follows:

100:會議記錄生成系統 100: Meeting Minutes Generation System

110:外部裝置 110: External device

111:鍵盤 111:Keyboard

112:滑鼠 112: Mouse

113:陣列式麥克風 113: Microphone Array

114:攝影機 114: Camera

120:電子裝置 120: Electronic devices

121:處理器 121: Processor

122:記憶體裝置 122: Memory device

123:(觸控)顯示器 123:(Touch) Display

124:聲音感測器 124: Sound Sensor

125:影像感測器 125: Image sensor

126:鍵盤 126:Keyboard

130:網路 130: Internet

140:伺服器 140: Server

200:會議記錄生成方法 200:Meeting record generation method

410:聲音特徵與臉部特徵 410: Voice and facial features

420:文字記錄 420: Text Records

421~426:語句 421~426: Sentences

430:文本 430: Text

431~436:情緒語意 431~436: Emotional Semantics

440:會議記錄 440: Meeting Minutes

441:會議標題 441: Meeting Title

442:行動項目 442: Action Items

TRX:文字記錄 TRX: Text Record

S210,S220,S230,S240,S250,S260,S270:步驟 S210, S220, S230, S240, S250, S260, S270: Steps

S281,S282,S290,S291,S292,S293,S294:步驟 S281, S282, S290, S291, S292, S293, S294: Steps

S212,S214,S216:步驟 S212, S214, S216: Steps

S222,S224,S226,S228,S229:步驟 S222, S224, S226, S228, S229: Steps

S230,S242,S244,S246,S248,S249,S250:步驟 S230, S242, S244, S246, S248, S249, S250: Steps

S272,S274,S276,S278,S281,S282:步驟 S272, S274, S276, S278, S281, S282: Steps

為使本揭露之上述和其他目的、特徵、優點與實施例能更明顯易懂,所附圖式之說明如下:第1圖為依據本揭露一些實施例之會議記錄生成系統的示意圖。 To make the above and other objects, features, advantages, and embodiments of the present disclosure more clearly understood, the accompanying drawings are described as follows: Figure 1 is a schematic diagram of a meeting record generation system according to some embodiments of the present disclosure.

第2圖為依據本揭露一些實施例之會議記錄生成方法的示意圖。 Figure 2 is a schematic diagram of a method for generating meeting minutes according to some embodiments of the present disclosure.

第3圖為依據本揭露一些實施例之會議記錄生成方法的示意圖。 Figure 3 is a schematic diagram of a method for generating meeting minutes according to some embodiments of the present disclosure.

第4圖為依據本揭露一些實施例之生成會議記錄的操作的示意圖。 Figure 4 is a schematic diagram of the process of generating a meeting record according to some embodiments of the present disclosure.

下列係舉實施例配合所附圖示做詳細說明,但所提供之實施例並非用以限制本揭露所涵蓋的範圍,而結構運作之描述非用以限制其執行順序,任何由元件重新組合之結構,所產生具有均等功效的裝置,皆為本揭露所涵蓋的範圍。另外,圖示僅以說明為目的,並未依照原尺寸作圖。為使便於理解,下述說明中相同元件或相似元件將以相同之符號標示來說明。 The following examples are illustrated in detail with accompanying diagrams. However, these examples are not intended to limit the scope of this disclosure, and the description of the structural operation is not intended to limit the order of execution. Any device with equivalent functionality produced by recombining these components is within the scope of this disclosure. Furthermore, the diagrams are for illustrative purposes only and are not drawn to scale. To facilitate understanding, identical or similar components will be labeled with the same reference numerals in the following description.

在全篇說明書與申請專利範圍所使用之用詞(terms),除有特別註明除外,通常具有每個用詞使用在此領域中、在此揭露之內容中與特殊內容中的平常意義。此外,在本文中所使用的用詞『包含』、『包括』、『具 有』、『含有』等等,均為開放性的用語,即意指『包含但不限於』。此外,本文中所使用之『及/或』,包含相關列舉項目中一或多個項目的任意一個以及其所有組合。 Unless otherwise specified, terms used throughout this specification and claims generally have the ordinary meanings they have in the art, within the context of this disclosure, and within the specific context. Furthermore, the terms "include," "including," "having," "having," and "containing" are intended to be open-ended, meaning "including, but not limited to." Furthermore, the term "and/or" as used herein includes any and all combinations of one or more of the listed items.

請參閱第1圖,第1圖為依據本揭露一些實施例之會議記錄生成系統100的示意圖。 Please refer to Figure 1, which is a schematic diagram of a meeting record generation system 100 according to some embodiments of the present disclosure.

如第1圖所示,會議記錄生成系統100包含外部裝置110、電子裝置120以及伺服器140。於一些實施例中,外部裝置110包含鍵盤111、滑鼠112、陣列式麥克風113以及攝影機114。於一些實施例中,陣列式麥克風113具有聲源定位技術的功能。 As shown in FIG1 , the conference transcript generation system 100 includes an external device 110 , an electronic device 120 , and a server 140 . In some embodiments, the external device 110 includes a keyboard 111 , a mouse 112 , a microphone array 113 , and a camera 114 . In some embodiments, the microphone array 113 functions as a sound source localization technology.

於一些實施例中,陣列式麥克風113通過達時間差法(Time Difference of Arrival;TDOA)和波束成形法(Beamforming)等聲音定位算法,定位聲音的來源進而得到聲源定位資訊。於一些實施例中,所述聲源定位資訊包含多個聲源各自的一角度以及一方向中至少一者。於一些實施例中,聲源定位資訊包含一會議場地內在會議期間中的發言者各自的角度及/或方向。 In some embodiments, the microphone array 113 uses sound localization algorithms such as Time Difference of Arrival (TDOA) and beamforming to locate the source of a sound and thereby obtain sound source localization information. In some embodiments, the sound source localization information includes at least one of an angle and a direction for each of multiple sound sources. In some embodiments, the sound source localization information includes the angle and/or direction of each speaker during a meeting in a conference venue.

於一些實施例中,電子裝置120與外部裝置電性連接,以接收來自外部裝置110的聲音、影像及/或文字輸入。於一些實施例中,電子裝置120包含處理器121、記憶體裝置122、(觸控)顯示器123、聲音感測器124、影像感測器125以及鍵盤126。於一些實施例中,使用者通過外部裝置110的攝影機114或電子裝置120的影像感測器125拍攝議場地內的影像,並通過陣列式麥克風113 或聲音感測器124接收會議場地內的聲音並進行聲源定位,所述影像經由網路130傳送至伺服器140並通過串流產生視訊會議的影像訊號以及語音訊號。 In some embodiments, the electronic device 120 is electrically connected to an external device to receive audio, video, and/or text input from the external device 110. In some embodiments, the electronic device 120 includes a processor 121, a memory device 122, a (touch) display 123, an audio sensor 124, an image sensor 125, and a keyboard 126. In some embodiments, users capture images of the conference room using a camera 114 of an external device 110 or an image sensor 125 of an electronic device 120. Microphone array 113 or an audio sensor 124 receives sounds within the conference room and localizes the sound sources. These images are transmitted via a network 130 to a server 140, where they are streamed to generate video and audio signals for the video conference.

於一些實施例中,使用者通過外部裝置110的鍵盤111、滑鼠112及/或電子裝置120的觸控顯示器123、鍵盤126進行文字輸入。 In some embodiments, the user inputs text via the keyboard 111, mouse 112 of the external device 110 and/or the touch display 123, keyboard 126 of the electronic device 120.

需要注意的是,第1圖僅繪示一個電子裝置,但可理解的是視訊會議是通過各個會議場地所使用的電子裝置進行。於一些實施例中,各會議場地所使用的設備對應於第1圖中的電子裝置120及/或外部裝置110。於一些實施例中,伺服器140經由網路130自視訊會議中各個會議場地的設備(例如,電子裝置120及/或外部裝置110)接收各個會議場地的影像、聲音及/或聲源資料等資料,從而根據各個會議場地的影像、聲音及/或聲源資料等資料,識別會議期間的音訊記錄中的發言者,進而基於標記有發言者的文本內容生成會議記錄。於一些實施例中,本揭示的會議記錄指稱的是會議紀要。於一些實施例中,本揭示指稱的會議紀錄是會議摘要,本揭示不以此為限。 It should be noted that Figure 1 only depicts one electronic device, but it is understood that the video conference is conducted using electronic devices used at each conference venue. In some embodiments, the equipment used at each conference venue corresponds to electronic device 120 and/or external device 110 in Figure 1. In some embodiments, server 140 receives data such as images, audio, and/or sound source data from each conference venue via network 130. Based on the data such as images, audio, and/or sound source data from each conference venue, server 140 identifies the speakers in the audio recordings during the conference and generates a conference record based on text content tagged with the speakers. In some embodiments, the meeting minutes referred to in this disclosure refer to meeting minutes. In some embodiments, the meeting minutes referred to in this disclosure refer to meeting summaries, but this disclosure is not limited to these.

於一些實施例中,伺服器140包含記憶體及處理器,記憶體用以儲存一些資料以及計算機可執行指令,處理器電性耦接記憶體,並且處理器自記憶體存取資料或指令已執行第2圖以及第3圖會議記錄生成方法200中的步驟。在一些實施例中,記憶體可以包含動態記憶體、靜態記憶體、硬碟及/或快閃記憶體。在一些實施例中,處理器 包含中央處理單元(Central Processing Unit;CPU)、圖形處理單元(Graphic Processing Unit;GPU)、張量處理單元(Tensor Processing Unit;TPU)、專用集成電路(Application Specific Integrated Circuit;ASIC)或任何等效的處理電路。因此,本揭示不以此為限。 In some embodiments, server 140 includes a memory and a processor. The memory is used to store data and computer-executable instructions. The processor is electrically coupled to the memory and accesses data or instructions from the memory to execute the steps of the meeting record generation method 200 in Figures 2 and 3. In some embodiments, the memory may include dynamic memory, static memory, a hard drive, and/or flash memory. In some embodiments, the processor includes a central processing unit (CPU), a graphics processing unit (GPU), a tensor processing unit (TPU), an application-specific integrated circuit (ASIC), or any equivalent processing circuit. Therefore, the present disclosure is not limited thereto.

請參閱第1圖以及第2圖。第2圖為依據本揭露一些實施例之會議記錄生成方法200的示意圖。如第2圖所示,會議記錄生成方法200包含步驟S210~S294。於一些實施例中,步驟S210~S294可以由第1圖的會議記錄生成系統100執行。於一些實施例中,步驟S210~S294可以由第1圖的伺服器140執行。於一些實施例中,伺服器140接收各會議場地的裝置/設備所收集的影像、聲音、文字輸入、帳號,利用四種模態互相進行加權關聯與整合,進行會議情境記錄之分析。最後會議結束後,系統會根據上述資訊寄送會議記錄給與會者,同時更新配置資料庫(Profile Database),所述資料庫存有人員的資料及歷史資料,包括聲紋、影像、帳號、行為分析等資料,以供後續會議持續使用,持續自動增進系統的完善性。於一些實施例中,所述四種模態包含影像分支、聲音分支、文字分支以及帳號分支。於一些實施例中,步驟S220對應於影像分支,並且步驟S230~S240對應於聲音分支。於一些實施例中,步驟S250~S260對應於文字分支,並且步驟S270對應於帳號分支。 Please refer to Figure 1 and Figure 2. Figure 2 is a schematic diagram of a meeting record generation method 200 according to some embodiments of the present disclosure. As shown in Figure 2, the meeting record generation method 200 includes steps S210 to S294. In some embodiments, steps S210 to S294 can be performed by the meeting record generation system 100 in Figure 1. In some embodiments, steps S210 to S294 can be performed by the server 140 in Figure 1. In some embodiments, the server 140 receives images, sounds, text inputs, and accounts collected by devices/equipment at each conference venue, and uses the four modalities to perform weighted correlation and integration to analyze the meeting situation record. After the meeting concludes, the system will send a meeting record to participants based on the above information and simultaneously update the profile database. This database stores personnel data and historical data, including voiceprints, images, accounts, behavioral analysis, and other data, for continued use in subsequent meetings, continuously and automatically improving the system's integrity. In some embodiments, the four modalities include an image branch, an audio branch, a text branch, and an account branch. In some embodiments, step S220 corresponds to the image branch, and steps S230-S240 correspond to the audio branch. In some embodiments, steps S250-S260 correspond to the text branch, and step S270 corresponds to the account branch.

於步驟S210中,開始視訊會議。於一些實施例中,在各個會議場地中的與會者加入視訊會議後,開始視訊會議。 In step S210 , the video conference begins. In some embodiments, the video conference begins after participants in each conference venue join the video conference.

於步驟S220中,進行臉部辨識。於一些實施例中,臉部辨識任務可以由臉部辨識神經網路進行。於一些實施例中,臉部辨識任務可以通過主要架構為YOLOv5的臉部辨識神經網路進行。於另一些實施例中,臉部辨識任務可以由其他能辨識臉部特徵的神經網路進行,本揭示不以此為限。對視訊會議的影像訊號中的多個影像幀幅進行臉部偵測,以獲取各會議場地中的所有臉部邊界框,並且基於資料庫中的資料辨識這些臉部邊界框中的與會者身分,進而獲取臉部辨識結果。舉例而言,若一或多位與會者的臉部邊界框被辨識為已知身分,可在該些臉部邊界框標上這些對應與會者的身分。另一方面,若一或多位與會者的臉部邊界框辨識為未知身分,可在該些臉部邊界框標上身分未知。於一些實施例中,上述的部分與會者可以是在相同會議場地通過相同帳號進行視訊會議,並且另一部份與會者是在相異會議場地通過相異帳號進行視訊會議。 In step S220, facial recognition is performed. In some embodiments, the facial recognition task can be performed by a facial recognition neural network. In some embodiments, the facial recognition task can be performed by a facial recognition neural network whose main architecture is YOLOv5. In other embodiments, the facial recognition task can be performed by other neural networks that can recognize facial features, but the present disclosure is not limited thereto. Face detection is performed on multiple image frames in the video signal of the video conference to obtain all facial bounding boxes in each conference venue, and the identities of the participants in these facial bounding boxes are identified based on the data in the database to obtain facial recognition results. For example, if the facial bounding boxes of one or more participants are recognized as known identities, the identities of the corresponding participants may be labeled on those facial bounding boxes. On the other hand, if the facial bounding boxes of one or more participants are recognized as unknown identities, the identities of those participants may be labeled as unknown. In some embodiments, some of the aforementioned participants may be participating in a video conference at the same conference venue using the same account, while other participants may be participating in a video conference at different conference venues using different accounts.

於步驟S230中,進行聲源定位。於一些實施例中,同一會議場地中的多位與會者通過相同帳號與其他場地的與會者進行視訊會議,則該多個與會者的身分無法直接通過會議帳號得知。因此,在這樣多人的會議場地中,能過通過陣列式麥克風113進行聲源定位,從而獲得聲源定位 資訊,所述聲源定位資訊包含一會議場地內在會議期間中的發言者各自的角度及/或方向。 In step S230 , sound source localization is performed. In some embodiments, multiple participants in a conference venue may use the same account to conduct a video conference with participants in other venues. The identities of these participants cannot be directly determined through the conference account. Therefore, in such a multi-person conference venue, sound source localization can be performed using microphone array 113 to obtain sound source localization information. This sound source localization information includes the angle and/or direction of each speaker within the conference venue during the meeting.

於步驟S240中,進行聲音辨識。於一些實施例中,聲音辨識可以是聲紋辨識任務,所述聲紋辨識任務可以由聲紋辨識神經網路進行。於一些實施例中,聲紋辨識任務可以通過基於PyTorch的機器學習架構的聲紋辨識神經網路的運作進行。於另一些實施例中,聲紋辨識可以通過基於pyannote.audio的神經建構模塊或其他能夠辨識聲紋的神經網路進行,本揭示不以此為限。於一些實施例中,基於資料庫資料對視訊會議的語音訊號進聲紋辨識,以辨識語音訊號中的各語音片段的發言者的身分為已知或未知,並基於時間序列/時間標記,將辨識出的身分標記至相應的語音片段,進而獲取聲紋辨識結果。 In step S240, voice recognition is performed. In some embodiments, voice recognition can be a voiceprint recognition task, which can be performed by a voiceprint recognition neural network. In some embodiments, the voiceprint recognition task can be performed by a voiceprint recognition neural network based on a machine learning architecture using PyTorch. In other embodiments, voiceprint recognition can be performed by a neural building block based on pyannote.audio or other neural networks capable of voiceprint recognition, but the present disclosure is not limited thereto. In some embodiments, voiceprint recognition is performed on the voice signal of a video conference based on database data to identify whether the speaker of each voice segment in the voice signal is known or unknown. Based on the time sequence/time stamp, the identified identity is tagged to the corresponding voice segment to obtain the voiceprint recognition result.

於一些實施例中,於步驟S240中所獲取的聲紋辨識結果以及於步驟S210中所獲取的臉部辨識結果包含多個已知身分以及至少一未知身分。 In some embodiments, the voiceprint recognition result obtained in step S240 and the facial recognition result obtained in step S210 include multiple known identities and at least one unknown identity.

於一些實施例中,於步驟S240中所獲取的聲紋辨識結果包含多個未知身分,並且於步驟S210中所獲取的臉部辨識結果包含多已知身分。於一些實施例中,於步驟S240中所獲取的聲紋辨識結果為多個未知身分,並且於步驟S210中所獲取的臉部辨識結果為多個已知身分。 In some embodiments, the voiceprint recognition result obtained in step S240 includes multiple unknown identities, and the facial recognition result obtained in step S210 includes multiple known identities. In some embodiments, the voiceprint recognition result obtained in step S240 includes multiple unknown identities, and the facial recognition result obtained in step S210 includes multiple known identities.

於一些實施例中,於步驟S240中所獲取的聲紋辨識結果包含多個已知身分,並且於步驟S210中所獲取的臉部辨識結果為多個未知身分。於一些實施例中,於步驟 S240中所獲取的聲紋辨識結果為多個已知身分,並且於步驟S210中所獲取的臉部辨識結果為多個未知身分。 In some embodiments, the voiceprint recognition result obtained in step S240 includes multiple known identities, and the facial recognition result obtained in step S210 includes multiple unknown identities. In some embodiments, the voiceprint recognition result obtained in step S240 includes multiple known identities, and the facial recognition result obtained in step S210 includes multiple unknown identities.

於步驟S281中,進行發言者匹配。於一些實施例中,利用於步驟S230中獲取的聲源角度,將聲紋辨識結果與臉部辨識結果匹配,從而獲取在一些時間標記下的發言者身分。於一些實施例中,麥克風陣列的位置為已知,通過聲源角度能夠將聲紋辨識結果與臉部辨識結果比對並連結。於一些實施例中,臉部辨識結果包含視訊會議中的影像幀中的臉部邊界框的大小、位置以及該些臉部邊界框各自對應的已知身分或未知身分。 In step S281, speaker matching is performed. In some embodiments, the sound source angle obtained in step S230 is used to match the voiceprint recognition results with the facial recognition results to obtain the speaker's identity at certain time stamps. In some embodiments, the position of the microphone array is known, and the voiceprint recognition results can be compared and linked with the facial recognition results based on the sound source angle. In some embodiments, the facial recognition results include the size and position of the facial bounding boxes in the video frame of the video conference, as well as the known or unknown identities corresponding to each facial bounding box.

在一些實施例中,若在一時間標記下的聲紋辨識結果為未知身分且在該時間標記下的臉部辨識結果為已知身分,則可通過聲源角度比對與會者的臉部邊界框的位置,從而識別發言者身分,並且將該發言者的聲紋特徵新增至資料庫。舉例而言,若資料庫不包含一位與會者的聲紋特徵,但包含這位與會者的臉部特徵及其對應身分,則通過聲源角度可比對影像中的臉部邊界框的位置,進而根據標記在臉部邊界框上的身分將這位與會者的聲紋資料/特徵新增至資料庫,以更新這位與會者的個人資料。在另一些實施例中,若在一時間標記下的聲紋辨識結果為已知身分且在該時間標記下的臉部辨識結果為未知身分,則可通過聲源角度比對影像中之發言者的位置,從而識別發言者的臉部邊界框,並且將該發言者的臉部特徵新增至資料庫。舉例而言,若資料庫不包含一位與會者的臉部特徵,但包 含這位與會者的聲紋特徵及對應身分,則通過聲源角度可比對影像中的臉部邊界框位置,進而根據標記在語音片段上的身分及時間標記,將這位與會者的臉部資料/特徵新增至資料庫,以更新這位與會者的個人資料。在再一些實施例中,若在一時間標記下的聲紋辨識結果以及的臉部辨識結果都為已知身分,則可通過聲源角度,對聲紋辨識結果以及的臉部辨識結果進行加權關聯,進而提高發言者識別的準確度。 In some embodiments, if the voiceprint recognition result at a time stamp indicates an unknown identity and the facial recognition result at the same time stamp indicates a known identity, the speaker's identity can be identified by comparing the position of the participant's facial bounding box with the sound source angle, and the speaker's voiceprint features can be added to the database. For example, if the database does not contain a participant's voiceprint features but does contain the participant's facial features and their corresponding identity, the sound source angle can be used to compare the position of the facial bounding box in the image. Based on the identity marked on the facial bounding box, the participant's voiceprint data/features can be added to the database to update the participant's personal profile. In other embodiments, if the voiceprint recognition result at a time stamp indicates a known identity and the facial recognition result at the same time stamp indicates an unknown identity, the speaker's facial bounding box can be identified by comparing the sound source angle with the speaker's position in the image, and the speaker's facial features can be added to the database. For example, if the database does not contain a participant's facial features but does contain the participant's voiceprint features and the corresponding identity, the sound source angle can be used to compare the facial bounding box position in the image. Based on the identity and time stamp tagged to the voice clip, the participant's facial data/features can be added to the database to update the participant's profile. In some further embodiments, if both the voiceprint recognition result and the facial recognition result at a time stamp are of a known identity, the voiceprint recognition result and the facial recognition result can be weighted and correlated based on the sound source angle, thereby improving the accuracy of speaker identification.

於步驟S250,進行語音轉文本識別。於一些實施例中,對語音訊號中的多個語音段進行語音轉文本識別,以獲取文字記錄。於一些實施例中,語音轉文本識別任務可以由卷積神經網路進行。於一些實施例中,語音轉文本識別任務可以由主要架構為Whisper的神經網路進行。於另一些實施例終,語音轉文本識別任務可以通過Maestro或其他能夠進行語音轉文本是別的神經網路進行,本揭示不以此為限。 In step S250, speech-to-text recognition is performed. In some embodiments, speech-to-text recognition is performed on multiple speech segments in the speech signal to obtain a text transcript. In some embodiments, the speech-to-text recognition task can be performed by a convolutional neural network. In some embodiments, the speech-to-text recognition task can be performed by a neural network primarily based on Whisper. In other embodiments, the speech-to-text recognition task can be performed by Maestro or other neural networks capable of speech-to-text conversion, but this disclosure is not limited thereto.

於步驟S260,獲取文字輸入。於一些實施例中,獲取與會者通過鍵盤111、126、滑鼠112或(觸控)顯示器123輸入的文字。 In step S260, text input is obtained. In some embodiments, text input by the participant via keyboard 111, 126, mouse 112, or (touch) display 123 is obtained.

於步驟S270,獲取用戶配置。於一些實施例中,可獲取視訊會議中各用戶配置的文字輸入。 In step S270, user configuration is obtained. In some embodiments, text input configured by each user in the video conference may be obtained.

於步驟S282,進行上下文匹配。於一些實施例中,依據時間序列,將文字輸入插入文字記錄以獲取更新後文字記錄。 In step S282, context matching is performed. In some embodiments, the text input is inserted into the text record based on the time sequence to obtain an updated text record.

於步驟S290,進行情境理解。於一些實施例中,將發言者身分標註至文字記錄/更新後文字記錄,以獲取文本,並且對文本文字記錄進行情境理解,藉此於步驟S293產生並獲取會議記錄。於一些實施例中,步驟S290可接續步驟S291或者是步驟S293。 In step S290 , context understanding is performed. In some embodiments, the speaker's identity is tagged to the text record/updated text record to obtain text, and context understanding is performed on the text record to generate and obtain the meeting record in step S293 . In some embodiments, step S290 may be followed by step S291 or step S293 .

於步驟S291,更新配置資料庫。於一些實施例中,可利用於步驟S281獲得的發言者匹配結果更新資料庫中的個人資料(例如,臉部特徵、聲紋特徵、帳號資料及/或行為分析)。於一些實施例中,可利用步驟S290獲得的文本更新資料庫中的會議資料,以供後續會議持續使用。 In step S291, the configuration database is updated. In some embodiments, the speaker matching results obtained in step S281 can be used to update the personal data in the database (e.g., facial features, voiceprint features, account data, and/or behavioral analysis). In some embodiments, the text obtained in step S290 can be used to update the meeting data in the database for continued use in subsequent meetings.

於步驟S292,視訊會議結束。 In step S292, the video conference ends.

於步驟S294,將會議記錄發送給所有與會者。 In step S294, the meeting minutes are sent to all participants.

請參閱第1圖、第2圖以及第3圖,第3圖為依據本揭露一些實施例之會議記錄生成方法200的示意圖。會議記錄生成方法200包含步驟S210。於一些實施例中,第3圖中的所有步驟可以由第1圖的會議記錄生成系統100執行。於一些實施例中,第3圖中的所有步驟可以由第1圖的伺服器140執行。於一些實施例中,步驟S212以及S222~S229對應於影像分支,並且步驟S214、S230及S242~S249對應於聲音分支。於一些實施例中,步驟S250及S262對應於文字分支,並且步驟S216、S272~S278對應於帳號分支。於一些實施例中,第2圖的步驟S220中的臉部辨識包含步驟S222~S229,並且第2圖的步驟S220中的聲音辨識包含步驟S242~S249。 於一些實施例中,第2圖的步驟S260包含步驟S262,並且第2圖的步驟S270包含步驟S272~S278。 Please refer to Figures 1, 2, and 3. Figure 3 is a schematic diagram of a conference record generation method 200 according to some embodiments of the present disclosure. The conference record generation method 200 includes step S210. In some embodiments, all the steps in Figure 3 can be performed by the conference record generation system 100 of Figure 1. In some embodiments, all the steps in Figure 3 can be performed by the server 140 of Figure 1. In some embodiments, steps S212 and S222-S229 correspond to the image branch, and steps S214, S230, and S242-S249 correspond to the audio branch. In some embodiments, steps S250 and S262 correspond to the text branch, and steps S216 and S272-S278 correspond to the account branch. In some embodiments, facial recognition in step S220 of FIG. 2 includes steps S222-S229, and voice recognition in step S220 of FIG. 2 includes steps S242-S249. In some embodiments, step S260 of FIG. 2 includes step S262, and step S270 of FIG. 2 includes steps S272-S278.

於步驟S212中,判斷攝影機是否開啟。於一些實施例中,若攝影機開啟,則能拍攝一會議場地的影像並傳送至伺服器進行串流,進而獲取視訊會議的影像訊號SVID。若攝影機未開啟,則於臉部辨識分支中判定該會議場地的人員身分為未知。 In step S212 , it is determined whether the camera is turned on. In some embodiments, if the camera is turned on, it can capture images of the conference room and transmit them to the server for streaming, thereby obtaining the video conference image signal S VID . If the camera is not turned on, the identity of the person at the conference room is determined to be unknown in the facial recognition branch.

於步驟S222,進行臉部偵測。於一些實施例中,對影像訊號SVID的影像幀幅進行臉部偵測,以產生臉部邊界框以及臉部邊界框中的臉部特徵FTFACEIn step S222 , face detection is performed. In some embodiments, face detection is performed on the image frame of the image signal S VID to generate a face bounding box and facial features FT FACE within the face bounding box.

於步驟S224中,進行特徵匹配。於一些實施例中,將臉部邊界框中的臉部特徵FTFACE與資料庫中的臉部特徵資料進行匹配。 In step S224, feature matching is performed. In some embodiments, the facial features FT FACE in the face bounding box are matched with facial feature data in a database.

於步驟S226中,判斷臉部邊界框中的臉部特徵FTFACE是否與資料庫中的臉部特徵資料匹配。若能匹配,執行步驟S228,顯示身分。若不能匹配,執行步驟S229,於臉部辨識分支中,判定身分為未知。藉此,獲取臉部邊界框中的臉部辨識結果REACE。臉部辨識結果REACE為各臉部邊界框的已知身分及/或未知身分。 In step S226, a determination is made as to whether the facial features FT FACE within the facial bounding box match the facial feature data in the database. If so, step S228 is executed to display the identity. If not, step S229 is executed in the face recognition branch, where the identity is determined to be unknown. This yields the facial recognition result R EACE for the facial bounding box. The facial recognition result R EACE represents the known identity and/or unknown identity for each facial bounding box.

於步驟S230中,進行聲源定位,以獲取聲源角度ANGSDIn step S230, sound source localization is performed to obtain the sound source angle ANG SD .

於步驟S214中,判斷麥克風是否開啟。於一些實施例中,若麥克風開啟,則能對一會議場地的聲音進行收音並傳送至伺服器進行串流,進而獲取視訊會議的語音訊 號SAUD。若麥克風未開啟,則於聲紋辨識分支中判定該會議場地的人員身分為未知。 In step S214 , it is determined whether the microphone is on. In some embodiments, if the microphone is on, it can pick up audio from the conference room and transmit it to the server for streaming, thereby obtaining the video conference voice signal S AUD . If the microphone is not on, the identity of the person at the conference room is determined to be unknown in the voiceprint recognition branch.

於步驟S242中,提取聲紋。於一些實施例中,提取語音訊號SAUD中語音片段的聲紋特徵FTSDIn step S242, a voiceprint is extracted. In some embodiments, a voiceprint feature FT SD of a speech segment in the speech signal S AUD is extracted.

於步驟S244中,進行發言者的身分辨識。於一些實施例中,將語音片段的聲紋特徵FTSD與儲存在資料庫中的聲紋特徵進行匹配。 In step S244, the speaker's identity is identified. In some embodiments, the voiceprint features FTSD of the speech segment are matched with voiceprint features stored in a database.

於步驟S246中,判斷語音片段的聲紋特徵FTSD是否與儲存在資料庫中的聲紋特徵匹配。若能匹配,執行步驟S248,顯示身分。若不能匹配,執行步驟S249,於聲紋辨識分支中,判定身分為未知。藉此,獲取語音片段的聲紋辨識結果RSD。臉部辨識結果RSD為各語音片段的已知身分及/或未知身分。 In step S246, a determination is made as to whether the voiceprint features FT SD of the speech segment match those stored in the database. If so, step S248 is executed to display the identity. If not, step S249 is executed in the voiceprint recognition branch, where the identity is determined to be unknown. This yields the voiceprint recognition result R SD for the speech segment. The facial recognition result R SD represents the known and/or unknown identity for each speech segment.

於步驟S250中,進行語音轉文字識別。於一些實施例中,對語音訊號中的語音片段進行語音轉文字識別,以獲取文字記錄TRX。 In step S250, speech-to-text recognition is performed. In some embodiments, speech-to-text recognition is performed on speech segments in the speech signal to obtain a text record TRX.

於步驟S216中,判斷是否接收輸入帳號。於一些實施例中,獲取加入視訊會議的帳號。 In step S216, it is determined whether the input account number is received. In some embodiments, the account number for joining the video conference is obtained.

於步驟S272中,判斷加入視訊會議的帳號是否能與儲存在資料庫中的帳號資料匹配。若使用者使用已知身份之帳號加入會議,如視訊會議軟體的帳號或已知信箱,此分支便會利用帳號資訊與資料庫資料進行比對,若是資料庫裡有該帳戶資料,則將帳號身份標記為資料庫裡顯示之身份,若沒有則標記為未知。識別出的帳號身份傳至文 字處理分支以便後續標示。若匹配,執行步驟S274,顯示身分。若不匹配,執行步驟S276,判定身分為未知。 In step S272, a check is performed to determine whether the account joining the video conference matches the account information stored in the database. If the user joins the conference using a known account, such as a video conferencing software account or a known email address, this branch compares the account information with the database data. If the account information exists in the database, the account identity is marked as the identity displayed in the database; otherwise, it is marked as unknown. The identified account identity is passed to the text processing branch for subsequent marking. If a match is found, step S274 is executed to display the identity. If not, step S276 is executed to determine that the identity is unknown.

於步驟S278,進行配置識別。識別加入視訊會議的帳號(例如,視訊會議軟體的帳號或信箱)的帳號資料RPROF(例如,人員身分、公司名稱等相關資料)。 In step S278, configuration identification is performed to identify the account information R PROF (e.g., personal identity, company name, etc.) of the account (e.g., video conferencing software account or mailbox) that joins the video conference.

於步驟S262中,判斷是否接收文字輸入。若否,直接將文字記錄TRX輸出且接續步驟S293。若有接收文字輸入,執行步驟S282。 In step S262, determine whether text input has been received. If not, directly output the text record TRX and proceed to step S293. If text input has been received, execute step S282.

於步驟S282,進行上下文匹配。於一些實施例中,根據時間序列,將對應於帳號資料RPROF的文字輸入插入文字記錄TRX,以產生更新後文字記錄,並且根據時間標記,將發言者身分標註至更新後文字記錄,以獲取文本。 In step S282, context matching is performed. In some embodiments, the text input corresponding to the account data R PROF is inserted into the text record TRX according to the time sequence to generate an updated text record, and the speaker identity is annotated to the updated text record according to the time stamp to obtain text.

於步驟S293,進行情境理解。於一些實施例中,對文本進行情境理解,以獲取會議記錄。於一些實施例中,對文本進行情境理解,以獲取文本中多個語句的多個情緒語意,並且根據該些情緒語意,移除文本中的部分內容以產生一更新後文本。於一些實施例中,對更新後文本進行摘要提取,以獲取會議記錄。 In step S293, context understanding is performed. In some embodiments, context understanding is performed on the text to obtain a meeting transcript. In some embodiments, context understanding is performed on the text to obtain multiple emotional meanings of multiple sentences in the text, and based on these emotional meanings, some content in the text is removed to generate an updated text. In some embodiments, summary extraction is performed on the updated text to obtain a meeting transcript.

請參閱第4圖。第4圖為依據本揭露一些實施例之生成會議記錄的操作的示意圖。於一些實施例中,步驟S290的情境理解操作可由情境處理模組實施。情境處理模組結合先前辨識出的身分、情緒語意及上下文,對會議內容進行一次整體情境的會議重點摘要,並標示對應的發言人身分。產生之會議記錄在會議後可利用自動寄送給所有 識別出身分之與會者。於一些實施例中,情境處理模組可以由摘要提取神經網路模型實施。於一些實施例中,情境處理模組可以由語境理解神經網路模型實施。於一些實施例中,情境處理模組可以由影像與聲音的情緒偵測的跨模態神經網路實施。於一些實施例中,情境處理模組可以由能夠實現多人對話語言解析任務的神經網路模型實施。因此,本揭示不以此為限。 Please refer to Figure 4. Figure 4 is a schematic diagram illustrating the process of generating a meeting transcript according to some embodiments of the present disclosure. In some embodiments, the context understanding operation in step S290 may be implemented by a context processing module. The context processing module combines previously identified identities, emotional semantics, and context to create a comprehensive summary of the meeting's key points and identify the corresponding speakers. The generated meeting transcript can be automatically emailed to all identified attendees after the meeting. In some embodiments, the context processing module may be implemented by a summary extraction neural network model. In some embodiments, the context processing module may be implemented by a context understanding neural network model. In some embodiments, the context processing module may be implemented by a cross-modal neural network that detects emotions from images and sounds. In some embodiments, the context processing module can be implemented by a neural network model capable of performing multi-person conversation language parsing tasks. Therefore, the present disclosure is not limited thereto.

如第4圖所示,於步驟S290中,通過與會者的聲音特徵與臉部特徵410,可辨識文字記錄420中的語句421~426的發言者及情緒語意431~436,以產生文本430。於步驟S293中,會議記錄440包含會議標題441以及對應於發言者身分的行動項目442。並且開玩笑的語意可自文字記錄420中移除,使得會議記錄440中不重要的內容減少,以改善會議記錄440的品質。 As shown in FIG. 4 , in step S290 , the speaker and emotional meaning 431 - 436 of sentences 421 - 426 in the text transcript 420 are identified using the participant's voice and facial features 410 to generate text 430 . In step S293 , the meeting transcript 440 includes a meeting title 441 and action items 442 corresponding to the speaker's identity. Furthermore, joking content is removed from the text transcript 420 , reducing unimportant content and improving the quality of the transcript 440 .

綜上所述,本揭示的會議記錄生成系統100以及會議記錄生成方法200通過將臉部特徵與聲紋特徵匹配,能夠改善資料庫中不具有聲紋特徵資料的情況。如此,在有臉部資料且不具有聲紋資料的情況下仍能辨識同一會議場域中的不同發言者身分。進一步而言,通過將發言者身分標記至文字記錄所產生的文本,能夠更容易基於多人對話進行情境理解,進而產生品質較佳的會議記錄。 In summary, the disclosed meeting transcript generation system 100 and method 200 can improve the situation where voiceprint data is not available in the database by matching facial features with voiceprint features. This allows different speakers in the same meeting to be identified even when facial data is available but voiceprint data is not. Furthermore, by tagging speaker identities to the generated text, contextual understanding of multi-person conversations is facilitated, resulting in higher-quality meeting transcripts.

雖然本揭露已以實施方式揭露如上,然其並非用以限定本揭露,任何本領域通具通常知識者,在不脫離本揭 露之精神和範圍內,當可作各種之更動與潤飾,因此本揭露之保護範圍當視後附之申請專利範圍所界定者為準。 Although this disclosure has been described above in terms of implementation, it is not intended to limit the present disclosure. Anyone with ordinary skill in the art may make various modifications and enhancements without departing from the spirit and scope of this disclosure. Therefore, the scope of protection of this disclosure shall be determined by the scope of the attached patent application.

100:會議記錄生成系統 100: Meeting Minutes Generation System

110:外部裝置 110: External device

111:鍵盤 111:Keyboard

112:滑鼠 112: Mouse

113:陣列式麥克風 113: Microphone Array

114:攝影機 114: Camera

120:電子裝置 120: Electronic devices

121:處理器 121: Processor

122:記憶體裝置 122: Memory device

123:(觸控)顯示器 123:(Touch) Display

124:聲音感測器 124: Sound Sensor

125:影像感測器 125: Image sensor

126:鍵盤 126:Keyboard

130:網路 130: Internet

140:伺服器 140: Server

Claims (9)

一種會議記錄生成方法,包含:獲取一視訊會議中的一視訊訊號、一語音訊號以及一聲源定位資訊;對該視訊訊號中的複數個影像幀進行臉部辨識,以獲取複數個臉部辨識結果;對該語音訊號中的複數個語音段進行聲紋辨識,以獲取在複數個時間標記下的複數個聲紋辨識結果;根據該聲源定位資訊,將該些聲紋辨識結果與該些臉部辨識結果匹配,以獲取在該些時間標記下的複數個發言者身分;對該語音訊號中的該些語音段進行語音轉文本識別,以獲取一文字記錄;獲取至少一用戶配置的一文字輸入;根據時間序列,將該文字輸入插入該文字記錄,以產生一更新後文字記錄;根據該些時間標記,將該些發言者身分標註至該更新後文字記錄,以獲取一文本;以及對該文本進行情境理解,以獲取一會議記錄。A method for generating a conference record comprises: obtaining a video signal, a voice signal, and sound source positioning information in a video conference; performing facial recognition on a plurality of image frames in the video signal to obtain a plurality of facial recognition results; performing voice print recognition on a plurality of voice segments in the voice signal to obtain a plurality of voice print recognition results at a plurality of time stamps; and matching the voice print recognition results with the facial recognition results according to the sound source positioning information to obtain a The invention relates to a method for obtaining a meeting record by identifying a plurality of speakers at the time stamps; performing speech-to-text recognition on the speech segments in the speech signal to obtain a text record; obtaining a text input configured by at least one user; inserting the text input into the text record according to a time sequence to generate an updated text record; tagging the speaker identities to the updated text record according to the time stamps to obtain a text; and performing context understanding on the text to obtain a meeting record. 如請求項1所述之會議記錄生成方法,其中該些聲紋辨識結果以及該些臉部辨識結果包含複數個已知身分以及至少一未知身分。The method for generating meeting records as described in claim 1, wherein the voiceprint recognition results and the facial recognition results include multiple known identities and at least one unknown identity. 如請求項1所述之會議記錄生成方法,其中該些聲紋辨識結果包含複數個未知身分,並且其中該些臉部辨識結果包含複數個已知身分。The method for generating a meeting record as described in claim 1, wherein the voiceprint recognition results include a plurality of unknown identities, and wherein the facial recognition results include a plurality of known identities. 如請求項3所述之會議記錄生成方法,更包含:根據該聲源定位資訊以及該些臉部辨識結果,判斷該些已知身分中的該些發言者身分;以及根據該些時間標記,根據該些發言者身分更新該些聲紋辨識結果的該些未知身分。The conference record generation method as described in claim 3 further includes: determining the identities of the speakers among the known identities based on the sound source positioning information and the facial recognition results; and updating the unknown identities of the voiceprint recognition results according to the speaker identities based on the time stamps. 如請求項1所述之會議記錄生成方法,其中該些聲紋辨識結果包含複數個已知身分,並且其中該些臉部辨識結果包含複數個未知身分。The method for generating a meeting record as described in claim 1, wherein the voiceprint recognition results include a plurality of known identities, and wherein the facial recognition results include a plurality of unknown identities. 如請求項1所述之會議記錄生成方法,其中該聲源定位資訊包含複數個聲源各自的一角度以及一方向中至少一者。A method for generating conference records as described in claim 1, wherein the sound source localization information includes at least one of an angle and a direction of each of a plurality of sound sources. 如請求項1所述之會議記錄生成方法,其中該臉部辨識結果包含該些影像幀中的臉部邊界框的位置以及該些臉部邊界框各自對應的一身分。The method for generating a meeting record as described in claim 1, wherein the facial recognition result includes the position of the facial bounding boxes in the image frames and an identity corresponding to each of the facial bounding boxes. 如請求項1所述之會議記錄生成方法,更包含:對該文本進行情境理解,以獲取該文本中複數個語句的複數個情緒語意;根據該些情緒語意,移除該文本中的部分內容,以產生一更新後文本;以及對該更新後文本進行摘要提取,以獲取該會議記錄。The meeting record generation method as described in claim 1 further includes: performing contextual understanding on the text to obtain multiple emotional semantics of multiple sentences in the text; removing part of the content in the text based on the emotional semantics to generate an updated text; and extracting a summary of the updated text to obtain the meeting record. 一種會議記錄生成系統,包含:一記憶體,用以儲存複數個指令以及資料;以及一處理器,電性連接該記憶體,該處理器用以存取該記憶體儲存的該些指令以及該資料以執行下列步驟:獲取一視訊會議中的一視訊訊號、一語音訊號以及一聲源定位資訊;對該視訊訊號中的複數個影像幀進行臉部辨識,以獲取複數個臉部辨識結果;對該語音訊號中的複數個語音段進行聲紋辨識,以獲取在複數個時間標記下的複數個聲紋辨識結果;根據該聲源定位資訊,將該些聲紋辨識結果與該些臉部辨識結果匹配,以獲取在該些時間標記下的複數個發言者身分;對該語音訊號中的該些語音段進行語音轉文本識別,以獲取一文字記錄;獲取至少一用戶配置的一文字輸入;根據時間序列,將該文字輸入插入該文字記錄,以產生一更新後文字記錄;根據該些時間標記,將該些發言者身分標註至該更新後文字記錄,以獲取一文本;以及對該文本進行情境理解,以獲取一會議記錄。A conference record generation system includes: a memory for storing a plurality of instructions and data; and a processor electrically connected to the memory, the processor for accessing the instructions and data stored in the memory to execute the following steps: obtaining a video signal, a voice signal, and sound source positioning information in a video conference; performing facial recognition on a plurality of image frames in the video signal to obtain a plurality of facial recognition results; performing voiceprint recognition on a plurality of voice segments in the voice signal to obtain a plurality of voiceprint recognition results at a plurality of time stamps; and Based on the sound source localization information, the voiceprint recognition results are matched with the facial recognition results to obtain a plurality of speaker identities at the time stamps; speech-to-text recognition is performed on the speech segments in the voice signal to obtain a text transcript; text input configured by at least one user is obtained; the text input is inserted into the text transcript based on a time sequence to generate an updated text transcript; the speaker identities are annotated to the updated text transcript based on the time stamps to obtain a text; and contextual understanding is performed on the text to obtain a meeting transcript.
TW113122571A 2024-06-18 Method and system for generating meeting minutes TWI900067B (en)

Publications (1)

Publication Number Publication Date
TWI900067B true TWI900067B (en) 2025-10-01

Family

ID=

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240073322A1 (en) 2020-06-20 2024-02-29 Science House LLC Systems, methods, and apparatus for virtual meetings

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240073322A1 (en) 2020-06-20 2024-02-29 Science House LLC Systems, methods, and apparatus for virtual meetings

Similar Documents

Publication Publication Date Title
US12165653B2 (en) Matching speakers to meeting audio
Tao et al. End-to-end audiovisual speech recognition system with multitask learning
CN114667726B (en) Privacy-aware conference room transcription from audio-visual streams
US8791977B2 (en) Method and system for presenting metadata during a videoconference
CN112075075B (en) Method and computerized intelligent assistant for facilitating teleconferencing
US8630854B2 (en) System and method for generating videoconference transcriptions
US20110131144A1 (en) Social analysis in multi-participant meetings
CN110415704A (en) Trial record data processing method, device, computer equipment and storage medium
US20160329050A1 (en) Meeting assistant
WO2024032159A1 (en) Speaking object detection in multi-human-machine interaction scenario
CN117854507A (en) Speech recognition method, device, electronic equipment and storage medium
CN114065720A (en) Conference summary generation method and device, storage medium and electronic equipment
JP2019192092A (en) Conference support device, conference support system, conference support method, and program
TWI900067B (en) Method and system for generating meeting minutes
WO2021135140A1 (en) Word collection method matching emotion polarity
US20240430118A1 (en) Systems and Methods for Creation and Application of Interaction Analytics
US20250384406A1 (en) Method and system for generating meeting minutes
US20230230588A1 (en) Extracting filler words and phrases from a communication session
CN121151523A (en) Methods and systems for generating meeting minutes
US11799679B2 (en) Systems and methods for creation and application of interaction analytics
US20230066829A1 (en) Server device, conference assistance system, and conference assistance method
CN114339132A (en) Intelligent meeting minutes method, device and computer equipment for video conferencing
KR102896711B1 (en) System For Automating Parliamentary Broadcasting Based On NDI Using Situation-Aware Artificial Intelligence
CN120186144B (en) File transmission method and system based on artificial intelligence
CN115225849B (en) A resource scheduling method for video conferencing system based on cloud computing