WO2025147678A1

WO2025147678A1 - Systems and methods for generating standardized reports using large language models

Info

Publication number: WO2025147678A1
Application number: PCT/US2025/010330
Authority: WO
Inventors: Jamil S. SAMAAN; Vijay PANDYARAJAN; Srinivas Gaddam; Fadi S. SAMAAN
Original assignee: Cedars Sinai Medical Center
Current assignee: Cedars Sinai Medical Center
Priority date: 2024-01-04
Filing date: 2025-01-03
Publication date: 2025-07-10
Anticipated expiration: 2026-07-04

Abstract

A method includes receiving one or more inputs from a client device, generating a transcript based on the one or more inputs from the client device, probing a large language model using first settings and the transcript to determine a procedure class, based on the procedure class, determining second settings for the large language model, the second settings including a standardized report format, probing the large language model with the second settings and the transcript, and providing a standardized report based on results from the large language model.

Description

SYSTEMS AND METHODS FOR GENERATING STANDARDIZED REPORTS USING LARGE LANGUAGE MODELS

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application includes a claim of priority under 35 U.S.C. §119(e) to U.S. provisional patent application No. 63/617,525, filed January 4, 2024, the entirety of which is hereby incorporated by reference.

FIELD OF THE INVENTION

[0002] The present invention relates generally to using large language models for generating standardized reports, and more specifically, to automating scribing in standardized formats from unstructured inputs using large language models.

BACKGROUND

[0003] Sometimes patients undergo medical procedures for diagnostic purposes or treatment purposes. Medical procedures, for example, endoscopic procedures, surgical procedures, etc., are typically performed by one or more healthcare professionals. These procedures must sometimes be documented or memorialized to specific standards. Beyond medical procedures, in office environments or meetings, sometimes takeaways or results from meetings must be memorialized to specific standards. In some cases, merely transcribing spoken words or providing a transcript of a meeting is not good enough because meetings may contain meandering conversations and other superfluous discussions not relevant to the subject matter of the meeting. Systems and methods for at least memorializing pertinent information in various events are provided in the present disclosure. These systems and methods reduce the storage requirements and improve translation of memorialized information to readily available standardized formats.

SUMMARY OF THE INVENTION

[0004] The term embodiment and like terms, e.g., implementation, configuration, aspect, example, and option, are intended to refer broadly to all of the subject matter of this disclosure and the claims below. Statements containing these terms should be understood not to limit the subject matter described herein or to limit the meaning or scope of the claims below. Embodiments of the present disclosure covered herein are defined by the claims below, not this summary. This summary is a high-level overview of various aspects of the disclosure and introduces some of the concepts that are further described in the Detailed Description section below. This summary is not intended to identify key or essential features of the claimed subject matter. This summary is also not intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification of this disclosure, any or all drawings, and each claim.

[0005] According to certain aspects of the present disclosure, a method includes receiving one or more inputs from a client device; generating a transcript based on the one or more inputs from the client device; probing a large language model using first settings and the transcript to determine a procedure class; based on the procedure class, determining second settings for the large language model, the second settings including a standardized report format; probing the large language model with the second settings and the transcript; and providing a standardized report based on results from the large language model.

[0006] According to certain aspects of the present disclosure, a system includes one or more data processors; and a non-transitory computer-readable storage medium containing instructions which, when executed on the one or more data processors, cause the one or more data processors to perform operations including: receiving one or more inputs from a client device; generating a transcript based on the one or more inputs from the client device; probing a large language model using a combination of instructions first settings and the one or more inputs transcript to determine a procedure class; based on the procedure class, determining second settings for the large language model, the second settings including a standardized report format; probing the large language model with the second settings and the transcript; and providing a standardized report based on results from the large language model.

[0007] The above summary is not intended to represent each embodiment or every aspect of the present disclosure. Rather, the foregoing summary merely provides an example of some of the novel aspects and features set forth herein. The above features and advantages, and other features and advantages of the present disclosure, will be readily apparent from the following detailed description of representative embodiments and modes for carrying out the present invention, when taken in connection with the accompanying drawings and the appended claims. Additional aspects of the disclosure will be apparent to those of ordinary skill in the art in view of the detailed description of various embodiments, which is made with reference to the drawings, a brief description of which is provided below.

BRIEF DESCRIPTION OF THE DRAWINGS [0008] The disclosure, and its advantages and drawings, will be better understood from the following description of representative embodiments together with reference to the accompanying drawings. These drawings depict only representative embodiments, and are therefore not to be considered as limitations on the scope of the various embodiments or claims. [0009] FIG. 1 is a block diagram of a system for standardized automated scribing, according to certain aspects of the present disclosure.

[0010] FIG. 2 is a process for standardized automated scribing, according to certain aspects of the present disclosure.

[0011] FIG. 3 is a second block diagram of a system for standardized automated scribing, according to certain aspects of the present disclosure.

[0012] FIG. 4 A is a second process for standardized automated scribing, according to certain aspects of the present disclosure.

[0013] FIG. 4B is an example fine-tuning process for improving model performance, according to certain aspects of the present disclosure.

[0014] FIG. 5 is an example implementation of the process of FIG. 4A, according to certain aspects of the present disclosure.

[0015] FIG. 6 is an example implementation for generating a report in FIG. 5, according to certain aspects of the present disclosure.

[0016] FIG. 7 is an example implementation for revising a report in FIG. 5, according to certain aspects of the present disclosure.

[0017] FIG. 8 is a second example implementation for the process of FIG. 4A, according to certain aspects of the present disclosure.

[0018] FIG. 9 A is a first page of an example report generated, according to certain aspects of the present disclosure.

[0019] FIG. 9B is a second page of the example report of FIG. 9A.

[0020] FIG. 10 is an example note generated for a referring physician, according to certain aspects of the present disclosure.

[0021] FIG. 11 is an example note generated for a patient, according to certain aspects of the present disclosure.

DETAILED DESCRIPTION

[0022] Standardized reports are used in many industries. The present disclosure will use the medical industry as an example to illustrate some advantages of some implementations of the present disclosure. Specifically, summarizing findings of endoscopic procedures will be used as an example. In current practice, endoscopists perform a single or multiple procedures on a single patient and are required to report their findings after completion in the form of an operative report. Generating a report involves either using a program which requires many clicks to complete the report or dictating their findings via voice capture program after the procedure which generates a report at a later time. The conventional methods of generating endoscopy reports have multiple limitations.

[0023] For example, the conventional methods require extensive memory and mental bandwidth of endoscopists. Endoscopists are tasked with memorizing endoscopic findings and maneuvers, which can be difficult to achieve with complex, long and multiple procedures on the same patient. This may also be difficult when an endoscopist has performed many procedures in the same day on different patients. Differentiating findings for each patient may be difficult. The endoscopist may sometimes not have time to complete the report after each procedure, requiring completion of multiple reports for different patients at a later time or the end of the day.

[0024] Furthermore, the conventional methods put a documentation burden on practitioners. Physician burnout from documentation is a well-documented issue. Even if the endoscopist is able to memorize and complete the documentation on time, tools are necessary to alleviate the mental processing burden of documentation and help improve physician efficiency. These tools may allow endoscopists to either perform more procedures, complete more clinical duties, and improve their overall quality of life. These improvements may improve patient outcomes and revenue generated by the physician.

[0025] Embodiments of the present disclosure use artificial intelligence (Al) for automatic generation of postoperative reports for endoscopic procedures. In an implementation, after the identity of the patient is confirmed, an audio recording program will record audio in real time during an endoscopic procedure. During the procedure, the endoscopist will verbally announce endoscopic findings and maneuvers. In some implementations, the endoscopist can also have regular conversations and communicate with staff in the room just like a usual endoscopic procedure. Once the procedure is complete, the endoscopist can stop the recording, for example, using a voice command like “stop”, “generate report”, etc. The audio capturing program will know to stop recording and send the recording to a large language model for generating a transcript of the audio recording and further generating an operative report based on the transcript along with a pre-filled prompt. In some implementations, the transcript can be generated by OpenAI's Whisper API. In some implementations, this transcript along with a pre-determined prompt can be sent to ChatGPT via OpenAI’s chat completion API to generate an operative. The process from the “generate report” voice command to having a full report can take seconds to minutes. Furthermore, the report will be in the preferred format that is currently accepted in the gastroenterology field. In some implementations, it may also include the majority of the components needed to make a final report. The endoscopist can later make edits as needed before finalizing the report. Embodiments of the present disclosure provide improvements to computing systems. These improvements to computing systems allow computing systems to generate and provide more reliable reports when compared to conventional computing systems. These reports can be provided in the same format, regardless of the input method (e.g., text, voice, handwriting, etc.) and regardless of the format and organization of the input method. These reports can also be generated in real-time, incorporating edits provided in real-time.

[0026] Various embodiments are described with reference to the attached figures, where like reference numerals are used throughout the figures to designate similar or equivalent elements. The figures are not necessarily drawn to scale and are provided merely to illustrate aspects and features of the present disclosure. Numerous specific details, relationships, and methods are set forth to provide a full understanding of certain aspects and features of the present disclosure, although one having ordinary skill in the relevant art will recognize that these aspects and features can be practiced without one or more of the specific details, with other relationships, or with other methods. In some instances, well-known structures or operations are not shown in detail for illustrative purposes. The various embodiments disclosed herein are not necessarily limited by the illustrated ordering of acts or events, as some acts may occur in different orders and/or concurrently with other acts or events. Furthermore, not all illustrated acts or events are necessarily required to implement certain aspects and features of the present disclosure.

[0027] For purposes of the present detailed description, unless specifically disclaimed, and where appropriate, the singular includes the plural and vice versa. The word “including” means “including without limitation.” Moreover, words of approximation, such as “about,” “almost,” “substantially,” “approximately,” and the like, can be used herein to mean “at,” “near,” “nearly at,” “within 3-5% of,” “within acceptable manufacturing tolerances of,” or any logical combination thereof. Similarly, terms “vertical” or “horizontal” are intended to additionally include “within 3-5% of’ a vertical or horizontal orientation, respectively. Additionally, words of direction, such as “top,” “bottom,” “left,” “right,” “above,” and “below” are intended to relate to the equivalent direction as depicted in a reference illustration; as understood contextually from the object(s) or element(s) being referenced, such as from a commonly used position for the object(s) or element(s); or as otherwise described herein.

[0028] Referring to FIG. 1, a system 100 for standardized automated scribing is provided, according to some implementations of the present disclosure. The system 100 includes a client device 104, a server 102, and a database 106. In some implementations, the server 102 has access to parameters of a large language model (LLM) (e.g., an LLM 108). In some implementations, the parameters of the LLM 108 are stored in the database 106. Each of these components can be realized by one or more computer devices and/or networked computer devices. The one or more computer devices include at least one processor with at least one non-transitory computer readable medium. In some implementations, the client device 104 is a smartphone. In some implementations, the client device 104 is a smart speaker. In some implementations, the client device 104 is a computer. In some implementations, the server 102 stores an Al algorithm or has access to an Al algorithm for analyzing captured audio and/or captured text. In some implementations, the Al algorithm can analyze written notes to extract text for further processing. In some implementations, the Al algorithm is the LLM 108. The LLM 108 can be a general knowledge LLM that can be used for text analysis (e.g., any version of OpenAI’s Generative Pre-Trained Transformer (GPT) language models). In some implementations, the database 106 includes files, computational models, etc., used by the server 102 for analyzing the inputs from the client device 104.

[0029] LLMs are used here as an example because LLMs have potential to offer Al support throughout healthcare, with a growing body of literature demonstrating versatility in the ability of Al systems to address clinical questions across multiple medical specialties. Training Al models can be a time-intensive and resource-intensive process, requiring extensive man-hours, computing, and energy resources. Typically, once an Al is trained, it can perform specific tasks it was trained for very well. Tasks outside its domain are typically not performed well. Generally trained Al, on the other hand, are not limited to a specific domain during training. Due to not being limited to a specific domain, performance of generally trained Al can be heavily dependent on how the generally trained Al is used. For example, ChatGPT which uses one of the GPT models can be probed using different questions and prompts to learn how such a generally trained Al will perform at a specific task. Although GPT is used here as an example, other LLMs (e.g., Meta's Llama 2, Google's Med-PaLM 2 and Bard, Anthropic's Claude, etc.) can be used for generating reports.

[0030] Referring to FIG. 2, a process 200 for standardized automated scribing is provided, according to some implementations of the present disclosure. In some implementation, the client device 104 receives the standardized report, and the server 202 performs the steps associated with the process 200. In some implementations, the client device 104 and the server 102 cooperatively perform one or more steps associated with the process 200.

[0031] At step 202, the server 102 receives one or more inputs from the client device 104 and/or from the database 106. In some implementations, the client device 104 captures the one or more inputs using a microphone associated with the client device 104. The one or more inputs are audio files that can be transcribed to text. In some implementations, the client device 104 captures the one or more inputs using a camera associated with the client device 104. The one or more inputs are images with text that can be deciphered using optical character recognition (OCR) or some other text recognition method. In some implementations, the one or more inputs are text documents provided by the client device 104.

[0032] In some implementations, the one or more inputs are obtained during an endoscopy procedure. Table 1 provides an example transcript of a typical conversation that may take place during the endoscopy procedure. The audio from the endoscopy procedure may be captured by the client device 104. The client device 104 may provide the audio transcript to the server 102. Using one or more algorithms, the server 102 can generate a text transcript (e.g., the transcript of Table 1) for further processing. In some implementations, recording of the audio may be started using voice commands (e.g., “start”, “okay”, “begin”, or some other trigger word). In some implementations, recording of the audio may be stopped using voice commands (e.g., “generate report”, “stop”, “done”, or some other stop command).

TABLE 1 : Example Transcript of Voice Inputs to the Server 102. The Conversation Captured Involves Four Individuals.

[0033] As provided in Table 1, in some implementations, the text transcript includes typographical errors (e.g., “January report” instead of the expected “generate report”) based on the algorithm used to obtain the transcript. In some implementations, the text transcript includes punctuation errors. In some implementations, the text transcript includes multiple side conversations that are not relevant to the subject matter or content to be included in the report. In some implementations, the text transcript includes statements. In some implementations, the text transcript includes questions.

[0034] At step 204, the server 102 receives instructions from the client device 104. For example, the client device 104 provides a prompt to the server 102. The prompt instructs the server 102 to use the LLM 108 to closely examine the inputs from step 202. The prompt can further instruct the server 102 to use the LLM 108 to produce the standardized report. The prompt can further instruct the server 102 to provide specific headers based on the standards associated with the specific field (e.g., the field of endoscopy). The prompt can further instruct the server 102 to use a specific font or mode of organization (e.g., numbered lists, bullet points or unnumbered lists, etc.). The prompt can further instruct the server 102 to specify meanings of the specific headers. The prompt can further instruct the server 102 to a topic of focus for generating the standardized report.

[0035] In some implementations, the instructions are stored in the database 106 and/or the server 102 such that the client device 104 does not provide the instructions to the server 102. The instructions may be remotely stored so that instructions can be updated by an administrator for better performance by the generalized LLM 108. In some implementations, the instructions are pre-stored on the client device 104 and can be updated by the server 102.

[0036] At step 206, the server 102 probes the LLM 108 using the inputs from step 202 and the instructions from step 204. In some implementations, the LLM 108 is probed using APIs and chatboxes (e.g., Whisper and ChatGPT).

[0037] At step 208, the server 102 provides a standardized report to the client device 104 based on results from the LLM 108. For example, the standardized results can be provided via email, displayed on a screen of the client device 104, etc. Table 2 provides an example of a standardized report for the conversations captured in Table 1.

TABLE 2: Example Standardized Output Generated From the Voice Input of Table 1

[0038] Embodiments of the present disclosure provide a system that uses LLMs for clinical applications. The versatility of LLMs allows them to leverage their broad knowledge base to accomplish many clinical tasks, e.g., in the realm of endoscopy. In conjunction to assisting endoscopists with the diagnosis and assessment of disease, LLMs have the potential to reduce the workload of physicians by streamlining clinical tasks and enhancing clinical efficiency. Furthermore, by optimizing prompts, resources do not have to be wasted in further training a general LLM for this task.

[0039] Conventional solutions from medical documentation companies allowed for standardization of operative reports while also allowing for quicker completion of these reports. These programs allowed users to generate a standardized report by clicking from provided options and guiding the user step by step through the report. Humans needed to be in the loop to navigate the menus and options for the generating the standardized report.

[0040] In contrast, some implementations of the present disclosure allow for automatic generation of reports simply from having an Al listen to conversations being had during the medical procedure. This removes the significant documentation burden on physicians and may improve accuracy of reports by eliminating physicians having to remember minute details of what was observed or said during the medical procedure. The alleviated burden on physicians increases the efficiency of physicians, allowing them to perform other clinical duties rather than being burdened by documentation. Some implementations of the present disclosure also increase the accuracy of reporting and can improve patient outcomes by minimizing errors or any omissions in generated reports. Furthermore, generated reports can be easily integrated into electronic medical records and knowledge bases. In some implementations, the Al can provide recommendations for populating certain fields in specific headers. For example, based on the prompt asking for recommendations, the Al can seek more information in knowledge bases to populate a recommendations section in the standardized report. Furthermore, these reports can be edited in real-time and the Al can insert the specific additions or edits in appropriate locations in the report based on the specific organization or specific headers in the report.

[0041] Embodiments of the present disclosure can be used in various situations. For example, all physicians/health systems who perform procedures/operations which require the generation of an operative report can benefit. Inpatient endoscopy and general surgery to outpatient ophthalmology and dermatology operative reports can be generated. Embodiments of the present disclosure can be applied across the globe given Al language models’ ability to operate in different languages. For example, ChatGPT and Whisper have been shown to understand and summarize many languages other than English.

[0042] Referring to FIG. 3, a system 300 for standardized automated scribing is provided, according to certain aspects of the present disclosure. The system 300 is the same as or similar to the system 100 (FIG. 1). The system 300 includes the client device 104, the database 106, a server 302, and in some instances, parameters of the LLM 108. The server 302 is the same as or similar to the server 102.

[0043] The server 302 may include a speech to text engine 310. The speech to text engine 310 is configured to generate a text transcript from an audio file. For example, the client device 104 can provide an audio file to the server 302, and the server 302 can generate a text transcript from the audio file. In some implementations, the client device 104 directly provides the text transcript to the server 302, and the speech to text engine 310 is not used in generating the text transcript. That is, the client device 104 can generate the text transcript directly.

[0044] The server 302 includes an LLM engine 312. The LLM engine 312 is, for example, an LLM software that performs calculations using a processor of the server 302. The processor can include a neural processor, a central processing unit, a graphical processing unit, or any combination thereof. The LLM engine 312 can choose one or more models provided in the LLM 108 to respond to queries. The LLM engine 312 is configured to generate standardized reports and perform revisions on the standardized reports based on system settings, user settings, or both system and user settings. The LLM engine 312 can also be used to fine-tune a generally trained Al model for better performance.

[0045] The server 302 includes procedure settings engine 314. The procedure settings engine 314 includes logic for selecting or obtaining system and/or user settings. For example, the procedure settings engine 314 can include prompt(s) that are to be provided to the LLM engine 312. In another example, the procedure settings engine 314 can receive prompts from the client device 104 for the LLM engine 312. The procedure settings engine 314 can include parameters for fine-tuning a generally trained Al model.

[0046] In some implementations, the procedure settings engine 314 includes system prompts for priming the LLM engine 312. For example, the system prompts can include a role assigned to the LLM engine 312. The assigned role can be, for example, a personal assistant. In other implementations, the assigned role can be, for example, a personal assistant in an endoscopy room. The system prompts can include a level of expertise assigned to the LLM engine 312. For example, the assigned expertise can include an assumed number of years that the LLM engine 312 has performed a certain task or a number of times that the LLM engine 312 has supposedly performed a certain task. For example, the assigned expertise can indicate that the LLM engine 312 has written ten reports, hundreds of reports, thousands of reports, millions of reports, etc. The assigned expertise can indicate that the LLM engine 312 has read tens, hundreds, thousands, or millions of reports. The system prompts can include personality traits associated with the LLM engine 312. Example personality traits include, for example, detail- oriented, verbose, succinct, humorous, tone, etc.

[0047] In some implementations, the procedure settings engine 314 includes user prompts or receives at least some of the user prompts from the client device 104. In some implementations, the user prompts include a description of an audio file or a description of a transcript associated with an audio file. For example, the description can be “I am providing you with a transcript of an audio recorded during a medical procedure.” In some implementations, the description can be “I am providing you with a transcript of an audio recorded during an endoscopy procedure.” In some implementations, the description can be “Here is an audio from a business meeting today.” In some implementations, the description can be “Here is a transcript from a meeting on supply chains.” The description can introduce a subject or topic associated with a text transcript or audio.

[0048] In some implementations, the user prompt can provide further details on participants in the room where the transcript was generated. For example, a number of participants can be provided, or an indication that more than one participant is involved with the generation of the transcript. The user prompt can further include commands or instructions associated with tasks assigned to the LLM engine 312. For example, the user prompt can ask the LLM engine 312 to categorize a procedure being performed. In the example where the LLM engine 312 is a personal assistant in the endoscopy room, the LLM engine 312 can be asked to categorize whether the transcript indicates that a colonoscopy was performed, an endoscopic retrograde cholangiopancreatography was performed, an endoscopic ultrasound was performed, etc. In some implementations, the user prompt can include limitations on the categories provided by the LLM engine 312. For example, the user prompt can firmly assert, “do not provide any other procedures besides the ones mentioned.”

[0049] In some implementations, the user prompt further provides a format associated with the standardized report to be generated. For example, the user prompt can include one or more section headers and description of items that should be included under each section header. In an example, the one or more section headers can include (i) history of present illness, (ii) team and equipment, (iii) procedural details, (iv) findings, (v) impressions, (vi) recommendations, (vii) or any combination thereof. The description for history of present illness can include, for example, a summary of a patient’s past medical history and demographics information like age and sex. The description for history of present illness can further limit a number of sentences, for example, by limiting the number of sentences to one sentence, two sentences, three sentences, and so on. [0050] The description for team and equipment can include, for example, any referring providers, physicians and technicians performing the procedure, settings associated with the procedure including medication, medical classifications, in-patient or out-patient, etc. The procedural description can include, for example, an instruction to summarize important details of the procedure. The description of the findings can include, for example, an instruction to summarize important details of the procedural details section including further details from the text transcript that were not included in the procedural details section. The description of the procedural details can also include defaults to address items not reported in the transcript, for example, by including a “N/A” for an expected item not addressed in the transcript. The description of the findings can further include one or more rules for making the findings section concise, for example, by indicating a default of “normal” for parts of an anatomy not addressed in the transcript.

[0051] The description of the impressions section can include, for example, an instruction to summarize the findings section in bullet points adding medical explanations if those explanations are found in the transcript. Specific formatting on how the findings and actions associated with findings can be specified in the description of the impressions. The description of the recommendations section can include an instruction for the LLM engine 312 to extract recommendations from the transcript and/or suggest recommendations based on an understanding of the findings and impressions. The description of the recommendations can further include an expertise level and/or medical or knowledge databases to reference. For example, guidelines from American College of Gastroenterology, American Society for Gastrointestinal Endoscopy, etc., can be provided as example authorities to reference when the LLM engine 312.

[0052] For each of the above-mentioned sections or for at least one of the above mentioned sections, the user prompt can further include one or more examples of how to word sentences or bullet points. Furthermore, in some implementations, the user prompt can further include clean-up rules at the end. Clean-up rules include, for example, uniform rules that govern a look and feel of the standardized reports including “no text should appear prior to the first section header” and/or “no text should appear after contents under the last section header.”

[0053] In some implementations, the procedure settings engine 314 includes multiple system prompts and multiple user prompts. For example, the multiple system prompts can be geared towards different roles and/or different procedures. The multiple user prompts can be geared towards different roles and/or different procedures and customization. [0054] As mentioned above, in some implementations, the LLM engine 312 fine tunes the LLM used in generating the reports. For example, a generally trained Al model is selected. For domain level fine-tuning (e.g., in endoscopy), the Al model is selected for the ability to perform well in natural language understanding and generation tasks, specifically one suitable for extracting information from a large text. A labeled dataset of endoscopy transcripts and their corresponding endoscopy reports which are drafted by expert endoscopists can be used in fine-tuning. This large dataset will then be used to fine tune the LLM to improve performance in accuracy and comprehensiveness.

[0055] In some implementations, the dataset can include a task-specific prompt as well as a number of transcribed audio recordings from endoscopic procedures and their corresponding expert endoscopist human-generated endoscopy reports. The number can be around 2500 transcribed audio recordings and their corresponding expert endoscopist human-generated endoscopy reports. The dataset can be structured into a list of prompt-completion pairs (e.g., {prompt, completion}) compatible with fine-tuning requirements. A prompt in the context of the dataset is an explicit instruction followed by a full transcript extracted from all speech spoken in the endoscopy room during the procedure. A completion in the context of the dataset is a detailed endoscopy report containing findings, impressions, recommendations, etc. An example dataset record in JSON Lines (JSONL) format is as follows:

{"prompt": "{instruction}" + "{transcript}" , "completion": " {report}"}

[0056] The dataset can be split into 80% training , 10% validation, and 10% test subsets. This split ensures that the model will be evaluated on unseen examples during the validation and testing phases. The data in the dataset can be cleaned and preprocessed to ensure consistent input data formatting (e.g. consistent spacing/newlines, identifying marker/indicators, etc.) [0057] In some implementations, the fine-tuning process can be conducted using either open- source python libraries like pytorch and transformers or API based tools. Data processing and modeling tasks can be performed using python and Excel®.

[0058] Referring to FIG. 4A, a process 400 for standardized automated scribing is provided, according to certain aspects of the present disclosure. In some implementations, the process 400 is performed, for example, by the system 100 or the system 300. In some implementations, the client device 104 and the server 102 cooperatively perform one or more steps associated with the process 400. In some implementations, the client device 104 and the server 302 cooperatively perform one or more steps associated with the process 400.

[0059] Optionally, at step 402, the server 302 receives audio input from the client device 104 and/or from the database 106. The audio input can be a recording associated with a medical procedure, a meeting, an interview, a social event, and so on.

[0060] Optionally, at step 404, the server 302 generates an audio transcript from the audio input. In some implementations, the speech to text engine 310 generates the audio transcript from the audio input. In some implementations, the LLM engine 312 can receive multi-modal input including the audio input to generate the audio transcript.

[0061] At step 406, the server 302 extracts a first number of sentences from the audio transcript. For example, the server 302 can extract the first three, four, six, ten, or twenty sentences from the audio transcript. For example, medical examinations can be structured where a physician can announce a specific procedure being performed at the beginning of the procedure. By extracting the first number of sentences for analysis, the first number of sentences can contain or identify the procedure. In some implementations, the physician does not need to announce the procedure, and the LLM engine 312 can determine the procedure based on context of information included in the first number of sentences.

[0062] At step 408, the server 302 provides a procedure class based on evaluating the first number of sentences and first settings with an LLM model. For example, the LLM engine 312 can receive as inputs the first number of sentences and the first settings. The first settings can include a first system prompt and a first user prompt. The first system prompt can include a role and a level of expertise. The first user prompt can include a description of the audio transcript and an instruction to categorize a procedure into a procedure class.

[0063] In some implementations, steps 402 to 408 are repeated several times using different numbers of first sentences to determine that the LLM engine 312 is providing a consistent categorization into the procedure class. In some cases, a longer number of first sentences is trusted more than a shorter number of first sentences. For example, if ten sentences and four sentences yield different procedure classes, the server 302 chooses the procedure class associated with the ten sentences.

[0064] In some implementations, the audio transcript is scanned for a set number of sentences (e.g., ten sentences at a time) to determine the procedure class. For example, a procedure class is determined using the first ten sentences, a procedure class is determined using the next ten sentences, and so on. The mode of the procedure classes from this scanning is determined to be the procedure class. In some cases, this scanning procedure is performed for only three sets of ten sentences and the procedure class associated with at least two of the three sets of ten sentences is selected as the procedure class. [0065] At step 410, the procedure settings engine 314 determines, based on the procedure class of step 408, second settings for generating one or more report types. Second settings can include, for example, system prompts and user prompts. The system prompts can include different tones compared to the tone of the first settings. For example, the system prompts can include a prompt specifying a tone of a physician communicating with other physicians, a prompt specifying a tone of a physician communicating with patients, a prompt specifying a tone of a person communicating with a ninth-grade comprehension or vocabulary, etc. The user prompts can include different formats for the reports, for example, a letter format for communicating with a patient, a medical j ournal format, a medical report format, a letter format for communicating with another physician or expert, etc.

[0066] At step 412, the LLM engine 312 generates the one or more report types by evaluating the audio transcript and the second settings. For example, if at step 410, three report types are to be generated, the first report type being a medical report, the second report type being a communication from a treating physician to a referring physician, and the third report type being a communication from the treating physician to the patient, then the LLM engine 312 uses different system prompts and/or user prompts from step 410 to generate the three report types. In some implementations, the one or more report types is the standardized report of FIG. 2.

[0067] In some implementations, two of the three report types are based on a first report type. For example, the medical report can be the first report type generated. The medical report can then be combined with system prompts and/or user prompts to generate the second report type (e.g., the communication from the treating physician to the patient) and the third report type (e.g., the communication from the treating physician to the referring physician).

[0068] In some implementations, the first report type is based on two roles. For example, a first role can be a same role as the role in the first setting, and a second role can be a role of a reviewer. For example, the first role can be a personal assistant in an endoscopy room, and the second role can be that of a reviewer of medical documentation. Such that the first report type is a first report generated using the role of the personal assistant and then modified using the role of the reviewer of medical documentation. From here, the first report type can be used to generate the second report type and the third report type.

[0069] At step 414, the server 302 receives one or more revision inputs from the client device 104. For example, the one or more revision inputs can include audio, text, one-liners, etc. In some implementations, the one or more revision inputs can be provided in real time when an error is noticed - “I said 4.1 not 4.0.” In some implementations, the one or more revision inputs are initiated with a button push on the client device 104. The button push can be communicated to the server 302.

[0070] At step 416, the LLM engine 312 updates the generated one or more report types of step 412 by evaluating the one or more revision inputs of step 414, one or more revision settings, and the one or more report types. The one or more revision settings are system and user prompts. For example, the user prompt can include instructions identifying the one or more report types.

[0071] In some implementations, one or more fine-tuning steps can be performed after step 412 such that the LLM engine 312 can select a more appropriate model or can adjust parameters of the specific LLM to generate more detailed and accurate results for the specific industry of focus. FIG. 4B is an example fine-tuning process 450 for improving model performance, according to certain aspects of the present disclosure. Fine-tuning allows pre-trained models to specialize and adapt to specific tasks via domain-specific training. Most open source or API- available pre-trained models can be fine-tuned, but as an example, the fine-tuning process 450 will be described using a medium sized model like GPT-2. GPT-2 includes 355 million parameters and is merely used as an example. Other LLMs can be fine-tuned using the finetuning process 450.

[0072] At step 452, the LLM engine 312 can be used to tokenize a dataset used for fine-tuning. The entire dataset is preprocessed with a tokenizer per model-specific requirements. For GPT- 2 fine-tuning, OpenAI's proprietary tokenizer can be used.

[0073] At step 454, hyperparameters are selected for fine-tuning. Table 3 summarizes the list of baseline hyperparameters that can be adjusted for a report generation application. Table 3 also includes example expected values and the purpose each parameter serves. Mixed-precision training (FP16) can be implemented for training efficiency.

TABLE 3: Baseline Hyperparameter Summary.

[0074] At step 456, the LLM is trained to optimize metrics in order to obtain a fine-tuned LLM. During training, the following metrics can be monitored at regular intervals: (i) training loss, (ii) validation loss, (iii) generation quality metrics (e.g., ROGUE-L and METEOR scores), or (iv) any combination of (i) to (iii). Validation loss can be monitored carefully early in the training process to determine a patience period. Model checkpoints can be saved regularly so that top-performing models can be retained based on validation metrics (e.g., validation loss and generation quality metrics).

[0075] At step 458, post-processing is performed to enforce report structure. Post-processing will enforce report structure via a rule-based system to ensure consistent formatting of all sections. Each generated report can be processed through a quality filter to check for critical findings and completeness of all sections.

[0076] At step 460, the fine-tuned LLM performance is evaluated using the dataset. For example, as described above, the dataset can be split into 80% training , 10% validation, and 10% test subsets. The validation set, employing automated metrics (e.g., ROGUE-L and METEOR) and/or expert endoscopist review, can be used to evaluate the fine-tuned LLM’s performance. In some implementations, report accuracy of over 90% is considered acceptable for completeness. The threshold of 90% is merely provided as an example, but other report accuracy thresholds can be considered. For example, in a report that has a “critical findings” section, over 90% can be acceptable for both critical findings and absolute completeness. That is, some sections of the report can be more important than other sections such that accuracy in that section can be used as a floor for accuracy of the entire report.

[0077] At step 462, the fine-tuned LLM is tested for ability to generalize. Upon completion of training and evaluation, the fine-tuned LLM is tested using the test dataset to evaluate the finetuned LLM’s ability to generalize to new inputs. During testing, the same explicit prompt structure used during training is provided, ensuring consistency between training and test runs. Test performance can be assessed based on accuracy, calculated as the percentage of endoscopy reports generated by the model that contained at least 90% of desired procedure information documented by the expert endoscopist.

[0078] FIG. 5 is an example implementation 500 of the process of FIG. 4A, according to certain aspects of the present disclosure. Steps 402 and 404 are represented with Audio 1 being converted to Transcript 1. A script operates on Transcript 1 to extract sentences in the Transcript 1, for example, implementing step 406 by extracting the first 10 sentences.

[0079] Step 408 involves determining a procedure class, for example, indicated by the number n. The LLM engine 312 takes as input first settings and the first 10 sentences to determine the procedure class n. Step 410 involves determining second settings based on the procedure class n. Step 412 involves using the second settings along with the Transcript 1 in the F_n block 600 to generate Report 1.0. In some implementations, Report 2.0 and Report 3.0 can be generated in a similar manner. But as provided in FIG. 5, Report 1.0 is used along with second settings yi and yz to generate Report 2.0 and Report 3.0, respectively.

[0080] When revision inputs are received in step 414, the REV block 700 takes as input Report 1.0, Report 2.0, Report 3.0, revision settings, and/or the revision inputs to generate Report 1.1, Report 2.1, and Report 3.1. Report 1.0, Report 2.0, and Report 3.0 can be revised independently of each other. For example, the REV block 700 can be run by different individuals on specific reports.

[0081] FIG. 6 is an example implementation of the F_n block 600 in FIG. 5 when generating a report, according to certain aspects of the present disclosure. FIG. 6 provides a flow diagram for the F_n block 600 in FIG. 5. Second settings xi...x_m are used by the LLM engine 312 to generate Note 2.m. Note 2.m is provided as Report 1.0 in FIG. 5.

[0082] In an example, second settings xi can include system and user prompts such that Note 2.0 includes findings, impressions, and recommendations. Second settings X2 can include system and user prompts for a reviewer of Note 2.0 such that the findings, impressions, and recommendations of Note 2.0 are adjusted by the LLM engine 312 to obtain Note 2.1. Second settings X3 can include system and user prompts for procedural details such that procedural details of Note 2.1 are updated to provide a Note 2.2. Second settings X4 can include system and user prompts for a reviewer of Note 2.2 such that the procedural details are adjusted by the LLM engine 312 to obtain Note 2.3. Note 2.3 can then be provided as Report 1.0.

[0083] Some implementations of the present disclosure can be used to generate and standardize reports regardless of the initial state of such a report. For example, as provided in FIG. 6, based on a transcript (e.g., Transcript 1), an intermediate report designated as Note 2.0 can be generated. Note 2.0 can be generated based on a perspective and experience of a certain type of expert. Note 2.0 can then be used, along with the transcript, to generate Note 2.1. In some cases, omissions of items in the transcript in Note 2.0 are corrected in Note 2.1 due to Note 2.1 having a different perspective and experience compared to the role and experience of the prompt used to generate Note 2.0. Furthermore, Note 2.2 can be generated to add additional items based on a different experience and role. Once all the different perspectives are added, at the m-th iteration, a Note 2.m is generated which will become Report 1.0.

[0084] FIG. 7 is an example implementation for revising a report in FIG. 5, according to certain aspects of the present disclosure. FIG. 7 provides a flow diagram for the REV block 700 in FIG. 5. Revision settings Zi, Audio 2 (e.g., revision inputs), Report #.0 and any other accompanying settings (e.g., user_rev_l_#) are used by the LLM engine 312 to generate a revised report (e.g., Report 1.1, Report 2.1, and Report 3.1 of FIG. 5). The REV block can be used by the different expertise highlighted in FIG. 6 to add revisions to sections of generated reports. That is, a first expertise role can be used with a first revision, and a second expertise role can be used with a second revision, and so on. This is similar to FIG. 6 above where multiple iterations with different perspectives can be used to generate Note 2.m, which is a derivative of the previous Note 2.i that came before. The difference with the REV block 700 is the transcript of the procedure is no longer at play and only the revision audio or revision transcripts are provided with the revision settings.

[0085] Healthcare providers can be bogged down preparing reports for procedures or coordinating to update electronic health record of patients. Furthermore, these healthcare providers can prepare reports independently, using different formats. The formats may not be compatible with each other and information provided can be hard to decipher due to the difference in formats. Some implementations of the present disclosure provide methods and systems for generating standardized reports such that these reports can be generated once audio of a procedure is available. In some implementations, the system 100 or 300 can record the audio in real-time and generate the report right after the procedure is completed. Any revisions to the report can be conducted using voice commands such that the system 100 or 300 makes edits to generated reports in real time.

[0086] By generating reports in this manner, different physicians responsible for different parts of the report can revise according to expertise. Any updates made to the report can be reflected in real-time such that all physicians and caretakers with access to the report can view changes as they occur. By using system and user prompts with clear instructions including formatting standards and expertise level, accurate reports including viewpoints from different simulated experts can be generated in seconds. Furthermore, any updates from a physician can be incorporated in the specified format. [0087] FIG. 8 is an example process flow 800 based on the process of FIG. 4A, according to certain aspects of the present disclosure. The process flow 800 will be described for endoscopy. To enhance readability and aid in explanation, a legend is provided highlighting “user-defined input”, “predefined prompt”, “agent and/or RAG”, “fixed logic”, and “refined output”.

[0088] User-defined input include items customizable by the user, for example, audio inputs (e.g., Audio 1 801a and Audio 2 801b), revision button or action detector (e.g., B rev). Predefined prompt includes items customizable for specific applications, for example, first settings (e.g., prompt 806), second settings (e.g., prompt 808a, 808b, 808c, 808d, 808e, 810a, 810b, 810c, 810d, 810e), and revision settings (e.g., 81 Of). Agent and/or RAG are specific engines that provide computer generated responses, for example, audio to text conversion (e.g., Whisper 802a, 802b), LLMs (e.g., GPT 807a, 807b, 807c, 807d, 807e, 807f, 807g), and LLMs with retrieval augmentation generation (RAG) (e.g., GPT+RAG 809a, 809b, 809c, 809d, 809e). Table 4 provides agent role summaries applied to the endoscopy setting. Fixed logic includes scripts (e.g., python script 804a, 804b, 804c, 804d, 804e). Refined output are revisions and outputs derived from the REV block 701. These include Report 2.1, Report 3.1, Report 4.1, Report 5.1, Report 6.1, and Report 7.

TABLE 4: Agent role summaries for FIG. 8 applied in an endoscopy setting

[0089] The process flow 800 begins with generation of Audio 1 801a. Generating Audio 1 801a is similar to that for Audio 1 as described above in connection with FIG. 5 and step 402. Whisper 802a is used to convert Audio 1 801a to Transcript 1 803. Generation of Transcript 1 803 is similar to generation of Transcript 1 as described above in connection with FIG. 5 and step 404. Transcript 1 803 is processed with python script 804a to extract a set number of sentences (e.g., first 10 sentences 805), similar to the process described above in connection with the First 10 sentences of FIG. 5.

[0090] The first 10 sentences 805 are combined with prompt 806 and processed by GPT 807a to obtain a procedure class indicated by n. The procedure class n is used with python script 804b to select appropriate function block (e.g., F_n block 601) and second settings (e.g., prompts 808a, 808b. . .). GPT 807a corresponds to Agent 1 in Table 4. GPT 807b combines Transcript 1 803 and prompt 808a to generate Note 2.0. GPT 807b corresponds to Agent 2.0 in Table 4. Prompt 808a can provide role and experience as described above in connection with system and user prompts. For example, prompt 808a can provide a role of a personal assistant in the endoscopy room that generates initial drafts of endoscopy reports.

[0091] GPT 807c combines Transcript 1 803 and prompt 808b to generate Note 2.1. GPT 807c corresponds to Agent 2.1 in Table 4. GPT 807c is a first reviewer of the generated Note 2.0. The role of the reviewer can be specified as a reviewer of medical documentation.

[0092] GPT+RAG 809a combines Transcript 1 803 and prompt 808c to generate Note 2.2. GPT+RAG 809a corresponds to Agent 2.2 in Table 4 and performs quality check on Note 2.1. Retrieval Augmented Generation (RAG) can be used to improve performance of agents with outputs containing highly specific information such as names, equipment models, etc. By using RAGs, obscure information can be retrieved and provided to agents to enhance reliability and accuracy. For example, Agent 2.2 reviewer can ensure standardized reporting systems are included in notes. An example of this is using the MAYO scoring system in a patient that is presenting with ulcerative colitis. Agent 2.2 reads the endoscopic report (Note 2.1) and deduces the MAYO score and suggests the deduced MAYO score as an addition (i.e., to be included in Note 2.2). Agent 2.2 uses an endoscopic metric RAG dataset having diseases-metric chunk organization. Table 4 identifies RAG datasets used by specific agents in FIG. 8.

[0093] Datasets for RAGs can be preprocessed as plain text key -value pairs and organized into text chunks. Each chunk contains a variable number of key-value pairs, depending on the structure of the RAG dataset. For example, referring to Table 4, a single part entry in the RAG dataset for Agent 3 can contain several key-value pairs (e.g. part number, part description, etc.) while each person in the RAG dataset for Agent 2.3a can be represented by a single key-value pair: role - name. Table 4 also includes the chunk structure for each organized RAG dataset. In some implementations, the chunks are embedded using the OpenAI embedding model text- embedding-ada-002 to convert the chunks into vectors that can be used to quantify and compare semantic meaning. This embedding allows relevant data to be efficiently identified and retrieved via mathematical comparison. The OpenAI embedding model is merely used here as an example due to GPT being selected as the LLM example, but other embedding models can be used to convert the dataset into vectors.

[0094] In some implementations, data retrieval from RAG datasets can involve an iterative retrieval process. An iterative retrieval process ensures sufficient contextual information is retrieved from the RAG dataset by the agent. An iterative retrieval process involves querying the RAG dataset and comparing an embedded compilation of the retrieved chunks to the original agent request to get a similarity score. This process is repeated until the similarity score decreases, indicating that the relevance of additional retrieved chunks is diminishing. A decreasing similarity score ensures all relevant information is extracted from the RAG dataset in every agent call.

[0095] Note 2.2 can be used by python script 804c to generate Note 2.2a and Note 2.2b. In some implementations, Note 2.2a is a first section of Note 2.2, and Note 2.2b is a second section of Note 2.2. In this embodiment, the first section and the second section, when combined comprise all of Note 2.2. Note 2.2a is used with prompt 808e and Transcript 1 803 by GPT+RAG 809c to generate Note 2.3a, and Note 2.2b is used with prompt 808d and Transcript 1 803 by GPT+RAG 809b to generate Note 2.3b. Although FIG. 8 provides Note 2.2a and Note 2.2b, Note 2.2# can be any set of sections that, when combined, form the complete endoscopy report (findings, impressions, recommendations, etc.). Notes 2.2a and Notes 2.2b are merely provided as examples. The process 800 can be expanded to Notes 2.2a, Notes 2.2b, Notes 2.2c, . . . Notes 2.2#. Different medical expertise roles captured by different agents and predefined prompts can be involved in reviewing the different sections to obtain corresponding Notes 2.3a, Notes 2.3b, Notes 2.3c, . . . Notes 2.3#.

[0096] GPT+RAG 809c corresponds to Agent 2.3a in Table 4, and GPT+RAG 809b corresponds to Agent 2.3b in Table 4. For example, Note 2.2a can be a section of Note 2.2 having to do with a specific heading or section. The specific section can deal with procedure attendance, and GPT+RAG 809c is a reviewer responsible for ensuring all names are spelled correctly in the report. GPT+RAG 809c reads the transcript and isolates names mentioned. Using a RAG dataset with all possible names in an institution, GPT+RAG 809c ensures correct spelling of names. GPT+RAG 809c uses a RAG dataset of hospital staff and patient database with chunks having role-name pairs.

[0097] Similarly, Note 2.2b can be a section of Note 2.2 having to do with procedure details, findings, and recommendations. GPT+RAG 809b is a reviewer that ensures recommendations provided in the note are in line with society guidelines from AGA, ASGE, ASSLD, ACG, etc. Other guidelines that are relevant to endoscopy can be added as well based on guidelines in the RAG dataset. RAG dataset used by Agent 2.3b is a GI guidelines dataset with information organized as single-guideline entry information chunks. Python script 804d can be used to combine Note 2.3a and Note 2.3b to generate Report 2.0. Report 2.0 is analogous to Report 1.0 in FIG. 5. Report 2.0 can be generated and ready for human review after the endoscopy procedure. FIGS. 9A and 9B provide an example report that can be generated for endoscopy. FIG. 9A is a first page of the report, and FIG. 9B is a second page of the report. Splitting Note 2.2b andNote 2.2a is slightly different as described above in step 412 of FIG. 4A and associated description in FIGS. 5 and 6. FIG. 8 provides that different sections of a generated note can have different reviewers focused on specific portions of the note. To ensure such separation and that changes are made to relevant portions, Note 2.2a and Note 2.2b are generated. In contrast, FIG. 6 shows entire notes iterated over by different reviewers to generate Note 2.x.

[0098] Other reports may need to be generated in a hospital setting by other staff members. For example, inventory management may need to track amount of items consumed during a procedure, or nurses may need to provide documentation of activities such as scope in and scope out times, cecum time, etc. Report 3.0 can be generated using Transcript 1 803 and prompt 810a by GPT+RAG 809d. GPT+RAG 809d corresponds to Agent 3 in Table 4 and can be used to generate Report 3.0 used for inventory and utilization tracking.

[0099] GPT+RAG 809d can use a RAG dataset including hospital equipment information organized in single-part entry information chunks. Agent 3 can keep track of materials used during endoscopies for the purpose of tracking inventory. Agent 3 can use a RAG dataset of all possible materials present in the endoscopy room and can ensure accurate reporting. The RAG dataset can be updated regularly as new materials are purchased. Tracking such information can help endoscopy leadership understand utilization rates and potentially reasons for utilization trends. Agent 3 can automatically understand and can keep track of the context of use for all materials, providing a standardized report (e.g., Report 3.0).

[0100] Report 4.0 can be generated using Transcript 1 803 and prompt 810b by GPT 807e. Report 4.0 includes nursing documentation. Report 5.0 is analogous to Report 2.0 of FIG. 5, and Report 6.0 is analogous to Report 3.0 of FIG. 5. Report 5.0 can be a note generated for a referring physician and can take the form provided in FIG. 10. Report 6.0 can be a note generated for a patient and can take the form provided in FIG. 11.

[0101] If edits are necessary or desired, then REV block 701 is used to make revisions. REV block 701 is similar to REV block 700 discussed above in connection with FIGS. 5 and 7. Audio 2 801b including the voiced revisions are transcribed using Whisper 802b to obtain transcript User_rev_3. Python script 804e selects the report to be revised, here indicated as Report #.0, and any previous revision instructions, here indicated as User_rev_l_#. The report to be revised and all revision instructions are combined with prompt 810f by GPT 807d to generate a revised report. FIG. 8 indicates revised reports to have a #.1 revision number. For example, a revision of Report 2.0 is Report 2.1. Revision of Report 3.0 is Report 3.1, revision of Report 4.0 is Report 4.1, and so on. Only a single iteration of a review is shown in FIG. 8, but subsequent revisions can have a revision naming convention appending revision numbers (e.g., #.2, #.3, and so on).

[0102] In FIG. 8, Report 5.0 and Report 6.0 are generated using Report 2.1. In some implementations, Report 5.0 and Report 6.0 are generated using Report 2.0 when no revisions are performed on Report 2.0. GPT 807f uses Report 2.1 and prompt 810d to generate Report 5.0, and GPT 807g uses Report 2.1 and prompt 810e to generate Report 6.0.

[0103] In a hospital setting, billing coordinators may need to properly bill insurance companies and patients for services provided. Therefore, Report 7 can be generated. Report 7 can be generated using Report 2.1 and Report 3.1. GPT+RAG 809e takes as input Report 2.1 and Report 3.1 along with prompt 810c to generate Report 7. GPT+RAG 809e corresponds to Agent 7 in Table 4. Agent 7 can generate billing codes based on the final note (e.g., Report 2.1) and what tools were utilized (e.g., information in Report 3.1). Billing code generation is an essential aspect of medical procedure workflow given that billing is how health professionals and hospitals get paid. Agent 7 can use a RAG dataset of all possible ICD-10 and CPT codes in order to generate billing codes. In some implementations, Agent 7 is optimized to bill the maximum amount based on procedures performed. The RAG dataset can be organized as code-value pair chunks of the ICD10/CPT codes.

[0104] In FIG. 8, any of the LLMs (e.g., GPT 807a, 807b, 807c, 807d, 807e, 807f, 807g), and LLMs with RAG (e.g., GPT+RAG 809a, 809b, 809c, 809d, 809e) can be fine-tuned according to the process 450 to improve performance. Some implementations of the present disclosure provide systems and methods for generating accurate reports that may stem from facts present in a same medical procedure. Some implementations of the present disclosure can enhance accuracy across the different reports generated and/or can enhance accuracy within sections in a same report. Some implementations of the present disclosure can generate a report with collaborative inputs from different agent types (or roles) with specialty or expertise in specific domains. Some implementations of the present disclosure provide for users (e.g., doctors or other medical professionals) to dictate corrections for revising the automatically generated reports.

[0105] Although the disclosed embodiments have been illustrated and described with respect to one or more implementations, equivalent alterations and modifications will occur or be known to others skilled in the art upon the reading and understanding of this specification and the annexed drawings. In addition, while a particular feature of the invention may have been disclosed with respect to only one of several implementations, such feature may be combined with one or more other features of the other implementations as may be desired and advantageous for any given or particular application.

[0106] While various embodiments of the present disclosure have been described above, it should be understood that they have been presented by way of example only, and not limitation. Numerous changes to the disclosed embodiments can be made in accordance with the disclosure herein, without departing from the spirit or scope of the disclosure. Thus, the breadth and scope of the present disclosure should not be limited by any of the above described embodiments. Rather, the scope of the disclosure should be defined in accordance with the following claims and their equivalents.

Claims

CLAIMS What is claimed is:

1. A method for generating reports in a collaborative medical setting, comprising: receiving one or more inputs from a client device; generating a transcript based on the one or more inputs from the client device; probing a large language model using first settings and the transcript to determine a procedure class; based on the procedure class, determining second settings for the large language model, the second settings including a standardized report format for at least one medical field; probing the large language model with the second settings and the transcript; and providing a standardized report based on results from the large language model.

2. The method of claim 1, wherein the large language model is GPT-3.5.

3. The method of claim 1 or claim 2, wherein the first settings include prompts optimized for generating the standardized report.

4. The method of any one of claims 1 to 3, wherein the first settings include a system prompt and a user prompt.

5. The method of claim 4, wherein the system prompt includes one or more of an assigned role, an assigned experience, a personality trait, or any combination thereof.

6. The method of claim 4 or claim 5, wherein the user prompt includes a description of the transcript, participants, instructions to categorize a procedure, a limit on categories, or any combination thereof.

7. The method of any one of claims 1 to 6, wherein the standardized report includes three report types.

8. The method of claim 7, wherein two of the three report types are generated using a first one of the report types.

9. The method of any one of claims 1 to 8, further comprising: receiving one or more revision inputs from the client device; and updating the standardized report by evaluating the one or more revision inputs, one or more revision settings, and the standardized report.

10. The method of claim 9, wherein the one or more revision inputs are obtained in realtime and the updating is performed in real-time such that the large language model edits the standardized report based on the one or more revision inputs.

11. The method of claim 9 or claim 10, wherein the standardized report and the updated standardized report use the standardized report format.

12. The method of any one of claims 1 to 11, further comprising: fine-tuning the large language model based on a labeled dataset; and updating the standardized report using the fine-tuned large language model.

13. The method of any one of claims 1 to 12, further comprising: splitting the standardized report into a first section of the standardized report and a second section of the standardized report; verifying the first section of the standardized report with a first RAG dataset and the large language model; verifying the second section of the standardized report with a second RAG dataset and the large language model, the first RAG dataset and the second RAG dataset being different RAG datasets; and generating an updated standardized report using the verified first section of the standardized report and the verified second section of the standardized report.

14. A system for generating reports in a collaborative medical setting, comprising: one or more data processors; and a non-transitory computer-readable storage medium containing instructions which, when executed on the one or more data processors, cause the one or more data processors to perform operations including: receiving one or more inputs from a client device; generating a transcript based on the one or more inputs from the client device; probing a large language model using first settings and the transcript to determine a procedure class; based on the procedure class, determining second settings for the large language model, the second settings including a standardized report format for at least one medical field; probing the large language model with the second settings and the transcript; and providing a standardized report based on results from the large language model.

15. The system of claim 14, wherein the large language model is GPT-3.5.

16. The system of claim 14 or claim 15, wherein the first settings include prompts optimized for generating the standardized report.

17. The system of any one of claims 14 to 16, wherein the first settings include a system prompt and a user prompt.

18. The system of claim 17, wherein the system prompt includes one or more of an assigned role, an assigned experience, a personality trait, or any combination thereof.

19. The system of claim 17 or claim 18, wherein the user prompt includes a description of the transcript, participants, instructions to categorize a procedure, a limit on categories, or any combination thereof.

20. The system of any one of claims 14 to 19, wherein the standardized report includes three report types.

21. The system of claim 20, wherein two of the three report types are generated using a first one of the report types.

22. The system of any one of claims 14 to 21, further configured to: receive one or more revision inputs from the client device; and update the standardized report by evaluating the one or more revision inputs, one or more revision settings, and the standardized report.

23. The system of claim 22, wherein the one or more revision inputs are obtained in realtime and the updating is performed in real-time such that the large language model edits the standardized report based on the one or more revision inputs.

24. The system of claim 22 or claim 23, wherein the standardized report and the updated standardized report use the standardized report format.

25. The system of any one of claims 14 to 24, further configured to: fine-tuning the large language model based on a labeled dataset; and updating the standardized report using the fine-tuned large language model.

26. The system of any one of claims 14 to 25, further configured to: splitting the standardized report into a first section of the standardized report and a second section of the standardized report; verifying the first section of the standardized report with a first RAG dataset and the large language model; verifying the second section of the standardized report with a second RAG dataset and the large language model, the first RAG dataset and the second RAG dataset being different RAG datasets; and generating an updated standardized report using the verified first section of the standardized report and the verified second section of the standardized report.