CN118942543B

CN118942543B - Plant genome sequencing data analysis method and analysis system based on artificial intelligence

Info

Publication number: CN118942543B
Application number: CN202410961338.7A
Authority: CN
Inventors: 闫硕; 任金威; 徐彬; 徐启江
Original assignee: Beijing Guoke Bencao Biotechnology Co ltd
Current assignee: Beijing Guoke Bencao Biotechnology Co ltd
Priority date: 2024-07-17
Filing date: 2024-07-17
Publication date: 2025-03-18
Anticipated expiration: 2044-07-17
Also published as: CN118942543A

Abstract

The present application relates to the field of artificial intelligence technology, and in particular provides a plant genome sequencing data analysis method and analysis system Plantsyn based on artificial intelligence, which accurately constructs a genome expression description vector relationship network of a target plant object through a long short-term memory network, and is adapted to a variety of feature sets. The introduction of AI sliding sampling kernel technology realizes the accurate extraction of local description vector relationship networks from complex data, and constructs vector relationship network tuples based on data distribution characteristics, thereby improving the precision of data analysis. By integrating vector relationship network tuples, a genome sequencing integrated vector set is formed, thereby improving the sensitivity and accuracy of noise detection. In this way, the noise disturbance of the sequencing data set can be comprehensively evaluated, important data quality information can be provided to scientific researchers, and subsequent data cleaning and experimental design can be assisted, significantly improving the accuracy and reliability of genome sequencing data analysis. The method and system can also be applied to plant transcriptome data analysis.

Description

Plant genome sequencing data analysis method and analysis system based on artificial intelligence

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a plant genome sequencing data analysis method and system based on artificial intelligence.

Background

The use of artificial intelligence and neural network models has become increasingly popular in the field of genomic sequencing data analysis. These advanced techniques are gradually changing our way of understanding and mining genomic data with their powerful data processing capabilities and efficient algorithmic accuracy.

Artificial intelligence techniques, particularly deep learning, have been widely used for classification, prediction and resolution of genomic data. They can extract useful biological information from massive sequencing data, helping researchers to reveal complex relationships between genes and phenotypes. By virtue of the excellent pattern recognition and feature extraction capability, the neural network model can find hidden rules and associations which are difficult to capture by the traditional method in genome data.

However, while artificial intelligence and neural network models have shown great potential in genomic sequencing data analysis, there are still some challenges and problems. Traditional sequencing data analysis means, even after combining these advanced techniques, are still limited by data processing capability and algorithm accuracy, and it is difficult to fully mine deep information in genomic data. In particular, when the complex genome expression level, transcript structure or epigenetic marker and other feature sets are faced, the limitations of the traditional methods are more obvious, and the requirements of modern scientific research on the depth and breadth of the data are often not met.

Even more serious is the noise disturbance problem that is prevalent in sequencing data. The noise is not only derived from technical limitations in the sequencing process, but also can be introduced by a plurality of links such as sample processing, data reading and the like. The existence of noise seriously affects the accuracy and reliability of data, which is a major problem that scientific researchers have to face. Unfortunately, while artificial intelligence and neural network models perform well in terms of data processing and analysis, current noise detection approaches remain to be improved in terms of accurately identifying and quantifying noise disturbances in sequencing data.

Disclosure of Invention

In order to solve the problems, the application provides an artificial intelligence-based plant genome sequencing data analysis method and an analysis system.

The embodiment of the application provides a plant genome sequencing data analysis method based on artificial intelligence, which is applied to an artificial intelligence analysis system, and comprises the following steps:

Determining a first genomic expression description vector relationship network of a first initial genomic sequencing dataset and a second genomic expression description vector relationship network of a second initial genomic sequencing dataset by using a long-short-term memory network, wherein the first initial genomic sequencing dataset and the second initial genomic sequencing dataset correspond to a target plant object, and the first genomic expression description vector relationship network and the second genomic expression description vector relationship network are respectively any one of a genomic expression quantity feature set, a transcript structural feature set or an epigenetic marker feature set;

Determining at least one local description vector relation network from the first genome expression description vector relation network and the second genome expression description vector relation network respectively based on an AI sliding sampling core, determining at least one vector relation network binary group based on the data distribution characteristics of each local description vector relation network in the corresponding genome expression description vector relation network, wherein the first local description vector relation network and the second local description vector relation network in each vector relation network binary group belong to different genome expression description vector relation networks, and the data distribution characteristics in the corresponding genome expression description vector relation networks are the same;

Integrating the first local description vector relation network and the second local description vector relation network in each vector relation network binary group to obtain genome sequencing integrated vector sets, and determining sequencing noise disturbance results of the first local description vector relation network and the second local description vector relation network in the corresponding vector relation network binary group based on each genome sequencing integrated vector set;

and determining sequencing noise disturbance results of the first initial genome sequencing data set and the second initial genome sequencing data set based on the sequencing noise disturbance results of the vector relation network tuples, wherein the sequencing noise disturbance results are sequencing noise disturbance or sequencing noise disturbance absence.

Further, for the first local description vector relation network and the second local description vector relation network in each vector relation network binary group, integrating the first local description vector relation network and the second local description vector relation network to obtain a genome sequencing integrated vector set, including:

Generating a genomic sequencing integrated vector set based on sequencing attention variables of respective feature elements of the first and second local description vector relationship networks, each feature element of the genomic sequencing integrated vector set comprising at least one attention dimension, the sequencing attention variable of each attention dimension being determined by at least one of the sequencing attention variables of the first local description vector relationship network corresponding to the feature element or the sequencing attention variables of the second local description vector relationship network corresponding to the feature element;

wherein the sequencing attention variable of each feature element of the genome expression level feature set is a genome expression level mapping parameter, the sequencing attention variable of each feature element of the transcript structural feature set is a transcript structural mapping parameter, each feature element of the epigenetic marker feature set corresponds to a plurality of sequencing attention variables and each sequencing attention variable is an epigenetic mapping parameter of one epigenetic attention dimension.

Further, the AI-based sliding sampling kernel determines at least one local description vector relationship network from the first and second genome expression description vector relationship networks, respectively, including:

Determining at least one first target data distribution feature in the first genome expression description vector relationship network, determining the data distribution feature which is the same as each first target data distribution feature in the second genome expression description vector relationship network as a second target data distribution feature, wherein each first target data distribution feature is used for reflecting a core sequencing embedded vector of the first genome expression description vector relationship network;

At least one local description vector relationship network is determined from each of the first target data distribution features and each of the second target data distribution features based on the AI sliding sampling kernel and the selected sampling strategy, respectively.

Further, the determining the sequencing noise disturbance result of the first initial genome sequencing dataset and the second initial genome sequencing dataset based on the sequencing noise disturbance result of each vector relation network binary group comprises:

Determining at least one target data distribution feature cluster, wherein each target data distribution feature cluster comprises a first target data distribution feature and a second target data distribution feature corresponding to the first target data distribution feature;

determining the first number of vector relation net tuples corresponding to each target data distribution feature cluster and the second number of vector relation net tuples without sequencing noise disturbance as sequencing noise disturbance results in the corresponding vector relation net tuples;

For each target data distribution feature cluster, determining sequencing noise disturbance results of a first target data distribution feature and a second target data distribution feature in the target data distribution feature cluster based on a first number and a second number corresponding to the target data distribution feature cluster;

determining sequencing noise disturbance results of the first initial genome sequencing dataset and the second initial genome sequencing dataset based on sequencing noise disturbance results of each of the target data distribution feature clusters.

Further, for each of the target data distribution feature clusters, determining a sequencing noise disturbance result of the first target data distribution feature and the second target data distribution feature in the target data distribution feature cluster based on the first number and the second number corresponding to the target data distribution feature cluster includes:

determining a first duty ratio of a second number corresponding to the target data distribution feature cluster and a first number corresponding to the target data distribution feature cluster;

if the first duty ratio corresponding to the target data distribution feature cluster is smaller than a first threshold value, determining noise disturbance analysis weights of the target data distribution feature cluster based on the first duty ratio and the second number corresponding to the target data distribution feature cluster, and determining sequencing noise disturbance results of the first target data distribution feature and the second target data distribution feature in the target data distribution feature cluster based on the noise disturbance analysis weights;

And if the first duty ratio corresponding to the target data distribution feature cluster is not smaller than the first threshold value, determining that sequencing noise disturbance results of the first target data distribution feature and the second target data distribution feature in the target data distribution feature cluster are not the sequencing noise disturbance.

Further, the determining the sequencing noise disturbance result of the first initial genome sequencing dataset and the second initial genome sequencing dataset based on the sequencing noise disturbance result of each target data distribution feature cluster comprises:

If the sequencing noise disturbance result of at least one target data distribution feature cluster is that the sequencing noise disturbance does not exist, determining that the sequencing noise disturbance result of the first initial genome sequencing data set and the second initial genome sequencing data set is that the sequencing noise disturbance does not exist;

If the sequencing noise disturbance result of each target data distribution feature cluster is the presence of sequencing noise disturbance, determining the sequencing noise disturbance result of the first initial genome sequencing data set and the second initial genome sequencing data set as the presence of sequencing noise disturbance.

Further, the determining the sequencing noise disturbance result of the first local description vector relation network and the second local description vector relation network in the corresponding vector relation network binary group based on each genome sequencing integrated vector set is realized through a long-term memory network, and the long-term memory network is obtained through debugging through the following steps:

Determining a plurality of genome expression vector relationship sample tuples, each genome expression vector relationship sample tuple comprising a first genome expression description vector relationship sample of a first initial genome sequencing dataset sample and a second genome expression description vector relationship sample of a second initial genome sequencing dataset sample, the first initial genome sequencing dataset sample and the second initial genome sequencing dataset sample corresponding to a target plant object, the first genome expression description vector relationship sample and the second genome expression description vector relationship sample being respectively any one of a genome expression quantity feature set, a transcript structural feature set, or an epigenetic marker feature set;

integrating a first genome expression description vector relation sample and a second genome expression description vector relation sample in each genome expression vector relation sample binary group to obtain genome sequencing integrated vector set samples, inputting each genome sequencing integrated vector set sample into an initial cyclic neural network to obtain a discrimination viewpoint corresponding to each genome sequencing integrated vector set sample, and determining a sequencing noise disturbance prediction result of the first genome expression description vector relation sample and the second genome expression description vector relation sample in each genome sequencing integrated vector set sample corresponding to the genome expression vector relation sample binary group based on the discrimination viewpoint;

Determining a debugging error function based on an actual sequencing noise disturbance result and a sequencing noise disturbance prediction result of a first genome expression description vector relation network sample and a second genome expression description vector relation network sample in each genome expression vector relation network sample binary group, debugging the initial cyclic neural network based on the debugging error function and each genome expression vector relation network sample binary group until the debugging error function meets the debugging termination requirement, and determining the cyclic neural network obtained by completing the debugging as the long-period memory network.

Further, when the sequencing noise disturbance result is that there is a sequencing noise disturbance, the method further includes:

and denoising the first initial genome sequencing data set and the second initial genome sequencing data set based on the sequencing noise disturbance result.

Further, the method further comprises:

updating the genome sequencing system of the target plant object.

The embodiment of the application provides an artificial intelligence analysis system which comprises at least one processor and a memory, wherein the memory stores computer-executable instructions, and the at least one processor executes the computer-executable instructions stored in the memory so that the at least one processor executes the method.

Embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when run, implements the method described above.

In the embodiment of the application, first, by using the long-short-term memory network, a first genome expression description vector relation network corresponding to a first initial genome sequencing dataset of a target plant object and a second genome expression description vector relation network of a second initial genome sequencing dataset can be accurately constructed. The network model is not only suitable for genome expression quantity feature sets, but also can effectively process transcript structure feature sets or epigenetic mark feature sets, and shows wide applicability.

Secondly, the embodiment of the application innovatively introduces an AI sliding sampling core technology. Through the technology, the local description vector relation network can be accurately extracted from the complex genome expression description vector relation network, and the vector relation network binary group is further constructed based on the similarity of data distribution characteristics. The process not only improves the fineness of data analysis, but also provides powerful support for subsequent noise disturbance analysis.

Furthermore, the embodiment of the application forms a genome sequencing integrated vector set by integrating the local description vector relation network in the vector relation network binary group. The step can determine the disturbance result of the sequencing noise more accurately, so that the sensitivity and accuracy of noise detection are improved effectively.

Finally, by integrating the sequencing noise disturbance results of the two tuples of each vector relation network, the embodiment of the application can comprehensively evaluate the sequencing noise disturbance conditions of the first initial genome sequencing data set and the second initial genome sequencing data set. The evaluation result not only provides important information about the data quality for scientific researchers, but also helps to guide the subsequent data cleaning and experimental design work, thereby remarkably improving the accuracy and reliability of genome sequencing data analysis.

In conclusion, the embodiment of the application skillfully combines the long-term memory network and the AI sliding sampling core technology, and brings innovation to genome sequencing data analysis. Through the long-term and short-term memory network, a genome expression description vector relation network of a target plant object is accurately constructed, and the method is suitable for various feature sets. The introduction of the AI sliding sampling core technology realizes the accurate extraction of the local description vector relation network from complex data, and builds a vector relation network binary group based on data distribution characteristics, thereby improving the fineness of data analysis. Further, through integrating vector relation network binary groups, a genome sequencing integrated vector set is formed, so that the sensitivity and accuracy of noise detection are improved. Therefore, the noise disturbance condition of the sequencing data set can be comprehensively evaluated, important data quality information is provided for scientific researchers, the subsequent data cleaning and experimental design are assisted, and the accuracy and reliability of genome sequencing data analysis are remarkably improved.

Drawings

FIG. 1 is a flowchart of a plant genome sequencing data analysis method based on artificial intelligence according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of an artificial intelligence analysis system according to an embodiment of the present application.

Detailed Description

In order to better understand the above technical solutions, the following detailed description of the technical solutions of the present application is made by using the accompanying drawings and specific embodiments, and it should be understood that the specific features of the embodiments and the embodiments of the present application are detailed descriptions of the technical solutions of the present application, and not limiting the technical solutions of the present application, and the technical features of the embodiments and the embodiments of the present application may be combined with each other without conflict.

FIG. 1 shows an artificial intelligence based plant genome sequencing data analysis method, applied to an artificial intelligence analysis system, comprising the following steps 110-140.

Step 110, determining a first genomic expression description vector relationship network for a first initial genomic sequencing dataset and a second genomic expression description vector relationship network for a second initial genomic sequencing dataset using a long-term memory network.

Wherein the first initial genome sequencing dataset and the second initial genome sequencing dataset correspond to a target plant object, and the first genome expression description vector relationship net and the second genome expression description vector relationship net are any one of a genome expression quantity feature set, a transcript structure feature set or an epigenetic marker feature set respectively.

In embodiments of the present application, the long-short-term memory network (LSTM) is a special Recurrent Neural Network (RNN) that is capable of learning long-term dependencies. In genomic sequencing data analysis, LSTM can be trained to recognize complex patterns in sequence data, such as specific structural and functional regions in a gene sequence. LSTM is able to effectively address long-term dependency problems in sequences through its unique gating mechanism, making it excellent in handling large-scale sequences such as genomic data.

The first initial genome sequencing dataset refers to the dataset resulting from a first round of genome sequencing of a plant object of interest. These data contain detailed information about the plant genome and are the basis for subsequent analysis and comparison. The quality and integrity of the first initial genome sequencing dataset is critical to accurately resolving the structure and function of the plant genome.

The first genome expression description vector relationship network is a data structure obtained by processing the first initial genome sequencing dataset by a specific algorithm (e.g., LSTM). The complex genome data is converted into vector form, so that mathematical operation and comparative analysis are convenient. The relational network captures the expression quantity characteristics, transcript structural characteristics or epigenetic mark characteristics in the genome, and provides an intuitive and quantitative way for researchers to understand and compare the differences and similarities of the genomes.

The second initial genome sequencing dataset is similar to the first initial genome sequencing dataset, but may be obtained by sequencing different parts of the plant at different time points, conditions or under different conditions. By comparing the first and second initial genome sequencing datasets, researchers can explore changes in gene expression of plants under different environments or developmental stages.

The second genomic expression description vector relationship network is a vector relationship network constructed based on the second initial genomic sequencing dataset. Similar to the first genome expression description vector relationship network, it also captures specific characteristics of the genome, but reflects information in the second initial genome sequencing dataset.

The target plant object refers to a specific plant species or individual selected by researchers, and is the focus of genome sequencing and analysis. By intensively studying the genome of the target plant object, scientists can reveal the genetic characteristics, physiological functions, environmental suitability and the like thereof.

Further, a genome expression level feature set is a set of features describing the expression level of a gene under specific conditions. The expression level refers to the amount of mRNA transcribed from a gene, and reflects the activity of the gene in a cell. The feature set of genome expression amounts helps to understand which genes are activated or inhibited under specific conditions. The structural feature set features of transcripts are focused on structural information of the gene transcripts, such as boundaries and lengths of exons, introns, etc. Transcript structure is critical to understanding the function and regulatory mechanisms of genes, as it determines the composition of the protein coding sequence and the manner of splicing. Epigenetic marker feature sets are a set of features that describe epigenetic modifications (e.g., DNA methylation, histone modification, etc.) on the genome. These modifications can affect gene expression without altering the DNA sequence. The epigenetic marker feature set has important significance for analyzing a complex regulatory network of gene expression.

In some examples, the first genome expression description vector relationship network is a high-dimensional data structure in which each gene or gene expression product is represented as a numerical feature vector. For example, there are 1000 genes, and the expression level of each gene can be quantified as a specific value, thus yielding a 1000-dimensional feature vector. Each element in this vector represents the expression level of the corresponding gene, and the magnitude of the value reflects the expression intensity of the gene. The vectors form a network of relationships in space, demonstrating the expression correlation and interaction relationships between genes. Similar to the first genomic expression description vector relationship network, the second genomic expression description vector relationship network is also composed of numerical feature vectors. But these vectors are derived based on the second initial genome sequencing dataset. Thus, while dimensions and structures may be similar to the first relationship net, values therein reflect gene expression under different conditions or time points. By comparing these two relationship networks, changes in gene expression under different environments or conditions can be analyzed. The genome expression level feature set is a set comprising a plurality of numerical features, each feature representing the expression level of a gene. For example, a feature vector may be shown in [5.6,3.2,8.9,..1.2 ], where each value represents the expression level of a particular gene. This feature set can help understand which genes are up-or down-regulated in expression under different conditions. Structural features of transcripts are of interest are structural features of transcripts such as exon length, number of introns, etc. These features may also be quantized into values and form a feature vector. For example, a structural feature vector of a transcript may include [300,4,200, ], where the values represent the length of the first exon, the number of introns, the length of the second exon, etc., respectively. These features help to understand the complexity of transcripts and their possible functional impact. The epigenetic signature feature set comprises a series of numerical features describing the epigenetic modification, such as DNA methylation level, histone modification status, and the like. These features can also be represented as numerical feature vectors. For example, one DNA methylation signature vector can be [0.8,0.3,0.9, ], where the values represent the methylation levels of different genetic loci. These features are critical to understanding the epigenetic regulatory mechanisms of gene expression. It can be seen that the above examples provide a powerful tool for quantifying and comparing genomic data, helping to understand the complexity and function of the genome more deeply.

Based on the foregoing, in step 110, the artificial intelligence analysis system begins its complex genomic data analysis process. First, the system uses a powerful deep learning tool, long term memory network (LSTM), to process the first initial genome sequencing dataset. The LSTM network can capture long-term dependency and complex patterns in the genome sequence through the unique memory units and the gating mechanism, so that a first genome expression description vector relation network is generated. This network of relationships represents precisely, in vector form, the characteristics of the expression level, of the transcript structure or of the epigenetic markers in the genome, depending on the purpose and the requirements of the analysis. Similarly, the artificial intelligence analysis system also performs the same process on the second initial genome sequencing dataset. This dataset may originate from different tissues of the same plant, different developmental stages or samples under different environmental conditions. Through processing of the LSTM network, the system generates a second genome expression description vector relationship network that captures specific features of the genome, again in the form of vectors. These two vector relationship networks provide the basis for subsequent comparison and analysis. Researchers can explore the changes in gene expression, differences in transcript structure, or distribution of epigenetic markers in plants under different conditions by comparing these two relationship networks. The accuracy and effectiveness of this step is critical to the overall genome sequencing data analysis process, as it provides a high quality data input and analysis basis for subsequent steps.

And 120, determining at least one local description vector relation network from the first genome expression description vector relation network and the second genome expression description vector relation network based on an AI sliding sampling core, and determining at least one vector relation network binary group based on the data distribution characteristics of each local description vector relation network in the corresponding genome expression description vector relation network.

The first local description vector relation network and the second local description vector relation network in each vector relation network binary group belong to different genome expression description vector relation networks, and the data distribution characteristics in the corresponding genome expression description vector relation networks are the same.

In the embodiment of the application, the AI sliding sampling core is an advanced data processing technology, and can perform sliding sampling in a large complex data set. In genomics analysis, AI slide sampling kernels are used to efficiently extract local data segments, i.e., local description vector relationship networks, from a vast genome expression description vector relationship network. By intelligent sliding and sampling, the technology can capture fine changes and important characteristics in the data set, and provides powerful support for subsequent data analysis and comparison.

The local description vector relation network is a part of data extracted from the whole genome expression description vector relation network through an AI sliding sampling core. Which represents the characteristics and relationships of a particular region or segment in the original dataset. The local description vector relation network can reflect the local characteristics and change rules of the genome more precisely, and is helpful for scientific researchers to explore the complex structure and function of the genome deeply.

Data distribution characteristics refer to the distribution and characteristics of data within a particular space or range. In genomic analysis, data distribution characteristics can reveal key information such as patterns of gene expression, transcript abundance, and distribution of epigenetic markers. By comparing the data distribution characteristics of different local description vector relation networks, researchers can find the similarity and the difference in genome, so that the functions and the regulation mechanism of genes can be understood more deeply.

A vector relationship network tuple is a pairing of two local description vector relationship networks from different genome expression description vector relationship networks. The two partial description vector relation networks, although in different original data sets, have the same or highly similar data distribution characteristics in the respective data sets. The construction of such a binary set facilitates comparison and analysis across data sets by researchers to find commonalities and differences between different genomes.

The first local description vector relation network is a local data segment extracted from the first genomic expression description vector relation network by the AI slide sampling kernel. It represents the features and relationships of a particular region in a first dataset, providing an important data basis for subsequent comparison and analysis. Similar to the first local description vector relationship network, the second local description vector relationship network is a local data segment extracted from the second genomic expression description vector relationship network. It reflects the features and relationships of the corresponding regions in the second dataset, together with the first locally descriptive vector relationship network, form a vector relationship network tuple for comparison and analysis across the datasets.

In step 120, the artificial intelligence analysis system begins to apply its powerful AI sliding sampling kernel technique to further analyze the previously constructed first and second genome expression description vector relationship networks. The goal of this step is to determine at least one local description vector relationship network and discover therefrom similarities and differences between the data. Firstly, the system performs sliding sampling in a first genome expression description vector relation network through an AI sliding sampling core, and intelligently identifies and extracts representative local data fragments, wherein the fragments form the first local description vector relation network. The process can accurately capture local features and slight changes in the genome, and provides a rich data base for subsequent comparative analysis. Similarly, the system performs a similar sliding sampling operation on the second genomic expression vector-description relationship network to obtain a second local description vector-relationship network. The two local description vector relationship networks represent specific regions or features in two different data sets, respectively. Next, the system begins to compare the data distribution characteristics of these local description vector relationship networks in the respective raw data sets. Through accurate data analysis and comparison, the system can identify those local descriptive vector relationship networks that have high similarity in data distribution and pair them to form vector relationship network tuples. Each vector relational network tuple comprises a first local description vector relational network from the first genomic expression description vector relational network and a second local description vector relational network from the second genomic expression description vector relational network. The key to this step is to extract valuable local information from complex genomic data using the intelligence and accuracy of the AI sliding sampling kernel technique and to discover commonalities and differences between different data sets by comparing data distribution characteristics. This lays a solid foundation for subsequent genome sequencing data integration and noise disturbance analysis.

And 130, integrating the first local description vector relation network and the second local description vector relation network in each vector relation network binary group to obtain genome sequencing integrated vector sets, and determining sequencing noise disturbance results of the first local description vector relation network and the second local description vector relation network in the corresponding vector relation network binary group based on each genome sequencing integrated vector set.

In the present embodiment, the integrated vector set for genome sequencing is a comprehensive data set formed by integrating genome sequencing data from different sources or under different conditions. In this dataset, each gene or gene expression product is represented as a numerical feature vector, which are assembled together to form a dataset that comprehensively reflects genomic features. The integration is not only helpful for scientific researchers to obtain more comprehensive gene expression information, but also can reveal dynamic changes and regulation mechanisms of gene expression by comparing data under different conditions.

The sequencing noise disturbance result refers to the deviation between the sequencing data and the real situation caused by various factors (such as sequencing instrument errors, sample processing differences and the like) in the genome sequencing process. Such disturbance results may be represented by abnormal fluctuations in gene expression levels, erroneous recognition of transcript structures, inaccurate measurement of epigenetic marks, or the like. By analyzing the disturbance result of the sequencing noise, scientific researchers can evaluate the accuracy and reliability of the sequencing data, thereby providing important references for subsequent data analysis and interpretation.

In detail, step 130 is a key element in the whole genome sequencing data analysis process, which involves integrating the vector relationship net tuples identified and paired in the previous step, and determining sequencing noise disturbance based on the integration result. First, the system traverses each of the vector relationship network tuples, which are paired from the first partial description vector relationship network and the second partial description vector relationship network, whose data distribution characteristics in the respective original data sets are identical or highly similar. For each of the tuples, the system integrates them using specific algorithms and techniques. The integrated process is not simple data combination, but takes the difference between different data sets into consideration, and ensures that the integrated data can truly reflect the characteristics of the original data through methods such as weighted average, data standardization and the like. After integration is completed, the system obtains a genome sequencing integrated vector set. This vector set integrates information from different data sets, with a more comprehensive, accurate description of genomic features. Next, the system performs an analysis of the sequencing noise disturbance on the first and second local description vector relationship networks in each of the tuples based on the set of integrated vectors. When analyzing the disturbance of sequencing noise, the system adopts a statistical method and a machine learning algorithm to compare the difference between the integrated vector set and the original data. These differences may result from various noise factors in the sequencing process, such as instrument errors, sample preparation differences, and the like. The system performs a quantitative evaluation of these differences and generates a sequencing noise disturbance result. The result not only can help scientific researchers to know the accuracy and reliability of sequencing data, but also can provide important references for subsequent data interpretation and experimental result verification. It can be seen that step 130 is a comprehensive data processing and analysis process that reveals noise disturbance in the genome sequencing data by integrating and comparing information in different data sets, providing more accurate and comprehensive data support for researchers.

Step 140, determining sequencing noise disturbance results of the first initial genome sequencing data set and the second initial genome sequencing data set based on the sequencing noise disturbance results of the vector relation network tuples, wherein the sequencing noise disturbance results are sequencing noise disturbance or sequencing noise disturbance absence.

In the embodiment of the application, the system aims to comprehensively judge whether the first initial genome sequencing data set and the second initial genome sequencing data set have sequencing noise disturbance or not based on the sequencing noise disturbance results of the two groups of the vector relation network obtained in the previous step.

First, the system aggregates and analyzes the sequencing noise disturbance results of all vector relation network tuples. These results contain detailed information about the differences between the local description vector relationship network and the raw data during the integration process. The system scrutinizes these perturbation results, particularly with respect to those tuples that exhibit significant noise perturbations.

Next, the system performs statistics and analysis on these noise disturbance results. It calculates the proportion of the tuples where noise disturbances are present, as well as the magnitude and distribution of these disturbances. Such information will assist the system in comprehensively evaluating the sequencing quality of the entire dataset.

Based on these detailed noise disturbance analyses, the system then makes a comprehensive decision. If most vector relational network tuples exhibit significant noise disturbance, or if some critical region tuples have severe noise problems, the system may determine that sequencing noise disturbance exists across the first or second initial genomic sequencing dataset. Conversely, if only a few tuples exhibit a slight noise disturbance, while most of the data remains stable and consistent, the system may determine that there is no significant sequencing noise disturbance in the data set.

Finally, the system generates a detailed report containing the analysis result of the sequencing noise disturbance and the comprehensive judgment of the system. This report will provide valuable information to the researchers, helping them to understand the quality and reliability of the sequencing data, and thus guiding subsequent experimental design and data analysis efforts.

It can be seen that step 140 is a key step in comprehensively evaluating noise disturbance of sequencing data. By analyzing the noise disturbance results of the two groups of the vector relation network, the method provides an accurate judgment for the quality of the whole sequencing data set for scientific researchers, and is beneficial to ensuring the accuracy and reliability of subsequent researches.

In summary, in a complete plant genome study case, an artificial intelligence analysis system is used to deeply analyze and process two sets of genome sequencing data to detect potential sequencing noise perturbations. First, an artificial intelligence analysis system obtains a first initial genomic sequencing dataset and a second initial genomic sequencing dataset of a plant object of interest. These datasets may be derived from different tissue samples of the same plant, or sequencing results of the same tissue at different time points or conditions. The system utilizes a long-short-term memory network (LSTM), a powerful deep learning model, to construct a first and a second genome expression description vector relationship network, respectively. These networks of relationships are capable of capturing genomic expression signatures, transcript structural signatures, or epigenetic signature signatures, depending on the needs and purpose of the analysis. Next, the artificial intelligence analysis system uses AI sliding sampling kernel techniques to sample locally in the two genome expression description vector relationship networks. The purpose of this step is to identify and extract a local descriptive vector relationship network with similar data distribution characteristics, forming a vector relationship network tuple. The two local description vector relationship networks in each binary group are each from a different initial genome sequencing dataset, but they exhibit the same data distribution characteristics in the respective datasets. Then, the system integrates the local description vector relation networks in each vector relation network binary group to generate a genome sequencing integrated vector set. This process helps to comprehensively compare the corresponding portions of the two initial data sets to more accurately identify sequencing noise. By contrast analyzing these integrated vector sets, the artificial intelligence analysis system is able to determine the sequencing noise disturbance results in each vector relational network doublet. Finally, based on the sequencing noise disturbance analysis results of all vector relation network tuples, the artificial intelligence analysis system obtains an overall sequencing noise disturbance conclusion about the first initial genome sequencing dataset and the second initial genome sequencing dataset. This conclusion clearly indicates whether sequencing noise perturbations are present in the data set, providing valuable information to researchers to help them evaluate data quality and to guide subsequent experimental design and data analysis efforts. In this way, the artificial intelligence analysis system not only improves the efficiency and accuracy of genome sequencing data processing, but also assists in the deep understanding and exploration of plant genome complexity and dynamics.

The embodiment of the application provides an accurate and efficient analysis method for genome sequencing data of a target plant object by innovatively combining a long-term memory network and an AI sliding sampling core technology. Specifically, through the long-term and short-term memory network, a first genome expression description vector relation network of a first initial genome sequencing data set and a second genome expression description vector relation network of a second initial genome sequencing data set can be accurately constructed, and the genome expression quantity feature set, the transcript structure feature set and the epigenetic mark feature set can be effectively expressed. Further, by means of an AI sliding sampling kernel technology, a local description vector relation network can be accurately extracted from a complex vector relation network, and a vector relation network binary group is constructed based on similarity of data distribution characteristics, so that not only is the accuracy of data analysis improved, but also a solid foundation is laid for subsequent noise disturbance analysis.

In addition, another significant advantage of embodiments of the present application is that by integrating the local description vector relationship network in the vector relationship network tuples, a genome sequencing integrated vector set is formed, thereby enabling a more accurate determination of sequencing noise perturbation results. This approach not only improves the sensitivity of noise detection, but also provides a comprehensive assessment of the quality of the sequencing data. Finally, by integrating the sequencing noise disturbance results of the two groups of the vector relation network, whether the first initial genome sequencing data set and the second initial genome sequencing data set have the sequencing noise disturbance or not can be accurately judged, powerful data support is provided for scientific researchers, and the accuracy and the reliability of genome sequencing data can be improved.

In some alternative embodiments, integrating the first local description vector relationship network and the second local description vector relationship network for each of the vector relationship network tuples to obtain a genome sequencing integrated vector set comprises generating a genome sequencing integrated vector set based on the sequencing attention variables of the feature elements of the first local description vector relationship network and the second local description vector relationship network, each feature element of the genome sequencing integrated vector set comprising at least one attention dimension, the sequencing attention variable of each attention dimension being determined by at least one of the sequencing attention variable of the first local description vector relationship network corresponding to the feature element or the sequencing attention variable of the second local description vector relationship network corresponding to the feature element, wherein the sequencing attention variable of each feature element of the genome expression feature set is a genome expression mapping parameter, the sequencing attention variable of each feature element of the transcript structure feature set is an epigenetic mapping parameter, and the sequencing attention variable of each feature element of the transcript structure feature set is an epigenetic mapping parameter.

Based on the embodiment, the artificial intelligence analysis system integrates the first local description vector relation network and the second local description vector relation network in each vector relation network binary group in a special mode, so that a genome sequencing integrated vector set is obtained. The key to this process is the use of sequencing attention variables, an indicator of the importance of each characteristic element in the genome sequencing process.

First, the system examines each feature element in the first and second local description vector relationship networks. These characteristic elements may represent the amount of expression of the genome, the structure of transcripts, or epigenetic markers, etc., which constitute an important part of biological genetic information.

Next, the system calculates one or more sequencing attention variables for each feature element. These variables are essentially a mapping parameter that reflects which characteristic elements are more of interest and which may be relatively minor in the sequencing process. Specifically, in the case of a genome expression profile, the sequencing attention variable for each profile element is the genome expression mapping parameter, in the case of a transcript structure profile, the sequencing attention variable for each profile element is the transcript structure mapping parameter, whereas in the case of an epigenetic signature profile, each profile element corresponds to a plurality of sequencing attention variables due to its complexity, and each variable represents a mapping parameter for an epigenetic attention dimension.

After determining these sequencing attention variables, the artificial intelligence analysis system generates a set of genomic sequencing integration vectors from them. Each feature element of the integrated vector set comprises at least one attention dimension, and the sequencing attention variable for each attention dimension is determined by the sequencing attention variable for the corresponding feature element in the first or second local description vector relationship network. Such processing not only preserves the richness of the raw data, but also, by introducing attention mechanisms, allows the system to be more focused on those important feature elements when processing genomic sequencing data.

In this way, the artificial intelligence analysis system can more effectively mine deep information in genome sequencing data, and improve the processing capacity of complex genome feature sets. At the same time, due to the introduction of the attention mechanism, the system can also show greater robustness against noise disturbances, as it can more accurately identify and focus on those characteristic elements that have a greater impact on the analysis results. Thus, the accuracy and efficiency of the artificial intelligence analysis system for processing genome sequencing data are improved, and the stability and reliability of the system in the face of complex data and noise disturbance are also enhanced. The method is an important technological breakthrough of artificial intelligence in the field of genome sequencing data analysis, is expected to provide more accurate and more efficient data analysis tools for scientific researchers, and promotes the deep development of genomics research.

Under other preferred design ideas, the AI-based sliding sampling kernel determines at least one local description vector relation net from the first genome expression description vector relation net and the second genome expression description vector relation net respectively, and comprises determining at least one first target data distribution characteristic in the first genome expression description vector relation net, determining the data distribution characteristic which is the same as each first target data distribution characteristic in the second genome expression description vector relation net as a second target data distribution characteristic, wherein each first target data distribution characteristic is used for reflecting a core sequencing embedded vector of the first genome expression description vector relation net, and determining at least one local description vector relation net from each first target data distribution characteristic and each second target data distribution characteristic respectively based on an AI sliding sampling kernel and a selected sampling strategy.

Under the design thought, the artificial intelligence analysis system adopts a refined method to determine at least one local description vector relation network from the first genome expression description vector relation network and the second genome expression description vector relation network. This process involves the identification, matching of data distribution features and the extraction of local descriptions based on AI slide sampling kernels.

First, the system further analyzes the first genomic expression description vector relationship network to determine at least one first target data distribution characteristic therein. These data distribution features can reflect core information and key patterns in the genomic sequencing data, similar to "fingerprints" or "markers" in the data. The system recognizes these features through complex algorithms, which can represent the core sequencing embedded vectors in the vector relation network, which is an important basis for subsequent analysis.

Next, the system searches the second genomic expression description vector relationship network for a data distribution feature that matches the first target data distribution feature, and determines it as a second target data distribution feature. This process is similar to finding a common "signature" or "fingerprint" in both sets of data to ensure that the local description vector relationship networks extracted from the two vector relationship networks have similar or identical data characteristics, allowing for efficient comparison and analysis.

After determining these target data distribution features, the artificial intelligence analysis system utilizes the AI sliding sampling kernel and the selected sampling strategy to extract a local description vector relationship network from these features. The AI slide sampling core is an advanced data processing technique that is capable of sliding throughout the data set, capturing local features of the data according to a selected sampling strategy. In this way, the system can accurately extract the local description vector relation network matched with the target data distribution characteristics, and provides an accurate data basis for subsequent noise detection and data analysis.

In this way, not only is the accuracy of data analysis improved, but the system is also enabled to more efficiently process complex genomic sequencing data. The method ensures that the local description vector relation networks extracted from two different genome expression description vector relation networks have high similarity and comparability, thereby greatly improving the accuracy and reliability of subsequent noise detection and data analysis. By the method, the artificial intelligent analysis system can deeper mine hidden information and association rules in genome sequencing data, and provides comprehensive and accurate data analysis results for scientific researchers.

In the next step, the sequencing noise disturbance results of the first initial genome sequencing dataset and the second initial genome sequencing dataset are determined based on the sequencing noise disturbance results of the vector relation net tuples, the sequencing noise disturbance results of the first initial genome sequencing dataset and the second initial genome sequencing dataset comprise determining at least one target data distribution feature cluster, each target data distribution feature cluster comprises one first target data distribution feature and a second target data distribution feature corresponding to the first target data distribution feature, determining a first number of vector relation net tuples corresponding to each target data distribution feature cluster and a second number of vector relation net tuples corresponding to the vector relation net tuples, wherein the sequencing noise disturbance results are the vector relation net tuples without the sequencing noise disturbance, determining the sequencing noise disturbance results of the first target data distribution feature and the second target data distribution feature in each target data distribution feature cluster based on the first number and the second number of target data distribution feature clusters, and determining the noise sequencing noise disturbance results of the first initial genome sequencing noise disturbance results based on the noise disturbance results of the target data distribution feature clusters.

Based on this embodiment, the artificial intelligence analysis system takes a series of fine operations to determine sequencing noise perturbation results for the first initial genomic sequencing dataset and the second initial genomic sequencing dataset. The process involves the determination of a target data distribution feature cluster, statistics of vector relation network tuples, and determination of sequencing noise disturbance results based on the data.

First, the system determines at least one target data distribution feature cluster. The clusters of features are made up of first target data distribution features and their corresponding second target data distribution features. In brief, each feature cluster contains similar or matched data distribution features from the two initial genome sequencing datasets, which provide the basis for subsequent comparison and analysis.

Next, the system counts the number of vector relation network tuples corresponding to each target data distribution feature cluster, namely the first number. Meanwhile, the number of vector relation net tuples without sequencing noise disturbance, namely the second number, is counted as the sequencing noise disturbance result in the vector relation net tuples. Statistics of the two numbers are important for subsequent judgment of the disturbance situation of sequencing noise.

Then, for each target data distribution feature cluster, the system determines sequencing noise perturbation results for the first and second target data distribution features in the feature cluster based on its corresponding first and second numbers. Specifically, if the ratio of the second number to the first number is higher, it is indicated that most of vector relation network tuples in the feature cluster are not disturbed by sequencing noise, so that a good sequencing noise disturbance result of the feature cluster can be judged. Conversely, if the ratio is lower, it indicates that the sequencing noise disturbance is more severe.

Finally, based on the sequencing noise disturbance results of each target data distribution feature cluster, the system comprehensively judges the sequencing noise disturbance results of the first initial genome sequencing data set and the second initial genome sequencing data set. The process is similar to summary scoring, and the system comprehensively considers the conditions of each feature cluster to give an overall sequencing noise disturbance evaluation result.

By the method, the artificial intelligence analysis system can comprehensively and accurately evaluate the sequence noise disturbance conditions of the two initial genome sequence data sets. The method is not only beneficial to scientific researchers to know the quality and reliability of data, but also provides important reference basis for subsequent data cleaning, experimental design and data analysis. Meanwhile, the evaluation method based on the target data distribution characteristic cluster and the vector relation network binary group has strong universality and flexibility, and can be widely applied to noise detection and quality control scenes of various genome sequencing data. In other words, through the series of refined operations and analysis, not only the accuracy and the sensitivity of the detection of the disturbance of the sequencing noise are improved, but also a new solution is provided for the quality control of the genome sequencing data.

In some exemplary embodiments, for each of the target data distribution feature clusters, determining sequencing noise disturbance results of the first and second target data distribution features in the target data distribution feature cluster based on the first and second numbers corresponding to the target data distribution feature cluster includes determining a first duty cycle of the second number corresponding to the target data distribution feature cluster and the first number corresponding to the first number, determining noise disturbance analysis weights of the target data distribution feature cluster based on the first duty cycle and the second number corresponding to the target data distribution feature cluster if the first duty cycle corresponding to the target data distribution feature cluster is less than a first threshold value, determining sequencing noise disturbance results of the first and second target data distribution features in the target data distribution feature cluster based on the noise disturbance analysis weights, and determining that sequencing noise disturbance results of the first and second target data distribution features in the target data distribution feature cluster are absent if the first duty cycle corresponding to the target data distribution feature cluster is not less than the first threshold value.

Based on this embodiment, the artificial intelligence analysis system employs a specific method to determine sequencing noise perturbation results for the first target data distribution feature and the second target data distribution feature in each target data distribution feature cluster. This process combines statistics with preset threshold values to ensure accuracy and reliability of the results.

First, the system determines a ratio of the second number corresponding to each target data distribution feature cluster to the first number, which ratio is referred to as a first duty cycle. The proportion reflects the proportion of vector relation network binary groups which are not disturbed by sequencing noise in the feature cluster, and is an important index for evaluating the condition of the disturbance of the sequencing noise.

Next, the system determines whether the first duty cycle is less than a predetermined first threshold. The threshold value is a threshold value set according to experience or actual requirements and used for distinguishing whether sequencing noise disturbance exists or not.

If the first duty ratio is smaller than the first threshold value, the vector relation network which is disturbed by sequencing noise in the target data distribution feature cluster has more binary groups, and the absence of the sequencing noise disturbance cannot be simply considered. At this point, the system determines a noise disturbance analysis weight based on the first duty cycle and the second number. This weight reflects the degree to which the feature cluster is perturbed by the sequencing noise, the greater the weight, the more severe the perturbation. The system then uses this weight to determine sequencing noise disturbance results for the first target data distribution feature and the second target data distribution feature in the feature cluster. This result will give a quantified evaluation index taking into account the weights and other relevant factors.

If the first duty ratio is not smaller than the first threshold value, the system can determine that sequencing noise disturbance results of the first target data distribution feature and the second target data distribution feature in the feature cluster are not generated because most vector relation network tuples in the target data distribution feature cluster are not disturbed by sequencing noise.

By the method, the artificial intelligent analysis system can give different sequencing noise disturbance evaluation results according to different conditions, and the method is more suitable for actual conditions. Meanwhile, the method has certain flexibility and expandability, and can be adjusted and optimized according to different requirements. Therefore, the sequencing noise disturbance evaluation method based on the first duty ratio and the noise disturbance analysis weight improves the accuracy and the reliability of evaluation, is beneficial to scientific researchers to better know the quality and the reliability of genome sequencing data, and provides powerful support for subsequent data analysis and research.

In some possible embodiments, the determining sequencing noise disturbance results for the first and second initial genomic sequencing data sets based on sequencing noise disturbance results for each of the target data distribution feature clusters includes determining that sequencing noise disturbance results for the first and second initial genomic sequencing data sets are absent if sequencing noise disturbance results for at least one of the target data distribution feature clusters are absent, determining that sequencing noise disturbance results for the first and second initial genomic sequencing data sets are absent if sequencing noise disturbance results for each of the target data distribution feature clusters are all present, and determining that sequencing noise disturbance results for the first and second initial genomic sequencing data sets are present if sequencing noise disturbance results for each of the target data distribution feature clusters are present.

In this embodiment, the artificial intelligence analysis system comprehensively determines the sequencing noise disturbance conditions of the first initial genome sequencing dataset and the second initial genome sequencing dataset according to the sequencing noise disturbance results of each target data distribution feature cluster. This process is based on a comprehensive and careful evaluation method to ensure the accuracy and reliability of the final conclusions.

First, the system examines the sequencing noise perturbation results of all target data distribution feature clusters. These feature clusters were previously derived through a series of complex analyses, each representing a particular data distribution pattern in the data set. Thus, the noise disturbance analysis of these feature clusters can directly reflect the noise condition of the entire dataset.

If the system finds that the sequencing noise disturbance result of at least one target data distribution feature cluster is that the sequencing noise disturbance is not present, it determines that the sequencing noise disturbance result of the first initial genome sequencing data set and the second initial genome sequencing data set is that the sequencing noise disturbance is not present. This is because as long as one feature cluster is not affected by noise, it is indicated that at least a portion of the data in the entire data set is reliable and not disturbed by noise.

On the other hand, if the sequencing noise disturbance results of all the target data distribution feature clusters indicate that a sequencing noise disturbance is present, the system determines that the sequencing noise disturbance results of the first initial genomic sequencing data set and the second initial genomic sequencing data set are that a sequencing noise disturbance is present. This means that the whole data set is widely affected by noise, and none of the parts is completely reliable.

By this comprehensive evaluation method, the artificial intelligence analysis system can provide a clear conclusion about the overall dataset sequencing noise disturbance situation. This conclusion considers not only the individual parts of the dataset but also the detailed feature cluster analysis, and thus has a high accuracy and reliability.

In some alternative embodiments, the determining the sequence noise disturbance result of the first local description vector relation net and the second local description vector relation net in the corresponding vector relation net binary set based on each genome sequence integrated vector set is implemented through a long-short-period memory net, the long-short-period memory net is obtained through debugging by determining a plurality of genome expression vector relation net sample binary sets, each genome expression vector relation net sample binary set comprises a first genome expression description vector relation net sample of a first initial genome sequence data set sample and a second genome expression description vector relation net sample of a second initial genome sequence data set sample, the first initial genome sequence data set sample and the second initial genome sequence data set sample correspond to a target plant object, the first genome expression vector relation net sample and the second genome expression vector relation net sample are respectively a genome expression vector feature set, a transcription structure feature marker, or a second genome expression vector relation net sample of any one of the first initial genome sequence data set sample sets, the first initial genome sequence data set sample and the second initial genome sequence data set sample correspond to a target plant object, determining a sequencing noise disturbance prediction result of a first genome expression description vector relation network sample and a second genome expression description vector relation network sample in genome expression vector relation network sample doublets corresponding to each genome sequencing integrated vector set sample based on the discrimination point, determining a debugging error function based on an actual sequencing noise disturbance result and a sequencing noise disturbance prediction result of the first genome expression description vector relation network sample and the second genome expression description vector relation network sample in each genome expression vector relation network sample doublet, debugging the initial cyclic neural network based on the debugging error function and each genome expression vector relation network sample doublet until the debugging error function meets the debugging termination requirement, and determining the cyclic neural network obtained by completing the debugging as the long-short-term memory network.

Based on this embodiment, the artificial intelligence analysis system utilizes a long short term memory network (LSTM) to determine the sequencing noise perturbation results of the corresponding vector relationship network tuples (i.e., the first local description vector relationship network and the second local description vector relationship network) in the genome sequencing integrated vector set. Long and short term memory networks are a special Recurrent Neural Network (RNN) that can efficiently process sequence data and memorize long term dependencies, thus providing significant advantages in processing genomic sequencing data.

To train this long and short term memory network, the system first determines a plurality of genome expression vector relationship network sample tuples. The sample tuples are extracted from genomic sequencing data of a target plant object, each sample tuple comprising a first genomic expression description vector relationship net sample of a first initial genomic sequencing dataset sample and a second genomic expression description vector relationship net sample of a second initial genomic sequencing dataset sample. These examples may be any of a genome expression level feature set, a transcript structural feature set, or an epigenetic signature feature set, representing different levels and characteristics of the genome.

Next, the system integrates the first and second genome expression description vector relationship network samples in these sample tuples to obtain a genome sequencing integrated vector set sample. The integrated vector set samples are then input into an initial recurrent neural network to obtain a discrimination perspective corresponding to each sample. These discrimination views are preliminary determinations of whether or not there is a sequencing noise disturbance for each sample by the network.

Based on these discrimination perspectives, the system further determines sequencing noise disturbance prediction results for the first and second genomic expression description vector relationship network samples in the genomic expression vector relationship network sample doublet corresponding to each genomic sequencing integrated vector set sample. These predictions are specific predictions of the network for sequencing noise disturbance that will be compared to the actual results.

To train and optimize the network, the system determines a debug error function based on the actual sequencing noise disturbance results and predicted results of the first and second genome expression description vector relationship network samples in each of the sample tuples. This function measures the difference between the network predictions and the actual results and is a key indicator for optimizing the network performance.

The system finally determines the loop neural network with complete debugging as a needed long-period memory network and short-period memory network by continuously debugging and optimizing the initial loop neural network by using the sample doublet and the debugging error function until the debugging error function meets the preset debugging termination requirement (such as the error rate is lower than a certain threshold value or the maximum iteration number is reached).

Thus, not only is the accuracy and efficiency of analysis improved, but also the system is enabled to process complex genome sequencing data and effectively identify and predict sequencing noise disturbance. This is of great importance for subsequent data cleaning, experimental design and data analysis.

In some exemplary embodiments, when the sequencing noise disturbance result is that a sequencing noise disturbance is present, the method further comprises denoising the first initial genomic sequencing data set and the second initial genomic sequencing data set based on the sequencing noise disturbance result.

In this embodiment, the artificial intelligence analysis system exhibits its powerful intelligent analysis capabilities in processing genomic sequencing data. When the system detects that sequencing noise disturbance results in the presence of sequencing noise disturbance, the system takes a series of accurate measures to denoise the first initial genome sequencing data set and the second initial genome sequencing data set.

First, the artificial intelligence analysis system carefully analyzes the source and nature of sequencing noise. These noise may originate from errors in the sequencing instrument itself, contamination during sample processing, or interference in the data reading. The system recognizes and distinguishes these noise from the actual genomic sequencing signal through advanced algorithms, such as deep learning models.

Next, the system cleans the data using sophisticated statistical methods and machine learning techniques. This may include removing high frequency noise using filters, or predicting and correcting sequencing errors by modeling. In the noise removal process, the artificial intelligence analysis system is particularly careful to protect the integrity and accuracy of genome data, so that the noise-removed data can truly reflect the genome information of a sample.

For example, if the system detects an abnormal gene sequence reading, it will be aligned with a known genomic database to confirm whether the reading is due to sequencing noise. If the acknowledgement is noisy, the system will reject it from the dataset, and if not, the system will retain this reading to ensure the integrity of the data.

After denoising the first initial genome sequencing dataset and the second initial genome sequencing dataset, the artificial intelligence analysis system generates a more accurate and more reliable genome sequencing dataset. This optimized dataset can be used for subsequent genomic studies such as genetic variation analysis, disease risk assessment, etc.

Through the denoising processing flow, the artificial intelligence analysis system not only improves the accuracy of genome sequencing data, but also provides a higher-quality data base for scientific researchers, and further promotes the research and application of genomics. The accurate data processing method has very important significance for improving the accuracy of medical diagnosis, the efficiency of drug research and development and the realization of personalized medicine.

In some independent embodiments, the denoising processing of the first initial genome sequencing dataset and the second initial genome sequencing dataset comprises combining the sequencing noise disturbance result, the first initial genome sequencing dataset and the second initial genome sequencing dataset to obtain a sequencing technology noise detection vector and a sequencing biological noise detection vector of a target plant object, wherein the sequencing technology noise detection vector is used for representing a sequencing technology noise factor in the target plant object, the sequencing biological noise detection vector is used for representing a sequencing biological noise factor in the target plant object, the sequencing technology noise detection vector is used for obtaining a first linkage noise positioning vector based on a noise detection vector related to the sequencing technology noise detection vector in the sequencing technology noise detection vector, the first linkage noise positioning vector is used for representing a sequencing technology noise detection vector integrated with the sequencing biological noise detection vector, the sequencing technology noise detection vector is used for obtaining a second linkage noise positioning vector based on a noise detection vector related to the sequencing technology noise detection vector in the target plant object, the second linkage noise positioning vector is used for obtaining a global positioning vector integrated with the global positioning vector, determining a denoising perspective for the target plant object; and denoising the first initial genome sequencing dataset and the second initial genome sequencing dataset according to the denoising viewpoint.

In this embodiment, the process of denoising the first and second initial genome sequencing data sets by the artificial intelligence analysis system is a highly sophisticated and technically extremely high-level operation. This process involves complex vector operations and noise recognition techniques to ensure the accuracy and effectiveness of the denoising.

First, the system needs to combine the sequencing noise disturbance result, the first initial genome sequencing data set and the second initial genome sequencing data set to obtain a sequencing technology noise detection vector and a sequencing biological noise detection vector of the target plant object. These two vectors are the basis for the denoising process and are used to characterize sequencing technology noise factors and sequencing biological noise factors in the target plant object, respectively. In short, sequencing technology noise may originate from errors or misoperations of the sequencing instrument, while sequencing biological noise may be related to biological processes such as collection, preservation, etc. of samples.

Next, the system performs coordinated noise localization. For a sequencing technology noise detection vector, the system obtains a first linkage noise positioning vector based on a part related to the sequencing biological noise detection vector. This vector is actually a comprehensive representation of sequencing technology noise taking into account sequencing biological noise factors. Similarly, for sequencing biological noise detection vectors, the system will also obtain a second linkage noise localization vector based on the relevant portions of the sequencing technology noise detection vector.

After the two linked noise location vectors are obtained, the system fuses them to obtain a global noise location vector. The global noise localization vector comprehensively considers noise factors in the aspects of sequencing technology and biology, and provides accurate guidance for subsequent denoising treatment.

Based on this global noise localization vector, the system determines a denoising perspective for the target plant object. This point of view of denoising is effectively a comprehensive denoising strategy, which considers the sources, properties and influence degrees of various noises to ensure the pertinency and effectiveness of denoising.

Finally, the system performs a denoising process on the first initial genome sequencing dataset and the second initial genome sequencing dataset according to the denoising viewpoint. This process may include steps such as cleaning, correction, and reconstruction of the data to ensure that the denoised dataset more accurately reflects genomic information of the target plant object.

Through the denoising processing flow, the artificial intelligence analysis system not only improves the accuracy and reliability of genome sequencing data, but also provides a higher-quality data basis for subsequent genomics research and application. The implementation of the denoising technology has important significance in promoting research progress in the field of genomics, improving the accuracy of medical diagnosis, promoting development of personalized medical treatment and the like.

In other possible embodiments, the method further comprises updating a genome sequencing system of the plant object of interest.

In other possible embodiments, the artificial intelligence analysis system is not only focused on denoising the genomic sequencing data, but also further updates the genomic sequencing system of the target plant object. This step is a key element in ensuring that the sequencing system continues to maintain optimal performance and accuracy.

After the system completes the denoising process of the first initial genome sequencing dataset and the second initial genome sequencing dataset, it can accumulate a great deal of processing experience and data insight. These experience and insight are critical to optimizing a genome sequencing system. Thus, the artificial intelligence analysis system uses this information to update and improve the sequencing system.

Specifically, the system first analyzes the various noise types and sources identified during the denoising process. Such noise may include instrument noise, environmental noise, reagent noise, and the like. By having an in depth knowledge of the nature of these noises, the system is able to find problems and disadvantages that may exist in sequencing systems.

Next, artificial intelligence analysis systems address these problems and deficiencies by proposing specific optimization suggestions and improvements. For example, if instrument noise is found to be the dominant source of noise, the system may suggest calibration or upgrades to the sequencing instrument to reduce instrument errors. If the environmental noise is significant, the system may suggest improvements in the sequencing environment, such as reducing external interference, optimizing temperature control, etc.

In addition, the system can optimize the algorithm and parameters of the sequencing system according to the data insight in the denoising process. For example, by adjusting parameters such as sequencing depth, read length, etc., the coverage and accuracy of sequencing can be improved. Meanwhile, the system can also utilize an advanced machine learning algorithm to improve the analysis method and the processing flow of the sequencing data, thereby further improving the accuracy and the reliability of the data.

Finally, the artificial intelligence analysis system integrates these optimization suggestions and improvements into the genomic sequencing system of the target plant object. This update process may involve aspects of software upgrades, hardware adjustments, or algorithm optimization. Through this series of updating operations, the sequencing system will be better able to accommodate a variety of complex sequencing environments and requirements, providing more accurate, more reliable genomic sequencing data.

In summary, by updating the genome sequencing system of the target plant object, the artificial intelligence analysis system not only can improve the accuracy and reliability of sequencing data, but also can improve the overall performance and adaptability of the sequencing system. This will provide more powerful data support and technical support for subsequent genomics research, medical diagnosis, drug development, etc.

It is worth mentioning that the genome sequencing system of the target plant object is updated by deeply analyzing the denoised first initial genome sequencing data set and the denoised second initial genome sequencing data set by utilizing an artificial intelligent analysis system, identifying systematic errors and random errors in the sequencing process, adjusting an optical detection system of the sequencing system based on error analysis results, optimizing parameters captured by laser excitation and fluorescent signals so as to reduce the influence of light signal attenuation and background noise, improving a fluid control module of the sequencing system, ensuring accurate transportation and mixing of reaction reagents and reducing noise introduced by reagent concentration fluctuation, further updating a base identification algorithm of the sequencing system, introducing a deep learning model, improving accuracy of base judgment, updating an algorithm model in real time by a cloud platform so as to adapt to the continuously changing sequencing requirements, and finally comprehensively upgrading hardware and software of the sequencing system, ensuring stability and high efficiency of the system, thereby realizing comprehensive optimization and updating of the genome sequencing system of the target plant object.

In detail, the artificial intelligence analysis system plays a vital role in the process of updating the genome sequencing system of the target plant object. This updating process is not only complex but also highly refined, requiring the system to have the ability to perform depth analysis and precise adjustments.

First, the artificial intelligence analysis system deeply analyzes the first initial genome sequencing dataset and the second initial genome sequencing dataset that have undergone noise removal processing. The purpose of this step is to identify systematic and random errors that may exist during sequencing. Systematic errors are typically due to inherent bias or improper operation of the sequencing system, while random errors may result from various unpredictable external factors. By analyzing these datasets in depth, the system is able to locate the error sources more accurately, providing powerful support for subsequent optimization work.

Next, based on the results of the error analysis, the artificial intelligence analysis system proceeds to adjust the optical detection system of the sequencing system. This includes optimizing parameters of laser excitation and fluorescent signal capture. Laser excitation is a critical step in the sequencing process and is responsible for exciting the fluorescent labeled nucleotides to emit a fluorescent signal. By adjusting the parameters of the laser such as intensity, wavelength, excitation time, etc., the system can excite fluorescent signals more effectively, and simultaneously reduce the influence of light signal attenuation and background noise. This helps to improve the clarity and accuracy of the sequencing signal.

At the same time, the system would also improve upon the fluid control module of the sequencing system. The fluid control module is responsible for precisely controlling and delivering reagents in the sequencing reaction, such as nucleotides, enzymes, buffers, and the like. By optimizing fluid control parameters such as flow rate, concentration, mixing ratio, etc., the system is able to ensure accurate delivery and mixing of the reactants. This not only reduces noise introduced by reagent concentration fluctuations, but also improves the stability and reproducibility of the sequencing reaction.

In order to further improve sequencing accuracy, the artificial intelligence analysis system also updates the base recognition algorithm of the sequencing system. Traditional base recognition algorithms may be limited by fixed patterns and parameters, and are difficult to accommodate with changing sequencing requirements. Thus, the system introduces a deep learning model to improve the base recognition process. The deep learning model has strong feature extraction and classification capability, and can more accurately identify and read base information in sequencing signals. In addition, the algorithm model is updated in real time through the cloud platform, and the system can continuously learn and adapt to new sequencing data and environmental changes, so that the optimal recognition performance is maintained.

Finally, in order to ensure the overall performance and stability of the sequencing system, the artificial intelligence analysis system comprehensively upgrades the hardware and software of the sequencing system. Hardware upgrades may include changing more advanced sequencing instruments, increasing computer processing power and storage capacity, etc., and software upgrades may involve optimizing the operating system, updating data analysis software, introducing new sequencing algorithms, etc. These upgrades can ensure that the sequencing system performs well in the face of large-scale, high-complexity genome sequencing tasks, providing high-quality, high-efficiency sequencing services for researchers.

In summary, by comprehensively optimizing and updating the genome sequencing system of the target plant object, the artificial intelligence analysis system can remarkably improve the accuracy and reliability of sequencing data.

Further, fig. 2 is a schematic structural diagram of an artificial intelligence analysis system 200 according to an embodiment of the present application. The artificial intelligence analysis system 200 as shown in fig. 2 includes a processor 210, and the processor 210 may call and run a computer program from memory to implement the methods of embodiments of the present application.

Optionally, as shown in FIG. 2, the artificial intelligence analysis system 200 may also include a memory 230. Wherein the processor 210 may call and run a computer program from the memory 230 to implement the method in an embodiment of the application.

Wherein the memory 230 may be a separate device from the processor 210 or may be integrated into the processor 210.

Optionally, as shown in fig. 2, the artificial intelligence analysis system 200 may further include a transceiver 220, and the processor 210 may control the transceiver 220 to interact with other devices, and in particular, may send information or data to other devices, or receive information or data sent by other devices.

Optionally, the artificial intelligence analysis system 200 may implement the storage engine or a component (such as a processing module) in the storage engine or a corresponding flow corresponding to a device in which the storage engine is deployed in each method of the embodiments of the present application, which is not described herein for brevity.

It should be appreciated that the processor of an embodiment of the present application may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The Processor may be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), an off-the-shelf programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

It will be appreciated that the memory in embodiments of the application may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate Synchronous dynamic random access memory (Double DATA RATE SDRAM, DDR SDRAM), enhanced Synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and Direct memory bus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

It should be appreciated that the above memory is exemplary but not limiting, and for example, the memory in the embodiments of the present application may also be static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (doubledata RATE SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCH LINK DRAM, SLDRAM), direct Rambus RAM (DRRAM), and the like. That is, the memory in embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

On the basis of the above, a computer readable storage medium is provided, on which a computer program is stored, which computer program, when run, implements the method described above.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art.

Claims

1. A plant genome sequencing data analysis method based on artificial intelligence, characterized in that it is applied to an artificial intelligence analysis system, and the method comprises:

Determine a first genome expression description vector relationship network of a first initial genome sequencing data set and a second genome expression description vector relationship network of a second initial genome sequencing data set using a long short-term memory network, wherein the first initial genome sequencing data set and the second initial genome sequencing data set correspond to a target plant object, and the first genome expression description vector relationship network and the second genome expression description vector relationship network are any one of a genome expression feature set, a transcript structure feature set, or an epigenetic marker feature set;

Based on the AI sliding sampling kernel, at least one local description vector relationship network is determined from the first genome expression description vector relationship network and the second genome expression description vector relationship network respectively, and at least one vector relationship network binary is determined based on the data distribution characteristics of each local description vector relationship network in the corresponding genome expression description vector relationship network, wherein the first local description vector relationship network and the second local description vector relationship network in each vector relationship network binary belong to different genome expression description vector relationship networks and have the same data distribution characteristics in the corresponding genome expression description vector relationship network;

Integrate the first local description vector relationship network and the second local description vector relationship network in each of the vector relationship network binary groups to obtain a genome sequencing integrated vector set, and determine the sequencing noise perturbation results of the first local description vector relationship network and the second local description vector relationship network in the corresponding vector relationship network binary group based on each of the genome sequencing integrated vector sets;

Determine the sequencing noise perturbation results of the first initial genome sequencing data set and the second initial genome sequencing data set based on the sequencing noise perturbation results of each of the vector relationship network binary groups, wherein the sequencing noise perturbation result is the presence of sequencing noise perturbation or the absence of sequencing noise perturbation;

For the first local description vector relationship network and the second local description vector relationship network in each of the vector relationship network binary groups, the first local description vector relationship network and the second local description vector relationship network are integrated to obtain a genome sequencing integrated vector set, including: generating a genome sequencing integrated vector set based on sequencing attention variables of each characteristic element of the first local description vector relationship network and the second local description vector relationship network, wherein each characteristic element of the genome sequencing integrated vector set includes at least one attention dimension, and the sequencing attention variable of each attention dimension is determined by at least one of the sequencing attention variables of the first local description vector relationship network corresponding to the characteristic element or the sequencing attention variables of the second local description vector relationship network corresponding to the characteristic element;

Among them, the sequencing attention variable of each characteristic element of the genome expression feature set is the genome expression mapping parameter, the sequencing attention variable of each characteristic element of the transcript structure feature set is the transcript structure mapping parameter, and each characteristic element of the epigenetic mark feature set corresponds to multiple sequencing attention variables and each sequencing attention variable is an epigenetic mapping parameter of an epigenetic attention dimension;

The method of determining at least one local description vector relationship network from the first genome expression description vector relationship network and the second genome expression description vector relationship network based on the AI sliding sampling kernel respectively includes: determining at least one first target data distribution feature in the first genome expression description vector relationship network, determining the data distribution feature in the second genome expression description vector relationship network that is the same as each of the first target data distribution features as a second target data distribution feature, and each of the first target data distribution features is used to reflect the core sequencing embedding vector of the first genome expression description vector relationship network; and determining at least one local description vector relationship network from each of the first target data distribution features and each of the second target data distribution features based on the AI sliding sampling kernel and the selected sampling strategy.

2. The method according to claim 1, characterized in that the determining the sequencing noise perturbation results of the first initial genome sequencing data set and the second initial genome sequencing data set based on the sequencing noise perturbation results of each of the vector relationship network binary groups comprises:

Determine at least one target data distribution feature cluster, each of the target data distribution feature clusters including one of the first target data distribution features and a second target data distribution feature corresponding to the first target data distribution feature;

Determine the first number of vector relationship network tuples corresponding to each of the target data distribution feature clusters, and the second number of vector relationship network tuples in which the sequencing noise disturbance result in the corresponding vector relationship network tuple is that there is no sequencing noise disturbance;

For each of the target data distribution feature clusters, determining sequencing noise perturbation results of the first target data distribution feature and the second target data distribution feature in the target data distribution feature cluster based on the first number and the second number corresponding to the target data distribution feature cluster;

Based on the sequencing noise perturbation results of each of the target data distribution feature clusters, the sequencing noise perturbation results of the first initial genome sequencing data set and the second initial genome sequencing data set are determined.

3. The method according to claim 2, characterized in that, for each of the target data distribution feature clusters, determining the sequencing noise perturbation results of the first target data distribution feature and the second target data distribution feature in the target data distribution feature cluster based on the first number and the second number corresponding to the target data distribution feature cluster, comprises:

Determine a first ratio of a second number corresponding to the target data distribution feature cluster to a first number corresponding to the first number;

If the first proportion corresponding to the target data distribution feature cluster is less than the first threshold value, determine the noise disturbance analysis weight of the target data distribution feature cluster based on the first proportion corresponding to the target data distribution feature cluster and the second number, and determine the sequencing noise disturbance results of the first target data distribution feature and the second target data distribution feature in the target data distribution feature cluster based on the noise disturbance analysis weight;

If the first proportion corresponding to the target data distribution feature cluster is not less than the first threshold value, it is determined that the sequencing noise disturbance results of the first target data distribution feature and the second target data distribution feature in the target data distribution feature cluster are that there is no sequencing noise disturbance.

4. The method according to claim 2, characterized in that the determining the sequencing noise perturbation results of the first initial genome sequencing data set and the second initial genome sequencing data set based on the sequencing noise perturbation results of each of the target data distribution feature clusters comprises:

If the sequencing noise perturbation result of at least one of the target data distribution feature clusters is that there is no sequencing noise perturbation, determining that the sequencing noise perturbation results of the first initial genome sequencing data set and the second initial genome sequencing data set are that there is no sequencing noise perturbation;

If the sequencing noise perturbation results of each of the target data distribution feature clusters are that sequencing noise perturbations exist, it is determined that the sequencing noise perturbation results of the first initial genome sequencing data set and the second initial genome sequencing data set are that sequencing noise perturbations exist.

5. The method according to claim 1, characterized in that the sequencing noise perturbation results of the first local description vector relationship network and the second local description vector relationship network in the corresponding vector relationship network tuple determined based on each genome sequencing integrated vector set are realized through a long short-term memory network, and the long short-term memory network is debugged by the following steps:

Determine a plurality of genome expression vector relationship network sample tuples, each of the genome expression vector relationship network sample tuples comprising a first genome expression description vector relationship network sample of a first initial genome sequencing data set sample and a second genome expression description vector relationship network sample of a second initial genome sequencing data set sample, the first initial genome sequencing data set sample and the second initial genome sequencing data set sample correspond to a target plant object, and the first genome expression description vector relationship network sample and the second genome expression description vector relationship network sample are respectively any one of a genome expression feature set, a transcript structure feature set or an epigenetic marker feature set;

Integrate the first genome expression description vector relationship network sample and the second genome expression description vector relationship network sample in each genome expression vector relationship network sample binary to obtain a genome sequencing integrated vector set sample, input each of the genome sequencing integrated vector set samples into an initial recurrent neural network to obtain a discriminant viewpoint corresponding to each of the genome sequencing integrated vector set samples, and determine a sequencing noise perturbation prediction result of the first genome expression description vector relationship network sample and the second genome expression description vector relationship network sample in the genome expression vector relationship network sample binary corresponding to each of the genome sequencing integrated vector set samples based on the discriminant viewpoint;

Based on the actual sequencing noise disturbance results and the sequencing noise disturbance prediction results of the first genome expression description vector relationship network sample and the second genome expression description vector relationship network sample in each genome expression vector relationship network sample binary group, a debugging error function is determined, and the initial recurrent neural network is debugged based on the debugging error function and each genome expression vector relationship network sample binary group until the debugging is completed when the debugging error function meets the debugging termination requirements, and the recurrent neural network obtained after the debugging is determined as the long short-term memory network.

6. The method according to claim 1, characterized in that when the sequencing noise disturbance result is that sequencing noise disturbance exists, the method further comprises:

Based on the sequencing noise disturbance result, the first initial genome sequencing data set and the second initial genome sequencing data set are subjected to noise reduction processing.

7. The method according to claim 6, characterized in that the method further comprises:

The genome sequencing system of the target plant object is updated.

8. An artificial intelligence analysis system, characterized in that it includes at least one processor and a memory; the memory stores computer-executable instructions; the at least one processor executes the computer-executable instructions stored in the memory, so that the at least one processor executes the method described in any one of claims 1-7.