[go: up one dir, main page]

WO2025137944A1 - Procédé de détermination d'une séquence consensus de molécules d'acide nucléique et utilisation - Google Patents

Procédé de détermination d'une séquence consensus de molécules d'acide nucléique et utilisation Download PDF

Info

Publication number
WO2025137944A1
WO2025137944A1 PCT/CN2023/142432 CN2023142432W WO2025137944A1 WO 2025137944 A1 WO2025137944 A1 WO 2025137944A1 CN 2023142432 W CN2023142432 W CN 2023142432W WO 2025137944 A1 WO2025137944 A1 WO 2025137944A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleic acid
sequence
sequencing
characteristic
acid molecule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2023/142432
Other languages
English (en)
Chinese (zh)
Inventor
李一彦
曾涛
孙宇辉
师虓
陈俊毅
董宇亮
黎宇翔
章文蔚
徐讯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
Original Assignee
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd filed Critical BGI Shenzhen Co Ltd
Priority to PCT/CN2023/142432 priority Critical patent/WO2025137944A1/fr
Publication of WO2025137944A1 publication Critical patent/WO2025137944A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Definitions

  • the present application relates to the field of sequencing technology, and in particular to methods and applications for determining the consensus sequence of nucleic acid molecules.
  • the third generation sequencing technology represented by the technology platform of Oxford Nanopore Technologies (ONT) and other companies is becoming more and more popular in ecological research due to its portable and low-cost sequencing possibilities.
  • ONT Oxford Nanopore Technologies
  • the disadvantage of ONT is that the quality of the original reads is low, which does not meet the requirements of high quality in some scenarios. Therefore, generating high-quality consensus sequences remains a challenge.
  • constructing a single-stranded circular library is a common method.
  • a consensus sequence is generated using the multiple copy fragments.
  • PB Pacific Biosciences
  • CCS circular consensus sequencing
  • BLAST Basic Local Alignment Search Tool
  • the embodiments of the present invention aim to solve at least one of the technical problems existing in the prior art.
  • the embodiments of the present invention provide a method and application for determining the consensus sequence of nucleic acid molecules, which is simpler and more convenient than BLAST for identifying shorter fragments such as linker sequences, and does not require comparison with a priori known sequences.
  • a method for determining a consensus sequence of a nucleic acid molecule comprising the following steps:
  • S100 obtaining sequencing information of the nucleic acid molecule, the sequencing information including multi-copy sequences of a library molecule, the library molecule including a single-stranded circular molecule formed by connecting the nucleic acid molecule and a characteristic fragment;
  • S200 identifying a characteristic sequence of the characteristic fragment in the sequencing information based on a Levingston distance algorithm using a known sequence of the characteristic fragment;
  • S300 dividing the sequencing information according to the characteristic sequence, and determining the repeated nucleic acid sequence of the nucleic acid molecule;
  • S400 generating the consensus sequence according to the set of repeated nucleic acid sequences of the nucleic acid molecule.
  • S200 includes calculating the Levingston distance between the sequence to be identified in the sequencing information and the known sequence. If the Levingston distance is not greater than a threshold, the sequence to be identified is determined to be a characteristic sequence.
  • the threshold is 0.3 times the length of the known sequence.
  • S200 also includes correcting the characteristic sequence.
  • correction is performed based on the position of the characteristic sequence in the sequencing information.
  • the method of correcting according to the position includes correcting a plurality of the feature sequences that are close in position into one.
  • the repetitive nucleic acid sequences in the collection in S400 are pre-screened.
  • the sequencing information includes the 1st repeated nucleic acid sequence to the nth repeated nucleic acid sequence in the order of sequencing
  • the screening includes screening according to the consistency of the kth repeated nucleic acid sequence with the 1st repeated nucleic acid sequence, and including them in the set, satisfying 1 ⁇ k ⁇ n.
  • the consistency is selected from at least one of length, similarity, base content, and quality score.
  • generating the consensus sequence according to the set of repeated nucleic acid sequences of the nucleic acid molecule in S400 includes performing a multiple sequence alignment on the repeated nucleic acid sequences in the set to generate the consensus sequence.
  • the multiple sequence alignment method is abpoa.
  • the method further includes performing error correction processing on the generated consensus sequence.
  • the error correction processing adopts at least one of gcpp, racon, pilon, and Nextpolish.
  • a second aspect of an embodiment of the present invention provides a method for sequencing a nucleic acid molecule, comprising connecting a nucleic acid molecule and a characteristic fragment to form a single-stranded circular molecule, performing rolling circle amplification and sequencing to obtain sequencing information of the nucleic acid molecule, and obtaining a common sequence of the nucleic acid molecule according to the aforementioned method.
  • a computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to enable a computer to execute the aforementioned method.
  • an electronic device comprising a processor and a memory, wherein the memory stores a computer program executable on the processor, and the processor implements the aforementioned method when executing the computer program.
  • a fifth aspect of the embodiments of the present invention provides a sequencing data analysis system, comprising:
  • An acquisition module the acquisition module is used to acquire sequencing information of the nucleic acid molecule, the sequencing information includes a multi-copy sequence of a library molecule, and the library molecule includes a single-stranded circular molecule formed by connecting the nucleic acid molecule and a characteristic fragment;
  • An identification module wherein the identification module is used to identify a characteristic sequence of the characteristic fragment in the sequencing information based on a Levingston distance algorithm through a known sequence of the characteristic fragment;
  • a division module the division module is used to divide the sequencing information according to the characteristic sequence and determine the repeated nucleic acid sequence of the nucleic acid molecule;
  • a generation module is used to generate the common sequence according to the set of repeated nucleic acid sequences of the nucleic acid molecule.
  • a sixth aspect of the embodiments of the present invention provides a sequencing system, comprising:
  • a sequencing module which is used to connect the nucleic acid molecule and the characteristic fragment to form a single-stranded circular molecule, and obtain the sequencing information of the nucleic acid molecule through rolling circle amplification and sequencing;
  • An acquisition module the acquisition module is used to acquire the sequencing information
  • An identification module wherein the identification module is used to identify a characteristic sequence of the characteristic fragment in the sequencing information based on a Levingston distance algorithm through a known sequence of the characteristic fragment;
  • a division module the division module is used to divide the sequencing information according to the characteristic sequence and determine the repeated nucleic acid sequence of the nucleic acid molecule;
  • a generation module is used to generate the common sequence according to the set of repeated nucleic acid sequences of the nucleic acid molecule.
  • the edit distance (Levingston distance) method is used to identify the characteristic sequence. Compared with the traditional BLAST-based identification solution, the time required is shortened to only 1/10 or even shorter, and the whole process is simpler and more convenient.
  • FIG. 1 is a schematic diagram of the sequencing process according to an embodiment of the present invention.
  • FIG. 2 is an example diagram of characteristic sequences identified on sequencing read lengths in an embodiment of the present invention.
  • FIG3 is a flow chart of consensus sequence generation in an embodiment of the present invention.
  • FIG. 4 is an Identity distribution diagram of each repeated nucleic acid sequence (subread) in an embodiment of the present invention.
  • FIG. 5 is a diagram showing the length distribution of each repeated nucleic acid sequence in an embodiment of the present invention.
  • FIG. 7 is a diagram showing the final distribution of the number of repeated nucleic acid sequences after screening in an embodiment of the present invention.
  • FIG. 8 is a diagram showing evaluation results of consensus sequence recognition in an embodiment of the present invention.
  • FIG. 9 is a statistical result of relevant indicators in the process of constructing a consensus sequence in an embodiment of the present invention.
  • first”, “second”, and “third” are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated.
  • the features defined as “first”, “second”, and “third” may explicitly or implicitly include at least one of the features.
  • “several” means more than one
  • “multiple” means at least two, such as two, three, etc.
  • greater than means at least two, such as two, three, etc.
  • greater than “less than”, “exceeding”, etc. are understood to not include the number
  • "above”, “below”, “within”, etc. are understood to include the number, unless otherwise clearly and specifically defined.
  • Biological samples include at least one of body fluids (such as blood, tissue fluid, lymph fluid, cerebrospinal fluid, urine, sweat, sputum, saliva, gastric juice, intestinal juice, pancreatic juice, bile, prostatic fluid, vaginal secretions, semen, serous cavity effusion, joint cavity effusion, bronchoalveolar lavage fluid, amniotic fluid, etc.), skin, feces, intestinal contents, swabs (such as nasal swabs, pharyngeal swabs, anal swabs, etc.), tissues (such as histological samples obtained by surgery, endoscopy or percutaneous puncture biopsy), cells, and microorganisms (such as bacteria, viruses, fungi, actinomycetes, rickettsia, mycoplasma, chlamydia, spirochetes, etc.).
  • body fluids such as blood, tissue fluid, lymph fluid, cerebrospinal fluid, urine, sweat, sputum, saliva
  • a consensus sequence is a sequence that is restored from a group of similar but not identical sequences and in which each position consists of the most likely base.
  • S100 Acquire sequencing information of nucleic acid molecules, where the sequencing information includes multi-copy sequences of library molecules, and the library molecules include single-stranded circular molecules formed by connecting nucleic acid molecules and characteristic fragments.
  • the sequencing of nucleic acid molecules is performed by any of the first generation sequencing, second generation sequencing, and third generation sequencing.
  • the first generation sequencing includes Maxam-Gilbert sequencing technology, Sanger dideoxy sequencing technology, pyrophosphate sequencing technology, fluorescent automatic sequencing technology, and hybridization sequencing technology, etc.
  • the second generation sequencing includes 454 Roche GS FLX, Illumina Solexa, SOLiD, Ion Torrent, BGISEQ, etc.
  • the third generation sequencing includes HeliScope, PacBio HiFi, PacBio CLR, ONT, Cyclone WT, etc.
  • the third generation sequencing is also called single molecule sequencing, which can read nucleotide sequences at the single molecule level, and has the advantages of longer read length and faster sequencing speed.
  • a multi-copy sequence of a nucleic acid molecule is obtained by rolling circle amplification for sequencing.
  • the nucleic acid molecule is connected to a characteristic fragment to form a single-stranded circular library molecule for rolling circle amplification, and the amplified product is sequenced, so that the sequencing information includes the multi-copy sequence of the library molecule.
  • the multi-copy sequence of the sequencing information includes sequence information of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200 copies of the nucleic acid molecule. Due to sequencing errors, the sequence information of these copies is not completely consistent, so the sequencing results of these copies are used for error correction to reduce the error rate.
  • Levenshtein distance belongs to a type of edit distance algorithm, which refers to the minimum number of editing operations required to change one string into another.
  • the allowed editing operations include replacement (replacing one character with another character), insertion (inserting a character) and deletion (deleting a character).
  • the Levenshtein distance lev a, b (i, j) between the first i characters of string a and the first j characters of string b satisfies the following formula:
  • the length of the characteristic sequence is short, so there may be fragments with smaller differences from the characteristic sequence in the sequence of the nucleic acid molecule, so there may be some non-characteristic sequence positions that are identified as characteristic sequences by the method of Levingston distance. Therefore, in some of the embodiments, after the characteristic sequence is identified by the Levingston distance algorithm, the identified characteristic sequence is also corrected to exclude the situation that these non-characteristic sequences are mistaken for characteristic sequences as much as possible. According to the principle of this type of misidentification, in some of the embodiments, correction is performed according to the position of the characteristic sequence in the sequencing information.
  • the nucleic acid molecule sequence and the characteristic sequence are spaced in sequence in the sequencing information obtained after the library molecule containing the nucleic acid molecule and the characteristic fragment is subjected to rolling circle amplification and sequencing, so that the adjacent characteristic sequences are spaced from each other by a small change or a substantially unchanged base distance, so that it can be judged by the general position of the characteristic sequence in the sequencing information that it belongs to the characteristic sequence or the characteristic sequence mistakenly recognized in the nucleic acid molecule.
  • the method for correcting according to the position includes correcting a plurality of characteristic sequences close in position to one. Specifically, the proximity of positions needs to be judged in conjunction with the length of the nucleic acid molecule.
  • the length of the nucleic acid molecule fragment is more than 200bp.
  • S300 Divide the sequencing information according to the characteristic sequence and determine the repeated nucleic acid sequence of the nucleic acid molecule.
  • the multi-copy sequences in the sequencing information include sequences in which nucleic acid molecules and characteristic fragments are spaced apart from each other. Therefore, referring to FIG. 2 , the sequencing information can be divided using characteristic sequences, and the remaining sequences between the characteristic sequences include multiple repeated nucleic acid sequences obtained by sequencing different copies of nucleic acid molecules.
  • S400 Generate a consensus sequence based on a set of repeated nucleic acid sequences of nucleic acid molecules.
  • the repeated nucleic acid sequences of the nucleic acid molecule obtained by dividing the characteristic sequence constitute a set of repeated nucleic acid sequences of the nucleic acid molecule, and a common sequence of the nucleic acid molecule is generated based on the set.
  • the repeated nucleic acid sequences are pre-screened to obtain repeated nucleic acid sequences with better quality and improve the accuracy of the consensus sequence.
  • the repeated nucleic acid sequences of the nucleic acid molecule include the first repeated nucleic acid sequence to the nth repeated nucleic acid sequence in the order of sequencing after amplification, and the remaining repeated nucleic acid sequences are screened according to the length and sequence information of the first repeated nucleic acid sequence.
  • the kth repeated nucleic acid sequence is screened according to the consistency with the first repeated nucleic acid sequence. If the consistency requirement is met, the kth repeated nucleic acid sequence (1 ⁇ k ⁇ n) is counted into the set of repeated nucleic acid sequences.
  • the consistency of the kth repeating nucleic acid sequence with the first repeating nucleic acid sequence includes at least one, at least two or all of the length, similarity, base content, quality score, etc.
  • the length of the kth repeating nucleic acid sequence and the length of the first repeating nucleic acid sequence differ by no more than ⁇ 0.1%, ⁇ 0.2%, ⁇ 0.3%, ⁇ 0.5%, ⁇ 1%, ⁇ 2%, ⁇ 3%, ⁇ 5%, ⁇ 10%, ⁇ 15%, ⁇ 20%, ⁇ 25%, ⁇ 30%, ⁇ 35%, ⁇ 40%, ⁇ 45%, ⁇ 50%, and the consistency screening requirement is met.
  • the ratio of the number of matching bases of the length of the kth repeating nucleic acid sequence and the cigar value of the first repeating nucleic acid sequence to the total number of bases of the kth repeating nucleic acid sequence is greater than 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, and the consistency screening requirement is met.
  • the GC content of the kth repeating nucleic acid sequence may differ from the GC content of the first repeating nucleic acid sequence by no more than ⁇ 0.01%, ⁇ 0.02%, ⁇ 0.03%, ⁇ 0.05%, ⁇ 0.1%, ⁇ 0.2%, ⁇ 0.3%, ⁇ 0.5%, ⁇ 1%, ⁇ 1.5%, ⁇ 2%, ⁇ 2.5%, ⁇ 3%, ⁇ 3.5%, ⁇ 4%, ⁇ 4.5%, ⁇ 5%, ⁇ 6%, ⁇ 7%, ⁇ 8%, ⁇ 9%, ⁇ 10%, the consistency screening requirement is met.
  • the quality score includes at least one of a Phred score, a Sanger score, a Solexa score, etc.
  • generating a consensus sequence based on a set of repeated nucleic acid sequences of nucleic acid molecules includes performing a multiple sequence alignment on the repeated nucleic acid sequences in the set to generate a consensus sequence.
  • a multiple sequence alignment method without prior knowledge is used to generate a consensus sequence using a set of repeated nucleic acid sequences, such as POA, SPOA, ABPOA, etc.
  • the multiple sequence alignment method without prior knowledge is ABPOA.
  • the generated consensus sequence is also subjected to error correction processing to further improve the accuracy.
  • the error correction processing can adopt at least one of gcpp, racon, pilon, and Nextpolish.
  • a second aspect of an embodiment of the present invention provides a method for sequencing a nucleic acid molecule, comprising connecting a nucleic acid molecule and a characteristic fragment to form a single-stranded circular molecule, performing rolling circle amplification and sequencing to obtain sequencing information of the nucleic acid molecule, and obtaining a common sequence of the nucleic acid molecule according to the aforementioned method.
  • a computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to enable a computer to execute the aforementioned method.
  • an electronic device includes a processor and a memory.
  • the memory stores a computer program executable on the processor.
  • the processor implements the aforementioned method when executing the computer program.
  • the memory is a non-transitory computer-readable storage medium that can be used to store non-transitory software programs and non-transitory computer executable programs, such as the aforementioned method described in the embodiments of the present invention.
  • the processor determines the consensus sequence of the nucleic acid molecule by running the non-transitory software program and instructions stored in the memory.
  • the memory may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application required for at least one function; and the data storage area may store and execute the above-mentioned programs.
  • the memory may include a high-speed random access memory and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the memory may include a memory remotely located relative to the processor, and the remote memory may be connected to the processor via a network.
  • Examples of the above network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the non-transitory software program and instructions required to implement the above method are stored in the memory, and when executed by one or more processors, the above method is executed.
  • a fifth aspect of the embodiments of the present invention provides a sequencing data analysis system, including:
  • An acquisition module is used to acquire sequencing information of nucleic acid molecules, wherein the sequencing information includes multi-copy sequences of library molecules, and the library molecules include single-stranded circular molecules formed by connecting nucleic acid molecules and characteristic fragments;
  • a recognition module used to identify the characteristic sequence of the characteristic fragment in the sequencing information based on the known sequence of the characteristic fragment and the Levingston distance algorithm;
  • a division module is used to divide the sequencing information according to the characteristic sequence and determine the repeated nucleic acid sequence of the nucleic acid molecule
  • a generating module is used to generate the common sequence according to a set of repeated nucleic acid sequences of nucleic acid molecules.
  • the recognition module includes calculating the Levingston distance between the sequence to be identified and the known sequence in the sequencing information. If the Levingston distance is not greater than the threshold, the sequence to be identified is determined as a characteristic sequence.
  • the threshold is a set multiple of the length of the known sequence of the characteristic fragment, for example, it can be 0.1 to 0.5 times, such as 0.1, 0.2, 0.3, 0.4, 0.5 times. Taking the threshold as 0.3 times the length of the known sequence of the characteristic fragment as an example, for a known sequence of 21bp in length, the threshold of the Levingston distance is 21 ⁇ 0.3 ⁇ 7bp.
  • the recognition module further comprises correcting the characteristic sequence.
  • the correction is performed according to the position of the characteristic sequence in the sequencing information.
  • the method for correcting according to the position comprises correcting multiple characteristic sequences that are close in position into one.
  • the generation module pre-screens the repetitive nucleic acid sequences in the set.
  • the sequencing information includes the first to nth repetitive nucleic acid sequences in the order of sequencing, and the screening includes screening according to the consistency of the kth repetitive nucleic acid sequence (1 ⁇ k ⁇ n) with the first repetitive nucleic acid sequence, and the kth repetitive nucleic acid sequence that meets the consistency requirement is included in the set.
  • the consistency is selected from at least one of length, base content, and quality score.
  • the generation module performs multiple sequence alignment on the repeated nucleic acid sequences in the set to generate a consensus sequence. In some specific embodiments, the method of multiple sequence alignment is abpoa. In some specific embodiments, the generation module also includes error correction processing on the generated consensus sequence. In some specific embodiments, the error correction processing adopts at least one of gcpp, racon, pilon, and Nextpolish.
  • a sixth aspect of the embodiments of the present invention provides a sequencing system, comprising:
  • the sequencing module is used to connect the nucleic acid molecule and the characteristic fragment to form a single-stranded circular molecule, and obtain the sequencing information of the nucleic acid molecule through rolling circle amplification and sequencing;
  • An acquisition module used to acquire the sequencing information
  • a recognition module used to identify the characteristic sequence of the characteristic fragment in the sequencing information based on the known sequence of the characteristic fragment and the Levingston distance algorithm;
  • a division module is used to divide the sequencing information according to the characteristic sequence and determine the repeated nucleic acid sequence of the nucleic acid molecule
  • a generating module is used to generate the common sequence according to a set of repeated nucleic acid sequences of nucleic acid molecules.
  • the device or system implementation described above is only illustrative, and the modules described as separate components may or may not be physically separated, that is, they may be located in one place or distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tapes, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media.
  • the randomly fragmented nucleic acid molecules of E. coli DNA are connected to the characteristic fragments (hairpin, sequence is CCTTGGCTCACAGAACGACAT) for rolling circle sequencing to obtain the sequencing information of the multi-copy sequences of these nucleic acid molecules. Based on this sequencing information, the common sequence of the nucleic acid molecules is identified.
  • the steps are as follows, referring to Figure 3:
  • the known sequence of the characteristic fragment is CCTTGGCTCACAGAACGACAT, and the Levenshtein distance threshold is 7bp.
  • the Python library used in this method to calculate the Levenshtein distance is: python_Levenshtein (0.21.1) (https://maxbachmann.github.io/Levenshtein/levenshtein.html).
  • the specific identification steps are: start scanning from the head end of fastq, the step length of each movement is 1, and the scanning window size is the length of the known sequence of the characteristic fragment 21bp. Each moving step can obtain a Levenshtein distance between the sequence to be identified and the known sequence in the current window.
  • the sequence to be identified in the window is recorded as the characteristic sequence.
  • the characteristic sequence screened by this step is shown in Figure 6.
  • the read length is classified according to the number of characteristic sequences originally identified, and the proportion of each type of read length in the total data volume is obtained. This result can reflect the quality of library construction and sequencing.
  • the length of the first repeated nucleic acid sequence is greater than 1/2 of the length of the second repeated nucleic acid sequence, no change is made. After the correction of the characteristic sequence position is completed, the characteristic sequence quantity distribution of all read lengths is recorded, which is called the original characteristic sequence quantity distribution for downstream analysis.
  • the repetitive nucleic acid sequences of the nucleic acid molecule are segmented according to the position of the characteristic sequence in the read length. After identifying N characteristic sequences, N+1 repetitive nucleic acid sequences can be obtained in the end.
  • the repetitive nucleic acid sequences obtained by segmentation of the characteristic sequences are saved in files such as subread1, subread2, etc. according to their position distribution in the read length . This facilitates the evaluation of the quality of the sorted data.
  • the recorded repetitive nucleic acid sequence (subread) information is analyzed for length and identity according to the position of the repetitive nucleic acid sequence to obtain the overall data quality.
  • the length screening standard is to retain subreadk (k>1) with a length between 0.8 and 1.2 times the length of subread1, and discard other subreads with a length less than 0.8 times or greater than 1.2 times.
  • the similarity screening standard is to use the edlib library (1.3.9) in Python to calculate the ratio of the number of matched bases in the cigar value after the alignment between subreadk and subread1 to the total number of bases in subreadk as the similarity. If the similarity between subreadk and subread1 is greater than 0.8, it is retained, otherwise it is discarded.
  • the number of subreads finally retained based on the length screening and similarity screening is recorded to obtain a distribution of the number of subreads after screening.
  • the proportion of each type of subread after screening to the total number can be obtained.
  • ⁇ 3D means that the number of subreads finally retained in the read length of this category is ⁇ 3, and so on.
  • Comparative Example 1 is to construct a consensus sequence for the sequencing information of the rolling circle sequencing in Example 1 according to the circular consensus sequencing (CCS) process of Pacific Biosciences (refer to Nucleic Acids Res. 2010 Aug; 38(15): e159.).
  • CCS circular consensus sequencing
  • the tool for identifying characteristic sequences based on the traditional BLAST recognition method is the Python BLAST library: Biopython (1.81) (https://doi.org/10.1093/bioinformatics/btp163).
  • the recognition time of the joint sequence based on the Levingston distance in Example 1 is 97 seconds, and the time required for the traditional BLAST-based recognition method in Comparative Example 1 is 20 minutes. Compared with the traditional BLAST-based recognition scheme, the time of the joint sequence recognition based on the Levingston distance is shortened to less than 1/10. Comparing the accuracy of the two, the accuracy of the consensus sequence provided by Example 1 is 99.5%, while the accuracy of the consensus sequence provided by the method of Comparative Example 1 is only 98.8%.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Zoology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Wood Science & Technology (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Sont divulgués dans la présente demande un procédé de détermination d'une séquence consensus de molécules d'acide nucléique et son utilisation. Le procédé comprend les étapes suivantes consistant : à acquérir des informations de séquençage des molécules d'acide nucléique, les informations de séquençage comprenant des séquences multi-copies de molécules de bibliothèque, et les molécules de bibliothèque comprenant des molécules circulaires simple brin formées par ligature des molécules d'acide nucléique et des fragments caractéristiques ; à identifier des séquences caractéristiques des fragments caractéristiques dans les informations de séquençage en fonction d'un algorithme de distance de Levenshtein et au moyen de séquences connues des fragments caractéristiques ; à diviser les informations de séquençage en fonction des séquences caractéristiques, et à déterminer des séquences d'acides nucléiques répétitives des molécules d'acide nucléique ; et à générer la séquence consensus en fonction d'un ensemble des séquences d'acides nucléiques répétitives des molécules d'acide nucléique.
PCT/CN2023/142432 2023-12-27 2023-12-27 Procédé de détermination d'une séquence consensus de molécules d'acide nucléique et utilisation Pending WO2025137944A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2023/142432 WO2025137944A1 (fr) 2023-12-27 2023-12-27 Procédé de détermination d'une séquence consensus de molécules d'acide nucléique et utilisation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2023/142432 WO2025137944A1 (fr) 2023-12-27 2023-12-27 Procédé de détermination d'une séquence consensus de molécules d'acide nucléique et utilisation

Publications (1)

Publication Number Publication Date
WO2025137944A1 true WO2025137944A1 (fr) 2025-07-03

Family

ID=96216504

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/142432 Pending WO2025137944A1 (fr) 2023-12-27 2023-12-27 Procédé de détermination d'une séquence consensus de molécules d'acide nucléique et utilisation

Country Status (1)

Country Link
WO (1) WO2025137944A1 (fr)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102076871A (zh) * 2008-11-07 2011-05-25 财团法人工业技术研究院 精确序列信息及修饰碱基位置确定的方法
CN104357563A (zh) * 2014-10-30 2015-02-18 东南大学 二次dna片段化的基因组单倍型高通量测序方法
CN110062809A (zh) * 2016-12-20 2019-07-26 豪夫迈·罗氏有限公司 用于环状共有序列测序的单链环状dna文库
CN110219054A (zh) * 2018-03-04 2019-09-10 清华大学 一种核酸测序文库及其构建方法
CN112739829A (zh) * 2018-09-27 2021-04-30 深圳华大生命科学研究院 测序文库的构建方法和得到的测序文库及测序方法
CN116497103A (zh) * 2017-01-18 2023-07-28 伊鲁米那股份有限公司 制备测序衔接子的方法和对核酸分子进行测序的方法
CN116732136A (zh) * 2023-05-17 2023-09-12 安序源生物科技(深圳)有限公司 文库制备方法和测序方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102076871A (zh) * 2008-11-07 2011-05-25 财团法人工业技术研究院 精确序列信息及修饰碱基位置确定的方法
CN104357563A (zh) * 2014-10-30 2015-02-18 东南大学 二次dna片段化的基因组单倍型高通量测序方法
CN110062809A (zh) * 2016-12-20 2019-07-26 豪夫迈·罗氏有限公司 用于环状共有序列测序的单链环状dna文库
CN116497103A (zh) * 2017-01-18 2023-07-28 伊鲁米那股份有限公司 制备测序衔接子的方法和对核酸分子进行测序的方法
CN110219054A (zh) * 2018-03-04 2019-09-10 清华大学 一种核酸测序文库及其构建方法
CN112739829A (zh) * 2018-09-27 2021-04-30 深圳华大生命科学研究院 测序文库的构建方法和得到的测序文库及测序方法
CN116732136A (zh) * 2023-05-17 2023-09-12 安序源生物科技(深圳)有限公司 文库制备方法和测序方法

Similar Documents

Publication Publication Date Title
US20230242977A1 (en) Universal short adapters with variable length non-random unique molecular identifiers
US20240011087A1 (en) Methods and systems for generation and error-correction of unique molecular index sets with heterogeneous molecular lengths
ES2769241T5 (es) Sistemas y métodos para detectar variación en el número de copias
JP7707177B2 (ja) 融合事象を決定するための方法およびシステム
JPWO2016002875A1 (ja) 核酸分子数計測法
CN112687341B (zh) 一种以断点为中心的染色体结构变异鉴定方法
CN114999572B (zh) 一种设计引物的方法、设备、可读介质及装置
WO2025137944A1 (fr) Procédé de détermination d'une séquence consensus de molécules d'acide nucléique et utilisation
CN111445956A (zh) 一种二代测序平台的基因组数据高效利用方法和装置
JP7791135B2 (ja) 不均一分子長を有するユニーク分子インデックスセットの生成およびエラー補正のための方法およびシステム
CN117157413A (zh) 在核酸测序中鉴定假阳性变体的方法
HK40027193A (en) Universal short adapters with variable length non-random unique molecular identifiers
HK40015418B (en) Methods and systems for generation and error-correction of unique molecular index sets with heterogeneous molecular lengths
HK40027193B (zh) 具有可变长度非随机独特分子标识符的通用短衔接子
NZ795518A (en) Methods and systems for generation and error-correction of unique molecular index sets with heterogeneous molecular lengths

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23962629

Country of ref document: EP

Kind code of ref document: A1