CA3052772A1 - Procede et systemes de reconstruction de sequences de reference genomiques a partir de lectures de sequences genomiques compressees - Google Patents
Procede et systemes de reconstruction de sequences de reference genomiques a partir de lectures de sequences genomiques compressees Download PDFInfo
- Publication number
- CA3052772A1 CA3052772A1 CA3052772A CA3052772A CA3052772A1 CA 3052772 A1 CA3052772 A1 CA 3052772A1 CA 3052772 A CA3052772 A CA 3052772A CA 3052772 A CA3052772 A CA 3052772A CA 3052772 A1 CA3052772 A1 CA 3052772A1
- Authority
- CA
- Canada
- Prior art keywords
- descriptor
- contig
- binarization
- sequence
- reads
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 title claims abstract description 105
- 238000007906 compression Methods 0.000 claims abstract description 67
- 230000006835 compression Effects 0.000 claims abstract description 66
- 239000002773 nucleotide Substances 0.000 claims description 71
- 125000003729 nucleotide group Chemical group 0.000 claims description 71
- 230000011664 signaling Effects 0.000 claims description 8
- 230000008569 process Effects 0.000 description 46
- 238000013507 mapping Methods 0.000 description 45
- 238000013459 approach Methods 0.000 description 19
- 238000012163 sequencing technique Methods 0.000 description 15
- 238000012545 processing Methods 0.000 description 11
- 230000009466 transformation Effects 0.000 description 9
- 108020004414 DNA Proteins 0.000 description 8
- 102000053602 DNA Human genes 0.000 description 8
- 238000006467 substitution reaction Methods 0.000 description 8
- 230000008901 benefit Effects 0.000 description 7
- 238000003780 insertion Methods 0.000 description 7
- 230000037431 insertion Effects 0.000 description 7
- 210000000349 chromosome Anatomy 0.000 description 6
- 238000012217 deletion Methods 0.000 description 6
- 230000037430 deletion Effects 0.000 description 6
- 229920002477 rna polymer Polymers 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 230000006837 decompression Effects 0.000 description 5
- 238000012268 genome sequencing Methods 0.000 description 5
- 238000012165 high-throughput sequencing Methods 0.000 description 5
- 238000007481 next generation sequencing Methods 0.000 description 5
- 229910052757 nitrogen Inorganic materials 0.000 description 5
- 238000010276 construction Methods 0.000 description 4
- 238000013144 data compression Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 229910052698 phosphorus Inorganic materials 0.000 description 4
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 4
- 108010038083 amyloid fibril protein AS-SAM Proteins 0.000 description 3
- 238000000429 assembly Methods 0.000 description 3
- 230000000712 assembly Effects 0.000 description 3
- 238000011331 genomic analysis Methods 0.000 description 3
- 238000000844 transformation Methods 0.000 description 3
- 108091028043 Nucleic acid sequence Proteins 0.000 description 2
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 235000019506 cigar Nutrition 0.000 description 2
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 2
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 2
- 229940113082 thymine Drugs 0.000 description 2
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 229930024421 Adenine Natural products 0.000 description 1
- 101100437998 Arabidopsis thaliana BZIP2 gene Proteins 0.000 description 1
- 101150071882 US17 gene Proteins 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- 125000003275 alpha amino acid group Chemical group 0.000 description 1
- 150000001413 amino acids Chemical class 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 229940104302 cytosine Drugs 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 229910052739 hydrogen Inorganic materials 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000011368 organic material Substances 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000003752 polymerase chain reaction Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000000392 somatic effect Effects 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 229940035893 uracil Drugs 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
La présente invention concerne un procédé et un appareil de représentation d'un génome de référence en termes d'éléments de syntaxe décrivant les différences entre ledit génome de référence et des séquences génomiques alignées. Lesdites séquences génomiques ont été préalablement alignées avec ledit génome de référence. Chacune des séquences génomiques alignées est décrite au moyen d'un sous-ensemble d'éléments de syntaxe. Des éléments de syntaxe décrivant toutes les séquences génomiques sont divisés en blocs selon leurs propriétés statistiques. Chaque bloc d'éléments de syntaxe est codé par entropie. Les blocs codés par entropie sont ensuite concaténés pour former un flux binaire compressé. Les différences entre le génome de référence et les séquences alignées sont exprimées en termes d'éléments de syntaxe. Lesdits éléments de syntaxe sont divisés en blocs selon leurs propriétés statistiques et chaque bloc est codé par entropie. Les éléments de syntaxe codés par entropie sont ensuite intégrés dans le flux binaire de blocs codés d'éléments de syntaxe décrivant des lectures alignées. Le procédé décrit permet la reconstruction du génome de référence utilisé pour l'alignement lors du décodage des séquences génomiques compressées, tout en préservant différentes options d'accès aléatoire sur les données compressées, et en permettant une compression efficace.
Applications Claiming Priority (10)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/EP2016/074311 WO2018068830A1 (fr) | 2016-10-11 | 2016-10-11 | Procédé et système de transmission de données bioinformatiques |
| PCT/EP2016/074297 WO2018068827A1 (fr) | 2016-10-11 | 2016-10-11 | Structures de données efficaces pour la représentation d'informations bioinformatiques |
| PCT/EP2016/074301 WO2018068828A1 (fr) | 2016-10-11 | 2016-10-11 | Procédé et système destinés à la mémorisation et à l'accès de données bioinformatiques |
| PCT/EP2016/074307 WO2018068829A1 (fr) | 2016-10-11 | 2016-10-11 | Procédé et appareil destinés à une représentation compacte de données bioinformatiques |
| EP2017017841 | 2017-02-14 | ||
| USPCT/US2017/017842 | 2017-02-14 | ||
| PCT/US2017/017842 WO2018071055A1 (fr) | 2016-10-11 | 2017-02-14 | Procédé et appareil pour la représentation compacte de données bioinformatiques |
| USPCT/US2017/041579 | 2017-07-11 | ||
| PCT/US2017/041579 WO2018071078A1 (fr) | 2016-10-11 | 2017-07-11 | Procédé et appareil d'accès à des données bioinformatiques structurées dans des unités d'accès |
| PCT/US2017/066458 WO2018151786A1 (fr) | 2016-10-11 | 2017-12-14 | Procédé et systèmes de reconstruction de séquences de référence génomiques à partir de lectures de séquences génomiques compressées |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CA3052772A1 true CA3052772A1 (fr) | 2018-08-23 |
Family
ID=67769776
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CA3052772A Abandoned CA3052772A1 (fr) | 2016-10-11 | 2017-12-14 | Procede et systemes de reconstruction de sequences de reference genomiques a partir de lectures de sequences genomiques compressees |
Country Status (2)
| Country | Link |
|---|---|
| AU (1) | AU2017399715A1 (fr) |
| CA (1) | CA3052772A1 (fr) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113285720A (zh) * | 2021-05-28 | 2021-08-20 | 中科计算技术西部研究院 | 基因数据无损压缩方法、集成电路及无损压缩设备 |
-
2017
- 2017-12-14 AU AU2017399715A patent/AU2017399715A1/en not_active Abandoned
- 2017-12-14 CA CA3052772A patent/CA3052772A1/fr not_active Abandoned
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113285720A (zh) * | 2021-05-28 | 2021-08-20 | 中科计算技术西部研究院 | 基因数据无损压缩方法、集成电路及无损压缩设备 |
| CN113285720B (zh) * | 2021-05-28 | 2023-07-07 | 中科计算技术西部研究院 | 基因数据无损压缩方法、集成电路及无损压缩设备 |
Also Published As
| Publication number | Publication date |
|---|---|
| AU2017399715A1 (en) | 2019-10-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20190385702A1 (en) | Method and systems for the reconstruction of genomic reference sequences from compressed genomic sequence reads | |
| EP3583249B1 (fr) | Procédé et systèmes de reconstruction de séquences de référence génomiques à partir de lectures de séquences génomiques compressées | |
| JP2020509474A (ja) | 圧縮されたゲノムシーケンスリードからゲノムリファレンスシーケンスを再構築するための方法とシステム | |
| CA3052824A1 (fr) | Procede et appareil pour la representation compacte de donnees bioinformatiques au moyen de plusieurs descripteurs genomiques | |
| CA3052772A1 (fr) | Procede et systemes de reconstruction de sequences de reference genomiques a partir de lectures de sequences genomiques compressees | |
| JP7324145B2 (ja) | ゲノムシーケンスリードの効率的圧縮のための方法及びシステム | |
| CN110663022A (zh) | 用于使用多个基因组描述符来紧凑表示生物信息学数据的方法和设备 | |
| NZ757185B2 (en) | Method and apparatus for the compact representation of bioinformatics data using multiple genomic descriptors | |
| HK40014527B (en) | Method and systems for the efficient compression of genomic sequence reads | |
| HK40014527A (en) | Method and systems for the efficient compression of genomic sequence reads |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FZDE | Discontinued |
Effective date: 20230614 |