WO2004070029A1 - Method to encode a dna sequence and to compress a dna sequence - Google Patents
Method to encode a dna sequence and to compress a dna sequence Download PDFInfo
- Publication number
- WO2004070029A1 WO2004070029A1 PCT/KR2003/001093 KR0301093W WO2004070029A1 WO 2004070029 A1 WO2004070029 A1 WO 2004070029A1 KR 0301093 W KR0301093 W KR 0301093W WO 2004070029 A1 WO2004070029 A1 WO 2004070029A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- dna sequence
- bases
- encoding
- encoded
- byte
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- E—FIXED CONSTRUCTIONS
- E04—BUILDING
- E04G—SCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
- E04G1/00—Scaffolds primarily resting on the ground
- E04G1/28—Scaffolds primarily resting on the ground designed to provide support only at a low height
- E04G1/32—Other free-standing supports, e.g. using trestles
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3084—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
-
- E—FIXED CONSTRUCTIONS
- E04—BUILDING
- E04G—SCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
- E04G1/00—Scaffolds primarily resting on the ground
- E04G1/15—Scaffolds primarily resting on the ground essentially comprising special means for supporting or forming platforms; Platforms
-
- E—FIXED CONSTRUCTIONS
- E04—BUILDING
- E04G—SCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
- E04G1/00—Scaffolds primarily resting on the ground
- E04G1/15—Scaffolds primarily resting on the ground essentially comprising special means for supporting or forming platforms; Platforms
- E04G2001/155—Platforms with an access hatch for getting through from one level to another
-
- E—FIXED CONSTRUCTIONS
- E04—BUILDING
- E04G—SCAFFOLDING; FORMS; SHUTTERING; BUILDING IMPLEMENTS OR AIDS, OR THEIR USE; HANDLING BUILDING MATERIALS ON THE SITE; REPAIRING, BREAKING-UP OR OTHER WORK ON EXISTING BUILDINGS
- E04G1/00—Scaffolds primarily resting on the ground
- E04G1/28—Scaffolds primarily resting on the ground designed to provide support only at a low height
- E04G1/30—Ladder scaffolds
- E04G2001/302—Ladder scaffolds with ladders supporting the platform
- E04G2001/305—The ladders being vertical and perpendicular to the platform
Definitions
- the present invention relates to a method for encoding a DNA sequence and a method for compressing a DNA sequence, and particularly to, a method for encoding a DNA sequence by expressing 4 types of DNA bases such as adenine (A), guanine (G), cytosine (C) and thymine (T) into 2 bits and a method compressing an encoded DNA sequence by using a common data compression method to increase compression efficiency.
- 4 types of DNA bases such as adenine (A), guanine (G), cytosine (C) and thymine (T) into 2 bits
- A adenine
- G guanine
- C cytosine
- T thymine
- DNA sequences of various living bodies are analyzed and researches on methods to effectively express and compress the DNA sequences are in progress.
- it takes at least 2 bits per base when a DNA base sequence is compressed using common sentence compression software such as WinZip and Arj.
- the present invention has been made in view of the foregoing problems, and considering that a DNA sequence comprises 4 types of bases such as adenine (A), guanine (G), cytosine (C), thymine (T), it is an object of the present invention to provide a method for encoding a DNA sequence by expressing respective bases of the DNA sequence into a 2-bit unit and a method for compressing an encoded DNA sequence to improve compression efficiency and compression rate.
- A adenine
- G guanine
- C cytosine
- T thymine
- the present invention provides a method for encoding a DNA sequence comprising the steps of: encoding bases of the DNA sequence comprising adenine (A), guanine (G), cytosine (C) and thymine (T), into 2 bits; forming one byte with a predetermined number of the encoded bases; and forming a DNA sequence in the byte unit.
- a method for compressing a DNA sequence comprising the steps of: encoding DNA bases comprising adenine (A) guanine (G), cytosine (C) and thymine (T) into 2 bits, respectively; forming one byte with a predetermined number of the encoded bases; forming a DNA sequence in the byte unit; and compressing the DNA sequence using a data compression method.
- A adenine
- G guanine
- C cytosine
- T thymine
- Fig. 1 is a view schematically showing the encoding of DNA bases
- Fig. 2 shows an embodiment of the method for encoding a DNA sequence according to the present invention
- Fig. 3 shows an embodiment of the method for compressing a DNA sequence according to the present invention.
- each base of a DNA sequence can be encoded into 2 bits. That is, each base is expressed into one of 4 characters such as adenine (A), guanine (G), cytosine (C) and thymine (T), which are expressed into 2-bit values including 00, 01, 10 and 11. It is just an example to express adenine (A), guanine (G), cytosine (C) and thymine (T) into 2-bit values of 00, 01, 10 and 11.
- the bases may be any values different from each other (Ex.: 01, 11, 00, 10).
- each base is set to be encoded into 2 bits.
- Bases of the DNA sequence to be encoded (hereinafter referred to as "target DNA sequence) are gathered in a predetermined number to form one byte.
- the encoded final DNA sequence is expressed in byte unit.
- the number of bases included in one byte may be 1, 2, 3 or 4.
- the remaining bits are filled with a predetermined value.
- target sequence 10 is "CACGACGTTGTA", in which 4 bases form one byte, respective procedures are explained.
- CACG complementary metal-oxide-semiconductor
- S21,S22 four bases of the target sequence 10 are encoded into 2 bits to form one byte
- S23 a temporary DNA sequence
- ACGT next four bases
- S22 The encoded byte is added to the temporary DNA sequence (S23).
- S23 The temporary DNA sequence is then "1000100100100111".
- the target DNA sequence still contains bases to be encoded (S24) and again undergoes the step S21.
- the four bases (TGTA) are encoded into 2 bits to form one byte (S22).
- the encoded byte is added to the temporary DNA sequence (S23). Then, the temporary DNA sequence becomes "100010010010011111011100". All the bases of the target DNA sequence are encoded and the process is ended (S24). Here, the information of the temporary DNA sequence is an encoded final DNA sequence 20.
- the steps S21 to S24 are the same as the procedures described in the above and shown in Fig. 2.
- the encoded DNA sequence (the temporary DNA sequence in the example) is finally compressed by an compression method (S25).
- the compression method which can be used in the present invention includes any of the sentence compression methods which have been already developed and used.
- the method for encoding a DNA sequence and the method for compressing a DNA sequence can be preferably performed in the form of a computer program. Therefore, the present invention includes a recording medium which is readable by a computer having computer programs recorded, in which the programs can carry out respective steps of the method for encoding a DNA sequence and the method for compressing a DNA sequence. While the present invention has been described with reference to the particular illustrative embodiments, it is not to be restricted by the embodiments but only by the appended claims. It is to be appreciated that those skilled in the art can change or modify the embodiments without departing from the scope and spirit of the present invention.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Architecture (AREA)
- Biotechnology (AREA)
- Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Mechanical Engineering (AREA)
- Structural Engineering (AREA)
- Civil Engineering (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| AU2003232661A AU2003232661A1 (en) | 2003-02-07 | 2003-06-04 | Method to encode a dna sequence and to compress a dna sequence |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR1020030007920A KR20040071993A (en) | 2003-02-07 | 2003-02-07 | Method to encode a DNA sequence and to compress a DNA sequence |
| KR10-2003-0007920 | 2003-02-07 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2004070029A1 true WO2004070029A1 (en) | 2004-08-19 |
Family
ID=32844797
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/KR2003/001093 Ceased WO2004070029A1 (en) | 2003-02-07 | 2003-06-04 | Method to encode a dna sequence and to compress a dna sequence |
Country Status (3)
| Country | Link |
|---|---|
| KR (1) | KR20040071993A (en) |
| AU (1) | AU2003232661A1 (en) |
| WO (1) | WO2004070029A1 (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2010108929A3 (en) * | 2009-03-23 | 2010-11-25 | Intresco B.V. | Methods for providing a set of symbols uniquely distinguishing an organism such as a human individual |
| CN105550535A (en) * | 2015-12-03 | 2016-05-04 | 人和未来生物科技(长沙)有限公司 | Encoding method for rapidly encoding gene character sequence into binary sequence |
| US10902937B2 (en) | 2014-02-12 | 2021-01-26 | International Business Machines Corporation | Lossless compression of DNA sequences |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR101253700B1 (en) * | 2010-11-26 | 2013-04-12 | 가천대학교 산학협력단 | High Speed Encoding Apparatus for the Next Generation Sequencing Data and Method therefor |
| KR101922129B1 (en) | 2011-12-05 | 2018-11-26 | 삼성전자주식회사 | Method and apparatus for compressing and decompressing genetic information using next generation sequencing(NGS) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH08123498A (en) * | 1994-10-21 | 1996-05-17 | Nippon Telegr & Teleph Corp <Ntt> | Waveform data compression method |
| US5651099A (en) * | 1995-01-26 | 1997-07-22 | Hewlett-Packard Company | Use of a genetic algorithm to optimize memory space |
| US5706498A (en) * | 1993-09-27 | 1998-01-06 | Hitachi Device Engineering Co., Ltd. | Gene database retrieval system where a key sequence is compared to database sequences by a dynamic programming device |
| US5727130A (en) * | 1995-08-31 | 1998-03-10 | Motorola, Inc. | Genetic algorithm for constructing and tuning fuzzy logic system |
| US5838964A (en) * | 1995-06-26 | 1998-11-17 | Gubser; David R. | Dynamic numeric compression methods |
| KR20020040406A (en) * | 2000-11-24 | 2002-05-30 | 김응수 | A method of compressing and storing data based on genetic code |
-
2003
- 2003-02-07 KR KR1020030007920A patent/KR20040071993A/en not_active Ceased
- 2003-06-04 AU AU2003232661A patent/AU2003232661A1/en not_active Abandoned
- 2003-06-04 WO PCT/KR2003/001093 patent/WO2004070029A1/en not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5706498A (en) * | 1993-09-27 | 1998-01-06 | Hitachi Device Engineering Co., Ltd. | Gene database retrieval system where a key sequence is compared to database sequences by a dynamic programming device |
| JPH08123498A (en) * | 1994-10-21 | 1996-05-17 | Nippon Telegr & Teleph Corp <Ntt> | Waveform data compression method |
| US5651099A (en) * | 1995-01-26 | 1997-07-22 | Hewlett-Packard Company | Use of a genetic algorithm to optimize memory space |
| US5838964A (en) * | 1995-06-26 | 1998-11-17 | Gubser; David R. | Dynamic numeric compression methods |
| US5727130A (en) * | 1995-08-31 | 1998-03-10 | Motorola, Inc. | Genetic algorithm for constructing and tuning fuzzy logic system |
| KR20020040406A (en) * | 2000-11-24 | 2002-05-30 | 김응수 | A method of compressing and storing data based on genetic code |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2010108929A3 (en) * | 2009-03-23 | 2010-11-25 | Intresco B.V. | Methods for providing a set of symbols uniquely distinguishing an organism such as a human individual |
| US9607127B2 (en) | 2009-03-23 | 2017-03-28 | Jan Jaap Nietfeld | Methods for providing a set of symbols uniquely distinguishing an organism such as a human individual |
| NL2003311C2 (en) * | 2009-07-30 | 2011-02-02 | Intresco B V | Method for producing a biological pin code. |
| US10902937B2 (en) | 2014-02-12 | 2021-01-26 | International Business Machines Corporation | Lossless compression of DNA sequences |
| CN105550535A (en) * | 2015-12-03 | 2016-05-04 | 人和未来生物科技(长沙)有限公司 | Encoding method for rapidly encoding gene character sequence into binary sequence |
Also Published As
| Publication number | Publication date |
|---|---|
| KR20040071993A (en) | 2004-08-16 |
| AU2003232661A1 (en) | 2004-08-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7586880B2 (en) | Nucleic acid-based data storage | |
| CN110945595B (en) | DNA-based data storage and retrieval | |
| US11379729B2 (en) | Nucleic acid-based data storage | |
| AU2018247323B2 (en) | High-Capacity Storage of Digital Information in DNA | |
| CN109830263B (en) | DNA storage method based on oligonucleotide sequence coding storage | |
| AU2019270159A1 (en) | Compositions and methods for nucleic acid-based data storage | |
| WO2015193140A1 (en) | Method and apparatus for encoding information units in code word sequences avoiding reverse complementarity | |
| KR100537523B1 (en) | Apparatus for encoding DNA sequence and method of the same | |
| WO2004070029A1 (en) | Method to encode a dna sequence and to compress a dna sequence | |
| Goel | A compression algorithm for DNA that uses ASCII values | |
| CN114341988A (en) | Methods for compressing genomic sequence data | |
| TW202008302A (en) | DNA-based data access by converting the input data into a set of nucleotide sequences and synthesizing a set of nucleic acids including the set of nucleotide sequences | |
| Venugopal et al. | Probabilistic Approach for DNA Compression | |
| HK40116173A (en) | Nucleic acid-based data storage | |
| JP2025509231A (en) | Combinatorial permutation and searching of nucleic acid-based data stores | |
| Rani | M.: A new referential method for compressing genomes | |
| Wang et al. | DNA Digital Data Storage based on Distributed Method | |
| 최영재 | High Information Capacity and Low Cost DNA-based Data Storage through Additional Encoding Characters | |
| KR20210056822A (en) | Method of compression and transmision for fastq genome data | |
| EP3098742A1 (en) | Method and apparatus for creating a plurality of oligos with a targeted distribution of nucleotide types | |
| HK40015249B (en) | Nucleic acid-based data storage |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| 122 | Ep: pct application non-entry in european phase | ||
| NENP | Non-entry into the national phase |
Ref country code: JP |
|
| WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |