CN109326325B

CN109326325B - Method, system and related assembly for gene sequence comparison

Info

Publication number: CN109326325B
Application number: CN201810827902.0A
Authority: CN
Inventors: 赵健; 史宏志; 李龙; 崔星辰; 尹云峰
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2022-02-18
Anticipated expiration: 2038-07-25
Also published as: CN109326325A

Abstract

The present application discloses a method for comparing gene sequences. The method includes determining a sequence to be compared and a reference sequence, and determining a pointer table and a candidate comparison position table according to the reference sequence; extracting a seed sequence of a preset length from the sequence to be compared , and determine the index value of the seed sequence; according to the index value of the seed sequence, query the pointer table and the candidate comparison position table to determine whether there is position information corresponding to the seed sequence in the reference sequence; The extension operation is performed on the sequence to be extended based on the reference sequence to obtain an intermediate sequence, and the gene sequence alignment result is obtained by comparing the intermediate sequence and the sequence to be compared. The method can reduce the storage space of the table storing the index value of the reference sequence and improve the gene sequence comparison rate. The present application also discloses a gene sequence comparison system, a computer-readable storage medium and an electronic device, which have the above beneficial effects.

Description

Method, system and related assembly for gene sequence comparison

Technical Field

The invention relates to the interdisciplinary field of computer technology and biological gene technology, in particular to a method and a system for comparing gene sequences, electronic equipment and a computer readable storage medium.

Background

The strategy of gene sequence alignment usually mainly includes two stages of seed (seed sequence) search and seed extension. The seed search stage mainly searches for a subsequence of a certain length on the read (sequences to be aligned) that can exactly match exactly on the reference sequence, i.e., seed. The expansion stage expands the searched seed on the seed sequence of the reference sequence. And finally, selecting the optimal comparison result according to the scoring condition on each seed sequence.

In the prior art, in order to enable the seed to be quickly and accurately matched to the reference sequence, the reference sequence may be pre-indexed, and the position of the seed on the reference sequence is directly obtained according to the index value in the stage of searching the seed. The index table basic unit of the reference sequence is < index, pos >, the length of the index can be selected according to the strategy of the index, and pos is the position of the seed on the reference sequence. However, since the index table needs to be built bit by bit over the entire reference sequence, the required storage space is huge. Therefore, the access rate of comparing the access of the index table to the random access by using an FPGA (Field-Programmable Gate Array) according to the index value is lower.

Therefore, how to reduce the storage space of the table for storing the reference sequence index values and increase the gene sequence alignment rate is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

An object of the present application is to provide a method, a system, an electronic device, and a computer-readable storage medium for comparing gene sequences, which can reduce the storage space of a table storing reference sequence index values and improve the comparison rate of the index values of the gene sequences.

In order to solve the above technical problems, the present application provides a method for gene sequence alignment, comprising:

determining a sequence to be compared and a reference sequence, and determining a pointer table and a candidate comparison position table according to the reference sequence; the candidate comparison position table is a table which describes the corresponding relation between the position information of the reference subsequence on the reference sequence and the last M bits of the R bit index value of the reference subsequence, the position information of the reference subsequence with the same front R-M bits of the index value is continuously arranged, and the pointer table is a table which describes the offset addresses of the reference subsequence with the same front N bits of the index value and different from the (N + 1) th bit to the (R-M) th bit in the candidate comparison position table; the reference subsequence is obtained by extracting the reference sequence bit by bit according to a preset length;

extracting a seed sequence with a preset length from a sequence to be compared, and determining an index value of the seed sequence;

inquiring the pointer table and the candidate comparison position table according to the index value of the seed sequence to judge whether the position information corresponding to the seed sequence exists in the reference sequence;

if so, taking the seed sequence as a sequence to be expanded, performing expansion operation on the sequence to be expanded by taking the reference sequence as a reference according to the position information to obtain an intermediate sequence, and comparing the intermediate sequence with the sequence to be compared to obtain a gene sequence comparison result.

Optionally, determining the pointer table and the candidate comparison location table according to the reference sequence includes:

extracting the reference sub-sequence with a preset length from the reference sequence bit by bit, recording the position information of the reference sub-sequence, and converting the reference sub-sequence into an index value;

carrying out byte division on the index value according to a preset rule to obtain a first sub-index value, a second sub-index value and a third sub-index value; wherein the first sub-index value is the first N bits of the index value, the third sub-index value is the last M bits of the index value, and the second sub-index value is the (N + 1) th bit to the (R-M) th bit of the index value;

constructing rows and columns of a pointer table according to the first sub-index value and the second sub-index value; each row of the pointer table corresponds to a different first sub-index value, each column corresponds to a different second sub-index value, and the row number P and the column number Q of the pointer table meet a preset relationship; wherein the preset relationship is P-2^N，Q＝2^R-M-N；

Constructing the candidate comparison position table according to the third sub-index value and the position information of the reference sub-sequence corresponding to the third sub-index value on the reference sequence according to a preset sorting rule;

and constructing the pointer table according to the sorting rule and the number of the reference subsequences with the same front R-M bits and different back M bits of the index value.

Optionally, querying the pointer table and the candidate comparison position table according to the index value of the seed sequence to determine whether the reference sequence has the position information corresponding to the seed sequence, including:

inquiring a target offset address in the pointer table according to the first R-M bits of the index value of the seed sequence;

inquiring the candidate comparison position table according to the target offset address, and judging whether the candidate comparison position table has M rear bits of the index value of the seed sequence;

and if so, judging that the position information of the seed sequence exists in the reference sequence.

Optionally, the sorting rule is dictionary sorting.

Optionally, the index value of the reference subsequence is a hash value corresponding to the reference subsequence after hash processing.

Optionally, if the number of bits of the hash value is 40, N is 26, and M is 10.

Optionally, performing an expansion operation on the seed sequence as a sequence to be expanded to obtain an intermediate sequence, including:

setting the seed sequence as the sequence to be spread;

acquiring a pre-set digit of the index value of the sequence to be expanded as an expansion module index value;

and inquiring a corresponding expansion module according to the index value of the expansion module, and inputting the sequence to be expanded into the expansion module to obtain the intermediate sequence.

The present application also provides a system for gene sequence alignment, the system comprising:

the index table determining module is used for determining a sequence to be compared and a reference sequence and determining a pointer table and a candidate comparison position table according to the reference sequence; the candidate comparison position table is a table which describes the corresponding relation between the position information of the reference subsequence on the reference sequence and the last M bits of the R bit index value of the reference subsequence, the position information of the reference subsequence with the same front R-M bits of the index value is continuously arranged, and the pointer table is a table which describes the offset addresses of the reference subsequence with the same front N bits of the index value and different from the (N + 1) th bit to the (R-M) th bit in the candidate comparison position table; the reference subsequence is obtained by extracting the reference sequence bit by bit according to a preset length;

the extraction module is used for extracting a seed sequence with a preset length from the sequence to be compared and determining an index value of the seed sequence;

the query module is used for querying the pointer table and the candidate comparison position table according to the index value of the seed sequence to judge whether the position information corresponding to the seed sequence exists in the reference sequence; if yes, starting a working process of the comparison module;

and the comparison module is used for performing expansion operation on the seed sequence serving as a sequence to be expanded to obtain an intermediate sequence, and comparing the intermediate sequence with the sequence to be compared to obtain a gene sequence comparison result.

The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed, performs the steps of the above-described method of gene sequence alignment.

The application also provides an electronic device, which comprises a memory and a processor, wherein the memory is stored with a computer program, and the processor calls the computer program in the memory to realize the steps executed by the method for comparing the gene sequences.

The invention provides a method for comparing gene sequences, which comprises the steps of determining a sequence to be compared and a reference sequence, and determining a pointer table and a candidate comparison position table according to the reference sequence; the candidate comparison position table is a table which describes the corresponding relation between the position information of the reference subsequence on the reference sequence and the last M bits of the R bit index value of the reference subsequence, the position information of the reference subsequence with the same front R-M bits of the index value is continuously arranged, and the pointer table is a table which describes the offset addresses of the reference subsequence with the same front N bits of the index value and different from the (N + 1) th bit to the (R-M) th bit in the candidate comparison position table; the reference subsequence is obtained by extracting the reference sequence bit by bit according to a preset length; extracting a seed sequence with a preset length from a sequence to be compared, and determining an index value of the seed sequence; inquiring the pointer table and the candidate comparison position table according to the index value of the seed sequence to judge whether the position information corresponding to the seed sequence exists in the reference sequence; if so, taking the seed sequence as a sequence to be expanded, performing expansion operation on the sequence to be expanded by taking the reference sequence as a reference according to the position information to obtain an intermediate sequence, and comparing the intermediate sequence with the sequence to be compared to obtain a gene sequence comparison result.

The index value of the reference subsequence is divided into three sections, the first two sections of the index value construct a pointer table, the last section of the index value and the position of the reference subsequence on the reference sequence construct a candidate comparison position table, and the index table in the prior art is divided into the pointer table and the candidate comparison position table. Because the rows and columns of the pointer table are respectively the first section of the index value and the second section of the index value, the pointer table is equivalent to the base address, and the candidate comparison position table is equivalent to the offset address, the storage space occupied by storing the index value is obviously reduced compared with a single table established only according to the index value and the position information. Furthermore, a large amount of storage space is saved by adopting the pointer table and the candidate comparison position table to store the index values, so that the efficiency of the FPGA for accessing the pointer table and the candidate comparison position table can be improved. The scheme can reduce the storage space of the table for storing the index value of the reference sequence and improve the gene sequence comparison rate. The application also provides a gene sequence comparison system, a computer readable storage medium and an electronic device, which have the beneficial effects and are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is a flow chart of a method for aligning gene sequences according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating a method for determining whether position information corresponding to the seed sequence exists in a reference sequence according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating an operation of expanding a seed sequence according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating a result obtained after an index table is split into a pointer table and an CAL table;

FIG. 5 is a schematic diagram of a system for aligning gene sequences according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring now to FIG. 1, FIG. 1 is a flowchart of a method for aligning gene sequences according to embodiments of the present disclosure.

The specific steps may include:

s101: determining a sequence to be compared and a reference sequence, and determining a pointer table and a candidate comparison position table according to the reference sequence;

the candidate comparison position table is a table which describes the corresponding relation between the position information of the reference subsequence on the reference sequence and the last M bits of the R bit index value of the reference subsequence, the position information of the reference subsequence with the same front R-M bits of the index value is continuously arranged, and the pointer table is a table which describes the offset addresses of the reference subsequence with the same front N bits of the index value and different from the (N + 1) th bit to the (R-M) th bit in the candidate comparison position table; the reference subsequence is obtained by extracting the reference sequence bit by bit according to a preset length;

it should be noted that the present embodiment is a scheme designed for gene sequence alignment, and a specific application scenario may be an application in precise medical treatment, that is, gene diagnosis. For example, a gene of a human is known to be responsible for heart disease, and the gene of a heart patient is compared with a reference gene (of a healthy person) to check whether the gene of the heart has variation. A particular application scenario may also be to perform homology alignments for looking at the similarity of two species, e.g. human and chimpanzee. Of course, there are many specific application scenarios in this embodiment, and the comparison between two pairs of genes may be implemented by the embodiments described in this application as long as the comparison is required.

The sequence to be compared mentioned in this step is a sequence to be subjected to a gene comparison operation (corresponding to the gene sequence of the cardiac patient mentioned in the above paragraph), and the reference sequence is a reference object (corresponding to the gene sequence of the normal human mentioned in the above paragraph) to be subjected to the gene comparison operation. The sequence to be compared is compared with the reference sequence, but the gene sequence is a sequence formed by four bases of adenine A, cytosine C, guanine G and thymine T, so that the gene position of the sequence to be compared corresponding to the reference sequence cannot be determined.

In the above example of cardiac gene comparison, human genes have 10 base pairs per 34 nm, all human genes have 30 hundred million base pairs, and the number of the human genes is huge, and the 30 hundred million base pairs are only composed of 4 base pairs, and a large amount of gene comparison is needed to determine which specific sequence in the sequences to be compared is a sequence related to heart disease, so that an index table needs to be established for the reference sequence, so as to determine a gene sequence which needs to be compared with the reference sequence according to the index table.

Since the reference sequence is usually a long base sequence, in this embodiment, a plurality of reference subsequences are extracted bit by bit from the reference sequence according to a preset length, for example, a total of F base pairs in the reference sequence and a preset length of G base pairs, F-G +1 reference subsequences can be obtained after extracting the reference sequence bit by bit, and the reference subsequences are a part of the reference sequence. The present embodiment defaults to an operation of recording position information of each reference sub-sequence on the reference sequence when performing reference sub-sequence extraction.

In theory, after the position information of the reference subsequence and the reference subsequence is obtained, the seed sequence with the same length as the reference subsequence can be extracted from the sequence to be compared bit by bit, all the seed sequences and all the reference subsequences are compared, and if the seed sequences and all the reference subsequences are consistent, the fact that a section of gene where the seed is located and a section of gene where the reference subsequence is located are genes with the same function (for example, controlling heart disease) is indicated. However, the base sequences cannot be aligned directly in a computer, and in this case, the bases must be aligned in a language readable by a computer. In this example, the reference subsequence and the seed sequence are both converted into index values, and the two base factor sequences are aligned by comparing the index values of the two base factor sequences. As a preferred embodiment, the converted index value may be a combination of

numbers

0 and 1, for example, A is 00, G is 01, C is 10, and T is 11, and the sequence of bases can be expressed by a binary code.

It should be noted that the length of the seed sequence in this embodiment is neither too short nor too long, and too short length easily causes the workload of the gene expansion operation to be too heavy, and too long length easily causes the storage space occupied by the pointer table and the candidate comparison position table to be too large. Therefore, the "preset length" mentioned in this step should be flexibly set in consideration of the total length of the reference sequence.

The key point of this embodiment is that a pointer table and a candidate comparison position table are established according to a reference sequence, and after an index value (default index value is R bits in total) of a reference subsequence is obtained, this embodiment divides the index value of each reference subsequence into three parts, which are: the first N bits of the index value, the first N +1 bits of the index value to R-M bits of the index value and the last M bits of the index value.

And constructing the candidate comparison position table according to the position information of the reference subsequence corresponding to the M bits after the index value and the reference subsequence corresponding to the M bits after the index value on the reference sequence according to a preset sequencing rule, wherein the position information with the same front R-M bits of the index value of the reference subsequence corresponding to the M bits after the index value is ensured to be continuously arranged in the construction process. The meaning of arranging the position information with the same front R-M bits of the index value in the candidate comparison position table continuously is that the pointer in the pointer table can point to all different rear M bits of the index value with the same front R-M bits of the index value.

And constructing the first N bits of the index value and the first N +1 bits to R-M bits of the index value into rows and columns of the pointer table, wherein in the construction process, each row of the pointer table is ensured to correspond to the first N bits of different index values, and each column is ensured to correspond to the first N bits of different index values, namely the first N +1 bits to the R-M bits. The pointer table is used for describing offset addresses of reference subsequences with the same first N bits of the index values and different first N +1 th bits to second R-M bits in the candidate comparison position table, and is equivalent to the pointer table dividing all the index values into a plurality of large blocks according to the first N bits of the index values, then dividing each large block into a plurality of small blocks according to the difference of the first N bits of the index values to the second R-M bits of the index values, wherein the combination of the first N bits of all the index values and the first N +1 th bits of the index values to the second R-M bits of the index values is a base address, the second M bits of the index values are offset addresses, and the pointer table records the corresponding relation between each base address and the offset addresses. Because the index value position information has a corresponding relation with the rear M bits of the index value, the position information of the index value corresponding to the base address and the offset address can be determined after the base address and the offset address are determined.

As a preferred embodiment, the specific process of constructing the pointer table and the candidate comparison location table may be:

the method comprises the following steps: extracting the reference sub-sequence with a preset length from the reference sequence bit by bit, recording the position information of the reference sub-sequence, and converting the reference sub-sequence into an index value;

step two: carrying out byte division on the index value according to a preset rule to obtain a first sub-index value, a second sub-index value and a third sub-index value; wherein the first sub-index value is the first N bits of the index value, the third sub-index value is the last M bits of the index value, and the second sub-index value is the (N + 1) th bit to the (R-M) th bit of the index value;

step three: constructing rows and columns of a pointer table according to the first sub-index value and the second sub-index value; each row of the pointer table corresponds to a different first sub-index value, each column corresponds to a different second sub-index value, and the row number P and the column number Q of the pointer table meet a preset relationship; wherein the preset relationship is P-2^N，Q＝2^R-M-N；

Wherein, the first sub-index value is composed of N-bit binary digits, and the number of rows P in the pointer table is 2^NIn the same way, the column number Q of the pointer table is 2^R-M-NIt should be noted that the obtained pointer table is a table with P rows and Q columns, and it is allowed that there is an index value corresponding to a certain row and a certain column in the table that has no index value of the reference sub-sequence corresponding to the certain row and the certain column, but all combinations of the first sub-index value and the second sub-index value should be included in the pointer table.

Step four: constructing the candidate comparison position table according to the third sub-index value and the position information of the reference sub-sequence corresponding to the third sub-index value on the reference sequence according to a preset sorting rule;

wherein the candidate contrast location table may be constructed in a lexicographic ordering. Each third sub-index value in the candidate comparison position table has a first sub-index value and a second sub-index value which are uniquely corresponding to the third sub-index value in the pointer table, and pointers of each item and each third sub-index value in the candidate comparison position table corresponding to the first sub-index value and the second sub-index value also exist in the pointer table.

Step five: and filling the pointer table according to the sorting rule and the number of the reference subsequences with the same front R-M bits and different back M bits of the index value.

The above process of constructing the pointer table and the candidate comparison location table is illustrated:

for example, the reference sequence is CGATGCATGT, since the gene sequence only writes one strand of base according to the base complementary pairing principle, the preset length is 4 base pairs, the reference subsequence is extracted from the reference sequence bit by bit, the index value is binary code, a is defined as 00, G is defined as 01, C is defined as 10, T is defined as 11, and CAL is position information, the result is as follows:

reference subsequence 1CGAT, index 10010011, CAL ═ 1;

reference subsequence 2GATG, index 01001101, CAL ═ 2;

reference subsequence 3ATGC, index 00110110, CAL ═ 3;

reference subsequence 4TGCA, index 11011000, CAL ═ 4;

reference subsequence 5GCAT, index value 01100011, CAL ═ 5;

with reference to subsequence 6CATG, index 10001101, CAL ═ 6.

With reference to subsequence 7ATGT, index 00110111, CAL ═ 7.

If N is 3 and M is 3, the following two tables are available, and it should be noted that, for convenience of understanding, (1), (2), (3), (4), (5), (6), and (7) appearing in the pointer table of table 1 represent the correspondence between the candidate comparison position table of table 2 and the position information CAL. In table 1, the pointer table may be filled with the number of reference subsequences with the same front R-M bits and different rear M bits of the index value, except for the correspondence between each item in the header of the pointer table and the position information in the candidate comparison position table. In fact, there are no leftmost column and no topmost row in the table, i.e., there is no table name, and here, for ease of understanding only, 1 row and 1 column are additionally added to the pointer table.

Watch 1 pointer

Addr	00	01	10	11
					000
001			(3)(7)
					010	(2)
011	(5)
					100	(6)	(1)
101
					110			(4)
111

TABLE 2 candidate comparison location Table

Key	CAL
		000	(4)
011	(1)
		011	(5)
101	(2)
		101	(6)
110	(3)
		111	(7)

As can be seen from the above example, the improved embodiment needs to store 6 × 5+3 × 7-51 bytes related to the index value, while the prior art needs to store 7 × 8-56 bytes related to the index value. Further, since the above example is only storage of the index value of one reference subsequence, and the number of base pairs in the example is smaller than the number of base pairs in actual operation for convenience of expression, the present embodiment can solve a large amount of storage space. It should be noted that there is a pointing relationship between the pointer table and the candidate comparison position table, i.e., the first segment, the second segment and the third segment of the index value are linked.

For example, taking a 24bp (base pair) seed adopted in the present invention as an example, when only one strand of a gene is considered to be 3Gbp and the seed is not hashed, a total of 36GB of storage space is generally required, since 12 bytes are required for the < index, pos > storage space of one unit. The index, pos, of a unit in the index table of the present invention requires 6 bytes of storage space, and 18GB of storage space is required in total. The pointer table additionally requires 2GB of storage space.

When not disassembled: < index, pos >, index 48bit takes 8 bytes and pos 32bit takes 4 bytes.

After splitting: < index, pos >, index10 bytes, takes 2 bytes, and pos 32bit takes 4 bytes.

In total, 3G entries are required:

pointer table storage space:

2^26*(32bit(piinter)+2^4*14bit)＝2^26*32＝2GB

the total saving in space is 36- (18+2) ═ 16 GB.

S102: extracting a seed sequence with a preset length from a sequence to be compared, and determining an index value of the seed sequence;

the operation of extracting the seed sequence with the preset length from the sequence to be compared in this step is the same as the operation of extracting the seed sequence from the reference sequence bit by bit mentioned in S101, and it should be noted that the logarithm of the base pairs of the seed sequence obtained in this step and the reference subsequence is the same, so as to find out the subsequence completely and accurately matching with the seed sequence on the reference sequence.

And after the index value of the seed sequence is obtained, finding the index value of the reference sub sequence which is the same as the index value of the seed sequence from the pointer table and the candidate comparison position table. It is understood that a plurality of seed sequences with preset lengths can be obtained in this step, and as a preferred embodiment, a plurality of seed sequences can be obtained by performing bit-by-bit extraction on the sequences to be compared. It should be noted that the index value of the seed sequence determined in this step should be the index value of the reference subsequence and be of the same kind, that is, if the base pairs of the seed sequence and the reference subsequence are identical, the index value of the seed sequence and the index value of the reference subsequence should also be identical.

S103: inquiring the pointer table and the candidate comparison position table according to the index value of the seed sequence to judge whether the position information corresponding to the seed sequence exists in the reference sequence; if yes, entering S104; if not, the flow is ended.

The aim of this step is to find a seed sequence that can exactly match the reference subsequence, and since the base pairs of both the reference subsequence and the seed sequence have been converted to index values, only the index values need to be compared. Specifically, in this embodiment, the index value of the reference subsequence passes through the pointer table and the candidate comparison position table, where the pointer table corresponds to the base address and the candidate comparison position table corresponds to the offset address. Therefore, as a preferred embodiment, when determining whether the reference sequence has the position information corresponding to the seed sequence according to the index value of the seed sequence, the first N bits of the index value and the N +1 th to R-M th bits may be compared first, and then whether the last M bits of the index value corresponding to a certain item in the pointer table are consistent with the last M bits of the index value of the seed sequence or not may be searched, and if so, it is indicated that a seed sequence capable of completely and accurately matching the reference subsequence is found, and the candidate position information corresponding to the last M bits of the index value in the comparison position table is the position of the reference subsequence capable of completely and accurately matching the seed sequence on the reference sequence.

S104: and taking the seed sequence as a sequence to be expanded, expanding the sequence to be expanded by taking the reference sequence as a reference according to the position information to obtain an intermediate sequence, and comparing the intermediate sequence with the sequence to be compared to obtain a gene sequence comparison result.

The method includes the steps of obtaining a reference subsequence which is completely and accurately matched with a seed sequence on a reference sequence, and obtaining position information of the reference subsequence on the reference sequence when the reference subsequence is obtained by default, so that the seed sequence which can be found in a pointer table and a candidate comparison position table can be used as a sequence to be expanded, and the sequence to be expanded is expanded according to the position information. In this embodiment, there is an operation of recording position information of the seed sequence on the sequence to be compared by default, so that the obtained intermediate sequence can be compared with the sequence to be compared according to the position information of the seed sequence on the sequence to be compared to obtain a comparison result.

For example, the sequence to be compared is CCGTTCATGT, the seed sequence is the first 4 CCGT of the sequence to be compared, the reference sequence is CCGTTCATAT, the reference subsequence completely and accurately matching the sequence to be compared is found to be CCGT, the sequence to be compared is extended by taking the reference sequence as an extension reference standard to obtain an intermediate sequence CCGTTCATAT, and the intermediate sequence and the sequence to be compared are compared to obtain a gene sequence comparison result.

In this embodiment, the index value of the reference subsequence is divided into three segments, the first two segments of the index value construct a pointer table, and the last segment of the index value and the position of the reference subsequence on the reference sequence construct a candidate comparison position table, so that the index table in the prior art is divided into the pointer table and the candidate comparison position table. Because the rows and columns of the pointer table are respectively the first section of the index value and the second section of the index value, the pointer table is equivalent to the base address, and the candidate comparison position table is equivalent to the offset address, the storage space occupied by storing the index value is obviously reduced compared with a single table established only according to the index value and the position information. Furthermore, a large amount of storage space is saved by adopting the pointer table and the candidate comparison position table to store the index values, so that the efficiency of the FPGA for accessing the pointer table and the candidate comparison position table can be improved. The embodiment can reduce the storage space of the table for storing the reference sequence index value and improve the gene sequence comparison rate.

Referring to fig. 2, fig. 2 is a flowchart for determining whether position information corresponding to the seed sequence exists in a reference sequence according to an embodiment of the present application;

the specific steps may include:

s201: inquiring a target offset address in the pointer table according to the first R-M bits of the index value of the seed sequence;

s202: inquiring the candidate comparison position table according to the target offset address, and judging whether the candidate comparison position table has M rear bits of the index value of the seed sequence; if yes, entering S203; if not, the flow is ended

S203: it is determined that position information of the seed sequence exists in the reference sequence.

After S203, the chapter sequence may be set as a sequence to be extended.

Referring to fig. 3, fig. 3 is a flowchart illustrating an operation of performing an expansion operation on a seed sequence according to an embodiment of the present application;

the specific steps may include:

s301: setting the seed sequence as the sequence to be spread;

s302: acquiring a pre-set digit of the index value of the sequence to be expanded as an expansion module index value;

s303: and inquiring a corresponding expansion module according to the index value of the expansion module, and inputting the sequence to be expanded into the expansion module to obtain the intermediate sequence.

For example, the expansion module may adopt a blocking process, and send the sequence to be expanded into the expansion module with the first two bits of id being 00,01,10, and 11 according to the first 2bits of the index value. In the expansion module, 256 bits of unit data in the pointer table are returned through one burst access of the FPGA. The DDR access efficiency of the FPGA is improved.

The following examples are given in the present application to illustrate the alignment of gene sequences:

step 1: selecting a base sequence of 24bp bit by bit from the reference sequence, sending the base sequence into a hash function to obtain an index value index, and storing the index, pos.

The index value designed by the invention adopts the hash key value after the seed is subjected to hash processing, so that the design has the advantage that the quantity of CAL units in each hash bucket can be uniformly distributed. Meanwhile, a hash function allowing collision is adopted for reducing the storage space, for example, the input is 24bp seed, 48bits, and the output is 40 bits.

Step 2: for all < index, pos >, the < index, pos > are arranged in the index's lexicographic order.

And step 3: and for all sorted < index, pos >, taking the first 26bits of the index as addr of the pointer table and index [27-30] as off of the pointer table, and counting the number of CAL in the off. And counting the CALs of the pointer. Please refer to fig. 4 for the relationship between the pointers in the pointer table and the CAL table, and fig. 4 is a schematic diagram of the result obtained after the index table is split into the pointer table and the CAL table. The CAL table is a candidate comparison position table.

The pointer table comprises three parts of addr, pointer and off. addr is an index value for acquiring the content of the Pointer unit. Pointer is the base address of a hash bucket and is used for addressing CAL table large block units. Off is the offset address of the subblock within one hash bucket. For the convenience of addressing, Off is designed as an accumulated value, i.e. the number of CAL units of all sub-blocks before the current sub-block, rather than the number of CAL units in each sub-block. The Pointer + Off [ n-1] is the address of the nth sub-block in the hash bucket. A special case occurs when sub-block 0 is queried, the offset address is 0, and the number of CAL units in the sub-block in the bucket is off [0 ]. The CAL table includes two parts, key and CAL. keys are used to distinguish different seeds within the same sub-block. Keys are stored in the CAL table instead of seeds, and the CAL unit is addressed together with the pointer table, so that the storage space of the index table is reduced. Taking the 24bp seed adopted in this embodiment as an example, when only one chain length of a gene is considered to be 3Gbp and the seed is not hash-processed, a storage space of < index, pos > of one unit usually needs 12 bytes, and a total storage space of 36GB is needed. The index, pos, of a unit in the index table of the present invention requires 6 bytes of storage space, and 18GB of storage space is required in total. The pointer table additionally requires 2GB of storage space.

And 4, step 4: and selecting a seed of 24bp bit by bit for the read, sending the seed into a hash function to obtain an index value index, and storing the index, pos.

And 5: get CAL table location address: for all < index, pos > of the read, the addr sending address of the pointer table is taken as the first 26bits of the index, and the address id of the off table unit is taken as the index [27-30] -1. Pointer + off [ id ] is the CAL's table location address.

Step 6: number of table cells for CAL acquisition: off [ id +1] -off [ id ].

And 7: acquiring CAL table unit data: and for all the indexes of the read, acquiring the CAL table unit data matched with the key in the CAL table unit number determined in the step 6 by taking index [31-39] bits of the key and taking the CAL table unit address determined in the step 5 as a starting address.

And 8: seed screening: and for all CAL table unit data of the read, acquiring a sub-sequence of the read according to < index, pos > of the seed, acquiring a sub-sequence of the reference sequence according to < index, pos > of the CAL table unit, comparing the two sub-sequences, acquiring a seed sequence of the reference sequence if the two sub-sequences are the same, and discarding the seed if the two sub-sequences are not the same.

When the DDR is randomly accessed in the FPGA, the clock period required for accessing one byte of data is the same as the clock period required for accessing the maximum bandwidth data allowed by burst. In the invention, the FPGA can acquire one unit of data (the storage space is needed to be 265bits) of the pointer table by accessing the DDR once. The seed expansion stage of this embodiment adopts a fast processing method, and specifically, 4 seed expansion modules are designed to process the expansion of seeds whose first 2bits of index value are 00,01,10, and 11, respectively. The advantage of this design is to fully utilize the maximum data bandwidth allowed by the FPGA to access the DDR once. One DDR access can obtain one unit data of the pointer table, and obtain a plurality of CAL unit addresses of seed with a common addr prefix.

And step 9: and respectively sending the seed subsequences and the seed subsequences of the reference sequence into seed expansion modules with id 00,01,10 and 11 according to the difference of the first 2bits of the index for expansion.

Step 10: and determining a final comparison result according to the scores of the 4 expansion modules on the seed expansion.

The embodiment in the practical application provides a gene sequence comparison processing method based on an index table, and through special design of the index table, the storage space of a reference sequence index table is greatly reduced, and meanwhile, the DDR access efficiency of the FPGA is improved. And the seed extension is subjected to block processing, the seed with the common addr prefix is extended in a seed extension module, and 256-bit unit data in the pointer table is returned through one-time burst access of the FPGA. The DDR access efficiency of the FPGA is improved.

Please refer to fig. 5, fig. 5 is a schematic structural diagram of a gene sequence alignment system according to an embodiment of the present disclosure;

the system may include:

Further, the index table determining module 100 includes:

an index value determining unit, configured to extract the reference sub-sequence with a preset length bit by bit for the reference sequence, record position information of the reference sub-sequence, and convert the reference sub-sequence into an index value;

the index value dividing unit is used for carrying out byte division on the index value according to a preset rule to obtain a first sub-index value, a second sub-index value and a third sub-index value; wherein the first sub-index value is the first N bits of the index value, the third sub-index value is the last M bits of the index value, and the second sub-index value is the (N + 1) th bit to the (R-M) th bit of the index value;

a pointer table construction unit, configured to construct rows and columns of a pointer table according to the first sub-index value and the second sub-index value; each row of the pointer table corresponds to a different first sub-index value, each column corresponds to a different second sub-index value, and the row number P and the column number Q of the pointer table meet a preset relationship; wherein the preset relationship is P-2^N，Q＝2^R-M-N；

A candidate comparison position table constructing unit, configured to construct the candidate comparison position table according to a preset ordering rule according to the third sub-index value and position information of a reference sub-sequence corresponding to the third sub-index value on the reference sequence;

and the pointer table filling unit is used for filling the pointer table according to the sorting rule and the number of the reference subsequences with the same front R-M bits and different back M bits of the index value.

Further, the query module 300 includes:

the offset address query unit is used for querying a target offset address in the pointer table according to the first R-M bits of the index value of the seed sequence;

the judging unit is used for inquiring the candidate comparison position table according to the target offset address and judging whether the candidate comparison position table has the last M bits of the index value of the seed sequence; and if so, judging that the position information of the seed sequence exists in the reference sequence.

Further, the ordering rule is dictionary ordering.

Further, the index value of the reference subsequence is a hash value corresponding to the hashed value of the reference subsequence.

Further, if the number of bits of the hash value is 40, N is 26, and M is 10.

Further, comparison module 400 includes:

an expansion module index value unit, configured to set the seed sequence as the sequence to be expanded, and obtain a pre-set number of bits of an index value of the sequence to be expanded as an expansion module index value;

and the extension unit is used for inquiring a corresponding extension module according to the extension module index value and inputting the sequence to be extended into the extension module to obtain the intermediate sequence.

And the sequence comparison unit is used for comparing the intermediate sequence with the sequence to be compared to obtain a gene sequence comparison result.

Since the embodiment of the system part corresponds to the embodiment of the method part, the embodiment of the system part is described with reference to the embodiment of the method part, and is not repeated here.

The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed, may implement the steps provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. a method for gene sequence alignment, is characterized in that, comprises:

Determine the sequence to be compared and the reference sequence, and determine the pointer table and the candidate comparison position table according to the reference sequence; wherein, the candidate comparison position table is to describe the position information of the reference subsequence on the reference sequence and the reference subsequence. The correspondence between the last M bits of the R-bit index value of the sequence, and the table in which the position information of the reference subsequences with the same R-M bits before the index value is continuously arranged, and the pointer table is used to describe that the first N bits of the index value are the same, A table of offset addresses of reference subsequences with different reference subsequences from the N+1th to the R-Mth position in the candidate comparison position table; the reference subsequence is obtained by bit-by-bit extraction of the reference sequence according to a preset length;

Extract a seed sequence of preset length from the sequence to be compared, and determine the index value of the seed sequence;

According to the index value of the seed sequence, query the pointer table and the candidate comparison position table to determine whether there is position information corresponding to the seed sequence in the reference sequence;

If yes, take the seed sequence as the sequence to be expanded, perform an expansion operation on the sequence to be expanded based on the reference sequence according to the position information to obtain an intermediate sequence, and compare the intermediate sequence with the sequence to be compared Obtain gene sequence alignment results.

2. The method according to claim 1, wherein determining the pointer table and the candidate comparison position table according to the reference sequence comprises:

Extracting the reference subsequence with a preset length from the reference sequence bit by bit, recording the position information of the reference subsequence, and converting the reference subsequence into an index value;

The index value is divided into bytes according to a preset rule to obtain a first sub-index value, a second sub-index value and a third sub-index value; wherein, the first sub-index value is the first N bits of the index value , the third sub-index value is the last M bits of the index value, and the second sub-index value is the N+1th to R-Mth bits of the index value;

The rows and columns of the pointer table are constructed according to the first sub-index value and the second sub-index value; wherein, each row of the pointer table corresponds to a different first sub-index value, and each column corresponds to a different second sub-index value The index value, the number of rows P and the number of columns Q of the pointer table satisfy a preset relationship; wherein, the preset relationship is P=2 ^N , Q=2 ^RMN ;

According to the third sub-index value and the position information of the reference sub-sequence corresponding to the third sub-index value on the reference sequence, the candidate comparison position table is constructed according to a preset sorting rule;

The pointer table is constructed according to the sorting rule and the number of reference subsequences with the same first R-M bits and different last M bits of the index value.

3 . The method according to claim 2 , wherein, according to the index value of the seed sequence, the pointer table and the candidate comparison position table are queried to determine whether there is position information corresponding to the seed sequence in the reference sequence. 4 . ,include:

Query the target offset address in the pointer table according to the first R-M bits of the index value of the seed sequence;

Query the candidate comparison position table according to the target offset address, and determine whether the last M bits of the index value of the seed sequence exist in the candidate comparison position table;

If so, it is determined that the reference sequence contains the position information of the seed sequence.

4. The method according to claim 2, wherein the sorting rule is dictionary sorting.

5 . The method according to claim 1 , wherein the index value of the reference subsequence is a hash value corresponding to the reference subsequence after hashing. 6 .

6 . The method according to claim 5 , wherein if the number of digits of the hash value is 40, N is 26 and M is 10. 7 .

7. The method according to any one of claims 1-6, characterized in that, performing an expansion operation on the seed sequence as a sequence to be expanded to obtain an intermediate sequence, comprising:

setting the seed sequence as the sequence to be extended;

Obtain the first preset digits of the index value of the sequence to be expanded as the index value of the expansion module;

The corresponding expansion module is queried according to the index value of the expansion module, and the sequence to be expanded is input into the expansion module to obtain the intermediate sequence.

8. a system of gene sequence alignment, is characterized in that, comprises:

The index table determination module is used to determine the sequence to be compared and the reference sequence, and determine the pointer table and the candidate comparison position table according to the reference sequence; wherein, the candidate comparison position table is a description of the reference subsequence on the reference sequence. The correspondence between the position information and the last M bits of the R-bit index value of the reference subsequence, and a table in which the position information of the reference subsequence with the same R-M bits before the index value is continuously arranged, and the pointer table is used to describe the A table of offset addresses in the candidate comparison position table for the reference subsequences with the same first N bits and different N+1th to R-Mth bits in the index value; the reference subsequence is a reference subsequence according to a preset length The sequence is extracted bit by bit;

an extraction module, used for extracting a seed sequence of a preset length from the sequence to be compared, and determining an index value of the seed sequence;

A query module, configured to query the pointer table and the candidate comparison position table according to the index value of the seed sequence to determine whether there is position information corresponding to the seed sequence in the reference sequence; if so, start the work of the comparison module process;

The comparison module is configured to perform an expansion operation on the seed sequence as a sequence to be expanded to obtain an intermediate sequence, and compare the intermediate sequence with the sequence to be compared to obtain a gene sequence comparison result.

9. An electronic device, characterized in that, comprising:

memory for storing computer programs;

A processor, configured to implement the steps of the gene sequence alignment method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that, a computer program is stored on the computer-readable storage medium, and when the computer program is executed by the processor, the gene sequence according to any one of claims 1 to 7 is realized The steps of the comparison method.