RU2023116499A

RU2023116499A - A SEQUENCE GRAPH-BASED TOOL FOR DETERMINING VARIATION IN SHORT TANDEM REPEAT AREAS

Info

Publication number: RU2023116499A
Application number: RU2023116499A
Authority: RU
Inventors: Егор ДОЛЖЕНКО; Майкл Э. ЭБЕРЛЕ
Original assignee: Иллумина, Инк.
Priority date: 2019-03-07
Filing date: 2020-03-06
Publication date: 2023-06-28

Claims

1. A method implemented using a computer equipped with one or more processors and system memory for genotyping one or more repeat sequences, each of which contains one or more repeat subsequences, including:

obtaining a sequence graph, where the sequence graph has a graph data structure with vertices representing nucleotide sequences, and directed edges connect the vertices, and the sequence graph contains one or more of its own simple cycles, and each own simple cycle is a subsequence of repetitions; And

alignment using one or more processors of reads of the sequence of the test sample with one or more sequences of repetitions, each of which is represented by a sequence graph.

2. The method of claim 1, wherein each subsequence of repeats contains repeats of a repeating unit of one or more nucleotides.

3. The method according to any one of claims 1 or 2, wherein the repeat sequence of one or more repeat sequences contains a specific repeat unit containing at least one partially defined nucleotide.

4. The method of claim 3, wherein the particular repeat unit contains degenerate codons.

5. The method according to any one of paragraphs. 1-4, in which one or more eigensimple loops contain two or more eigensimple loops representing two or more repeating subsequences.

6. The method according to any one of paragraphs. 1-5, wherein the sequence graph further comprises two or more alternative paths for two or more alleles.

7. The method of claim 6, wherein two or more alleles contain a deletion or substitution.

8. The method of claim 6 wherein the replacement comprises a single nucleotide variant (SNP) or a single nucleotide polymorphism (SNP).

9. The method of claim 6, further comprising genotyping two or more alleles using sequence reads aligned with two or more alternative pathways.

10. A method implemented using a computer equipped with one or more processors and system memory to characterize the expansion of repeats, where the method includes:

collecting, using one or more processors, sequence reads of the test sample, where the sequence reads contain paired end reads;

aligning one or more sequence read processors with one or more repeat sequences, each of which is represented by a sequence graph;

defining base and anchored reads in paired end reads, where base reads are reads aligned with or with a space near the repeat sequence of one or more repeat sequences, and where anchored reads are unaligned reads that are paired with base reads; And

determining the probability of expansion of repeats in the test sample based on certain anchored readings, at least in part.

11. The method of claim 10, wherein the sequence graph contains one or more intrinsic simple cycles, each intrinsic simple cycle being a subsequence of repeats, each subsequence of repeats containing repeats of a repeat unit of one or more nucleotides.

12. The method according to any one of paragraphs. 10 or 11, in which base reads are aligned to within about 5 kb. repeat sequences.

13. The method according to any one of paragraphs. 10-12, in which the misaligned reads contain reads that cannot be aligned or graph-aligned sequences with at least one mismatch.

14. The method according to any one of paragraphs. 10-13, where repeat expansion probability is determined based on certain baseline reads as well as certain anchored reads.

15. A method implemented using a computer equipped with one or more processors and system memory to detect repeat expansion, wherein the method includes:

aligning, with one or more processors, the test sample sequence reads with one or more repeat sequences, each of which is represented by a sequence graph, where the sequence reads contain paired end reads;

defining base and anchored reads in paired end reads, where base reads are reads aligned with or with a space near the repeat sequence of one or more repeat sequences, and where anchored reads are unaligned reads that are paired with base reads;

determining a number of high volume reads associated with the test sample, where the number of high volume reads corresponds to the number of base reads and/or sticky reads that have repeats greater than a threshold value; And

determination of the presence of expansion of repeats in the test sample, taking into account the number of reads of a large volume, which exceeds the recognition criterion.

16. The method of claim 15, wherein the sequence graph contains one or more intrinsic simple cycles, each intrinsic simple cycle being a subsequence of repeats, each subsequence of repeats comprising repeats of a repeat unit of one or more nucleotides.

17. The method according to any one of paragraphs. 15 or 16, in which base reads are aligned to within about 5 kb. repeat sequences.

18. The method according to any one of paragraphs. 15-17, in which the misaligned reads contain reads that cannot be aligned or graph-aligned sequences with at least one mismatch.

19. The method according to any one of paragraphs. 15-18, further comprising filtering out erroneous and low quality reads prior to aligning the sequence reads in step (a).

20. The method according to any one of paragraphs. 15-19, where a large volume read is determined based on the maximum number of repeats of a specific repeat sequence for a read having a specific read length.

21. The method of claim 20, wherein the high volume reading is determined by the number of repetitions, the number of which exceeds at least about 80% of the maximum number of repetitions.

22. The method according to any one of paragraphs. 15-21, where the recognition criterion is obtained based on the distribution of large volume reads of one or more control samples.

23. The method according to any one of paragraphs. 15-22 where the recognition criterion is determined based on the sequencing depth.

24. The method of claim 23, wherein the sequencing depth indicates the average distance between reads in an aligned genome.

25. A method implemented to characterize a sequence of repeats, where the method includes:

receiving, on the first computing device, from the second computing device, data indicative of reads of the test sample sequence;

aligning, with one or more processors of the first computing device, the sequence reads with one or more repetition sequences, each of which is represented by a sequence graph;

determining, by one or more processors of the first computing device, information indicative of test sample repeat expansions and/or genotype information of one or more test sample repeat sequences based on alignment of sequence reads with one or more repeat sequences of the sequence graph; And

transferred information indicative of repeat expansions and/or genotype information to the third device.

26. The method of claim 25, wherein the sequence graph contains one or more intrinsic simple cycles, each intrinsic simple cycle being a subsequence of repeats, each subsequence of repeats containing repeats of a repeating unit of one or more nucleotides.

27. The method according to any one of paragraphs. 25 or 26, where the second computing device includes a sequencer.

28. The method according to any one of paragraphs. 25 or 26, where the second computing device contains a database.

29. The method according to any one of paragraphs. 25-28 where the first computing device is remote from the second computing device and/or the third computing device.

30. The method according to any one of paragraphs. 25-29, where information indicative of repeat expansions and/or genotype information includes one or more patient diagnoses associated with the test sample.

31. System containing:

system memory; And

one or more processors configured to:

obtaining a sequence graph, where the sequence graph has a graph data structure with vertices representing nucleotide sequences, and directed edges connect the vertices, and wherein the sequence graph contains one or more of its own simple cycles, and each own simple cycle is a subsequence of repetitions; And

alignment using one or more processors reads the sequence of the test sample with one or more sequences of repetitions, each of which is represented by a graph.

32. System containing:

system memory; And

one or more processors configured to:

defining base and sticky reads in paired end reads, where base reads are reads aligned with or with a space near the repeat sequence of one or more repeat sequences, and where sticky reads are unaligned reads that are paired with base reads; And

determining the probability of expansion of repeats in the test sample based on certain anchored readings, at least.

33. System containing:

system memory; And

one or more processors configured to:

defining base and sticky reads in paired end reads, where base reads are reads aligned with or with a space near the repeat sequence of one or more repeat sequences, and where sticky reads are unaligned reads that are paired with base reads;

determination of the presence of expansion of repeats in the test sample, taking into account the number of readings of a large volume that exceeds the recognition criterion.

34. System containing:

system memory; And

one or more processors configured to:

alignment using one or more processors of the first computing device readings of the sequence with one or more sequences of repetitions, each of which is represented by a sequence graph;

determining, by one or more processors of the first computing device, information indicative of test sample repeat expansions and/or genotype information of one or more test sample repeat sequences based on alignment of sequence reads with one or more repeat sequences of a sequence graph; And

transferring information indicative of repeat expansions and/or genotype information to a third device.