Presentation is loading. Please wait.

Presentation is loading. Please wait.

VCF format: variants c.f. S. Brown NYU

Similar presentations


Presentation on theme: "VCF format: variants c.f. S. Brown NYU"— Presentation transcript:

1 VCF format: variants c.f. S. Brown NYU
#CHROM POS ID REF ALT QUAL QUAL chr G A 23 . DP=3;VDB=0.0298;AF1=1;AC1=2;DP4=0,0,3,0;MQ=60;FQ=-36 GT:PL:GQ 1/1:55,9,0:15 INFO AA ancestral allele AC allele count in genotypes, for each ALT allele, in the same order as listed AF allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes AN total number of alleles in called genotypes BQ RMS base quality at this position CIGAR cigar string describing how to align an alternate allele to the reference allele DB dbSNP membership DP combined depth across samples, e.g. DP=154 END end position of the variant described in this record (esp. for CNVs) H2 membership in hapmap2 MQ RMS mapping quality, e.g. MQ=52 MQ0 Number of MAPQ == 0 reads covering this record NS Number of samples with data SB strand bias at this position SOMATIC indicates that the record is a somatic mutation, for cancer genomics VALIDATED validated by follow-up experiment c.f. S. Brown NYU

2 Alignment or NGS What are the challenges? c.f. S. Brown NYU

3 Aligning millions of short reads
Computationally intensive Aligning to reference genome => mapping the reads Aligning de novo => genome assembly Either way, using something like S-W or BLAST would take too long, so you modify them to take shortcuts (heuristics). Heuristics include using indexing methods Length of each read = Number of reads = 107 – 108 Genome length = or double if diploid

4 A short word about indexing
Just like the word and page number of the index of a book, this creates an index of short sequences, with their location on the genome. The index file contains index entries made up of a search key value and a pointer to the block in the data file. Hashing (or keys) is when you add line numbers and describe the contents by a minimum of characters, called seeds. Techniques are very well defined in computer science End result is that the files takes up a lot less space, take a lot less time to search

5 Aligning millions of short reads
Computationally intensive Aligning to reference genome => mapping the reads Aligning de novo => genome assembly Heuristics include using indexing methods, some use ungapped alignment with short words BWT (Burrows-Wheeler Transformation) greatly reduced computational time Approaches include using hash tables, spaced seeds, contiguous seeds etc

6 Mapping The step of aligning the reads to the reference genome. - index the reads then scan them against the reference - find the reference match that has the lowest mismatch - calculate the p-value, and assign each read its location Problems: accuracy, splice junctions, variants Challenges: false positives, repeats, parental alleles, HapMap

7 NGS alignment algorithms
Smith Waterman BLAST Enter BWT BLAT is precomputed BLAST

8 MAQ algorithm Mapping and Aligning with Qualities
MAQ first indexes read sequences and scans the reference genome to find hits that are extended and scored (first 28bp with 2 mismatches max), minimizing the sum of quality scores of mismatches. Scheme of indexing in hashtables (six noncontigous seed templates): indexed indexed eg, for an 8-base long fragment with 1 or 2 variants, you still get segments which are fully matched (2 and 6): c.f. Shamir 2011

9 MAQ algorithm Heuristic yet thoughtful:
MAQ loads all reads in the memory For each read MAQ creates an integer depending on the 28bp templates and stores the read ID and the integer in a hashtable. When all reads are processed, we have many reads under the same integer as key. Now we are scanning the reference, using 28 bp subsquences. Using the corresponding hashes of the subsequences, we find reads that match and extend them. MAQ calculates a score for each extended match and retains the best score hits. Once the reference is scanned, the next template is used until no more templates are left. c.f. Shamir 2011

10 Burrows Wheeler Transformation
The Burrows–Wheeler transform is an algorithm used in data compression techniques The output is easier to compress because it has many repeated characters. In this example the transformed string, there are a total of eight runs of identical characters: XX, II, XX, SS, PP, .., II, and III, which together make 17 out of the 44 characters. So, that would be great if we could compress our reads like that: How does it work?

11 Burrows Wheeler Transformation
First we transform: This string now can be nicely compressed (i.e. we can do that with our reads too!)

12 Burrows Wheeler Transformation
However, how do we get back to the original reads (i.e. how does the inverse Transformation work?)

13 BowTie Matrix has characteristic that similar rows are clustered together Starting with query, we find ranges of rows that match the query. Successively adding to the query the ranges become smaller. A match is found if query matches the complete row, no match if range = 0 Problems arise when mismatches occur Remedy backtracking (next slide)

14 BowTie

15 Dealing with splice junctions
(from Garber et al, Nature, 2011)

16 TopHat

17 Pros and Cons Identifying splice junctions correctly is non-trivial
Splice aligners that use exon first methods are computationally less intensive (eg TopHat, which starts with results of Bowtie2) Seed extend methods are better at finding new isoforms, but could equally come up with false positive splice junctions (eg GSNAP)

18 Identifying transcripts
For RNA-Seq, might want to identify novel transcripts Not trivial Ignore isoforms Use genome-guided reconstruction, ie using reads that span the potential splice site of known genomes (eg Cufflinks) use genome independent reconstruction, ie de novo approaches (eg Velvet, Trinity)

19 Finding the real transcripts


Download ppt "VCF format: variants c.f. S. Brown NYU"

Similar presentations


Ads by Google