Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.

Similar presentations


Presentation on theme: "Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads."— Presentation transcript:

1 Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads Unit 3: Gene expression clustering/biclustering TF binding (GSM336334)GSM336334 Unit 5: Transcriptional model or your extension 7/4 28/4 18/5 17/6 Unit 4: motif finder July

2 Projects guidelines: Schedule is strict Work in pairs a week after starting the project: Status report. A Q&A session with Amos/Eran (Focused on techniques, algorithms, implementation) Two weeks after starting: Status report. Q&A session (Focused on analysis of results, problems) Submission three weeks after starting + discussion Submit results in the course wiki. Put code in your home dir. Change pairs at least once. Grade: based on instructors evaluation of projects and participation in classes.

3 Module 1: mapper Read a MNase-seq Solexa reads file in FASTA format >name ACGTACGTACGT… >name2 ACGTAAAGAC… Read a genome reference in FASTA format (a set of chromosomes) Write a mapping program and find the genomic coordinate of each mappable read. You can ignore insertions/deletions – a bonus for considering these to some extant Submit: –description of the algorithm and the parameters you used –Mapping statistics (how many read mapped successfully, how many were non unique, running time) –Graphs showing the distribution of errors over the read position, the G+C content of the reads, compared to the genomic trend

4 Mapping Solexa reads Mapping Solexa reads to a genome have unique characteristics Query consists of a very large number of short reads Similarity to reference genome is expected to be very high Genome Database Solexa Query You can index the query k-mers (using which k?) and traverse the database to search for hits Or you can index the database and map queries one by one You can expect low level of errors: 1 or 2 per read You can assume that no more than one gap occurred (even this is a lot) The algorithm must pay particular attention to ambiguous hits (that are mapped to more than one position) The meta-algorithm: Build index for exact k-mers (db/query?) Find k-mer hits Extend k-mer hits to matches (filter double matches upon detection, or score the probabilistically))

5 Sequence Quality Same as for Sanger sequencing, nextgen sequencers generate base calling scores and report them as -10log 10 (p) One would like to consider a mismatch with low quality appropriately Uniqueness in genome For a genome of size G, what is the expected number of k-mer hits as a function of K? If nucleotides have variable G+C content? If we map all C’s to T’s? the genome k-mer spectrum is strongly affected by repetitive elements and microsattelites

6 Hashing DNA K-mers of length 11 is easy (2 22 ) Longer K-mers (for searching mismatches) storage is bounded by genome length! How to access the hash efficiently? Best: random access using integer encoding –A DNA word need 2bits for character, you can hash 12-mers in a vector with 16 million entries Possible: hash table or binary search tree (e.g., STL map, hash_map, Perl associative containers)

7 Suffix Trees (just for background) Suffix trees efficient string encodings Geared toward O(d) lookup of substrings The tree contains all suffixes of a string as pathes from the root Each node have no more than A out edges (A=4 for DNA) Naïve construction: in O(N 2 ) O(N) construction (!) O(N) memory (Prove!) d ab d a c c a b d a c c c c a d b 1 4 2 5 6 3 Suffix tree for “dabdac”

8 Sampling short reads How many reads we expect to detect on a certain genomic location? We sample N times (e.g., 10,000,000) from a large population – the number of hits for a single locus is expected to be binomial B(p,n) where p is the fraction of fragments in the pool If Np is large (>10) we can assume a normal distribution If Np is small the distribution should be geometric p’s (the fraction of fragments that cover a locus) will vary among loci: –In ChIP-seq – loci that are occupied by that targeted factor will be covered –In MNase-seq – loci that are adjacent to a cutting site As is often the case, the theoretical assumptions need not hold – test the distribution of values and see for yourself

9 From mapped reads to coverage statistics Divide the genome to fixed bins Compute how many reads cover each bin A better strategy will depend on the application: Add ~200-300 for fragmented ChIP product Add 147 (or -10 ) for nucs (or linkers) Add fragment length for RNA Pair ended-reads Statistics on spatial bins

10 Implementation considerations Best - C/C++: get used to the STL –Vectors –Maps –Integer encodings Java Perl: be aware of your memory model –Associative arrays are expensive –Lists when you can –vec($myvar, $id, bits) (can use BioPerl) Python R Matlab (don’t)


Download ppt "Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads."

Similar presentations


Ads by Google