Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Indexing DNA Sequences Using q-Grams
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:
Fast Algorithms For Hierarchical Range Histogram Constructions
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Data Structures Using C++ 2E
Lecture 12: Revision Lecture Dr John Levine Algorithms and Complexity March 27th 2006.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Next Generation Sequencing, Assembly, and Alignment Methods
BTrees & Bitmap Indexes
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Finding approximate palindromes in genomic sequences.
Hash Tables and Associative Containers CS-212 Dick Steflik.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Selection of Optimal DNA Oligos for Gene Expression Arrays Reporter : Wei-Ting Liu Date : Nov
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
A hands on Introduction to Computational Genomics Amos Tanay Eran Segal Weizmann Institute of Science.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
CS 221 Analysis of Algorithms Data Structures Dictionaries, Hash Tables, Ordered Dictionary and Binary Search Trees.
Physical Mapping of DNA Shanna Terry March 2, 2004.
CS 394C March 19, 2012 Tandy Warnow.
Identifying Reversible Functions From an ROBDD Adam MacDonald.
Lecture 10: Class Review Dr John Levine Algorithms and Complexity March 13th 2006.
Massive Parallel Sequencing
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
GE3M25: Computer Programming for Biologists Python, Class 5
Local Multiple Sequence Alignment Sequence Motifs
CHAPTER 9 HASH TABLES, MAPS, AND SKIP LISTS ACKNOWLEDGEMENT: THESE SLIDES ARE ADAPTED FROM SLIDES PROVIDED WITH DATA STRUCTURES AND ALGORITHMS IN C++,
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
1 i206: Lecture 17: Exam 2 Prep ; Intro to Regular Expressions Marti Hearst Spring 2012.
Top 50 Data Structures Interview Questions
Gene expression from RNA-Seq
13 Text Processing Hongfei Yan June 1, 2016.
CS 598AGB Genome Assembly Tandy Warnow.
Introduction to Bioinformatics II
Sequence comparison: Significance of similarity scores
Basic Local Alignment Search Tool
Presentation transcript:

Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads Unit 3: Gene expression clustering/biclustering TF binding (GSM336334)GSM Unit 5: Transcriptional model or your extension 7/4 28/4 18/5 17/6 Unit 4: motif finder July

Projects guidelines: Schedule is strict Work in pairs a week after starting the project: Status report. A Q&A session with Amos/Eran (Focused on techniques, algorithms, implementation) Two weeks after starting: Status report. Q&A session (Focused on analysis of results, problems) Submission three weeks after starting + discussion Submit results in the course wiki. Put code in your home dir. Change pairs at least once. Grade: based on instructors evaluation of projects and participation in classes.

Module 1: mapper Read a MNase-seq Solexa reads file in FASTA format >name ACGTACGTACGT… >name2 ACGTAAAGAC… Read a genome reference in FASTA format (a set of chromosomes) Write a mapping program and find the genomic coordinate of each mappable read. You can ignore insertions/deletions – a bonus for considering these to some extant Submit: –description of the algorithm and the parameters you used –Mapping statistics (how many read mapped successfully, how many were non unique, running time) –Graphs showing the distribution of errors over the read position, the G+C content of the reads, compared to the genomic trend

Mapping Solexa reads Mapping Solexa reads to a genome have unique characteristics Query consists of a very large number of short reads Similarity to reference genome is expected to be very high Genome Database Solexa Query You can index the query k-mers (using which k?) and traverse the database to search for hits Or you can index the database and map queries one by one You can expect low level of errors: 1 or 2 per read You can assume that no more than one gap occurred (even this is a lot) The algorithm must pay particular attention to ambiguous hits (that are mapped to more than one position) The meta-algorithm: Build index for exact k-mers (db/query?) Find k-mer hits Extend k-mer hits to matches (filter double matches upon detection, or score the probabilistically))

Sequence Quality Same as for Sanger sequencing, nextgen sequencers generate base calling scores and report them as -10log 10 (p) One would like to consider a mismatch with low quality appropriately Uniqueness in genome For a genome of size G, what is the expected number of k-mer hits as a function of K? If nucleotides have variable G+C content? If we map all C’s to T’s? the genome k-mer spectrum is strongly affected by repetitive elements and microsattelites

Hashing DNA K-mers of length 11 is easy (2 22 ) Longer K-mers (for searching mismatches) storage is bounded by genome length! How to access the hash efficiently? Best: random access using integer encoding –A DNA word need 2bits for character, you can hash 12-mers in a vector with 16 million entries Possible: hash table or binary search tree (e.g., STL map, hash_map, Perl associative containers)

Suffix Trees (just for background) Suffix trees efficient string encodings Geared toward O(d) lookup of substrings The tree contains all suffixes of a string as pathes from the root Each node have no more than A out edges (A=4 for DNA) Naïve construction: in O(N 2 ) O(N) construction (!) O(N) memory (Prove!) d ab d a c c a b d a c c c c a d b Suffix tree for “dabdac”

Sampling short reads How many reads we expect to detect on a certain genomic location? We sample N times (e.g., 10,000,000) from a large population – the number of hits for a single locus is expected to be binomial B(p,n) where p is the fraction of fragments in the pool If Np is large (>10) we can assume a normal distribution If Np is small the distribution should be geometric p’s (the fraction of fragments that cover a locus) will vary among loci: –In ChIP-seq – loci that are occupied by that targeted factor will be covered –In MNase-seq – loci that are adjacent to a cutting site As is often the case, the theoretical assumptions need not hold – test the distribution of values and see for yourself

From mapped reads to coverage statistics Divide the genome to fixed bins Compute how many reads cover each bin A better strategy will depend on the application: Add ~ for fragmented ChIP product Add 147 (or -10 ) for nucs (or linkers) Add fragment length for RNA Pair ended-reads Statistics on spatial bins

Implementation considerations Best - C/C++: get used to the STL –Vectors –Maps –Integer encodings Java Perl: be aware of your memory model –Associative arrays are expensive –Lists when you can –vec($myvar, $id, bits) (can use BioPerl) Python R Matlab (don’t)