Whole Genome Assembly Microarray analysis. Mate Pairs Mate-pairs allow you to merge islands (contigs) into super-contigs.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Data Mining Classification: Alternative Techniques
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Linear Separators. Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We will.
Assembly.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Discrimination Methods As Used In Gene Array Analysis.
CSE182-L12 Gene Finding.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Fuzzy K means.
Fa 06CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Sp’10Bafna/Ideker Classification (SVMs / Kernel method)
This week: overview on pattern recognition (related to machine learning)
Whole Genome Expression Analysis
Classification (Supervised Clustering) Naomi Altman Nov '06.
CS 394C March 19, 2012 Tandy Warnow.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Non-Bayes classifiers. Linear discriminants, neural networks.
Linear Models for Classification
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
EQTLs.
Semi-Supervised Clustering
LECTURE 11: Advanced Discriminant Analysis
M. Fu, G. Huang, Z. Zhang, J. Liu, Z. Zhang, Z. Huang, B. Yu, F. Meng 
Whole Genome Assembly.
Generally Discriminant Analysis
Support Vector Machines
Mathematical Foundations of BME
Clustering.
Presentation transcript:

Whole Genome Assembly Microarray analysis

Mate Pairs Mate-pairs allow you to merge islands (contigs) into super-contigs

Super-contigs are quite large Make clones of truly predictable length. EX: 3 sets can be used: 2Kb, 10Kb and 50Kb. The variance in these lengths should be small. Use the mate-pairs to order and orient the contigs, and make super-contigs.

Problem 3: Repeats

Repeats & Chimerisms 40-50% of the human genome is made up of repetitive elements. Repeats can cause great problems in the assembly! Chimerism causes a clone to be from two different parts of the genome. Can again give a completely wrong assembly

Repeat detection Lander Waterman strikes again! The expected number of clones in a Repeat containing island is MUCH larger than in a non-repeat containing island (contig). Thus, every contig can be marked as Unique, or non-unique. In the first step, throw away the non-unique islands. Repeat

Detecting Repeat Contigs 1: Read Density Compute the log-odds ratio of two hypotheses: H1: The contig is from a unique region of the genome. The contig is from a region that is repeated at least twice

Detecting Chimeric reads Chimeric reads: Reads that contain sequence from two genomic locations. Good overlaps: G(a,b) if a,b overlap with a high score Transitive overlap: T(a,c) if G(a,b), and G(b,c) Find a point x across which only transitive overlaps occur. X is a point of chimerism

Contig assembly Reads are merged into contigs upto repeat boundaries. (a,b) & (a,c) overlap, (b,c) should overlap as well. Also, –shift(a,c)=shift(a,b)+shift(b,c) Most of the contigs are unique pieces of the genome, and end at some Repeat boundary. Some contigs might be entirely within repeats. These must be detected

Creating Super Contigs

Supercontig assembly Supercontigs are built incrementally Initially, each contig is a supercontig. In each round, a pair of super-contigs is merged until no more can be performed. Create a Priority Queue with a score for every pair of ‘mergeable supercontigs’. –Score has two terms: A reward for multiple mate-pair links A penalty for distance between the links.

Supercontig merging Remove the top scoring pair (S 1,S 2 ) from the priority queue. Merge (S 1,S 2 ) to form contig T. Remove all pairs in Q containing S 1 or S 2 Find all supercontigs W that share mate- pair links with T and insert (T,W) into the priority queue. Detect Repeated Supercontigs and remove

Repeat Supercontigs If the distance between two super-contigs is not correct, they are marked as Repeated If transitivity is not maintained, then there is a Repeat

Filling gaps in Supercontigs

Consenus Derivation & Assembly Summary –Do an “all pairs” prefix-suffix alignment. (Speedup using k-mer hashing). –Construct a graph of overlapping alignments. –Break the graph into “unique” regions (Number of clones similar to prediction using LW), and “repeat/chimeric” regions. Each such “unique’ region is called a contig. –Merge contigs into super-contigs using mate-pair links –For each contig, construct a multiple alignment, and consensus sequence. –Pad the consensus sequence using NNs.

Summary Once controversial, whole genome shotgun is now routine: –Human, Mouse, Rat, Dog, Chimpanzee.. –Many Prokaryotes (One can be sequenced in a day) –Plant genomes: Arabidopsis, Rice –Model organisms: Worm, Fly, Yeast WGS must be followed up with a finishing effort. A lot is not known about genome structure, organization and function. –Comparative genomics offers low hanging fruit.

Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis/ DNA signals Gene Finding Assembly

Other static analysis is possible Protein Sequence Analysis Sequence Analysis Gene Finding Assembly ncRNA Genomic Analysis/ Pop. Genetics

A Static picture of the cell is insufficient Each Cell is continuously active, –Genes are being transcribed into RNA –RNA is translated into proteins –Proteins are PT modified and transported –Proteins perform various cellular functions Can we probe the Cell dynamically? –Which transcripts are active? –Which proteins are active? –Which proteins interact? Gene Regulation Proteomic profiling Transcript profiling

Micro-array analysis

The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute Lymphocytic Leukemia) & AML (Acute Myelogenous Leukima) Possibly, the set of genes over-expressed are different in the two conditions

Supplementary fig. 2. Expression levels of predictive genes in independent dataset. The expression levels of the 50 genes most highly correlated with the ALL-AML distinction in the initial dataset were determined in the independent dataset. Each row corresponds to a gene, with the columns corresponding to expression levels in different samples. The expression level of each gene in the independent dataset is shown relative to the mean of expression levels for that gene in the initial dataset. Expression levels greater than the mean are shaded in red, and those below the mean are shaded in blue. The scale indicates standard deviations above or below the mean. The top panel shows genes highly expressed in ALL, the bottom panel shows genes more highly expressed in AML.

Gene Expression Data Gene Expression data: –Each row corresponds to a gene –Each column corresponds to an expression value Can we separate the experiments into two or more classes? Given a training set of two classes, can we build a classifier that places a new experiment in one of the two classes. g s1s1 s2s2 s

Three types of analysis problems Cluster analysis/unsupervised learning Classification into known classes (Supervised) Identification of “marker” genes that characterize different tumor classes

Supervised Classification: Basics Consider genes g 1 and g 2 –g 1 is up-regulated in class A, and down-regulated in class B. –g 2 is up-regulated in class A, and down-regulated in class B. Intuitively, g1 and g2 are effective in classifying the two samples. The samples are linearly separable. g1g1 g2g

Basics With 3 genes, a plane is used to separate (linearly separable samples). In higher dimensions, a hyperplane is used.

Non-linear separability Sometimes, the data is not linearly separable, but can be separated by some other function In general, the linearly separable problem is computationally easier.

Formalizing of the classification problem for micro-arrays Each experiment (sample) is a vector of expression values. –By default, all vectors v are column vectors. –v T is the transpose of a vector The genes are the dimension of a vector. Classification problem: Find a surface that will separate the classes v vTvT

Formalizing Classification Classification problem: Find a surface (hyperplane) that will separate the classes Given a new sample point, its class is then determined by which side of the surface it lies on. How do we find the hyperplane? How do we find the side that a point lies on? g1g1 g2g

Basic geometry What is ||x|| 2 ? What is x/||x|| Dot product? x=(x 1,x 2 ) y

Dot Product Let  be the unit vector. –||  || = 1 Recall that –  T x = ||x|| cos  What is  T x if x is orthogonal (perpendicular) to  ? How do we specify a hyperplane?  x   T x = ||x|| cos 

Hyperplane How can we define a hyperplane L? Find the unit vector that is perpendicular (normal to the hyperplane)

Points on the hyperplane Consider a hyperplane L defined by unit vector , and distance  0 Notes; –For all x  L, x T  must be the same, x T  =  0 –For any two points x 1, x 2, (x 1 - x 2 ) T  =0 x1x1 x2x2

Hyperplane properties Given an arbitrary point x, what is the distance from x to the plane L? –D(x,L) = (  T x -  0 ) When are points x1 and x2 on different sides of the hyperplane? x 00

Separating by a hyperplane Input: A training set of +ve & -ve examples Goal: Find a hyperplane that separates the two classes. Classification: A new point x is +ve if it lies on the +ve side of the hyperplane, -ve otherwise. The hyperplane is represented by the line {x:-  0 +  1 x 1 +  2 x 2 =0} x2x2 x1x1 + -

Error in classification An arbitrarily chosen hyperplane might not separate the test. We need to minimize a mis-classification error Error: sum of distances of the misclassified points. Let y i =1 for +ve example i, y i =-1 otherwise. Other definitions are also possible. x2x2 x1x1 + - 

Gradient Descent The function D(  ) defines the error. We follow an iterative refinement. In each step, refine  so the error is reduced. Gradient descent is an approach to such iterative refinement. D(  )  D’(  )

Rosenblatt’s perceptron learning algorithm

Classification based on perceptron learning Use Rosenblatt’s algorithm to compute the hyperplane L=( ,  0 ). Assign x to class 1 if f(x) >= 0, and to class 2 otherwise.

Perceptron learning If many solutions are possible, it does no choose between solutions If data is not linearly separable, it does not terminate, and it is hard to detect. Time of convergence is not well understood

Linear Discriminant analysis Provides an alternative approach to classification with a linear function. Project all points, including the means, onto vector . We want to choose  such that –Difference of projected means is large. –Variance within group is small x2x2 x1x1 + - 

LDA Cont’d Fisher Criterion

Maximum Likelihood discrimination Suppose we knew the distribution of points in each class. –We can compute Pr(x|  i ) for all classes i, and take the maximum

ML discrimination Suppose all the points were in 1 dimension, and all classes were normally distributed.

ML discrimination recipe We know the distribution for each class, but not the parameters Estimate the mean and variance for each class. For a new point x, compute the discrimination function g i (x) for each class i. Choose argmax i g i (x) as the class for x