Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Biology Algorithmic Techniques & Medical Applications CSE 590YA August 15, 2001.

Similar presentations


Presentation on theme: "Computational Biology Algorithmic Techniques & Medical Applications CSE 590YA August 15, 2001."— Presentation transcript:

1 Computational Biology Algorithmic Techniques & Medical Applications CSE 590YA August 15, 2001

2 2 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions

3 3 Overview Human Genome Project Why is it important? Sequence  functionality Prevention & treatment of disease Where is there computation in it? Lab hardware/software Analysis: assembly, element discovery Could not accomplish w/o computers

4 4 Bigger Picture Biology of the (not so) past Isolated Low level (one X at a time) Slow accumulation of knowledge Biology of the present Global High level (organismal/theoretical) Rapid accumulation of knowledge Rapid generation of open questions

5 5 Example: S. cerevisiae (yeast) Yeast: before expression arrays Model organism for experiments Easy to grow, modify, and study Genetics similar to higher organisms Yeast: after expression arrays Immensely more useful Now know most gene functions New results every month that used to take five years Results are directly applicable to higher organisms

6 6 A good beginning … The genome is not the end Code to be deciphered Human road map Greater need for computational tools and power Example: dbSNP Data exists Need help finding and relating it all

7 7 Computers – not just for analysis Role reversal Before: Biologists generate data, computers analyze it Now: Computers generate experiments, biologists perform them Cycle New future for CMBists Biotech has greatest opportunity for real science to be done, and CS is crucial!

8 8 CB is good for CS Old research revisited and applied Clustering Expired in the 70s, reborn 3 years ago New papers  reacceptance as research topic Data mining, web statistics, e-commerce Machine learning Well-studied over the past couple decades New needs in CB  new research on tuning

9 9 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions

10 10 Biochemistry 101 Cells Basic building blocks of life * Proteins Key to functionality Catalyze reactions * Store and release energy Build cells and cell components Process-specific, yet resource-efficient

11 11 The genetics of proteins DNA Four-base alphabet * Genes are instructions for building proteins Cell cycle * Extensive regulatory mechanism Construct proteins at right time and place Break down proteins and reuse components Incredibly complex series of steps

12 12 Transcription & translation DNA  RNA Transcription factors * RNA polymerase RNA  protein Translation at ribosome * Amino acid chains Protein degradation

13 13 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions

14 14 Technology DNA microarrays Consensus RNAs adhered to slide Test and control cDNAs produced * Fluorescently labeled Hybridized with RNAs on slide Scan fluorescence with computer Results: how much RNA present! * What does this signify?

15 15 Example uses Timepoints in the cell cycle Which genes are always “on”? Which genes are responsible for certain events in the cycle? Differential expression in experiment Which genes are responsible for a particular cell response? What is the response pattern over time?

16 16 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions

17 17 “Low tech” algorithms 90s: DNA is just a bunch of strings Questions became answerable! Are there gross similarities in the genome? What do they imply? Are there smaller recurring elements in the genome? What is their function? I know what Gene A does? Can I use that to figure out what Gene B does?

18 18 String and sequence matching String matching Find exact replicas of DNA sequence elsewhere in the genome Are they statistically unlikely? Sequence matching Regions of DNA that look similar: allows for evolution Also applied to proteins In reality, sequences are more important

19 19 Computer tools Biological questions could be answered better by a computer than by a biologist GenBank, FASTA, BLAST, GAP Not trivial developments, even for CS Required novel approaches to NP-hard problems Web proliferation (ongoing) www.cs.jhu.edu/~salzberg/appendixa.html

20 20 High tech: expression arrays Use active gene data to classify a cell Example: Cancer type prediction Subtypes appear very similar histologically Very different clinical courses Diagnoses: biologists’ insight rather than systematic/unbiased approaches

21 21 Classifying cancer ALL vs. AML Two kinds of leukemia (only recently separated) Must be treated very differently Distinguishable in clinic, but not 100% reliable Golub (1999) Goal: Determine cancer type by overall gene expression; build an automated classifier By-product: One of earliest quantitative uses of DNA microarrays

22 22 Strategy Get expression data for 6800 genes from 27 ALL and 11 AML patients Clustering: Find genes with expression levels that are strongly correlated with the ALL-AML class distinction Give each such gene a weighted predictive vote for its class Let important genes vote on test cases

23 23 Determining correlation w/ class Idealized expression patterns Neighborhood analysis * Correlation metric Euclidean distance, regression, TNOM Significance Q: Is gene more highly correlated with IEP than would be expected by chance? A: Examine correlation w/ random IEP permutations Results: 1100 genes more highly correlated with ALL-AML class distinction than expected by chance

24 24 Making a class predictor Subset of informative genes will elect the class of a new sample Each casts weighted vote for its class: * Expression level of gene in test sample Original correlation of gene w/ class distinction Prediction strength (PS) Margin of victory after all genes vote If less than threshold, then uncertain

25 25 Validation of the model (a) Initial data set: cross-validation For each patient sample: Build a classifier without it (i.e. w/ 37 others) Predict class of left-out sample Calculate cumulative error rate Results Used top 50 genes 36/38 samples classified correctly, 2 uncertain

26 26 Validation of the model (b) Independent data set: test validation 34 samples from diverse tissues 29/34 “strong” predictions; 100% accuracy PS values quite high for both.77 in cross-validation;.73 in independent Mean PS lower for samples from one particular laboratory: importance of standardization in clinical setting

27 27 Further results of clinical importance 10  200 voting gene set had same accuracy Voter gene function: not just lineage markers Surface receptors, anti-apoptotic agents, cell cycle regulators, DNA manipulators, known oncogenes These genes provide insight into cancer causes New biological knowledge as a result of computational methods! Other applications of CP & feature selection Response to chemotherapy Eventual outcome of disease

28 28 Other array-based classifiers (a) k-means clustering Select “high-scoring” features like before Pick k points as initial cluster centroids Add each new data point to nearest cluster Move that cluster centroid to new mean Use these centroids to classify test cases

29 29 Other array-based classifiers (b) Support Vector Machines Goal: find a plane that separates data points If not separable Boost the data points into a higher dimensional space using some well-behaved kernel function Try to find a separating hyperplane there Key benefits of SVM version Kernel avoids explicit representation of higher-dim space Finding the maximum margin separating classes avoids overfitting

30 30 Class discovery What if we don’t know how many clusters we want? The discovery of finer-grained subtypes of cancer has been arduous and slow How can microarrays help here? Golub (1999) again … Automatic class discovery based solely on gene expression

31 31 Self-organizing maps (SOMs) Very much like k-means clustering However, we don’t know the discriminating features in advance Cluster based on all gene expression levels Results for 27 ALL/11 AML data set Class A: 24/25 samples were ALL Class B: 10/13 samples were AML Quite effective, but not perfect

32 32 SOMs (cont’d) How can we evaluate the “learned” clusters w/o knowing the true classes? Test by class prediction – accuracy should be high if classes reflect true structure Results Predictors w/ variety of genes did well in cross-validation Exception: the one AML in class A was often predicted to be in class B This suggests an iterative method for class discovery: discover, predict, refine

33 33 Independent model validation Cannot assess “accuracy” on test data Instead, assess prediction strength High PS indicates that structure in initial data is also present in test data Results Median PS=.61, 74% of samples above threshold Compared w/ random clusters, PS’s were highly statistically significant We have discovered ALL-AML distinction! Even lower-level distinctions also discovered

34 34 Other CS w/ expression arrays Regulatory element detection Correlate expression data with frequency of DNA motifs Taxing even for fastest processors today Discovery of regulatory pathways Treat expression arrays over time as a graph Establish a Bayesian network model for regulatory pathways over the array graph structure Infer network parameters  pathway structure

35 35 Problems with DNA arrays Different companies, different types Even within one company Different products over time Different binding efficiencies Much time spent on normalization Even then, different groups’ results are hard to compare Biggest worry: RNA levels in cells do not accurately reflect current protein content Perhaps limits our discovery potential

36 36 Proteonomics If protein is most important, why not study it directly? Much work is done on proteins already But difficult to purify, prepare, quantify Results are very coarse Emerging technologies More efficient protein purification and protein arrays are being developed! Lots of discoveries to come

37 37 Outline Overview Biology Technology Algorithms & Applications Low tech: String algorithms High tech: Class discovery/prediction Treatments & clinical outcomes Conclusions

38 38 Looking to the future Biology is becoming a more theoretical, unified science The problem w/ biology has always been that there are too many layers Work has always been somewhere in the middle Now research is beginning to focus on processes and pathways and networks in general This is the proper path to developing theories Along the way … Lots of hard computational problems to be solved!


Download ppt "Computational Biology Algorithmic Techniques & Medical Applications CSE 590YA August 15, 2001."

Similar presentations


Ads by Google