Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 1. Course outline The biological system Omics and its impact Big data The statistician/bioinformatician’s role Bios 540 Introduction to Bioinformatics.

Similar presentations


Presentation on theme: "Lecture 1. Course outline The biological system Omics and its impact Big data The statistician/bioinformatician’s role Bios 540 Introduction to Bioinformatics."— Presentation transcript:

1 Lecture 1. Course outline The biological system Omics and its impact Big data The statistician/bioinformatician’s role Bios 540 Introduction to Bioinformatics

2 Instructor: Tianwei Yu Office: GCR Room 334 Email: tianwei.yu@emory.edutianwei.yu@emory.edu Office Hours: by appointment. Teaching Assistant: Mr. Qingpo Cai Office Hours: TBA Course Website: http://web1.sph.emory.edu/users/tyu8/540/index.htm Course Outline

3 Evaluation Class participation (5%) Three homeworks (15% × 3) Final report based on a research article (50%).

4 Course Outline Statistics Other Disciplines CS Biology Genetics …… 540 Bioinformatics Machine learning and other courses

5 Course Outline Biological sequence analysis Pariwise alignment; multiple alignment; sequence models; motifs; fast alignment; phylogenetic trees High-throughput data generation and preprocessing Next generation sequencing; Microarray RNA/DNA profiling; LC/MS based Proteomics/Metabolomics. (Technique; popular models) General statistical technics in high-throughput data Multiple testing & FDR; clustering; classification Data Interpretation & Integration Ontology; Some Important Databases; Networks

6 Related course Bios 740 (Bios/CS 534 from 2017): Machine Learning. Supervised learning: Classification: Bayesian decision theory, LDA, classification tree, random forest, SVM, boosting, bump hunting, neural networks, deep learning. Model generalization. Variance/Bias, training/testing error, cross validation. Unsupervised learning: Dimension reduction: PCA, factor analysis, ICA, NCA,SIR Clustering: similarity measures, hierarchical, k-means, model-based clustering …

7 Tentative schedule Lecture 1Introduction Lecture 2 Sequencing; Dynamic programming sequence alignment Lecture 3 BLAST; Hidden Markov Models in alignment (1) Lecture 4 Hidden Markov Models (2); Multiple Alignment Lecture 5 Motif discovery; Phylogeny Lecture 6 Gene expression: microarray and deep sequencing Lecture 7 Supervised and Unsupervised Learning (1) Lecture 8Supervised and Unsupervised Learning (2) Lecture 9Multiple Testing Lecture 10Analyzing the DNA by deep sequencing (1) Lecture 11Analyzing the DNA by deep sequencing (2); MS-based Proteomics & Metabolomics(1) Lecture 12 MS-based Proteomics & Metabolomics (2) Lecture 13Networks and Ontology Lecture 14Data integration

8 Course Outline Recommended Readings for the basics: Richard Durbin et al. (2005) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Michael Waterman (1996) Introduction to Computational Biology – Maps, Sequences & Genomes.

9 9 The complex biological system (Picture edited from http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/) Metabolites Red: central dogma Blue line: interactions

10 10 The complex biological systems Genome Transcriptome Proteome interactome Tissue architectures Cell interactions Sigaling …… Metabolome Cell Organism Environment Chemicals Microorganisms

11 11 How many players are there in the human system ?  30,000~70,000 genes  one or more regulatory sequence per gene  ~70% of the genes are alternatively spliced to generate >1 transcripts and >1 proteins per gene  > 40,000 different metabolites ( Human Metabolome Database)  Hundreds of signaling molecules  Different cellular architectures The above listed are just Species. Amounts of each species also matter! The complex biological systems

12

13 Our goal – the comprehensive understanding of diseases We face the “big data” challenge The complex biological systems

14 Li et al. Seminars in Immunology 25:209 Comprehensive studies of a disease

15 “Omics” Biological knowledge Medical knowledge Improved health Genomics Transcriptomics Proteomics Metabolomics Interactomics …… includes Sequencing Microarrays LC/MS NMR Two hybrid …… Measured by Their data are High-throughput High-noise To reduce noise Advanced pre- processing techniques Reliable high- throughput information Techniques to analyze high- dimensional data and knowledgebases The complex biological system – our goals

16 16 The complex biological systems --- the genome http://www.insectscience.org/2.10/ref/fig5a.gif http://content.answers.com/main/conten t/wp/en/f/f0/DNA_Overview.png

17 17 The complex biological systems --- the genome  The human genome is a book with 3 billion characters. 5% are words (protein coding sequences) and 95% are not.  The mouse genome contains about 2.5 billion characters. It is very similar to the human genome (85% identical in protein coding regions). That is one of the reasons why mice are suitable for elucidation of biological mechanisms and drug discovery. The similarity results from a common ancestor 80 million years ago.  How many genomes are sequenced? The number increases rapidly. http://gregoryzynda.com/ncbi/genome/python/2 014/03/31/ncbi-genome.html

18 18 The complex biological systems --- the genome  Small variations in the genome can cause huge differences in pheonotypes – disease susceptibility, drug response etc.  The sequence variations in the genome can be measured by PCR (low-throughput), microarray and deep-sequencing (high- throughput) – individual genome. http://archive.hpcwire.com/hpcwire/2013-06-05/dell_boxes_up_hpc_for_life_sciences.html

19 The complex biological systems --- the epigenome DNA is structured. Modifications to relevant proteins (methylation/acetylation/…) and DNA itself can change its structure and control gene expression (DNA -> RNA). http://www.roadmapepigenomics.org/

20 The complex biological systems --- transcriptome and proteome The cell is a complex machinery. The active parts of it are the proteins. The DNA records how each protein should be made, but not the quantity at a given moment. To understand the operation of the machinery, we want to know how much of each protein is present under certain conditions. There are potentially >10,000 species of proteins in the cell. Protein modifications further complicate things. The proteome can be directly measured by methods like LC/MS/MS, which is costly. A much easier way is to measure the transcriptome. The messenger RNAs serve as the molds in the making of the parts. Normally, the more molds, the more parts made. mRNA doesn’t have tertiary structures – much easier to quantify by micro-arrays. http://www.katiephd.com/a-whole-new-rna-world/

21 The complex biological systems --- metabolome  Small molecules – not coded by DNA.  Substrates of enzymes (proteins). Reflects activities of the regulatory systems and the environment.  Directly reflects Metabolic regulation Nutrition Environmental response Drug response  Indirectly reflects system changes (redox potential…)  Measured by NMR, GC/MS, LC/MS,…….

22 The interactome The Scientist 2004, 18(12):18

23 KEGG network – proteins(enzymes) & metabolites. The reactome

24 Biomarker discovery To find non-invasive methods to: Predict disease risk; early detection Disease classification Predict response to treatment Monitor disease progression Before the era of high-throughput experiments, what did the doctors do? Age, gender, ethnicity, behavioral measures, … Disease stage, dissection of disease tissue … Use one-at-a-time methods to analyze proteins/metabolites in disease tissue or biological fluids The relevance of omics experiments in medicine

25 To study the disease mechanism to find a cure: (1)Diseases with pathogen  Interaction of the human system with the pathogen. Protein interaction, regulation of gene expression, change in metabolite concentration … What can we block to stop the disease progression? (2)Diseases without pathogen  What goes wrong in the human system? Is it a genetic disorder? Is it disturbance of the regulation of the system? The relevance of omics experiments in medicine

26 26 Genomics – a few examples Medical question.Experimental Techniques.Computational Techniques Is there a (set of) special mutation causing a disease? Deep sequencing; Single Nucleotide Polymorphism (SNP) arrays Association Analysis; Linkage Analysis; Multiple testing; …… How to find gene products that aid/suppress the development of a certain type of cancer ? array comparative genomic hybridization; SNP array CGH/LOH; Deep sequencing. Segmentation; Multiple testing; Clustering; Classification …… How to find a region of DNA whose folding structure affect disease status? Deep sequencing.Alignment; Peak modeling; Segmentation ….. The relevance of omics experiments in medicine

27 27 Transcriptomics – a few examples Medical question.Experimental Techniques.Computational Techniques Are certain gene products associated with the incidence/progression of a disease? Expression microarrays Whole Transcriptome Shotgun Sequencing Alignment; Multiple testing; Dimension reduction; Clustering; Classification …… Are there subtypes of disease undetected by regular medical examination? Gene expression (and potentially all other “omics” methods) (same as above) The relevance of omics experiments in medicine

28 28 Proteomics – a few examples Medical question.Experimental Techniques.Computational Techniques Are certain proteins associated with the incidence/progression of a disease? Mass spectrometry (2D gel -> MS, tandom MS, LC/MS/MS,……) Sequence matching; Multiple testing; Dimension reduction; Clustering; Classification …… How do proteins change their modification patterns in a disease state? (targeted) Mass spectrometry(same as above) How do proteins of pathogens work and interact with human proteins? Mass spectrometry Immunological methods Large-scale structural study (same as above) Protein structure analysis; The relevance of omics experiments in medicine

29 29 Metabolomics – a few examples Medical question.Experimental Techniques. Computational Techniques How are bodily metabolic networks disrupted in metabolic diseases? Mass spectrometry NMR etc Data alignment Metabolite mapping Multiple testing Dimension reduction Functional data analysis…… How do some drugs interfere with the human metabolome? How are they transformed/ degraded? (same as above) Do pollutants accumulate in the human body and cause diseases? Mass Spectrometry(same as above) The relevance of omics experiments in medicine

30 30 “Omics” is revolutionizing medicine. Personalized medicine Understand each patient’s system, match them with treatments. (success example: Oncotype DX breast cancer test from Genomic Health, in order to tailor treatment.) Predictive medicine & preventive medicine Find the increased risk, even before the disease onset. Predict the progression of disease after it occurs. Systems biology  better understanding of diseases How are all the “omics” measurements related? How do they interact? What does it say about possible treatments and development of drugs? The relevance of omics experiments in medicine

31 31 What is Personalized medicine? Each person is different by Different DNA sequence (tens of millions of sequence variations) Different DNA structures Different gene expression levels Different protein modification/ degradation patterns Different metabolite levels in the blood Different exposure history …… The relevance of omics experiments in medicine Number of SNPs Bioinformatics, Vol. 27 no. 13 2011, pages 1741–1748

32 The relevance of omics experiments in medicine Bioinformatics, Vol. 27 no. 13 2011, pages 1741–1748 Fig. 1. Personalized medicine. Personal genomics connect genotype to phenotype and provide insight into disease. Pharmacogenomics connect connects genotype to patient- specific treatment. Traditional medicine defines the pathologic states and clinical observations to evaluate and adjust treatments.

33 The relevance of omics experiments in medicine nature medicine volume 17 | number 3 | 297 – 303.

34

35 35 The relevance of omics experiments in medicine

36 http://jeffhurtblog.com/2012/07/20/three-vs-of-big-data-as-applied-conferences/ The challenges to statisticians/bioinformaticians Luckily, or unluckily, we are part of the “big data” game.

37 37 All Omics experiments share one characteristic: Omics  The “totality”  there are many ! We are measuring hundreds of thousands of features from one single person. We are overwhelmed by data --- even eyeballing the data becomes impossible. The task: Reduce the data into a more useful form. Make use of the data in medicine and biological research ! Nature Methods 6, S2 - S5 (2009) The challenges to statisticians/bioinformaticians

38 38 The challenges to statisticians/bioinformaticians The sample size issue. Up to now, most genome-wide association studies (GWAS) yielded very weak biomarkers. Biomarkers found by microarray are often unreliable. Why? Diseases are complicated! The human population is diverse! We are limited by sample size! If a disease is caused by the combinatorial effect from 3 genes located at different regions in the genome, high-throughput technology will have difficulty finding them, even with 1000 samples!

39 39 Many medical questions using Omics can be generalized into these forms:  Processing the data to find the features. Pre-processing, sequence comparison, data modeling…  Identifying features (SNPs, genes, proteins etc) associated with a disease (or disease state) Find if a feature is significantly different between normal/disease samples. Statistical models, Multiple Testing, model validation, generalization…  Finding previously unknown subtypes of a disease Group samples based on there feature measurements. Dimension Reduction, Clustering, …  Predicting disease/normal status or different disease subtypes/states Based on the measurements of some features, predict a new case. Predictive Model Building… The challenges to statisticians/bioinformaticians

40 40 Compromise The models may be too complex  assumptions may not hold; theoretical rigors may not be achieved Too much background knowledge Computing needs Different data types - integration “Dirty” data Speed: the first few methods (not the best method) dominates, and data evovles The challenges specific to statistical bioinformaticians Work with others


Download ppt "Lecture 1. Course outline The biological system Omics and its impact Big data The statistician/bioinformatician’s role Bios 540 Introduction to Bioinformatics."

Similar presentations


Ads by Google