Introduction to Modeling and Algorithms in Life Sciences Ananth Grama Purdue University

Slides:



Advertisements
Similar presentations
DNA Replication and RNA Production Selent. Replication The process of copying DNA The two chains of nucleotides separate by unwinding and act as templates.
Advertisements

Combinatorial Pattern Matching CS 466 Saurabh Sinha.
High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Next Generation Sequencing, Assembly, and Alignment Methods
Genetica per Scienze Naturali a.a prof S. Presciuttini 1. Enzymes build everything Enzymes allow nutrients to be digested; they convert food into.
RNA and Protein Synthesis
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Sequencing and Sequence Alignment
Ultrafast and memory-efficient alignment of short reads to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center for Bioinformatics.
LECTURE 5: DNA, RNA & PROTEINS
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
8.4 DNA Transcription 8.5 Translation
Central Dogma First described by Francis Crick
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
How Genes Work. Transcription The information contained in DNA is stored in blocks called genes  the genes code for proteins  the proteins determine.
Essentials of the Living World Second Edition George B. Johnson Jonathan B. Losos Chapter 13 How Genes Work Copyright © The McGraw-Hill Companies, Inc.
Chapter 13.2 (Pgs ): Ribosomes and Protein Synthesis
A QUICK INTRODUCTION Protein Synthesis. Key Terms Gene RNA mRNA tRNA rRNA Transcription Translation Codon Anticodon Ribosome Denature RNA Polymerase.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
What is the structure of DNA? Hw Q 1-4 p. 299.
Lesson Overview 13.1 RNA.
CSE 6406: Bioinformatics Algorithms. Course Outline
Transcription and Translation
RNA and Protein Synthesis
Transcription and Translation. The Central Dogma of Molecular Biology: DNA --> RNA --> Protein Protein synthesis requires two steps: transcription and.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
RNA and Protein Synthesis
Interest Grabber DNA contains the information that a cell needs to carry out all of its functions. In a way, DNA is like the cell’s encyclopedia. Suppose.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
1 TRANSCRIPTION AND TRANSLATION. 2 Central Dogma of Gene Expression.
Computational Molecular Biology Introduction and Preliminaries.
Protein Synthesis: DNA CONTAINS THE GENETIC INFORMATION TO PRODUCE PROTEINS BUT MUST FIRST BE CONVERTED TO RND TO DO SO.
From Gene To Protein Chapter 17. From Gene to Protein The “Central Dogma of Molecular Biology” is DNA  RNA  protein Meaning that our DNA codes our RNA.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4.
Fragment Assembly of DNA BIO/CS 471 – Algorithms for Bioinformatics.
 DNA is the blueprint for life – it contains your genetic information  The order of the bases in a segment of DNA (GENE) codes for a particular protein;
Chapter 10: DNA and RNA.
Protein Synthesis: Protein Synthesis: Translation and Transcription EQ: What is the Central Dogma and what processes does it involve? Describe processes.
Winter School on Mathematical Methods in Life Sciences
DNA TO PROTEIN genotype to phenotype Look deep into nature, and then you will understand everything better. Albert Einstein.
Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.
DNA in the Cell Stored in Number of Chromosomes (24 in Human Genome) Tightly coiled threads of DNA and Associated Proteins: Chromatin 3 billion bp in Human.
CHAPTER 13 RNA and Protein Synthesis. Differences between DNA and RNA  Sugar = Deoxyribose  Double stranded  Bases  Cytosine  Guanine  Adenine 
Protein Synthesis How genes work.
Protein Synthesis and Genetic Mutations. Objectives Recognize that components that make up the genetic code are common to all organisms (TEKS 6B) Explain.
Protein Synthesis The process of protein synthesis is explained by the central dogma of molecular biology, which states that: DNA  RNA  Proteins How.
Jeopardy DNAs mRNA Amino Acids ETC Q $100 Q $200 Q $300 Q $400 Q $500 Q $100 Q $200 Q $300 Q $400 Q $500 Final Jeopardy tRNA.
Biology Ch. 11 DNA and Genes DNA  DNA controls the production of proteins Living tissue is made up of protein, so DNA determines an organism’s.
+ Protein Synthesis. + REVIEW: DNA plays 2 essential roles in organisms: #1: Allows cells to reproduce. How? DNA replication allows cells to pass along.
The Central Dogma of Molecular Biology DNA  RNA  Protein  Trait.
RNA and Protein Synthesis Chapter How are proteins made? In molecular terms, genes are coded DNA instructions that control the production of.
Chapter 13 – RNA & Protein Synthesis MS. LUACES HONORS BIOLOGY.
Chapter 13 Test Review.
8.3 DNA Replication KEY CONCEPT DNA replication copies the genetic information of a cell.
8.2 KEY CONCEPT DNA structure is the same in all organisms.
High Throughput Sequencing
Chapter 13- RNA and Protein Synthesis
Amino acids (protein building blocks) are coded for by mRNA base sequences.
From DNA to Proteins Transcription.
Unit 7 “DNA & RNA” 10 Words.
UNIT 5 Protein Synthesis.
CSC2431 February 3rd 2010 Alecia Fowler
What is RNA? Do Now: What is RNA made of?
CSE 589 Applied Algorithms Spring 1999
Central Dogma Central Dogma categorized by: DNA Replication Transcription Translation From that, we find the flow of.
How genes on a chromosome determine what proteins to make
TRANSLATION and MUTATIONS
DNA Deoxyribonucleic Acid.
Presentation transcript:

Introduction to Modeling and Algorithms in Life Sciences Ananth Grama Purdue University

Acknowledgements To various sources, including Profs. Mehmet Koyuturk, Michael Raymer, Wiki sources (pictures), and other noted attributions. To the US National Science Foundation and the Center for Science of Information.

Central Dogma of Molecular Biology

Central Dogma of Molecular Biology Mostly valid with some exceptions: Reverse Transcription: Retroviruses such as Feline Leukemia, HIV RNA Replication: RNA to RNA transfer in viruses Direct Translation: DNA to Protein (typically in cell fragments)

Protein Synthesis Transcription: a DNA molecule is converted into a complementary strand of RNA This RNA is also called messenger RNA (mRNA) since it acts as an intermediary between DNA and the Ribosomes Ribosomes are parts of cell that synthesize proteins from mRNA

Eukaryotic Transcription

Synthesizing Proteins: Translation mRNA is decoded by the Ribosome to produce specific proteins (polypeptide chains) Polypeptide chains fold to make active proteins The amino acids are attached to transfer RNA (tRNA) molecules, which enter one part of the ribosome and bind to the messenger RNA sequence.transfer RNA

Translation

Some Numbers Human DNA has: 3 billion base pairs The length of DNA in a cell is 1m! This is packed into a nucleus of 3 – 10 microns Each chromosome (46 in all) is about 2 cm on average.

Some Numbers

Models for Lactose Intolerance

Haemophilia-A

Haemophilia-A is caused by clotting factor VIII deficiency. Factor VIII is encoded by the F8 gene.

Analyzing Sequences

Sequences: An Evolutionary Perspective Evolution occurs through a set of modifications to the DNA These modifications include point mutations, insertions, deletions, and rearrangements Seemingly diverse species (say mice and humans) share significant similarity (80-90%) in their genes The locations of genes may themselves be scrambled

Gene Duplication Gene duplication has important evolutionary implications Duplicated genes are not subject to evolutionary pressures Therefore they can accumulate mutations faster (and consequently lead to specialization)

Inversions Para and pericentric inversions

Transposition A group of conserved genes appears in a transposed fashion at a different location

Genomic Sequences Sanger Sequencing Next-Generation Sequencing Illumina Solexa Helicos Solid Roche/454

Sanger Sequencing

Helicos NGS

Illumina Solexa

Dealing with Reads From fluorescence to nucleotides (Phread) Error correction Mapping to reference genomes Assembly

Error Correction NGS reads range from 50 – 300 bps (constantly changing) Error rates range from 1 – 3% Errors are not uniformly distributed over the read Correcting errors is a critical step before mapping/ assembly

Error Correction Needed coverage on the genome k-mer based error correction Suffix Trees

Short Read Alignment Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists Approximate answer to: where in genome did read originate? …TGATCATA… GATCAA …TGATCATA… GAGAAT better than What is “good”? For now, we concentrate on: …TGATATTA… GATcaT …TGATcaTA… GTACAT better than –Fewer mismatches is better –Failing to align a low-quality base is better than failing to align a high-quality base Pop et al.

Indexing Genomes and reads are too large for direct approaches like dynamic programming Indexing is required Choice of index is key to performance Suffix tree Suffix array Seed hash tables Many variants, incl. spaced seeds

Indexing Genome indices can be big. For human: Large indices necessitate painful compromises 1.Require big-memory machine 2.Use secondary storage > 35 GBs > 12 GBs 3.Build new index each run 4.Subindex and do multiple passes

Burrows-Wheeler Transform Reversible permutation used originally in compression Once BWT(T) is built, all else shown here is discarded Matrix will be shown for illustration only Burrows Wheeler Matrix Last column BWT(T)T Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994

Burrows-Wheeler Transform Property that makes BWT(T) reversible is “LF Mapping” i th occurrence of a character in Last column is same text occurrence as the i th occurrence in First column T BWT(T) Burrows Wheeler Matrix Rank: 2

Burrows-Wheeler Transform To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping Could be called “unpermute” or “walk-left” algorithm Final T

FM Index Ferragina & Manzini propose “FM Index” based on BWT Observed: LF Mapping also allows exact matching within T LF(i) can be made fast with checkpointing …and more (see FOCS paper) Ferragina P, Manzini G: Opportunistic data structures with applications. FOCS. IEEE Computer Society; Ferragina P, Manzini G: An experimental study of an opportunistic index. SIAM symposium on Discrete algorithms. Washington, D.C.; 2001.

Exact Matching with FM Index To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc

Exact Matching with FM Index In progressive rounds, top & bot delimit the range of rows beginning with progressively longer suffixes of Q

Exact Matching with FM Index If range becomes empty (top = bot) the query suffix (and therefore the query) does not occur in the text

Backtracking Consider an attempt to find Q = “agc” in T = “acaacg”: Instead of giving up, try to “backtrack” to a previous position and try a different base “gc” does not occur in the text “g” “c”

Sequencing Find maximal overlaps between fragments: ACCGT CGTGC TTAC TACCGT --ACCGT CGTGC TTAC TACCGT— TTACCGTGC Consensus sequence determined by vote

Quality Metrics The coverage at position i of the target or consensus sequence is the number of fragments that overlap that position Two contigs No coverage Target:

The Maximum Overlap Graph Overlap multigraph Each directed edge, (u,v) is weighted with the length of the maximal overlap between a suffix of u and a prefix of v a b d c TACGA CTAAAG ACCC GACA weight edges omitted!

Paths and Layouts The path dbc leads to the alignment:

Superstrings Every path that covers every node is a superstring Zero weight edges result in alignments like: Higher weights produce more overlap, and thus shorter strings The shortest common superstring is the highest weight path that covers every node GACA GCCC TTAAAG

Graph formulation of SCS Input: A weighted, directed graph Output: The highest-weight path that touches every node of the graph NP Hard, Use Greedy Approximation

Greedy Example

So we have sequences now! Find genes in sequences. Query: AGTACGTATCGTATAGCGTAA What does it do? What does it do? Find similar gene in other species with known function and reason from it Align sequences with known genes Find the gene with the “best” match