Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.

Slides:



Advertisements
Similar presentations
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
Advertisements

Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Marc Schaub Andreas Sundquist Monday & Wednesday.
Lecture outline Sequence alignment Gaps and scoring matrices
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:
Sequencing and Sequence Alignment
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Marc Schaub Andreas Sundquist Monday & Wednesday.
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: George Asimenos Andreas Sundquist Tuesday&Thursday 2:45-4:00 Skilling Auditorium.
Welcome to CS374 Algorithms in Biology. Overview Administrivia Molecular Biology and Computation  DNA, proteins, cells, evolution  Some examples of.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TA: Eugene Fratkin Tuesday&Thursday 2:45-4:00 Skilling Auditorium.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Welcome to CS374 Algorithms in Biology
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Deoxyribonucleic acid (DNA) Biometrics CPSC 4600 Biometrics and Cryptography.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Sequence Alignment. 2 Sequence Comparison Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
CS223 Algorithms D-Term 2013 Instructor: Mohamed Eltabakh WPI, CS Introduction Slide 1.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
An Introduction to Bioinformatics 2. Comparing biological sequences: sequence alignment.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
Minimum Edit Distance Definition of Minimum Edit Distance.
Overview of Bioinformatics 1 Module Denis Manley..
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics I: Sequence Alignment Eric Xing Lecture.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter.
Sequence Similarity.
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
1 Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y.
Definition of Minimum Edit Distance
Definition In simple terms, an algorithm is a series of instructions to solve a problem (complete a task) We focus on Deterministic Algorithms Under the.
Pairwise sequence Alignment.
Intro to Alignment Algorithms: Global and Local
Bioinformatics Algorithms and Data Structures
Presentation transcript:

Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday 11:50-1:05 James H Clark Center S361

Goals of this course Introduction to Computational Biology & Genomics  Basic concepts and scientific questions  Why does it matter?  Basic biology for computer scientists  In-depth coverage of algorithmic techniques  Current active areas of research Useful algorithms  Dynamic programming  String algorithms  HMMs and other graphical models for sequence analysis

Topics in CS262 Part 1: Basic Algorithms  Sequence Alignment & Dynamic Programming  Hidden Markov models, Context Free Grammars, Conditional Random Fields Part 2: Topics in computational genomics and areas of active research  DNA sequencing and assembly  Comparative genomics  Genes: finding genes, gene regulation  Personalized genomics

Course responsibilities Homeworks  4 challenging problem sets, 4-5 problems/pset Due at beginning of class Up to 3 late days (24-hr periods) for the quarter  Collaboration allowed – please give credit Teams of 2 or 3 students Individual writeups If individual (no team) then drop score of worst problem per problem set (Optional) Scribing  Due one week after the lecture, except special permission  Scribing grade replaces 2 lowest problems from all problem sets First-come first-serve, staff list to sign up

Reading material Books  “Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchison Chapters 1-4, 6, 7-8, 9-10  “Algorithms on strings, trees, and sequences” by Gusfield Chapters 5-7, 11-12, 13, 14, 17 Papers Lecture notes

Birth of Molecular Biology DNA Phosphate Group Sugar Nitrogenous Base A, C, G, T Physicist Ornithologist

Genetics in the 20 th Century

21 st Century AGTAGCACAGACTACGACGAGA CGATCGTGCGAGCGACGGCGTA GTGTGCTGTACTGTCGTGTGTG TGTACTCTCCTCTCTCTAGTCT ACGTGCTGTATGCGTTAGTGTC GTCGTCTAGTAGTCGCGATGCT CTGATGTTAGAGGATGCACGAT GCTGCTGCTACTAGCGTGCTGC TGCGATGTAGCTGTCGTACGTG TAGTGTGCTGTAAGTCGAGTGT AGCTGGCGATGTATCGTGGT AGTAGGACAGACTACGACGAGACGAT CGTGCGAGCGACGGCGTAGTGTGCTG TACTGTCGTGTGTGTGTACTCTCCTC TCTCTAGTCTACGTGCTGTATGCGTT AGTGTCGTCGTCTAGTAGTCGCGATG CTCTGATGTTAGAGGATGCACGATGC TGCTGCTACTAGCGTGCTGCTGCGAT GTAGCTGTCGTACGTGTAGTGTGCTG TAAGTCGAGTGTAGCTGGCGATGTAT CGTGGT

Computational Biology Organize & analyze massive amounts of biological data  Enable biologists to use data  Form testable hypotheses  Discover new biology AGTAGCACAGACTACGACGAGA CGATCGTGCGAGCGACGGCGTA GTGTGCTGTACTGTCGTGTGTG TGTACTCTCCTCTCTCTAGTCT ACGTGCTGTATGCGTTAGTGTC GTCGTCTAGTAGTCGCGATGCT CTGATGTTAGAGGATGCACGAT GCTGCTGCTACTAGCGTGCTGC TGCGATGTAGCTGTCGTACGTG TAGTGTGCTGTAAGTCGAGTGT AGCTGGCGATGTATCGTGGT

Intro to Biology

Sequence Alignment

Complete DNA Sequences More than 1000 complete genomes have been sequenced

Evolution

Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication

Evolutionary Rates OK X X Still OK? next generation

Sequence conservation implies function Alignment is the key to Finding important regions Determining function Uncovering evolutionary events

Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N, an alignment is an assignment of gaps to positions 0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

What is a good alignment? AGGCTAGTT, AGCGAAGTTT AGGCTAGTT- 6 matches, 3 mismatches, 1 gap AGCGAAGTTT AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps AG-CGAAGTTT AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gaps AG-CG-AAGTTT

Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch: -s Gap:-d Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d Alternative definition: minimal edit distance “Given two strings x, y, find minimum # of edits (insertions, deletions, mutations) to transform one string to the other”

How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Too many possible alignments: >> 2 N (exercise)

Alignment is additive Observation: The score of aligningx 1 ……x M y 1 ……y N is additive Say thatx 1 …x i x i+1 …x M aligns to y 1 …y j y j+1 …y N The two scores add up: F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])

Dynamic Programming There are only a polynomial number of subproblems  Align x 1 …x i to y 1 …y j Original problem is one of the subproblems  Align x 1 …x M to y 1 …y N Each subproblem is easily solved from smaller subproblems  We will show next Dynamic Programming!!!Then, we can apply Dynamic Programming!!! Let F(i, j) = optimal score of aligning x 1 ……x i y 1 ……y j F is the DP “Matrix” or “Table” “Memoization”

Dynamic Programming (cont’d) Notice three possible cases: 1.x i aligns to y j x 1 ……x i-1 x i y 1 ……y j-1 y j 2.x i aligns to a gap x 1 ……x i-1 x i y 1 ……y j - 3.y j aligns to a gap x 1 ……x i - y 1 ……y j-1 y j m, if x i = y j F(i, j) = F(i – 1, j – 1) + -s, if not F(i, j) = F(i – 1, j) – d F(i, j) = F(i, j – 1) – d

Dynamic Programming (cont’d) How do we know which case is correct? Inductive assumption: F(i, j – 1), F(i – 1, j), F(i – 1, j – 1) are optimal Then, F(i – 1, j – 1) + s(x i, y j ) F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d Where s(x i, y j ) = m, if x i = y j ;-s, if not

G - AGTA A10 -2 T 0010 A-3 02 F(i,j) i = Example x = AGTAm = 1 y = ATAs = -1 d = -1 j = F(1, 1) = max{F(0,0) + s(A, A), F(0, 1) – d, F(1, 0) – d} = max{0 + 1, – 1, – 1} = 1 AAAA TTTT AAAA Procedure to output Alignment Follow the backpointers When diagonal, OUTPUT x i, y j When up, OUTPUT y j When left, OUTPUT x i

The Needleman-Wunsch Matrix x 1 ……………………………… x M y 1 ……………………………… y N Every nondecreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences An optimal alignment is composed of optimal subalignments

The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling-in partial alignments a.For each i = 1……M For each j = 1……N F(i – 1,j – 1) + s(x i, y j ) [case 1] F(i, j) = max F(i – 1, j) – d [case 2] F(i, j – 1) – d [case 3] DIAG, if [case 1] Ptr(i, j)= LEFT,if [case 2] UP,if [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Performance Time: O(NM) Space: O(NM) Later we will cover more efficient methods