Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss.

Slides:



Advertisements
Similar presentations
Computational Biology, Part 7 Similarity Functions and Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Advertisements

Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Measuring the degree of similarity: PAM and blosum Matrix
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Identification of Transcriptional Regulatory Elements in Chemosensory Receptor Genes by Probabilistic Segmentation Steven A. McCarroll, Hao Li Cornelia.
Computing the exact p-value for structured motif Zhang Jing (Tsinghua University and university of waterloo) Co-authors: Xi Chen, Ming Li.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Identifying Regulatory Transcriptional Elements on Functional Gene Groups Using Computer-
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Human Molecular Genetics Section 14–3
Sequencing a genome and Basic Sequence Alignment
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
© Wiley Publishing All Rights Reserved.
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.
Sequencing a genome and Basic Sequence Alignment
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Figure 2: over-representation of neighbors in the fushi-tarazu region of Drosophila melanogaster. Annotated enhancers are marked grey. The CDS is marked.
Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Initial sequencing and analysis of the human genome Averya Johnson Nick Patrick Aaron Lerner Joel Burrill Computer Science 4G October 18, 2005.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Overview of Bioinformatics 1 Module Denis Manley..
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Construction of Substitution matrices
Integration of Bioinformatics into Inquiry Based Learning by Kathleen Gabric.
Compression of Protein Sequences EE-591 Information Theory FEI NAN, SUMIT SHARMA May 3, 2003.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Objectives: Outline the steps involved in sequencing the genome of an organism. Outline how gene sequencing allows for genome wide comparisons between.
A Very Basic Gibbs Sampler for Motif Detection
Babak Alipanahi1, Andrew Delong, Matthew T Weirauch & Brendan J Frey
Finding regulatory modules
Nora Pierstorff Dept. of Genetics University of Cologne
Basic Local Alignment Search Tool
Presentation transcript:

Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss

Overview 1. Problem Statement 2. Motivation 3. History 4. Our Approach 5. Evaluation 6. Results 7. Discussion 8. References

1. The Problem  Find regulatory sequences in the upstream region of yeast DNA.  Regulatory sequences are segments of DNA where proteins can bind to enhance transcription of a gene.

The Problem  We are given: Upstream Genome- consists of:  Gene Families- consists of:  Individual Genes- consists of:  Strings like ATGC  We had to find substrings unusually frequent in gene families given their distribution in the whole upstream genome.

The Problem  We emulated techniques devised by van Helden.  Worked on similar data set and tried to emulate and even better his findings.

2. Motivation  Organisms like yeast share many genes with humans.  As a result, they share diseases too.  Finding regulatory sequences in yeast might lead to medical advances.  Might lead to therapies for diseases such as cystic fibrosis.

3. History  Previous century saw rapid advances in genetics.  Scientific community trying to get a better understanding of various genomes.  This particular technique was developed by Jacques van Helden.

4.Our approach  Extract all substrings of lengths 6-8 in the upstream genome.  Calculate frequency of occurrence of each substring.  Put this data in a table.

Our Approach  Consider a gene family.  Find all substrings in it and frequencies and build table.  For each entry, add the probability of occurrence.  Use above data to calculate three scores.

Our Approach  Score 1: Expected Occurrence / Actual Occurrence  Use probability of occurrence and size of gene family to calculate expected occurrence.  Divide by actual occurrence.  Low score -> Unusually frequent substring.

Our Approach  Score 2: Poisson Distribution  Use expected and actual number of occurrences.  If substring occurs ‘n’ times, calculate probability of ‘n’ occurrences using Poisson Distribution.  Lower probability -> Unusually frequent

Our Approach  Score 3: Binomial Theorem  Use probability of occurrence, sizes of genome and gene family and actual occurrences.  If substring occurs ‘n’ times, calculate probability of ‘n’ occurrences using Binomial Distribution.  Lower probability -> Unusually frequent

Our Approach  Sort substrings by a score.  Take top sequences, create a probability matrix.  Iterate probability matrix to get probabilistic model of regulatory sequence.

5. Evaluation Metrics  Van Helden’s results in ’98 paper and his website.  ’98 paper used old data, not very reliable for evaluation.  Website very useful since it works on current data and dynamically calculates results.  Compared our output to his.

Evaluation Metrics  Also, compare three scores types to find best method.

6. Results Comparison of Results for MET FAMILY GeneVan Helden’s siteBinomial DistPoisson DistExpected / ActualOld Paper CACGTG11341 ACGTGA22123 TCACGT33212 ATATAT44N/A 5 TATATA55N/A 10 AACTGT ACAGTT76N/A29N/A ACACAC897N/A GTGTGT986N/A

Results  Probability matrices generated successfully!

7. Discussion  Paper results clearly outdated.  Close co-relation with van Helden’s site.  Binomial distribution best, followed by Poisson and Expected/Actual

Discussion  Why don’t Binomial results perfectly match van Helden’s site? Van Helden paper only outlines general method. He uses many filters and adjustments. Limited info about them on site. We used similar, but not same, filters. Example: Purge sequences that appear twice in a row.

Discussion  Future work Find more filters. Try other similar organisms’ genomes. Biologically verify results!

Discussion  What we learnt Biology!  First-hand look at genetic data  Became more familiar with genes  Clearly understood what the fuss about genetics is about Computer Science  Teamwork  Interfacing CS with other scientific disciplines

References  van Helden, J., André, B. & Collado-Vides, J. (1998). Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281(5),  van Helden, J., Rios, A. F. & Collado- Vides, J. (2000). Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 28(8):