DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.

Slides:



Advertisements
Similar presentations
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
BIOINFORMATICS Ency Lee.
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
Jeff Shen, Morgan Kearse, Jeff Shi, Yang Ding, & Owen Astrachan Genome Revolution Focus 2007, Duke University, Durham, North Carolina Introduction.
Heuristic alignment algorithms and cost matrices
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Bioinformatics and Phylogenetic Analysis
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 18: Application-Driven Hardware Acceleration (4/4)
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Similar Sequence Similar Function Charles Yan Spring 2006.
© Wiley Publishing All Rights Reserved. Biological Sequences.
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
C OMPUTATIONAL BIOLOGY. O UTLINE Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms.
Development of Bioinformatics and its application on Biotechnology
An Introduction to Bioinformatics
CSE 6406: Bioinformatics Algorithms. Course Outline
Protein Sequence Alignment and Database Searching.
Intelligent Systems for Bioinformatics Michael J. Watts
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
A new way of seeing genomes Combining sequence- and signal-based genome analyses Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI Introduction: So far,
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Doug Raiford Phage class: introduction to sequence databases.
Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI-Jena, Germany Introduction: During the last 10 years, a large number of complete.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
DNA. DNA fingerprinting, DNA profiling, DNA typing  All terms applied to the discovery by Alec Jeffreys and colleagues at Leicester University, England.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Introduction to Bioinformatics Resources for DNA Barcoding
Research Paper on BioInformatics
Data-intensive Computing: Case Study Area 1: Bioinformatics
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
Notes 13.1 DNA.
Bioinformatics Vicki & Joe.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool (BLAST)
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis methods and tools. What is next ?

HMMs Hidden Markov Model (HMM) is very useful statistical model for molecular biology although it was aimed to be used for speech recognition purposes. HMM can be used as a statistical profile for a protein family (DNAs) and hence used to search a database for other similarities or family members. Q1 :How can HMMs be used in DNA analysis?

To calculate the probability of the sequence ACTTCG, we multiply the probabilities; where the probability is the conditional probability that a certain nucleotide appears in a position, given that a specific nucleotide was in the previous position: P (ACTTCG….) = P 1 (A) * P 2 (C|A) * P 3 (T|C) * P 4 (T|T) * P 5 (C|T) * P 6 (G|C)…………

In more formal way, HMM cannot be observed directly but we can infer the hidden state q t from a random observation Y t

What is DNA sequence ? DNA consists of two long interwoven strands that form the famous “double helix”. Each Strand is built from a small set of molecules called nucleotides. Often the length of double-stranded DNA is expressed in the units of basepairs (bp), kilobasepairs (kb), or megabasepairs (Mb), so that this size could be expressed equivalently as 5X 10 ^6 bp,5000 kb, or 5Mb Collectively, the 46 chromosomes in one human cell consist of approximately 3 X 10^9 bp of DNA

How to get DNA sequence By using chemical methods for determining the order of the nucleotide bases: Adenine, Guanine, Cytosine, and Thymine - in a molecule of DNA Used in many fields and applications such as Forensics and biological systems why don’t we use the powerful text searching algorithms and tools to search DNA databases?

DNA can be sequenced by a chemical procedure that breaks a terminally labelled DNA molecule partially at each repetition of a base. DNA Sequencing can be done by different methods : 1.Maxam-Gilbert sequencing 2.Chain-termination methods 3.Dye-terminator sequencing 4.Automation and sample preparation 5.Large scale sequencing strategies Q2: Name four of DNA Sequencing methods

Example : a chain termination method A DNA sequencing printout. The sequence is represented by a series of peaks, one for each nucleotide position. In this example, a red peak is an A, blue is a C, orange is a G, and green is a T.

DNA Sequence formats : Plain sequence format EMBL format FASTA format GCG format GCG-RSF (rich sequence format) GenBank format IG format

FASAT Format : FASTA format is the standard format in the field of bioinformatics to represent either nucleotide sequences or peptide sequences. This format is single-letter code and it allows sequence names and comments FASAT consists of a single-line description at the beginning followed by sequence data in multiple lines. The length of the each chunk (line) of the sequence must not exceeds 80 characters. Sequence identifiers are defined by a standard called NCBI Q3: what is FASAT format?

NCBI Data Base: National Centre for Biotechnology Information ( is sequence database in US maintain a huge collection of DNA and protein sequences. Each sequence in NCBI is stored in a separate record with a unique identifier called accession. Example : By accessing the NCBI website and using this accessing NC_001477, we can retrieve the DNA sequence for Dengue virus that causes Dengue fever

NCBI cont….. The database query can be done either directly from the website or by using the R functions choosebank() and query()

Analysis methods The analysis fall into 5 main methods : Knowledge-based single sequence analysis. Pairwise sequence comparison. Multiple sequence alignment. Sequence motif discovery in multiple alignments. Phylogenetic inference. Q4: What are the main methods of DNA sequence analysis ?

Analysis methods: alignment Alignment: to compare a sequence with sequences that have already been reported and stored in a database. Alignment can be global and local Local alignments: reveal regions that are highly similar, but do not necessarily provide a comparison across the entire two sequences. The global approach compares one whole sequence with other entire sequences.

Alignment Examples:

Alignment Tools : BLAST The most common local alignment tool is BLAST (Basic Local Alignment Search Tool) developed by Altschul et al. (1990. J Mol Biol 215:403) “BLAST is a set of algorithms that attempt to find a short fragment of a query sequence that aligns perfectly with a fragment of a subject sequence found in a database.” That initial alignment must be greater than a neighborhood score threshold (T), the fragment is then used as a seed to extend the alignment in both directions… Which means BLAST algorithm breaks the query into short words of a specific length

Joshua Naranjo Q5: what is BLAST algorithm ? State its steps.

Can R Help ? Yes. It has so many useful packages to process DNA Sequences. It can be used to access BLAST as well.

Examples : DNA sequence Composition 1. GC fraction: GC content is one of the fundamentals properties of a genome sequence, which is the percentage of Gs and Cs ((GC)s). We can do that by two ways: lengthy one is to use the statistics to calculate the percentage of GC with respect to the whole string. The other way is to use function GC () from the R package SeqinR, and we will go with this option as shown below

2. DNA words: It the same idea of knowing the frequency of some nucleotides such as A or G but with longer words like “AA” or “CA”. Those can be 2 nucleotides such as “GC”, 3 nucleotides like “AAA” or 4 nucleotides long and so on. An example of 3 nucleotides words is shown below:

3. To find the score for the optimal global alignment between the sequences ‘GAATTC’ and ‘GATTA’, we type:

4. Comparing two sequences using a dotplot()

Is it that easy ? No It is not simply give the sequences to R and get the results. It is an art which need a degree of skills. Fitting the sequences to be compared to a form that reflects some shared quality. For example: -How they look structurally, -How they evolved from a common ancestor, or -Optimization of a mathematical construct

What is next ? Are we monkeys ?

References: NA%20Sequence%20Analysis.pdf through%20DNA%20String%20Analysis%20-%20Summary.pdf