EST Sequence Cleaning and Quality Control FW4089 Bioinformatics H. Jiang.

Slides:



Advertisements
Similar presentations
1 Introduction to Perl Part III: Biological Data Manipulation.
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Max BachourJessica Chen. Shotgun or 454 sequencing High throughput sequencing technique that can collect a large amount of data at a fast rate. Works.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
BINF350, Tutorial 4 Karen Marshall. Aim ► Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome ► To optimise blast.
PROMoter SCanning/ANalysis tool. Goal Creating a tool to analyse a set of putative promoter sequences and recognize known and unknown promoters, with.
Bioinformatics for the Canadian Potato Genome Project David De Koeyer, Martin Lagüe and Rebecca Griffiths Wageningen September 18, 2004.
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
BLAST Tutorial 3 What is BLAST? Basic Local Alignment Search Tool Is a set of similarity search programs designed to explore sequence databases. What are.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
1 Automated Searching of Polynucleotide Sequences Michael P. Woodward Supervisory Patent Examiner - Art Unit
Using BLAST options to refine a search 1)Address the question “how many of the Phytophthora/tomato interaction ESTs are tomato?” A: Will depend on conditions.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Genome Annotation BCB 660 October 20, From Carson Holt.
Comparing protein structure and sequence similarities Sumi Singh Sp 2015.
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
MCB 5472 Assignment #5: RBH Orthologs and PSI-BLAST February 19, 2014.
Some Ideas on Final Project. Feature extraction TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT n CCGATACCCCCCACGCGAAACCCTACACATCAAAT p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT.
WSSP Chapter 7 BLASTN: DNA vs DNA searches atttaccgtg ttggattgaa attatcttgc atgagccagc tgatgagtat gatacagttt tccgtattaa taacgaacgg ccggaaatag gatcccgatc.
WSSP Chapter 7 BLASTN: DNA vs DNA searches atttaccgtg ttggattgaa attatcttgc atgagccagc tgatgagtat gatacagttt tccgtattaa taacgaacgg ccggaaatag gatcccgatc.
MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.
Copyright OpenHelix. No use or reproduction without express written consent1.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Part I: Identifying sequences with … Speaker : S. Gaj Date
1 P6a Extra Discussion Slides Part 1. 2 Section A.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
FLEX Fast Lexical Analyzer EECS Introduction Flex is a lexical analysis (scanner) generator. Flex is provided with a user input file or Standard.
Assignment feedback Everyone is doing very well!
INTRODUCTION ● Expressed sequence tags offer a low cost approach to gene discovery ● For a range of non-model organisms, ESTs represent the only sequence.
Nucleotide Sequence Analysis 1 Part I [web page]web page Osvaldo Graña CNIO Bioinformatics Unit March 2013.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the.
(PSI-)BLAST & MSA via Max-Planck. Where? (to find homologues) Structural templates- search against the PDB Sequence homologues- search against SwissProt.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
SAGExplore web server tutorial. The SAGExplore server has three different modules …
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009.
What is BLAST? Basic BLAST search What is BLAST?
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Welcome to the combined BLAST and Genome Browser Tutorial.
Short Read Workshop Day 5: Mapping and Visualization
What is sequencing? Video: WlxM (Illumina video) WlxM.
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
Culturable Bacterial Communities Analyzer DIANA VANESSA SARRIA-ZUNIGA ELIANA TORRES-ZELADA April 29, 2016.
What is BLAST? Basic BLAST search What is BLAST?
Virginia Commonwealth University
Scoring Sequence Alignments Calculating E
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
EDNA analyze Wang Ying & Huang Junman.
Blast Basic Local Alignment Search Tool
Basics of BLAST Basic BLAST Search - What is BLAST?
Primer design.
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Basic Local Alignment Search Tool
Automated Searching of Polynucleotide Sequences
Maximize read usage through mapping strategies
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Presentation transcript:

EST Sequence Cleaning and Quality Control FW4089 Bioinformatics H. Jiang

Background Expressed Sequence Tag (EST): small pieces of DNA seq ( bp) sequenced at one or both ends of an expressed gene. Advantage: fast and easy, to fish a gene out. Challenge: seq errors, redundant ESTs,.. ~12, 000 from 8 cDNA libraries: Control Apex (CA), Transgenic Apex (TA) Control Apex (CA), Transgenic Apex (TA) Control Apex (CL), Transgenic Apex (TL) Control Apex (CL), Transgenic Apex (TL) Control Apex (CR), Transgenic Apex (TR) Control Apex (CR), Transgenic Apex (TR) Control Apex (CS), Transgenic Apex (TS) Control Apex (CS), Transgenic Apex (TS)

Sequence Cleaning (1) Base-calling program---Phred: Input: chromatogram files Input: chromatogram files Output: seq file and quality file (scores) Output: seq file and quality file (scores) >phred –id./passTrace –sd./passSeq/ -qd./passQual/ DNA sequence cleaning: quality trimming and vector removal---Lucy: Lucy Steps: Read input seq#, seq info, and quality info Read input seq#, seq info, and quality info Chop off splice sites Chop off splice sites Remove vector insert Remove vector insert Produce output seq for fragment assembly. Produce output seq for fragment assembly.

Sequence Cleaning (2) Lucy Input: seq file and quality file >lucy [parameters] seqFile qualFile [2ndSeqFile] Lucy major parametersto set up: -vector vector_completeSeq splice_site_file (splice_site_file: 2 splice-site seq before and after the insertion point on the vector) splice.begin(150 bp) ____________________________________________________________________________ M13 Rev. Primer EcoRI Linker 1 Linker 4 EcoRI splice.end(150 bp) ____________________________________________________________________________ half of Linker 1 Linker 4 EcoRIM13 For Primer Lucy Output: identified locations of good/clean region identified locations of good/clean region trim seq without vector, linker, Ns (<3% Ns), and poly A/Ts. trim seq without vector, linker, Ns (<3% Ns), and poly A/Ts. Other check-up (e.g. chloroplast, mitochondrial, …) Submit 11, 545 ESTs to GenBank

Any Problem? An example. In-house re-sequence clone at plate position of NP9A.A1, its seq should match MTU2CA.P16.F09, which was submitted. However, blast ESTdb at NCBI did not find any match with our seq. Problems? Half of the new seq is vector contamination Half of the new seq is vector contamination Seq cleaning problems? Seq cleaning problems? ClustalW-multiple seq alignment of new seq, Lucy- trimmed seq and Seq from seq company. ClustalW-multiple seq alignment of new seq, Lucy- trimmed seq and Seq from seq company.

NP_9A_A1_new_ AGAATACGCCAGCTTGGTACCGAGCTCGGATCCCTAGTAACGGCCGCCAG Lucy2CA.P16.F09_299bp_ Larkun-trimmed_751bp_ CCTNGNACGGCCGC-AG NP_9A_A1_new_ TGTGCTGGAATTCGCCCTTTCGAGCGGCCGCCCGGGCAGGTCAACGGTTG Lucy2CA.P16.F09_299bp_ GCAACGGTTG Larkun-trimmed_751bp_ TGTGCTGGAATTCGCCCTTTCGAGCGGCCGCCCGGGCAGGNNAACGGTTG ******** NP_9A_A1_new_ ACTTCAATCCAACCATAATTAAACGGCGATAGATCATAATTTCAGTCAAG Lucy2CA.P16.F09_299bp_ ACTTCAATCCAACCATAATTAAACGGCGATAGATCATAATTTCAGTCAAG Larkun-trimmed_751bp_ ACTTCAATCCAACCATAATTAAACGGCGATAGATCATAATTTCAGTCAAG ************************************************** NP_9A_A1_new_ TTCTAAGAACCCATTATCAAATTATTATCCAACAACAACAATAATAATTT Lucy2CA.P16.F09_299bp_ TTCTAAGAACCCATTATCAAATTATTATCCAACAACAACAATAATAATTT Larkun-trimmed_751bp_ TTCTAAGAACCCATTATCAAATTATTATCCAACAACAACAATAATAATTT ************************************************** NP_9A_A1_new_ CTCATTCGAAGAGAATCGTAGAATTCATAATCTAAATCGAAAAAAAAACT Lucy2CA.P16.F09_299bp_ CTCATTCGAAGAGAATCGTAGAATTCATAATCTAAATCGAGGGGGAAACT Larkun-trimmed_751bp_ CTCATTCGAAGAGAATCGTAGAATTCATAATCTAAATCGNNNGGNNAACT *************************************** **** NP_9A_A1_new_ AAAAATCCATCAAATTAAACAAAAACAAACCCGAAGATGGATGAATCAAT Lucy2CA.P16.F09_299bp_ AAAAATCCATCAAATTAAACAAAAACAAACCCGAAGATGGATGAATCAAT Larkun-trimmed_751bp_ AAAAATCCATCAAATTAAACAAAAACAAACCCGAAGATGGATGAATCAAT ************************************************** NP_9A_A1_new_ TGGTCTTGGGATCAGGAAGGGGGAGAAAACCAGNACCAGGAGCAGAACCA Lucy2CA.P16.F09_299bp_ TGGTCTTGGGATCAGGAAGGGGGAGAAAACCAGCACCAGGAGCAGAACCA Larkun-trimmed_751bp_ TGGTCTTGGGATCAGGAAGGGGGAGAAAACCAGCACCAGGAGCAGAACCA ********************************* **************** NP_9A_A1_new_ CGAGGCGGAGAAAACCCAAGAAACTTCTGCAAATTAGTGCCTCGGCCGCG Lucy2CA.P16.F09_299bp_ CGAGGCGGAGAAAACCCAAGAAACTTCTGCAAATTAGTG Larkun-trimmed_751bp_ CGAGGCGGAGAAAACCCAAGAAACTTCTGCAAATTAGTGCCTCGGCCGCG *************************************** NP_9A_A1_new_ ACCACGCTAAGGGCGAATTCTGCAGATATCCATCACACTGGCGGCCGCTC Lucy2CA.P16.F09_299bp_ Larkun-trimmed_751bp_ ACCACGCTAAGGGCGAATTCTGCAGATATCCATCACACTGGCGGCCGCTC linker seq Seq difference EcoRI

Find Questionable Sequences Strategy: Find and compare sequences of ‘clean region’ defined by Lucy (coordinates are in debug files produced by Lucy) Find and compare sequences of ‘clean region’ defined by Lucy (coordinates are in debug files produced by Lucy) Find those seq that have low quality scores and different between Phred base calls and Lark calls Find those seq that have low quality scores and different between Phred base calls and Lark calls Criteria: minimum length >100, >10 continuous low quality scores ( 100, >10 continuous low quality scores (<16) Compare difference between Lark and Lucy seq Compare difference between Lark and Lucy seq Manually check some chromatogram files Manually check some chromatogram files

Program 1 From debug file get coordinate file: Where good/clean region starts and ends Where good/clean region starts and ends Program 1: (a C program) Usage:./clean coordfile qualfile threshhold regionlength [outputfile] 1. get coordfile: ESTID+CLR 2. qualfile: quality file 3. threshold: 16 (Lucy default) 4. regionlength: 10 (10 continuous low scores) 5. optional output filename.

Program 2 Program 2: (a unix-based Flex program) Partial code: 1. Define rules (patterns): e.g., CHAR [A-Z]; INT -?[0-9]; MTUID; BAD {INT}* 1. Define rules (patterns): e.g., CHAR [A-Z]; INT -?[0-9]; MTUID; BAD {INT}* 2. Action 2. Action {BAD}{ if ( ? <=16 ) if ( ? <=16 ){ ++count; ++count;} if (count >= 10 && inbadregion = 0) { strncp (?)? strncp (?)? printf("\n%s ", yytext); printf("\n%s ", yytext);}else{ count = 0; count = 0; inbadregion =1; inbadregion =1;} 3. main() 3. main() 4. Compile and produce an executable file. 4. Compile and produce an executable file.

Result (1) By C and Flex program, found ~2400 seq had 10 consecutive low scores Lucy claimed that average probability of error (score<=16) over the final clean range is 2.5%. Next: compare base calls between Lucy trimmed and Lark sequences. By 2 methods: GAP and Local Blast

GAP of 2400 seq pairs (1) Convert file format: gcg>fromfasta bigfastaFile of Lark and Lucy Run a shell script: gap Lucyseq1 Larkseq1 (all 2400 pairs) Output : *.pair file (example see below) GAP of: mtu2ta.p13.h06.seq check: 4559 from: 1 to: 358 MTU2TA.P13.H06 to: origmtu2ta.p13.h06.seq check: 438 from: 1 to: 358 origMTU2TA.P13.H06 Gap Weight: 50 Average Match: Length Weight: 3 Average Mismatch: Quality: 3248 Length: 369 Ratio: Gaps: 4 Percent Similarity: Percent Identity:

GAP of 2400 seq pairs (2) Perl program to parse *.pair output Use regular expression to find key words and extract the corresponding values: if ($line=~/Ratio: /) { $line =split (":",$line); $crx[1]=~s/^\s*//; $crx[1]=~s/\s*$//; $crx[3]=~s/^\s*//; $crx[3]=~s/\s*$//; $ratio =$crx[1]; $gaps =$crx[3]; Criteria: min length >100, gap>=1, identity 100, gap>=1, identity <95%. Found: total 188 seq different. Found: total 188 seq different.

Local Blast of 2400 seq pairs (1) Format Lark orig seq to database files >formatdb –i orig –p F –o T Blast Lucy trimmed seq (one bigfastaFile) with orig >blastall –i lucy –d orig –p blastn –e 0.05 –m 9 –b 1 >blastout (output redireciton) Blast output: # BLASTN [Aug ] # Query: MTU2CA.P10.A07 # Database: orig # Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score MTU2CA.P10.A07 MTU2CA.P10.A MTU2CA.P10.A07 MTU2CA.P4.E MTU2CA.P10.A07 MTU2CA.P4.A

Local Blast of 2400 seq pairs (2) Cut-off criteria: Cut-off criteria: pick the top hit(same IDs), select those with identity 100, and gap>=1 Parse the output by a Perl program Parse the output by a Perl program Found total 158 sequences. Found total 158 sequences.

Compare GAP and Blast Results Compare results: Blast is quick and not accurate (e.g. unknown reason to open a gap for a perfect matched seq pair). Compare results: Blast is quick and not accurate (e.g. unknown reason to open a gap for a perfect matched seq pair). Find common seq IDs: Find common seq IDs: total of 144 seq (1.2% of 11, 545 submitted seq). total of 144 seq (1.2% of 11, 545 submitted seq). Manual check on their chromatograms Manual check on their chromatograms (To be continued……)

A chromatogram example NNNGGNN