Presentation is loading. Please wait.

Presentation is loading. Please wait.

Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.

Similar presentations


Presentation on theme: "Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads."— Presentation transcript:

1 Applied Bioinformatics Week 5

2 Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads

3 Theoretical Part I DNA sequencing Next generation sequencing Cleaning nucleotide sequences

4 DNA Sequencing Sanger Method –Please explain Other methods –Too many to discuss –http://en.wikipedia.org/wiki/DNA_sequencinghttp://en.wikipedia.org/wiki/DNA_sequencing

5 Shotgun Sequencing Many short (~700 N) sequences Human genome sequencing project –Finished? How can you make sense of these sequences? Contrast: –Genome walking

6 Next Generation Sequencing Increases the throughput of sequencing –More sequence per time –Not more sequence per read (still around 500) Many commercial platforms available –454 pyrosequencing –Illumina (Solexa) sequencing –... Price is dropping –Whole genomes in a day –http://www.1000genomes.org/http://www.1000genomes.org/

7 454 Pyrosequencing http://genepool.bio.ed.ac.uk/

8 Illumina sequencing http://seqanswers.com/forums/showthread.php?t=21

9 Where from is your DNA Did you just clone and sequence? Did you sent a sample to a company? Did you find the sequence in a database? Better make sure it is correct and clean

10 Vector Contaminations Long DNA pieces are fragmented and cloned into vectors before sequencing. This usually causes some amount of vector to be sequenced along with the insert. image: Wikipedia

11 Adapter Contaminations Long DNA pieces are fragmented and adapter sequences are ligated to both ends of the fragments before sequencing. This causes adapters to be sequenced along with the desired sequence.

12 Contaminations Cause Misassembly One important outcome of not removing contaminations from genomic sequences is that they cause misassembly of sequences

13 Cleaning Contaminations Several approaches and tools to clean vector contaminations from genomic sequences have been developed. Most of them rely on a reference vector library, including: –LUCY, LUCY2 –SeqTrim –DeconSeq –TagCleaner –cross_match –SeqClean –VecScreen

14 Problem Definition A vector is a circular DNA sequence. After being linearized in reference libraries, vector contaminations around the linearization point can no more be detected and cleaned by currently available tools.

15 UniVec A vector library by NCBI Problems: –Has complete sequences for only 8 vectors, although full length sequences are available on public databases for the rest as well. –Only these 8 vectors are appended to themselves by 49 nt to overcome circularization problem. –Some vectors are divided into partitions, for no apparent reason. –Some adapter sequences are appended to themselves as well, whereas some are not.

16 Previous Solution Not designed for entire libraries Proposes cutting the first 60 nucleotides from the start of a vector sequence and pasting it to the end by using a simple text editor No more has an implementation Y.-A. Chen, C.-C. Lin, C.-D. Wang, H.-B. Wu, and P.-I. Hwang, “An optimized procedure greatly improves EST vector contamination removal,” 2007.

17 Our Solution Appending all (or filtered by the user) vector sequences in a reference library to themselves or to first n number of nucleotides (n chosen by the user) As customizable as possible, but still efficient with a single click Has a GUI for target- users

18 Our Solution Possible Customisations –Cleaning already introduced appendices in the library –Filtering the sequences by a keyword in their definition lines and/or by length –Virtual Circularization Appending sequences to themselves by first n nucleotides

19 Efficiency of Our Method Datasets: –Every 600th EST –P. somniferum EST –Artificial Data Vector Libraries –rawUV –cleanUV –appUV The Percentage of Sequences Cleaned rawUVcleanUVappUV Every 600th EST31.0030.9431.79 P. Somniferum EST 17.26 18.03 Artificial Data87.5075.00100.00 The Percentage of Nucleotides Cleaned rawUVcleanUVappUV Every 600th EST2.862.852.90 P. Somniferum EST 0.45 0.47 Artificial Data15.35 19.93

20 Theoretical Part I Mind Mapping Break 10 min

21 Practical Part I

22 Screening for Vector seqs www.ncbi.nlm.nih.gov/VecScreen Get the U87251 sequence (FASTA) –What is this number? –Enter the sequence and run the analysis What do you see as a result? –Would you continue with the experiment? –Would you discard the sequence?

23 Sequencing Since we cannot do any sequencing here we have to prepare a simulation 1.Select a nucleotide sequence of about 15000 bases 2.Copy and paste that sequence into word 1.3 times 2.Separated by empty lines

24 Sequencing 3.Arbitrarily add linebreaks into the resulting document 1.At least 30 (10 per copy min) 2.Spread out throughout the sequence 4.Add a FASTA definition line after each line break –Use >Copy-N-Fragment-X as a template for the definition line Ensure that the overall number of characters is less than 50000

25 Practical Part I 10 min break

26 Theoretical Part II Sequence Assembly

27 Assembling Sequences Shotgun sequencing –Sequence fragments –Find overlapping fragments –Build contiguous sequences (contig) –Assemble into whole genomes Genetic and physical maps –Help orient fragments and contigs Problems with repetitive sequences

28 Sequence Tagged Sites Physical map Up to 200 bp long Unique for a region of the genome STS reference map –Map to assemble BAC/ PAC clones –Repeat process to map contigs to clones

29 Sequence Tagged Site Chromosome Sequence Tagged Site Endonuclease Site The restriction enzyme should digest the DNA into approximately 200 kB long fragments

30 Fragments with STS If it fits into a plasmid (Up to 10 kB) Up to 700 kB! Shortest Chromosome (21) 47 mB -> 250 BACs

31 1 BAC -> 10 – 50 Plasmids / Cosmids Plasmid / Cosmid

32 Primer Polymerase Chain Reaction will lead predominantely to: Use several nucleases EcoRI BamHI HindIII Target ~ 1000 nucleotides

33 Restriction Sequence with degenerate primers? or subclone and sequence Clone01: ACCGACTACGATCGCACTCAGCATCGCGATCCGATACGTAGCTAGCTAGCT Clone02: TGTGTAGCTAGCTGCGGCGCTAGGATAGGCATCTAGCTATCGGACTCTGTG... Clone20: GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT... Sequencing

34 Clone01 ACCGACTACGATCGCACTCAGCATCGCGATCCGATACGTAGCTAGCTAGCT |||||||||||| Clone20 GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT Clones 010203..20..27 010SYMETRIC 02-150YMETRIC 035120METRIC.0ETRIC.0TRIC 2015-25-153100RIC.0IC.0C 2702-5-4-205-530 >Clone01 ACCGACTACGATCGCACTCAGCATCGCGA TCCGATACGTAGCTAGCTAGCT >Clone02 TGTGTAGCTAGCTGCGGCGCTAGGATAGG CATCTAGCTATCGGACTCTGTG... >Clone20 GTAGTACGTGCTAGCTACGTACGTACGAT CGTACGTAGTACCGACTACGAT... Smith-Waterman or more specialized Alg. all vs all Check here as well

35 GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT ACCGACTACGATCGCACTCAGCATCGCGATCCGATACGTAGCTAGCTAGCT ACCGACTACGATCGCACT |||||| |||||||||||| |||||||||||| |||||||||||| |||||||||||| TAGTACCG GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT GTAGTACGTGCTAGCTACGTACGTACGATCGTACGTAGTACCGACTACGAT Chromosome Not proportional For each plasmid the BAC and therefore the position on the chromosome is known Sequencing all plasmids will give the complete sequence of the genome !Caution! Highly simplified Why?What does coverage mean?

36 Assembling Software As you just saw assembling sequences is computationally expensive Therefore most software is not available online but often freely for download

37 Theoretical Part II Mind mapping 10 min break

38 Practical Part II

39 Restriction Maps You sent a sample for sequencing. You might want to check if the sequence makes sense What is a restriction map? www.restrictionmapper.org

40 CAP3 Assembly GOTO: http://pbil.univ-lyon1.fr/cap3.phphttp://pbil.univ-lyon1.fr/cap3.php Use the sequences you prepared earlier to assemble them with cap3 Analyze the results –Did you get a full correct assembly?


Download ppt "Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads."

Similar presentations


Ads by Google