Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automated sequencing machines,

Similar presentations


Presentation on theme: "Automated sequencing machines,"— Presentation transcript:

1 Automated sequencing machines,
particularly those made by PE Applied Biosystems, use 4 colors, so they can read all 4 bases at once.

2 All the Genes? Any human gene can now be found in the genome by similarity searching with over 95% certainty. However, the sequence still has many gaps unlikely to find an uninterrupted genomic segment for any gene still can’t identify pseudogenes with certainty This will improve as more sequence data accumulates

3 Finding Genes in genome Sequence is Not Easy
About 2% of human DNA encodes functional genes. Genes are interspersed among long stretches of non-coding DNA. Repeats, pseudo-genes, and introns confound matters

4 Impact on Bioinformatics
Genomics produces high-throughput, high-quality data, and bioinformatics provides the analysis and interpretation of these massive data sets. It is impossible to separate genomics laboratory technologies from the computational tools required for data analysis.

5 Six basic questions about genomes
[1] how is a genome sequenced? [2] when is the project finished? [3] sequence one individual or many? [4] what information is in the DNA? [5] how many genes are in the genome? [6] how can whole genomes be compared?

6 [1] Genome projects: sequencing strategies
Hierarchical shotgun method Assemble contigs from various chromosomes, then sequence and assemble them. A contig is a set of overlapping clones or sequences from which a sequence can be obtained. The sequence may be draft or finished. A contig is thus a chromosome map showing the locations of those regions of a chromosome where contiguous DNA segments overlap. Contig maps are important because they provide the ability to study a complete, and often large segment of the genome by examining a series of overlapping clones which then provide an unbroken succession of information about that region. Scaffold: an ordered set of contigs placed on a chromosome. Shotgun An approach used to decode an organism's genome by shredding it into smaller fragments of DNA which can be sequenced individually. The sequences of these fragments are then ordered, based on overlaps in the genetic code, and finally reassembled into the complete sequence. The 'whole genome shotgun' method is applied to the entire genome all at once, while the 'hierarchical shotgun' method is applied to large, overlapping DNA fragments of known location in the genome.

7 3. Whole Genome Shotgun Sequencing
cut many times at random plasmids (2 – 10 Kbp) cosmids (40 Kbp) forward-reverse linked reads known dist ~500 bp ~500 bp

8 ARACHNE: Whole Genome Shotgun Assembly
1. Find overlapping reads 2. Merge good pairs of reads into longer contigs 3. Link contigs to form supercontigs 4. Derive consensus sequence ..ACGATTACAATAGGTT..

9 [2] When is the project finished?
Get five to ten-fold coverage Finished sequence: a clone insert is contiguously sequenced with high quality standard of error rate 0.01%. There are usually no gaps in the sequence. Draft sequence: clone sequences may contain several regions separated by gaps. The true order and orientation of the pieces may not be known.

10

11 Repetitive DNA sequences: five classes
[1] Interspersed repeats: transposon-derived repeats -- 45% of human genome; LTR, SINE, LINE [2] Processed pseudogenes [3] Simple sequence repeats -- micro- and minisatellites -- ACAAACT, 11 million times in a Drosophila -- Human genome has 50,000 CA dinucleotide repeats [4] Segmental duplications (about 5% of human genome) [5] Tandem repeats (e.g. telomeres, centromeres)

12 LINE and SINE repeats. A LINE (long interspersed nuclear element) encodes a reverse transcriptase (RT) and perhaps other proteins. Mammalian genomes contain an old LINE family, called LINE2, which apparently stopped transposing before the mammalian radiation, and a younger family, called L1 or LINE1, many of which were inserted after the mammalian radiation (and are still being inserted). A SINE (short interspersed nuclear element) generally moves using RT from a LINE. Examples include the MIR elements, which co-evolved with the LINE2 elements. Since the mammalian radiation, each lineage has evolved its own SINE family. Primates have Alu elements and mice have B1, B2, etc. The process of insertion of a LINE or SINE into the genome causes a short sequence (7-21 bp for Alus) to be repeated, with one copy (in the same orientation) at each end of the inserted sequence. Alus have accumulated preferentially in GC-rich regions, L1s in GC-poor regions.

13 What is the function of nongenic DNA?
Hypotheses: Nongenic DNA performs essential functions, such as regulation of gene expression. Nongenic DNA is inert, genetically and physiologically. Excess DNA is incidental and is called “junk DNA.” Nongenic DNA is a functional parasite or selfish DNA (retrotransposons). Nongenic DNA has a structural function.

14 Clasificación del ADN FUNCIONAL (secuencias que cumplen una función)
- Codante (se traducen en proteínas) -No codante (no se traducen) * Transcrito (cumple función a nivel de RNA: subun. ribos.) * No transcrito (cumple función a nivel de DNA: intrón, promotor, enhancer, etc.) NO-FUNCIONAL (secuencias que no cumplen ninguna función: “Junk DNA” – basura)

15 Rely on previously identified genes
Gene-finding algorithms Homology-based searches (“extrinsic”) Rely on previously identified genes Algorithm-based searches (“intrinsic”) Investigate nucleotide composition, open- reading frames, and other intrinsic properties of genomic DNA

16 DNA intron RNA Mature RNA protein

17 Homology-based searching: compare DNA
to expressed genes (ESTs) DNA intron RNA RNA protein

18 DNA RNA Algorithm-based searching: compare DNA in exons (unique codon usage) to introns (unique splices sites) to noncoding DNA. Identify open reading frames (ORFs).

19

20

21 [6] how can whole genomes be compared?
-- molecular phylogeny -- You can BLAST (or PSI-BLAST) all the DNA and/or protein in one genome against another -- We looked at TaxPlot and COG for bacterial (and for some eukaryotic) genomes

22 Orthologue & Paralogue
Orthologue- homologous genes with identical function in different organisms. Paralogue- homologous genes in the same organism originated from gene duplication.

23 Orthologue & Paralogue
Species 1 Species 2 Gene A Gene B Gene A Gene B diverge

24 Orthologue & Paralogue
Species 1 Gene A Gene B Species 2

25 Orthologue & Paralogue
Species 1 Gene A Species 2 Gene A Gene B

26 Orthologue & Paralogue
Species 1 Species 2 Gene A Gene B

27 Comparative Genomics Using ACT The Artemis Comparison Tool

28 Artemis Artemis is a free DNA sequence viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation.

29 Artemis comparison tool ACT
Based on artemis and coded in java. Allows visualisation of two sequences or more and a comparison file. The comparison file can be BLASTn or tBLASTx. Retains all the functionality of artemis.

30 Running ACT Sequence 1 Sequence 2 BLASTn tBLASTx MSPcrunch Reformat

31 Genes Blastn DNA sequence Repeats Promoters rRNA Pseudo-Genes tRNA
Gene finders Blastn Blastx Halfwise tRNA scan RepeatMasker Repeats Promoters rRNA Pseudo-Genes Genes tRNA Fasta BlastP Pfam Prosite Psort SignalP TMHMM

32 The Annotation Process
DNA SEQUENCE Useful Information ANNALYSIS SOFTWARE Annotator

33 DNA in Artemis AT content Forward translations Reverse Translations
DNA and amino acids

34 Gene structure IN TRYPANOSOMATIDS Polycistronic structure
Genes occur on a single strand at a time. Inflection points No splicing

35

36 Trypanosome gene structure

37 GENE STRUCTURE IN MALARIA
Splicing No polycistronic units Can have small exons Low complexity regions

38 AT content Coding regions have higher GC content in AT rich genomes

39 AT content

40 CODON USAGE Codon bias is different for each organisms.
DNA content in coding regions is restricted but not in non coding regions. The codon usage for any particular gene can influence expression.

41 Codon usage All organisms have a preferred set of codons.
Malaria Trypanosoma GUU GUU 0.28 GUC GUC 0.19 GUA GUA 0.14 GUG GUG 0.39

42 Codon Usage

43 Codon Usage in Artemis Forward frames Reverse frames

44 GC frame plot Plots the third position GC content of each frame of a DNA sequence. In coding DNA the GC content of the 3rd base is often higher. Good prediction of coding in malaria and trypanosomes.

45

46 Genefinding programs Genefinding software packages use hidden markov models. Predict coding, intergenic and intron sequences Need to be trained on a specific organism. Never perfect!

47 Phat Cawley et al. (2001) Mol. Bio. Para. 118 p167 http://www. stat
Based on a generalised hidden Markov model (GHMM) Free easily installed and run. Is good at predicting multiexon genes but will in some cases miss out genes altogether and will over predict.

48 Whant is an HMM A statistical model that represents a gene.
Similar to a “weight Matrix” that can recognise gaps and treat them in a systematic way. Has a different “states” that represent introns,exons and intergenic regions.

49 GlimmerM Salzberg et al. (1999) genomics 59 24-31
Adaption of the prokaryotic genefinder Glimmer. Delcher et al. (1999) NAR Based on a interpolated HMM (IHMM). Only used short chains of bases (markov chains) to generate probabilities. Trained identically to Phat

50 GlimmerM Under predicts splicing
Hardly hardly ever misses a gene completely. Does over predict. Free with licence.

51 Homology Data Coding regions are more conserved than non coding regions due to selective pressure. Comparing all possible translations against all known proteins will give clues to known genes. Blastx

52 The Gene Prediction Process
ESTs FASTA BlastX DNA SEQUENCE Good Gene Models ANNALYSIS SOFTWARE Phat GlimmerM DNA Plots Annotator

53 T. brucei vs L. major (cont.)
region expanded in Leishmania

54 T. brucei vs T. cruzi T.cruzi chr 3. conservation of G6PDH and adjacent hypothetical proteins. alanine aminotransferase. some genes missing synteny continues beyond “centromere” in brucei and beyond

55 L. major has break in synteny that is conserved in T. brucei and T
L. major has break in synteny that is conserved in T. brucei and T. cruzi T. cruzi Chr3. T. Brucei chr1 L. Major chr12 T. Brucei chr6

56 The ACT Display Genome2 genome3 genome1 Zoom scroll bar Filter scroll
Blast HSPs genome3

57 ACT Designed for looking at complete bacterial genomes.

58 Knowlesi contgs tblastx Falciparum Chr 3 tblastx Yoelii Contigs (TIGR)

59

60 AG-FMVZ-USP

61

62

63 Software www.sanger.ac.uk/Software/Artemis


Download ppt "Automated sequencing machines,"

Similar presentations


Ads by Google