Lesson: Sequence processing

Lesson: Sequence processing
Goals: Introduce DNA Assembly and Alignment Practice rebuilding full sequences from reads

Sequencing by Synthesis Review
Modified PCR “builds” sequence over multiple cycles Each strand of DNA is amplified into a cluster of identical DNA before sequencing PURPOSE: Review of Sequencing By Synthesis slides. Refreshes the process of generating a sequence.

Sequencing by Synthesis Review
Multiple clusters are sequenced at once Clusters can be: From different samples OR from the same sample Short regions OR long regions that have been broken into shorter pieces Unique tags (indices) identify the source of each cluster The sequence from each cluster is referred to as a “read” PURPOSE: Review of Sequencing By Synthesis slides. Refreshes the role of indexing and emphasizes how each sequencing run can produce sequences from multiple different sources – either different areas within a genome or different genomes entirely. Sets up that further processing is required before a sequence can be analyzed.

Before analysis can begin:
Sequence information needs to be stored FASTA files store sequence information in a text format Long regions that were broken up for sequencing need to be rebuilt Assembly rebuilds long regions using overlapping sequences Alignment rebuilds long regions by matching reads to a reference “References” are the results from the previous times a genome or region was sequenced. This can also be called the “consensus” sequence since it is the agreed upon complete version of the sequence. PURPOSE: Introduces vocabulary (FASTA [pronounced fast-ay], assembly, alignment, reference, consensus) and the two major components of sequence processing: storing information and rebuilding a complete sequence.

Storing Sequencing Information
FASTA files Used for nucleotide (DNA, RNA) or peptide (protein) sequences. Contains a header row, marked by “>” with sample information and then a new row with sequence information. One FASTA file can contain multiple sequences. Can be opened with any text editor PURPOSE: Provides details on FASTA files. Useful if students will be working directly with sequence information (as in our BioinformaticsTools lesson), but can be skipped.

Rebuilding Long Sequences: 1
Assembly Sequencing works best with short regions, so long regions of DNA are randomly fragmented before sequencing Overlaps in the regions are used to reconstruct the full sequence PURPOSE: Conceptual overview of sequence assembly. Provides visual for discussion of details in next slide.

Assembly Details DNA is amplified before fragmentation. Lots of copies being randomly fragmented means a lot of overlap. The more short fragments which overlap with one another allow more certainty that the long region has been correctly assembled. Read 1: ATCCGCATTGAC Read 2: TGACCTAGCGCA Read 3: GCAATACGTGAC OR PURPOSE: Provides more details on the process of assembly. Key to assembly is that multiple copies of the original DNA get randomly fragmented so that each copy has different breaks. These different breaks allow reads to overlap at different points which is vital for ensuring that they are assembled correctly (in the picture example, Read 2 matches both Read 1 and Read 3, but because Read 4 has different end points it can determine which is the correct match). Read 4: CATTGACCTAG ?

Practice Assembly Sequence Processing OR Read Assembly Activity
All groups get only the reads Think about the following: How many “reads” were necessary to cover the entire “genome”? How sure are you of the final sequence? Are there any regions of ambiguity? What information would you want to help resolve that ambiguity? PURPOSE: hands on practice with the process of Assembly. The Sequence Processing and Read Assembly activities can be found at ase.tufts.edu/chemistry/walt/sepa/activities.html

Rebuilding Long Sequences: 2
Alignment Long regions are randomly fragmented into shorter regions for sequencing Short regions are lined up against previous sequencing results to reconstruct the full sequence PURPOSE: Conceptual overview of sequence alignment. Provides visual for discussion of details in next slide.

Alignment Details Reference: ATCCCGGA-TCGTTA ||| |||| ||| ||
Points of variation between the read and reference are noted and stored in a “Variant Call File” (VCF) The more short fragments which include a variation, the more certain we can be that variation isn’t just a sequencing error. Reads can vary from a reference in different ways Changes in a nucleotide Insertions Deletions PURPOSE: Reiterate that in alignment the general sequence is known and that the focus is on the variations present in the specific sequence. Introduces notation for marking variations and briefly discusses the types of changes (Vocabulary not on the slide: a change in a single nucleotide is often called a point mutation or a single nucleotide polymorphism (SNP [pronounced snip])) Reference: ATCCCGGA-TCGTTA ||| |||| ||| || Read: ATC-CGGAATCGATA  The | indicates a perfect match

Storing Variation Information
Variant Call File (VCF) Indicates differences compared to a reference. Contains header rows, marked by “##”, and a table of variants Can be opened in text or spreadsheet editors PURPOSE: Provides details on VCF files. Useful if students will be working directly with variant information (as in our Mutations Investigation module), but can be skipped.

Practice Aligning Sequence Processing OR Read Assembly Activity
All groups get reads and a reference copy of the original text For more practice with alignment: Aligning Short Texts Activity Think about the following: How are you deciding on the “best” alignment? What benefit is there to having multiple “reads” for each text? Multiple Alignment: When more than two sequences are being aligned PURPOSE: hands on practice with the process of Alignment. The Sequence Processing and Read Assembly activities can be found at ase.tufts.edu/chemistry/walt/sepa/activities.html

Evaluating Alignments
Goal: maximize overlap between sequences Scoring Way of quantifying overlap so different alignments can be compared Different scoring systems exist, but a simple one would be Matches: +1 Mismatches: -1 Gaps: -2 To use this system: Score = (number of matches) – (number of mismatches) – 2*(number of gaps) PURPOSE: Introduces a technique for determining the best alignment out of multiple possible alignments. Introduces a simple scoring system students can use to compare alignments.

Comparing Alignments Alignment 1 Alignment 2 Alignment 3
Score = (number of matches) – (number of mismatches) – 2*(number of gaps) Alignment 1 Alignment 2 Alignment 3 Reference: GTCGAATGAAACGATTAA |||| | || | Read: TCGATTTAACGATTA Reference: GTCGAATGAAACGATTAA || |||||||| Read: TCGATTTAACGATTA PURPOSE: Practice using the simple scoring system to pick the best out of three possible alignments (answers below). Highlights that even though gaps incur a worse penalty than mismatches they can still be better for the overall alignment. ALIGNMENT 1: Score = 8 matches – 7 mismatches – 2*0 gaps = 1 ALIGNMENT 2: Score = 10 matches – 5 mismatches – 2*0 gaps = 5 ALIGNMENT 3: Score = 13 matches – 2 mismatches – 2*1 gaps = 9 Reference: GTCGAATGAAACGATTAA |||| | | ||||||| Read: TCGATTTA-ACGATTA

Coverage The number of times each nucleotide is “seen” during sequencing Higher coverage makes it easier to distinguish errors from true sequence variations What is being sequenced helps determine how common a variation has to be before it’s considered a “real” variation Read 1: ATCCGCATTGAC Read 2: CGCCTTGACCTAG Read 3: CCGCCCTGACCTAG Low Coverage vs High Coverage Read 1: ATCCGCATTGAC Read 2: CGCCTTGACCTAG Read 3: CCGCCCTGACCTAG Read 4: TCCGCATTGACCT Read 5: CGCATTGACCTAGCG Read 6: CGCATTGACCTA Read 7: ATCCGCATTGACC Read 8: TCCGCATTGAC Read 9: GCATTGACCTACCGC Read 10: ATTCCGCATTG PURPOSE: introduces vocabulary (coverage). Coverage is conceptually the same as performing replicates and the examples of High and Low coverage highlight that nucleotides will have different levels of coverage and how that affects the certainty of the results. Expansion Points: The lower left bullet point (“What is being…”) asks students to think about concerns in determining variation validity. Humans have two copies of most genes and so an individual carrying two different alleles of that gene will have a more mixed set of nucleotides at a given point in that gene’s sequence. Each human cell contains multiple copies of mitochondrial DNA, the possibility for error exists every time this mtDNA is replicated and so one individual may have lots of low-level variation across their mtDNA sequence (having multiple versions of mtDNA is known as “heteroplasmy”). THIS BULLET POINT CAN BE OMITTED.

Types of Sequencing Analysis
De Novo Sequencing Used the first time a gene or genome is ever sequenced Uses assembly to stitch short regions into a longer whole Resequencing Used subsequent times a genome is sequenced Uses alignment to identify short sequences using a reference PURPOSE: connects the methods of sequence processing to the different types of sequencing. Introduces vocabulary students are likely to see in discussions of sequencing (de novo, resequencing). The puzzles model trying to reconstruct something without a guide or knowing what the final product should look like.

Compare methods Sequence Processing OR Read Assembly Activity
Use a different text, provide half the groups a “Reference” sheet Think about the following: How long are the “reads”? How long is the “genome”? How easy was this task with vs without a “reference” text? How fast was this task with vs without a “reference” text? How long are sequencing reads? How long are genomes? How easy/fast would using real sequencing data be? PURPOSE: A chance to compare methods side by side. Can be omitted if the methods are compared after the alignment practice. The Sequence Processing and Read Assembly activities can be found at ase.tufts.edu/chemistry/walt/sepa/activities.html

Role of computers in analysis
Computers can: Automate tasks Work faster than humans Process long sequences just as easily as short sequences Bioinformatics: use of computers for analyzing complex biological data. Lots of bioinformatics tools exist for you to use in analyzing your sequence PURPOSE: Introduces the necessity of computers in making much of this processing feasible. Introduces bioinformatics and bioinformatics tools as the way to further analyze sequencing data. Expanded upon in our BioinformaticsTools lesson.

Lesson: Sequence processing

Similar presentations

Presentation on theme: "Lesson: Sequence processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lesson: Sequence processing

Similar presentations

Presentation on theme: "Lesson: Sequence processing"— Presentation transcript:

Similar presentations

About project

Feedback