Download presentation
1
How to Build a Horse Megan Smedinghoff
2
Background In February 2007, Broad Institute released a draft genome of the horse (Equus caballus) The project cost $15 million and was funded by the National Human Genome Research Institute and the National Institute of Health 300,000 Bacterial Artificial Chromosomes were provided by the University of Veterinary Medicine in Hanover, Germany and the Helmholtz Centre for Infection Research in Braunschweig, Germany
3
Horse Genome Statistics
The horse genome contains approximately 2.7 billion base pairs The assembly was done using 6.8-fold coverage The sequenced horse was a thoroughbred mare named Twilight from Cornell University Twilight posing for a picture at Cornell
4
Why Sequence the Horse? Allows scientists to study diseases that primarily affect horses such as Glanders SNP information can be used to connect DNA to physical characteristics and explain differences between breeds Lots of general information about mammals can be gained by looking at the horse since very few large mammals have been sequenced
5
How the Horse Genome Affects Us
There are over 80 known genetic conditions in the horse that are analogous to human disorders Horses have some conditions traditionally found in humans such as allergies and arthritis Having the complete horse genome helps infer the order of evolution Horse Racing?
6
Project Proposal Reassemble the horse genome using the Celera Assembler Use existing UMD software to compare my assembly with the Broad assembly and produce a reconciled horse genome Deposit the improved assembly in GenBank Advisor: Jim Yorke
7
Introduction to Genome Sequencing
DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. Primer End Reads (Mates) SEQUENCE 750bp Vector LIGATE & CLONE Slide courtesy of Art Delcher
8
How Genomes are Assembled
Closure Trim the Reads Calculate Overlaps Build Unitigs Build Contigs Build Scaffolds
9
Assembly: Calculating Overlaps
5’ 3’ Read A 5’ 3’ Read A 5’ 3’ Read B 3’ 5’ Read B 3’ 5’ Read A 3’ 5’ Read A 5’ 3’ Read B 3’ 5’ Read B Compare every possible combination of reads to find every overlap of a certain length (~40bp) Must compare forward and reverse orientation of each pair of reads Comparisons must take into account the possibility of sequencing errors and use alignment algorithms such as Smith-Waterman
10
Assembly: Creating Unitigs
Reads A unitig is a set of reads that have been linked together based on overlaps A unitig has no ambiguities
11
Assembly: Creating Unitigs (cont.)
Best Buddy Algorithm for Unitig Assembly: If the longest overlap with read A is read B and the longest overlap with read B is read A, then reads A and B are best buddies A B C A B C D D Read A and Read B are best buddies Read A and Read B are NOT best buddies
12
Assembly: Creating Contigs
Unitig A Unitig B Read 1 Read 2 Read 1 and Read 2 are mates A contig is a set of overlapping unitigs Contigs are assembled by using mate pair information Since we know the distance between mates and the orientation of the mates, we can infer the placement of the unitigs
13
Assembly: Building Scaffolds
Contig A Contig B Reads Scaffolds are built from contigs The orientation and approximate distances between contigs are inferred from mate pair information When possible, the gaps between contigs are filled in with leftover sequence
14
Arachne Assembler 24-mer indexing
Any two reads that share at least one 24-mer are paired Each pair is scored Contigs are created by merging paired pairs Repeat regions are avoided during contig assembly but used during scaffold assembly Subreads are placed after scaffold assembly Serafim Batzoglou Arachne Author
15
Celera Assembler Find overlaps of at least 40bp with less than 6% error Overlaps are found using 22-mers After overlaps are calculated, Celera does error correction using a voting algorithm Contigs are assembled using best buddy algorithm Scaffolds are assembled from mate pair information Scaffold gaps are filled when possible Gene Meyers Former vice president of Celera Genomics
16
Project Expectations Fall 2007 Produce Celera Assembly Spring 2008
Produce Reconciled Assembly General Goals Tackle the unexpected problems that accompany genome assembly Document my work Validate my work wherever possible
17
Validation Genome assemblies are not perfect
I plan to validate my assembly by comparing it to the current draft I expect about 1.5% difference between the Celera Assembly and the Broad Assembly I will use Mummer to measure similarity between genomes
18
Graphs courtesy of Adam Phillippy
Mummer Mummer is a piece of software created by CBCB that is used to compare genomes Mummer locates strings of at least 18bp that are present in each genome Plotting the results makes it easy to see insertions, deletions, inversions, etc. Graphs courtesy of Adam Phillippy
19
Implementation Details
I plan to use the Genome cluster at University of Maryland to produce my assembly Much of my project will utilize existing software I intend to use Perl to write any additional scripts that may be needed
20
Time Permitting The University of Maryland has recently produced a lot of software for the genome assembly pipeline, much of which has not been tested on large genomes I hope to use programs like the UMD overlapper and Figaro to see how these programs affect my assembly Mihai Pop James White
21
Acknowledgements James Yorke, Aleksey Zimin, and the Genome Group for advising me on the nature of this project Steven Salzberg, Art Delcher, and Adam Phillippy for giving lectures and producing slides on genome assembly topics Gene Myers paper on Drosophila Serafim Batzoglou paper on Arachne Wikipedia
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.