Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden
De novo assembly
Overall idea
Repeats and non random sheering
scaffolding Multiple libraries contigs are directed by mate pairs -> scaffolding
4 types of assemblers Greedy algorithms Overlap-layout-consensus Align-layout-consensus Bac by Bac sequencing
Types of assemblers I Greedy algorithms joins similar reads easily confused by repeats
Types of assemblers II Overlap layout consensus assembler nodes represent end of read lines represent similarity between reads (overlap) layout step removes redundant information consensus step is building of genome
Types of assemblers III Align-layout-consensus. process called comparative assembly. The overlap stage of assembly is replaced by an alignment step. The layout stage is also greatly simplified due to the additional constraints provided by the alignment to the reference.
Types of assemblers IV Bac by bac sequencing genome broken in fragments Bac’s location is determined in the lab minimum tiling path (whole genome is covered by at least one Bac Bac’s sequenced
Lander-Waterman equation “rain drops” to cover a tile 8-10 fold coverage 5 contigs for 1MB genome
Timeline 1975 Sanger sequencing 1990 First shotgun/EST assemblers overlap-layout-consensus approach 2000 Human shotgun assembly 2001 Mouse shotgun assembly roche available 2006 Solexa available 2007 short read assembers de Bruijn graphs
The complexity of sequence assembly Long reads –better identification –much slower Short reads –faster to align –more difficult with repeats Amount of reads Length of reads Mismatches Algorithms can show quadratic or even exponential complexity
3 NGS Projects Dragon fly Medical Maggots EST comparison
Dragon Fly (libelle) Class Odonata 3000 species 90 in Europe Undergo a morphic change
Pilot study for African Dragon Fly Morphic change Some migrate others don't Genetically divergent Contain lots of introns in their genome
Project questions What are the homologies with other species? How big is the genome? Are there already sequences in Genbank and are they present in the data?
Dragon fly project data Genomic Single end 1 x reads Trimmed to 34/51 nucleotides nucleotides sequenced CDNA Paired end 2 x reads Read lenght = 51 nucleotides sequenced
Dragon fly methods Assemble cDNA Blast resulting contigs to determine homologies Align genomic DNA to contigs Calculate genome size
Dragon fly assembly results total contigs: 3898 average length of contigs: 176 average coverage of contigs: 24 contigs larger than 300 nucleotides: 800 average length of contigs larger then 300: 508 average coverage of contigs larger then 300: 15
Dragon fly genes and homologies libellula pulchella Enallagma aspersum Erythromma najas Ischnura verticalis many Drosophila species Criteria used for in this analysis was an e- value of less then 1*10^-40 and a score of more than 200. COII gene with accession number GQ (partial) COI gene with accession number GQ (partial) NDI gene with accession number GQ (partial) found in the cDNA contigs.
Dragon fly genome size 30 genomic genes selected after blasting Size Alignment with Bowtie “calculation”
Medicinal maggots Treated to non healing wounds genes revealed Signaling proteins Inhibitor of apoptosis protein 2 Digestive enzymes Lipases proteinases antimicrobial peptides (AMPs) Lucilia defensin diptericin
Medicinal maggots data 5 degenerate peptide sequences 36 Peptides cDNA reads read lenght 32
Medicinal maggots question Have we sequenced (pieces) of the genes corresponding to the peptides.
Medicinal maggots methods Build local library of peptides Assemble contigs CLCbio Nextgene Velvet Blast contigs to peptides Find hits Make coverage plot
Nextgene assembly maggots aantal contigs = gemiddelde lengte = 59 gemiddelde coverage = 11 aantal contigs >300 = 719 gemiddelde lengte >300 = 661 gemiddelde coverage >300 = 64
CLC assembly Aantal contigs = 78 gemiddelde lengte = 2282 gemiddelde coverage = 514
Velvet assembly made total contigs: 586 length of contigs:168 coverage of contigs: 55 contigs larger than 300 nucleotides:62 length of contigs larger then 300: 779 coverage of contigs larger then 300: 63
Found Genes Maggots C.vicina mRNA for arylphorin subunit A4 Velvet Drosophila willistoni GK21455 (Dwil\GK21455) mRNA nextgene Lucilia cuprina clone sbsp9 serine proteinase mRNA nextgene
EST comparison Traditional EST sequencing known library assemblers CLCbio Nextgene Velvet
EST comparison method Assemble cDNA and match with known ESTs
EST results
conclusions Big differences between assemblers coverage length amount of nodes sequence x performs best on EST test
Questions?