Presentation is loading. Please wait.

Presentation is loading. Please wait.

Elephant Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.

Similar presentations


Presentation on theme: "Elephant Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis."— Presentation transcript:

1 Elephant Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis

2 Zebra Finch Genome The Genome assembly is downloaded from ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryote s/vertebrates_mammals/Loxodonta_africana/Lox afr3.0/ ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryote s/vertebrates_mammals/Loxodonta_africana/Lox afr3.0/ This assembly contains 693 scaffolds(GL…) and 1658 contigs (AAGU…), but they are not mapped to chromosomes. Total gapped length is 3,196mb and none gapped sequence length is 3,118mb.

3 Seg Dup detection pipelines WGACto detect Seg Dup in genomic assemblies by looking for homologous pairs ( >1 kb in length >90% identity).

4 Parameters and notes for WGAC pipeline Repeats –Because the elephant repeats library is not available, we masked out the combined sequence space of winMask and repeatmasker spaces. –The repeatMasker only using the default is not good enough. Tested by blast. –The combined masking space is good enough. Blast parsing seeds in WGAC pipeline: –the seed size is 500 bp.

5 Result from WGAC Pipeline Total pairs of WGAC detected (>1 kb and >90% identity) 64164 Interchromosome pairs 58454 Intrachromosome pairs 5709 Total WGAC NR (bp) 128,672,221 NR inter 97,156,068 NR intra 55,296,067 Total genome size (with gap) 3,196,721,236 Notes: The inter, and intra are based on scaffold and contigs rather than chromosomes.

6 General analysis of WGAC length and identity distribution 1.Length distribution peaked at 1-2 kb, intra > inter, with 87% of WGAC related to chrUn. 2.Identity distribution peaked at 97-98%. Few are higher than 99%.

7 NR distribution (AllDupLen.xls) Because the scaffold and contigs are not mapped to chromosome, there is no NR distribution on each chromosome In general, the large scaffold has less SD, and smaller scaffold and has higher SDs, especially those less than 1mb. All contigs has high percentage of the SDs.

8 Initial stats is in allstat.xls

9 WGAC page, not yet set up

10 WSSD analysis done by Tin not yet Downloaded the WGS reads; about 11,683,735 reads from trace archive at NCBI. Downloaded zfinch-finished BACs. These BACs are used to determine the threshold for WGS depth coverage. For 5-kb window, the average number of reads is 59. The threshold for 5-kb window is 110, for 1-kb it’s 22. Used UCSC taeGut1 database rmsk tables as input to mask the genome for repeats with divergence <=10%. (UCSC rmsk options: RepeatMasker -align -s -species 'Taeniopygia guttata')

11 WSSD results not yet available A total of 16,076 regions with 44,218,871 bp were found in wssdGE10K_nogap.tab (which has a 10-k cut-off). 13,782 of them are on chrUn. A summary table of WGAC intersect with WSSD is at http://eichlerlab.gs.washington.edu/help/linchen/zfinch/data/wgacCMPwssd.out.xls http://eichlerlab.gs.washington.edu/help/linchen/zfinch/data/wgacCMPwssd.out.xls

12 General view showing WGAC (>5kb) and WSSD on all chromosomes not done yet, may be on large scaffold Grey above lines are WSSD Brow below lines are WGAC

13 Union of WSSD and WGAC gene intersect with Seg Dups not available A nonredundant union of WGAC and WSSD is generated with cut- off size at 10 kb (AllDup10kb.tab). There are 3,839 NR regions with 50,902,487 bp, which is about 10 mb more than WSSD alone. However, be aware there may be false positive sites, especially on chrUn, since we know there are high false positive WGACs on chromosomes and chrUn.

14 Summary table 1 not avaible totalchrNchrUn No. nr intervalfile wssd (bp) 44,218,87111,237,98535,080,886729wssdGE10K_nogap.tab wgac (bp) 384,501,909232,493,308152,008,6017387oo.weild10kb.join.all.cull AllDup (bp) 394,988,746235,022,961159,965,7855934allDUP Wssd and Wgac shared 8,195,5773,182,1285,013,449 Genome (bp) 1,233,186,3411,057,961,026175,225,315

15 Large SDs >=10 kb SD >=10 kb in size were pulled out. There are a total of 3,839 intervals with length 50,902,487 bp in the allDup.tab.

16 result


Download ppt "Elephant Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis."

Similar presentations


Ads by Google