Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slow and Steady: The Sea Urchin Genome Project David A. Schwarz Mentor: Dr. Andrew Cameron Site: California Institute of Technology.

Similar presentations


Presentation on theme: "Slow and Steady: The Sea Urchin Genome Project David A. Schwarz Mentor: Dr. Andrew Cameron Site: California Institute of Technology."— Presentation transcript:

1 Slow and Steady: The Sea Urchin Genome Project David A. Schwarz Mentor: Dr. Andrew Cameron Site: California Institute of Technology

2 Objective ► Curate the non annotated, predicted genes of the sea urchin genome. ► Learn to annotate genes and register as many as possible to spbase.org

3 Importance ► The purple sea urchin: the only non- chordate deuterostome with a sequenced genome. ► It could help us understand the evolution of biological processes such as odor perception and immunity. ► Developments made in the project could benefit future genome projects.

4 Strongylocentrotus purpuratus ► Phylum: Echinodermata ► Radially symmetrical shell, 3 – 10 cm. ► Spines can reach 3 cm long. ► Moves slowly, feeding mostly on algae. ► Reproduces by external fertilization.

5 Phylogeny

6 Data Flow Estimated Set of 23,300 genes

7 Genome Sequencing ► WGS = Whole Genome Shotgun Sequencing  Genome assembly named Spur_v0.5 ► CAPSS = Cloned-Array Pooled Shotgun Sequencing Strategy  Genome assembly named Spur_v2.1

8 Data Flow Estimated Set of 23,300 genes

9 Sequencing ► WGS: ► Extract DNA ► Digest ► Sequence the Fragments ► Assemble the genome. ► CAPSS: ► Combines WGS with BAC. ► Uses BACs as framework for genome assembly.

10 CAPSS

11 Data Flow Estimated Set of 23,300 genes

12 GLEAN GLEAN Statistical Algorithm EnsemblGenscanGnomon

13 Discrepancy ► Spur_v0.5 – ► 28,944 predicted ► ~10,044 annotated ► 18,944 non annotated ► ~ 5,700 gene difference possibly due to:  4 – 5% species polymorphism (E. Davidson, et al.)  Assembly error  Prediction error ► Spur_v2.1 ► 23,300 estimated ► Gene number reduced when duplicates overlap

14 Methods ► Python Filtering ► Python Searching ► BioPython module:  BLAST hit FASTA sequences ► Grep-like functions:  GLEAN models by protein type  FASTA sequences in GLEAN protein databse Infile: Gene list If conditions meet: Print to outfile Check against: Data file

15 Example List GLEAN3_00003ref|NP_104627.1| hypothetical protein [Mesorhizobium loti] >gi|1... 38 0.48 GLEAN3_00004ref|NP_788284.1| CG33087-PC [Drosophila melanogaster] >gi|232403... 40 0.19 GLEAN3_00005ref|NP_509604.1| abnormal NUClease NUC-1, deoxyribonuclease DLAD... 69 4e-11 GLEAN3_00008ref|XP_293875.3| similar to RIKEN cDNA B130016O10 gene [Homo sap... 240 5e-62 GLEAN3_00010gb|AAH36744.1| FLJ11712 protein [Homo sapiens] 86 6e-16 GLEAN3_00011gb|AAH36744.1| FLJ11712 protein [Homo sapiens] 143 3e-32 GLEAN3_00014ref|NP_062642.1| ubiquitin-conjugating enzyme E2A, RAD6 homolog;... 229 2e-59 GLEAN3_00018failed GLEAN3_00019failed GLEAN3_00020failed GLEAN3_00021ref|NP_196259.2| chaperone protein - related [Arabidopsis thalia... 110 4e-23 GLEAN3_00023failed GLEAN3_00024sp|O42587|PRSA_XENLA 26S protease regulatory subunit 6A (TAT-bin... 130 1e-29 GLEAN3_00027gb|AAD19348.1| reverse transcriptase-like protein [Takifugu rubr... 172 2e-41 GLEAN3_00028gb|AAH53792.1| MGC64389 protein [Xenopus laevis] 164 3e-39 GLEAN3_00029failed GLEAN3_00030ref|XP_060945.2| similar to Olfactory receptor 10T2 [Homo sapien... 54 5e-06 GLEAN3_00032dbj|BAA22375.1| Nfrl [Xenopus laevis] 339 7e-92 GLEAN3_00033ref|XP_354640.1| RIKEN cDNA D430035D22 gene [Mus musculus] 186 1e-45 GLEAN3_00034dbj|BAC04242.1| unnamed protein product [Homo sapiens] 207 5e-52 GLEAN3_00037dbj|BAC02921.1| zVeph-A [Danio rerio] 112 4e-23 GLEAN3_00038ref|NP_004198.1| solute carrier family 16, member 3; monocarboxy... 44 0.008 GLEAN3_00039failed

16 Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: Different name, same genome coordinates Genes removed: 139

17

18 Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: Evidence for gene expression Genes removed: 1,603

19 Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: No hits Genes removed: 3,145

20 Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by Sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: Exactly the same BLAST hit Genes removed: 4,545

21 Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,519) Filtering by Protein Quality (2,478) Condition: Successful Reciprocal BLAST match Genes removed: 3,952

22 Reciprocal Blast Sea urchin protein database (GLEAN) NCBI Nr database A B X Y GLEAN_ANCBI Protein B(score)(e-value) Good Reciprocal Blast

23 Reciprocal Blast Sea urchin protein database (GLEAN) NCBI Nr database A B X Y GLEAN_ANCBI Protein B(score)(e-value) Bad Reciprocal Blast

24 Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,470) Filtering by Reciprocal Blast (5,519) Filtering by Protein Quality (2,478) Conditions: Names such as “hypothetical”, “predicted”, “unnamed” Genes removed: 3,041

25 Annotation Process Search sequences of proteins of similar type or domain (use GLEAN DB and PFAM) Build phylogeny tree with Clustal X. Annotate gene following Spbase guidelines. If necessary: Do some research on the protein type or its domains. (Using PFAM)

26

27 Contributions to Annotation ► AnnotationAssist.py  Automates searching for families in the Glean database  Autofetches sequences for Clustal X  Stores everything on a unique directory based on Glean model name and family

28 References ► Polymorphism: R.J. Britten, A. Cetta, E.H. Davidson, Cell 15, 1175 (1978) ► CAPSS: W. W. Cai, R. Chen, R. A. Gibbs, A. Bradley, Genome Res. 11, 1619 (2001).

29 Acknowledgments ► Dr. Andrew Cameron ► David Felt ► Lauren Lee and Nowelle Ibarra ► SoCalBSI Staff and Coordinator ► SoCalBSI Participants ► Funding:  NIH  NSF  DOE  Beckman Institute


Download ppt "Slow and Steady: The Sea Urchin Genome Project David A. Schwarz Mentor: Dr. Andrew Cameron Site: California Institute of Technology."

Similar presentations


Ads by Google