Download presentation
Published byHollie Ray Modified over 9 years ago
1
Quality of services of a primary nucleotide sequence database
Hideaki Sugawara Center for Information Biology and DDBJ, National Institute of Genetics SOKEN-DAI.
2
Primary Biological Databases
International Nucleotide Sequence Database Collaboration (INSDC) serves communities JPO: Japan Patent Office DDBJ: DNA Data Bank of Japan NIG: National Institute of Genetics EMBL: European Molecular Biology Laboratory EBI: Europen Bioinfomatics Institute NCBI: National Center for Biotechnology Information EPO: European Patent Office USTPO: US Trademark and Patent Offrice JPO DDBJ NIG GenBank NCBI EMBL EBI EPO USTPO 2006/10/23 Primary Biological Databases
3
Infrastructure for the services
Informatics Contents Mangement 2006/10/23 Primary Biological Databases
4
Today I will introduce:
Web services (programmable interface) Derived databases Q&A databases A bug tracking system 2006/10/23 Primary Biological Databases
5
Mash-up of biological data resources
It has bee feasible to mash-up diverse databases and tools by hand to some extent. ① Connect to a journal database to find accession numbers of INSDC A. Web site of Journal database Users with Web browser ②. Move to INSDC to search amino acid sequences by the accession numbers found in the step ① B. Web site of Nucleotide database C. Web site of Protein Data Bank (PDB) ③. Move to the Protein Data Bank (PDB) to get the 3D structure. The three steps in the above are accomplished only by click, copy and paste by mouse. However, It is very difficult when dealing with a lot of data. Nucleic Acid ResearchのDB特別号に掲載されるデータベース数は、500をこえ、 自分の目的に沿った様々な生物情報資源を効率的に取得するためには、 Webブラウザを通じて、様々なデータベースや解析ツールにアクセスする必要がある。 例えば、この例で示しておりますのは、論文に記載されている配列データについて、タンパク立体構造が 判明しているのかどうかを調べたいときのアクセス例を示しております。 一件だけならば、このようなアクセスでも苦はないのですが、複数ですと、 取得は非常に困難になってしまます。より・・・ありと考えております。 Large-scale databases are produced in OMICS. Therefore, a machanical mash-up by program will be required more and more. 2006/10/23 Primary Biological Databases
6
Web services for the biological problem solving environment
We implemented a SOAP (Simple Object Access Protocol) server and Web services that provide a program-friendly interface. We propose standardization of bioinformatics services to improve the interoperability of desperately diverse biological data resources. 2006/10/23 Primary Biological Databases
7
The world of Web services
「Web services」 provider EBI, NCBI, ----- DDBJ UDDI Registration of Web services SOAP (HTTP/HTTPS) Search Web services WSDL XML Users UDDI (Universal Description, Discovery and Integration) WSDL (Web Services Description Language). 英語で説明を書く!!! 具体的なサービスの例として、Web servicesの提供者によって、UDDI()と呼ばれる利用可能なサービスの登録 が行われ、利用者はそのUDDIから自分が利用したいサービスの検索を行います。 次に、自分の利用したいサービスを見つけた利用者は、サービスが登録されている提供者から、 そのサイトで提供さえているサービスの一覧とアクセス方法が記載されているWSDLを取得し、 実際に、SOAPを介したサービスを使用することが出来ます。 WSDLとSOAPを介したサービスの提供はXMLファイルにより交換が行われます。 2006/10/23 Primary Biological Databases
8
XML Central of DDBJ: http://xml.nig.ac.jp/index.html
As DDBJ Web services server, We open the SOAP interfaces with the several services which offer by DDBJ. 2006/10/23 Primary Biological Databases
9
Workflow composed of methods provided by Web services
141 methods in 15 services are available. Blast ClustalW DDBJ Ensembl Fasta Getentry GIB GTOP NCBI Genome Annotation PML RefSeq SPS SRS TxSearch VecScreen Open OMIM Gene Ontology Rfam developed Custom made workflows 2006/10/23 Primary Biological Databases
10
Easy to bind Web services methods from your program
Download ActivePerl for WINDOWS and you can call methods in DDBJ Web services. Specify a WDSL file and call a method: In the following example, the method of getXML_DDBJEntry(accession) of the Getentry Web services is called from a Perl program: 1. Include the package needed to use SOAP service. 2. Specifies WSDL file of SOAP service you want to use. 3. Call the service you want to use. 2006/10/23 Primary Biological Databases
11
Workflow with Online Mendelian Inheritance in Man (OMIM)
Web services tool Data Local program 閾値に関しては、記載する。 This workflow reveal homology relationship between human disease genes and genes of other eukaryotes. 2006/10/23 Primary Biological Databases
12
Environmental DNA sequences automatic annotation workflow
In order to make full use of gene information included in nucleotide sequence database, we developed workflow of gene finding of DNA fragments obtained from pooled genome samples of uncultured microbes in environmental samples. BLASTP Environmental Sample EMBOSS (getorf) ORF Selected ORF Attached Product InterPro Scan FLOW Execute getorf in EMBOSS which finds and outputs the sequences of open reading frames (ORFs). The ORFs can be defined as regions of a specified minimum size (longer than 300bp) between START and STOP codons. (The minimum size cut-off for getorf was ) The protein product was attacheded by searching the bacterial division of DDBJ release 62 against predicted ORFs using BLASTP. (The e-value cut-off for BLASTP was 1e-40.) 遺伝子情報 We observed DNA fragments obtained from pooled genome samples of uncultured microbes in environmental and clinical samples. Evidence-based gene finding. Evidence in the form of protein alignments was used to determine the most likely coding frame. The approximate start and stop positions were likewise determined from the bounding coordinates of the alignments and refined to identify specific start and stop codons. This methodology was applied to several different lines of evidence in a series of steps as follows: First evidence was generated by searching the bacterial portion of the nraa dataset against the Sargasso data using tblastn. All blast searches were performed using NCBI blastall version Unless stated otherwise all blast searches were performed with these generic parameters: -Y F "m L" -U T -v 5. The NCBI non-redundant amino acid dataset (nraa) was downloaded from NCBI on September 2nd and contains 1,510,260 peptides of which 626,877 were classified as bacterial. The tblastn searches using the bacterial subset of nraa were performed with these additional parameters: -e 1e-4 -b K The blastx searches against all of nraa were done with these additional parameters: -e 1e-3 -b 5000 -K The tblastx searches used these parameters: -e 1e-3 -b K The search of the Sargasso data against itself using blastn used these parameters: -W 20 -K b q -9 -e 1e-40. 2006/10/23 Primary Biological Databases
13
Access frequency of DDBJ Web services
Examples of methods frequently used: getentry, blast, RefSeq、GO 、 2006/10/23 Primary Biological Databases
14
Web services in the world
EBI, NCBI, ---- DDBJ SOAP(HTTP/HTTPS) 利用者 (Web services map prepared by EBI) 2006/10/23 Primary Biological Databases
15
Primary Biological Databases
Derived databases 2006/10/23 Primary Biological Databases
16
>60M entries >100G bps ~ 0.5 projects INSDC is not only large-scale but also complex. It is not easy to retrieve what you want to anapyze.
17
Primary Biological Databases
A simple derived database: subset of complete microorganisms genomes from INSDC 2006/10/23 Primary Biological Databases
18
Gene prediction programs used
2006/10/23 Primary Biological Databases
19
Parameters: the cutoff length used
number >20 >(=)30 >33.3aa (100bp) >40aa >50aa >60aa >66.6aa (200bp) >80 >100aa >150aa >200aa >300aa >400aa 1 25 3 7 4 2 6 2006/10/23 Primary Biological Databases
20
COG ID in the qualifier of /note
Annotation: description of products ~ Hahella chejuensis KCTC 2396 CDS /gene="argB" /locus_tag="HCH_01027" /EC_number=" " /note="COG0548" /codon_start=1 /transl_table=11 /product="Acetylglutamate kinase" /protein_id="ABC " /db_xref="GI: " /translation="MLDRDNALQVAAVLSKALPYIQRFAGKTIVIKYGGNAMTDEELK NSFARDVVMMKLVGINPIVVHGGGPQIGDLLQRLNIKSSFINGLRVTDSETMDVVEMV LGGSVNKDIVALINRNGGKAIGLTGKDANFITARKLEVTRATPDMQKPEIIDIGHVGE VTGVRKDIITMLTDSDCIPVIAPIGVGQDGASYNINADLVAGKVAEVLQAEKLMLLTN IAGLMNKEGKVLTGLSTKQVDELIADGTIHGGMLPKIECALSAVKNGVHSAHIIDGRV PHATLLEIFTDEGVGTLITRKGCDDA" COG ID in the qualifier of /note ~ Archaeoglobus fulgidus DSM 4304 CDS complement( ) /locus_tag="AF_1280" /note="similar to GB:L77117 SP:Q60382 PID: percent identity: 56.06; identified by sequence similarity; putative" /codon_start=1 /transl_table=11 /product="acetylglutamate kinase (argB)" /protein_id="AAB " /db_xref="GI: " /translation="MENVELLIEALPYIKDFHSTTMVIKIGGHAMVNDRILEDTIKDI VLLYFVGIKPVVVHGGGPEISEKMEKFGLKPKFVEGLRVTDKETMEVVEMVLDGKVNS KIVTTFIRNGGKAVGLSGKDGLLIVARKKEMRMKKGEEEVIIDLGFVGETEFVNPEII RILLDNGFIPVVSPVATDLAGNTYNLNADVVAGDIAAALKAKKLIMLTDVPGILENPD DKSTLISRIRLSELENMRSKGVIRGGMIPKVDAVIKALKSGVERAHIIDGSRPHSILI ELFTKEGIGTMVEP" Some details of homology analysis in the /note qualifier ~ Agrobacterium tumefaciens C58 circular chromosome CDS complement( ) /gene="AGR_C_666" /note="acetylglutamate kinase PA5323 {imported} - Pseudomonas aeruginosa (strain PAO1)" /codon_start=1 /transl_table=11 /product="AGR_C_666p" /protein_id="AAK " /db_xref="GI: " /translation="MTSSESEIQARLLAQALPFMQKYENKTIVVKYGGHAMGDSTLGK AFAEDIALLKQSGINPIVVHGGGPQIGAMLSKMGIESKFEGGLRVTDAKTVEIVEMVL AGSINKEIVALINQTGEWAIGLCGKDGNMVFAEKAKKTVIDPDSNIERVLDLGFVGEV VEVDRTLLDLLAKSEMIPVIAPVAPGRDGATYNINADTFAGAIAGALHATRLLFLTDV PGVLDKNKELIKELTVSEARALIKDGTISGGMIPKVETCIDAIKAGVQGVVILNGKTP HSVLLEIFTEGAGTLIVP" Product ID in the qualifier of /product Product name in the qualifier of /note 2006/10/23 Primary Biological Databases
21
Annotation: description of products
unknown hypothetical protein probable orf predicted protein putative protein hypothetical conserved protein uncharacterized protein conserved domain protein conseved hypthetical 2006/10/23 Primary Biological Databases
22
Annotation: inconsistency or biological variation
Agrobacterium tumefaciens C58 circular chromosome (Cereon) Agrobacterium tumefaciens C58 circular chromosome (U. Washington) 2006/10/23 Primary Biological Databases
23
Primary Biological Databases
A derived database: contents of complete microorganisms genomes were evaluated Gene Trek in Procayote Space 2006/10/23 Primary Biological Databases
24
Primary Biological Databases
Features of GTPS Simultaneous prediction and evaluation of all the possible protein coding genes (ORFs) in prokaryote genomes All the ORFs are graded The criteria and evidence data are available Web site Updated once a year 2006/10/23 Primary Biological Databases
25
Primary Biological Databases
Short history of GTPS ver. (data froze in) strains Archaea Bacteria 2003 (Jul 2003) 123 14 109 2004 (Sep 2004) 183 17 166 2005 (Feb 2006) 302 25 277 2006/10/23 Primary Biological Databases
26
Not putative membrane nor unknown protein
Grade blastp hit InterProScan hit Coverage Subject Subject alignment subject query AAAA BBBB Function known motif Not putative membrane nor unknown protein ≧ 70% AAA BBB Unknown motif & or AA BB No hit AAAA1-D3 grades Potential genes alignment/ ORF Putative membrane or unknown protein Function known or unknown motif ≧ 70% A B ≧ 70% Putative membrane protein No hit C or Function known or unknown motif No hit ≧ 70% Unknown protein D No hit & ≧ 70% Unknown protein E No hit or X No hit No hit 2006/10/23 Primary Biological Databases
27
Potential genes (AAAA1 - D3)
Number of ORFs sorted by each grade GTPS ver. 2003 GTPS ver. 2004 GTPS ver. 2005 Grades AAAA-A BBBB-B C D E X Total 283,247 7,208 4,680 79,779 6,788 466,681 848,383 431,672 10,250 7,511 107,382 10,225 687,110 1,254,150 752,186 20,755 16,227 137,656 17,739 278,075 1,222,638 Glimmer2 and RBS finder Glimmer 3 Potential genes (AAAA1 - D3) 370,876 551,246 903,845 INSDC 362,828 537,312 904,530 2006/10/23 Primary Biological Databases
28
Comparison of the number of protein coding genes between GTPS and INSDC
GTPS ver. 2003 GTPS ver. 2004 GTPS ver. 2005 Identical 3' matched New ORFs (Not annotated in INSDC) Not predicted by Glimmer Total (potential genes) 261,720 92,206 14,954 1,996 370,876 70.6% 24.9% 4.0% 0.5% 390,557 133,146 23,935 3,608 551,246 70.8% 24.2% 4.3% 0.7% 576,762 252,365 31,438 43,280 903,845 63.8% 27.9% 3.5% 4.8% Glimmer 3 was used for ver.2005. 2006/10/23 Primary Biological Databases
29
How many genes are out there?
The ratio of the clustered ORFs to all the ORFs among the sampled genomes increased with the increasing number of the genomes. The ratio is not yet saturated at the 180 genomes (29.5% of the genes among 180 genomes were clustered and presumed to have the same function.). 2006/10/23 Primary Biological Databases
30
Primary Biological Databases
Summary It is the long term mission of the primary database to archive all the data published for the long term. At the same time, it is requested to provide objective/reliable data and tools for the mash-up. It proposes and maintains standards of biological data processing for the long term. 2006/10/23 Primary Biological Databases
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.