Presentation is loading. Please wait.

Presentation is loading. Please wait.

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Source View Community Integrative Bioinformatics (NSF) Arabidopsis (reference.

Similar presentations


Presentation on theme: "Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Source View Community Integrative Bioinformatics (NSF) Arabidopsis (reference."— Presentation transcript:

1 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Source View Community Integrative Bioinformatics (NSF) Arabidopsis (reference organism) All cereals (NSF) Rice Legumes Soy EST (USB) Soy Functional (NSF) Medicago (NSF) Trees Pine EST (DOE) Pine Functional (NSF)

2 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Partnerships Research Community Support: Shared Expertise and Knowledge Bioinformatics Community Plant Community Metacomputing Community Federal Support: Grants and Contracts Corporate Support: Hardware, Software, and Data Integrated Genomics

3 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Application View All public genomic data Sequence processing Similarity Searches Unigene Sets Diogenes Pipeline, Automation BioData

4 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Application View All public genomic data Sequence processing Similarity Searches Unigene Sets Diogenes Pipeline, Automation Genomics Desktop Functional Genomics Array Design SAGE Clustering Data Mining Visualization & Exploration BioData

5 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Application View Warehouse Multi-species Comparative Functional Genomics Metafam All public genomic data Sequence processing Similarity Searches Unigene Sets Diogenes Pipeline, Automation Genomics Desktop Functional Genomics Array Design SAGE Clustering Data Mining Visualization & Exploration Metabolic Pathway Reconstruction BioData Relational Genbank

6 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA The Genomics Grid Distributed Computing: Condor, Globus, Sun Grid Clusters of Workstations High Performance Networking ATM / GBE / FCAL Internet 2 Special Purpose Hardware Time Logic “DeCypher” Interoperable Software “Grid Aware” Applications Remote SQL Queries Java Enterprise level data storage Oracle High Throughput Genomics Visual Exploration of Global Data Resources Real Time, Visual Collaboration

7 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Design Goals Scalable - Provide a workload management solution for large scale bioinformatics processing Extensible - Add new tools easily without modifying core components Portable - Deliver functionality in heterogeneous environments Collaborative - Combine processing resources to increase throughput

8 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Underlying Components

9 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Client Data Files Metadata Context Unique Internal Identifiers 55565758 Individual Data Items (Chromatograms or sequence files) All metadata related to each individual sequence XML format “Preprocessing” database Data submissions happen in batches, initiated by clients. File formats, processing requirements, and batch structure vary widely. Data arrives at CCGB in a well structured format, amenable to automatic processing. Web Based Submission Tool

10 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Data Submission Prototype In this example of a data submission page, the user selects the appropriate data directory, and uses Netscape’s file browser to upload the TAB delimited spreadsheet file.

11 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Metadata Required for Processing Name Type Sequence IDString (used to identify which data file is associated with this metadata) Sequence NameString (used for GSS# or EST# in GB submission) Experiment Type Data Type Date SequencedDate Seq PrimerIdentifier for Primer (CBC maintained list) Contact NameIdentifier for NCBI Contact File (CBC maintained list) CitationIdentifier for NCBI Citation File (CBC maintained list) Library Identifier for NCBI Library File (CBC maintained list) Class OrganismIdentifier for organism (CBC maintained list) Send to DB Some quality control checking is done at submission time to ensure that the metadata are consistent and correct. This includes a “spellcheck” like feature to be sure that primers, citations and such reference things known to CBC.

12 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Tasks in Processing Biological Data Base Calling (Phred, Phran) Vector Filter (VF4) Artifact Filter (af) BLAST (blast, blastx, tblastx, blastn) Contig construction (Phrap) Microarray Design Primer Selection Functional Analysis & Annotation Submission to public repositories (Genbank) Publication

13 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA TkBatch User Interface Provide a configurable interface to a set of tools. Batch Processing System Enable batch submission of thousands of jobs Dependency Management Define Directed Acyclic Graphs (DAG)s for process flow. A DAG is not a tree.

14 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Watchlist: Directed Acyclic Graph of processes which will act on the input data File List: Input data, possibly selected from diverse locations in the file system Compile to - Job Description: Enumerates all tasks included in the job, all job dependencies, as well as a “status journal” indicating progress through the tasks. TkBatch – Use Outline Submit to – Distributed Processing CONDOR metacomputing platform Similar to GLOBUS and Sun’s GRID Uses idle workstations to perform processing tasks Dependancy Observe through TkBatch Building process monitoring capabilities into the TkBatch system. Obtaining CONDOR source code to make improvements directly.

15 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Application Configuration Some system abstraction, but still a very “close to the road” interface Tools cannot be selected unless they are appropriate to the current output type in the watchlist. Reasonable defaults are provided for command line options.

16 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Analysis Tools RelGB –A simple relational framework for GenBank Data –Java based UI for biologically relevant queries SSR Identification & primer design for ESTs –All; UTR; BAC-end; BAC EST contigs: Diogenes-Blast; Primer3 –Analysis Tools

17 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA PERL and CGI Scripts, operating on XML indexes to data directories Creates set of predefined web views on data http://web.ahc.umn.edu/biodata Grant Summary Grant Info Grant Statistics Contig list Submission Set List Submission Sequence Length Distribution Submission Set Visualization Search BLAST reports Sequence List Contig Sets Contig Info Table Phrap Parameters Submissions in the Contig Set Contig Quality Graphs Sequences in the Contig Set Contig Page Sequence Info Contig Visualization Sequence Analysis Tools BLAST Reports Sequence Info Raw Sequence Filtered Sequence Sequence Quality Graph Sequence Analysis Tools BLAST Reports Project Statistics Number of sequences Number of submissions Length Statistics Contig Statistics Quality Statistics BioData Summary

18 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA BioData File Tree contig_dir_### | +-index.xml | | <contigdata | | kingdom="Planta" | | family="Pinaceae" | | species="Pinus taeda" | | files="contigs" | | > | | | | Xylem | | | | NXNV | | | | 991206a 991206b 991207a 991207b | | 20000103a 20000217a 20000515a 20000612 | | 20000103b 20000221a 20000515b 20000613a | | 20000103c 20000328a 20000515c 20000613b | | | | <phrapparams | | minmatch="40" | | minscore="80" | | > | | | | <contigversionlist | | AssemblyProcessId="PtaedaNormalXylem" | | AssemblyProcessVersion="1" | | AssemblyStepNumber="1" | | > | | | +-libraryname.fasta.screen.ace.1 | |

19 Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA CCGB Condor Cluster 65 processors on 37 machines Performance –4.75 Gflops –25 BIPS –19 GB memory –Figures are roughly equivalent to a 16 processor IBM SP2 Customized usage policies


Download ppt "Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Source View Community Integrative Bioinformatics (NSF) Arabidopsis (reference."

Similar presentations


Ads by Google