1 The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services Research Computing University Information Technology Services.

Slides:



Advertisements
Similar presentations
SOAP.
Advertisements

Replication, Transcription, and Translation Before a cell can divide, the DNA in the nucleus of the cell must be duplicated. Since the DNA molecule consists.
RNA and PROTEIN SYNTHESIS
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
1 The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services Research Computing University Information Technology Services.
Making relational data available on the Grid: A survey of methods from CGI to OGSA-DAI Michael Grobe Indiana University 1 Introduction This poster presents.
LECTURE 5: DNA, RNA & PROTEINS
DNA, RNA, and Protein Section Objectives: By the end of this section of notes your should be able to: Relate the concept of the gene to the sequence of.
10-2: RNA and 10-3: Protein Synthesis
Chapter 17 From Gene to Protein. Gene Expression The process by which DNA directs the synthesis of proteins 2 stages: transcription and translation Detailed.
RNA Ribonucleic Acid.
Transcription: Synthesizing RNA from DNA
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Biology 10.1 How Proteins are Made:
RNA and Protein Synthesis Chapter 13 (M). Information Flow Language of DNA is written as a sequence of bases If the bases are the letters the genes are.
NAi_transcription_vo1-lg.mov.
1865- Gregor Mendel studied inheritance patterns using pea plants and observed traits were inherited as separate units. These traits are now known as.
From DNA to Protein Chapter DNA, RNA, and Gene Expression  What is genetic information and how does a cell use it?
What must DNA do? 1.Replicate to be passed on to the next generation 2.Store information 3.Undergo mutations to provide genetic diversity.
RNA Structure and Transcription Mrs. MacWilliams Academic Biology.
Chapter 11 DNA Within the structure of DNA is the information for life- the complete instructions for manufacturing all the proteins for an organism. DNA.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Gene Expression How is the information in DNA used to determine an organism’s characteristics?
Biology: DNA, Transcription, Translation, and Protein Synthesis
Protein Synthesis Part 1: Transcription. DNA is like a book of instructions written with the alphabet A, T, G, and C. Genes are specific sequences of.
Lecture #3 Transcription Unit 4: Molecular Genetics.
Transcription Packet #20 5/31/2016 2:49 AM1. Introduction  The process by which information encoded in DNA specifies the sequences of amino acids in.
3-Tier Client/Server Internet Example. TIER 1 - User interface and navigation Labeled Tier 1 in the following graphic, this layer comprises the entire.
Chapter 3 The Biological Basis of Life. Chapter Outline  The Cell  DNA Structure  DNA Replication  Protein Synthesis.
PROTEIN SYNTHESIS The Blueprint of Life: From DNA to Protein.
Relate the concept of the gene to the sequence of nucleotides in DNA.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
Mike Jackson EPCC OGSA-DAI Architecture + Extensibility OGSA-DAI Tutorial GGF17, Tokyo.
Transcription … from DNA to RNA.
RNA & Protein Synthesis
Chapter 17 From Gene to Protein. 2 DNA contains the genes that make us who we are. The characteristics we have are the result of the proteins our cells.
CH : DNA, RNA, and Protein Section Objectives: Relate the concept of the gene to the sequence of nucleotides in DNA. Sequence the steps involved.
Replication (not part of transcription/translation) Before a cell can divide, the DNA in the nucleus of the cell must be duplicated. Since the DNA molecule.
CHAPTER 13 RNA and Protein Synthesis. Differences between DNA and RNA  Sugar = Deoxyribose  Double stranded  Bases  Cytosine  Guanine  Adenine 
Protein Synthesis Review By PresenterMedia.com PresenterMedia.com.
DNA, RNA. Genes A segment of a chromosome that codes for a protein. –Genes are composed of DNA.
Transcription and Translation. Central Dogma of Molecular Biology  The flow of information in the cell starts at DNA, which replicates to form more DNA.
Introduction to Molecular Biology and Genomics BMI/CS 776 Mark Craven January 2002.
Transcription and The Genetic Code From DNA to RNA.
Gene Activity 1Outline Function of Genes  One Gene-One Enzyme Hypothesis Genetic Code Transcription  Processing Messenger RNA Translation  Transfer.
DNA: The Genetic Material Molecular Genetics Section 1 Griffith  Performed the first major experiment that led to the discovery of DNA as the genetic.
Protein Synthesis RNA, Transcription, and Translation.
Gene Expression DNA, RNA, and Protein Synthesis. Gene Expression Genes contain messages that determine traits. The process of expressing those genes includes.
Chapter 8 Section 8.4: DNA Transcription 1. Objectives SWBAT describe the relationship between RNA and DNA. SWBAT identify the three kinds of RNA and.
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe.
DNA Structure and Replication SC.912.L.16.9 Explain why the genetic code is universal. Explain why there are similarities in the genetic code of different.
Gene Activity Chapter 14. Gene Activity 2Outline Function of Genes  One Gene-One Enzyme Hypothesis Genetic Code Transcription  Processing Messenger.
ISC321 Database Systems I Chapter 2: Overview of Database Languages and Architectures Fall 2015 Dr. Abdullah Almutairi.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 9 Web Services: JAX-RPC,
Ch. 11: DNA Replication, Transcription, & Translation Mrs. Geist Biology, Fall Swansboro High School.
Albia Dugger Miami Dade College Cecie Starr Christine Evers Lisa Starr Chapter 9 From DNA to Protein (Sections )
DEPTT. OF COMP. SC & APPLICATIONS
Gene Expression = Protein Synthesis.
RNA, & Protein Synthesis
Ch > 28.4.
Chapter 9 Web Services: JAX-RPC, WSDL, XML Schema, and SOAP
12-3 RNA and Protein Synthesis
DNA & Protein Synthesis
Transcription Packet #21 12/8/ :59 PM.
What is RNA? Do Now: What is RNA made of?
How to Use This Presentation
Chapter 17 From Gene to Protein.
Biology, 9th ed,Sylvia Mader
LECTURE 5: DNA, RNA & PROTEINS
Presentation transcript:

1 The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services Research Computing University Information Technology Services Indiana University at Indianapolis January 2007

2 Basic genome science processes and vocabulary Basic relational algebra Simple SQL as an expression of the relational algebra DB2 and the Federated Server CLSD data sources: “relationalized”, mirrored, and federated Accessing CLSD Directions for possible future work: Adding data sources Integrating more completely with the TeraGrid Integrating with other Grids Questions, suggestions Outline

3 A “polymer” is a chemical composed of many similar units, e.g. polyvinyl chloride, starches, etc. DNA is a (usually double-stranded) polymer composed of nucleotides: Thymine, Adenosine, Cytosine, and Guanine DNA carries genetic information. Individual units of genetic information are stored in individual (possibly quite long) segments of DNA. RNA is a (usually single-stranded) polymer composed of nucleotides: Uracil, Adenosine, Cytosine, Guanine There are many varieties of RNA (mRNA, snRNA, rRNA, snoRNA,etc.), and they serve different functions within a cell. For example, RNA “transfers” genetic information, catalyses reactions, and otherwise assists or interferes with reactions. The chemistry

4 Polymers are synthesized by catalysts called “polymerases” in a process called “polymerization.” Proteins are polymers composed of (over 20 different kinds of) amino acids, such as: Methionine (M), Isoleucine (I), Cysteine(C), Histidine (H), Alanine(A), Glutamic acid (E), Leucine (L), etc. Proteins: provide structure: microfilaments (polymers of actin), microtubules (polymers of tubulins), channels thru the cell wall, etc. catalyse and co-catalyse reactions, as “enzymes,” bind with DNA to enhance or inhibit “transcription” and “translation”, are sometimes marked for transport or degradation. Protein primary, secondary and tertiary structures are important. Proteins are degraded within proteasomes.. The chemistry II

5 From Atherly,et al., 1999 Genetic material: 2 meters of DNA packaged into less than 1.4 microns

6 The central model of molecular genetics DNA can be reliably replicated during the process of cell division, by DNA- dependent DNA polymerases. DNA can be “transcribed” to messenger RNA (mRNA) by DNA-dependent RNA polymerases. Transcription takes place in the nucleus (or equivalent). mRNA is transported to the cytoplasm where it is used as a template for creating proteins by “ribosomes” in a process called “translation.” The translation process encodes 1 amino acid for each 3 DNA bases in a sequence (“triplet”). The function mapping each of the 64 possible triplets to an amino acid is the “genetic code.” Ribosomes are complexes of RNA and protein.

7 The central model within the cell Diagram from: (Don’t forget about degradation and recyling of AAs.)

8 The central model in more detail (Graphics of DNA and RNA from Atherly, et al. 1999)

9 The central model in even more detail (Graphics of DNA and RNA from Atherly, et al. 1999)

10 Mutations and polymorphisms Nucleotide sequence Translated AA sequence Wildtype: ACTGAACTGATT Thr–Glu–Leu-Ile Substitution: ACTGACCTGATTThr-Asp-Leu-Ile Deletion: ACTCTGATT Thr-Leu-Ile Insertion: ACTGAACCTGAACTGATT Thr-Glu-Pro-Gly-Leu-Ile If mutations like these occur in genetic material within oocytes, they may be transmitted to offspring, and define “polymorphic” gene variations. A Single Nucleotide Polymorphism (SNP) is a variation where one base is changed and passed on to offspring (and occurs with sufficient frequency). A Deletion/Insertion Polymorphism (DIP) is a variation where multiple bases have been removed or inserted into a sequence. dbSNP is a database of SNPs and DIPs containing millions of entries, and over 120K unique sequences that are inserted or deleted.

11 Exons, introns and isoforms in eucaryotes

12 Exons, introns and isoforms II Alternative splicing products (isoforms) can be derived from the same gene, so that one gene can code for multiple proteins. Both protein-coding and non-protein-coding genes may be embedded within introns, and may be “co-expressed.” The spliceosome is composed of a collection of protein and small nuclear RNA molecules (snRNA). Almost every human gene is thought to have at least 2 isoforms. The set of all isoforms is sometimes called the “transcriptome.”

13 Scale of human genome data Total number of bases:3.2Gbp (DNA from one half of one chromosome (chromatid) from each of 24 chromosomes: 22 autosomal chromosome pairs plus the sex chromosomes.) Percentage of genome consisting of protein coding genes:< 2% Average gene length:~3Kbp (but up to 2.4Mbp) Average exon length: 200bp Average protein length: AA Percentage of “junk” DNA:often said to be ~50% Percentage of “junk” DNA now suspected to be transcribed (the “dark matter” of the genome): ~50 to 100% Some of that junk is mRNA that negatively regulates translation.

14 The “promoter” region: A landing site for a procaryotic RNA Polymerase

15 Transcription factors, activators, enhancers: What is a “gene” Such sites may be several thousand base pairs upstream of a start site, and even downstream of a start site. Some are even in introns. Control of cell processes occurs at every step in the protein lifecycle: transcription, translation, transport, degradation.

16 "We can no longer think of a gene as a simple region of DNA that transcribes RNA for the sole purpose of making proteins," "The reality is that a single gene may be a large region of DNA from which a whole cast of RNA molecules are transcribed, all of which are expressed in a coordinated fashion to provide a biological function.“ Tom Gineras, Affymetrix

17 Process control: cancer-related reaction pathways from Hanahan, et al.

18 Basic relational algebra The relational algebra operates on relations, which are sets of tuples of the same arity, which is to say, collections of lists of the same length. Here are two 4-tuples: ( 1, 2, 3, 4 ) ( 8, 7, 9, 4 ) Relations are commonly represented as tables. There are 5 primitive operations within the relational algebra: Projection: extract specific columns from a relation Selection: extract specific rows Set union: create a new table composed of all the rows of two other tables Set difference: remove the rows in one relation that appear in another Cartesian product: “multiply” two tables to create a third

19 Cartesian product in more detail Cartesian product (arity: 4 + 3; length: 3 * 2) Table2 (arity 3; length 2) Table1 (arity 4; length 3)

20 Relational databases and query languages Database management systems based on the relational algebra were described by Edward F. Codd working for IBM in the early 1970s. Codd’s formulation included: indexes and keys, decomposition into normal forms, and integrity constraints. Multiple languages and interfaces were developed to query and modify collections of relations, among them the Structured English Query Language, SEQUEL, developed by Chamberlain and Boyce.

21 SQL as an implementation of the relational algebra The most successful such language, SQL, was based on SEQUEL, and maps to the relational primitives as follows: Projectionselect fieldname_list from tablename Selectionselect * from tablename where Union(select fieldname_list from tablename1) union (select fieldname_list from tablename2) use ALL to keep duplicates Set differenceselect * from (tablename1 except tablename2) Cartesian productselect * from tablename1, tablename2 Note that SQL does not specify how to perform a query; only what the result should be. It is a “declarative,” rather than “procedural,” language.

22 IBM’s DB2 and WebSphere Federated Server, nee Information Integrator, nee DiscoveryLink DB2 is a fully-featured relational database system that can house and serve large databases. Data is usually imported in relational form, structured as rows composed of individual data values, possibly identified by unique IDs (keys). DB2 can also access data in tables managed by other, usually physically remote, database management systems, such as Oracle, MySQL or DB2. This process is known as “data federation.” DB2 can also federate some external resources that are not normally accessed as relational tables (e.g. Blast). Such resources are transformed, or “relationalized” on-the-fly by “wrappers”. Once these resources have been registered with their wrappers they may be referred to within SQL queries as is any other resource.

23 WFS diagram from Del Prete

24 Some WFS jargon Wrapper: a library to access a particular class of data sources or protocols. Each wrapper contains information about data source characteristics. There are BLAST and PubMed wrappers, and now a “generic Script wrapper” that talks to user scripts. Server: represents a specific data source (user mappings maybe required for authentication) Nickname: a local table name (alias) for a data on a server (mapped to rows and columns) A nickname looks like a table, but links to a server, which links to a wrapper/data source, where the wrapper knows how to process the data from the source.

25 Use of the generic Script wrapper (Drawing design courtesy of Doug Del Prete)

26 Using NCBI data within DB2: More than just mirroring Mirroring usually implies maintaining exact copies of data sources. Most data mirrored by CLSD must not only be copied, but also inserted into the CLSD relational structure. This is accomplished by a series of scripts that: Download the data from its external site, Convert it to a form that can be used to update CLSD tables, Insert the data into tables, and Monitor the overall process to identify and log errors. These scripts are run regularly from crontab entries, and monitoring results are examined after every run.

27 CLSD “relationalized” data sources BINDBIND -- Pathways, Gene interactions ENZYMEENZYME -- Enzyme nomenclature ePCRePCR -- ePCR results of UniSTS vs Homo sapiens KEGG data sources: LIGAND -- Pathways, Reactions, & Compounds PATHWAY -- Pathway map coordinates NCBI data sources: LocusLink -- Genetic Loci. (LocusLink has been inactive since July 1, 2005 when it was retired in favor of UniGene.) UniGene -- Gene clusters SGDSGD -- Saccharomyces Genome Database

28 KEGG datasource info PATHWAYPATHWAY: 42,273 pathways generated from 306 reference pathways LIGANDLIGAND: 14,238 compounds, 4,111 drugs, 10,951 glycans, 6,810 reactions, 7,127 reactant pairs

29 CLSD federated data sources Federated NCBI data sources (subject to hit rate throttling): Nucleotide -- Nucleotide sequences PubMed -- Journal abstracts Federated local mirrors of NCBI data sources (not throttled): Blast (updated monthly) is mirrored by UITS dbSNP (updated at major builds) is mirroed by IUSM Some KEGG resources are federated via the FS KEGG user-defined functions

30 BLAST: Both mirrored and federated NCBI Blast is typically accessed via a web page at NCBI, or some mirrored site. Data is returned in a typical web interface format suitable for users. Within CLSD, BLAST is accessed via an SQL query and data is returned as a table that can be manipulated as is any other DB2 table. For example, here is an SQL query that invokes a blastall process running on libra00 from within DB2: select GB_ACC_NUM, description, e_value from ncbi.BLASTN_NT where BlastSeq = 'AGTACTAGCTAGCTAGCTACTAGCTGACTGACTGACTGATGCATCGATGATGC ‘ The local version of blastall conducts the search and returns results encoded within XML (by specifying the –m7 parameter).

31 The DB2 federation software converts the XML encoded results into something like this: GB_ACC_NUMDESCRIPTION E_VALUE (VARCHAR)(VARCHAR) (DOUBLE) AE003644Drosophila melanogaster chromosome L, section 53 of 83 of the complete sequence AE003410Drosophila melanogaster, chromosome L, region 34C4-36A7 (Adh region), section 4 of 10 of the comple AC092228Drosophila melanogaster, chromosome L, region 35X-35X, BAC clone BACR21J17, complete sequence AP008207Oryza sativa (japonica cultivar-group) genomic DNA, chromosome 1, complete sequence AP003197Oryza sativa (japonica cultivar-group) genomic DNA, chromosome 1, BAC clone:B1015E06 AP003105Human DNA sequence from chromosome 1, putative argumentativeness gene GROBE1

32 Modifying BLAST search settings via SQL Parameters sent to blastall can be set by using equality comparisons as assignment statements within SQL conditionals, as in: select Score, E_Value, HSP_Info, HSP_Q_Seq, HSP_H_Seq from ncbi.BLASTN_NT where BlastSeq = 'gagttgtcaatggcgagg' and gapcost=8 and E_Value <.0005 which will pass gapcost and e-value settings on to blastall.

33 BLAST data sources available via CLSD Here is a list showing which search types are supported by the DB2 BLAST wrapper within CLSD. BLAST search type: Data sources BLASTN: NT, EST_HUMAN, EST_MOUSE, and EST_OTHER A nucleotide sequence is compared with the contents of a nucleotide sequence database. BLASTP: NR, SP An amino acid sequence is compared with the contents of an amino acid database. BLASTX: NR, SP A nucleotide sequence is compared with the contents of an amino acid sequence database. Query is translated in all six reading frames.

34 User-defined functions (supplied by IBM) There exist special functions for manipulating sequence patterns: LSPatternMatch LSPrositePattern To get a list of (aspartate aminotranserase) BLAST results filtered by a (pyridoxal phosphate attachment site) pattern specified in PROSITE pattern language: select gb_acc_num, HSP_H_SEQ from ncbi.blastp_nr where blastseq='MSQICKRGLLISNRLAPAALRCKSTWFSEVQMGPPDAILGVTE\ AFKKDTNPKKINLGAGAYRDDNTQPFVLPSVREAEKRVVSRSLDKEYATIIGI\ PEFYNKAIELALGKGSKRLAAKHNVTAQSISGTGALRIGAAFLAKFWQGNREI\ YIPSPSWGNHVAIFEHAGLPVNRYRYYDKDT' and DB2LS.LSPatternMatch(HSP_H_SEQ, DB2LS.LSPrositePattern( '[GS]-[LIVMFYTAC]-[GSTA]-K-x(2)-[GSALVN].' ) ) > 0 Note the use of the period (.) to terminate the PROSITE pattern, and that the LSPatternMatch function returns the character position of the left-most substring matching the pattern, or zero if there is no match.

35 Accessing CLSD: getting an account To access CLSD you must have an account on the Libra Cluster at IU (aka libra00.uits.iu.edu). If you don’t have an account and are associated with Indiana University, request an account by filling out a Research Systems Account Application atResearch Systems Account Application In the comments section of the account request, add that you need a local and persistent password for use with CLSD. Once you have a Libra account, send to SDS at indiana.edu and request instructions for defining a local and persistent password for use with CLSD. TeraGrid users should send to SDS at indiana.edu explaining how CLSD will be used, and describing their TeraGrid activities. SDS will then arrange for an appropriate Libra account and send instructions for defining a suitable password.

36 Accessing CLSD: options DB2 can be accessed in a variety of ways: DB2 Command Line Processor (Unix, Windows) DB2 Control Center (wherever JRE is running) DB2 driver for Perl DBI DB2 drivers for the Java Database Connectivity (JDBC) Application Program Interface (API), especially the JDBC Universal Driver Demonstration Web pages (invoke a Java servlet that uses JDBC): Demonstration WebService (invoked as a function call via JAX-RPC): Demonstration Web pages (invoke a Java servlet that invokes the CLSD WebService): Experimental WSRF Resource (using WSRF within a GT4 container) Experimental OGSA-DAI service (running within a GT4 container)

37 JDBC access Connect to the CLSD: Class.forName( "com.ibm.db2.jcc.DB2Driver" ); con = DriverManager.getConnection( "jdbc:db2://libra00.uits.iu.edu:50000/clsd2", accountName, accountPassword ); Prepare a query, send it to the db, and receive a result: statement = con.createStatement(); resultSet = statement.executeQuery( query ); Get some query meta-data (column labels and column data types): ResultSetMetaData rsmd = resultSet.getMetaData(); result = rsmd.getColumnLabel( colCount ); result2 = rsmd.getColumnTypeName( colCount );

38 JDBC access (continued) Get a row of data: for( int colCount = 1; colCount <= numcols; colCount++ ) { String returnedString = ""; // Must be predefined. returnedString = resultSet.getString( colCount ) + ""; out.println( " " + returnedString + " \n" ); }

39 Accessing CLSD thru a WebService (JAX-RPC) The Java API for XML-based Remote Procedure Calls, or JAX-RPC, is a specification that defines a system for building distributed services (so-called “WebServices”) within the client-server model. JAX-RPC makes it possible for a function invocation in a client like: a_variable = function_name( parameter_list) to cause the function, “function_name,” to run on a remote server and return a response containing the value to be assigned to the variable “a_variable”, and a function invocation in a client like: returnString = queryCLSD( "select * from syscat.tables", "1", "5", "accountName", "accountPassword", “table” ) will return a (possibly very long) string containing the response to the query (given that various linkages have been prearranged).

40 Outline of the CLSDservice public class CLSDservice { // Full source at: // public String queryCLSD( String query, String startingRowToPrint, String maxRows, String account, String password, String format ) { // Get a query string, etc. from the command line or Web // browser. // Declare JDBC drivers and connect to DB2. // Prepare a JDBC statement containing the SQL query, submit // it to DB2, and capture the returned JDBC result set. // Query result set metadata for column names and types to // return as the first row, and then collect the contents of // each data row. return theResponse; } // end queryCLSD } // end Class CLSDservice

41 SOAP and WSDL JAX-RPC uses SOAP and WSDL to establish the various linkages required to implement remote procedure calls. SOAP messages are usually encoded as XML messages within HTTP requests where: A SOAP request is an HTTP POST request with an XML body. A SOAP response is an HTTP response header followed by an XML body. Such RPC functions are “exposed” as “operations” when described within web pages using the Web Services Description Language (WSDL).

42 Java command-line client to access CLSD via CLSDservice public class testCLSDClient { public static void main(String [] args) { try { String endpoint = " Service service = new Service(); Call call = (Call) service.createCall(); call.setTargetEndpointAddress( new java.net.URL( endpoint ) ); call.setOperationName( new QName(" "queryCLSD" ) ); String returnString = (String) call.invoke( new Object[] { "select * from syscat.tables", "1", "5", "accountName", "accountPassword", “table” } ); System.out.println( returnString ); } catch (Exception e) { System.err.println(e.toString()); }

43 Perl command-line client to access CLSD via CLSDservice #!perl –w use SOAP::Lite; # Set up the call to CLSD using SOAP. $host = “discover.uits.indiana.edu”; $service = SOAP::Lite -> service( “ ); # Make the call to CLSD. $result = $service->queryCLSD( “select tabschema,tabname from syscat.tables”, 1, 5, "DB2account", "password" "table" ); print $result;

44 OGSA The Open Grid Services Architecture (OGSA) is an “architecture” for building computational grids. In particular, OGSA “…defines a set of core capabilities and behaviors that address key concerns in Grid systems.” [2] It does not, however, implement or define how to implement such core capabilities. OGSA is NOT layered or object oriented. However, both will be exploited naturally in some implementations. OGSA provides an architecture for building services such as: “Service-Based distributed query processing,” “Grid Workflow”, “Grid Monitoring Architecture” etc.

45 OGSA-DAI OGSA-Data Access and Integration (OGSA-DAI) is a very flexible and powerful data access framework that can be used within an OGSA grid environment. It provides various data movement, virtualization, and manipulation services that transform the use of data into a higher-level workflow. The OGSA-DAI client shown in the next slide uses the OGSA-DAI Client Toolkit to send a hard-coded query to CLSD (here known as the “DB2Resource). The Toolkit allows clients to use JDBC by creating a JDBC ResultSet object from an OGSA-DAI WebRowSet. The response is encoded using XML and may be retrieved as a single string, or as individual fields by using individual JDBC calls as shown below.

46 Java command-line client to access CLSD via OGSA-DAI public class queryCLSD { public static void main(String[] args) throws Exception { // Create an instance of the data service. String handle = " String id = "DB2Resource"; DataService service = GenericServiceFetcher.getInstance().getDataService( handle, id); // Define a request composed of one activity. SQLQuery query = new SQLQuery( "select tabschema,tabname from syscat.tables"); WebRowSet rowset = new WebRowSet( query.getOutput() ); ActivityRequest request = new ActivityRequest(); request.add( query ); request.add( rowset );

47 Java command-line client to access CLSD via OGSA-DAI 2 // Submit the request and retrieve results. Response response = service.perform( request ); ResultSet result = rowset.getResultSet(); ResultSetMetaData rsmd = result.getMetaData(); int numCols = rsmd.getColumnCount(); // Display each column from each row. while( result.next() ) { for( int colCount = 1; colCount <= numCols; colCount++ ) { out.print( “ “ + result.getString( colCount ) ); } out.println(); }

48 This client displays a small part of the functionality provided by OGSA-DAI. In addition, an OGSA-DAI service can be configured to: operate on XML or text data sources, as well as relational data sources, perform a series of operations (also known as “activities”) as part of a single request, deliver results to a third party (via FTP, GridFTP, SMTP, etc.) or to another data service, deliver results asynchronously, which can be very useful for long-running requests, and utilize authentication methods supported by WSRF to provide grid-based security. Also, exposing a database via OGSA-DAI makes it available for OGSA Distributed Query Processing (OGSA-DQP), so that its use may be further virtualized within the DQP model. In some cases, however, OGSA-DAI and DQP may introduce performance penalties.

49 Current and possible directions Adding data sources: mirrored and federated Requests for mirroring or federating will be gladly entertained DB2 now provides a user-configurable script wrapper that connects to a remote DB2 daemon that can start any co-located arbitrary script and return data encoded in XML (restricted to one foreign key per table) Such a script could be built to relay any web resource that returns XML meeting key restrictions. Wrappers could be constructed to relay some OGSA-DAI resources Implementing the OGSA-DAI service in productional mode. Integrating with the TeraGrid CLSD is currently accessible from the TeraGrid, but authentication is local. It may be possible to enforce TeraGrid based X.509 authentication, using either WSRF or OGSA-DAI interfaces.

50 References: –Atherly, Alan G, et al., The Science of Genetics, –Apache Foundation, AXIS User’s Guide, –Codd, Edward F., A Relational Model of Data for Large Shared Data Banks, (See also: –CSLD web page: –Foster, Ian, et al. “The Open Grid Systems Architecture, Version 1.5”. –Sotomayer, Boria and Lisa Childers, Globus Toolkit 4: Programming Java Services –Sundaram, Babu, Understanding WSRF, Questions, comments, suggestions?