Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grid enabled e-Research in the Life Sciences Prof Richard Sinnott Technical Director National e-Science Centre Anthony Stell University of Glasgow, Scotland,

Similar presentations


Presentation on theme: "Grid enabled e-Research in the Life Sciences Prof Richard Sinnott Technical Director National e-Science Centre Anthony Stell University of Glasgow, Scotland,"— Presentation transcript:

1 Grid enabled e-Research in the Life Sciences Prof Richard Sinnott Technical Director National e-Science Centre Anthony Stell University of Glasgow, Scotland, UK 26 th October 2006

2 Life Sciences Some of the Big Questions –How does a cell/brain work? –Which genes/pathways are involved in which diseases and can we develop drugs to target them? –Why do people who eat less tend to live longer? –Is this drug effective (for these individuals)? –How important are genetic / social / environmental factors to specific diseases? …how clinically significant is the consumption of deep fried Mars Bar and Pizza Crunch in Scotland?

3 Life Science Grids Extensive Research Community > 4000 at Glasgow Extensive Applications –Many people care about them Health, Food, Environment Interacts with virtually every discipline –Physics, Chemistry, Maths/Stats, Nano-engineering, … MANY databases relevant to bioinformatics (and growing!) –Heterogeneity, Interdependence, Complexity, Change, …

4 Database Growth PDB Content Growth DBs growing exponentially!!! Biobliographic (MedLine, PubMed…) Protein Seq (UniProt, …) 3D Molecular Structure (PDB, …) Nucleotide Seq (GenBank, EMBL…) Pathways (KEGG, WIT…) Molecular Classifications (SCOP,…) Motif Libraries (PROSITE, Blocks, …) … Yesterday EMBL Database contained 147,881,486,173 nucleotides in 81,229,974 entries. Homo sapiens Mus musculus Rattus norvegicus Bos taurus Pan troglo- dytes Canis familiaris Monodelphis domestica Macaca mulatta Danio rerio Aedes aegypti Other

5 More genomes …... Arabidopsis thaliana mouse rat Caenorhabitis elegans Drosophila melanogaster Mycobacterium leprae Vibrio cholerae Plasmodium falciparum Mycobacterium tuberculosis Neisseria meningitidis Z2491 Helicobacter pylori Xylella fastidiosa Borrelia burgorferi Rickettsia prowazekii Bacillus subtilis Archaeoglobus fulgidus Campylobacter jejuni Aquifex aeolicus Thermotoga maritima Chlamydia pneumoniae Pseudomonas aeruginosa Ureaplasma urealyticum Buchnerasp. APS Escherichia coli Saccharomyces cerevisiae Yersinia pestis Salmonella enterica Thermoplasma acidophilum

6 Distributed and Heterogeneous data LPSYVDWRSAGAVVDIKSQG ECGGCWAFSAIATVEGINKI TSGSLISLSEQELIDCGRTQQD NTRGCDGGYI TDGFQFIIND GGINTEENYPYTAQDGDCDV AGGTATAGCGCGCGCGATATATA AAATGTACGTACGGGCCCTTATA CGCGCGCGATATATAGCGCGCG Sequence Structure Function Gene expression Morphology Pathways

7 Translational Research Just one example!

8 Systems-Biology… Nucleotide sequences Nucleotide structures Gene expressions Protein Structures Protein functions Protein-protein interaction (pathways) Cell Cell signalling Tissues Organs PhysiologyOrganisms Populations + links to plant/crops, environmental, health, … information sources

9 Is Grid the Answer? Key problems to be addressed –Tools that simplify access to and usage of data Internet hopping is not ideal! –Tools that simplify access to and usage of large scale HPC facilities qsub [-a date_time] [-A account_string] [-c interval] [-C directive_prefix] [-e path] [-h] [-I] [-j join] [-k keep] [-l resource_list] [-m mail_options] [-M user_list] [-N name] [-o path] [-p priority] [-q destination] [-r c] [-S path_list] [-u user_list] [-v variable_list] [-V] [-W additional_attributes] [-z] [script] …

10 Is Grid the Answer …ctd? Key problems …ctd –Tools designed to aid understanding of complex data sets and relationships between them e.g. through visualisation –Support different kinds of collaborative research break down the silos be multi-discipline support the research process –Provide access to many more computational resources to expedite scientific process (or to make it feasible!)

11 Access to and Usage of Data Grid technology should allow to –hide heterogeneity, –deal with location transparency, –address security concerns, –support data provenance –… Data Access and Integration Specification (DAIS) being defined by GGF –OGSA-DAI/DAIT projects key role in shaping these standards Other commercial solutions –IBM Information Integrator, SRS, …

12 Access to and Usage of HPC facilities Consider whole genome-genome comparisons between two species –Current strategy essentially chops up one genome and fires searches for those fragments in the other then re- assembles results messy approximate matching - re-assembly difficult important correlations can be lost –to make this tractable so called junk DNA ignored –chopping may introduce artefacts or hide phenomena  Better to put both full genomes in memory and perform a useful complete comparison  Only possible with very high-end machines (available via grids) –Should not have to be script writer/Linux sys- admin to use these facilities

13 Cognitive aspects of Data Life science data can be “ugly” –Raw data sets messy –Requires significant effort to understand –Schemas/data models evolving –…–… Tools needed to –Simplify understanding –Improve analysis –Navigate through potentially huge data sets e.g. to find genes of interest in chromosomes of different species, … they are!!!

14 Collaborative Aspects Should provide tools that automate the way researchers wish to work –User driven workflows –Linking compute and data resources “on the fly” Where is the “best place” to submit these jobs right now? –MyGrid workbench gaining widespread acceptance 20,000+ downloads 3000+ bio-services

15 Collaborative Aspects …ctd Break down the silos and multi-disciplinary –We are all looking at possible genetic factors in cancer, metal health, cardiovascular… …so we should co-ordinate our efforts and share data, knowledge, … –Has anyone generated results like these? –Can I see them now »…rather than waiting 2 years for the Nature / Science publication –I need input from a physicist, chemist, a statistician, a … to explain this, to process these results, to simulate this phenomenon, to verify these results …

16 Nucleotide sequences Nucleotide structures Gene expressions Protein Structures Protein functions Protein-protein interaction (pathways) Cell Cell signalling Tissues Organs PhysiologyOrganisms Populations BRIDGES SBRN VOTES DyVOSE GEMEPS GLASSESP-Grid GS SFHS

17 BRIDGES Project Synteny Service Information Integrator OGSA-DAI Magna Vista Service VO Authorisation blast ++ +

18 Bridges Portal

19 www.nesc.ac.uk MagnaVista

20

21 GeneVista

22 Grid Blast Interface Allows ‘genome scale’ blasting Transparently uses NGS, ScotGrid, other GU clusters, Condor pools Many databases already deployed across nodes No user certificates Fine grained security at back-end

23

24

25

26

27 Grid Enabled Microarray Expression Profile Search (GEMEPS) 1 year BBSRC project started 1 st March 2006 –Involves Glasgow, Cornell University, US, Riken Institute, Japan –Aim to provide tools for discovery, comparison and analysis of microarray data sets How does my data compare to others? –Species, disease, platform, results, … How do these experiments compare? Can we improve the way we establish how genes in different species are linked? –Requires data access, integration and move towards data mining –Built upon fine grained security Microarrays expensive and contain potentially important (valuable) data sets

28 Experiences Currently exploring microarray data sets in detail –GEO, ArrayExpress, local in-house microarray storage solutions at Riken, Cornell, SHWFGF … Investigating/Grid enabling CellMontage software (http://cellmontage.cbrc.jp/)http://cellmontage.cbrc.jp/ –system for searching gene expression databases for cells or tissues similar to a query gene expression profile –similarity of two profiles computed by comparing the order of genes ranked by expression (Spearman Rank) simple measure but sufficient to characterize cell types across different microarray platforms gene sets/expression value ranges differ between platforms, making direct comparison difficult/impossible

29 Microarray Data Resources Various standards and interoperability issues –MIAME –MAGE-ML –MINiML –SOFTtext –SOFTmatrix –… What’s in a name? –Gene names, probe names, platforms, species names in experiments, … Life Science Identifiers

30 Grid Enabled Microarray Expression Profile Search (GEMEPS)

31 Overview of VOTES Grids –Compute and Data Grids –Accessing Grids –Grid Security Clinical Trials and VOTES –Clinical Trials –VOTES Goals –Security Issues –Classification Issues Project so far –Implementation and Technologies Conclusion –Application to life sciences –The main challenge: the human one

32 Grids – what are they? Use existing resources to solve large-scale compute or data problems more efficiently –Rather than throwing money at hardware solutions… –Develop applications that intelligently use available resources whilst maintaining security between all parties involved Compute grids –Aggregation of CPU cycles and storage for better performance Data grids –Enhance quality and value of distributed information Virtual Organisations –Where parties share data and resources but in a limited sense, so who can access what must be strictly controlled.

33 Soundbites “Next generation of the Internet” [Various] –Knitting together network infrastructures… “Internet on steroids” [Me at social events] –Getting better performance with what you’ve got… “More bang for your buck” [BBC Magazine] –Do the same as could be achieved with lots of hardware, but doing it more efficiently… “Co-ordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations” [“Anatomy of the Grid”, Ian Foster] –…

34 Accessing Grids Want an open, usable interface to access grid applications… As intuitive and easy to use as browsers to access applications on the Internet… Portal technology is one possible way forward: –Developed as stateful web applications. –Communicate to middleware solutions which do their magic to allocate and use underlying grid infrastructures.

35 Grid Security - 1 Security is often classified as: “AAA” –Authentication “Who are you?” –Authorization “What are you allowed to do?” –Accounting “Where were you on the night of…?” But there are other aspects to be considered: –Anonymisation –Confidentiality –Non-repudiation

36 Grid Security - 2 Not just server checking on client, but vice versa: –Because a server might be a client to another process… PKI – digital keys/certificates for authentication –Clever mathematics provide useful encryption and signature tools… –“The Code Book” by Simon Singh Proxy certificates –Pushing your credentials further down a path of trust than your immediate neighbour… –Delegation of trust

37 Grid Technologies Range of middleware solutions: –Globus Toolkit –Open Middleware Infrastructure Institute (OMII) –Shibboleth –GridSphere Not mature –Difficult to implement… No clear leaders –Our job is to pick, choose and develop…

38 Clinical Trials Research studies into new drugs, medical devices or other interventions on patients in scientifically-controlled environment. Required for regulatory authority approval of new therapies. Generally speaking: they help improve quality of life.

39 VOTES Virtual Organisations for Trials and Epidemiological Studies 3 year (£2.8 million) MRC funded project started in October 2005 Collaboration between various UK universities: –Glasgow, Oxford, Nottingham/Leicester, Manchester, Imperial College London Focuses on three key areas of clinical trials: –Patient Recuitment –Data Collection –Study Management

40 Key Areas Patient Recruitment –How many men aged between 45 and 65 had a heart attack last year? How many of them would be willing to participate in the trial of a new drug? Data Collection –Are the participants taking their drug/placebo on a regular basis? Have there been any incidents relating to the trial? Study Management –Who can see the trial data (e.g. consultants, nurses)? Who ensures the trial is in the patient’s interest? Can we simplify the ethical review process?

41 Data Grids Falls into the remit of the “Information Grid”: –“… which provides a way for information resources to be joined with related information resources to greater exploit the value of the inherent relationships among information, then for new connections to be made as situations change.” [Grid Computing with Oracle, technical white paper, 2005] Two main challenges: –Security –Data classification And we want to “plug in” to the existing NHS IT infrastructure…

42 Additional Clinical Security Anonymisation –De-identifying data –Only interested in the statistical data => don’t need to know the patient’s identity –So the identifying data is encrypted Statistical Inference –When two bits of seemingly innocuous data are joined, can result in identification –E.g. an unusual condition in a particular postcode

43 Data Classification Main problem here is one of language and definition across domains. Solutions proposed include: –Global schema Essentially an overall description of data that all parties must subscribe to. –Ontology Methods of translating the idiosyncratic description to a common description used by all parties. No clear solution to this yet… –Current method is to join distributed databases on CHI number (Community Health Index).

44 VOTES Portal Overview Developed on local test-bed of distributed servers and databases: –Log in and are assigned privileges based on role –Select clinical trial –Select parameters to view and apply conditions (if desired) –Results of this query are brought back from the databases distributed over the test-bed (or VO if you will…), joined and presented as a unified resource. Demo available at break…

45 VOTES Portal Snapshots

46 Architecture

47 Technologies –GridSphere (2.1) –Globus Toolkit (4.0) –OGSA-DAI (2.2) Security Framework –Database user management (Resource-level) Local restrictions on local resources –Access Control matrix (VO-level) A bit-wise privilege matrix that will be available to the whole VO Representative NHS Databases –GPASS –SCI Store

48 Conclusions Grid Computing is a challenging field… We provide one *possible* solution to applying the technology paradigms to clinical trials and studies. And it is hopefully a worthwhile effort, as it potentially brings: –Efficient use of distributed resources and data. –Enhanced analysis and understanding of said data. –Closer collaboration between participants. –Peace, prosperity and general happiness to human-kind... Maybe…

49 The main challenge… … is the human one. Encouraging technological uptake… Challenging techno-phobic attitudes…

50 Further Information Website: http://www.nesc.ac.uk/hub/projects/voteshttp://www.nesc.ac.uk/hub/projects/votes Portal: http://labpc- 12.nesc.gla.ac.uk:18080/gridspherehttp://labpc- 12.nesc.gla.ac.uk:18080/gridsphere Contact: –Prof. Richard Sinnott – r.sinnott@nesc.gla.ac.ukr.sinnott@nesc.gla.ac.uk –Anthony Stell – a.stell@nesc.gla.ac.uka.stell@nesc.gla.ac.uk –Oluwafemi Ajayi – o.ajayi@nesc.gla.ac.uko.ajayi@nesc.gla.ac.uk


Download ppt "Grid enabled e-Research in the Life Sciences Prof Richard Sinnott Technical Director National e-Science Centre Anthony Stell University of Glasgow, Scotland,"

Similar presentations


Ads by Google