Presentation on theme: "Grid Computing Research and Applications Sornthep Vannarat Large scale Simulation Research Laboratory National Electronics and Computer Technology Center."— Presentation transcript:
Grid Computing Research and Applications Sornthep Vannarat Large scale Simulation Research Laboratory National Electronics and Computer Technology Center
Outline • Introduction to Grid computing • Open Grid Service Architecture • Bioinformatics applications on Grid • Information Grid project • GEO Grid project • Knowledge Grid • Web 2.0 and Grid computing • Grid activities at NECTEC
What is Grid computing? • Next-generation computing platform and global cyberinfrastructure for solving large-scale problems in science, engineering, and business • Grid Café [http://gridcafe.web.cern.ch/gridcafe/] • Web is a service for sharing information over the Internet, the Grid is a service for sharing computer power and data storage capacity over the Internet • Ian Foster – 1998: Computational Grid is a hardware and software infrastructure that provides dependable, consistent, and pervasive access to high-end computational capabilities – 2000: Grid computing is concerned with coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations – 2002: Grid is a system that (1) coordinates resources that are NOT subject to centralized control (2) uses standard, open, general purpose protocols and interfaces (3) delivers non-trivial qualities of service
Status of Grid computing • A promising work in progress • Usable with a lot of efforts • WISDOM: – EGEE Docking project – Find new inhibitors for proteins produced by Plasmodium falciparum – Over 46 million docking simulations in 6 weeks using 1,700 computers in 15 countries, equivalent to 80 CPU-years • Beyond computing power
Types of Grids • Computing grid • Data/storage grid • Information grid • Instrument grid • Access grid
9 The Grid Problem • Flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resource From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… – central location, – central control, – omniscience, – existing trust relationships.
10 Elements of the Problem • Resource sharing – Computers, storage, sensors, networks, … – Sharing always conditional: issues of trust, policy, negotiation, payment, … • Coordinated problem solving – Beyond client-server: distributed data analysis, computation, collaboration, … • Dynamic, multi-institutional virtual orgs – Community overlays on classic org structures – Large or small, static or dynamic
Challenges • To provide seamless access • Heterogeneous environments • Multiple administrative domains and autonomy issues • Scalability • Dynamicity/adaptability
Grid computing middleware • “Global Grids and Software Toolkits: A Study of Four Grid Middleware Technologies”, Parvin Asadzadeh et al. • UNICORE – Uniform Interface to Computing Resources – Ready-to-run Grid system including client and server software – UNICORE 6.0.1 release26 Nov 2007: WSRF based implementation • Globus Toolkit – Developed by Globus Alliance – Open source software toolkit used for building grids with services written in a combination of C and Java – GT 4.0.5 OGSA WSRF based • Legion, Gridbus • EGEE’s gLite
13 One View of Requirements • Identity & authentication • Authorization & policy • Resource discovery • Resource characterization • Resource allocation • (Co-)reservation, workflow • Distributed algorithms • Remote data access • High-speed data transfer • Performance guarantees • Monitoring Adaptation Intrusion detection Resource management Accounting & payment Fault management System evolution Etc. …
14 Layered Grid Architecture Application Fabric “Controlling things locally”: Access to, & control of, resources Connectivity “Talking to things”: communication (Internet protocols) & security Resource “Sharing single resources”: negotiating access, controlling use Collective “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Internet Transport Application Link Internet Protocol Architecture
• Service-oriented architecture – Key to virtualization, discovery, composition, local- remote transparency • Leverage industry standards – Internet, Web services • Distributed service management – A “component model for Web services” • A framework for the definition of composable, interoperable services “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Foster, Kesselman, Nick, Tuecke, 2002
Web Services • XML-based distributed computing technology • Web service = a server process that exposes typed ports to the network • Described by the Web Services Description Language, an XML document that contains – Type of message(s) the service understands & types of responses & exceptions it returns – “Methods” bound together as “port types” – Port types bound to protocols as “ports” • A WSDL document completely defines a service and how to access it • WSRF
23 GRAM services GT4 Java Container GRAM services Delegation RFT File Transfer request GridFTP Remote storage element(s) Local scheduler User job Compute element GridFTP sudo GRAM adapter FTP control Local job control Delegate FTP data Client Job functions Delegate Service host(s) and compute element(s) GT4 GRAM Architecture SEG Job events
24 GT4 Container Monitoring & Discovery GRAMUser Index GT4 Cont. RFT Index GT4 Container Index GridFTP adapter Registration & WSRF/WSN Access Custom protocols for non-WSRF entities Clients (e.g., WebMDS) Automated registration in container WS-ServiceGroup
PKI • Public Key Infrastructure • Key based encryption • Symmetry and Asymmetric encryptions • Public and Private keys • Digital signature • Digital certificate • CA
GSI • Grid Security Infrastructure • Transport and message-level security • Authorization schemes • Credential delegation and single sign-on • Different levels of security: container, service, and, resource
28 OGSA-DAI • An extensible framework for data access and integration • Expose heterogeneous data resources to a grid through web services • Interact with data resources – Queries and updates – Data transformation / compression – Data delivery – Application-specific functionality • A base for higher-level services – Federation, mining, visualisation,… • Open Grid Forum DAIS Working Group – DAIS (Database Access and Integration) specifications – OGSA-DAI to be a reference implementation of DAIS
29 OGSA-DAI functionality • Interaction with data resources – Relational – MySQL, SQL Server, DB2, PostGres, Oracle – XMLDB – eXist, Xindice – Files – text, binary, indexed – SQL multi-resources – aggregation of OGSA-DAI services exposing relational resources • Transformation and compression – ZIP, GZIP, XSLT, ResultSet-to-WebRowSet, ResultSet-to- CSV, … – WebRowSet projection, frequency distribution, random sample, … • Delivery – Local file, HTTP, SMTP, SOAP attachments, GridFTP, other OGSA-DAI services • Resource creation and destruction • Document-oriented interface – service interface is resource agnostic
31 Bioinformatics and Grid • Bioinformatics applications often require high-performance computing and large data handling • Tools: bioinformatics tools and web services • Data: – Public databases – Biological knowledge: ontology and meta data – unpublished data • Grid computing meets the requirements – Computing Grids – Data Grids – Knowledge Grids
32 Computing Grid • High throughput computing – Thousands of small independent tasks • Grid computing v.s. cluster computing – aims at parallel and distributed computing – differ in network latency and robustness. – frequency of task failures is much higher in grid computing • Two types of high-throughput computing – numerical processing – symbolic processing
33 High throughput numerical processing • Systems biology aims at modeling of biological dynamics in molecules, cells, organs and individuals • Huge computational power is needed for – molecular folding – molecular docking – spatiotemporal molecular interaction – kinetic parameter estimation • Problem decomposition techniques – parameter sweep – stochastic modeling
34 WISDOM • EGEE Docking project • Find new inhibitors for proteins produced by Plasmodium falciparum • over 46 million docking simulations 6 weeks • 1,700 computers in 15 countries • Equivalent to 80 CPU-years
35 DIANE • Enhanced version of WISDOM • Light-weight framework • Search for drugs for predicted variants of H5N1 • 2 millions docking complexes with a size of 600 gigabytes • 2,000 grid worker nodes in 17 countries
36 Limitations of EGEE Infrastucture • Experiences from virtual screening projects • Overall grid efficiency about 50 percent • Major sources of failure – Server license failure 23% – Workload management failure 10% – Site failure 9%
37 Study of kinetic pathways • Estimation of ODEs for modeling of metabolic pathways and signal transduction pathways • Genetic algorithms: – Estimating optimal parameter fitting to biological experimental results – High degrees of parallelism (multiple trials with initial conditions) • Parameter-parameter dependencies: – Calculating moment parameters, such as AUC, MRT, VRT
38 High throughput symbolic processing • Sequence analysis: Homology searches, Genome comparisons, Genome-wide analyses • Sequencing data are expected to increase more rapidly – High-throughput DNA sequencing technologies – Metagenomic projects – Human resequencing projects – Genome sequencing projects on other species • Requires large databases such as DNA and protein sequence • Sharing and updating of biological databases on the grid are of key importance
39 Sharing biological databases • Become more and more difficult and intractable • Automatic updating of databases is necessary • Concerns – Duplicated database copying – Disk overflow – Unexpected shutdown – Version management – File checksum integrity verification – Parallel and pipelined mechanisms for high-throughput data transfer
40 EGEE Framework • EGEE provides a general framework for sharing replicas of biological databases represented • Physical File Name (PFN) • Logical File Name (LFN) • Globally Unique Identifier (GUID) • Replica Manager System (RMS) – Replica Metadata Catalog (RMC) – Replica Location Service (RLS) LFN-3 LFN-2 GUID PFN-2 PFN-1 LFN-1 RMC RLS
41 GADU • Genome Analysis and Database Update system • Automated, scalable, high- throughput computational workflow engine • Executes bioinformatics tools (BLAST, BLOCKS, PFam, Chisel and InterPro) • Public databases (NCBI RefSeq, PIR, InterPro and KEGG)
42 Homology Search • GRID BLAST implementations have been developed and reported – Prestaging of sequence databases to minimize the runtime overhead of transferal of large sequence databases – Databases update which keeps data consistency on the data-grid – Dynamic load balancing of query sequences – Assembling of the results from distributed jobs
43 Genome Comparison • Most promising life science applications for grid computing • Expandable and flexible large scale computing facility is needed • E.g. Investigation of horizontal gene transfer among 354,606 ORFs extracted from more than 100 microbial genomes – Used 229 CPUs located in 5 institutions • Number of pair-wise sequence comparison ∝ N 2
44 Integration of bioinformatics services • Resourceome – Uniform and secure interface – Providing workflows – Using Metadata and ontology • Metadata, ontology, XML: fill the semantic gap of heterogeneous databases • Framework: OGSA based on WSRF
46 RbsB in Different Formats • DDBJ • SWISS-PROT • PDB
47 BioPfuga • Workflow system integrating application programs • Separating application programs into smaller parts. • Standardize the data format for transferring data between different application programs.
48 Bioinformatics workflow • Necessary for end-users of bioinformatics web/grid services • Taverna provides a workflow language and graphical user interface for: building, running and editing of workflows • Semantic indexing system of bioinformatics services has become essential for choosing resources • Searching functionally similar bioinformatics workflows is also important • Bioinformatics ontology is essential for automatic generation of bioinformatics workflows
49 Secure Data Access • Many bioinformatics databases are public and freely available • But access to the data needs to be strictly controlled in distributed collaborative research (For example: clinical data) • Public Key Infrastructures (PKI) is the predominant method for enforcing authentication • Virtual Organization for Trials and Epidemiological Studies (VOTES) project uses Internet2 Shibboleth technology
51 Information Grid •an open and flexible infrastructure that facilitates the integration of any information anywhere across heterogeneous data sources under grid environment. •3 essential components –MDL: Marker Description Language –Information Services –Information Brokers
52 MDL: Marker Description Language •a unified language that defines: –standard schema model –integration configuration model –standard schema discovery model
53 Information Service •as an agent to publish information •Responsibilities: –connect to a current data source of an organization –transform generic query (mdlQuery) into specific query –transform query result into standard schema defined in the specified MDL document generic query (mdlQuery) specific query (SQL) query result (table) query result (mdl-based result) Generic Information Service Tool • manual mapping • RDBMS • no authentication
54 Information Broker •as a broker of Information Services •Responsibilities: –connect to Information Services –connect to others Information Brokers –discover potential Information Brokers and Services –integrate information mdlQuery integrated mdl-based result mdlQuery integrated mdl-based result
67 Knowledge Grid • Tacit knowledge – "We should start from the fact that we can know more than we can tell", Michael Polanyi, a 20th-century philosopher • Knowledge represented on computers is just a part of out knowledge • Grid as place where people work together and create knowledge • Sharing explicit and tacit knowledge • This framework gives a meta-philosophical approach to rationalise the current Grid phenomemon.
68 Knowledge spiral theory • Knowledge creation requires a cyclic process of knowledge conversion between tacit knowledge and explicit knowledge – Socialization (tacit knowledge to tacit knowledge) – Externalization (tacit knowledge to explicit knowledge) – Combination (explicit knowledge to explicit knowledge) – Internalization (explicit knowledge to tacit knowledge)
69 Socialization • First step in formulating a community • Grid portals are helpful for attracting those who are interested in some specific field • Must allow formulation of user-defined communities • Knowledge grids should provide social communication system-like facilities – Participants formulate new communities – Participants recruit other participants • Face-to-face meeting or off-site meeting will be also helpful in promoting mutual understanding in a community.
70 Externalization • For example publication of research papers • Externalization is the essence of knowledge creation • Knowledge grid should provide facilities for participants to publish their knowledge in a community • Web-based dynamic contents are one of the promising ways of publication of knowledge
71 Combination • Combination expands knowledge by the sharing of explicit knowledge in a community • Synergy effects can be expected if participants bring together their own knowledge • Grid portals and application-oriented grids play an essential role in this process
72 Internalization • Internalization is a process of acquiring tacit knowledge by experience • To make use of a grid for real world life science problems, a problem solving layer for bioinformatics must be developed • Gridfication of public databases and bioinformatics tools are necessary conditions but not sufficient • Bioinformatics environment should provide secure facilities to deal with unpublished data and customization facilities to develop one's own bioinformatics environment coordinated with global bioinformatics environment
Web 2.0 Design Patterns • The Long Tail • Data is the Next Intel Inside • Users Add Value "architecture of participation" • Network Effects by Default • Some Rights Reserved Design for "hackability" and "remixability." • The Perpetual Beta • Cooperate, Don't Control • Software Above the Level of a Single Device • What Is Web 2.0... by Tim O'Reilly, http://www.oreilly.com/pub/a/oreilly/tim/news/2005/09/30/ what-is-web-20.html
Web 2.0 Core Competencies • Services, not packaged software, with cost- effective scalability • Control over unique, hard-to-recreate data sources that get richer as more people use them • Trusting users as co-developers • Harnessing collective intelligence • Leveraging the long tail through customer self- service • Software above the level of a single device • Lightweight user interfaces, development models, AND business models
What are we talking about? • Communities & all that social stuff? – Great, love it, should have done all this 20 years ago… • Easier to use web interfaces? – Love them as a user but they are (still) hard to build (tried JSF+AJAX+Swing Webflow - argh!!!) – Is it worth the effort? Researchers are not occasional users! • Existing web 2.0 applications? – Each great individually but try using them in combination… – How can I share my connotea™ bookmarks with my Facebook™ friends? • REST as an architectural style? – Good idea - for some applications - flipside of the Grid btw.
Web 2.0 and Grid computing • Simplify user interface • More flexible than (conventional) portal • Software as a service • Collaboration Grid • Knowledge Grid
Tools and mashups based on web service infrastructure http://www.chembiogrid.org/projects/proj_tools.html
Shared Bookmarking for Social Networks • MSI-CIEC project to support tagging and online shared bookmarking. – Pioneered by del.icio.us in 2003 (!) • Bookmarking services allow you to – Share links (URLs) with networks of friends – Organize your links by mnemonic tags – Find other interesting URLs by popularity (most bookmarked) – Find interesting URLs by keywords • When used collectively, tags form folksonomies. – “Pave the cow paths” – Typically about tagged URLs. – But also about people who tag. – Semantic Web Lesson: everything is a URI.
Grid Activity Summary • Grid Testbed – CFD Application – Virtualization • Information Grid • Grid CA
Grid Computing Testbed • Internal Level – NECTEC conducts tests of Globus Toolkit 4 for following issues: • Middleware • Pre-WS and WebServices components • CFD on Grid • Gfarm file system
Grid Computing Testbed •National Level –NECTEC cooperates with Thai National Grid Project (TNGP) to set Thailand Grid community standards.
Grid Computing Testbed • International Level – NECTEC has been an active member of PRAGMA resources and data working group. – Improve the interoperability of Grid middleware in the Asia Pacific region and make Grid enable to use for scientists.
• Computation requirements for CFD very high. • Nevertheless, some major challenges still exist in CFD research; for examples, turbulence research and very-large-scale CFD simulations. CFD on Grid • Grid provides dependable, consistent, pervasive and inexpensive access to high-end computational capability. • We investigate the feasibility and scalability of cross-platform simulation paradigms for a fine-grain application as CFD application on our Grid testbed.
CFD on Grid • Some Remarks – Grid infrastructure with independency of public IP. – Ability to do job migration automatically. – Dedicated Grid environment for fine-grain applications such as CFD application. – Improvement of algorithm for high latency Grid infrastructure.
NECTEC GOC CA • A digital certificate issuer developed specifically to support authentication for Grid resources. • Developed under X.509 Public Key Infrastructure by Large Scale Simulation Research Laboratory (LSR), National Electronics and Computer Technology Center. • A issues certificates to users, hosts and services. • Current Status: – Production Level CA under APGrid PMA – ~ 10 Certificates issued (all for internal users)