Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October 10-11 2011 Geoffrey.

Slides:



Advertisements
Similar presentations
SALSA HPC Group School of Informatics and Computing Indiana University.
Advertisements

International Conference on Cloud and Green Computing (CGC2011, SCA2011, DASC2011, PICom2011, EmbeddedCom2011) University.
Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.
SALSASALSASALSASALSA Chemistry in the Digital Age Workshop, Penn State University, June 11, 2009 Geoffrey Fox
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Clouds will win! Geoffrey Fox Director,
Student Visits August Geoffrey Fox
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Cyberinfrastructure Supporting Social Science Cyberinfrastructure Workshop October Chicago Geoffrey Fox
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.
3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June San Jose Geoffrey Fox, Shantenu Jha, Dan Katz,
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
SCSI: Platforms & Foundations: Cyberinfrastructure Socially Coupled Systems & Informatics: Science, Computing & Decision Making in a Complex Interdependent.
X-Informatics Cloud Technology (Continued) March Geoffrey Fox Associate.
SALSASALSASALSASALSA AOGS, Singapore, August 11-14, 2009 Geoffrey Fox 1,2 and Marlon Pierce 1
Science of Cloud Computing Panel Cloud2011 Washington DC July Geoffrey Fox
Hussein Suleman University of Cape Town Department of Computer Science Advanced Information Management Laboratory High Performance.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
Science Clouds and FutureGrid’s Perspective June Science Clouds Workshop HPDC 2012 Delft Geoffrey Fox
OpenQuake Infomall ACES Meeting Maui May Geoffrey Fox
Biomedical Cloud Computing iDASH Symposium San Diego CA May Geoffrey Fox
Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.
Scientific Computing Environments ( Distributed Computing in an Exascale era) August Geoffrey Fox
ICETE 2012 Joint Conference on e-Business and Telecommunications Hotel Meliá Roma Aurelia Antica, Rome, Italy July
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
SALSA HPC Group School of Informatics and Computing Indiana University.
SALSASALSASALSASALSA FutureGrid Venus-C June Geoffrey Fox
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
SALSASALSASALSASALSA Cloud Panel Session CloudCom 2009 Beijing Jiaotong University Beijing December Geoffrey Fox
Clouds will win! CTS Conference 2011 Philadelphia May Geoffrey Fox
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
Cloud Cyberinfrastructure and its Challenges & Applications 9 th International Conference on Parallel Processing and Applied.
1 Cloud Systems Panel at HPDC Boston June Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University
1 TCS Confidential. 2 Objective : In this session we will be able to learn:  What is Cloud Computing?  Characteristics  Cloud Flavors  Cloud Deployment.
SALSASALSA Dynamic Virtual Cluster provisioning via XCAT on iDataPlex Supports both stateful and stateless OS images iDataplex Bare-metal Nodes Linux Bare-
Bioinformatics on Cloud Cyberinfrastructure Bio-IT April Geoffrey Fox
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Big Data Workshop Summary Virtual School for Computational Science and Engineering July Geoffrey Fox
Organizations Are Embracing New Opportunities
Digital Science Center II
Geoffrey Fox, Shantenu Jha, Dan Katz, Judy Qiu, Jon Weissman
Real IBM C exam questions and answers
I590 Data Science Curriculum August
Data Science Curriculum March
Biology MDS and Clustering Results
SCALABLE OPEN ACCESS Hussein Suleman
Scalable Parallel Interoperable Data Analytics Library
Twister4Azure : Iterative MapReduce for Azure Cloud
Clouds from FutureGrid’s Perspective
Big Data Architectures
Cyberinfrastructure and PolarGrid
Services, Security, and Privacy in Cloud Computing
Department of Intelligent Systems Engineering
Cloud Computing: Concepts
Panel on Research Challenges in Big Data
Cloud versus Cloud: How Will Cloud Computing Shape Our World?
Big Data, Simulations and HPC Convergence
Convergence of Big Data and Extreme Computing
Presentation transcript:

Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey Fox Director, Digital Science Center, Pervasive Technology Institute Associate Dean for Research and Graduate Studies, School of Informatics and Computing Indiana University Bloomington

Philosophy of Clouds and Grids Clouds are (by definition) commercially supported approach to large scale computing (data-sets) – So we should expect Clouds to continue to replace Compute Grids – Current Grid technology involves “non-commercial” software solutions which are hard to evolve/sustain Public Clouds are broadly accessible resources like Amazon and Microsoft Azure – powerful but not easy to customize and data trust/privacy issues Private Clouds run similar software and mechanisms but on “your own computers” (not clear if still elastic) – Platform features such as Queues, Tables, Databases currently limited – Still shared for cost effectiveness? Services still are correct architecture with either REST (Web 2.0) or Web Services Clusters are still critical concept for either MPI or Cloud software

2 Aspects of Cloud Computing: Infrastructure and Runtimes Cloud infrastructure: outsourcing of servers, computing, data, file space, utility computing, etc. – Handled through Web services that control virtual machine lifecycles. Cloud runtimes or Platform: tools (for using clouds) to do data- parallel (and other) computations. – Apache Hadoop, Google MapReduce, Microsoft Dryad, Bigtable, Chubby and others – MapReduce designed for information retrieval but is excellent for a wide range of science data analysis applications – Can also do much traditional parallel computing for data-mining if extended to support iterative operations – Data Parallel File system as in HDFS and Bigtable

Biomedical Cloud Issues Operating cost of a large shared (public) cloud ~20% that of traditional cluster Gene sequencing cost decreasing much faster than Moore’s law Biomedical computing does not need low latency (microsecond) synchronization of HPC Cluster – Amazon a factor of 6 less effective on HPC workloads than state of art HPC cluster – i.e. Clouds work for biomedical applications if we can make convenient and address privacy and trust Deduce natural infrastructure for biomedical data analysis is cloud plus (iterative) MapReduce Software as a Service likely to be dominant usage model – Paid by “credit card” whether commercial, government or academic – “standard” services like BLAST plus services with your software

What is Modern Data System Architecture I? Traditionally each new instrument or major project has a new data center established – e.g. in Astronomy each wavelength has its data center Such centers offer – Data access with low level FTP/Web interface OR – Database access or other sophisticated search (e.g. GIS) No agreement across fields if significant computing needed on data – Life Sciences tend to need substantial computing from assembly, alignment, clustering, …. “Old model” was scientist downloading data for analysis in local computer system – Is this realistic with multi-petabyte datasets? – Maybe with Content Delivery Network (Caching)

What is Modern Data System Architecture II? We are taught to “bring the computing to the data” but – Downloading data from central repository violates this Could have a giant cloud with a co-located giant data store but not very plausible politically or technically More likely multiple distributed 1-10 petabyte data archives with associated cloud (MapReduce) infrastructure – Analyses could still involve data and computing from multiple such environments – Need hierarchical algorithms but usually natural These can be private or public clouds For cost reasons, they will always be multi-user shared systems but can be ~single function

Trustworthy Cloud Computing Public Clouds are elastic (can be scaled up and down) as large and shared – Sharing implies privacy and security concerns; need to learn how to use shared facilities Private clouds are not easy to make elastic or cost effective (as too small) – Need to support public (aka shared) and private clouds “Amazon is 100X more secure than your infrastructure” (Bio- IT Boston April 2011) – But how do we establish this trust? “Amazon is more or less useless as NIH will only let us run 20% of our genomic data on it so not worth the effort to port software to cloud” (Bio-IT Boston) – Need to establish trust

Inside Modern Data System Architecture III? Even within our cloud, we can examine data architecture with ~3 major choices 1)Shared file system (Lustre, GPFS, NFS …) as used to support high performance computing 2)Object Store such as S3(Amazon) or Swift (OpenStack) 3)Data Parallel File Systems such as Hadoop or Google File Systems Shared File or Object Stores separate computing and data and are limited by bandwidth of compute cluster to storage system connection – Intra cluster bandwidth >> inter cluster bandwidth? Data Parallel File Systems canNOT put computing on same NODE as data in a multi-user environment – Can put data on same CLUSTER as computing

Traditional 3-level File System?

Data Parallel File System? No archival storage and computing brought to data C Data C C C C C C C C C C C C C C C File1 Block1 Block2 BlockN …… Breakup Replicate each block File1 Block1 Block2 BlockN …… Breakup Replicate each block

Trustworthy Cloud Approaches Rich access control with roles and sensitivity to combined datasets Anonymization & Differential Privacy – defend against sophisticated datamining and establish trust that it can Secure environments (systems) such as Amazon Virtual Private Cloud – defend against sophisticated attacks and establish trust that it can Application specific approaches such as database privacy Hierarchical algorithms where sensitive computations need modest computing on non-shared resources Iterative MapReduce can be built on classic pub-sub communication software with known security approaches

Twister v0.9 March 15, 2011 New Interfaces for Iterative MapReduce Programming SALSA Group Bingjing Zhang, Yang Ruan, Tak-Lon Wu, Judy Qiu, Adam Hughes, Geoffrey Fox, Applying Twister to Scientific Applications, Proceedings of IEEE CloudCom 2010 Conference, Indianapolis, November 30-December 3, 2010 Twister4Azure to be released May 2011 MapReduceRoles4Azure available now at

Twister4Azure Architecture

BLAST Sequence Search Cap3 Sequence Assembly Smith Waterman Sequence Alignment

Multidimensional Scaling MDS Performance 30,000*30,000 Data points, 15 instances, 3 MR steps per iteration 30 Map tasks per application # Instances Speedup Probably super linear as used small instances

100,043 Metagenomics Sequences Scaling to 10’s of millions with Twister on cloud