1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.

Slides:



Advertisements
Similar presentations
Microsoft Research Microsoft Research Jim Gray Distinguished Engineer Microsoft Research San Francisco SKYSERVER.
Advertisements

1 Online Science The World-Wide Telescope as a Prototype For the New Computational Science Jim Gray Microsoft Research
Prof. Natalia Kussul, PhD. Andrey Shelestov, Lobunets A., Korbakov M., Kravchenko A.
Remote Visualisation System (RVS) By: Anil Chandra.
CONCEPTUAL WEB-BASED FRAMEWORK IN AN INTERACTIVE VIRTUAL ENVIRONMENT FOR DISTANCE LEARNING Amal Oraifige, Graham Oakes, Anthony Felton, David Heesom, Kevin.
University of St Andrews School of Computer Science Experiences with a Private Cloud St Andrews Cloud Computing co-laboratory James W. Smith Ali Khajeh-Hosseini.
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
1. Topics Is Cloud Computing the way to go? ARC ABM Review Configuration Basics Setting up the ARC Cloud-Based ABM Hardware Configuration Software Configuration.
1 CEOS/WGISS20 – Kyiv – September 13, 2005 Paul Kopp SIPAD New Generation: Dominique Heulet CNES 18, Avenue E.Belin Toulouse Cedex 9 France
Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.
20 Spatial Queries for an Astronomer's Bench (mark) María Nieto-Santisteban 1 Tobias Scholl 2 Alexander Szalay 1 Alfons Kemper 2 1. The Johns Hopkins University,
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D.
1998/5/21by Chang I-Ning1 ImageRover: A Content-Based Image Browser for the World Wide Web Introduction Approach Image Collection Subsystem Image Query.
Title US-CMS User Facilities Vivian O’Dell US CMS Physics Meeting May 18, 2001.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Astro-DISC: Astronomy and cosmology applications of distributed super computing.
FLANN Fast Library for Approximate Nearest Neighbors
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
Ch 4. The Evolution of Analytic Scalability
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.
Appendix B Planning a Virtualization Strategy for Exchange Server 2010.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
Submitted by: Shailendra Kumar Sharma 06EYTCS049.
Master Thesis Defense Jan Fiedler 04/17/98
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
November 2, 2000HEPiX/HEPNT FERMI SAN Effort Lisa Giacchetti Ray Pasetes GFS information contributed by Jim Annis.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
GRNET Greek Research & Education Network GRNET Simple Storage – GSS Ioannis Liabotis, Panos Louridas Amsterdam, June 2007.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.
Indexing and Visualizing Multidimensional Data I. Csabai, M. Trencséni, L. Dobos, G. Herczegh, P. Józsa, N. Purger Eötvös University,Budapest.
Yunhong Gu and Robert Grossman University of Illinois at Chicago 碩資工一甲 王聖爵
Quick Introduction to NorduGrid Oxana Smirnova 4 th Nordic LHC Workshop November 23, 2001, Stockholm.
Challenges in Mining Large Image Datasets Jelena Tešić, B.S. Manjunath University of California, Santa Barbara
Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.
LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.
James Annis Gabriele Garzoglio Peretz Partensky Chris Stoughton The Experimental Astrophysics Group, Fermilab The Terabyte Analysis Machine Project data.
Near Real-Time Verification At The Forecast Systems Laboratory: An Operational Perspective Michael P. Kay (CIRES/FSL/NOAA) Jennifer L. Mahoney (FSL/NOAA)
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Grid Appliance The World of Virtual Resource Sharing Group # 14 Dhairya Gala Priyank Shah.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
EC Review – 01/03/2002 – WP9 – Earth Observation Applications – n° 1 WP9 Earth Observation Applications 1st Annual Review Report to the EU ESA, KNMI, IPSL,
CS 157B: Database Management Systems II April 10 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron Mak.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Parallel IO for Cluster Computing Tran, Van Hoai.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
HARD DISKS. INTRODUCTION TO HARD DISKS  Hard disk is the core fundamental component of the Computer system.  A mass storage device that stores the permanent.
IT 5433 LM1. Learning Objectives Understand key terms in database Explain file processing systems List parts of a database environment Explain types of.
Microsoft Research San Francisco (aka BARC: bay area research center) Jim Gray Researcher Microsoft Research Scalable servers Scalable servers Collaboration.
Canadian Bioinformatics Workshops
Hadoop Clusters Tess Fulkerson.
University of Technology
BARC Scaleable Servers
SDM workshop Strawman report History and Progress and Goal.
Data Warehousing and Data Mining
Ch 4. The Evolution of Analytic Scalability
COMPASS Database SPACE TELESCOPE SCIENCE INSTITUTE Gretchen Greene
CS246 Search Engine Scale.
TeraScale Supernova Initiative
Web Mining Department of Computer Science and Engg.
CS246: Search-Engine Scale
Google Sky.
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The Technology Performance Tests Conclusions Links

2 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The goal of the Terabyte Analysis Machine (TAM) project is to explore and integrate emerging and stable technologies into an efficient computing environment for physicists and astronomers working on data intensive scientific analysis. The studies in progress are in the areas of Knowledge and Distributed Intelligence (KDI) and Large Distributed Archives in Astronomy and Particle Physics (ALDAP)

3 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 The Cluster Environment TAM is a cluster of (currently) 5 Linux boxes (2xPIII 650 MHz, 1 GB RAM, 72 GB IDE disk) connected to a 1 TB raid box via fibre channel It uses GFS as File System. Condor as Batch System.

4 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 The Distance Machine Framework As an example we have implemented an analysis engine for the k th nearest neighbor search: ANN. ANN uses the SDSS (SX) database (John Hopkins University) It extracts the relevant data from SX using the Tag Extractor utility developed by Koen Holtman (Caltech) The Distance Machine framework allows specific analysis programs to transparently access large databases, making best use of sophisticated specific algorithms for indexing and partitioning the database.

5 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Scales The SDSS database will have: 15 Tbytes image database 500 Gbytes catalog database 250 million objects 100 million galaxies (example of the search for cluster of galaxies) 2.5 Kbytes object size 500 byte tag object size (example of tag with about 50 parameters of interest) 50 Gbytes tag database Moving the tag database: 50 Gbytes at 10 Mbytes/sec => 1.5 hours to transfer over network 50 Gbytes at 30 Mbytes/sec => 0.5 hours to write to disk Dimensionality of the problem: 120 parameters (total) per object

6 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 The Technology Due to the client/server architecture, ANN works in a distributed environment. On TAM, many ANN servers can run in parallel, each one dealing with a different parameter space, topology and portion of the sky. Astrotool can be used as an analysis client for ANN. ANN acts as a server that provides nearest neighbors to an analysis client using corba interfaces.

7 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Performance Tests The Distance Machine finds Nearest Neighbors using the kd-tree technique. This technique is O(N logN) Re-clustering techniques improve performance

8 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Conclusions The Distance Machine implements a framework to study how re-indexing and re-partitioning techniques increase performance on parallel access to large data sets. It’s flexible enough to be used on data other than astronomy (e.g. “out of the box”: study of Nearest Neighbors techniques for particle tagging) It is a core engine for cluster finding analysis on the SDSS data. The Distance Machine can be used as a test bed for the GriPhyN tools (virtual NN data sets).

9 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Links Compute.html

10 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 The Distance Machine ANN/Objectivity (Garzoglio/Annis) Adaptive smoothing kernel Similarity search Cluster search in high dimensions Current use: spectroscopic cluster catalog (the Finger of God catalog) Next project: (summer student Peretz Partensky) link DM/maxBcg to Kent image base Run cluster finder on a objects on demand (Griphyn…) Show nearest neighbors on images Next instrument: kd-tree based range search Range search the basis for most environmental studies

11 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 The Terabyte Analysis Machine The compute cluster Intended to hold tsObj files on disk Intended as Fermi/SDSS science analysis machine Performs well as cluster finder Non-EAG users: TAG, UC, Penn State, CMU Purchase in the works to double the machine size Upgrade cluster software to use QCD farm experience Suffering from a lack of computer professional support: Only Annis and Garzoglio can bring the machine fully up Dan Yocum (EAG) will be brought up to speed/ownership GFS tutorial at Sistina soon