Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research.

Similar presentations


Presentation on theme: "1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research."— Presentation transcript:

1 1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research

2 2 TerraServer Lessons Learned Hardware is 5 9’s (with clustering) Software is 5 9’s (with clustering) Admin is 4 9’s (offline maintenance) Network is 3 9’s (mistakes, environment) Simple designs are best 10 TB DB is management limit 1 PB = 100 x 10 TB DB this is 100x better than 5 years ago. (yahoo!, HotMail are 300TB, Google! Is 2PB) Minimize use of tape –Backup to disk (snapshots) –Portable disk TBs

3 3 Serving BIG images Break into tiles (compressed): –10KB for modems –1MB for LANs Mosaic the tiles for pan, crop Store image pyramid for zoom –2x zoom only adds 33% overhead 1 + ¼ + 1 / 16 + … Use a spatial index to cluster & find objects 1.6x1.6 km 2 image.8x.8 km 2 image.4x.4 km 2 image.2x.2 km 2 tile

4 4 Economics People are more than 50% of costs Disks are more than 50% of capital Networking is the other 50% –People –Phone bill –Routers Cpus are free (they come with the disks)

5 5 SkyServer/ SkyQuery Lessons DB is easy Search –It is BEST to index –You can put objects and attributes in a row (SQL puts big blobs off-page) –If you can’t index, you can extract attributes and quickly compare –SQL can scan at 5M records/cpu/second –Sequential scans are embarrassingly parallel Web services are easy XML Data Sets : –a universal way to represent answers –minimize round trips: 1 request/response –Diffgrams allow disconnected update

6 6 How Will We Find Stuff? Put everything in the DB (and index it) Need dbms features: Consistency, Indexing, Pivoting, Queries, Speed/scalability, Backup, replication If you don’t use one, you’r creating one! Simple logical structure: –Blob and link is all that is inherent –Additional properties (facets == extra tables) and methods on those tables (encapsulation) More than a file system Unifies data and meta-data Simpler to manage Easier to subset and reorganize Set-oriented access Allows online updates Automatic indexing, replication SQL

7 7 How Do We Represent Data To The Outside World? File metaphor too primitive: just a blob Table metaphor too primitive: just records Need Metadata describing data context –Format –Providence (author/publisher/ citations/…) –Rights –History –Related documents In a standard format XML and XML schema DataSet is great example of this World is now defining standard schemas schema Data or difgram - … …

8 8 Emerging Concepts Standardizing distributed data –Web Services, supported on all platforms –Custom configure remote data dynamically –XML: Extensible Markup Language –SOAP: Simple Object Access Protocol –WSDL: Web Services Description Language –DataSets: Standard representation of an answer Standardizing distributed computing –Grid Services –Custom configure remote computing dynamically –Build your own remote computer, and discard –Virtual Data: new data sets on demand

9 9 Szalay’s Law: The utility of N comparable datasets is N 2 Metcalf’s law applies to telephones, fax, Internet. Szalay argues as follows: Each new dataset gives new information 2-way combinations give new information. Example: Combine these 3 datasets –(ID, zip code) –(ID, birth day) –(ID, height) Other example: quark star: Chandra Xray + Hubble optical, +600 year old records.. Drake, J. J. et al. Is RX J a Quark Star?. Preprint, (2002). Preprint X-ray, optical, infrared, and radio views of the nearby Crab Nebula, which is now in a state of chaotic expansion after a supernova explosion first sighted in 1054 A.D. by Chinese Astronomers. Crab star 1053 AD

10 10 Science is hitting a wall FTP and GREP are not adequate You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~10,000 disks At some point you need indices to limit search parallel data search and analysis search and analysis tools This is where databases can help You can FTP 1 MB in 1 sec You can FTP 1 GB / min (= 1 $/GB) … 2 days and 1K$ … 3 years and 1M$

11 11 Networking: Great hardware & Software 5GBps (1 = 40 Gbps) GbpsEthernet common (~100 MBps) –Offload gives ~2 hz/Byte –Will improve with RDMA & zero-copy –10 Gbps mainstream by 2004 Faster I/O –1 GB/s today (measured) –10 GB/s under development –SATA (serial ATA) 150MBps/device

12 12 Bandwidth: 3x bandwidth/year for 25 more years Today: –40 Gbps per channel (λ) –12 channels per fiber (wdm): 500 Gbps –32 fibers/bundle = 16 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth Aggregate bandwidth doubles every 8 months! 1 fiber = 25 Tbps

13 13 Redmond/Seattle, WA San Francisco, CA New York Arlington, VA 5626 km 10 hops Information Sciences Institute MicrosoftQwest University of Washington Pacific Northwest Gigapop HSCC (high speed connectivity consortium) DARPA Hero/Guru Networking

14 14 Real Networking Bandwidth for 1 Gbps “stunt” cost 400k$/month – ~ 200$/Mbps/m (at each end + hardware + admin) –Price not improving very fast –Doesn’t include operations / local hardware costs Admin… costs more ~1$/GB to 10$/GB Challenge: Go home and FTP from a “fast”server The Guru Gap: FermiLab JHU –Both “well connected” –vBNS, NGI, Internet2, Abilene,…. –Actual desktop-to-desktop ~ 100KBps –12 days/TB (but it crashes first). The reality: to move 10GB, mail it! TeraScale Sneakernet

15 15 How Do You Move A Terabyte? 14 minutes ,920, OC hours1000Gbps 1 day Mpbs 14 hours ,000155OC3 2 days2, ,00043T3 2 months2, ,2001.5T1 5 months Home DSL 6 years3,0861, Home phone Time/TB $/TB Sent $/Mbps Rent $/month Speed Mbps Context Source: TeraScale Sneakernet, Microsoft Research, Jim Gray et. all

16 16 There Is A Problem GREAT!!!! –XML documents are portable objects –XML documents are complex objects –WSDL defines the methods on objects (the class) But will all the implementations match? –Think of UNIX or SQL or C or… This is a work in progress. Niklaus Wirth: Algorithms + Data Structures = Programs

17 17 Changes To DBMS’s Integration of Programs and Data –Put programs inside the database allows OODB –Gives you parallel execution Integration of Relational, Text, XML, Time Scaleout (even more) AutoAdmin (“no knobs”) Manage Petascale databases (utilities, geoplex, online, incremental)

18 18 Publishing Data Roles Authors Publishers Curators Archives Consumers Traditional Scientists Journals Libraries Archives Scientists Emerging Collaborations Project web site Data+Doc Archives Digital Archives Scientists

19 19 The Core Problem: No Economic Model The archive user has not yet been born. How can he pay you to curate the data? The Scientist gathered data for his own purpose Why should he pay (invest time) for your needs? Answer to both: that’s the scientific method Curating data (documenting the design, the acquisition and the processing) Is very hard and there is no reward for doing it. The results are rewarded, not the process of getting them. Storage/archive NOT the problem (it’s almost free) Curating/Publishing is expensive.

20 20 SDSS Data Inflation – Data Pyramid Level 1A Grows 5TB pixels/year growing to 25TB ~ 2 TB/y compressed growing to 13TB ~ 4 TB today (level 1A in NASA terms) Level 2 Derived data products ~10x smaller But there are many catalogs. Publish new edition each year –Fixes bugs in data. –Must preserve old editions –Creates data pyramid Store each edition –1, 2, 3, 4… N ~ N 2 bytes Net: Data Inflation: L2 ≥ L1 E1 E2 E3 E4 4 editions of level 1A data (source data) 4 editions of level 2 derived data products. Note that each derived product is small, but they are numerous. This proliferation combined with the data pyramid implies that level2 data more than doubles the total storage volume. time Level 1A4 editions of Level 2 products

21 21 What’s needed? (not drawn to scale) Science Data & Questions Scientists Database To store data Execute Queries Plumbers Data Mining Algorithms Miners Question & Answer Visualization Tools

22 22 CS Challenges For Astronomers Objectify your field: –Precisely define what you are talking about. –Objects and Methods / Attributes –This is REALLY difficult. –UCDs are a great start but, there is a long way to go “Software is like entropy, it always increases.” -- Norman Augustine, Augustine’s Laws –Beware of legacy software – cost can eat you alive –Share software where possible. –Use standard software where possible. –Expect it will cost you 25% to 40% of project.  Explain what you want to do with the VO –20 queries or something like that. Science Data & Questions Scientists

23 23 Challenge to Data Miners: Linear and Sub-Linear Algorithms Today most correlation / clustering algorithms are polynomial N 2 or N 3 or… N 2 is VERY big when N is big (10 18 is big) Need sub-linear algorithms Current approaches are near optimal given current assumptions. So, need new assumptions probably heuristic and approximate Data Mining Algorithm s Miners Techniques

24 24 Challenge to Data Miners: Rediscover Astronomy Astronomy needs deep understanding of physics. But, some was discovered as variable correlations then “explained” with physics. Famous example: Hertzsprung-Russell Diagram star luminosity vs color (=temperature) Challenge 1 (the student test): How much of astronomy can data mining discover? Challenge 2 (the Turing test): Can data mining discover NEW correlations? Data Mining Algorithm s Miners

25 25 Plumbers: Organize and Search Petabytes Automate –instrument-to-archive pipelines It is is a messy business – very labor intensive Most current designs do not scale (too many manual steps) BaBar (1TB/day) and ESO pipeline seem promising. A job-scheduling or workflow system –Physical Database design & access Data access patterns are difficult to anticipate Aggressively and automatically use indexing, sub-setting. Search in parallel Goals –Answer easy queries in 10 seconds. –Answer hard queries (correlations) in 10 minutes. Database To store data Execute Queries Plumbers

26 26 Scale UP Scaleable Systems Scale UP: grow by adding components to a single system. Scale Out: grow by adding more systems. Scale OUT

27 27 What’s New – Scale Up 64 bit & TB size main memory SMP on chip: everything’s smp 32… 256 SMP: locality/affinity matters TB size disks High-speed LANs

28 28 Who needs 64-bit addressing? You! Need 64-bit addressing! 640K ought to be enough for anybody. Bill Gates, 1981 But that was 21 years ago == 2  21/3 = 14 bits ago. 20 bits + 14 bits = 34 bits so.. 16GB ought to be enough for anybody Jim Gray, bits > 31 bits so… 34 bits == 64 bits YOU need 64 bit addressing!

29 29 64 bit – Why bother? 1966 Moore’s law: 4x more RAM every 3 years. 1 bit of addressing every 18 months 36 years later: 2  36/3 = 24 more bits Not exactly right, but… 32 bits not enough for servers 32 bits gives no headroom for clients So, time is running out ( has run out ) Good news: Itanium™ and Hammer™ are maturing And so is the base software (OS, drivers, DB, Web,...) Windows & 256GB today!

30 30 64 bit – why bother? Memory intensive calculations: –You can trade memory for IO and processing Example: Data Analysis & Clustering a JHU in memory CPU time is ~NlogN, N ~ 100M Disk M chunks → time ~ M 2 must run many times Now running on HP Itanium Windows.Net Server 2003 SQL Server Graph courtesy of Alex Szalay & Adrian Pope of Johns Hopkins University yea r decade week day month

31 31 Amdahl’s balanced System Laws 1 mips needs 4 MB ram and needs 20 IO/s At 1 billion instructions per second need 4 GB/cpu need 50 disks/cpu! 64 cpus … 3,000 disks 1 bips cpu 4 GB RAM 50 disks 10,000 IOps 7.5 TB

32 32 The 5 Minute Rule – Trade RAM for Disk Arms If data re-referenced every 5 minutes It is cheaper to cache it in ram than to get it from disk A disk access/second ~ 50$ or ~ 50MB for 1 second or ~ 50KB for 1,000 seconds. Each app has a memory “knee” Up to the knee, more memory helps a lot.

33 33 64 bit Reduces IO, saves disks Large memory reduces IO 64-bit simplifies code Processors can be faster (wider word) Ram is cheap (4 GB ~ 1k$ to 20k$) Can trade ram for disk IO Better response time. Example –tpcC 4x1Ghz Itanium2 vs 4x1.6Ghz IA32 40 extra GB → 60% extra throughput 4x1.6Ghz IA32 8GB 4x1 Ghz IA64 48GB 4x1.6Ghz IA32 32GB

34 34 AMD Hammer™ Coming Soon AMD Hammer™ is 64bit capable 2003: millions of Hammer™ CPUs will ship 2004: most AMD CPUs will be 64bit 4GB ram is less than 1,000$ today less than 500$ in 2004 Desktops (Hammer™) and servers (Opteron™). You do the math,… Who will demand 64bit capable software?

35 35 A 1TB Main Memory Amdahl’s law: 1mips/MB, now 1:5 so ~20 x 10 Ghz cpus need 1TB ram 1TB ram ~ 250k$ … 2m$ today ~ 25k$ … 200k$ in 5 years 128 million pages –Takes a LONG time to fill –Takes a LONG time to refill Needs new algorithms Needs parallel processing Which leads us to… –The memory hierarchy –smp –numa

36 36 If cpu is always waiting for memory Predict memory requests and prefetch –done If cpu still always waiting for memory Multi-program it ( multiple hardware threads per cpu ) –Hyper Threading: Everything is SMP –2 now more later –Also multiple cpus/chip If your program is single threaded –You waste ½ the cpu and memory bandwidth –Eventually waste 80% App builders need to plan for threads. Hyper-Threading: SMP on chip

37 37 The Memory Hierarchy Locality REALLY matters CPU 2 G hz, RAM at 5 Mhz RAM is no longer random access. Organizing the code gives 3x (or more) Organizing the data gives 3x (or more) Levellatency(clocks) size Registers 1 1 KB L KB L KB L MB Near RAM GB Far RAM GB

38 38 RAM Off chip Icache Arithmatic Logical Unit Dcache L2 cache The Bus Remote cache Disk Network Other Cpus registers L1 cache Remote RAM

39 39 Scaleup Systems Non-Uniform Memory Architecture (NUMA) Coherent but… remote memory is even slower All cells see a common memory Slow local main memory Slower remote main memory Scaleup by adding cells Planning for 64 cpu, 1TB ram Interconnect, Service Processor, Partition management are vendor specific Several vendors doing this Itanium and Hammer System interconnect Crossbar/Switch Partition manager Config DB CPUCPUCPUCPU MemMemMemMem I/O Chipset CPUCPUCPUCPU MemMemMemMem I/O Chipset CPUCPUCPUCPU MemMemMemMem I/O Chipset CPUCPUCPUCPU MemMemMemMem I/O Chipset Service Processor

40 40 Changed Ratios Matter If everything changes by 2x, Then nothing changes. So, it is the different rates that matter. Improving FAST CPU speed Memory & disk size Network Bandwidth Slowly changing Speed of light People costs Memory bandwidth WAN prices

41 41 Disks are becoming tapes Capacity: –150 GB now, 300 GB this year, 1 TB by 2007 Bandwidth: – 40 MBps now 150 MBps by 2007 Read time –2 hours sequential, 2 days random now 4 hours sequential, 12 days random by IO/s 40 MBps 150 GB 200 IO/s 150 MBps 1 TB

42 42 Disks are becoming tapes Consequences Use most disk capacity for archiving Copy on Write (COW) file system in Windows and other OSs. RAID10 saves arms, costs space (OK!). Backup to disk Pretend it is a 100GB disk + 1 TB disk –Keep hot 10% of data on fastest part of disk. –Keep cold 90% on colder part of disk Organize computations to read/write disks sequentially in large blocks.

43 43 Wiring is going serial and getting FAST! Gbps Ethernet and SATA built into chips Raid Controllers: inexpensive and fast. 1U storage 2-10 TB SAN or NAS (iSCSI or CIFS/DAFS) Enet 100MBps/link 8xSATA 150MBps/link

44 44 NAS – SAN Horse Race Storage Hardware 1k$/TB/y Storage Management 10k$...300k$/TB/y So as with Server Consolidation Storage Consolidation Two styles: NAS (Network Attached Storage) File Server SAN (System Area Network) Disk Server I believe NAS is more manageable.

45 45 SAN/NAS Evolution Modular Monolithic Sealed

46 46 IO Throughput K Access Per Second Vs. RPM Kaps vs. RPM Kaps

47 47 Comparison Of Disk Cost $’s for similar performance Seagate Disk Prices* *Source: Seagate online store, quantity one prices $29.7$455Fibre15K RPM36.7 GBX15 36LP $29.7$455SCSI15K RPM36.7 GBX15 36LP $32.5$325SCSI10K RPM36.7 GB36 ES 2 $14.0$101ATA7200 RPM40 GBATA 1000 $15.9$86ATA5400 RPM40 GB ATA 100 $/K RevCostConnect.SpeedSizeModel #

48 48 Comparison Of Disk Costs ¢/MB for different systems Seagate6.4¢$1155Int SCSI181 GB WD2.3¢$276Ext. ATA120 GB Dell 1.4¢$115Int. ATA80 GB Cost/MBCostTypeSizeMfg. EMCxx¢SANXX GB Source: Dell

49 49 Why Serial ATA Matters Modern interconnect Point-to-point drive connection –150Mbs –> 300Mbs Facilitates ATA disk arrays Enables inexpensive “cool” storage

50 50 Performance (on Y2k SDSS data) Run times: on 15k$ HP Server (2 cpu, 1 GB, 8 disk) Some take 10 minutes Some take 1 minute Median ~ 22 sec. Ghz processors are fast! –(10 mips/IO, 200 ins/byte) –2.5 m rec/s/cpu ~1,000 IO/cpu sec ~ 64 MB IO/cpu sec

51 51 NVO: How Will It Work? Define commonly used `atomic’ services Build higher level toolboxes/portals on top We do not build `everything for everybody’ Use the rule: –Define the standards and interfaces –Build the framework –Build the 10% of services that are used by 90% –Let the users build the rest from the components

52 52 Federation Data Federations of Web Services Massive datasets live near their owners: –Near the instrument’s software pipeline –Near the applications –Near data knowledge and curation –Super Computer centers become Super Data Centers Each Archive publishes a web service –Schema: documents the data –Methods on objects (queries) Scientists get “personalized” extracts Uniform access to multiple Archives –A common global schema

53 53 Grid and Web Services Synergy I believe the Grid will be many web services share data (computrons are free) IETF standards Provide –Naming –Authorization / Security / Privacy –Distributed Objects Discovery, Definition, Invocation, Object Model –Higher level services: workflow, transactions, DB,.. Synergy: commercial Internet & Grid tools

54 54 Web Services: The Key? Web SERVER: –Given a url + parameters –Returns a web page (often dynamic) Web SERVICE: –Given a XML document (soap msg) –Returns an XML document –Tools make this look like an RPC. F(x,y,z) returns (u, v, w) –Distributed objects for the web. –+ naming, discovery, security,.. Internet-scale distributed computing Your program Data In your address space Web Service soap object in xml Your program Web Server http Web page

55 55 Grid? Harvesting spare cpu cycles is not important –They are “free” (1$/cpu day) –They need applications and data (which are not free) (1$/GB shipped) Accessing distributed data IS important –Send the programs to the data –Send the questions to the databases. Super Computer Centers become Super Data Centers Super Application Centers

56 56 The Grid: Foster & Kesselman (Argonne National Laboratory) Internet computing and GRID technologies promise to change the way we tackle complex problems. They will enable large- scale aggregation and sharing of computational, data and other resources across institutional boundaries …. Transform scientific disciplines ranging from high energy physics to the life sciences

57 57 Grid/Globus Leader of the pack for GRID middleware Layered software toolkit –1: Grid Fabric (OS, TCP) –2: Grid Services Globus Resource Allocation Manager Globus Information Service (meta-computing directory service) Grid Security Infrastructure GridFTP –3: Application Toolkits Job submission MPICH-G2 message passing interface –4:Specific Applications OVERFLOW Navier-Stokes flow solver

58 58 Globus in gory detail SHELL SCRIPTS globus-mds-search '(&(hn=denali.mcs.anl.gov)(objectclass=GlobusSy stemDynamicInformation))' cpuload1 |\ sed -n -e '/^hn=/p' -e '/^cpuload1=/p' |\ sed -e 's/,.*$//' -e 's/=/ /g' |\ awk '/^hn/{printf "%s", $2} /^cpuload/{printf " %s\n", $2}‘ if [ $# -eq 0 ]; then echo "provide argument " 1>&2 exit 1 fi if [ -z "$GRAMCONTACT" ] ; then GRAMCONTACT="`globus-hostname2contacts -type fork pitcairn.mcs.anl.gov`" fi pwd=`/bin/pwd` rsl="&(executable=${pwd}/myjobtest)(count=$1)" arch=`${GLOBUS_INSTALL_PATH}/sbin/config.guess` ${GLOBUS_INSTALL_PATH}/tools/${arch}/bin/globusrun -o -r "${GRAMCONTACT}" "${rsl}" LIBRARIES /* get process id and hostname */ pid = getpid(); rc = globus_libc_gethostname(hn, 256); globus_assert(rc == GLOBUS_SUCCESS); /* get current time and convert to string format. setting [25] to zero will strip the newline character. */ mytime = time(GLOBUS_NULL); timestr = globus_libc_ctime_r( &mytime, buf, 30 ); timestr[25] = '\0'; globus_libc_printf("%s : process %d on %s came to \ life\n",timestr, pid, hn); /*THE BARRIER!!! */ globus_duroc_runtime_barrier(); /*Passed the barrier: get current time again and print it out.*/ mytime = time(GLOBUS_NULL); timestr = globus_libc_ctime_r( &mytime, buf, 30 ); globus_libc_printf("%s : process %d on %s passed \ the barrier\n", timestr, pid, hn); /*TODO 1: get the layout of the DUROC job using first globus_duroc_runtime_intra_subjob_rank() and then globus_duroc_runtime_inter_subjob_structure(). */ /* We are done.*/ rc = globus_module_deactivate_all(); globus_assert(rc == GLOBUS_SUCCESS); return 0;

59 59 Shielding Users Users do not want to deal with XML, they want their data Users do not want to deal with configuring grid computing, they want results SOAP: data appears in user memory, XML is invisible SOAP call: just a remote procedure

60 60 Atomic Services Metadata information about resources –Waveband –Sky coverage –Translation of names to universal dictionary (UCD) Simple search patterns on the resources –Cone Search –Image mosaic –Unit conversions Simple filtering, counting, histogramming On-the-fly recalibrations

61 61 Higher Level Services Built on Atomic Services Perform more complex tasks Examples –Automated resource discovery –Cross-identifications –Photometric redshifts –Outlier detections –Visualization facilities Expectation: –Build custom portals in matter of days from existing building blocks (like today in IRAF or IDL)

62 62 SkyQuery Distributed Query tool using a set of services Feasibility study, built in 6 weeks from scratch –Tanu Malik (JHU CS grad student) –Tamas Budavari (JHU astro postdoc) Implemented in C# and.NET Won 2 nd prize of Microsoft XML Contest Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

63 63 Architecture Image cutout SkyNode SDSS SkyNode 2Mass SkyNode First SkyQuery Web Page

64 64 Cross-id Steps Parse query Get counts Sort by counts Make plan Cross-match –Recursively, from small to large Select necessary attributes only Return output Insert cutout image SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND (o.i - t.m_j) > 2 AND o.type=3

65 65 Show Cutout Web Service


Download ppt "1 High Performance Presentation: 5 slides/Minute? (65 slides / 15 minutes) IO and DB “stuff” for LSST A new world record? Jim Gray Microsoft Research."

Similar presentations


Ads by Google