Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS597A: Managing and Exploring Large Datasets Kai Li.

Similar presentations


Presentation on theme: "CS597A: Managing and Exploring Large Datasets Kai Li."— Presentation transcript:

1 CS597A: Managing and Exploring Large Datasets Kai Li

2 About This Seminar Goal: –Identify research directions and issues in managing and exploring large datasets Plan: –Overview of a few of state-of-the-art storage systems –Reading some papers on a few research systems in storage systems, data management and data exploration –Discussions on wild ideas –Define, work, and present course projects

3 Why Is This Area Interesting? (Where Are The Bottlenecks?) Network CreateTransformTransmit Store and Retrieve

4 Computer Food Chains Mini-super (Convex, etc) Mainframe (IBM 370) Minicomputer (VAX) WS (SUN) PC Supercomputer (Cray, etc) Servers (IBM, SUN) PCLaptop (Computer systems in 1980s) PDA (Computer systems in 1990s and 2000s) Supercomputer (Cray, etc)

5 Storage Arrays of Food Chains? Storage Area Network (SAN) “Super” NAS (NetApp, SUN) “Super” SAN storage (EMC, Hitachi, IBM) “MiniSuper” SAN storage (HPQ, Startups) iSCSI (Startups) Network Attached Storage (NAS) “MiniSuper” NAS (Startups) PC storage (Dell, Snap!, MSFT SAK boxes) Direct Attached Storage (DAS) “Super” SCSI RAID ATA RAID ATA disks USB, Microdrive, Flash

6 Typical General Infrastructures Networ k BCV or 3 rd copy (e.g. EMC) Mirrored storage (e.g EMC) Backup tape library File servers /wo disks Clients Storage Area Network Storage Area Network Backup tape library File servers /w disks Clients Storage Area Network Storage Area Network

7 Exponential Growth (Courtesy Jim Gray, Turing Lecture 99) Performance/Price doubles every 18 months 100x per decade Progress in next 18 months = ALL previous progress –New storage = sum of all old storage (ever) –New processing = sum of all old processing. 15 years ago

8 Disk Density vs. Moore’s Law

9 Storage Capacity Grows Fast

10 Raw Storage Is Cheap Disk drives beat tapes in 2002 in $/TB (IDC) –Disk $/TB declines 50% / year –Tape $/TB declines 29% / year But, ATA arrays ($/TB) beat tape libraries in 2006 (Gartner) –Disk system $/TB declines 40%/year –Tape library $/TB declines 29%/year (Source: Gartner and IDC) 2006 2002 $/TB

11 Summary of Storage Trends Disk density beats Moore’s Law Data growth rate follows Moore’s law Raw disks are cheap while storage systems are very expensive Crossover from tapes to disks

12 How Much Information Is there? (Courtesy Jim Gray, Turing Lecture 99) Soon everything can be recorded and indexed Most data never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search is key technology. www.lesk.com/mlesk/ksg97/ksg.html Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book.Movi e All LoC books (words) All Books MultiMedia Everything ! Recorded A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

13 How Much Information Is There? (Hal Varian, Peter Lyman et al. 2001) Web has a lot of documents –“Surface” web had 2.5B docs, adding 7.5M pages/day –“Deep” web had 550B docs, 95% publicly accessible Most websites are in English –78% all websites and 96% e-commerce E-mail generates a large amount of information –A “white-collar” worker receives ~40 messages/day –E-mail information is 500x of web every year

14 How Much Information Is There? (Hal Varian, Peter Lyman et al. 2001) Storage media TB/year (Upper est.) TB/year (Lower est.) Growth rate Paper240232% Film427,21658,2164% Optical833170% Magnetic1,693,000577,21055%

15 Challenges In Managing and Exploring Datasets Disk’s behavior is like a big tape –Storage is indeed “infinitely” large –Ability to get information is slow Reliability is far from what we need –Disks do fail –Software and human corrupt data Managing storage is difficult –Storage and data are both growing Retrieving data is difficult –Get what you want –See what you get

16 Properties of A Research Goal (Jim Gray, 1999) Simple to state Not obvious how to do it Clear benefit Progress and solution is testable Can be broken in to smaller steps –So that you can see intermediate progress

17 Systems Challenges (Lampson, SOSP Keynote 99) Systems that work –Meeting their specs –Always available –Adapting to changing environment –Evolving while they run –Made from unreliable components –Growing without practical limit Credible simulations or analysis Writing good specs Testing Performance –Understanding when it doesn’t matter

18 What Should the “New World” Focus Be? (Hennessy, FCRC keynote 99) Availability –Both appliance & service Maintainability –Two functions: Enhancing availability by preventing failure Ease of SW and HW upgrades Scalability –Especially of service Cost –per device and per service transaction Performance –Remains important, but its not SPECint

19 Tentative Syllabus Today: About the Course Week 2: Read several vision papers Week 3: Guest lecture on archival storage Week 4: Commercial storage systems (EMC, Veritas, NetApp) Week 5: Global-scale storage (OceanStore and the like) Week 6: Managing personal (Coda, Bayou, Personal RAID) Week 7: Managing geographical data (TerraServer) Week 8: Guest lecture on managing astrophysical data (SkyServer) Week 9: Managing and exploring large scientific data Week 10: Managing medical data Week 11: Managing genomic data Week 12: Project reports and presentations Detailed, tentative reading will be available this weekend


Download ppt "CS597A: Managing and Exploring Large Datasets Kai Li."

Similar presentations


Ads by Google