CS597A: Managing and Exploring Large Datasets Kai Li.

Slides:



Advertisements
Similar presentations
Managing Hardware and Software Assets
Advertisements

Computer Technology Forecast Jim Gray Microsoft Research
U Computer Systems Research: Past and Future u Butler Lampson u People have been inventing new ideas in computer systems for nearly four decades, usually.
1 Store Everything Online In A Database Jim Gray Microsoft Research
Archive Task Team (ATT) Disk Storage Stuart Doescher, USGS (Ken Gacke) WGISS-18 September 2004 Beijing, China.
Data Storage Solutions Module 1.2. Data Storage Solutions Upon completion of this module, you will be able to: List the common storage media and solutions.
NAS vs. SAN 10/2010 Palestinian Land Authority IT Department By Nahreen Ameen 1.
File Management Chapter 3
1.  Computer applications today: ◦ Word processing (Word) ◦ Spreadsheets (Excel) ◦ Presentation software (PowerPoint) ◦ Communication ( , Internet)
“Redundant Array of Inexpensive Disks”. CONTENTS Storage devices. Optical drives. Floppy disk. Hard disk. Components of Hard disks. RAID technology. Levels.
MD240 - Management Information Systems Sept. 13, 2005 Computing Hardware – Moore's Law, Hardware Markets, and Computing Evolution.
High Performance Computing Course Notes High Performance Storage.
How to Cluster both Servers and Storage W. Curtis Preston President The Storage Group.
Back Up and Recovery Sue Kayton February 2013.
Storage Networking. Storage Trends Storage growth Need for storage flexibility Simplify and automate management Continuous availability is required.
Session 3 Windows Platform Dina Alkhoudari. Learning Objectives Understanding Server Storage Technologies Direct Attached Storage DAS Network-Attached.
Data Storage Willis Kim 14 May Types of storages Direct Attached Storage – storage hardware that connects to a single server Direct Attached Storage.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Storage and Backups November 18, 2010 | Worksighted.
Co-funded by the European Union under FP7-ICT Co-ordinated by aparsen.eu #APARSEN Storage Solutions The use case at the National Library of the.
BACKUP/MASTER: Immediate Relief with Disk Backup Presented by W. Curtis Preston VP, Service Development GlassHouse Technologies, Inc.
“Five minute rule ten years later and other computer storage rules of thumb” Authors: Jim Gray, Goetz Graefe Reviewed by: Nagapramod Mandagere Biplob Debnath.
Storage Area Networks The Basics. Storage Area Networks SANS are designed to give you: More disk space Multiple server access to a single disk pool Better.
The Cost of Storage about 1K$/TB 12/1/1999 9/1/2000 9/1/2001 4/1/2002.
Chet Jacobs, Senior Storage Architect Enterprise Storage Group Compaq Computer March 2001 Chet Jacobs, Senior Storage Architect Enterprise Storage Group.
PPT Slides by Dr. Craig Tyran and Kraig Pencil Information Systems Hardware MIS 320 Kraig K. Pencil Summer 2014.
Global Capabilities Archiving – Designing from Top to Bottom Gary Brown Dimension Data.
Data Storage CPTE 433 John Beckett. The Paradox “If I can go to a computer store and buy 1000 gigabytes for $50, why does it cost more in your server.
The Dawning of the Age of Infinite Storage William Perrizo Dept of Computer Science North Dakota State Univ.
Hosted by Case Study - Storage Consolidation Steve Curry Yahoo Inc.
Best Practices for Backup in SAN/NAS Environments Jeff Wells.
INFO1 – Practical problem solving in the digital world
Western European Disk Storage Systems Market Overview Eric Sheppard Program Manager, European Storage System.
Meeting the Data Protection Demands of a 24x7 Economy Steve Morihiro VP, Programs & Technology Quantum Storage Solutions Group
Classification of computers. Classification based on capacity Microcomputer. Minicomputer. Mainframe computer. Super computer.
1 © 2010 Overland Storage, Inc. © 2012 Overland Storage, Inc. Overland Storage The Storage Conundrum Neil Cogger Pre-Sales Manager.
Virtualization in the NCAR Mass Storage System Gene Harano National Center for Atmospheric Research Scientific Computing Division High Performance Systems.
School of EECS, Peking University Microsoft Research Asia UStore: A Low Cost Cold and Archival Data Storage System for Data Centers Quanlu Zhang †, Yafei.
Section 1 # 1 CS The Age of Infinite Storage.
Chapter © 2006 The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/ Irwin Chapter 7 IT INFRASTRUCTURES Business-Driven Technologies 7.
1 Recap (from Previous Lecture). 2 Computer Architecture Computer Architecture involves 3 inter- related components – Instruction set architecture (ISA):
Section 1 # 1 CS The Age of Infinite Storage.
1 Store Everything Online In A Database Jim Gray Microsoft Research
Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.
Storage Trends: DoITT Enterprise Storage Gregory Neuhaus – Assistant Commissioner: Enterprise Systems Matthew Sims – Director of Critical Infrastructure.
The Worlds of Database Systems From: Ch. 1 of A First Course in Database Systems, by J. D. Pullman and H. Widom.
CSCI 765 Big Data and Infinite Storage One new idea introduced in this course is the emerging idea of structuring data into vertical structures and processing.
Hosted by 2004 Purchasing Intentions Survey Mark Schlack Editorial Director, Storage Media Group TechTarget.
CS246 Data & File Structures Lecture 1 Introduction to File Systems Instructor: Li Ma Office: NBC 126 Phone: (713)
Archiving Solutions Software vs. Hosted vs. Appliance Based.
StorCenter ix4-200d Training By Erik Collett August 2009.
1 IBM TIVOLI Business Continuance Seminar Training Document.
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National Laboratory.
ENIAC was the first digital computer. It is easy to see how far we have come in the evolution of computers.
US ATLAS Tier 1 Facility Rich Baker Deputy Director US ATLAS Computing Facilities October 26, 2000.
Storage Networking. Storage Trends Storage grows %/year, gets more complicated It’s necessary to pool storage for flexibility Intelligent storage.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Computer Hardware. Focus Items  Design systems that meet business needs  Hardware industry trends  Problems Legacy hardware (and software) Dealing.
Open-E Data Storage Software (DSS V6)
Storage Area Networks The Basics.
Integrating Disk into Backup for Faster Restores
Data Center Infrastructure
How much information? Adapted from a presentation by:
Storage Networking.
Storage Networking.
CS The Age of Infinite Storage
Storage Trends: DoITT Enterprise Storage
1.2 Types of information storage media
Jim Gray Microsoft Research
Presentation transcript:

CS597A: Managing and Exploring Large Datasets Kai Li

About This Seminar Goal: –Identify research directions and issues in managing and exploring large datasets Plan: –Overview of a few of state-of-the-art storage systems –Reading some papers on a few research systems in storage systems, data management and data exploration –Discussions on wild ideas –Define, work, and present course projects

Why Is This Area Interesting? (Where Are The Bottlenecks?) Network CreateTransformTransmit Store and Retrieve

Computer Food Chains Mini-super (Convex, etc) Mainframe (IBM 370) Minicomputer (VAX) WS (SUN) PC Supercomputer (Cray, etc) Servers (IBM, SUN) PCLaptop (Computer systems in 1980s) PDA (Computer systems in 1990s and 2000s) Supercomputer (Cray, etc)

Storage Arrays of Food Chains? Storage Area Network (SAN) “Super” NAS (NetApp, SUN) “Super” SAN storage (EMC, Hitachi, IBM) “MiniSuper” SAN storage (HPQ, Startups) iSCSI (Startups) Network Attached Storage (NAS) “MiniSuper” NAS (Startups) PC storage (Dell, Snap!, MSFT SAK boxes) Direct Attached Storage (DAS) “Super” SCSI RAID ATA RAID ATA disks USB, Microdrive, Flash

Typical General Infrastructures Networ k BCV or 3 rd copy (e.g. EMC) Mirrored storage (e.g EMC) Backup tape library File servers /wo disks Clients Storage Area Network Storage Area Network Backup tape library File servers /w disks Clients Storage Area Network Storage Area Network

Exponential Growth (Courtesy Jim Gray, Turing Lecture 99) Performance/Price doubles every 18 months 100x per decade Progress in next 18 months = ALL previous progress –New storage = sum of all old storage (ever) –New processing = sum of all old processing. 15 years ago

Disk Density vs. Moore’s Law

Storage Capacity Grows Fast

Raw Storage Is Cheap Disk drives beat tapes in 2002 in $/TB (IDC) –Disk $/TB declines 50% / year –Tape $/TB declines 29% / year But, ATA arrays ($/TB) beat tape libraries in 2006 (Gartner) –Disk system $/TB declines 40%/year –Tape library $/TB declines 29%/year (Source: Gartner and IDC) $/TB

Summary of Storage Trends Disk density beats Moore’s Law Data growth rate follows Moore’s law Raw disks are cheap while storage systems are very expensive Crossover from tapes to disks

How Much Information Is there? (Courtesy Jim Gray, Turing Lecture 99) Soon everything can be recorded and indexed Most data never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search is key technology. Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book.Movi e All LoC books (words) All Books MultiMedia Everything ! Recorded A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

How Much Information Is There? (Hal Varian, Peter Lyman et al. 2001) Web has a lot of documents –“Surface” web had 2.5B docs, adding 7.5M pages/day –“Deep” web had 550B docs, 95% publicly accessible Most websites are in English –78% all websites and 96% e-commerce generates a large amount of information –A “white-collar” worker receives ~40 messages/day – information is 500x of web every year

How Much Information Is There? (Hal Varian, Peter Lyman et al. 2001) Storage media TB/year (Upper est.) TB/year (Lower est.) Growth rate Paper240232% Film427,21658,2164% Optical833170% Magnetic1,693,000577,21055%

Challenges In Managing and Exploring Datasets Disk’s behavior is like a big tape –Storage is indeed “infinitely” large –Ability to get information is slow Reliability is far from what we need –Disks do fail –Software and human corrupt data Managing storage is difficult –Storage and data are both growing Retrieving data is difficult –Get what you want –See what you get

Properties of A Research Goal (Jim Gray, 1999) Simple to state Not obvious how to do it Clear benefit Progress and solution is testable Can be broken in to smaller steps –So that you can see intermediate progress

Systems Challenges (Lampson, SOSP Keynote 99) Systems that work –Meeting their specs –Always available –Adapting to changing environment –Evolving while they run –Made from unreliable components –Growing without practical limit Credible simulations or analysis Writing good specs Testing Performance –Understanding when it doesn’t matter

What Should the “New World” Focus Be? (Hennessy, FCRC keynote 99) Availability –Both appliance & service Maintainability –Two functions: Enhancing availability by preventing failure Ease of SW and HW upgrades Scalability –Especially of service Cost –per device and per service transaction Performance –Remains important, but its not SPECint

Tentative Syllabus Today: About the Course Week 2: Read several vision papers Week 3: Guest lecture on archival storage Week 4: Commercial storage systems (EMC, Veritas, NetApp) Week 5: Global-scale storage (OceanStore and the like) Week 6: Managing personal (Coda, Bayou, Personal RAID) Week 7: Managing geographical data (TerraServer) Week 8: Guest lecture on managing astrophysical data (SkyServer) Week 9: Managing and exploring large scientific data Week 10: Managing medical data Week 11: Managing genomic data Week 12: Project reports and presentations Detailed, tentative reading will be available this weekend