Presentation is loading. Please wait.

Presentation is loading. Please wait.

How much information? Adapted from a presentation by: Jim Gray Microsoft Research Alex Szalay Johns Hopkins University.

Similar presentations


Presentation on theme: "How much information? Adapted from a presentation by: Jim Gray Microsoft Research Alex Szalay Johns Hopkins University."— Presentation transcript:

1 How much information? Adapted from a presentation by: Jim Gray Microsoft Research http://research.microsoft.com/~gray Alex Szalay Johns Hopkins University http://tarkus.pha.jhu.edu/~szalay/

2 How much information is there in the world Infometrics - the measurement of information What can we store What do we intend to store. What is stored. Why are we interested.

3 Infinite Storage? Yotta Zetta Exa Peta Tera Giga Mega Kilo The Terror Bytes are Here –1 TB costs <100$ to buy –1 TB costs 300k$/y to own Management & curation are expensive –Searching without indexing 1TB takes minutes or hours Petrified by Peta Bytes? But… people can “afford” them so, – They will be used. Solution: Automate processes

4 Digital Information Created, Captured, Replicated Worldwide Exabytes 10-fold Growth in 5 Years! DVD RFID Digital TV MP3 players Digital cameras Camera phones, VoIP Medical imaging, Laptops, Data center applications, Games Satellite images, GPS, ATMs, Scanners Sensors, Digital radio, DLP theaters, Telematics Peer-to-peer, Email, Instant messaging, Videoconferencing, CAD/CAM, Toys, Industrial machines, Security systems, Appliances Source: IDC, 2008

5 Scale of things to come Information: –In 2002, recorded media and electronic information flows generated about 22 exabytes (10 18 ) of information –In 2006, the amount of digital information created, captured, and replicated was 161 EB –In 2010, the amount of information added annually to the digital universe will be about 988 EB (almost 1 ZB)

6 Digital Universe Environmental Footprint In our physical universe, 98.5% of the known mass is invisible, composed of interstellar dust or what scientists call “dark matter.” In the digital universe, we have our own form of dark matter — the tiny signals from sensors and RFID tags and the voice packets that make up less than 6% of the digital universe by gigabyte, but account for more than 99% of the “units,” information “containers,” or “files” in it. Tenfold growth of the digital universe in five years will have a measurable impact on the environment, in terms of both power consumed and electronic waste.

7 How much information is there? Soon most everything will be recorded and indexed Most bytes will never be seen by humans. Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book.Movi e All books (words) All Books MultiMedia Everything ! Recorded A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

8

9 Digital Immortality Requirements for storing various media for a single person’s lifetime at modest fidelity Bell, Gray, CACM, ‘01

10 What is Digital Immortality? Preservation and interaction of digitized experiences for individuals and/or groups –Preservation and access –Active interaction with archives through queries and/or an avatar (agents) –Avatar interactions for group experiences Issues: –Archiving –Indexing –Veracity –Access

11 PB EB TB MediaTB/yGrowth Rate, % optical5070 paper1002 film100,0004 magnetic1,000,00055 total1,100,15050 ~10 Exabytes ~90% digital > 55% personal Print:.003% of bytes 5TB/y, but text has lowest entropy Email is (10 Bmpd) 4PB/y and is 20% text (estimate by Gray) WWW is ~50TB deep web ~50 PB Growth: 50%/y Information Census Lesk Varian & Lyman

12 New Information Flows Telephone increase is significant

13 Internet

14 First Disk 1956 IBM 305 RAMAC 4 MB 50x24” disks 1200 rpm 100 ms access 35k$/y rent Included computer & accounting software (tubes not transistors)

15 10 years later 1.6 meters 30 MB

16 Terabyte external drive for $200 - 20 cents a gigabyte. In 5 years, 1 cent/gigabyte, $10 for a terabyte? Now - Terabytes on your desk

17 The Cost of Storage about 1K$/TB 12/1/1999 9/1/2000 9/1/2001 4/1/2002 9/22/2003

18 Storage capacity beating Moore’s law Improvements: Capacity 60%/y Bandwidth40%/y Access time 16%/y 1000 $/TB today 100 $/TB in 2007 Moores law58.70% /year TB growth112.30% /year since 1993 Price decline50.70% /year since 1993 Most (80%) data is personal (not enterprise) This will likely remain true.

19 Disk Evolution Capacity:100x in 10 years 1 TB 3.5” drive in 2006 20 GB as 1” micro-drive System on a chip High-speed LAN Disk replacing tape Disk is super computer! Kilo Mega Giga Tera Peta Exa Zetta Yotta

20 Disk Storage Cheaper Than Paper File Cabinet (4 drawer) 250$ Cabinet: Paper (24,000 sheets) 250$ Space (2x3 @ 10€/ft2) 180$ Total 700$ 0.03 $/sheet 3 pennies per page Disk: disk (250 GB =) 250$ ASCII: 100 m pages 2e-6 $/sheet(10,000x cheaper) micro-dollar per page Image: 1 m photos 3e-4 $/photo (100x cheaper) milli-dollar per photo Store everything on disk Note: Disk is 100x to 1000x cheaper than RAM

21 Why Put Everything in Cyberspace? Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Point-to-Point OR Broadcast Immediate OR Time Delayed Locate Process Analyze Summarize

22 Memex As We May Think, Vannevar Bush, 1945 “A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility” “yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can be profligate and enter material freely”

23 Trying to fill a terabyte in a year ItemItems/TBItems/day 300 KB JPEG3 M9,800 1 MB Doc1 M2,900 1 hour 256 kb/s MP3 audio 9 K26 1 hour 1.5 Mbp/s MPEG video 2900.8

24 Projected Portable Computer for 2006 100 Gips processor 1 GB RAM 1 TB disk 1 Gbps network “Some” of your software finding things is a data mining challenge

25 The Personal Terabyte(s) (All Your Stuff Online) So you’ve got it – now what do you do with it? TREASURED (what’s the one thing you would save in a fire?) Can you find anything? Can you organize that many objects? Once you find it will you know what it is? Once you’ve found it, could you find it again? Information Science Goal: Have GOOD answers for all these Questions

26 How Will We Find Anything? Need Queries, Indexing, Pivoting, Scalability, Backup, Replication, Online update, Set-oriented access If you don’t use a DBMS, you will implement one! Simple logical structure: –Blob and link is all that is inherent –Additional properties (facets == extra tables) and methods on those tables (encapsulation) More than a file system Unifies data and meta-data SQL ++ DBMS

27 How Do We Represent It To The Outside World? Schematized Storage File metaphor too primitive: just a blob Table metaphor too primitive: just records Need Metadata describing data context –Format –Providence (author/publisher/ citations/…) –Rights –History –Related documents In a standard format XML and XML schema DataSet is great example of this World is now defining standard schemas schema Data or difgram - … - 184.028935351008 -1.12590950121524 … - 184.025719033547 -1.21795827920186

28 80% of data is personal / individual. But, what about the other 20%? Business –Wall Mart online: 1PB and growing…. –Paradox: most “transaction” systems < 1 PB. –Have to go to image/data monitoring for big data Government –Government is the biggest business. Science –LOTS of data.

29 Q: Where will the Data Come From? A: Sensor Applications Earth Observation –15 PB by 2007 Medical Images & Information + Health Monitoring –Potential 1 GB/patient/y  1 EB/y Video Monitoring –~1E8 video cameras @ 1E5 MBps  10TB/s  100 EB/y  filtered??? Airplane Engines –1 GB sensor data/flight, –100,000 engine hours/day –30PB/y Smart Dust: ?? EB/y http://robotics.eecs.berkeley.edu/~pister/SmartDust/ http://www-bsac.eecs.berkeley.edu/~ shollar/macro_motes/macromotes.html

30 CERN Tier 0 Instruments: CERN – LHC Peta Bytes per Year Looking for the Higgs Particle Sensors: 1000 GB/s (1TB/s ~ 30 EB/y) Events 75 GB/s Filtered 5 GB/s Reduced 0.1 GB/s ~ 2 PB/y Data pyramid: 100GB : 1TB : 100TB : 1PB : 10PB

31 Science Data Volume ESO/STECF Science Archive 100 TB archive Similar at Hubble, Keck, SDSS,… ~1PB aggregate

32 Premise: DataGrid Computing Store exabytes twice (for redundancy) Access them from anywhere Implies huge archive/data centers Supercomputer centers become super data centers Examples: Google, Yahoo!, Hotmail, BaBar, CERN, Fermilab, SDSC, …

33 Thesis Most new information is digital (and old information is being digitized) An Information Science Grand Challenge: –Capture –Organize –Summarize –Visualize this information Optimize Human Attention as a resource Improve information quality

34 Access!

35 The Evolution of Science Observational Science –Scientist gathers data by direct observation –Scientist analyzes data Analytical Science –Scientist builds analytical model –Makes predictions. Computational Science –Simulate analytical model –Validate model and makes predictions Data Exploration Science Data captured by instruments Or data generated by simulator –Processed by software –Placed in a database / files –Scientist analyzes database / files

36 Computational Science Evolves Historically, Computational Science = simulation. New emphasis on informatics: –Capturing, –Organizing, –Summarizing, –Analyzing, –Visualizing Largely driven by observational science, but also needed by simulations. Too soon to say if comp-X and X-info will unify or compete. BaBar, Stanford Space Telescope P&E Gene Sequencer From http://www.genome.uci.edu/

37 Next-Generation Data Analysis Looking for –Needles in haystacks – the Higgs particle –Haystacks: Dark matter, Dark energy Needles are easier than haystacks Global statistics have poor scaling –Correlation functions are N 2, likelihood techniques N 3 As data and computers grow at same rate, we can only keep up with N logN A way out? –Discard notion of optimal (data is fuzzy, answers are approximate) –Don’t assume infinite computational resources or memory Requires combination of statistics & computer science

38 Smart Data (active databases) If there is too much data to move around, take the analysis to the data! Do all data manipulations at database –Build custom procedures and functions in the database Automatic parallelism guaranteed Easy to build-in custom functionality –Databases & Procedures being unified –Example temporal and spatial indexing –Pixel processing Easy to reorganize the data –Multiple views, each optimal for certain types of analyses –Building hierarchical summaries are trivial Scalable to Petabyte datasets

39 Data Mining in the Image Domain: Can We Discover New Types of Phenomena Using Automated Pattern Recognition? (Every object detection algorithm has its biases and limitations) – Effective parametrization of source morphologies and environments – Multiscale analysis (Also: in the time/lightcurve domain)

40 Challenge: Make Data Publication & Access Easy Augment FTP with data query: Return intelligent data subsets Make it easy to –Publish: Record structured data –Find: Find data anywhere in the network Get the subset you need –Explore datasets interactively Realistic goal: –Make it as easy as publishing/reading web sites today.

41 Federation Data Federations of Web Services Massive datasets live near their owners: –Near the instrument’s software pipeline –Near the applications –Near data knowledge and curation –Super Computer centers become Super Data Centers Each Archive publishes a web service –Schema: documents the data –Methods on objects (queries) Scientists get “personalized” extracts Uniform access to multiple Archives –A common global schema Challenge: –What is the object model for your science?

42 Web Services: The Key? Web SERVER: –Given a url + parameters –Returns a web page (often dynamic) Web SERVICE: –Given a XML document (soap msg) –Returns an XML document –Tools make this look like an RPC. F(x,y,z) returns (u, v, w) –Distributed objects for the web. –+ naming, discovery, security,.. Internet-scale distributed computing Your program Data In your address space Web Service soap object in xml Your program Web Server http Web page

43 Web Services Architecture

44 Information Science and Data Generation Trends What does large amounts of information provide? –New opportunities for search! –New discoveries Business opportunities? Research opportunities? Problems?


Download ppt "How much information? Adapted from a presentation by: Jim Gray Microsoft Research Alex Szalay Johns Hopkins University."

Similar presentations


Ads by Google