Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Tokyo, Japan

There are copies: 1
The Data Avalanche Jim Gray Microsoft Research Talk at National Youth Leadership Forum on Technology,

Similar presentations


Presentation on theme: "University of Tokyo, Japan"— Presentation transcript:

1 University of Tokyo, Japan
The Data Avalanche Talk at University of Tokyo, Japan October 2005 Jim Gray Microsoft Research

2 Numbers TeraBytes and Gigabytes are BIG!
Mega – a house in san francisco Giga – a very rich person Tera – ~ The Bush national debt Peta – more than all the money in the world A Gigabyte: the Human Genome A Terabyte: 150 mile long shelf of books.

3 Outline Yotta Zetta Historical trends imply that in 20 years: Exa
Peta Tera Giga Mega Kilo Historical trends imply that in 20 years: we can store everything in cyberspace. The personal petabyte. computers will have natural interfaces speech recognition/synthesis vision, object recognition beyond OCR Implications The information avalanche will only get worse. The user interface will change: less typing, more writing, talking, gesturing, more seeing and hearing Organizing, summarizing, prioritizing information is a key technology. We are here

4 How much information is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo Soon everything can be recorded and indexed Most bytes will never be seen by humans. Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: See Lyman & Varian: How much information Everything! Recorded All Books MultiMedia All books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

5 Things Have Changed 1956 IBM 305 RAMAC 10 MB disk ~1M$ (y2004 $)

6 The Next 50 years will see MORE CHANGE ops/s/$ Had Three Growth Curves 1890-1990
Combination of Hans Moravac + Larry Roberts + Gordon Bell WordSize*ops/s/sysprice Mechanical Relay 7-year doubling Tube, transistor,.. 2.3 year doubling Microprocessor 1.0 year doubling

7 Constant Cost or Constant Function?
100x improvement per decade Same function 100x cheaper 100x more function for same price Mainframe SMP Constellation Cluster Constant Price Mini SMP Constellation Workstation Lower Price – New Category Graphics/storage PDA Camera/browser

8 Growth Comes From NEW Apps
The 10M$ computer of 1980 costs 1k$ today If we were still doing the same things, IT would be a 0 B$/y industry NEW things absorb the new capacity

9 The Surprise-Free Future in 20 years.
10,000x more power for same price Personal supercomputer Personal petabyte stores Same function for 10,000x less cost. Smart dust --the penny PC? The 10 peta-op computer (for 1,000$).

10 10,000x would change things Human computer interface
Decent computer vision Decent computer speech recognition Decent computer speech synthesis Vast information stores Ability to search and abstract the stores.

11 How Good is HCI Today? Surprisingly good. Demo of making faces
Demo of speech synthesis Daisy, Hal Synthetic voice Speech recognition is improving fast, Vision getting better Pen computing finally a reality. Displays improving fast (compared to last 30 years)

12 Outline Yotta Zetta Historical trends imply that in 20 years: Exa
Peta Tera Giga Mega Kilo Historical trends imply that in 20 years: we can store everything in cyberspace. The personal petabyte. computers will have natural interfaces speech recognition/synthesis vision, object recognition beyond OCR Implications The information avalanche will only get worse. The user interface will change: less typing, more writing, talking, gesturing, more seeing and hearing Organizing, summarizing, prioritizing information is a key technology. We are here

13 How much information is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo Almost everything is recorded digitally. Most bytes are never seen by humans. Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: See Lyman & Varian: How much information Everything! Recorded All Books MultiMedia All books (words) .Movie A Photo A Book

14 And >90% in Cyberspace Because:
Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Point-to-Point OR Broadcast Immediate OR Time Delayed Locate Process Analyze Summarize

15 MyLifeBits The guinea pig
Gordon Bell is digitizing his life Has now scanned virtually all: Books written (and read when possible) Personal documents (correspondence, memos, , bills, legal,0…) Photos Posters, paintings, photo of things (artifacts, …medals, plaques) Home movies and videos CD collection And, of course, all PC files Recording: phone, radio, TV, web pages… conversations Paperless throughout ” scanned, 12’ discarded. Only 30GB Excluding videos Video is 2+ TB and growing fast

16 Capture and encoding

17 I mean everything

18 25Kday life ~ Personal Petabyte
1PB Will anyone look at web pages in 2020? Probably new modalities & media will dominate then.

19 Challenges Capture: Get the bits in Organize: Index them
Manage: No worries about loss or space Curate/ Annotate: atutomate where possible Privacy: Keep safe from theft. Summarize: Give thumbnail summaries Interface: how ask/anticipate questions Present: show it in understandable ways.

20 Memex As We May Think, Vannevar Bush, 1945
“A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility” “yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can be profligate and enter material freely”

21 Too much storage? Try to fill a terabyte in a year
Item Items/TB Items/day 300 KB JPEG 3 M 9,800 1 MB Doc 1 M 2,900 1 hour 256 kb/s MP3 audio 9 K 26 1 hour 1.5 Mbp/s MPEG video 290 0.8 Petabyte volume has to be some form of video.

22 How Will We Find Anything?
Need Queries, Indexing, Pivoting, Scalability, Backup, Replication, Online update, Set-oriented access If you don’t use a DBMS, you will implement one! Simple logical structure: Blob and link is all that is inherent Additional properties (facets == extra tables) and methods on those tables (encapsulation) More than a file system Unifies data and meta-data SQL ++ DBMS

23 Photos

24 Searching: the most useful app?
Challenge: What questions for useful results? Many ways to present answers

25

26 Detail view TEXT: PDP-8

27 Resource explorer Ancestor (collections), annotations, descendant & preview panes turned on

28 Synchronized timelines with histogram guide
Drag & Drop: Search for microsoft.com – drag over all results Multiple timelines: Select Sheet window. Window/split New search for google.com – drag over results into new timeline -lock scrollbars Transclusion – right click to explore

29 Value of media depends on annotations
“Its just bits until it is annotated”

30 System annotations provide base level of value
Date 7/7/2000

31 Tracking usage – even better
Date 7/7/2000. Opened 30 times, ed to 10 people (its valued by the user!)

32 Get the user to say a little something is a big jump
Date 7/7/2000. Opened 30 times, ed to 10 people. “BARC dim sum intern farewell Lunch”

33 Getting the user to tell a story is the ultimate in media value
A story is a “layout” in time and space Most valuable content (by selection, and by being well annotated) Stories must include links to any media they use (for future navigation/search – “transclusion”). Cf: MovieMaker; Creative Memories PhotoAlbums Dapeng was an intern at BARC for the summer of 2000 We took him to lunch at our favorite Dim Sum place to say farewell At table L-R: Dapeng, Gordon, Tom, Jim, Don, Vicky, Patrick, Jim

34 Value of media depends on annotations
“Its just bits until it is annotated” Auto-annotate whenever possible e.g. GPS cameras Make manual annotation as easy as possible. XP photo capture, voice, photos with voice, etc Support gang annotation Make stories easy Dapeng was an intern at BARC for the summer of 2000 We took him to lunch at our favorite Dim Sum place to say farewell At table L-R: Dapeng, Gordon, Tom, Jim, Don, Vicky, Patrick, Jim

35 80% of data is personal / individual. But, what about the other 20%?
Business Wall Mart online: 1PB and growing…. Paradox: most “transaction” systems < 1 PB. Have to go to image/data monitoring for big data Government Government is the biggest business. Science LOTS of data.

36 Instruments: CERN – LHC Peta Bytes per Year
Looking for the Higgs Particle Sensors: GB/s (1TB/s ~ 30 EB/y) Events GB/s Filtered GB/s Reduced GB/s ~ 2 PB/y Data pyramid: 100GB : 1TB : 100TB : 1PB : 10PB CERN Tier 0

37 Information Avalanche
Both better observational instruments and Better simulations are producing a data avalanche Examples Turbulence: 100 TB simulation then mine the Information BaBar: Grows 1TB/day 2/3 simulation Information 1/3 observational Information CERN: LHC will generate 1GB/s 10 PB/y VLBA (NRAO) generates 1GB/s today NCBI: “only ½ TB” but doubling each year, very rich dataset. Pixar: 100 TB/Movie Image courtesy of C. Meneveau & A. Szalay @ JHU

38 Q: Where will the Data Come From? A: Sensor Applications
Earth Observation 15 PB by 2007 Medical Images & Information + Health Monitoring Potential 1 GB/patient/y  1 EB/y Video Monitoring ~1E8 video 1E5 MBps  10TB/s  100 EB/y  filtered??? Airplane Engines 1 GB sensor data/flight, 100,000 engine hours/day 30PB/y Smart Dust: ?? EB/y

39 ? The Big Picture The Big Problems Query and Vis tools
Experiments & Instruments facts questions facts ? Other Archives facts answers Literature facts Simulations The Big Problems Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it How to coexist with others Query and Vis tools Support/training Performance Execute queries in a minute Batch query scheduling

40 FTP - GREP Download (FTP and GREP) are not adequate
You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~3,000 disks At some point we need indices to limit search parallel data search and analysis This is where databases can help Next generation technique: Data Exploration Bring the analysis to the data!

41 The Speed Problem Many users want to search the whole DB ad hoc queries, often combinatorial Want ~ 1 minute response Brute force (parallel search): 1 disk = 50MBps => ~1M disks/PB ~ 300M$/PB Indices (limit search, do column store) 1,000x less equipment: 1M$/PB Pre-compute answer No one knows how do it for all questions.

42 Next-Generation Data Analysis
Looking for Needles in haystacks – the Higgs particle Haystacks: Dark matter, Dark energy Needles are easier than haystacks Global statistics have poor scaling Correlation functions are N2, likelihood techniques N3 As data and computers grow at same rate, we can only keep up with N logN A way out? Relax notion of optimal (data is fuzzy, answers are approximate) Don’t assume infinite computational resources or memory Combination of statistics & computer science

43 Analysis and Databases
Much statistical analysis deals with Creating uniform samples – data filtering Assembling relevant subsets Estimating completeness censoring bad data Counting and building histograms Generating Monte-Carlo subsets Likelihood calculations Hypothesis testing Traditionally these are performed on files Most of these tasks are much better done inside a database Move Mohamed to the mountain, not the mountain to Mohamed.

44 Outline Yotta Zetta Historical trends imply that in 20 years: Exa
Peta Tera Giga Mega Kilo Historical trends imply that in 20 years: we can store everything in cyberspace. The personal petabyte. computers will have natural interfaces speech recognition/synthesis vision, object recognition beyond OCR Implications The information avalanche will only get worse. The user interface will change: less typing, more writing, talking, gesturing, more seeing and hearing Organizing, summarizing, prioritizing information is a key technology. We are here

45 The Evolution of Science
Observational Science Scientist gathers data by direct observation Scientist analyzes data Analytical Science Scientist builds analytical model Makes predictions. Computational Science Simulate analytical model Validate model and makes predictions Data Exploration Science Data captured by instruments Or data generated by simulator Processed by software Placed in a database / files Scientist analyzes database / files

46 e-Science Data captured by instruments Or data generated by simulator
Processed by software Placed in a files or database Scientist analyzes files / database Virtual laboratories Networks connecting e-Scientists Strong support from funding agencies Better use of resources Primitive today

47 e-Science is Data Mining
There are LOTS of data people cannot examine most of it. Need computers to do analysis. Manual or Automatic Exploration Manual: person suggests hypothesis, computer checks hypothesis Automatic: Computer suggests hypothesis person evaluates significance Given an arbitrary parameter space: Data Clusters Points between Data Clusters Isolated Data Clusters Isolated Data Groups Holes in Data Clusters Isolated Points 1 peta byte = 1000 terabytes = 10, Gb disks today You can grep 1MB in 1s You can grep 1GB in 1min You can grep 1TB in 2days You can grep 1PB in 3 years You can FTP 1MB in 1s, Nichol et al. 2001 Slide courtesy of and adapted from Robert CalTech.

48 TerraServer/TerraService http://terraService.Net/
US Geological Survey Photo (DOQ) & Topo (DRG) images online. On Internet since June 1998 Operated by Microsoft Corporation Cross Indexed with Home sales, Demographics, Encyclopedia A web service 20 TB data source 10 M web hits/day

49 USGS Image Data Digital Raster Graphics Digital OrthoQuads
18 TB, 260,000 files uncompressed Digitized aerial imagery 88% coverage conterminous US 1 meter resolution < 10 years old Digital Raster Graphics 1 TB compressed TIFF, 65,000 files Scanned topographic maps 100% U.S. coverage 1:24,000, 1:100,000 and 1:250,000 scale maps Maps vary in age This slide shows an example image from the two USGS data themes currently stored on Microsoft TerraServer. The first theme, Aerial Imagery, is the USGS “DOQ” data-set. To date, we have received 14 TB of USGS data. We get about .75 TB of uncompressed imagery per quarter from the USGS. Some of the data is for areas we have not been “covered” previously. The rest are more recent updates or error corrections to previously received data. Currently we have 88% coverage. USGS is expecting to reach 96% coverage of the lower 48 states by the beginning of the year There are a variety of issues with achieving 100% coverage. Note, it has nothing to do with “image security” issues. The second data-set we added was the topographic map data, USGS “DRG” – Digital Raster Graphics. These are scanned paper maps of several different map scales – 1:250k, 1:100k, 1:63k, 1:25k, and 1:24k. We have 100% coverage of the entire United States including Alaska and Hawaii. However, many of the maps are woefully out of date. It is not uncommon to find maps from the 1950s. Often this is not disastrous, as the primary use for this data is the actual topo lines show in brown in the above image. We’ve found that hikers and campers love these maps and are very handy for accessing the gravity/difficulty of a particular hiking location. The big challenge in processing the topo data was removing the ancillary information from the digital file. If you ever have seen a USGS paper map, you will notice quite a bit of data, tick marks, titles, legend, etc., located all around the edges of the map. The digital file, like the paper map, includes all the information located around the map, we’ve dubbed this the “marginalia”. To edge match all the topo maps, the marginalia had to be removed first, then tiles extracted and merged from neighboring topo maps. Also, the DRG data is a different projection, UTM NAD27, than the USGS DOQ data, UTM NAD83. We “moved” the USGS DRG data from NAD27 to NAD83 before storing it in our database. This allows us to “switch” between DOQ and DRG at the same scale and give the user a “topo view” or “image view” of the same location.

50 User Interface Concept
Display Imagery: 316 m 200 x 200 pixel images 7 level image pyramid Resolution 1 meter/pixel to 64 meter/pixel Navigation Tools: 1.5 m place names “Click-on” Coverage map Longitude and Latitude search U.S. Address Search External Geo-Spatial Links to: USGS On-line Stream Gauges Home Advisor Demographics Home Advisor Real Estate Encarta Articles Steam flow gauges Concept: User navigates an ‘almost seamless’ image of earth Click on image to zoom in Buttons to pan NW, N, NE, W, E, SW, S, SE Links to switch between Topo, Imagery, and Relief data {Note, this slide can/should be skipped if the site was demonstrated on-line.} Conceptually, the Microsoft TerraServer database and web application present an “almost” seamless mosaic of an image of earth. By “almost” we mean the mosaic “comes to an end” because of some limitations in presenting “flat maps of a round object” or because our vendor “ran out of imagery” at some point. For example, there is not much point in digitizing images of the oceans. Also the USGS only digitizes U.S. territory, thus our imagery stops at the U.S. borders with Canada and Mexico. Metaphorically, imagine you are “virtually flying over the country” at some altitude. Looking out the window and straight down, you can see some “square” of earth and some amount of detail. The amount of detail is controlled by the altitude of your “plane”. What you see next is dependent on the direction of your aircraft. On TerraServer, you can “click on the image” to change your virtual plane’s altitude to be “lower to the ground” so you can see more details. Buttons surround the image so you can “move your virtual plane” to see imagery in the NW, N, NE, W, E, SW, S, or SE directions. Links exist in the blue, TerraServer “toolbar” that allow you to “Print”, “Download”, or examine meta “Information” about the imagery you are viewing. Also, you can change the size of your “virtual window” so that you can see more or less imagery at one time. Finally, you can also switch you view to be a topographic map, a relief map, or the default aerial/satellite image. Currently, the USGS has completed photographing and digitizing approximately 88% of the continental U.S., that is the conterminous, lower-48 states. We have 100% coverage of all 50 states in the topographic map data-set and 100% coverage of the globe in the relief map. However, the relief map data is very low resolution – approximately 1 kilometer per pixel. To date (July 2001), we have received approximately 18 terabytes of uncompressed aerial imagery from the USGS in the form of 251,000 files. The files range in size from 46 to 151 MB per file depending if they are Black and White “grayscale” files or RGB “color-infrared” files. Note, we never directly display color-infrared data. Instead, we convert the color-infrared data to grayscale during the loading process. The imagery we receive from the USGS has about a 300 pixel overlap with adjacent files. The TerraServer loading process removes the overlap during the load and compression process. The result is, TerraServer contains 305 million compressed images that we have dubbed “tiles”. Each tile is 200 pixels by 200 pixels in size. The amount of ground covered by a TerraServer tile depends on the resolution or “scale” of a pixel. From experience, we have found that we can produce a maximum of 7 “zoomed-out views” from the full resolution data. We’ve learned from our graphics experts to create a 2x, 4x, 8x, 16x, 32x, and 64x “zoom reductions” from the initial full resolution data. For USGS 1 meter per-pixel resolution imagery, we have 1m, 2m, 4m, 8m, 16m, 32m, and 64m resolution data. For USGS topographic map data, we have 2m, 4m, 8m, 16m, 32m, 64m, and 128m per-pixel resolution data. We provide three ways for users to find an image in the 305 million tiles – search by place name, by coverage map, and by longitude and latitude. We use the 1.5 million place name “gazetteer” made available to us by the Microsoft’s Geography Business Unit, they are the makers of “Streets and Trips” and “MapPoint”. Users can enter part of a formal place name, e.g. “San”, “San, CA”, “New York, NY”, “Yellowstone”, etc., and TerraServer will do a wildcard search to locate all the places matching the specified string where we have imagery. Alternatively, we display a low resolution map of the earth that has been shaded to show the locations where we have image coverage. Users can click on the shaded map to see the image located nearest where they clicked. Finally, users can enter in a longitude and latitude pair to find an image that contains the specified longitude and latitude. The bottom line is the TerraServer interface provides a simple way to search for an “initial tile of interest” to display in the center of a web page. Simple buttons exist to enlarge, zoom in, zoom out, change the type of data, and pan around the “tile of interest”. This simple concept is free from “GIS terminology” or other sophisticated concepts that require sophisticated scripts or applets hosted within the web browser. Links to Print, Download and view meta-data information

51 Terra Service New Things
A popular web service Exactly the map you want. Dynamic Map Re-projection UTM to Geographic projection Dynamic texture mapping? New Data 1 foot resolution natural color imagery Census Tiger data Lights Out Management MOM Auto-backup / restore on drive failure

52 “Urban Area” Data Microsoft Campus at 4 meter resolution
“Redundant Bunch 1” Ball field at .25 meter resolution

53 TerraServer Becomes a Web Service TerraServer. net -> TerraService
TerraServer Becomes a Web Service TerraServer.net -> TerraService.Net Web server is for people. Web Service is for programs The end of screen scraping No faking a URL: pass real parameters. No parsing the answer: data formatted into your address space. Hundreds of users but a specific example: US Department of Agriculture

54 TerraServer Web Services
Terra-Tile-Service Landmark-Service Get image meta-data Query TS Gazetteer Retrieve TS ImageTiles Projection conversions Web Map Client OpenGIS “like” Landmarks layered on TerraServer imagery Geo-coded data of well-known objects (points), e.g. Schools, Golf Courses, Hospitals, etc. Polygons of well-known objects (shapes), e.g. Zip Codes, Cities, etc Fat Map Client Visual Basic / C# Windows Form Access Web Services for all data Sample Apps

55 Web Services Internet-scale distributed computing Web SERVER:
Given a url + parameters Returns a web page (often dynamic) Web SERVICE: Given a XML document (soap msg) Returns an XML document Tools make this look like an RPC. F(x,y,z) returns (u, v, w) Distributed objects for the web. + naming, discovery, security,.. Internet-scale distributed computing Your program Web Server http Web page Your program Web Service soap Data In your address space object in xml

56 TerraServer Hardware Storage Bricks
KVM / IP Storage Bricks “White-box commodity servers” 4tb raw / 2TB Raid1 SATA storage Dual Hyper-threaded Xeon 2.4ghz, 4GB RAM Partitioned Databases (PACS – partitioned array) 3 Storage Bricks = 1 TerraServer data Data partitioned across 20 databases More data & partitions coming Low Cost Availability 4 copies of the data RAID1 SATA Mirroring 2 redundant “Bunches” Spare brick to repair failed brick 2N+1 design Web Application “bunch aware” Load balances between redundant databases Fails over to surviving database on failure ~100K$ capital expense.

57 Virtual Observatory http://www.astro.caltech.edu/nvoconf/ http://www.voforum.org/
Premise: Most data is (or could be online) So, the Internet is the world’s best telescope: It has data on every part of the sky In every measured spectral band: optical, x-ray, radio.. As deep as the best instruments (2 years ago). It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..). It’s a smart telescope: links objects and data to literature on them.

58 Why Astronomy Data? It has no commercial value
IRAS 25m It has no commercial value No privacy concerns Can freely share results with others Great for experimenting with algorithms It is real and well documented High-dimensional data (with confidence intervals) Spatial data Temporal data Many different instruments from many different places and many different times Federation is a goal The questions are interesting How did the universe form? There is a lot of it (petabytes) 2MASS 2m DSS Optical IRAS 100m WENSS 92cm NVSS 20cm GB 6cm ROSAT ~keV

59 Time and Spectral Dimensions The Multiwavelength Crab Nebulae
Crab star 1053 AD X-ray, optical, infrared, and radio views of the nearby Crab Nebula, which is now in a state of chaotic expansion after a supernova explosion first sighted in 1054 A.D. by Chinese Astronomers. Slide courtesy of Robert CalTech.

60 SkyServer.SDSS.org A modern archive Also used for education
Raw Pixel data lives in file servers Catalog data (derived objects) lives in Database Online query to any and all Also used for education 150 hours of online Astronomy Implicitly teaches data analysis Interesting things Spatial data search Client query interface via Java Applet Query interface via Emacs Popular -- 1% of Terraserver  Cloned by other surveys (a template design) Web services are core of it.

61 Demo of SkyServer Shows standard web server Pixel/image data
Point and click Explore one object Explore sets of objects (data mining)

62 Data Federations of Web Services
Massive datasets live near their owners: Near the instrument’s software pipeline Near the applications Near data knowledge and curation Super Computer centers become Super Data Centers Each Archive publishes a web service Schema: documents the data Methods on objects (queries) Scientists get “personalized” extracts Uniform access to multiple Archives A common global schema Federation

63 SkyQuery A Prototype WWT
Started with SDSS data and schema Imported12 other datasets into that spine schema. (a day per dataset plus load time) Unified them with a portal Implicit spatial join among the datasets. All built on Web Services Pure XML Pure SOAP Used .NET toolkit

64 Federation: SkyQuery.Net
Combine 4 archives initially Just added 10 more Send query to portal, portal joins data from archives. Problem: want to do multi-step data analysis (not just single query). Solution: Allow personal databases on portal Problem: some queries are monsters Solution: “batch schedule” on portal server, Deposits answer in personal database.

65 SkyQuery Structure Each SkyNode publishes Portal is Schema Web Service
Database Web Service Portal is Plans Query (2 phase) Integrates answers Is itself a web service 2MASS INT SDSS FIRST SkyQuery Portal Image Cutout

66 SkyQuery: http://skyquery.net/
Distributed Query tool using a set of web services Four astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England). Feasibility study, built in 6 weeks Tanu Malik (JHU CS grad student) Tamas Budavari (JHU astro postdoc) With help from Szalay, Thakar, Gray Implemented in C# and .NET Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

67 SkyNode Basic Web Services
Metadata information about resources Waveband Sky coverage Translation of names to universal dictionary (UCD) Simple search patterns on the resources Cone Search Image mosaic Unit conversions Simple filtering, counting, histogramming On-the-fly recalibrations

68 Portals: Higher Level Services
Built on Atomic Services Perform more complex tasks Examples Automated resource discovery Cross-identifications Photometric redshifts Outlier detections Visualization facilities Goal: Build custom portals in days from existing building blocks (like today in IRAF or IDL)

69 Open SkyQuery SkyQuery being adopted by AstroGrid as reference implementation for OGSA-DAI (Open Grid Services Architecture, Data Access and Integration). SkyNode basic archive object SkyQuery Language (VoQL) is evolving.

70 The Registry UDDI seemed inappropriate Evolved Dublin Core Complex
Irrelevant questions Relevant questions missing Evolved Dublin Core Represent Datasets, Services, Portals Needs to be machine readable Federation (DNS model) Push & Pull: register then harvest

71 Unified Definitions Universal Content Definitions Collated all table heads from all the literature 100,000 terms reduced to ~1,500 Rough consensus that this is the right thing. Refinement in progress as people use UCDs Defines Units: gram, radian, second, janski... Semantic Concepts / Metrics Std error, Chi2 fit, magnitude, passband, velocity,

72 Classes and Methods Your program Data In your address space Web Service soap object in xml First Class: VO table Represents an answer set in XML Defined by an XML Schema (XSD) Metadata (in terms of UCDs) Data representation (numbers and text) First method Cone Search: Get objects in this cone

73 Provenance Most data will be derived.
To do science, need to trace derived data back to source. So programs and inputs must be registered. Must be able to re-run them. Example: Space Telescope Calibrated Data Run on demand Can specify software version (to get old answers) Scientific Data Provenance and Curation are largely unsolved problems (some ideas but no science).

74 Other Classes Space-Time class Image Class (returns pixels) Spectral
Your program Data In your address space Web Service soap object in xml Space-Time class Image Class (returns pixels) SdssCutout Simple Image Access Protocol HyperAtlas Spectral Simple Spectral Access Protocol 500K spectra available at Query Services ADQL and SkyNode And Registry: see below

75 Object Model Your General acceptance of XML program http Web
Server General acceptance of XML Recent acceptance of XML Schema (XSD over DTD) Wait-and-See about SOAP/WSDL/… “ Web Services are just Corba with angle brackets.” FTP is good enough for me. Personal opinion: Web Services are much more than “Corba + <>” Huge focus on interop Huge focus on integrated tools But the community says “Show me!” Many technologists convinced, but not yet the astronomers http Web page Your program Web Service soap Data In your address space object in xml

76 Data Sources Literature online and cross indexed
Simbad, ADS, NED, Many curated archives online FIRST, DPOSS, 2MASS, USNO, IRAS, SDSS, VizeR,… Typically files with English meta-data and some programs Groups, Researchers, Amateurs Publish Datasets online in various formats Data publications are ephemeral (may disappear) Many have unknown provenance Documentation varies; some good and some none.

77 The WWT Components Outline What we learned Data Sources
Literature Archives Unified Definitions Units, Semantics/Concepts/Metrics, Representations, Provenance Object model Classes and methods Portals WWT is a poster child for the Data Grid. What we learned Astro is a community of 10,000 Homogenous & Cooperative If you can’t do it for Astro, do not bother with 3M bio-info. Agreement Takes time Takes endless meetings Big problems are non-technical Legacy is a big problem. Plumbing and tools are there But… What is the object model? What do you want to save? How document provenance?

78 MyDB added to SkyQuery Let users add personal DB 1GB for now.
Use it as a workbook. Online and batch queries. Moves analysis to the data Users can cooperate (share MyDB) Still exploring this 2MASS INT SDSS FIRST SkyQuery Portal Image Cutout MyDB

79 ? The Big Picture The Big Problems Query and Vis tools
Experiments & Instruments facts questions facts ? Other Archives facts answers Literature facts Simulations The Big Problems Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it How to coexist with others Query and Vis tools Support/training Performance Execute queries in a minute Batch query scheduling


Download ppt "University of Tokyo, Japan"

Similar presentations


Ads by Google