Presentation on theme: "Big Data Kirk Borne George Mason University LSST All Hands Meeting August 13 - 17, 2012."— Presentation transcript:
Big Data Kirk Borne George Mason University LSST All Hands Meeting August 13 - 17, 2012
Characteristics of Big Data – 1a Big quantities of data are acquired everywhere. It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, etc.
Characteristics of Big Data – 1b Big quantities of data are acquired everywhere. It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, etc.
Characteristics of Big Data – 1c Big quantities of data are acquired everywhere. It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, etc.
Characteristics of Big Data – 2 Big quantities of data are acquired everywhere. It is now a big issue in all aspects of life: science, business, healthcare, gov, social networks, etc. But… What do we mean by “big”? Gigabytes? Terabytes? Petabytes? Exabytes? The meaning of “big” is domain-specific and resource- dependent (data storage, I/O throughput, computation cycles, communication costs) I say … we all are dealing with our own “tonnabytes”
Characteristics of Big Data – 3 There are 4 dimensions to the Big Data challenge: 1.Volume (“tonnabytes” data challenge) 2.Variety (complexity, curse of dimensionality) 3.Velocity (rate of data and information flowing at us) 4.Verification (verifying inference-based models from data) Therefore, we need something better to cope with the tonnabytes …
This graphic says it all … Graphic provided by S. G. Djorgovski, Caltech Clustering – examine the data and find the data clusters (clouds), without considering what the items are = Characterization ! Classification – for each new data item, try to place it within a known class (i.e., a known category or cluster) = Classify ! Outlier Detection – identify those data items that don’t fit into the known classes or clusters = Surprise ! 8
Data-Enabled Science: Scientific KDD (Knowledge Discovery from Data) Characterize the new (clustering, unsupervised learning) Assign the known (classification, supervised learning) Discover the unknown (outlier detection, semi-supervised learning) The two major benefits of BIG DATA: 1.best statistical analysis of “typical” events 2.automated search for “rare” events Graphic from S. G. Djorgovski
Basic Astronomical Knowledge Problems – 1 The clustering problem: –Finding clusters of objects within a data set –What is the significance of the clusters (statistically and scientifically)? –What is the optimal algorithm for finding friends-of- friends or nearest neighbors? N is >10 10, so what is the most efficient way to sort? Number of dimensions ~ 1000 – therefore, we have an enormous subspace search problem –Are there pair-wise (2-point) or higher-order (N-way) correlations? N is >10 10, so what is the most efficient way to do an N-point correlation? –algorithms that scale as N 2 logN won’t get us there
Basic Astronomical Knowledge Problems – 2 Outlier detection: (unknown unknowns) –Finding the objects and events that are outside the bounds of our expectations (outside known clusters) –These may be real scientific discoveries or garbage –Outlier detection is therefore useful for: Novelty Discovery – is my Nobel prize waiting? Anomaly Detection – is the detector system working? Data Quality Assurance – is the data pipeline working? –How does one optimally find outliers in 10 3 -D parameter space? or in interesting subspaces (in lower dimensions)? –How do we measure their “interestingness”?
The dimension reduction problem: –Finding correlations and “fundamental planes” of parameters –Number of attributes can be hundreds or thousands The Curse of High Dimensionality ! –Are there combinations (linear or non-linear functions) of observational parameters that correlate strongly with one another? –Are there eigenvectors or condensed representations (e.g., basis sets) that represent the full set of properties? Basic Astronomical Knowledge Problems – 3
Basic Astronomical Knowledge Problems – 4 The superposition / decomposition problem: –Finding the defining features that separate different classes objects that overlap in simple parameter spaces –What if there are 10 10 objects that overlap in a 10 3 -D parameter space? –What is the optimal way to separate and extract the different unique classes of objects?
The LSST Big Data Manifesto More data is not just more data … more is different! More data is not just more data … more is different! Discover the unknown unknowns. Discover the unknown unknowns. Massive Data-to-Knowledge challenge. Massive Data-to-Knowledge challenge.
The LSST Big Data Challenges 1.Massive data stream: ~2 Terabytes of image data per hour that must be mined in real time (for 10 years). 2.Massive 20-Petabyte database: more than 20 billion objects need to be classified, and most will be monitored for important variations in real time. 3.Massive event stream: knowledge extraction in real time for ~2,000,000 events each night. Challenge #1 includes both the static data mining aspects of #2 and the dynamic data mining aspects of #3. Challenge #1 includes both the static data mining aspects of #2 and the dynamic data mining aspects of #3. Look at these in more detail... Look at these in more detail...
LSST big data challenges # 1, 2 Each night for 10 years LSST will obtain roughly the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey Our grad students will be asked to mine these data (~20 TB each night ≈ 40,000 CDs filled with data): –A truckload of CDs each and every day for 10 yrs –Cumulatively, a football stadium full of 100 million CDs after 10 yrs The challenge is to find the new, the novel, the interesting, and the surprises (the unknown unknowns) within all of these data. Yes, more is most definitely different !
LSST big data challenge # 3 Approximately 2,000,000 times each night for 10 years LSST will detect a new sky event, and the astronomical community will be challenged with classifying these events. What will we do with all of these events? time flux Characterize first ! (Unsupervised Learning) Classify later.
Characterization includes … Feature Detection and Extraction: Identifying and describing features in the data –via machine algorithms or human inspection (including the potentially huge contributions from Citizen Science) Extracting feature descriptors from the data Curating these features for search, re-use, & discovery Finding other parameters and features from other archives, other databases, other information sources – and using those to help characterize (ultimately classify) each new event. … hence, coping with a highly multivariate parameter space
Data-driven Discovery (Unsupervised Learning) i.e., What can I do with characterizations? 1.Class Discovery – Clustering 2.Principal Component Analysis – Dimension Reduction 3.Outlier Detection – Surprise / Anomaly / Deviation / Novelty Discovery 4.Link Analysis – Association Analysis – Network Analysis 5.and more.
20 Addressing the D2K (Data-to-Knowledge) Challenge Complete end-to-end application of Data Science: Data management, metadata management, data search, information extraction, data mining, knowledge discovery Applies to any discipline (not just science disciplines) Skilled workforce needed to take data to knowledge
21 Informatics in Education and An Education in Informatics
Data Science Education: Two Perspectives Informatics in Education – working with data in all learning settings Informatics (Data Science) enables transparent reuse and analysis of data in inquiry-based classroom learning. Learning is enhanced when students work with real data and information (especially online data) that are related to the topic (any topic) being studied. http://serc.carleton.edu/usingdata/ (“Using Data in the Classroom”) http://serc.carleton.edu/usingdata/ Example: CSI The Cosmos An Education in Informatics – students are specifically trained: … to access large distributed data repositories … to conduct meaningful inquiries into the data … to mine, visualize, and analyze the data … to make objective data-driven inferences, discoveries, and decisions Numerous Data Science programs now exist at several universities (GMU, Caltech, RPI, Michigan, Cornell, U. Illinois, Indiana U., … ) http://cds.gmu.edu/ (Computational & Data Sciences @ GMU) http://cds.gmu.edu/
23 Responses to Big Data – 1 2.5 approaches to dealing with Big Data: –Data Science = Informatics & Statistics (and data-intensive computing) –Citizen Science = Human Computation –Or else … (where possible) combine these two – use the very effective human cognitive skills of pattern recognition and anomaly detection to generate training sets of relevant features (characterizations) to improve the machine algorithms.
24 Responses to Big Data – 2 LSST Informatics & Statistics Science Collaboration: –breakout @ 11am in TB-A New Journal: Astronomy & Computing –Poster and flyers available in hallway –http://www.journals.elsevier.com/astronomy-and-computing/http://www.journals.elsevier.com/astronomy-and-computing/ New AAS Working Group on Astroinformatics & Astrostatistics –Members: “Bill” Zeljko Ivezic (chair), Kirk Borne, George Djorgovski, Eric Feigelson, Eric Ford, Alyssa Goodman, Aneta Siemiginowska, Alex Szalay, Rick White. –Visit https://www.facebook.com/AstroInformaticshttps://www.facebook.com/AstroInformatics
25 LSST Informatics & Statistics Breakout Session Brief “lightning” talks by 7 team members : –Jogesh Babu: Statistical Resources –Kirk Borne: Outlier Detection for Surprise Discovery in Big Data –Matthew Graham: Characterizing and Classifying CRTS –Joseph Richards: Time-Domain Discovery and Classification –Sam Schmidt: Upcoming Challenges for Photometric Redshifts –Lior Shamir: Automatic Analysis of Galaxy Morphology –John Wallin: Citizen Science and Machine Learning Open Discussion : –LSST Publication Reviews: informatics & statistics participation –LSST Science Book chapter –Research Roadmap document 11:00am-12:30pm today – Tortolita Ballroom A