Download presentation
Presentation is loading. Please wait.
Published byOsborn Lewis Modified over 8 years ago
1
Introduction to Data Science Section 2 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1
2
The Data Lifecycle 2
3
Data Science is More than Analysis Data analysis gets most of the attention in data science. In that sense, many people struggle to distinguish data science from applied statistics. Analysis is obviously important, but statistical analysis skills are only useful if the data can be collected in put in a usable form. Data Science is much broader than just data analysis. 3
4
The Data Lifecycle Data science considers data at every stage of what is called the data lifecycle. This lifecycle generally refers to everything from collecting data to analyzing it to sharing it so others can re-analyze it. – In fact, it includes the planning process that should be in place before any other work begins. New visions of this process in particular focus on integrating every action that creates, analyzes, or otherwise touches data. These same new visions treat the process as dynamic – data archives are not just digital shoe boxes under the bed. There are many representations of the this lifecycle. 4
5
5
6
6
7
7
8
8
9
Lessons from the Lifecycle Data Science is more than just data analysis. Effective data science requires – Planning – Vision – Storage – Interoperability of systems – A team approach – Adaptability and Scalability 9
10
What is Missing? Most definitions of data science underplay or leave out discussions of: – Substantive theory – Metadata – Privacy and Ethics – Greater Consideration for missing data, representativeness, and uncertainty – More thinking about the proper Null hypothesis – Leadership on leveraging data science for the public good 10
11
Substantive Theory 11
12
The Data Generating Process (DGP) Most of the time we don’t care about the data itself. Most of the time we are trying to learn something about an underlying process that produces the data – a DGP. Technically trained folks might be good at uncovering patterns in data, but you need substantive expertise to: – Know where to look in the first place – Know what to look for – Know what you find actually might mean 12
13
What is the DGP? Good analysis starts with a question you want to answer. – Blind data mining can only get you so far, and really, there is no such thing as completely blind mining Answering that question requires laying out expectations of what you will find and explanations for those expectations. Those expectations and explanations rest on assumptions. If your data collection, data management, and data analysis are not compatible with those assumptions, you risk producing meaningless or misleading answers. 13
14
The DGP (cont.) Think of the world you are interested in as governed by dynamic processes. Those processes produce observable bits of information about themselves – data We can use data science to: – Collect, catalog, and organize those bits of information – Discover patterns in data and fit models to that data – Make predictions outside of our data – Inform explanations of both those patterns and those predictions. Real discovery is NOT about modeling patterns in observable data. It is about understanding the processes that produced that data. 14
15
Theories and DGPs Theories provide explanations for the processes we care about. They answer the question, Why does something work the way it does. Theories make predictions about what we should see in data. We use data to test the predictions, but we never completely test a theory. 15
16
Why do we need theory? Can’t we just find “truth” in the data if we have enough of it? Especially if we have all of it? No! – More data does not mean more representative data. – Every method of analysis makes some assumptions, so we are better off if we make them explicit. – Patterns without understanding are a best uninformative and at worst deeply misleading. 16
17
Robert Mathews Aston, 2000. “Storks Deliver Babies (P=0.008).” Teaching Statistics. Volume 22, Number 2, Summer 2000 17
18
New Behaviors Require New Theories The Target example illustrated how existing theories about habit formation informed their data mining efforts. However, whole new behaviors exist that are creating a lot of the data that data scientists want to analyze: – Online shopping – Cell phone usage – Crowd sourced recommendation systems – Facebook, Google searching, etc. – Online mobilization of social protests We need new theories for these new behaviors. 18
19
Metadata 19
20
What is Metadata? Metadata is data about data. It is frequently ignored or misunderstood. Metadata is required to give data meaning. It includes: – Variable names and labels, value labels, information on who collected the data, when, by what methods, in what locations, for what purpose, etc. Metadata is essential to use data effectively, to reuse data, to share data, and to integrate data. Data without metadata is worthless. 20
21
The Value of Metadata Data by itself is just a bunch of 0’s and 1’s. Metadata – Provides meaning – Allows for cataloging – Facilitates search and discovery – Enables linking data sets 21
22
Types of Metadata NICO Defines three types: – Structural: describes how the components of the data are organized (columns, rows, chapters, etc.) – Descriptive: provides titles, authors, keywords, subjects, etc. that facilitate attribution and search/discovery. – Administrative: technical information on how file was created, software used, formats for storage, etc. Includes rights and preservation metadata 22
23
Metadata Standards There are emerging standards for metadata – The American National Standards Institute – The International Organization for Standardization Dublin Core – 15 classis metadata terms. – Title, Creator, Subject, Description, Publisher, Contributor, Data, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights 23
24
Privacy and Ethics We will do this at the end 24
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.