Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Data Science Section 2 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey 1.

Similar presentations


Presentation on theme: "Introduction to Data Science Section 2 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey 1."— Presentation transcript:

1 Introduction to Data Science Section 2 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey carsey@unc.edu 1

2 The Data Lifecycle 2

3 Data Science is More than Analysis Data analysis gets most of the attention in data science. In that sense, many people struggle to distinguish data science from applied statistics. Analysis is obviously important, but statistical analysis skills are only useful if the data can be collected in put in a usable form. Data Science is much broader than just data analysis. 3

4 The Data Lifecycle Data science considers data at every stage of what is called the data lifecycle. This lifecycle generally refers to everything from collecting data to analyzing it to sharing it so others can re-analyze it. – In fact, it includes the planning process that should be in place before any other work begins. New visions of this process in particular focus on integrating every action that creates, analyzes, or otherwise touches data. These same new visions treat the process as dynamic – data archives are not just digital shoe boxes under the bed. There are many representations of the this lifecycle. 4

5 5

6 6

7 7

8 8

9 Lessons from the Lifecycle Data Science is more than just data analysis. Effective data science requires – Planning – Vision – Storage – Interoperability of systems – A team approach – Adaptability and Scalability 9

10 What is Missing? Most definitions of data science underplay or leave out discussions of: – Substantive theory – Metadata – Privacy and Ethics – Greater Consideration for missing data, representativeness, and uncertainty – More thinking about the proper Null hypothesis – Leadership on leveraging data science for the public good 10

11 Substantive Theory 11

12 The Data Generating Process (DGP) Most of the time we don’t care about the data itself. Most of the time we are trying to learn something about an underlying process that produces the data – a DGP. Technically trained folks might be good at uncovering patterns in data, but you need substantive expertise to: – Know where to look in the first place – Know what to look for – Know what you find actually might mean 12

13 What is the DGP? Good analysis starts with a question you want to answer. – Blind data mining can only get you so far, and really, there is no such thing as completely blind mining Answering that question requires laying out expectations of what you will find and explanations for those expectations. Those expectations and explanations rest on assumptions. If your data collection, data management, and data analysis are not compatible with those assumptions, you risk producing meaningless or misleading answers. 13

14 The DGP (cont.) Think of the world you are interested in as governed by dynamic processes. Those processes produce observable bits of information about themselves – data We can use data science to: – Collect, catalog, and organize those bits of information – Discover patterns in data and fit models to that data – Make predictions outside of our data – Inform explanations of both those patterns and those predictions. Real discovery is NOT about modeling patterns in observable data. It is about understanding the processes that produced that data. 14

15 Theories and DGPs Theories provide explanations for the processes we care about. They answer the question, Why does something work the way it does. Theories make predictions about what we should see in data. We use data to test the predictions, but we never completely test a theory. 15

16 Why do we need theory? Can’t we just find “truth” in the data if we have enough of it? Especially if we have all of it? No! – More data does not mean more representative data. – Every method of analysis makes some assumptions, so we are better off if we make them explicit. – Patterns without understanding are a best uninformative and at worst deeply misleading. 16

17 Robert Mathews Aston, 2000. “Storks Deliver Babies (P=0.008).” Teaching Statistics. Volume 22, Number 2, Summer 2000 17

18 New Behaviors Require New Theories The Target example illustrated how existing theories about habit formation informed their data mining efforts. However, whole new behaviors exist that are creating a lot of the data that data scientists want to analyze: – Online shopping – Cell phone usage – Crowd sourced recommendation systems – Facebook, Google searching, etc. – Online mobilization of social protests We need new theories for these new behaviors. 18

19 Metadata 19

20 What is Metadata? Metadata is data about data. It is frequently ignored or misunderstood. Metadata is required to give data meaning. It includes: – Variable names and labels, value labels, information on who collected the data, when, by what methods, in what locations, for what purpose, etc. Metadata is essential to use data effectively, to reuse data, to share data, and to integrate data. Data without metadata is worthless. 20

21 The Value of Metadata Data by itself is just a bunch of 0’s and 1’s. Metadata – Provides meaning – Allows for cataloging – Facilitates search and discovery – Enables linking data sets 21

22 Types of Metadata NICO Defines three types: – Structural: describes how the components of the data are organized (columns, rows, chapters, etc.) – Descriptive: provides titles, authors, keywords, subjects, etc. that facilitate attribution and search/discovery. – Administrative: technical information on how file was created, software used, formats for storage, etc. Includes rights and preservation metadata 22

23 Metadata Standards There are emerging standards for metadata – The American National Standards Institute – The International Organization for Standardization Dublin Core – 15 classis metadata terms. – Title, Creator, Subject, Description, Publisher, Contributor, Data, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights 23

24 Privacy and Ethics We will do this at the end 24


Download ppt "Introduction to Data Science Section 2 Data Matters 2015 Sponsored by the Odum Institute, RENCI, and NCDS Thomas M. Carsey 1."

Similar presentations


Ads by Google