Presentation on theme: "Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department."— Presentation transcript:
Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department of Mathematical Sciences Worcester Polytechnic Institute 2014
My Research "Internet Connectivity Access layer" by User:Ludovic.ferre - Internet_Connectivity_Overview2_Access.svg. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons -
This is a panel, so I want to be provocative! Provocative Adjective 1. tending or serving to provoke; inciting, stimulating, irritating, or vexing. So, I will be a little sad if I don’t end up irritating anyone
The first war: Terminology Analyzing data has a long history! There have been many terms that have been used to describe such endeavors: Statistics Artificial Intelligence Machine learning Data analytics Since I happen to work in a “Data Science” program perhaps I may be allowed the indulgence of using that terminology…
Whatever we call it, what makes things different now?
Experiments, observations, and numerical simulations in many areas of science and business are currently generating terabytes of data, and in some cases are on the verge of generating petabytes and beyond. Analyses of the information contained in these data sets have already led to major breakthroughs in fields ranging from genomics to astronomy and high-energy physics and to the development of new information-based industries. - Frontiers in Massive Data Analysis, National Research Council of the National Academies Given a large mass of data, we can by judicious selection construct perfectly plausible unassailable theories—all of which, some of which, or none of which may be right. - Paul Arnold Srere
The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it. - Hal Varian, Google's Chief Economist, My personal goal: Getting students to be able to think critically about data.
What is Big Data? The are many examples of "data", but what makes some of it “big”? The classic definition revolves around the three Vs. Volume, velocity, and variety. Volume: There is a just a lot of it being generated all the time. Things get interesting and “big”, when you can’t fit it all on one computer anymore. Why? There are many ideas here such as MapReduce, Hadoop, etc. that all revolve around being able to process data that goes from Terabytes, to Petabytes, to Exabytes. Velocity: Data is being generated very quickly. Can you even store it all? If not, then what do you get rid of and what do you keep? Variety: The data types you mention all take different shapes. What does it mean to store them so that you can play with or compare them? /wiki/Green_Giant#m ediaviewer/Plik:Jolly_ green_giant.jpg
Is Big Data the same as Data Science? Are Big Data and Data Science the same thing? I wouldn't say so... Data Science can be done on small data sets. And not everything done using Big Data would necessarily be called Data Science. Big Data Data Science
Is Big Data the same as Data Science? Are Big Data and Data Science the same thing? I wouldn't say so... Data Science can be done on small data sets. And not everything done using Big Data would necessarily be called Data Science. But there certainly is a substantial overlap! Big Data Data Science
But... what does getting "knowledge" from data really mean? Are we searching for causality? “Causation: The relation between mosquitoes and mosquito bites. Easily understood by both parties but never satisfactorily defined by philosophers and scientists.” - Michael Scriven, Evaluation Thesaurus, 1991http://freshspectrum.com/causation/ Most strikingly, society will need to shed some of its obsession for causality in exchange for simple correlations: not knowing why but only what. - Big Data: A Revolution that Transform How We Live, Work, and Think, Viktor Mayer-Schönberger and Kenneth Cukier.
Can you even be certain? For real world problems, I claim that you will never be certain of any inferences from data. I mean, what happens to your carefully thought out marketing plan for some rocking slacks when the Martians land. What is unacceptable is when the data you actually have does not support the conclusion you report. Public domain image
It can be easy to fool yourself! Human beings are really good at pattern detection... Perhaps a bit too good!
It can be easy to fool yourself!
Skills for Data Science
Which is most important?
WPI Data Science Program: A Collaboration Business School Computer Science Department Mathematical Sciences Department
M.S. in Data Science Program INTEGRATIVE DATA SCIENCE (3 CREDITS) GRADUATE QUALIFYING PROJECT OR MS THESIS (3 TO 9 CREDITS) MATHEMATICAL ANALYTICS (3 CREDITS) DATA ACCESS & MANAGEMENT (3 CREDITS) DATA ANALYTICS & MINING (3 CREDITS) BUSINESS INTELLIGENCE & CASE STUDIES (3 CREDITS) CONCENTRATION AND ELECTIVES (9 TO 15 CREDITS) 18
Data Science Core INTEGRATIVE DATA SCIENCE : DS 501 INTRODUCTION TO DATA SCIENCE (NEW COURSE) MATHEMATICAL ANALYTICS MATHEMATICAL ANALYTICS (SELECT ONE): MA 543/DS 502 STATISTICAL METHODS FOR DATA SCIENCE (NEW COURSE) MA 542 REGRESSION ANALYSIS MA 554 APPLIED MULTIVARIATE ANALYSIS DATA ACCESS AND MANAGEMENT DATA ACCESS AND MANAGEMENT (SELECT ONE): CS 542 DATABASE MANAGEMENT SYSTEMS MIS 571 DATABASE APPLICATIONS DEVELOPMENT CS 561 ADVANCED TOPICS IN DATABASE SYSTEMS CS 585/DS 503 BIG DATA MANAGEMENT (NEW COURSE) DATA ANALYTICS AND MINING DATA ANALYTICS AND MINING (SELECT ONE): CS 548 KNOWLEDGE DISCOVERY AND DATA MINING CS 539 MACHINE LEARNING CS 586/DS 504 BIG DATA ANALYTICS (NEW COURSE) BUSINESS INTELLIGENCE AND CASE STUDIES BUSINESS INTELLIGENCE AND CASE STUDIES (SELECT ONE): MIS 584 BUSINESS INTELLIGENCE MKT 568 DATA MINING BUSINESS APPLICATIONS Data Science Certificate Program (18 credits); 15 CREDIT DATA SCIENCE CORE plus 3 CREDIT ELECTIVE
2014 Data Science Cohort NATIONALITY CAMBODIA INDIA CHINA PAKISTAN TAIWAN IRAN U.S.A. BRAZIL NEPAL AFGHANISTAN INDONESIA EDUCATIONAL FOUNDATION Q UANTITATIVE / COMPUTATIONAL BACKGROUNDS P ROGRAMMING WITH DATA STRUCTURES AND ALGORITHMS FOR COMPUTATIONAL SKILLS Q UANTITATIVE SKILLS C ALCULUS, LINEAR ALGEBRA AND STATISTICS EMPLOYMENT HISTORIES S ENIOR R ESEARCH A NALYST S ENIOR B USINESS A NALYST P ATIENT F INANCIAL S ERVICES D ATA B ASE A NALYST - ARCHITECT D ECISION S CIENTIST M INISTRY OF F INANCE L AHEY H EALTH T ECHNICAL P ROGRAM M ANAGEMENT U.S. D EPARTMENT OF S TATE 66.70% Male 33.3% Female GENDER 10% FULBRIGHT SCHOLARS
2014 Data Science Cohort FALL 2014 Total Applicants 126 Total acceptances 33 Fulbright Scholars 3 Brazil Science Mobility Student 1 Countries Represented 9 Domestic Students 5 International Students 28 Many hold more than one earned Bachelor’s Degree US Universities include Columbia, UNH and WPI Dean Oates gave two Awards of $5K to outstanding students. These awards help attract top students.
Skills Acquired by Our Students Fundamental/Technical : SQL/ Data Modeling / Cleaning Data Integration / Warehousing Statistical Learning / Machine Learning Distributed Computing Big Data Management Classif./Regression/DecisionTrees Business Intelligence Distributed Mining Algorithms Professional Skills: Business Use Cases / Entrepreneurship Interdisciplinary Teams / Leadership Tools : Oracle /MySQL/DB2/SQLServer R / SAS / SciKit Weka /RapidMiner /MatLab IBM Cognos / SPSS Modeler Hadoop / Mahout / Cassandra Python / Java / Cloud Computing Storm / Sparc / InfoSphere Streams Spotfire / Tableaux Professional Skills: Story Telling / Visualization Presentations / Reports
Data Science Tools for Students: Free! Software: Python iPython: Numpy: Pandas: Matplotlib: Mayavi: Scikit-learn: learn.org/stable/ Data: UCI Machine learning repository Kaggle https://www.kaggle.com/ U.S. Government https://www.data.gov/
Data Science Tools for Students: Cheap Computing resources: AWS Google Rackspace