1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1a, January 27, 2015, Lally 102 Introduction to Data Analytics, Current Challenges. Course Outline.

Slides:



Advertisements
Similar presentations
Feature Engineering Studio January 21, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
Advertisements

INTRODUCTION TO MODELING
CS1203 SCCC/ATS COURSE SYLLABUS Introduction to Computer Concepts and Applications Revised 8/16/2014 Online 7/14 revision Ed Hall Instructor.
CSc 2310 Principles of Programming (Java)
IS5152 Decision Making Technologies
Lecture Roger Sutton 21: Revision 1.
The Anatomy and Physiology of Data Science Peter Fox 1 ( 1.
Financial Management 2BUS0197 Introduction to Module.
© 2005 Illinois Mathematics and Science Academy 1 Illinois Learning Standards Information from
Math 115b Section 1 (Summer 07)  Instructor: Kerima Ratnayaka   Phone :  Office.
General information CSE 230 : Introduction to Software Engineering
SEAS Acad Mtg – 8/26/03Prof. Frank Sciulli Introduction - Physics SEAS Academic Meeting l Intro: Frank Sciulli – Professor in the Physics Dept. u Lecturing.
Math 115b Section 5(Spring 06)  Instructor: Kerima Ratnayaka   Phone :  Office.
Math 115b Section 1H(Spring 07)  Instructor: Kerima Ratnayaka   Phone :  Office.
Introduction CSCI102 - Introduction to Information Technology B ITCS905 - Fundamentals of Information Technology.
IACT303 – INTI 2005 World Wide Networking Welcome and Introduction to Subject. Penney McFarlane The University of Wollongong.
Nsm.uh.edu Math Courses Available After College Algebra.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3b, February 7, 2014 Lab exercises: datasets and data infrastructure.
1 Are you ready?. 2 Professor: Vladimir Misic Office : Phone: Office Hours : Mon, Tue; 2:00pm – 4:00pm Website :
At the end of my physics course, a biology student should be able to…. Michelle Smith University of Maine School of Biology and Ecology Maine Center for.
CSE 1111 Week 1 CSE 1111 Introduction to Computer Science and Engineering.
Cpt S 471/571: Computational Genomics Spring 2015, 3 cr. Where: Sloan 9 When: M WF 11:10-12:00 Instructor weekly office hour for Spring 2015: Tuesdays.
1 MSCS 237 Distributed Computing Spring 2006 INSTRUCTOR: Dr. Sheikh Iqbal Ahamed Office: Cudahy Hall 386 Phone: Office Hours: Monday 2:00-3:00pm.
COURSE ADDITION CATALOG DESCRIPTION To include credit hours, type of course, term(s) offered, prerequisites and/or restrictions. (75 words maximum.) 4/1/091Course.
COMP Introduction to Programming Yi Hong May 13, 2015.
Lecture 1 Page 1 CS 111 Summer 2015 Introduction CS 111 Operating System Principles.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
CSc 2310 Principles of Programming (Java) Dr. Xiaolin Hu.
CPS120: Introduction to Computer Science Fall: 2002 Instructor: Paul J. Millis.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 30, 2015 Introductory Statistics/ Refresher and Relevant software installation.
How to be an online student. How does it work? An online course follows a schedule and syllabus with due dates for assignments (just like an on-campus.
Faustino Jerome G. Babate Nursing Research I Brokenshire College SOCSKSARGEN.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1a, January 21, 2014, SAGE 3101 Introduction to Data Analytics, Current Challenges. Course Outline.
1 Advanced Semantic Technologies Prof. Deborah McGuinness and Dr. Patrice Seyed CSCI CSCI ITWS ITWS TA: Justin.
1 Software Systems Development CEN Spring 2011 TR 12:30 PM – 1:45 PM ENB 116 Instructor:Dr. Rollins Turner Dept. of Computer Science and Engineering.
ScWk 242 Course Overview and Review of ScWk 240 Concepts ScWk 242 Session 1 Slides.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 24, 2014 Relevant software and getting it installed.
CSE 1340 Introduction to Computing Concepts Class 1 ~ Intro.
1 [CMP001 Computer Orientation I] Course Guide Ms. Wesal Abdalfattah office#: 357 Ext#: 8612 Prince Sultan University,
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Syllabus CS479(7118) / 679(7112): Introduction to Data Mining Spring-2008 course web site:
Econ 3320 Managerial Economics (Fall 2015)
Lecture Section 001 Spring 2008 Mike O’Dell CSE 1301 Computer Literacy.
Quantitative Methods in Geography Geography 391. Introductions and Questions What (and when) was the last math class you had? Have you had statistics.
Principles of Computer Science I Honors Section Note Set 1 CSE 1341 – H 1.
CPS120: Introduction to Computer Science Winter 2002 Instructor: Paul J. Millis.
Welcome to Vitamins, Herbs, and Nutritional Supplements HW205 Your instructor: Kim Montgomery, MS, NBT.
CS Welcome to CS 5383, Topics in Software Assurance, Toward Zero-defect Programming Spring 2007.
CSE 1105 Week 1 CSE 1105 Course Title: Introduction to Computer Science & Engineering Classroom Lecture Times: Section 001 W 4:00 – 4:50, 202 NH Section.
CSE 1105 Week 1 CSE 1105 Introduction to Computer Science & Engineering Time: Wed 4:00 – 4:50 Thurs 9:30 – 10:20 Thurs 4:00 – 4:50 Place: 100 Nedderman.
1 Advanced Semantic Technologies Deborah McGuinness CSCI , 97543, CSCI , 97014, ITWS , 98113, ITWS , TA: Abigail.
1 Data Structures COP 4530 Spring 2010 MW 4:35 PM – 5:50 PM CHE 101 Instructor:Dr. Rollins Turner Dept. of Computer Science and Engineering ENB
Qualitative Literacy Series E. Van Harken MED /9/15.
COMP1927 Course Introduction 16x1
1 Introduction to Data Communication Networks ISQS 3349, Spring 2000 Instructor: Zhangxi Lin Office: BA 708 Phone: Homepage:
Feature Engineering Studio September 9, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
CM220 College Composition II Friday, January 29, Unit 1: Introduction to Effective Academic and Professional Writing Unit 1 Lori Martindale, Instructor.
CS151 Introduction to Digital Design Noura Alhakbani Prince Sultan University, College for Women.
Computer Networks CNT5106C
Welcome! Seminar – Monday 6:00 EST HS Seminar Unit 1 Prof. Jocelyn Ramos.
Welcome... Hello Class, I want to remind you that I am here to assist you with any questions or concerns you have about the class. Feel free to contact.
Course Overview Stephen M. Thebaut, Ph.D. University of Florida Software Engineering.
CSE6339 DATA MANAGEMENT AND ANALYSIS FOR COMPUTATIONAL JOURNALISM CSE6339, Spring 2012 Department of Computer Science and Engineering, University of Texas.
Introduction to Data Analytics, Current Challenges. Course Outline
It’s called “wifi”! Source: Somewhere on the Internet!
The General Education Core in CLAS
Week 1 Gates Introduction to Information Technology cosc 010 Week 1 Gates
Cpt S 471/571: Computational Genomics
Cpt S 471/571: Computational Genomics
COURSE ADDITION: Context
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1a, January 27, 2015, Lally 102 Introduction to Data Analytics, Current Challenges. Course Outline

Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: Lally 102 Instructor: Peter Fox Instructor contact: (do not leave a Contact hours: Monday** 3:00-4:00pm (or by appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by ) TA: Jiaju Shen Web site: –Schedule, lectures, syllabus, reading, assignments, etc. 2

Contents Intro – about this course Learning objectives Outline of the course Definitions and why Analytics is more than Analysis What skills are needed What is expected 3

Truth in Advertising 4

Assessment and Assignments Via written assignments with specific percentage of grade allocation provided with each assignment Via individual oral presentations with specific percentage of grade allocation provided Via presentations – depending on class size Via participation in class (not to exceed 10% of total, start with 10% and lose % by not participating) Late submission policy: first time with valid reason – no penalty, otherwise 20% of score deducted each late day. Talk to me EARLY if you are having schedule problems completing assignments 5

Assessment and Assignments Reading assignments –Are given when needed to support key topics or to complete assignments –Will not be discussed in class unless there are questions You will mostly perform individual work (i.e. group work is TBD) 6

Project options (examples) Social networks Financial Social-economic, marketing Network/ security data Linked data 7

Objectives Introduce students to relevant methods to recognize and apply quantitative algorithms, techniques and interpretation To develop students' strategic thinking skills, combined with a solid technical foundation in data and model-driven decision-making. Develop ability to apply critical and analytical methods to formulate and solve science, engineering, medical, and business problems In groups, students will identify qualitative problems and apply content analytics Students will examine real-world examples to place data-mining techniques in context, to develop data-analytic thinking, and to illustrate that proper application is as much an art as it is a science. By the end of the course, students can effectively communicate analytic findings to non-specialists [At the advanced level, evaluation focuses on decision making under uncertainty, learning how to build optimization models that incorporate random parameters: static stochastic optimization, two-stage optimization with recourse, chance-constrained optimization, and sequential decision making. ] 8

Learning Objectives Through class lectures, practical sessions, written and oral presentation assignments and projects, students should: –Students to demonstrate knowledge of relevant analytic methods, and to recognize and apply quantitative algorithms, techniques and interpret results –Students to demonstrate strategic thinking skills, combined with a solid technical foundation in data and model-driven decision-making. –Students to develop ability to apply critical and analytical methods to formulate and solve science, engineering, medical, and business problems –Students will examine real-world examples to place data-mining techniques in context, to develop data-analytic thinking, and to illustrate that proper application is as much an art as it is a science. –Students must effectively communicate analytic findings to non-specialists. –[graduate level] Students must develop and demonstrate a working knowledge of decision making under uncertainty, be able to build optimization models that incorporate random parameters: static stochastic optimization, two-stage optimization with recourse, chance-constrained optimization, and sequential decision making. –***TBD*** In groups, students will identify qualitative problems and apply content analytics and present interpreted results 9

Undergraduates/ Grads Graduate students are assessed at: –Higher level of demonstration –Additional questions or tasks in assignments Undergraduates are welcome to complete these higher requirements to extra grade Extra points for outstanding/ above and beyond are given** 10

Academic Integrity Student-teacher relationships are built on trust. For example, students must trust that teachers have made appropriate decisions about the structure and content of the courses they teach, and teachers must trust that the assignments that students turn in are their own. Acts, which violate this trust, undermine the educational process. The Rensselaer Handbook of Student Rights and Responsibilities defines various forms of Academic Dishonesty and you should make yourself familiar with these. In this class, all assignments that are turned in for a grade must represent the student’s own work. In cases where help was received, or teamwork was allowed, a notation on the assignment should indicate your collaboration. Submission of any assignment that is in violation of this policy will result in a penalty. If found in violation of the academic dishonesty policy, students may be subject to two types of penalties. The instructor administers an academic (grade) penalty of full loss of grade for the work in violation, and the student may also enter the Institute judicial process and be subject to such additional sanctions as: warning, probation, suspension, expulsion, and alternative actions as defined in the current Handbook of Student Rights and Responsibilities. If you have any question concerning this policy before submitting an assignment, please ask for clarification. 11

Current Syllabus/Schedule Web site:

Questions so far? 13

Introductions Who you are, background? Why you are here? What you expect to learn? 14

The nature of the challenge 15

16

17

Perspective People make decisions every day and increasingly they are using computers to assist them. Knowledge is power: –Or accurate/ reliable knowledge is actionable Gaining knowledge and how to use that knowledge - from (often multiple ones) information and data sources A model = formula/ equation that could depend on parameters and variables 18

So what are we talking about? 19

Definitions (at least for this course) Data - are encodings that represent the qualitative or quantitative attributes of a variable or set of variables. Data (plural of "datum", which is seldom used) - are typically the results of measurements, computations, or observations and can be the basis of graphs, images of a set of variables. Data - are often viewed as the lowest level of abstraction from which information and knowledge are derived*** 20

And then there is Big Data 21

~ Data for this course FaceBook Facebook networks (from a date in Sept. 2005) for 100 colleges and universities. These files only include intra- school links. Anonymized. Well curated. Very good quality. Matlab. InterNetwork Illinois – Telecommunication network traffic and telemetry for central Illinois. Well curated. Good quality. Unexplored. Linked data – logd.tw.rpi.edu/datasets, e.g. EPA Facility Register System for each U.S. state. Linked. Quality unknown. RDF. 22

A view from IBM … “Anyone who wants to learn something about data analytics should take a road trip. Myriad real-time decisions must be made based on analysis of static information as well as ever-changing conditions. Data about traffic, weather, road construction, fuel, time, current location and available funds are just a few of the factors.” This information and much more are needed to answer questions like: –If I skip this gas station, will I run out of gas before the next one? –Is it worth driving 50 miles out of the way to see the Corn Palace? How late will that side trip make us? –Can I make it to Billings, Mont., by sunset or should I look for a place to stop? 23

Case Studies (warming up) Sports Analytics – Moneyball ( Nate Silver ( Google Analytics - casestudy.htm casestudy.htm Marketing Analytics – products for pregnant (women) Amazon Recommender – “If you liked, …” utilizing-real-time-data-analyticshttp:// utilizing-real-time-data-analytics 24

Data… Finding it … Getting ready to use it … Using it: –Big data technologies: Hadoop (MapReduce) + Pig, HDFS, HIVE – see database database –NoSQL, Graph, Hbase, Cassandra, Mongo DB, Riak, CouchDB –MPP Databases: Storm, Drill, Dremel We’ll review a few of these 25

Analysis Software packages / environments: –Gnu R –Rstudio Extensive libraries Going from preliminary to initial analysis… Parametric (assumes or asserts a probability distribution) and non-parametric statistics 26

What is "statistics"? The term "statistics" has two common meanings, which we want to clearly separate: descriptive and inferential statistics. But to understand the difference between descriptive and inferential statistics, we must first be clear on the difference between populations and samples. 27 Courtesy Marshall Ma (and prior sources)

A population is a set of well-defined objects –We must be able to say, for every object, if it is in the population or not –We must be able, in principle, to find every individual of the population A geographic example of a population is all pixels in a multi-spectral satellite image A sample is a subset of a population –We must be able to say, for every object in the population, if it is in the sample or not –Sampling is the process of selecting a sample from a population Continuing the example, a sample from this population could be a set of pixels from known ground truth points Populations and samples 28 Courtesy Marshall Ma (and prior sources)

A population = “all” of the data, if you can get it (BIG Data) –This is what is different about the methods you use A sample = “some” of the data, and you may not know how representative it is –This is what limits analysis but certainly the development of models Populations and samples 29 Courtesy Marshall Ma (and prior sources)

Two common uses of the word: Descriptive statistics: numerical summaries of samples –i.e., what was observed –Note the ‘sample’ may be exhaustive, i.e., identical to the population Inferential statistics: from samples to populations –i.e., what could have been or will be observed in a larger population Example: Descriptive "The adjustments of 14 GPS control points for this orthorectification ranged from 3.63 to 8.36 m with an arithmetic mean of m" Inferential "The mean adjustment for any set of GPS points taken under specified conditions and used for orthorectification is no less than 4.3 and no more than 6.1 m; this statement has a 5% probability of being wrong" What do we mean by "statistics"? 30 Courtesy Marshall Ma (and prior sources)

Patterns and Relationships Stepping from elementary/ distribution analysis to algorithmic-based analysis Often: data mining: classification, clustering, rules; machine learning; support vector machine, non-parametric models Outcome: model and an evaluation of its fitness for purpose 31

Prediction Choosing applicable models Combining models Confidence levels Multi-variate Future, event, pattern Past event, relation Etc. 32

Prescription Decisions and Effects: What should you do and why? “Business Rules” Benefit or mitigate or adapt? [Personal e.g.] Builds on Prediction, often involves scenarios and post-analysis 33

34 Summary We’ll work our way through the stages of analytics We’ll use current both laptop installed software and some server data infrastructures for analytics to give you practical experience We’ll cover algorithms, models, and software to use them Aim: ~ Tuesday lecture, Friday hands-on lab and interactions Course will be “adapted” as we go (more people this year)

Just for fun… 35

36 Traversal for new patterns

Smart visual exploration 37

Skills needed Database or data structures Literacy with computers and applications that can handle the data we will use Pick up R programming, terminology and syntax Ability to access internet, servers and retrieve/ acquire data, install/ configure software Presentation of proposal projects and assignment results 38

Tentative assignment structure (no exam) Assignment 1: Review of a DA Case Study. Due ~ week 2 (Friday). 10% (written/ discuss; individual); Assignment 2: Datasets and data infrastructures – lab assignment. Due ~ week 3. 10% (lab; individual); Assignment 3: Preliminary and Statistical Analysis. Due ~ week 4. 15% (15% written and 0% oral; individual); Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 5. 15% (10% written and 5% oral; individual); Assignment 5: Term project proposal. Due ~ week 6. 5% (0% written and 5% oral; individual); Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual); Term project. Due ~ week % (25% written, 5% oral; individual). 39

What is expected Attend class, complete assignments, participate Ask questions, offer answers in class Work individually (and in a group TBD) on assignments Work constructively in class sessions Next class is Jan. 30 – Introduction or refresher to statistics 40

Reading/ watching Sports Analytics – Moneyball ( Nate Silver ( Google Analytics - casestudy.htm casestudy.htm utilizing-real-time-data-analyticshttp:// utilizing-real-time-data-analytics

Reference Material On Website after class 14 reading section Data Analytics – various intro material Using R 42

Files This is where the files for assignments, exercise will be placed – data, code (fragments), etc. 43

Assignment 1 Choose a DA case study from a) readings, or b) your choice (must be approved by me) Read it and provide a short written review/ critique (business case, area of application, approach/ methods, tools used, results, actions, benefits). Hand in a written report. Be prepared to discuss it on class on Friday 6 th.. Details on the Web site (under Reading/Assignments; Week 1) 44

Gnu R load this firsthttp://lib.stat.cmu.edu/R/CRAN/ R Studio – see R-intro.html in manualshttp:// /– / –Manuals - Libraries – at the command line – library(), or select the packages tab, and check/ uncheck as needed 45

Exercises – getting data in Rstudio –read in csv file (two ways to do this) - GPW3_GRUMP_SummaryInformation_2010.csv –Read in excel file (directly or by csv convert) EPI_data.xls (2010EPI_data tab) –See if you can plot some variables –Anything in common between them? 46