Feature Engineering Studio February 23, 2015. Let’s start by discussing the HW.

Slides:



Advertisements
Similar presentations
Feature Engineering Studio January 21, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
Advertisements

Learning Objectives 1 Copyright © 2002 South-Western/Thomson Learning Data Analysis: Bivariate Correlation and Regression CHAPTER sixteen.
1 Multiple Regression Interpretation. 2 Correlation, Causation Think about a light switch and the light that is on the electrical circuit. If you and.
Stat 512 – Lecture 18 Multiple Regression (Ch. 11)
How Many Discoveries Have Been Lost by Ignoring Modern Statistical Methods? Rand R. Wilcox.
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
PSY 307 – Statistics for the Behavioral Sciences
Correlation A correlation exists between two variables when one of them is related to the other in some way. A scatterplot is a graph in which the paired.
Educational Data Mining Overview John Stamper PSLC Summer School /25/2011 1PSLC Summer School 2011.
Psyc 235: Introduction to Statistics
Summary of Quantitative Analysis Neuman and Robson Ch. 11
DESIGNING, CONDUCTING, ANALYZING & INTERPRETING DESCRIPTIVE RESEARCH CHAPTERS 7 & 11 Kristina Feldner.
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Warm-up with Multiple Choice Practice on 3.1 to 3.3
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Learning Objective Chapter 14 Correlation and Regression Analysis CHAPTER fourteen Correlation and Regression Analysis Copyright © 2000 by John Wiley &
(a.k.a: The statistical bare minimum I should take along from STAT 101)
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 13, 2012.
Copyright © 2010 Pearson Education, Inc. Chapter 1 Stats Starts Here.
Feature Engineering Studio September 23, Welcome to Mucking Around Day.
User Study Evaluation Human-Computer Interaction.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
Feature Engineering Studio September 9, Welcome to Problem Proposal Day Rules for Presenters Rules for the Rest of the Class.
ANOVA and Linear Regression ScWk 242 – Week 13 Slides.
Intro: “BASIC” STATS CPSY 501 Advanced stats requires successful completion of a first course in psych stats (a grade of C+ or above) as a prerequisite.
Next Colin Clarke-Hill and Ismo Kuhanen 1 Analysing Quantitative Data 1 Forming the Hypothesis Inferential Methods - an overview Research Methods Analysing.
Feature Engineering Studio September 23, Let’s start by discussing the HW.
1 The Personal Software Process Estimation Based on Real Data* * Would Martin Fowler approve? “I want you to take this personally…”
C M Clarke-Hill1 Analysing Quantitative Data Forming the Hypothesis Inferential Methods - an overview Research Methods.
Feature Engineering Studio October 14, Iterative Feature Refinement.
STATISTICAL ANALYSIS FOR THE MATHEMATICALLY-CHALLENGED Associate Professor Phua Kai Lit School of Medicine & Health Sciences Monash University (Sunway.
Stat 13, Tue 5/29/ Drawing the reg. line. 2. Making predictions. 3. Interpreting b and r. 4. RMS residual. 5. r Residual plots. Final exam.
The Statistical Imagination Chapter 15. Correlation and Regression Part 2: Hypothesis Testing and Aspects of a Relationship.
Feature Engineering Studio March 1, Let’s start by discussing the HW.
Feature Engineering Studio September 30, Quick Note Please me for appointments rather than just showing up at my office – I’m always glad.
Statistical Selection Chart. For 2 samples ASK You say you want to compare! How many samples? Are my samples related? OR Are they independent?
Chapter 9 Correlational Research Designs. Correlation Acceptable terminology for the pattern of data in a correlation: *Correlation between variables.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Going from data to analysis Dr. Nancy Mayo. Getting it right Research is about getting the right answer, not just an answer An answer is easy The right.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 27, 2013.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
PSC 47410: Data Analysis Workshop  What’s the purpose of this exercise?  The workshop’s research questions:  Who supports war in America?  How consistent.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Feature Engineering Studio October 7, Welcome to Bring Me a Rock Day 2.
Feature Engineering Studio April 29, Assignment Problem Shift “The Fresh Mind”
Feature Engineering Studio September 9, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 6, 2013.
Multiple Regression Analysis Regression analysis with two or more independent variables. Leads to an improvement.
Feature Engineering Studio Special Session September 25, 2013.
STATS 10x Revision CONTENT COVERED: CHAPTERS
Feature Engineering Studio February 2, Welcome to Problem Proposal Day Rules for Presenters Rules for the Rest of the Class.
SOCW 671 #11 Correlation and Regression. Uses of Correlation To study the strength of a relationship To study the direction of a relationship Scattergrams.
Power Point Slides by Ronald J. Shope in collaboration with John W. Creswell Chapter 7 Analyzing and Interpreting Quantitative Data.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
TEACHING STATISTICS ONLINE Dr Alison Bentley Research Coordinator School of Clinical Medicine Faculty of Health Sciences.
Lecture note on statistics, data analysis planning – week 14 Elspeth Slayter, M.S.W., Ph.D.
Feature Engineering Studio October 7, Welcome to Bring Me Another Rock.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 25, 2013.
Core Methods in Educational Data Mining
Bivariate & Multivariate Regression Analysis
Stats Club Marnie Brennan
Feature Engineering Studio Special Session
Chapter 1 Stats Starts Here Copyright © 2010 Pearson Education, Inc.
Core Methods in Educational Data Mining
Feature Engineering Studio
Mocktail Party Subtitle.
Chapter 1 Stats Starts Here
Presentation transcript:

Feature Engineering Studio February 23, 2015

Let’s start by discussing the HW

Assignment 3 Data Cleaning Look for outliers in your data set Find 3 variables that have one or more outliers (if you can) Identify those variables Given the mean, median, SD, and some outlier values in them For each variable, write a 1 sentence “just so story” (or multiple just so stories) about what might have caused the outlier(s) Argue (briefly) for a reasonable approach to dealing with that variable’s outliers (and explain why your chosen approach is reasonable)

Everyone will present an outlier Alphabetical Order Based on First Name – Tie-Breaker: Last Name I’ll call out letters – Using the class roster failed last time

Tell us about your best outlier Mean, Median, SD, and some outlier values Give your “just so story” (or multiple just so stories) about what might have caused the outlier(s) What do you plan to do about it (if anything)?

Questions? Comments?

Things you can do in Excel part 2 of 3

Identifying specific cases of interest

Did event of interest ever occur for student?

Ratios between events of interest

How many students had 3 (or 4, 5, 2,…) of an event

Unitized actions (such as unitized time)

Last 3 or 5 unitized

Comparing earlier behaviors to later behaviors through caching

Counts-if

Percentages of action type

Percentages of time spent per action/location/KC/etc.

List merging

Pearson Correlation

T-tests

More complex stats in Excel I have worksheets that can do Chi-squared, Cohen’s Kappa, Extra-Sum-of-Squares F-test, and some various meta-analytic methods in Excel But if you don’t really know what you’re doing, it’s better to use a stats package for these

What else might you want to do in Excel?

Questions? Comments?

HW4 Feature Engineering 1 “Bring Me a Rock” Get your data set Open it in Excel Create as many features as you feel inspired to create – Features should be created with the goal of predicting your ground truth variable – At least 12 separate features that are not just variations on a theme (e.g. “time for last 3 actions” and “time for last 4 actions” are variations on a theme; but “time for last 3 actions” and “total time between help requests and next action” are two separate features) For each feature, write a 1-3 sentence “just so story” for why it might work Test how good each feature is

Testing Feature Goodness For this assignment, there are a bunch of ways to test feature goodness Single-feature prediction models in data mining or stats package, giving Pearson correlation, Spearman’s rho, or Cohen’s kappa (special session this Wednesday) Compute Pearson correlation in Excel Compute t-test in Excel Compute other metrics in Excel (but see earlier disclaimer)

Were you right? Which of your “just so stories” seem to be correct? Did any of your feature correlate in the opposite direction from what you expected?

Assignment 4 Write a brief report for me me an excel sheet with your features You don’t need to prepare a presentation But be ready to discuss your features in class

Next Classes 2/25 Special Session – Using RapidMiner to Produce Prediction Models – Come to this if you’ve never built a classifier or regressor in RapidMiner (or a similar tool) – Statistical significance tests using linear regression don’t count… 3/2 Advanced Feature Distillation in Excel – HW4 due