Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature Engineering Studio February 23, 2015. Let’s start by discussing the HW.

Similar presentations


Presentation on theme: "Feature Engineering Studio February 23, 2015. Let’s start by discussing the HW."— Presentation transcript:

1 Feature Engineering Studio February 23, 2015

2 Let’s start by discussing the HW

3 Assignment 3 Data Cleaning Look for outliers in your data set Find 3 variables that have one or more outliers (if you can) Identify those variables Given the mean, median, SD, and some outlier values in them For each variable, write a 1 sentence “just so story” (or multiple just so stories) about what might have caused the outlier(s) Argue (briefly) for a reasonable approach to dealing with that variable’s outliers (and explain why your chosen approach is reasonable)

4 Everyone will present an outlier Alphabetical Order Based on First Name – Tie-Breaker: Last Name I’ll call out letters – Using the class roster failed last time

5 Tell us about your best outlier Mean, Median, SD, and some outlier values Give your “just so story” (or multiple just so stories) about what might have caused the outlier(s) What do you plan to do about it (if anything)?

6 Questions? Comments?

7 Things you can do in Excel part 2 of 3

8 Identifying specific cases of interest

9 Did event of interest ever occur for student?

10 Ratios between events of interest

11 How many students had 3 (or 4, 5, 2,…) of an event

12 Unitized actions (such as unitized time)

13 Last 3 or 5 unitized

14 Comparing earlier behaviors to later behaviors through caching

15 Counts-if

16 Percentages of action type

17 Percentages of time spent per action/location/KC/etc.

18 List merging

19 Pearson Correlation

20 T-tests

21 More complex stats in Excel I have worksheets that can do Chi-squared, Cohen’s Kappa, Extra-Sum-of-Squares F-test, and some various meta-analytic methods in Excel But if you don’t really know what you’re doing, it’s better to use a stats package for these

22 What else might you want to do in Excel?

23 Questions? Comments?

24 HW4 Feature Engineering 1 “Bring Me a Rock” Get your data set Open it in Excel Create as many features as you feel inspired to create – Features should be created with the goal of predicting your ground truth variable – At least 12 separate features that are not just variations on a theme (e.g. “time for last 3 actions” and “time for last 4 actions” are variations on a theme; but “time for last 3 actions” and “total time between help requests and next action” are two separate features) For each feature, write a 1-3 sentence “just so story” for why it might work Test how good each feature is

25 Testing Feature Goodness For this assignment, there are a bunch of ways to test feature goodness Single-feature prediction models in data mining or stats package, giving Pearson correlation, Spearman’s rho, or Cohen’s kappa (special session this Wednesday) Compute Pearson correlation in Excel Compute t-test in Excel Compute other metrics in Excel (but see earlier disclaimer)

26 Were you right? Which of your “just so stories” seem to be correct? Did any of your feature correlate in the opposite direction from what you expected?

27 Assignment 4 Write a brief report for me Email me an excel sheet with your features You don’t need to prepare a presentation But be ready to discuss your features in class

28 Next Classes 2/25 Special Session – Using RapidMiner to Produce Prediction Models – Come to this if you’ve never built a classifier or regressor in RapidMiner (or a similar tool) – Statistical significance tests using linear regression don’t count… 3/2 Advanced Feature Distillation in Excel – HW4 due


Download ppt "Feature Engineering Studio February 23, 2015. Let’s start by discussing the HW."

Similar presentations


Ads by Google