Download presentation
Presentation is loading. Please wait.
Published byCecily Spencer Modified over 9 years ago
1
Feature Engineering Studio February 23, 2015
2
Let’s start by discussing the HW
3
Assignment 3 Data Cleaning Look for outliers in your data set Find 3 variables that have one or more outliers (if you can) Identify those variables Given the mean, median, SD, and some outlier values in them For each variable, write a 1 sentence “just so story” (or multiple just so stories) about what might have caused the outlier(s) Argue (briefly) for a reasonable approach to dealing with that variable’s outliers (and explain why your chosen approach is reasonable)
4
Everyone will present an outlier Alphabetical Order Based on First Name – Tie-Breaker: Last Name I’ll call out letters – Using the class roster failed last time
5
Tell us about your best outlier Mean, Median, SD, and some outlier values Give your “just so story” (or multiple just so stories) about what might have caused the outlier(s) What do you plan to do about it (if anything)?
6
Questions? Comments?
7
Things you can do in Excel part 2 of 3
8
Identifying specific cases of interest
9
Did event of interest ever occur for student?
10
Ratios between events of interest
11
How many students had 3 (or 4, 5, 2,…) of an event
12
Unitized actions (such as unitized time)
13
Last 3 or 5 unitized
14
Comparing earlier behaviors to later behaviors through caching
15
Counts-if
16
Percentages of action type
17
Percentages of time spent per action/location/KC/etc.
18
List merging
19
Pearson Correlation
20
T-tests
21
More complex stats in Excel I have worksheets that can do Chi-squared, Cohen’s Kappa, Extra-Sum-of-Squares F-test, and some various meta-analytic methods in Excel But if you don’t really know what you’re doing, it’s better to use a stats package for these
22
What else might you want to do in Excel?
23
Questions? Comments?
24
HW4 Feature Engineering 1 “Bring Me a Rock” Get your data set Open it in Excel Create as many features as you feel inspired to create – Features should be created with the goal of predicting your ground truth variable – At least 12 separate features that are not just variations on a theme (e.g. “time for last 3 actions” and “time for last 4 actions” are variations on a theme; but “time for last 3 actions” and “total time between help requests and next action” are two separate features) For each feature, write a 1-3 sentence “just so story” for why it might work Test how good each feature is
25
Testing Feature Goodness For this assignment, there are a bunch of ways to test feature goodness Single-feature prediction models in data mining or stats package, giving Pearson correlation, Spearman’s rho, or Cohen’s kappa (special session this Wednesday) Compute Pearson correlation in Excel Compute t-test in Excel Compute other metrics in Excel (but see earlier disclaimer)
26
Were you right? Which of your “just so stories” seem to be correct? Did any of your feature correlate in the opposite direction from what you expected?
27
Assignment 4 Write a brief report for me Email me an excel sheet with your features You don’t need to prepare a presentation But be ready to discuss your features in class
28
Next Classes 2/25 Special Session – Using RapidMiner to Produce Prediction Models – Come to this if you’ve never built a classifier or regressor in RapidMiner (or a similar tool) – Statistical significance tests using linear regression don’t count… 3/2 Advanced Feature Distillation in Excel – HW4 due
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.