Download presentation
Presentation is loading. Please wait.
Published byEdgar Cole Modified over 9 years ago
1
Feature Engineering Studio September 23, 2013
2
Welcome to Mucking Around Day
3
Sort into pairs Partner with the person next to you One group of 3 is allowed
4
Sort into pairs Do we have a group of 3? One of the 3 will work with me
5
Sort into pairs Go over your reports together – A maximum of 5 minutes apiece
6
5 minutes for first person
7
5 minutes for second person
8
Re-assemble into one big group
9
Who here found something really cool while mucking around? Show us, tell us
10
Who here found a histogram with a normal distribution? Show us, tell us
11
Who here found a histogram with a hypermode? Show us, tell us
12
Who here found a histogram with a flat distribution? Show us, tell us
13
Who here found a histogram with a skewed distribution? Show us, tell us
14
Who here found a histogram with a bimodal distribution? Show us, tell us
15
Who here found a histogram with something else interesting? Show us, tell us
16
Who here found something surprising with their min, max, average, stdev?
17
Categorical variables Who here found something curious, weird, or interesting in the distribution of their categorical variables?
18
Who here hasn’t spoken yet? (and analyzed data) Tell us something interesting you found in your data
19
Who here played with pivot tables? What did you learn?
20
My turn to play with pivot tables Who wants to volunteer their data? (I might request a 2 nd or 3 rd data set, depending on how the 1 st one goes)
21
Who here played with vlookup? What did you learn?
22
My turn to play with vlookup Using the same volunteered data set(s)
23
Other cool things you can create with a few simple formulas (plus demos!)
24
Identifying specific cases of interest
25
Did event of interest ever occur for student?
26
Counts-so-far (and total value for student)
27
Counts-last-N-actions
28
First attempts
29
Ratios between events of interest
30
How many students had 3 (or 4, 5, 2,…) of an event
31
Times-so-far
32
Cutoff-based features
33
Unitized actions (such as unitized time)
34
Last 3 or 5 unitized
35
Comparing earlier behaviors to later behaviors through caching
36
Counts-if
37
Percentages of action type
38
Percentages of time spent per action/location/KC/etc.
39
Questions? Comments?
40
Other cool ideas?
41
Assignment 3 Feature Engineering 1 “Bring Me a Rock” Get your data set Open it in Excel Create as many features as you feel inspired to create – Features should be created with the goal of predicting your ground truth variable – At least 12 separate features that are not just variations on a theme (e.g. “time for last 3 actions” and “time for last 4 actions” are variations on a theme; but “time for last 3 actions” and “total time between help requests and next action” are two separate features) For each feature, write a 1-3 sentence “just so story” for why it might work Test how good each features is
42
Testing Feature Goodness For this assignment, there are a bunch of ways to test feature goodness Single-feature prediction models in data mining or stats package, giving correlation or kappa (special session this Wednesday) Compute correlation in Excel (want to see?) – You can do this with binaries variables too, although it’s not really optimal Compute t-test in Excel (want to see?) Compute kappa in Excel (if you don’t know how, easier to do in RapidMiner)
43
Were you right? Which of your “just so stories” seem to be correct? Did any of your feature correlate in the opposite direction from what you expected?
44
Assignment 3 Write a brief report for me Email me an excel sheet with your features You don’t need to prepare a presentation But be ready to discuss your features in class
45
Next Classes 9/25 Special Session – Using RapidMiner to Produce Prediction Models – Come to this if you’ve never built a classifier or regressor in RapidMiner (or a similar tool) – Statistical significance tests using linear regression don’t count… 9/30 Advanced Feature Distillation in Excel – Assignment 3 due – Online Equation Solver Tutorials should be in your INBOX
46
Upcoming Classes 10/2 Special session on prediction models – Come to this if you don’t know why student-level cross-validation is important, or if you don’t know what J48 is 10/7 Advanced Feature Distillation in Google Refine 10/9 Special session? TBD.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.