Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature Selection on Time-Series Cab Data

Similar presentations


Presentation on theme: "Feature Selection on Time-Series Cab Data"— Presentation transcript:

1 Feature Selection on Time-Series Cab Data
Yingkit (Keith) Chow

2 Contents Introduction Features Considered
FCBF (Filter-type feature selection) FCBF-PCA (my variation) Conclusion

3 All Features Considered
Each time sample consists of the following features Day of Week, Time of Day (1st two features) taxis[t, 6:9], taxis[t-1, 6:9],…, taxis[t-5, 6:9] [6:9] represents the index to the matrix taxis, which is the cab entering with meter off, cab enter on, cab exit off, cab exit on Not all features here will be relevant to classifying whether a game is present.

4 Fast Correlation-Based Filter
Algorithm: Finds features that are relevant ( SU(I, C) > threshold), where SU is symmetric uncertainty and will be described in the next slide Remove redundant features by comparing remaining features (after the first step) Remove feature j if SU(i, j) >= SU(j, C) Check out [1] to get the pseudo code. Basically ‘I’ is a feature in the ideal subset (starting with the individual most informative feature). Then check to see if j is redundant and if it is informative of class C.

5 Equations[1] Information Gain (IG) Symmetric Uncertainty (SU)
IG(X|Y) = H(X) – H(X|Y) Symmetric Uncertainty (SU) SU(X,Y) = 2 * IG(X|Y) / [H(X)+H(Y)] SU is used instead of IG because it compensates for features having more values and normalizes data[1] H(X) is entropy of feature X, and H(X|Y) is the conditional entropy of X when Y is known.

6 FCBF Classifier (MATLAB Classify- Linear) Number Bins = 96
Threshold = 0.01 Accuracy = 91.9% I was expecting features immediately before the start of game to be selected. However, this is likely due to the fact that trying to classify when the games are active, so samples in the middle and end of the game will rely on data 5 samples ago in helping in the decision process.

7 Choice of Number Bins Num Bins = 96 results shown in previous slide (red is ground truth of game and blue is my classification) Num Bins = 20 Accuracy = 58.6% Here the algorithm breaks down and only chooses feature 2, the “time of day”. The blue is periodic here, where a certain time segment a day, everyday will be classed as a game. Throughout this presentation I use 11:10000 samples of taxis data for training. Then Test with samples 25000:32000 Both the blue and red plots are boolean, the scaling is done to help visualize how the classification is going.

8 FCBF - PCA FCBF compares individual features with each other
We can use PCA to try and capture a group of features. (for example, maybe one eigenvector can capture the shape of the number of cabs incoming with meters on initially before a game or the increase in the number of cabs entering with meters off prior to the end of game) Example shown in the next slide

9 Cab Traffic Behavior Before Start of Game Towards End of Game
Cab On Enter and Cab Off Exit are high Towards End of Game Cab Off Enter and Cab On Exit are high

10 FCBF-PCA Classifier (MATLAB Classify- Linear) Number Bins = 20
Threshold = 0.01 Accuracy = 92.9% Note: the features here are projections onto the eigenvectors and not the original feature dimension

11 Conclusions The choice of number of bins have an enormous impact on the performance. (possibly due to 96 discrete values of time of day variable) FCBF-PCA was less susceptible to the choice of numBins (10, 20, 100 numBins all resulted in approximately 91% accuracy)

12 Future Work Currently using labels of game or not game.
I’ll try to make it work for detecting the first sample of a game and another classifier to detect the last sample of a game since the mid-game generally has an entirely different characteristic from the beginning and end of game. However, I might be limited by the number of samples.

13 Questions I’m not currently in NYC so please send questions or comments to:

14 Citations “Feature Selection for High Dimensional Data: A Fast Correlation-Based Filter Solution”, by Lei Yu and Huan Liu, ICML (2003) “Efficient Feature Selection via Analysis of Relevance and Redundancy”, by Lei Yu and Huan Liu, Journal of Machine Learning Research 5 (2004)


Download ppt "Feature Selection on Time-Series Cab Data"

Similar presentations


Ads by Google