Presentation on theme: "How to determine ranking out of complex data types"— Presentation transcript:
1How to determine ranking out of complex data types Ranking AlgorithmsHow to determine ranking out of complex data types
2Ranking Topics in a Presentation Supports Presentation Theme (.25)Supports Class Understanding (.75)Weighted TotalExamples and Cases.55.9.8125Quarterback Rating example.61Target Example.5.8.725Data Pre processing.77.7.7175WhyHowOutcomesScaling Data For Indexing.87.7167Decision treesCluster Analysis.675Support Vector MachinesCollaborative FilteringDrawing Conclusions.65Multiple regressionAnovaMeta.575Conclusion.2.4
3Ranking QBs in the NFL: Passer Rating This formula is meant to measure a quarterback’s passing performance with a single numeric value.The passer rating scale is from 0 to 158.35 variables(completions, yards, touchdowns, interceptions and attempts)Each statistic is weighted, prior to the second step.Each statistic is applied a Min/Max to make sure one outlier statistic does not dominate the formula.
4Advantages and Flaws of Passer Rating Pros:Provides a quick and easy way of evaluating quarterback performance based on a standard formula.Captures key variables associated with passer performance.Cons:Incomplete measurement: doesn’t take into consideration dropped passes, times sacked, yards after catch, fumbles, Etc.Can be deceptive if the amount of attempts is small.Amount of touchdown passes scored is largely dependent on other players.Scenario: Below are last games statistics. Which quarterback should the coach start next game?Philip Rivers (current starter): PR = 108.7Joseph Gast (backup): PR = 158.3
5Target: Using ranking algorithms to predict pregnancy The goal: Since birth records are public, new parents are bombarded with marketing and advertising offers. Target’s goal was to identify parents before the baby was born. More specifically, target wanted to be able to identify pregnant women in the second trimester and send them coupons for diapers, car seats, etc. The outcome: Target was successful! Women thought it was creepy. The PR following effected Target negatively. The solution: Continue to “target” (haha) pregnant women with relevant ads, however include purposefully non-relevant ads so they do not notice.
6Target: How to determine what metrics to capture Target collected vast data on the purchase habits customers already for various other reasons (I.E: Christmas toys).Andrew Pole (Target’s senior analyst of consumer habits) started mining data from the baby shower registry.Pole found interesting changes in buyer behavior as their due date approaches, such as:Buying more unscented lotionBuying magnesium, zinc and calcium supplements.Buying unscented soaps.Buying larger quantities of cotton balls, hand sanitizers and washcloths
7Target: Putting pregnancy prediction score to use In the end, Pole identified 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score.He also could predict shopper’s due dates well enough to send coupons timed to specific stages pregnancy.Target then ran studies on how to advertise to pregnant women and found out that the coupons were more likely to be used when coupled with “random” items.“Just wait. We’ll be sending you coupons for things you want before you even know you want them.” –Andrew Pole
8Data Pre-Processing 1: Why data must be pre-processed Organizations often require decisions to be formed from multi- criteria datasets; However:Original data often suffers from:Lacking attributesLacking valuesContaining aggregatesData obtained from different sources are often inconsistent (using different attribute names, invalid codes, of different data types)Datasets may contain errors or outliers
9Data Pre-Processing 2: How Preprocessing Works Fill in missing values using central tendency along with learning algorithms to predict valueCluster values to isolate outliersSmooth data using regressionCorrect inconsistencies using decision-making techniques
10Data Pre-Processing 3: Outcomes of Preprocessing The purpose of data pre-processing is to produce a better data set without loss of relevant information. This:Allows statistical analysis on incomplete datasetsAllows resulting dataset to be uploaded to data visualization softwareCan be used to test and track many relationships between variablesProvides a method of ranking similar data from disparate sources
11Scaling Data for Indexing: Decision Trees Most common form of indexUses B-Tree structure to parse valuesQuick and easy given simple metricsCreated by dividing groups of data roughly in half and putting values into each half.Process is repeated until each “decision” contains exactly one value
12Scaling Data for Indexing: Support-Vector Machines Uses machine learning to generate probabilityRequires training and test dataComplexity grows exponentially with size of training dataUsed to coerce non-standard information into standard classifications (e.g. handwriting recognition algorithms)Sample of Support-Vector Application
13Scaling Data for indexing: Cluster Analysis Starts with decision-tree type analysisData elements with arranged as objects via their attributesSimilar objects are arranged in clustersAs clusters get too large, new clusters are formedNew data is compared against cluster ranges, indexed accordinglyData retrieval looks at cluster first, then objects within appropriate clusters
14Collaborative Filtering Crowd-sourcing rankings based on users likes/dislikesGives users test set of data to rank on Likert scaleLikert scale is dropped, relative ranking is retainedUsers are matched with others with similar tastes, providing ability to predict new object ranks
15Drawing Conclusions: Multiple Regression Uses multiple variables to predict a linear relationship. One dependent variable; k explanatory variables. β = slope terms Multiple Coefficient of Determination = R2 R2 always increases the more you add explanatory variables, however this does not mean the model is better. = Adjusted R2; weighs errors more heavily by penalizing the model for adding bad explanatory variables.
16Drawing Conclusions: Factorial Anova Used when you have 1 or more categorical independent variables. (otherwise use multiple regression)Tests the extent to which one variable depends on Changes in other variables.Great for analyzing the interactionof categorical variables. And howthey relate to other variables.The effect is measured by dividingdata into categories andcomparing the sum of squaresmean for each category to thesum of squares total.
17Drawing Conclusions: Meta Analysis Meta-analysis contrasts and combines the results of different studies. Usually ones with small sample sizes.A meta analysis measures to what extent different studies on a common topic produced the same effect.The inverse of the variance (σ2) is oftenused as a weight so that larger samples have a greater effect.Pros:Improved precision and accuracy estimates due to more dataResults can be generalized to larger populations.A hypothesis test can be applied on summary estimates.Cons:Publication bias: negative results are less likely to be published.Agenda driven bias: cherry picked studies.Simpson’s paradox
18ConclusionUsing rankings for analyzing multiple variables is an extremely useful tool for expedited decision making.Data preprocessing allows statistical analysis for incomplete datasets (improves data hygiene).Indexes are scaled by common attributes:A decision tree deals with 1 attributea cluster deals with multiple attributessupport vector machines rely on computer generated algorithms for identifying similarity.Statistical tests such as regression analysis, factorial Anova analysis and Meta analysis can be used to draw conclusions regarding whether or not variables are related to each other.