Statistical Analysis with Big Data

Name: Statistical Analysis with Big Data
Uploaded: 2017-09-05T20:49:18+00:00
Duration: PTM10S23
Channel: Gwenda Nicholson
Description: Statistical Analysis with Big Data

Statistical Analysis with Big Data
Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL

Preface: I-O psychology and business management are slowly being recognized in the big data arena. Slowly. Witness stories about “big data” in HR and personnel selection, with little to no mention of the (DECADES) of personnel assessment expertise in HR and I-O psychology…

Adding insult to injury…

Overview Run LASSO regression and random forests, two predictive models that are relatively novel to organizational research and useful for big data (and small data). Demonstrate a philosophy: Flexibly model to improve prediction – but don’t overfit. Discuss four ideas/implications related to predictive models such as those we ran.

Example 1 LASSO Many R packages exist to perform 'big data' types of analyses, such as the LASSO (Least Absolute Shrinkage and Selection Operator). LASSO not only estimates regression weights like 'normal' (OLS) regression; LASSO makes many weights shrink to zero. Yeehaw!

LASSO Check out the coefficients across the range of values of the “tuning parameter,” lambda. You can see where LASSO regression consistently selects predictive variables and excludes others. Yeehaw!

LASSO Let's see what variables tend to predict the job-search behaviors of employed managers (Boudreau, 2001). [R demonstration] 12 predictors of managers’ job search behaviors [you would use actual data – but I simulated 1,000 cases here ~ pop’n mtx]

Nothin’……………………………………OLS
All LASSO solutions varying lambda… Considering all other predictors at a given value of lambda… job satisfaction (less sat  > search) compensation (less $  > search) gender (female  > search) Nothin’……………………………………OLS

Least Angle Regression = LARS (graph = the trace of all LASSO soln's)
I think of LARS as Tiptoe RegressionTM, as it is more cautious than stepwise regression… Start with all coefficients = 0. The predictor with the highest validity is the first one entered. But now don't step, tiptoe…. Increase the regression weight from 0 toward its 'normal' regression value until one of the other predictors correlates with the residual (y-yhat) just as much. Enter that predictor next. Now move the weights of those two predictors toward its 'normal' regression solution, until a third predictor correlates as much with the new residual. Enter that predictor. Keep goin' until all predictors are entered. This method efficiently provides all LASSO solutions.

In general: Cross-validation
As mentioned, other weights will work better across other samples (e.g., unit weights, but…something better?) How to find out? Train the model on a given set of data (develop the weights) Test the model on a fresh set of data (apply the weights to new data; how good are predictions?

LASSO: k-fold Cross-validation
Train the model on a given set of data (develop the weights) Test the model on a fresh set of data (apply the weights to new data; how good are predictions? 10-fold cross-validation [R demonstration] Divide the data into 10 equal parts or “folds” (randomly, or stratified random) Develop the model on 1 fold; test the model on the rest Do this for each fold, so that each fold participates in generating model predictions. Average the 9 predicted results across each case. [demonstrate w/ LASSO regression]

LASSO k-fold Cross-validation
 # of predictors simpler optimal  tuning parameter

LASSO is swell, but also check out…elastic net!
When several predictors are correlated, LASSO will tend to select one of them rather than the group. [also note p can be >> N] The elastic net will encourage selecting the group of predictors in this case: Elastic net encourages parsimony (like LASSO) yet also tries to select those groups of related variables when they are predictive (like ridge regression). OLS regression ridge regression LASSO elastic net

Predicted Task Performance Scores
Example 2 Random Forests First look at trees, used to classify data: Example Cognitive Ability Conscientiousness Biodata 3.6 4.1 4.3 5.6 Teamwork Openness 3.8 3.2 = high (> X) = low (≤ X) Predicted Task Performance Scores

Example 2 Random Forests
Draw a large number of boostrapped samples (e.g., k = 500 samples, each based on 2/3 of the data, w/ replacement). For each sample, build a tree similar to the one just illustrated… but with a catch: At each node, only consider a random subset of the predictors that can split the node (square root of # all predictors is a default). This yields diverse (decorrelated) trees: different variables at each node, different cutpoints at each variable.

Example 2 (cont’) Random Forests
For each tree, look at the “out of bag” (oob) data – data that did NOT participate in building the tree. Take these data and run them down the tree to get their predicted scores. For each case, average the predicted scores across trees. Average them. [R demonstration] case ŷ Tree 1 Tree 2 … Tree k Average Predicted 1 (in tree) 3.4 2.1 3.554 2 4.5 4.312 N 2.6 2.4 2.561

A simple model w/ false assumptions can beat a complex model with
truer ones (the latter might need more data to show its strength) From Domingos (2012).

Some thoughts while under the hood… 1.
If Big Data only captures the 3Vs on an ever-expanding hard drive, it is useless. Taylor 2013, HBR Blog: “We can amass all the data in the world, but if it doesn’t help to save a life, allocate resources better, fund the organization, or avoid a crisis, what good is it?” Thinking about the increasing amount of data companies have on hand: Some of those data are directly relevant to selection (lots of online applications, screening tests). Other data might be relatively indirect, but an argument can be made for selection (resume text mining). Still other data are indirect but difficult to justify even if predictive (e.g., time to complete an application online). …What do we do in this situation?

Useful ‘signals’ in data discovered through predictive modeling could be amplified by developing measures that collect more data (given enough development time, testing time, $...). (knowledge  new measures  new knowledge) (Fayyad et al., 1996)

Work to engineer knowledge, not just prediction. If predictions become more accurate and robust than ever… will we understand (theorize about, legally defend) them any better? (related idea: why do we have reliability, why not just validity?) Generally, our substantive research focuses on correlations, mean differences, etc. at the factor or scale level, not at the single-item level as big data might. History tells us that item-level analyses can be hard to interpret (e.g., DIF). Interpretable surprises are hard to find. At what level(s) is there knowledge?

Big Data analytics provides reasons/opportunities to collaborate – if there is a culture for that. Ties to other org functions (perhaps served by Big Data)

Thank you! Fred Oswald

Statistical Analysis with Big Data

Similar presentations

Presentation on theme: "Statistical Analysis with Big Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Analysis with Big Data

Similar presentations

Presentation on theme: "Statistical Analysis with Big Data"— Presentation transcript:

Similar presentations

About project

Feedback