Download presentation
Presentation is loading. Please wait.
Published byRosaline McCarthy Modified over 6 years ago
1
A Smart Tool to Predict Salary Trends of H1-B Holders
Ramya Ramesh Akshay Poosarla Under the guidance of: Prof. Meiliu Lu
2
Problem Statement •The H-1B is an employment-based, non-immigrant visa category for temporary foreign workers in the United States •The wage prediction model is used to predict the wages of H1-B employees from H1-B petition dataset. •Build an efficient model to predict the salary trends of H1-B workers and analyze what are key factors that influence the wages
3
Data Collection The data set is collected from the website 2016 Non immigrant Employee petition data set 6,47,852 rows and 41 columns 6,33,943 H1-B applications
4
Understanding More about the Dataset
5
Understanding More About Data
Case_status : Certified,Certified_withdrawn,Denied,Withdrawn Employer_Name : Around 12,000 different employer names were listed Agent_Attorney_Name : Name of the Agent who filed the petition Prevailing_wage : Total wage Unit_of _Pay : Hourly,weekly,Monthly,Biweekly,Yearly H1-B dependent : Yes,No Worksite_state : State of the work place Soc_code : Standard Occupational Classification
6
Data Pre Processing The Data set is very huge and not clean
7
Data Preprocessing Converted all the unit of prevailing wages to prevailing wage per year Ex: Weekly, Bi-weekly, Hourly, Monthly Handling the missing values Replaced the Numeric values with Mean Replaced the categorical values with Mode Unsuccessful Attempt so removed the rows with missing values
8
Data Preprocessing In order to pass the data to the model the data should be consistent Made the data consistent across the data set in different columns Job location : NewYork, NY, New York to NY Job titles : Computer Information system manager, Computer and Information System Manager, Computer systems Manager
9
Data Pre Processing Removed the outliers In the Prevailing wage column
10
Feature Selection Most Important Part of Pre Processing
11
Feature selection Boruta Package B
12
Feature selection(Continuation)
Job_ title Employer_name Employer _state Agent_Attorney_name Agent_ Attorney_state Soc_code Soc_name
13
Data Insights
14
Data Insights
15
Data Insights
16
Data Insights
17
Naive Bayes Classification
Divided the prevailing wage into 9 classes starting from 25,000 to 1,18,000. Width of each class was calculated using the Normalization technique. Trained the data using Naïve Bayes Classifier to know into which classes ,each of the prevailing wage fall into. The accuracy obtained was too low
18
Naïve Bayes Classification
To improve the accuracy of Naïve Bayes Classifier, divided the prevailing wage into only 3 classes. Below 60,000 as low, Between 60,000 to 90,000 as average, above 90,000 as high. We trained the dataset using Naïve Bayes Classifier with one against many classes and we obtained an accuracy of 83%
19
Support Vector Machines
Random Sampling of the dataset . Trained the dataset using SVM with one against many classes Obtained an accuracy of 95.84%
20
Multilinear Regression
To train the dataset using multilinear regression all the categorical values as to be changed to numeric. Employer_name is a categorical value in our dataset and there are more 10,000 unique employer names Tried converting the categorical values to binary using python pandas ,but the csv format of the dataset got corrupted as it was creating 12,000X12,000 square matrix. We did a random sampling of the data and trained the model using multilinear regression. R mean squared error was found to be 0.5
21
Decision Trees Obtained an accuracy of 94.94%
22
Text Analysis
23
Results Model Accuracy Naïve Bayes Classifier(one against many) 83%
Support Vector Machines(random sampling) 95.844% Decision Trees 94.94% Multilinear Regression(random sampling) R -squared error:0.5
24
Limitations of R For a large dataset converting of categorical values into numeric was a big question. Where we have to assign labels for each of the factors. Assigning labels to categorical variable which has 12,000 levels is tedious process. We cannot train the dataset using random forest in R, if the dataset contains the categorical variables with more than 32 levels .It cannot handle categorical predictors with more than 32 categories. When plotted the decision tree, the predictor variables with more than 52 levels was not printed. We could not interpret the rules of the decision tree. Visualization is a limitation in R.
25
THANK YOU
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.