A Smart Tool to Predict Salary Trends of H1-B Holders

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Florida International University COP 4770 Introduction of Weka.
Random Forest Predrag Radenković 3237/10
Application of Decision Tree: Bankruptcy Prediction 2004/05/07.
Introduction to Data Mining with XLMiner
Assuming normally distributed data! Naïve Bayes Classifier.
5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
An Exercise in Machine Learning
Overview DM for Business Intelligence.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
The identification of interesting web sites Presented by Xiaoshu Cai.
Chapter 9 – Classification and Regression Trees
DATA MINING FINAL REPORT Vipin Saini M 許博淞 M 陳昀志 M
1 (21) EZinfo Introduction. 2 (21) EZinfo  A Software that makes data analysis easy  Reveals patterns, trends, groups, outliers and complex relationships.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Machine Learning Documentation Initiative Workshop on the Modernisation of Statistical Production Topic iii) Innovation in technology and methods driving.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Nurissaidah Ulinnuha. Introduction Student academic performance ( ) Logistic RegressionNaïve Bayessian Artificial Neural Network Student Academic.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Konstantina Christakopoulou Liang Zeng Group G21
An Exercise in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Overview of the Data Mining Process
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.
PREDICTING SONG HOTNESS
DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.
MapReduce Compilers-Apache Pig
Collage Score Card & Software defect prediction
Classify A to Z Problem Statement Technical Approach Results Dataset
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Restaurant Revenue Prediction using Machine Learning Algorithms
Machine Learning with Spark MLlib
SNS COLLEGE OF TECHNOLOGY
Chapter 7. Classification and Prediction
Admission Prediction System
Text Mining CSC 600: Data Mining Class 20.
Trees, bagging, boosting, and stacking
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Predict House Sales Price
NBA Draft Prediction BIT 5534 May 2nd 2018
Classifying enterprises by economic activity
The Assistive System Progress Report 2 Shifali Kumar Bishwo Gurung
iSRD Spam Review Detection with Imbalanced Data Distributions
Implementing AdaBoost
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Intro to Machine Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Text Mining CSC 576: Data Mining.
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Lecture 10 – Introduction to Weka
Junheng, Shengming, Yunsheng 11/09/2018
Machine Learning in Business John C. Hull
Earthquake Prediction
Credit Card Fraudulent Transaction Detection
Chapter 2 Excel Extension: Now You Try!
Using Machine Learning to Analyze Serial Killer Patterns
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Machine Learning for Cyber
Presentation transcript:

A Smart Tool to Predict Salary Trends of H1-B Holders Ramya Ramesh Akshay Poosarla Under the guidance of: Prof. Meiliu Lu

Problem Statement •The H-1B is an employment-based, non-immigrant visa category for temporary foreign workers in the United States •The wage prediction model is used to predict the wages of H1-B employees from H1-B petition dataset. •Build an efficient model to predict the salary trends of H1-B workers and analyze what are key factors that influence the wages

Data Collection The data set is collected from the website https://app.enigma.io/ 2016 Non immigrant Employee petition data set 6,47,852 rows and 41 columns 6,33,943 H1-B applications

Understanding More about the Dataset

Understanding More About Data Case_status : Certified,Certified_withdrawn,Denied,Withdrawn Employer_Name : Around 12,000 different employer names were listed Agent_Attorney_Name : Name of the Agent who filed the petition Prevailing_wage : Total wage Unit_of _Pay : Hourly,weekly,Monthly,Biweekly,Yearly H1-B dependent : Yes,No Worksite_state : State of the work place Soc_code : Standard Occupational Classification

Data Pre Processing The Data set is very huge and not clean

Data Preprocessing Converted all the unit of prevailing wages to prevailing wage per year Ex: Weekly, Bi-weekly, Hourly, Monthly Handling the missing values Replaced the Numeric values with Mean Replaced the categorical values with Mode Unsuccessful Attempt so removed the rows with missing values

Data Preprocessing In order to pass the data to the model the data should be consistent Made the data consistent across the data set in different columns Job location : NewYork, NY, New York to NY Job titles : Computer Information system manager, Computer and Information System Manager, Computer systems Manager

Data Pre Processing Removed the outliers In the Prevailing wage column

Feature Selection Most Important Part of Pre Processing

Feature selection Boruta Package B

Feature selection(Continuation) Job_ title Employer_name Employer _state Agent_Attorney_name Agent_ Attorney_state Soc_code Soc_name

Data Insights

Data Insights

Data Insights

Data Insights

Naive Bayes Classification Divided the prevailing wage into 9 classes starting from 25,000 to 1,18,000. Width of each class was calculated using the Normalization technique. Trained the data using Naïve Bayes Classifier to know into which classes ,each of the prevailing wage fall into. The accuracy obtained was too low

Naïve Bayes Classification To improve the accuracy of Naïve Bayes Classifier, divided the prevailing wage into only 3 classes. Below 60,000 as low, Between 60,000 to 90,000 as average, above 90,000 as high. We trained the dataset using Naïve Bayes Classifier with one against many classes and we obtained an accuracy of 83%

Support Vector Machines Random Sampling of the dataset . Trained the dataset using SVM with one against many classes Obtained an accuracy of 95.84%

Multilinear Regression To train the dataset using multilinear regression all the categorical values as to be changed to numeric. Employer_name is a categorical value in our dataset and there are more 10,000 unique employer names Tried converting the categorical values to binary using python pandas ,but the csv format of the dataset got corrupted as it was creating 12,000X12,000 square matrix. We did a random sampling of the data and trained the model using multilinear regression. R mean squared error was found to be 0.5

Decision Trees Obtained an accuracy of 94.94%

Text Analysis https://bigml.com/dashboard/sources

Results Model Accuracy Naïve Bayes Classifier(one against many) 83% Support Vector Machines(random sampling) 95.844% Decision Trees 94.94% Multilinear Regression(random sampling) R -squared error:0.5

Limitations of R For a large dataset converting of categorical values into numeric was a big question. Where we have to assign labels for each of the factors. Assigning labels to categorical variable which has 12,000 levels is tedious process.   We cannot train the dataset using random forest in R, if the dataset contains the categorical variables with more than 32 levels .It cannot handle categorical predictors with more than 32 categories. When plotted the decision tree, the predictor variables with more than 52 levels was not printed. We could not interpret the rules of the decision tree. Visualization is a limitation in R.

THANK YOU