A Smart Tool to Predict Salary Trends of H1-B Holders

Slides:

Advertisements

Similar presentations

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Advertisements

Florida International University COP 4770 Introduction of Weka.

Random Forest Predrag Radenković 3237/10

Application of Decision Tree: Bankruptcy Prediction 2004/05/07.

Introduction to Data Mining with XLMiner

Assuming normally distributed data! Naïve Bayes Classifier.

5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.

Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!

Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.

Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.

An Exercise in Machine Learning

Overview DM for Business Intelligence.

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.

The identification of interesting web sites Presented by Xiaoshu Cai.

Chapter 9 – Classification and Regression Trees

DATA MINING FINAL REPORT Vipin Saini M 許博淞 M 陳昀志 M

1 (21) EZinfo Introduction. 2 (21) EZinfo  A Software that makes data analysis easy  Reveals patterns, trends, groups, outliers and complex relationships.

The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Machine Learning Documentation Initiative Workshop on the Modernisation of Statistical Production Topic iii) Innovation in technology and methods driving.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Nurissaidah Ulinnuha. Introduction Student academic performance ( ) Logistic RegressionNaïve Bayessian Artificial Neural Network Student Academic.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Konstantina Christakopoulou Liang Zeng Group G21

An Exercise in Machine Learning

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.

Overview of the Data Mining Process

A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.

Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.

PREDICTING SONG HOTNESS

DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.

MapReduce Compilers-Apache Pig

Collage Score Card & Software defect prediction

Classify A to Z Problem Statement Technical Approach Results Dataset

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Restaurant Revenue Prediction using Machine Learning Algorithms

Machine Learning with Spark MLlib

SNS COLLEGE OF TECHNOLOGY

Chapter 7. Classification and Prediction

Admission Prediction System

Text Mining CSC 600: Data Mining Class 20.

Trees, bagging, boosting, and stacking

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Predict House Sales Price

NBA Draft Prediction BIT 5534 May 2nd 2018

Classifying enterprises by economic activity

The Assistive System Progress Report 2 Shifali Kumar Bishwo Gurung

iSRD Spam Review Detection with Imbalanced Data Distributions

Implementing AdaBoost

Classification and Prediction

CSCI N317 Computation for Scientific Applications Unit Weka

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Intro to Machine Learning

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Text Mining CSC 576: Data Mining.

Analysis for Predicting the Selling Price of Apartments Pratik Nikte

Lecture 10 – Introduction to Weka

Junheng, Shengming, Yunsheng 11/09/2018

Machine Learning in Business John C. Hull

Earthquake Prediction

Credit Card Fraudulent Transaction Detection

Chapter 2 Excel Extension: Now You Try!

Using Machine Learning to Analyze Serial Killer Patterns

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Machine Learning for Cyber

Presentation transcript:

A Smart Tool to Predict Salary Trends of H1-B Holders Ramya Ramesh Akshay Poosarla Under the guidance of: Prof. Meiliu Lu

Problem Statement •The H-1B is an employment-based, non-immigrant visa category for temporary foreign workers in the United States •The wage prediction model is used to predict the wages of H1-B employees from H1-B petition dataset. •Build an efficient model to predict the salary trends of H1-B workers and analyze what are key factors that influence the wages

Data Collection The data set is collected from the website https://app.enigma.io/ 2016 Non immigrant Employee petition data set 6,47,852 rows and 41 columns 6,33,943 H1-B applications

Understanding More about the Dataset

Understanding More About Data Case_status : Certified,Certified_withdrawn,Denied,Withdrawn Employer_Name : Around 12,000 different employer names were listed Agent_Attorney_Name : Name of the Agent who filed the petition Prevailing_wage : Total wage Unit_of _Pay : Hourly,weekly,Monthly,Biweekly,Yearly H1-B dependent : Yes,No Worksite_state : State of the work place Soc_code : Standard Occupational Classification

Data Pre Processing The Data set is very huge and not clean

Data Preprocessing Converted all the unit of prevailing wages to prevailing wage per year Ex: Weekly, Bi-weekly, Hourly, Monthly Handling the missing values Replaced the Numeric values with Mean Replaced the categorical values with Mode Unsuccessful Attempt so removed the rows with missing values

Data Preprocessing In order to pass the data to the model the data should be consistent Made the data consistent across the data set in different columns Job location : NewYork, NY, New York to NY Job titles : Computer Information system manager, Computer and Information System Manager, Computer systems Manager

Data Pre Processing Removed the outliers In the Prevailing wage column

Feature Selection Most Important Part of Pre Processing

Feature selection Boruta Package B

Feature selection(Continuation) Job_ title Employer_name Employer _state Agent_Attorney_name Agent_ Attorney_state Soc_code Soc_name

Data Insights

Data Insights

Data Insights

Data Insights

Naive Bayes Classification Divided the prevailing wage into 9 classes starting from 25,000 to 1,18,000. Width of each class was calculated using the Normalization technique. Trained the data using Naïve Bayes Classifier to know into which classes ,each of the prevailing wage fall into. The accuracy obtained was too low

Naïve Bayes Classification To improve the accuracy of Naïve Bayes Classifier, divided the prevailing wage into only 3 classes. Below 60,000 as low, Between 60,000 to 90,000 as average, above 90,000 as high. We trained the dataset using Naïve Bayes Classifier with one against many classes and we obtained an accuracy of 83%

Support Vector Machines Random Sampling of the dataset . Trained the dataset using SVM with one against many classes Obtained an accuracy of 95.84%

Multilinear Regression To train the dataset using multilinear regression all the categorical values as to be changed to numeric. Employer_name is a categorical value in our dataset and there are more 10,000 unique employer names Tried converting the categorical values to binary using python pandas ,but the csv format of the dataset got corrupted as it was creating 12,000X12,000 square matrix. We did a random sampling of the data and trained the model using multilinear regression. R mean squared error was found to be 0.5

Decision Trees Obtained an accuracy of 94.94%

Text Analysis https://bigml.com/dashboard/sources

Results Model Accuracy Naïve Bayes Classifier(one against many) 83% Support Vector Machines(random sampling) 95.844% Decision Trees 94.94% Multilinear Regression(random sampling) R -squared error:0.5

Limitations of R For a large dataset converting of categorical values into numeric was a big question. Where we have to assign labels for each of the factors. Assigning labels to categorical variable which has 12,000 levels is tedious process. We cannot train the dataset using random forest in R, if the dataset contains the categorical variables with more than 32 levels .It cannot handle categorical predictors with more than 32 categories. When plotted the decision tree, the predictor variables with more than 52 levels was not printed. We could not interpret the rules of the decision tree. Visualization is a limitation in R.

THANK YOU