Classifying enterprises by economic activity

Slides:



Advertisements
Similar presentations
Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Advertisements

© Federal Statistical Office Germany, IV A2 Federal Statistical Office Germany Application of Regular Expressions in the German Business Register Session.
Indian Statistical Institute Kolkata
Bring Order to Your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge Date: 2011/11/21 Source: Claudiu S. Firan (CIKM’10)
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Sparse vs. Ensemble Approaches to Supervised Learning
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Sparse vs. Ensemble Approaches to Supervised Learning
TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab Computational Intelligence Laboratory Toyota.
Ensemble Learning (2), Tree and Forest
1 How to use Weka How to use Weka. 2 WEKA: the software Waikato Environment for Knowledge Analysis Collection of state-of-the-art machine learning algorithms.
Automated Patent Classification By Yu Hu. Class 706 Subclass 12.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Machine Learning Documentation Initiative Workshop on the Modernisation of Statistical Production Topic iii) Innovation in technology and methods driving.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Class Imbalance in Text Classification
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Information Organization: Evaluation of Classification Performance.
CS : Designing, Visualizing and Understanding Deep Neural Networks
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
A Simple Approach for Author Profiling in MapReduce
Sharing of previous experiences on scraping Istat’s experience
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
A Smart Tool to Predict Salary Trends of H1-B Holders
R Data Mining for Insurance Retention Modeling
Introduction to Machine Learning
Machine Learning – Classification David Fenyő
ESSNet Pilot: Web Scraping for Job Vacancy Statistics
WP1: Web scraping Job Vacancies- ELSTAT
Alan D. Mead, Talent Algorithms Inc. Jialin Huang, Amazon
Erasmus University Rotterdam
Application of Classification and Clustering Methods on mVoC (Medical Voice of Customer) data for Scientific Engagement Yingzi Xu, Department of Statistics,
Source: Procedia Computer Science(2015)70:
Removing Duplicate Job Ads
Natural Language Processing of Knee MRI Reports
Big Data ESSNet: Web Scraping for Job Vacancy Statistics Nigel Swier UK Office for National Statistics.
CMPT 733, SPRING 2016 Jiannan Wang
ESSNet Pilot: Web Scraping for Job Vacancy Statistics
Project 1: Text Classification by Neural Networks
CSCI 5832 Natural Language Processing
Machine Learning and Verbatim Survey Response
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Ensemble learning Reminder - Bagging of Trees Random Forest
Classification Breakdown
Model generalization Brief summary of methods
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Information Retrieval
CMPT 733, SPRING 2017 Jiannan Wang
Predicting Loan Defaults
Introduction to Sentiment Analysis
Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Machine Learning for Cyber
Information Organization: Evaluation of Classification Performance
Machine Learning for Cyber
Presentation transcript:

Classifying enterprises by economic activity Cedefop/CRISP/ESSNet 21 March 2018, Milan

Outline What is this about Few small presentations (~7 min) Discussion UK – Fero Hajnovic Germany – Chris Gabriel Islam France – Maxime Bergeat Belgium – Thomas Delclite Discussion - Think about what to get out of this session in discussion!

What is this about? Methods for extracting economic activity from job ads Assessment of the algorithms (accuracy, speed, applicability)

UK Fero Hajnovic

Data Adzuna, 10th of Jan. 2018 Survey (has SIC) Matching Ads Comps Before matching ~ 1 000 000 ~45 000 After matching ~ 10 000 ~ 1000

Machine learning 3. Word selection 1. NLP 4. Classifier + grid search - low IDFs - 2000 words - in some… - ...but not all SICs Machine learning 1. NLP The Skills People Group are the largest provider of... skills people group largest provider funded commercial... 4. Classifier + grid search - Multinomial NB - SVM 2. TF-IDF 1051x27808 sparse matrix with 385703 stored elements

Results Accuracy 66% (NB) 60% (SVC) 55% (Combined - NB + Random forest) Similar recall Todo More data Equal distribution Word selection

Germany Chris Gabriel Islam

Source: Wikipedia, Adrichel Using R on CEDEFOP data Used Package: Class Algorithm: K-Nearest-Neighbour, Accurarcy: 78 %   A C D F G H I J K M N Q R 317 1 4 2 98% 5 758 183 367 50 55% 3 0% 1804 7 44 27 48 93% 109 26 806 22 131 12 13 506 49% 10 4626 19 83 97% 366 171 181 67 23% 29 30 38% 9 39 54 35% 77 47 24 32 8 940 224 15 68% 57 358 112 34 92 2999 42 80% 11 17 482 94% 66 96% 57% 92% 51% 95% 24% 40% 74% 88% 99% Source: Wikipedia, Adrichel During this time only CEDEFOP-data was available, so we had to measure the quality of the date by itself Variables for Prediction: YearGrabDate + YearExpireDate + Experience + Salary In R there were some problems reading the CSV: Only 43000 entries out of 2.1 millions © Federal Statistical Office of Germany (Destatis)

Using Python on Federal Employment Agency data Used Packages: NLTK and Sklearn Algorithm: Multinomial Naive Bayes Classifier, Accuracy: 85 %   A B C D E F G H I J K L M N O P Q R S T U 446 19 21 48 14 17 1 5 33 4 2 73% 9 16 6 3 17% 15 8866 539 597 98 221 172 95 728 7 78% 24 51 11 36% 37 45 10 30 55% 231 6711 88 52 8 90 220 90% 673 294 8598 261 223 254 112 400 53 23 79% 69 152 3103 40 115 201 82% 22 39 5747 18 29 67 96% 79 28 46 1879 44 87% 32 478 77% 38 194 51% 435 12 154 399 97 319 4352 242 299 66% 55 524 695 260 190 240 286 27 267 25618 13 347 20 954 89% 848 74% 35 107 9747 105 304 58% 71 64 60 111 146 1393 69% 12% 31 100% 80% 84% 83% 61% 92% 88% Good data since it is provided manually Only variable was job description, but we want to try the job title as well. Problem: Sometimes no job title Used options: Stopwords in German, alpha=0.01, fit to prior was true Tokenizing and discarding punctuation as well as counting only words that appears at least n-times didn‘t increase the accuracy © Federal Statistical Office of Germany (Destatis)

Outlook Focus on signal words or dictionaries ML in order to detect ads from employment agencies Try out other packages and tools: FastText, Tensorflow, etc. Cooperate with collleagues in our institute on our newly built ML platform © Federal Statistical Office of Germany (Destatis)

France Maxime Bergeat

Methodology Data received from National Employment Agency (Pôle emploi) ~140 websites Subset of 100 000 job ads Always including job description and coded occupation (no job title ) Variables used as predictors Words from job description (input is basically a document-term matrix) Selection of words is based on term frequencies of words. Mostly Python sklearn library Goal: classification of occupation (22 categories) March 2018 Cedefop meeting

Application #1 ~1500 words used as predictors (words present in > 0.5% of all documents) Different methods tried for classification Logistic Regression Gradient boosting Random Forests with bagging Basic neural network Support vector machine Here: results presented for Logistic regression With L1 penalty Evaluation of results: train and test datasets (80% / 20%) Global accuracy: 70% March 2018 Cedefop meeting

Application #2 – Precision matrix March 2018 Cedefop meeting

Application #3 – Recall matrix March 2018 Cedefop meeting

Future work Improve selection of words used as predictors Choose words which are occupation-specific Improve tuning of parameters used in predictive methods Improve evaluation process -> Cross-validation First results seem good… … Global accuracy of 80% with selection of occupation-specific words for predictors and basic tuning of parameter (for penalty) with random forests (words present in > 2% of documents for at least one occupation category) First experiments show that cross-validation does not improve a lot robustness of evaluation results (k-fold cross validation). To be continued… March 2018 Cedefop meeting

Belgium Thomas Delclite

Machine Learning and prediction of NACE code by job description ESSnet Big Data, Milan, 03/2018 http://statbel.fgov.be

Using R package on administrative data Sensitivity Specificity Package RTextTools http://statbel.fgov.be

Using R package on administrative data Package RTextTools http://statbel.fgov.be

Issues Predictions for three languages Application to webscraping data Quality of webscraping data to assess variations of job vacancy http://statbel.fgov.be

Discussion SIC classification in Cedefop data? Why is this useful? Boundary till which this works BG – companies house,

Thank you