PROFILING USERS BY ESTIMATING COMPOSITE AND MULTI-VALUED ATTRIBUTES FROM BIG DATA SOURCES FOR SOCIAL STATISTICS PURPOSES NTTS 2017, Brussels, March.

Slides:



Advertisements
Similar presentations
Large-Scale Entity-Based Online Social Network Profile Linkage.
Advertisements

Primary Data Collection Method: Survey Design. Primary Data Collection Primary data collection is necessary when a researcher cannot find the data needed.
Big Data viewed from a Statistics Office Deputy director Regions and Environment SCB Adjunct Professor KTH Viveka Palm 8 May 2015.
Presentation of approach and pilot results Mannheim, March 20-22, 2015 You walk, you travel, you use your phone – differently!
OECD Short-Term Economic Statistics Working PartyJune Analysis of revisions for short-term economic statistics Richard McKenzie OECD OECD Short.
Electronic reporting in Poland 27th Voorburg Group Meeting Warsaw, Poland October 1st to October 5th, 2012 Central Statistical Office of Poland.
CompuBase Data for CRM / PRM Integration How compuBase fits to an existing CRM / PRM system? Last review 25/03/2007.
Capitalizing on ICT for Data Communication Nan Hasnani Long Padang Department of Statistics, Malaysia.
Elaine Ménard & Margaret Smithglass School of Information Studies McGill University [Canada] July 5 th, 2011 Babel revisited: A taxonomy for ordinary images.
United Nations Regional Seminar on Census Data Dissemination and Spatial Analysis for Arabic Speaking Countries, Amman, Jordan May 2011 Identification.
Patient Profiling Carole Adebayo Health Intelligence Manager & Pauline Mitchell Patient Profiling Development Officer.
United Nations Economic Commission for Europe Statistical Division Summary of the consultation on the recommendations on climate change related statistics.
United Nations Statistical Commission CensusInfo Learning Centre, 22 February 2010 CensusInfo Project in the Context of the 2010 World Population and Housing.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
High-Level Forum on Strategic Planning for Statistics in Central Asia Countries Bishkek, Kyrgyz Republic, May 2006 Oleg Kara, Deputy Director General,
United Nations Regional Seminar on Census Data Dissemination and Spatial Analysis for Arabic Speaking Countries, Amman, Jordan May 2011 Identification.
Overview and challenges in the use of administrative data in official statistics IAOS Conference Shanghai, October 2008 Heli Jeskanen-Sundström Statistics.
General Online Research Conference GOR 14, 5-7 March 2014 Cologne University of Applied Sciences, Germany Pawel Kuczma, Institute of Journalism, University.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
DEFINITIONS AND APPROACHES AND APPLICATIONS TO THE ISRAELI OFFICIAL STATISTICS PROGRAM OFFICIAL STATISTICS AND CORE STATISTICS Tom Caplan, Israel Central.
Priorities in building up statistics in pre-accession countries Barbara Domaszewicz Agriculture Department, Central Statistical Office of Poland Workshop.
New data sources (such as Big Data) and Traditional Sources Work Package 2.
Marketing Research.
Disaggregating the SDGs by Disability
Haidy Samy Mohamed Mahmoud
WEB SCRAPING FOR JOB STATISTICS
Internet as a tool for health education of medical personnel
UNECE Seminar on New Frontiers for Statistical Data Collection, Geneva
Auto Coding System Development and application
Diffusion of Open Data and Crowdsourcing among Heritage Institutions
Ewa Czumaj Central Statistical Office of Poland
Evaluation of Society’s Interest in the Official Statistics and Calculation of the Society’s Interest Index Laima Grižaitė Deputy head, Public Relations.
Petteri Baer, Marketing Manager, Statistics Finland
STATISTICAL Data portal, lesotho
WP7 MULTI DOMAINS.
Croatian Statistical System Presented by Robert Knežević
Working Group Meeting: Statistics on Crime and Criminal Justice 15 March 2017, Luxembourg Ongoing work on developing the ICCS manual and translations.
Detecting Online Commercial Intention (OCI)
United Nations Development Account 10th Tranche Statistics and Data
Big Data ESSNet: Web Scraping for Job Vacancy Statistics Nigel Swier UK Office for National Statistics.
Internal WP7 meeting Warsaw, June 12-13, 2017
CLIENT RELATIONSHIP MANAGEMENT KEEPING TRACK OF REQUESTS THE EASY WAY
European statistics User support network – Report 2012
Sub-regional workshop on integration of administrative data, big data
Welcome to the European Shoemaker e-learning platform introduction
The Importance of Informal Sector Statistics
Enhancing statistical practices to improve data sharing
Eurostat – Unit E2 Maaike Bouwmeester
Health and Human Services Information for Rural America
Text Mining & Natural Language Processing
Item 3 of the draft agenda ESS.VIP ADMIN: progress report
Causes of Haze Assessment Brief Overview and Status Report
Rural development statistics
United Nations Statistics Division
Text Mining & Natural Language Processing
RESULTS AND CHALLENGES
Expert Group Meeting on SDG Economic Indicators in Africa
Wiesbaden, 24 October, 2007 Svetlana Shutova Statistics Estonia
Multinational enterprise groups in the EU Dissemination from the EGR
Mapping Data Production Processes to the GSBPM
Query Type Classification for Web Document Retrieval
Web archives as a research subject
Conference on New Technologies for official Statistics
PolyAnalyst Web Report Training
Pete Benton , Beyond 2011 Programme Director
The EuroGroups Register Agne Bikauskaite, August Götzfried
Journal of Web Semantics 55 (2019)
STEPS Site Report.
Introduction Dataset search
Big Data in Official Statistics: Generalities
Presentation transcript:

PROFILING USERS BY ESTIMATING COMPOSITE AND MULTI-VALUED ATTRIBUTES FROM BIG DATA SOURCES FOR SOCIAL STATISTICS PURPOSES NTTS 2017, Brussels, March 14-16, 2017 Jacek Maślankowski CENTRAL STATISTICAL OFFICE, Statistical office in gdańsk, poland UNIVERSITY OF GDAŃSK, POLAND

AGENDA NTTS 2017, Brussels, March 14-16, 2017 Prerequisites Framework Results of analysis Conclusions NTTS 2017, Brussels, March 14-16, 2017

OVERVIEW AND THE GOAL OF THE STUDY Show the methodology of extracting big data sources for social statistics purposes Provide information on population with detailed attributes that can be extracted from the data or at least estimated Attributes can be composite (e.g., address, name, etc.) as well as multi-valued (e.g., phone numbers) H1: The usability of the data can be increased by estimating values for specific entities H2: The representativeness of the web data does not allow applying it directly for social statistics purposes NTTS 2017, Brussels, March 14-16, 2017

METHODOLOGY AT A GLANCE Set of combined methods used to extract users profiles from both social media as well as webpages Attributes available for analysis: Screen name Full name Geographic location URL Description Estimation of attributes: Gender Different forms of verbs Language markers Find phone/e-mail address Regular expressions NTTS 2017, Brussels, March 14-16, 2017

METHODOLOGY IN DETAILS METHODS: (1) Machine Learning tools (2) Text Mining methods STEPS: Analyse the readiness of the data source Social Big Data – Social Mining – Web Mining: profiling users Identification of demographic attributes POPULATION Group of social media users and Internet users that make comments on selected web portals NTTS 2017, Brussels, March 14-16, 2017

USE CASES WHO WHAT Entity is a person who is active on social media as well as persons that are making comments on various events in the country Three different cases were used to make analysis and enhance the social statistics: intentions to vote (mostly covered in statistics from OECD) media education – how people trust in media social confidence NTTS 2017, Brussels, March 14-16, 2017

RESULTS – ENTITY CLASSIFICATION (IDENTIFY GENDER) BASED ON THE VERB FORM BASED ON THE VERB FORM AND TRAINING DATASET Based on suffix of the verb form: F1-score varies from 0.25 to 0.57 in specific datasets with precision 0.33 to 0.5. Cannot be included as the primary method of identification. MultinomialNB and Linear SVM From 0.8075-0.866 Supported by different methods NTTS 2017, Brussels, March 14-16, 2017

CONCLUSIONS (1/2) Several useful and reliable attributes can be extracted to enhance social statistics surveys Can enrich social surveys, e.g., on social confidence and intention to vote Results are presented using geographic and demographic attributes of the entities NTTS 2017, Brussels, March 14-16, 2017

CONCLUSIONS (2/2) The hypothesis H1 has been confirmed by comparing the results of analysis with the data from official statistics The hypothesis H2 has also been confirmed – we have to expect some differences in the results Each data source must be treated individually NTTS 2017, Brussels, March 14-16, 2017

THANK YOU! Jacek Maślankowski STATISTICAL OFFICE IN GDAŃSK POLAND NTTS 2017, Brussels, March 14-16, 2017