Matching and Industry Coding

Slides:



Advertisements
Similar presentations
- ONS Classification Coding Tools Project Occupation Classification Workshop RSS, London, 21 June 2004 Nigel Swier.
Advertisements

Classifications and CASCOT Ritva Ellison Institute for Employment Research University of Warwick.
Oracle Sourcing Quick Start Guide.
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
SMART Agency Tipsheet Staff List This document focuses on setting up and maintaining program staff. Total Pages: 14 Staff Profile Staff Address Staff Assignment.
What is Sure BDCs? BDC stands for Batch Data Communication and is also known as Batch Input. It is a technique for mass input of data into SAP by simulating.
1 Business Register: Quality Practices Eddie Salyers
TxEIS Security A role-based solution October 2010.
Microsoft Access 2010 Chapter 10 Administering a Database System.
Rev.04/2015© 2015 PLEASE NOTE: The Application Review Module (ARM) is a system that is designed as a shared service and is maintained by the Grants Centers.
Office of Housing Choice Voucher Program Voucher Management System – VMS Version Released October 2011.
Autoentry and Autocoder Efficiently creating and coding people records from resumes.
1 Splitting Available License Entitlements Audience: Authorized Business Partners and Avaya Module Scope: The process of how to split available (not yet.
HTBN Batches These slides are intended as a starting point for further discussion of how eTime might be extended to allow easier processing of HTBN data.
Local Government Corporation Resource 2016 NextGen Trustee Year End.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
1 Terminal Management System Usage Overview Document Version 1.1.
PowerTeacher Gradebook PTG and PowerTeacher Pro PT Pro A Comparison The following slides will give you an overview of the changes that will occur moving.
N5 Databases Notes Information Systems Design & Development: Structures and links.
JavaScript, Sixth Edition
Compatible with the latest browsers; Chrome, Safari, Firefox, Opera and Internet Explorer 9 and above.
T3/Tutorials: Data Submission
Core LIMS Training: Project Management
Microsoft Office Access 2010 Lab 1
The new London Business Survey:
Project Management: Messages
Software Application Overview
Setting up Categories, Grading Preferences and Entering Grades
ECE 353 Lab 3 Pipeline Simulator
GO! with Microsoft Office 2016
Select Survey Invitations
Practical Office 2007 Chapter 10
The University of Delaware Higher Education Consortia
WebCSA Waterloo Information Systems Limited presents
Single Sample Registration
GO! with Microsoft Access 2016
Core LIMS Training: Advanced Administration
Software Testing With Testopia
Conducting the performance appraisal
RAD-IT Architecture Software Training
Domain Matching for BID Association Requests
Central Document Library Quick Reference User Guide View User Guide
Conducting the performance appraisal
Domain Matching for Contract Association Requests
AoFA Training Manager Guideline
StudentTranscripts Service Overview
Sharne Bailey, Tony Byrne UK, Office for National Statistics
Recommended Budget Reductions
For a new user you must click on the “Registration for Generator” link
Creating and Modifying Queries
Penn State Educational Programming Record (EPR) Guide
NextGen Trustee General Ledger Accounting
This presentation document has been prepared by Vault Intelligence Limited (“Vault") and is intended for off line demonstration, presentation and educational.
Post 16 Return SEPTEMBER 2012.
Cash and Cash Management
Classification John Perry, UK ONS.
This presentation document has been prepared by Vault Intelligence Limited (“Vault") and is intended for off line demonstration, presentation and educational.
Unemployment Insurance Agency Michigan Web Account Manager
This presentation document has been prepared by Vault Intelligence Limited (“Vault") and is intended for off line demonstration, presentation and educational.
In-service Usage, Performance Monitoring & Management Service
Pre-Employment Testing Software
Spreadsheets, Modelling & Databases
The ultimate in data organization
Post 16 Return SEPTEMBER 2011.
Improved Register Data Matching and its Impact on Survey Population Estimates Steve Vale Office for National Statistics, UK.
Pre-Employment Testing Software
Hire Xpress User’s Training A Human Resources guide to Hire Xpress
Module 1b – ICIS Permitted Features
This presentation document has been prepared by Vault Intelligence Limited (“Vault") and is intended for off line demonstration, presentation and educational.
Presentation transcript:

Matching and Industry Coding Title page

What is Matching? What is Matching? Matching is the linking data from two or more differing sources Why Match? Combined datasets can contain more information than is available on individual datasets To build efficient sampling frames To reduce response burden To impute missing data values To provide validated data from trusted sources Firstly the question “What is matching?” should be answered. Matching, in this instance, is the linking of trusted data from two or more differing sources. Combined datasets can contain more information than is available on individual datasets To build efficient sampling frames for surveys, by reducing duplication To reduce response burden To impute missing data values (e.g. using turnover per head ratios for an SIC division to calculate turnover where employment is known). To provide validated data from trusted sources.

Current Matching Current Matching system All administrative data passes through a matching process prior to being added to the IDBR Two systems are used for Matching NameKey Matching SSA-NAME3 Matching All administrative data (VAT, PAYE, Companies House) is passed through the matching process prior to being loaded on to the IDBR. This ensures that linkage takes place resulting in units on the IDBR being represented accurately and avoiding duplication. Two systems are used for matching NameKey matching and SSA-NAME3 Matching.

Matching Process: Stage 1 Stage 1: NameKey Matching Input data Stage 1 - NameKey matching Match No match Stage 1 is NameKey Matching The input data i.e. VAT, PAYE, Companies House undertakes NameKey Matching. Namekeys are generated from the business name. Namekeys are strings of characters with spaces, noise words (e.g. and) and non-alphabetic/non-numeric characters (e.g. fullstops) removed. Once this has been done the data is passed through the matching process with the result that data can be classified as either a Match or a No Match. Matches are considered definite matches if the match is exact (100%) between the Namekey and postcode. The register is amended accordingly. No Matches are moved to the next stage of matching – SSA-NAME3. These are considered definite matches These move on to the next stage of matching

Matching Process: Stage 2 Stage 2: SSA-NAME3 Stage 2 – SSA-Name3 Key search Candidates found No candidates / >300 candidates found Candidates are scored to find best match Stage 2 is SSA-NAME3. Those No Matches created in NameKey Matching are passed through SSA-NAME3 Key Search. This part of the process generates permutations on the name and address of input units to widen the search on the Register. Matching is then undertaken. If possible candidates are found an input unit they are given a score based on the closeness between the input unit and a Register unit based on the name, address and postcode. If “no candidates” or “more than 300 candidates” were found the outcome is consider as a No Match. No further matching (automatic or clerical) is carried out. No further matching is carried out

Matching Process: Stage 2 (continued) Stage 2: SSA-Name3 (continued) Stage – SSA-Name3 matching and scoring Match Possible No match Rules are applied Rules are applied No further action The SSA-NAME3 process continues by reviewing those candidates given a score in the previous slide. Where a input unit has a score of greater than 79% to a single IDBR unit it is considered a Definite Match. Where a unit has a score of 79% or less it is considered unmatched but can be divided in to 2 categories (Possibles and No Matches). Possibles are where the score is greater than 79% but there are more than one match; or there is a score of between 79% and 60% (79% and 33% for corporate businesses) No Matches are where the score is less than 60% (33% for corporates) Rules are then applied, For example with Matches and Possibles the status of the matched units may be reviewed i.e. if one of the units is a status 1 (company) but the other unit is a status 3 (Partnership) the matching rule will have failed. From the rules a decision is made on whether the Match passes or fails. For matched passes the units are linked to the register For matched fails the units are reported out for clerical review Possible passes are reported out for clerical review (e.g. PAYE possible table on IDBR) No further action is taken on Possible Fails. Pass Fail Pass Fail Units are linked Units are reported out Units are reported out

Matching: Process Summary Input data Stage 1 – NameKey matching No match Match Stage 2 – SSA-Name3 key search Candidates found Match The process is summarised in this slide. Stage 3 – SSA-Name3 matching and scoring Candidates not found Possible No match

Matching Developments IDS2.4 SSA-NAME1.7 Matched 1,399 941 Possible 275 55 Unmatched 326 1,004 Total 2,000 Apart from the business as usual matching development work is also being undertaken. We have taken data from the Farms Survey System (FSS) Register maintained by The Department for Environments Food and Rural Affairs (Defra) with the aim of matching the FSS to the IDBR. This process also involved the trialling of different matching software – IDS2.4 – this software is an upgrade of SSA-NAME. IDS2.4 provides improved “tuning” of parameters with increased matches resulting. Initially the comparison between the two systems was undertaken by taking a sample of 2,000 large businesses from the FSS. The results showed that 1,399 (70%) of the records were matched compared to 941 (47%) on the existing software; whilst 326 (16%) were unmatched in IDS2.4 compared to 1,004 (50%) using SSA-NAME1.7. By testing new advanced tools on Defra data has resulted in improvements in matching.

Matching Developments All live FSS units were then run through IDS2.4 – 197,810 units Lower match as data includes small units IDS2.4 results secured at end of trial as better than SSA-NAME1.7 Matched Large All Definite 70% 46% Possible 14% 1% Unmatched 16% 53% All FSS live units were then run through IDS2.4, a total of 197,810 units, results are shown in the table. The results from running all the units through compared to the results from the previous slide (in the large column). A lower match resulted because most of the units are small so most won’t match as they may not be on the IDBR as not registered with IDBR administrative inputs (e.g. PAYE, VAT). Trial of IDS2.4 now finished but results have been secured as these percentages are better than could be provided by SSA-NAME1.7.

Matching Future Plans Future plans Dependant on Re-engineering of the IDBR Future plans in matching. Under the re-engineering of the IDBR it will be necessary to change the matching software employed. SSA-NAME1.7 will not function on the new environment, hence the testing of IDS2.4 which should work on the new structure. However, decisions of the software to be employed have yet to be made. The Defra system contains information on farms in England. The next stage is to extend the project to match information from the Scottish farms register to the IDBR, then Wales. This will then provide further information to carry out agricultural surveys from the IDBR. Work is being undertaken on rationalising the datasets received by the ONS. The aim of part of the Administrative Data Integration (ADI) Project is to review datasets received in the ONS and see if they can be integrated on to the IDBR, or whether they hold duplicate information to the IDBR (if so it may be that such datasets are no longer required as the IDBR should provide the information already). This project is only concerning itself with electronic micro-data. Once datasets have been identified matching to the IDBR may be required.

Industry Coding Requirement to assign business descriptions to the Standard Industrial Classification (SIC) Coding tool used is the Precision Data Coder (PDC) Pilot of Automatic Coding by Text Recognition (ACTR) There is a requirement to code business descriptions to 5 digit SIC codes. This ensures that when a selection is being made for a survey that businesses from the correct part of the economy are selected. The tool used to do this is the Precision Data Coder (PDC), software developed by Inference Group in Australia. Although a new coding tool (ACTR) is being piloted.

Industry Coding Rules Rules for Coding Business descriptions may fall in to two categories Single business descriptions where the description can be used for coding (e.g. Manufacture of Garden Sheds); and Multiple business descriptions If more than one activity, the first activity must be selected. - For example where a business description states “Manufacture and Wholesale of….” should be coded to “Manufacture”. If more than one product, the first product must be selected. For example “Manufacture of Garden Sheds and Paving Stones….” should be coded to “Manufacture of Garden Sheds”. Before using a coding tool it is appropriate to mention “Rules for Coding” that should be applied. Business descriptions may fall in to two categories Single business descriptions – where the description can be used for coding (e.g. Manufacture of Garden Sheds); and Multiple business descriptions In the case where the first activity must be selected. For example where a business description states “Manufacture and Wholesale of….” should be coded to “Manufacture”. In the case of more than one product always select the first. For example “Manufacture and Wholesale of Garden Sheds and Paving Stones….” should be coded to Manufacture of Garden Sheds.

Industry Coding Process The coding tools can be used in two ways Batch processing - used to process files of business descriptions Interactive processing – used by coders to check individual codes The coding tools can be used in two ways either batch processing; or interactive processing. Batch processing involves running a file of numerous business descriptions through the coding tool to assign an SIC to them automatically Interactive processing is where individuals use the coding tool to code individual business descriptions that have either not been coded through the batch processor, or coders to confirm SIC changes to businesses.

Industry Coding: Process This slide shows the blank interactive PDC screen, which is used for coding individual business descriptions. This slide shows that if “clothing” is entered on the middle bar, PDC will come up with a number of options. It would then be for the coder to decide what the most appropriate business description to be applied. To assist them with the bottom window on the screen provides further information on the highlighted SIC code.

Industry Coding Developments Automatic Coding by Text Recognition (ACTR) trial Ability to update the knowledge base within ONS PDC licence ends April 2006 Interactive ACTR The ONS are currently trialling Automatic Coding by Text Recognition (ACTR), a coding tool developed by Stats Canada, with a view to adopting the software in April 2006 at the end of the current PDC licence. ACTR has the same attributes to PDC. With the PDC, however, changes to the knowledge base codes (Business Description to SIC mapping) can only be made once a year via the Inference Group in Australia. With ACTR however it is possible to update the knowledge base codes within the ONS so ensuring that anomalies are removed from the system as soon as possible. The slide shows the interactive ACTR window. Entering a description in the text box will then provide a the results in the box below assigning a score to each match in order for the coder to decide which is the best possible description.

Matching and Coding Questions Any questions? Any questions?