Presentation is loading. Please wait.

Presentation is loading. Please wait.

Matching and Industry Coding

Similar presentations


Presentation on theme: "Matching and Industry Coding"— Presentation transcript:

1 Matching and Industry Coding
Title page

2 What is Matching? What is Matching?
Matching is the linking data from two or more differing sources Why Match? Combined datasets can contain more information than is available on individual datasets To build efficient sampling frames To reduce response burden To impute missing data values To provide validated data from trusted sources Firstly the question “What is matching?” should be answered. Matching, in this instance, is the linking of trusted data from two or more differing sources. Combined datasets can contain more information than is available on individual datasets To build efficient sampling frames for surveys, by reducing duplication To reduce response burden To impute missing data values (e.g. using turnover per head ratios for an SIC division to calculate turnover where employment is known). To provide validated data from trusted sources.

3 Current Matching Current Matching system
All administrative data passes through a matching process prior to being added to the IDBR Two systems are used for Matching NameKey Matching SSA-NAME3 Matching All administrative data (VAT, PAYE, Companies House) is passed through the matching process prior to being loaded on to the IDBR. This ensures that linkage takes place resulting in units on the IDBR being represented accurately and avoiding duplication. Two systems are used for matching NameKey matching and SSA-NAME3 Matching.

4 Matching Process: Stage 1
Stage 1: NameKey Matching Input data Stage 1 - NameKey matching Match No match Stage 1 is NameKey Matching The input data i.e. VAT, PAYE, Companies House undertakes NameKey Matching. Namekeys are generated from the business name. Namekeys are strings of characters with spaces, noise words (e.g. and) and non-alphabetic/non-numeric characters (e.g. fullstops) removed. Once this has been done the data is passed through the matching process with the result that data can be classified as either a Match or a No Match. Matches are considered definite matches if the match is exact (100%) between the Namekey and postcode. The register is amended accordingly. No Matches are moved to the next stage of matching – SSA-NAME3. These are considered definite matches These move on to the next stage of matching

5 Matching Process: Stage 2
Stage 2: SSA-NAME3 Stage 2 – SSA-Name3 Key search Candidates found No candidates / >300 candidates found Candidates are scored to find best match Stage 2 is SSA-NAME3. Those No Matches created in NameKey Matching are passed through SSA-NAME3 Key Search. This part of the process generates permutations on the name and address of input units to widen the search on the Register. Matching is then undertaken. If possible candidates are found an input unit they are given a score based on the closeness between the input unit and a Register unit based on the name, address and postcode. If “no candidates” or “more than 300 candidates” were found the outcome is consider as a No Match. No further matching (automatic or clerical) is carried out. No further matching is carried out

6 Matching Process: Stage 2 (continued)
Stage 2: SSA-Name3 (continued) Stage – SSA-Name3 matching and scoring Match Possible No match Rules are applied Rules are applied No further action The SSA-NAME3 process continues by reviewing those candidates given a score in the previous slide. Where a input unit has a score of greater than 79% to a single IDBR unit it is considered a Definite Match. Where a unit has a score of 79% or less it is considered unmatched but can be divided in to 2 categories (Possibles and No Matches). Possibles are where the score is greater than 79% but there are more than one match; or there is a score of between 79% and 60% (79% and 33% for corporate businesses) No Matches are where the score is less than 60% (33% for corporates) Rules are then applied, For example with Matches and Possibles the status of the matched units may be reviewed i.e. if one of the units is a status 1 (company) but the other unit is a status 3 (Partnership) the matching rule will have failed. From the rules a decision is made on whether the Match passes or fails. For matched passes the units are linked to the register For matched fails the units are reported out for clerical review Possible passes are reported out for clerical review (e.g. PAYE possible table on IDBR) No further action is taken on Possible Fails. Pass Fail Pass Fail Units are linked Units are reported out Units are reported out

7 Matching: Process Summary
Input data Stage 1 – NameKey matching No match Match Stage 2 – SSA-Name3 key search Candidates found Match The process is summarised in this slide. Stage 3 – SSA-Name3 matching and scoring Candidates not found Possible No match

8 Matching Developments
IDS2.4 SSA-NAME1.7 Matched 1,399 941 Possible 275 55 Unmatched 326 1,004 Total 2,000 Apart from the business as usual matching development work is also being undertaken. We have taken data from the Farms Survey System (FSS) Register maintained by The Department for Environments Food and Rural Affairs (Defra) with the aim of matching the FSS to the IDBR. This process also involved the trialling of different matching software – IDS2.4 – this software is an upgrade of SSA-NAME. IDS2.4 provides improved “tuning” of parameters with increased matches resulting. Initially the comparison between the two systems was undertaken by taking a sample of 2,000 large businesses from the FSS. The results showed that 1,399 (70%) of the records were matched compared to 941 (47%) on the existing software; whilst 326 (16%) were unmatched in IDS2.4 compared to 1,004 (50%) using SSA-NAME1.7. By testing new advanced tools on Defra data has resulted in improvements in matching.

9 Matching Developments
All live FSS units were then run through IDS2.4 – 197,810 units Lower match as data includes small units IDS2.4 results secured at end of trial as better than SSA-NAME1.7 Matched Large All Definite 70% 46% Possible 14% 1% Unmatched 16% 53% All FSS live units were then run through IDS2.4, a total of 197,810 units, results are shown in the table. The results from running all the units through compared to the results from the previous slide (in the large column). A lower match resulted because most of the units are small so most won’t match as they may not be on the IDBR as not registered with IDBR administrative inputs (e.g. PAYE, VAT). Trial of IDS2.4 now finished but results have been secured as these percentages are better than could be provided by SSA-NAME1.7.

10 Matching Future Plans Future plans
Dependant on Re-engineering of the IDBR Future plans in matching. Under the re-engineering of the IDBR it will be necessary to change the matching software employed. SSA-NAME1.7 will not function on the new environment, hence the testing of IDS2.4 which should work on the new structure. However, decisions of the software to be employed have yet to be made. The Defra system contains information on farms in England. The next stage is to extend the project to match information from the Scottish farms register to the IDBR, then Wales. This will then provide further information to carry out agricultural surveys from the IDBR. Work is being undertaken on rationalising the datasets received by the ONS. The aim of part of the Administrative Data Integration (ADI) Project is to review datasets received in the ONS and see if they can be integrated on to the IDBR, or whether they hold duplicate information to the IDBR (if so it may be that such datasets are no longer required as the IDBR should provide the information already). This project is only concerning itself with electronic micro-data. Once datasets have been identified matching to the IDBR may be required.

11 Industry Coding Requirement to assign business descriptions to the Standard Industrial Classification (SIC) Coding tool used is the Precision Data Coder (PDC) Pilot of Automatic Coding by Text Recognition (ACTR) There is a requirement to code business descriptions to 5 digit SIC codes. This ensures that when a selection is being made for a survey that businesses from the correct part of the economy are selected. The tool used to do this is the Precision Data Coder (PDC), software developed by Inference Group in Australia. Although a new coding tool (ACTR) is being piloted.

12 Industry Coding Rules Rules for Coding
Business descriptions may fall in to two categories Single business descriptions where the description can be used for coding (e.g. Manufacture of Garden Sheds); and Multiple business descriptions If more than one activity, the first activity must be selected. - For example where a business description states “Manufacture and Wholesale of….” should be coded to “Manufacture”. If more than one product, the first product must be selected. For example “Manufacture of Garden Sheds and Paving Stones….” should be coded to “Manufacture of Garden Sheds”. Before using a coding tool it is appropriate to mention “Rules for Coding” that should be applied. Business descriptions may fall in to two categories Single business descriptions – where the description can be used for coding (e.g. Manufacture of Garden Sheds); and Multiple business descriptions In the case where the first activity must be selected. For example where a business description states “Manufacture and Wholesale of….” should be coded to “Manufacture”. In the case of more than one product always select the first. For example “Manufacture and Wholesale of Garden Sheds and Paving Stones….” should be coded to Manufacture of Garden Sheds.

13 Industry Coding Process
The coding tools can be used in two ways Batch processing - used to process files of business descriptions Interactive processing – used by coders to check individual codes The coding tools can be used in two ways either batch processing; or interactive processing. Batch processing involves running a file of numerous business descriptions through the coding tool to assign an SIC to them automatically Interactive processing is where individuals use the coding tool to code individual business descriptions that have either not been coded through the batch processor, or coders to confirm SIC changes to businesses.

14 Industry Coding: Process
This slide shows the blank interactive PDC screen, which is used for coding individual business descriptions. This slide shows that if “clothing” is entered on the middle bar, PDC will come up with a number of options. It would then be for the coder to decide what the most appropriate business description to be applied. To assist them with the bottom window on the screen provides further information on the highlighted SIC code.

15 Industry Coding Developments
Automatic Coding by Text Recognition (ACTR) trial Ability to update the knowledge base within ONS PDC licence ends April 2006 Interactive ACTR The ONS are currently trialling Automatic Coding by Text Recognition (ACTR), a coding tool developed by Stats Canada, with a view to adopting the software in April 2006 at the end of the current PDC licence. ACTR has the same attributes to PDC. With the PDC, however, changes to the knowledge base codes (Business Description to SIC mapping) can only be made once a year via the Inference Group in Australia. With ACTR however it is possible to update the knowledge base codes within the ONS so ensuring that anomalies are removed from the system as soon as possible. The slide shows the interactive ACTR window. Entering a description in the text box will then provide a the results in the box below assigning a score to each match in order for the coder to decide which is the best possible description.

16 Matching and Coding Questions
Any questions? Any questions?


Download ppt "Matching and Industry Coding"

Similar presentations


Ads by Google