Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crime Section, Central Statistics Office..  The Crime Section would like to acknowledge the assistance provided by the Probation Service in this project.

Similar presentations


Presentation on theme: "Crime Section, Central Statistics Office..  The Crime Section would like to acknowledge the assistance provided by the Probation Service in this project."— Presentation transcript:

1 Crime Section, Central Statistics Office.

2  The Crime Section would like to acknowledge the assistance provided by the Probation Service in this project. ◦ In particular, we would like to thank Michael Donnellan and Aidan Gormley.

3  Connectivity between the various Criminal Justice Database Systems  The Challenge - Absence of unique identifier  The Solution – CSO statistical matching.  Results of matching exercise  Future Goals

4 Robust links between PULSE and CCTS. Tenuous link between PULSE/CCTS and Probation Need to make these links into strong links - but how?

5  Common unique identifier allows rapid integration of datasets.  The common identifiers between PULSE and CCTS include Charge No., Summons No.  These are linked to the Person PULSE ID in PULSE, to allow linking by individual.  Result: Able to produce statistics combining police and court outcome data.  However, there is a problem....

6  No such common identifier between CCTS/PULSE and Probation  Probation Service uses its own unique identifiers.  No linking between this and PULSE identifiers such as Person PULSE ID and Court Outcome number.  Cannot link the datasets and cannot produce statistics.

7  But a solution exists:  If persons in the separate systems can be matched across variables that exist in both systems:  Then a table linking unique identifiers can be produced.  Variables such as first name, surname, data of birth and address exist in both systems.  These can be used to link the two systems.  This is the basis of the CSO solution.

8  The CSO received a test dataset from the Probation Service, for years 2007 and 2008.  Over 8700 data orders with corresponding info.  First, a manual matching exercise was carried out to test feasibility  Matching by first name, surnames, addresses, dates of birth on over 7800 probation records.  A random sample of 800 records  It took 8.5 person-days to process this 10% sample.  At this rate, it would have taken over90 days to process the entire dataset.

9  The next step was to automate the matching process, for entire dataset.  Fully automated matching solution – not really possible.  A mixed-model method incorporating automatic and manual matching, to achieve 99% matching.  70% of matches were automatically matched, without human role.  This match was on first name, surname and date of birth.

10  Additional sorting/matching algorithms to simplify manual matching of remaining 28%.  There were four additional stages, with progressively increasing human role.  These were to identify cases where age or address data does not match, for example.  Processes still mainly automated and algorithm based, so fast to process.  The entire process was completed in 2man- day. 99% of all the records (7,800+) matched.  Compared to projected (90+ man days).

11  Step one.  Both datasets sorted by names, addresses and dates of birth. NB All datasets shown are merely representations, not actual data

12 These are large datasets.

13

14  Step Two.  The probation and PULSE records are matched automatically by names and date of birth – using SAS.  70% of entries are matched automatically, this way.  For each probation ID, the corresponding PULSE Ids are listed.  People may have multiple PULSE Ids, for each probation ID.

15  Step Three.  The next step is to ensure that surnames with the prefix “O’” are recorded in the same manner in both datasets  Step has minimal human involvement.  One dataset records “O’ ” as “O”  This is not detected or matched in initial stage  This can be performed with an automatic software “Replace” function  When the automatic matching (Step Two) is run again:  Now 85% of records match automatically.

16  Step Four ◦ The next step is to match on cases where the surname and date of birth match, first names are closely related: ◦ This step has more human involvement. Geographical info is used as a further check. This allows us to find aliases. ◦ Example shown here:  It is clear that although “Liz” and “Elizabeth”, and “Alex” and “Lex” differ, they refer to same person.

17  Step Five. ◦ Additional matching steps are then carried out.  One is to check for matching first names, surnames and geographical info, but where dates of birth differ.  Special checks can identify matching cases here. ◦ Another set of checks involves searching for matching first name, date of birth but slightly different surnames.  All these steps lead to match of over 95%.  The final step is a fully manual operation to match the remaining 5%

18  The CSO produced detailed results from this linkage.  Tables were produced showing:  Number of subsequent First Offices (recidivism), during the period 2008-11, by individuals with probation orders issued in 2007-08  Table B: Subsequent First Offences (recidivism), during the period 2008-11, by individuals with probation orders issued in 2007-08, as percentage of the Original Primary Offence  Table C: Subsequent First Offence (recidivism) by individuals, during the period 2008-11, with probation orders issued in 2007- 08 as a percentage of total original primary offences  Table D: Subsequent First Offence (recidivism) during the period 2008-11 of individuals with probation orders issued in 2007-08 as a % of total subsequent First Offences  Unfortunately, we can show only sample data here.

19

20

21  Further development of matching model.  To incorporate text analysis, fuzzy matching.  To develop a fully automatic process to match to 99%.

22  This project shows a simple, effective solution to integrating datasets in the absence of a common identifier.  This project doesn’t invalidate the importance of development of unique identifiers. ◦ But it does allow matching of records where it is not feasible to retroactively apply any planned common identifier.  This method is not limited to Criminal Justice Administrative Data. ◦ It can be applied to any datasets with common information on names, dates of birth etc.


Download ppt "Crime Section, Central Statistics Office..  The Crime Section would like to acknowledge the assistance provided by the Probation Service in this project."

Similar presentations


Ads by Google