Presentation on theme: "M AY 21, 2014 I DENTITY M ATCHING : SSN S ARE NOT ENOUGH ! J OHN S ABEL ERDC ARRA SLDS Conference."— Presentation transcript:
M AY 21, 2014 I DENTITY M ATCHING : SSN S ARE NOT ENOUGH ! J OHN S ABEL ERDC ARRA SLDS Conference
M AY 21, 2014 A BOUT THE ERDC RCW RCW established the Education Research & Data Center (ERDC) in the Washington State Office of Financial Management (OFM). In collaboration with statutory partner agencies, representing education and employment, and the Legislative Evaluation and Accountability Program (LEAP) committee, ERDC conducts analyses of early learning, K-12, higher education programs and education and workforce issues across the P-20W system.Office of Financial Managementstatutory partner agencies,Legislative Evaluation and Accountability Program ERDC Vision To promote a seamless, coordinated preschool-to-career (P-20W) experience for all learners by providing objective analysis and information. ERDC Mission To develop longitudinal information spanning the P-20W system in order to facilitate analyses, provide meaningful reports, collaborate on education research, and share data. ERDC Values 1.Coordinate, facilitate, build upon and enhance the education data collection and analysis already being done by multiple agencies and institutions. 2.Adhere strictly to both the letter and spirit of privacy laws affecting individual student record data and be sensitive to other privacy concerns. 3.Achieve consensus wherever possible among participating agencies and institutions in determining the best data and research available to help guide the implementation of P-20W goals. 4.Conduct all business, data development and research in an open and transparent fashion (to the extent allowed by privacy laws), with the full inclusion of education agencies, organizations, and institutions as well as legislative participants.
M AY 21, 2014 A BOUT THE P20W D ATA W AREHOUSE The ERDC is the owner and user of the State of Washington’s P20W Data Warehouse. The system is hosted by the Department of Enterprise Services. The P20W Data Warehouse is a statewide longitudinal data system that includes de-identified data about people's early childhood, Kindergarten through 12 th grade, higher education and workforce experiences and performances (hence the name P20W). The data are collected and linked from existing state agency data systems. It includes data about the kinds of services they receive, programs in which they participate, and their academic performance and program or degree completion. It also includes a variety of demographic data so we are able to look at a variety of different groups of people. Personally identifiable information, such as names, social security numbers, addresses, and other data which can identify a person as an individual, are not part of the research database.
M AY 21, 2014 IF SSNS W ERE P ERFECT AND U BIQUITOUS … SELECT K12.*, College.* FROM K12 INNER JOIN College ON K12.SSN = College.SSN = K12.SSN
M AY 21, 2014 SSNS ARE N OT P ERFECT People’s actual SSN can be different from the recorded SSN for any number of reasons: Transcription error. Wrong SSN recorded. For example a parent filling in their own SSN for their child’s Running Start application. Intentionally filling in an incorrect SSN on a form.
M AY 21, 2014 M ULTIPLE N UMBER OF SSN S PER P20ID In the ERDC P20W data warehouse, sometimes individual P20IDs (unique person IDs) have more than one SSN:
M AY 21, 2014 M ULTIPLE N UMBER OF P20ID S PER SSN Conversely, some SSNs are shared by more than one P20ID:
M AY 21, 2014 W AYS TO A DDRESS I MPERFECT SSN S ERDC is utilizing or developing a number of ways of to validate/invalidate SSNs. Frequency and use analysis of P20IDs and SSNs in the P20W data warehouse. Comparison of the last 4 digits of SSNs with Department of Licensing data. Using Social Security Administrations Death Master File and Washington Department of Health Death Names file to find SSN group/area numbers, first 5 digits of SSNs, that have never been issued. Using Social Security Administrations High Group list to find when SSN group/area number have been issued. Data readily available only from November 2003 to June 24, On June 25, 2011, SSN randomization began.
M AY 21, 2014 SSN S ARE NOT UBIQUITOUS Even if SSNs were perfect, less than half the P20IDs in the P20W data warehouse have them:
M AY 21, 2014 A NY IDENTIFIER HAS SIMILAR PROBLEMS. S O WHAT TO DO ? Along with SSNs, any “global” identifiers will have some or all of these problems. So what to do? Add additional identifiers for identity matching: First, middle last names, Birth date Gender School/college codes District codes All said though, SSN really is an excellent identity matching variable to have.
M AY 21, 2014 U SING A L ARGER S ET OF I DENTIFIERS FOR I DENTITY M ATCHING Identity matching is split into three phases: 1.Deterministic matching, automerge: Always strive first to minimize false positives and then try to minimize false negatives. Matches are automatically matched and merged. 2.Probabilistic matching, automerge: Additional matches are matched and merged. 3.Probabilistic matching, manual merge: Additional matches are manually reviewed, and then selectively matched and merged.
M AY 21, 2014 D ETERMINISTIC M ATCHING, A UTOMERGE E XAMPLE * Collapsed DOB is a birth date that has been transformed so that birth dates that have the same birth year, but inverted birth months and days have the same value. Set of all true positive matches
M AY 21, 2014 M ANUAL R EVIEW OF P OTENTIAL MATCH PAIRS Potential match pairs are brought into Excel for manual review. Cell pairs that are not alike are color coded. Red means the cells are different. Yellow means one cell has no data. Each potential match pair is classified in the “Class” variable according to similarities in the different identifiers. This allows the potential match pairs to be sorted (Invented, example data)
M AY 21, 2014 O THER M ETHODS TO I MPROVE M ATCH R ATE ERDC uses several other techniques to improve the match rate: Rigorous standardization of all name fields. Bringing into manual review additional fields such as school history over time. County affinity matrices. Use of name change data.
M AY 21, 2014 H OW TO O BTAIN P-20W D ATA ERDC Data Request Process Please go to the ERDC’s “Accessing P-20W Data” page at: 1. Fill out the Data Request Form send to ERDC 2. ERDC calls requestor to clarify request if necessary 3. If request is changed, ERDC will send changes to requestor for approval 4. ERDC sends the data request that includes study questions and data requested to data contributors 5. Data contributors have 5 days to review and respond to requestor about the data requested 6. Requestor works with ERDC to revise request based on feedback, if necessary 7. ERDC creates a data sharing agreement with requestor to share the linked, de-identified data a. Copy of signed DSA will be made available by ERDC via the website or 8. ERDC works to get the data to requestor 9. Requestor works with the data and contacts data contributors with questions about their data 10. Requestor sends draft report to ERDC for distribution to data contributors. 11. Data contributors have 10 days to review report and respond to requestor with comments about use of data 12. Requestor releases report
M AY 21, 2014 C ONTACT THE ERDC ERDC Website ERDC Mailing Address P.O. Box Olympia, WA ERDC Phone/Fax Phone: (360) Fax: (360) John Sabel