Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University.

Similar presentations


Presentation on theme: "Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University."— Presentation transcript:

1 Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University Applied Physics Laboratory

2 Introduction JHU/APL for DHS S&T CCI in support of DHS EDMO JHU/APL for DHS S&T CCI in support of DHS EDMO Objective: to help EDMO and NIEM community Objective: to help EDMO and NIEM community By learning of needs and gaps By learning of needs and gaps By exploring technologies that fill critical gaps in the NIEM lifecycles of model management, IEPD development, and implementation support. By exploring technologies that fill critical gaps in the NIEM lifecycles of model management, IEPD development, and implementation support. JHU/APL for DHS S&T CCI in support of DHS EDMO JHU/APL for DHS S&T CCI in support of DHS EDMO Objective: to help EDMO and NIEM community Objective: to help EDMO and NIEM community By learning of needs and gaps By learning of needs and gaps By exploring technologies that fill critical gaps in the NIEM lifecycles of model management, IEPD development, and implementation support. By exploring technologies that fill critical gaps in the NIEM lifecycles of model management, IEPD development, and implementation support. 12/15/092NIEM Blue Team Tools Day

3 Filling a Gap 12/15/093NIEM Blue Team Tools Day Objective: To develop a proof-of-concept capability to create test data for an information exchange using synthetic data Objective: At the practitioner level Need a tool to aid the t esting of an implementation by the generation of safe test data Need a tool to aid the t esting of an implementation by the generation of safe test data

4 Synthetic Data Vision for Solution 12/15/2009NIEM Blue Team Tools Day4 SYNINGEN

5 Proof-of-Concept for one IEPD. Proof-of-Concept for one IEPD. Synthetic Instance Generator (SYNINGEN) Synthetic Instance Generator (SYNINGEN) Synthetic data source Synthetic data source Embedded database Embedded database Dynamic insertion of controlled erroneous data Dynamic insertion of controlled erroneous data Proof-of-Concept for one IEPD. Proof-of-Concept for one IEPD. Synthetic Instance Generator (SYNINGEN) Synthetic Instance Generator (SYNINGEN) Synthetic data source Synthetic data source Embedded database Embedded database Dynamic insertion of controlled erroneous data Dynamic insertion of controlled erroneous data Design SYNINGEN Synthetic Data Test Records Schema Binding IEPD Schemas Pre- process 9/17/095NIEM Blue Team Tools Day

6 IEPD Selection Requirements Consists of a cross-section of commonly used data fields Consists of a cross-section of commonly used data fields Contains a minimal amount of domain specific data to utilize the capabilities of the previously developed synthetic data generator Contains a minimal amount of domain specific data to utilize the capabilities of the previously developed synthetic data generator Provides a concrete method for assessing IEPD implementation / test data via development of a web service (WS) Provides a concrete method for assessing IEPD implementation / test data via development of a web service (WS)Requirements Consists of a cross-section of commonly used data fields Consists of a cross-section of commonly used data fields Contains a minimal amount of domain specific data to utilize the capabilities of the previously developed synthetic data generator Contains a minimal amount of domain specific data to utilize the capabilities of the previously developed synthetic data generator Provides a concrete method for assessing IEPD implementation / test data via development of a web service (WS) Provides a concrete method for assessing IEPD implementation / test data via development of a web service (WS) Selection : CONNECT Driver License Search IEPD Designed to facilitate the effective exchange of criminal justice information amongst the CONNECT states (Alabama, Nebraska, Wyoming, Tennessee, and Kansas) Designed to facilitate the effective exchange of criminal justice information amongst the CONNECT states (Alabama, Nebraska, Wyoming, Tennessee, and Kansas) Defines the driver license search parameters, driver license search results (summary), and driver license details Defines the driver license search parameters, driver license search results (summary), and driver license details Selection : CONNECT Driver License Search IEPD Designed to facilitate the effective exchange of criminal justice information amongst the CONNECT states (Alabama, Nebraska, Wyoming, Tennessee, and Kansas) Designed to facilitate the effective exchange of criminal justice information amongst the CONNECT states (Alabama, Nebraska, Wyoming, Tennessee, and Kansas) Defines the driver license search parameters, driver license search results (summary), and driver license details Defines the driver license search parameters, driver license search results (summary), and driver license details 9/17/096NIEM Blue Team Tools Day

7 Demonstration Generate Test Data Generate Test Data Good values Good values Bad Values Bad Values Test Web Service Client Test Web Service Client Generate Test Data Generate Test Data Good values Good values Bad Values Bad Values Test Web Service Client Test Web Service Client Web Service Driver License Query Client SYNINGEN Synthetic Data Test Records Schema Binding IEPD Schemas Pre- process 9/17/097NIEM Blue Team Tools Day

8 Synthetic Data Generation 12/15/098NIEM Blue Team Tools Day Synthetic Data SYNINGEN

9 Why Use Synthetic Data? Need for large-scale, high-quality synthetic datasets to support DHS Test and Evaluation (T&E) activities Need for large-scale, high-quality synthetic datasets to support DHS Test and Evaluation (T&E) activities Designing Designing Modeling Modeling Testing (including usability) Testing (including usability) Training Training Tool studies Tool studies Need poses privacy protection challenges due to lack of access to actual data, i.e., Personally Identifiable Information (PII), and other access limitations Need poses privacy protection challenges due to lack of access to actual data, i.e., Personally Identifiable Information (PII), and other access limitations Four possible data methods available to address data needs (with limitations) Four possible data methods available to address data needs (with limitations) Use of actual data Use of actual data Sanitized or anonymized data Sanitized or anonymized data Manually created fictitious data Manually created fictitious data Machine-generated large-scale datasets from real world models, algorithms, or reference statistical patterns Machine-generated large-scale datasets from real world models, algorithms, or reference statistical patterns Need for large-scale, high-quality synthetic datasets to support DHS Test and Evaluation (T&E) activities Need for large-scale, high-quality synthetic datasets to support DHS Test and Evaluation (T&E) activities Designing Designing Modeling Modeling Testing (including usability) Testing (including usability) Training Training Tool studies Tool studies Need poses privacy protection challenges due to lack of access to actual data, i.e., Personally Identifiable Information (PII), and other access limitations Need poses privacy protection challenges due to lack of access to actual data, i.e., Personally Identifiable Information (PII), and other access limitations Four possible data methods available to address data needs (with limitations) Four possible data methods available to address data needs (with limitations) Use of actual data Use of actual data Sanitized or anonymized data Sanitized or anonymized data Manually created fictitious data Manually created fictitious data Machine-generated large-scale datasets from real world models, algorithms, or reference statistical patterns Machine-generated large-scale datasets from real world models, algorithms, or reference statistical patterns 12/15/099NIEM Blue Team Tools Day

10 Synthetic Data Generator (SDG) Synthetic Data : datasets comprised entirely of fictitious data, that can be used in a given context (or situation), instead of directly measurable or accessible actual data. Synthetic Data : datasets comprised entirely of fictitious data, that can be used in a given context (or situation), instead of directly measurable or accessible actual data. Prototype developed for Department of Homeland Security (DHS) Science and Technology (S&T) Command, Control, and Interoperability (CCI) Division Prototype developed for Department of Homeland Security (DHS) Science and Technology (S&T) Command, Control, and Interoperability (CCI) Division Automatic capability to produce robust datasets comprised of entities (e.g. people with behaviors/ footprints over time) Automatic capability to produce robust datasets comprised of entities (e.g. people with behaviors/ footprints over time) Creates synthetic test data that models a community with highly connected social networks of entities and relationships Creates synthetic test data that models a community with highly connected social networks of entities and relationships Data reflects typical daily activities in which people travel, communicate, and spend money in ways that are normally expected in a reasonable world Data reflects typical daily activities in which people travel, communicate, and spend money in ways that are normally expected in a reasonable world Datasets are in simple delimited text format Datasets are in simple delimited text format Synthetic Data : datasets comprised entirely of fictitious data, that can be used in a given context (or situation), instead of directly measurable or accessible actual data. Synthetic Data : datasets comprised entirely of fictitious data, that can be used in a given context (or situation), instead of directly measurable or accessible actual data. Prototype developed for Department of Homeland Security (DHS) Science and Technology (S&T) Command, Control, and Interoperability (CCI) Division Prototype developed for Department of Homeland Security (DHS) Science and Technology (S&T) Command, Control, and Interoperability (CCI) Division Automatic capability to produce robust datasets comprised of entities (e.g. people with behaviors/ footprints over time) Automatic capability to produce robust datasets comprised of entities (e.g. people with behaviors/ footprints over time) Creates synthetic test data that models a community with highly connected social networks of entities and relationships Creates synthetic test data that models a community with highly connected social networks of entities and relationships Data reflects typical daily activities in which people travel, communicate, and spend money in ways that are normally expected in a reasonable world Data reflects typical daily activities in which people travel, communicate, and spend money in ways that are normally expected in a reasonable world Datasets are in simple delimited text format Datasets are in simple delimited text format 12/15/0910NIEM Blue Team Tools Day

11 SDG Data 11 JHU/APL developed the concept and rules that characterize the reasonable world JHU/APL developed the concept and rules that characterize the reasonable world Categories of interest Categories of interest Demographics (including immigration) Demographics (including immigration) Social networks Social networks Communication patterns Communication patterns Travel Travel Financial transactions (including consumer spending) Financial transactions (including consumer spending) Produce datasets that are consistent in time and space Produce datasets that are consistent in time and space JHU/APL developed the concept and rules that characterize the reasonable world JHU/APL developed the concept and rules that characterize the reasonable world Categories of interest Categories of interest Demographics (including immigration) Demographics (including immigration) Social networks Social networks Communication patterns Communication patterns Travel Travel Financial transactions (including consumer spending) Financial transactions (including consumer spending) Produce datasets that are consistent in time and space Produce datasets that are consistent in time and space travels Person City Credit Card Transaction Credit Card Number purchases Phone Number communicates using Phone Number Call Transcript caller receiver NIEM Blue Team Tools Day12/15/09

12 Current Synthetic Fields PERSON … PersonID BinaryBase64Object BinaryDescriptionText BinaryFormatID BinaryFormtStandardText BinaryCategoryText GivenName FamilyName MiddleName Suffix Citizenship PassportNumber DriverLicenseNumber DriverLicenseState DriverLicenseExpiration. Date DriverLicenseIssueDate DOB 12/15/0912 NIEM Blue Team Tools Day PHONE_NUMBER PersonID Type (Landline or Mobile) Number CREDIT_CARD_ TRANSACTION PersonID TransactionNumber CreditCardNumber PurchaseCity Date Amount Company Industry PHONE_NUMBER PersonID Type (Landline or Mobile) Number CREDIT_CARD_ TRANSACTION PersonID TransactionNumber CreditCardNumber PurchaseCity Date Amount Company Industry PERSON EthnicityCode EthnicityText EyeColorCode EyeColorText GenderCode GenderText HairColorCode HairColorText HeightInches WeightPounds AddressStreetNumber AddressStreetName AddressCity AddressCounty* AddressState AddressPostalCode AddressPostalExtensionCode PHONE_CALL PersonID Date DurationSeconds Type FromCityID FromNumber ToCityID ToNumber TRAVEL PersonID FromCityID ToCityID Date CITY CityID City State Region Country PHONE_CALL PersonID Date DurationSeconds Type FromCityID FromNumber ToCityID ToNumber TRAVEL PersonID FromCityID ToCityID Date CITY CityID City State Region Country

13 Synthetic Data Sample 12/15/09NIEM Blue Team Tools Day13 Bio: Eladio Berstis, a USA citizen, lives in Lansing Michigan, and was born on June 8, He subscribes to two landline telephone numbers: and He shares these numbers with family members who live with him. Eladio has a relative, Soto Berstis, who lives in Providence Rhode Island. Eladio calls Soto regularly. Eladio owns two MasterCard credit cards. Bio:

14 Utility Developed prot otype web portal interface to SDG Developed prototype web portal interface to SDG User specifies characteristic attributes for a dataset through this interface User specifies characteristic attributes for a dataset through this interface Has been extended to generate other domain specific data Has been extended to generate other domain specific data Applications Applications North American Threat (NAT) Dataset for intelligence analysis North American Threat (NAT) Dataset for intelligence analysis Privacy Protection Technology Privacy Protection Technology NIEM Test Data NIEM Test Data Suspicious Activity Reports (SARs) Suspicious Activity Reports (SARs) Datasets have been generated and distributed to research institutions and agencies Datasets have been generated and distributed to research institutions and agencies Developed prot otype web portal interface to SDG Developed prototype web portal interface to SDG User specifies characteristic attributes for a dataset through this interface User specifies characteristic attributes for a dataset through this interface Has been extended to generate other domain specific data Has been extended to generate other domain specific data Applications Applications North American Threat (NAT) Dataset for intelligence analysis North American Threat (NAT) Dataset for intelligence analysis Privacy Protection Technology Privacy Protection Technology NIEM Test Data NIEM Test Data Suspicious Activity Reports (SARs) Suspicious Activity Reports (SARs) Datasets have been generated and distributed to research institutions and agencies Datasets have been generated and distributed to research institutions and agencies 12/15/0914NIEM Blue Team Tools Day

15 Feedback CY2010 Q1: Delivery of SYNINGEN software to DHS EDMO CY2010 Q1: Delivery of SYNINGEN software to DHS EDMO Independent of IEPD Independent of IEPD Definition of fields desired in a synthetic dataset for NIEM Definition of fields desired in a synthetic dataset for NIEM What is useful? What is useful? What level of fidelity desired for reasonable world? What level of fidelity desired for reasonable world? Feedback welcomed Feedback welcomed CY2010 Q1: Delivery of SYNINGEN software to DHS EDMO CY2010 Q1: Delivery of SYNINGEN software to DHS EDMO Independent of IEPD Independent of IEPD Definition of fields desired in a synthetic dataset for NIEM Definition of fields desired in a synthetic dataset for NIEM What is useful? What is useful? What level of fidelity desired for reasonable world? What level of fidelity desired for reasonable world? Feedback welcomed Feedback welcomed 12/15/0915NIEM Blue Team Tools Day

16 Backup Slides 12/15/0916NIEM Blue Team Tools Day

17 How reasonable is the data? 12/15/09NIEM Blue Team Tools Day17 Names People in the same family tend to share the same family name People in the same family tend to share the same family name Western naming convention of a single first name and single last name Western naming convention of a single first name and single last name Many companies do not have realistic names Many companies do not have realistic names Actual cities around the world Actual cities around the world Affiliations can have either actual (e.g. Al Qaeda) or fictitious (e.g. Augusta Gang) names Affiliations can have either actual (e.g. Al Qaeda) or fictitious (e.g. Augusta Gang) namesTravels Does not specify the transportation modes for travels Does not specify the transportation modes for travels Tracks people at the city level (data does not tell us whether a person was seen at a restaurant) Tracks people at the city level (data does not tell us whether a person was seen at a restaurant) No more than one travel event in one day No more than one travel event in one day Phone Calls Simplified communications among people Simplified communications among people Access to both landline and mobile phone numbers Access to both landline and mobile phone numbers Mobile number is owned by only one person Mobile number is owned by only one person Landline phone number may be used by a number of people Landline phone number may be used by a number of people Two phone calls originating from the same phone number will not overlap in time (no guarantee that a phone number could not be a receiver and a caller at the same time) Two phone calls originating from the same phone number will not overlap in time (no guarantee that a phone number could not be a receiver and a caller at the same time) Credit Card Transactions Types of data that people are likely to find in their own monthly statements: date, amount, company, and industry Types of data that people are likely to find in their own monthly statements: date, amount, company, and industry Transactions occur in the same city as a persons current location Transactions occur in the same city as a persons current locationNames People in the same family tend to share the same family name People in the same family tend to share the same family name Western naming convention of a single first name and single last name Western naming convention of a single first name and single last name Many companies do not have realistic names Many companies do not have realistic names Actual cities around the world Actual cities around the world Affiliations can have either actual (e.g. Al Qaeda) or fictitious (e.g. Augusta Gang) names Affiliations can have either actual (e.g. Al Qaeda) or fictitious (e.g. Augusta Gang) namesTravels Does not specify the transportation modes for travels Does not specify the transportation modes for travels Tracks people at the city level (data does not tell us whether a person was seen at a restaurant) Tracks people at the city level (data does not tell us whether a person was seen at a restaurant) No more than one travel event in one day No more than one travel event in one day Phone Calls Simplified communications among people Simplified communications among people Access to both landline and mobile phone numbers Access to both landline and mobile phone numbers Mobile number is owned by only one person Mobile number is owned by only one person Landline phone number may be used by a number of people Landline phone number may be used by a number of people Two phone calls originating from the same phone number will not overlap in time (no guarantee that a phone number could not be a receiver and a caller at the same time) Two phone calls originating from the same phone number will not overlap in time (no guarantee that a phone number could not be a receiver and a caller at the same time) Credit Card Transactions Types of data that people are likely to find in their own monthly statements: date, amount, company, and industry Types of data that people are likely to find in their own monthly statements: date, amount, company, and industry Transactions occur in the same city as a persons current location Transactions occur in the same city as a persons current location


Download ppt "Protecting Information in the NIEM Lifecycle Using Synthetic Data 15 December 2009 Nykia Jackson, Barbara Shapter Kim Sterret-Day Johns Hopkins University."

Similar presentations


Ads by Google