Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sensitive Information Sweep

Similar presentations

Presentation on theme: "Sensitive Information Sweep"— Presentation transcript:

1 Sensitive Information Sweep
Using Cornell’s Spider Wyman Miles, Cornell University Kerry Havens, University of Colorado at Boulder Steve Lovaas, Colorado State University

2 Overview Quick Background The Technical Problem (Kerry)
The Organizational Problem (Steve) Spider (Wyman) Summary & Questions

3 What is “Sensitive Information”?
A Growing Concern A Moving Target SSN, Credit Card, Driver’s License, Medical Records, Student Information, Proprietary Research,… Data in Context – Aggregation

4 Why Are We All Here? The Front Page!
CDW-G 2006 Survey – more than 3 million college students may have lost personal information in the last year. Identity theft is the fastest growing crime in the U.S. By far the biggest culprit? Lost or stolen computers.

5 Regulations, Standards, & Laws
Federal – HIPAA, FERPA, SarbOx, GLB,… Identity Theft Protection Act? State – Many states passing identity theft protection laws; New York & Colorado have state CISO Industry – PCIDSS

6 The Technical Problem: Finding sensitive information in a haystack
Kerry Havens University of Colorado at Boulder Some background information on how we use Spider on desktops and servers. In addition to Google Hacks…

7 SSN Remediation At CU-Boulder, SSNs were used as a student identifier before 2004 House Bill was approved in 2003 requiring institutions to change this method to ensure the privacy of a student’s social security number CU-Boulder started issuing student IDs to new students in July 2004 and converting SSNs to SIDs in 2005 Because SSNs were used as an identifier in the recent past, there are still data elements on public facing servers. Most of the time, the department forgot the data was there, or it isn’t necessary to keep it on the public server and can be securely archived or deleted.

8 Where the data is not stored
File type exclusions – fine tuning Binary files where the data cannot be read Received input from community for fine tuning False positives International telephone numbers Examples for web form validation Why is the department webpage asking for SSNs? Identifying certain file types and data known to be false positives and skipping that data when scanning with Spider will improve efficiency. File type exclusions include .exe, .dll, .iso, and so on. These binary files types can be extensive and receiving input from the community will help add to the list. At CU-Boulder, we ran into several instances of false positives that were valid data (not binary data). These included things like international phone numbers and example SSNs for web form validation. While the data was not a valid SSN, it raises the question, Why is the department taking SSN information?

9 OS and File Encoding Problems
HTML encoding problems Representations (pictures) of sensitive data are not found Examples include PDF Searching a UNIX filesystem Preparing the file before searching for private data For example, using strings to extract text from text/binary hybrids like .doc or .xls There have been instances where there is code in an html file that hits as a false-positive, but we find a lot of sensitive data in html files. Allowing these false positives was a better approach than missing valid private data. One of the problem with skipping file types is the possibility of skipping binary data that actually contains private data. While examining the binary data of a .pdf file will not reveal the private data, the file could actually be a bitmap representation of a form with an SSN or credit card number (like tax forms). When searching a UNIX filesystem (in our case, we discovered that the hybrid text/binary files (like MS Office files) would reveal many false positives. The challenge is to prepare the file before sweeping for private data. Strings worked well in order to extract the text from an Excel or Word file. However, different file types may require different preparation (such as archives and .pst files).

10 Where the data is stored
Typical file types of discovered data Gradebooks Course web pages Homework assignments Travel authorization forms Personal financial documents

11 Regular Expressions Returns too much data: /\d{3}-\d{2}-\d{4}/
Searching for environment specific data in the hope that common data will lead us to more data: /\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} | (52[1-4]|65[0-3])\d{6})\b/ State specific information can be found at The default regular expression for SSNs in Spider is too broad and returns too much data. We’ve also noticed that SSNs are not always delimited with dashes when we find them in the wild. But a non-delimited, nine digit number is not going to be easy to find. That why we used custom regular expressions. Custom regular expressions is the leading feature of Spider. Without them, the false-positives would be too great for the program to be useful. These customizations included Colorado specific prefixes for SSNs and CU-Boulder specialized credit cards that we found would lead us to more data. Usually, there will be at least one student from Colorado in a course or even a section of courses that will lead us to more data. We used the state specific prefixes that can be found on the Social Security Administration’s website.

12 /\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |
Regular Expressions Let’s dissect this… /\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} | (52[1-4]|65[0-3])\d{6})\b/ This may give you some insight into what goes into building the regular expressions and why they are so useful. The first line checks for delimited SSNs from anywhere in the country, the second line checks for SSNs that are not delimited from Colorado.

13 /\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |
Regular Expressions Let’s dissect this… /\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} | (52[1-4]|65[0-3])\d{6})\b/ Boundary

14 First acceptable digit
Regular Expressions Let’s dissect this… /\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} | (52[1-4]|65[0-3])\d{6})\b/ 8s and 9s are not used as the first digit for SSNs, but they are important to find for SIDs or foreign students. First acceptable digit

15 /\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} |
Regular Expressions Let’s dissect this… /\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} | (52[1-4]|65[0-3])\d{6})\b/ 2, 4, or 6 digits in a row

16 Delimited by dash or space
Regular Expressions Let’s dissect this… /\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} | (52[1-4]|65[0-3])\d{6})\b/ Delimited by dash or space

17 Colorado specific prefix, not delimited
Regular Expressions Let’s dissect this… /\b([0-7]\d{2}[-|\s]\d{2}[-|\s]\d{4} | (52[1-4]|65[0-3])\d{6})\b/ Colorado specific prefix, not delimited

18 CU Experiences Pitfalls Recommendations
Users’ interpretations of the log file Fine tuning file extension exceptions and regular expressions Recommendations Keep current environment in mind

19 The Organizational Problem: a really big haystack
Steve Lovaas Network Security Manager Colorado State University

20 Organizational Vision
Support from the top Cabinet-level committee driving the project Spurred by headlines and state mandates VP for IT who really gets security Campus PR campaign Web site Public meetings Tied SSN purge to the rollout of a new CSUID in Fall 2006

21 Using Resources Project Constraints Buy / Build / Leverage tools?
Tight timeline No budget  Not a trivial programming project Buy / Build / Leverage tools? Goal: 100% coverage vs. Best Effort Spider chosen for Windows, Linux, Mac Manual searching on AIX, mainframe

22 Ultimate Responsibility
Original thought: deans / dept. heads Revised edition: individual employees Developed a personal attestation for for every employee to sign, submitted in bulk by colleges More work for central IT Senior VP: Doing the scan and signing the form is a CONDITION OF EMPLOYMENT

23 Individual Attestation Form
Every employee 2 choices: I don’t interact with SSNs in the course of my job SSNs in all electronic files under my control have been removed or encrypted VP for IT must approve exceptions

24 CSU Experiences Pitfalls Recommendations
Beta tool for a live project requires quick response and careful management of user expectations & acceptance Careful of deadlines, it’s a lot of work! Recommendations Don’t do this kind of project without active support from the very top Anticipate the need for analysis/parsing tools Have a supported encryption solution for exceptions

25 Wyman Miles Sr. Security Engineer Cornell University
Cornell Spider Wyman Miles Sr. Security Engineer Cornell University

26 A Brief History of Spider
Early 2005, scan Web for SSNs Later, scan disk images for SSNs/CCNs March 2006, debut at BU Security Camp April 2006, Educause, demand for a Windows version Version 1.0 in May, 2.0 in June

27 A Brief History, II June 2006, major feedback from Steve: bug reports, tests, feature requests Engine developed that same month: internal incident response OSX Spider Sept 2006 Windows Spider rewrite April 2007, GPL release of all Spiders

28 Current Spider SSN, SIN, CCN, NINO discovery in many file types
Various data type validators Web scanning, back to its roots Scan for data in unallocated space Faster. More readable source

29 Various Spiders Windows Spider, aka Spider3 OSX Spider
Engine, general UNIX spider LinSpider, our oldest version Spider Simple: Windows Spider preconfigured to skip noisy files

30 Future Spider Feature set convergence between Engine, OSX, Windows
Community Development Possible I2 hosting of distribution and documentation More documentation! Client-Server model revisited

31 Spider Log

32 Spider at Cornell Incident response: a compromise has happened, what was at risk? Pre-emptive Dan Elswit, CALS Security Officer

33 Spider in CIT CIT abandoned SSNs a few years ago, but they remain
Tech support uses Spider Simple to discover lurking SSNs Manual process

34 Athletics Spider Simple Unique log names to network share
Centralized analysis

35 Spider Downloads

36 Summary Purging sensitive information is something we’re going to have to get good at Get support from the highest levels Tune regular expressions and file/ext skip lists for your environment Anticipate parsing needs, exceptions New Spider features, more users, broader OS support Spider also for ongoing support, forensics

37 Questions? Wyman Miles: Kerry Havens: Steve Lovaas:
Kerry Havens: Steve Lovaas: The Spider users’ list:

Download ppt "Sensitive Information Sweep"

Similar presentations

Ads by Google