Presentation on theme: "Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen."— Presentation transcript:
Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier, Brian Taylor, and David Jensen Knowledge Discovery Laboratory Department of Computer Science University of Massachusetts Amherst Henry G. Goldberg and John Komoroske National Association of Securities Dealers Financial Industry Regulatory Authority
` Reps are required to file disclosures for incidents ranging in severity from customer complaints to criminal charges. Individual characteristics are not distinctive,… …it is the collection of related entities that sets this rep apart. Age Years in Industry …
Goals Overall Goal Identify new methods and improve existing methods for detecting securities fraud that consider the relationships among reps, branches, and firms. Goals of This Talk 1)Describe the pre-processing steps needed to prepare the data for knowledge discovery. 2) Demonstrate that pre-processing is both necessary and beneficial for knowledge discovery in relational domains.
Data Pre-ProcessingChallenge 1: Inputs The Knowledge Discovery Process Challenge 2: Class Labels
Challenge 1: Preparing Inputs Small-scale social structure between reps is important. Branch Associations Tribes NOT IN RAW DATA (Cortes, Pregibon, and Volinsky 2001) (Neville et al. 2005)
? Identifying Branch Associations 110 Wall Street, NEW YORK, NY, 10005 311 S. WACKER DRIVE, CHICAGO, IL, 60606 1400 World Trade Center, St. Paul, MN, 55101 30 East 7th Street Suite 1400, St. Paul, MN, 55101 110 Wall Street, NY, NY, 10005 110 Wall Street, MANHATTAN, NY, 10005 110 Wall Street, 22ND FLOOR, NEW YORK, NY, 10005 30 East 7th Street Suite 1400, St. Paul, MN, 55101 311 S. WACKER DRIVE, CHICARGO, OH, 60606 1400 World Trade Center, St. Paul, MN, 55101 Self-Reported Addresses are messy! Use String Matching Algorithm Unmatched 30% (~1.43 million) Inferred Branches Matched 70% (~3.35 million)
B A Identifying Tribes Why did they move? 1.Looking for better location to commit fraud. 2.Friend inviting friends to better jobs. 3.Geographic Limitations. 4.Branches merged or acquired. Anomalous Movement Background Movement
Identifying Tribes Tribe - |trīb| noun A group of reps that works together at many statistically unlikely branches. Reps in tribes found by our algorithm: –move between branches in more zip-codes than the rest of the population. –have more disclosures than the rest of the population. For more details on how to find tribes please see talk by Lisa Friedland: Finding Tribes: Identifying Close-Knit Individuals from Employment Patterns R11: Pattern Discovery (II), Tuesday 10:30 am ~ 11:50 am, Regency 2
Challenge 2: Class Label Whether a rep is planning to commit fraud is unknown … … instead we use a surrogate class label called a risk score based on disclosure history. With NASD guidance, we assigned a weight to each disclosure type based on its severity. + = risk score For branches, compute average risk score over reps. + + + + = risk score For reps, sum over disclosures to determine risk score for a given year. 3
Risk Scores to Class Labels Simple Approach 1)Sort by risk score 2)Choose Top N But Risk scores are not distributed evenly. Scores vary by: –Geography –Year –Firm Demographics Want high-risk, but relative to similar branches or reps.
Creating a Normalized Class Label 0124567893 Small Branch, Small Firm Small Branch, Large Firm Medium Branch, Small Firm Medium Branch, Large Firm Large Branch, Large Firm 2005 Stratify by year, zip-code, and firm demographics For each bin, choose top 5% of all scores, as long as also above median of non-zero scores
The Knowledge Discovery Process kdl.cs.umass.edu/proximity
Learning Models of Risk Two approaches for learning models: –Model highest-risk entities from each bin with a single model –Model each bin independently (stratified). Two possible class labels –Normalized Class Label –Global Class Label (no normalization).
Relational Probability Tree (Neville et al 2003)
Challenge 1: Scorecard Enabled all other analyses. Branch Features appeared many times in learned trees. Branch Associations Useful as a local pattern. Not wide-spread enough to be useful as a global pattern despite desirable characteristics. Tribes REQUIRED
Normalized Class Label Stratified Models Challenge 2: Scorecard Allows Improved Branch models. No improvement in Rep models. Stratified Models do not improve performance. No aggregation is required for reps. Normalization has no effect. Risk score on branches is aggregate over reps. Normalization accounts for discrepancies in sizes. Small number of positive instances in each bin leads to lack of generalization.
Summary Finding small-scale social structure critical for relational domains. –Branch matching essential, all analyses required branches. –Tribes - interesting local pattern not widely spread for global modeling. Continuing to explore other techniques to capitalize on dynamic structure of complex domains. Under the right circumstances, normalization can be useful when creating a class label for relational domains. –Helpful when aggregating over lower level entities.
Andrew Fast email@example.com kdl.cs.umass.edu/~afast kdl.cs.umass.edu/proximity Poster #35 tonight. Questions?
Consolidation and Link Formation Consolidation - identify entities of interest Link Formation - construct structured relations between consolidated entities. (Goldberg and Senator 1995) –Raw data don’t always explicitly contain entities and structure of interest. –These are critical for knowledge discovery in relational data. –In the securities industry, social structure among reps useful for detecting misconduct. (Neville et al. 2005)
For more details on tribes please see talk by Lisa Friedland: Finding Tribes: Identifying Close-Knit Individuals from Employment Patterns R11: Pattern Discovery (II), Tuesday 10:30 am ~ 11:50 am, Regency 2 How did we do? – Reps in tribes move between branches in more zip-codes than the general population of reps – Reps in tribes are almost 8 times more likely to be at high-risk of fraud.
Identifying Tribes Rather than limiting ourselves to static relations, we can consider groups of reps who coordinate their movement from branch to branch. Not all group movement is coordinated, consider: –Two firms in a small town. (Geography) –Branch is sold to another firm. (Acquisitions and Mergers) –Branch is closed. (Branch Closings) Not all coordinated movement is nefarious. –Friends inviting friends to better jobs. Tribe - |trīb| noun group of reps that coordinate their movement between statistically unlikely branches.
Challenge 2: Scorecard Normalized Class Label –Allows better models for branches, not for reps. Risk score on branches is aggregate over reps. –Normalization accounts for discrepancies in sizes. No aggregation for reps is needed. –A high-scoring rep is high-risk no matter which bin they belong to. Stratified Models –Stratified models perform worse than combined models. Number of positives instances per bin is small, does not generalize well.
Challenge 2: Class Label “Will commit fraud in future” flag is not given. –Create a surrogate class label or risk score from collection of disclosures on reps. Risk is not uniformly distributed across the data. –Market conditions vary over time. (Temporal) –Laws vary from state to state. (Geographic) –A small firm may have different market pressures than a large firm (Demographic)
Weighted Risk Score Assign each disclosure type a score based on its severity. –Customer complaints –Bankruptcy –Regulatory Action Sum over disclosure types for each rep for each year. Average over reps for branches. Who is high-risk? –Order entire population by risk score, choose top n. –But not uniformly distributed. Check out your broker: http://brokercheck.finra.org/ http://www.nasdbrokercheck.com
Normalizing Risk Stratify data into bins with uniform risk score. –Consider each year independently. –Segment USA by zip-code. –Create 5 categories of branches based on branch and firm size. Small branch, Small firm. Small branch, Large Firm. Medium branch, Small Firm. Medium branch, Large firm. Large branch, large firm. For each year 5 branch types X 10 zip-code regions = 50 bins. Who is high-risk? –Order entire population by risk score, choose top. each of the bins