Presentation is loading. Please wait.

Presentation is loading. Please wait.

Byron Marshall: Oregon State University Hsinchun Chen: University of Arizona ISI 2006 IEEE Intelligence and Security Informatics Conference May 23-24 San.

Similar presentations


Presentation on theme: "Byron Marshall: Oregon State University Hsinchun Chen: University of Arizona ISI 2006 IEEE Intelligence and Security Informatics Conference May 23-24 San."— Presentation transcript:

1 Byron Marshall: Oregon State University Hsinchun Chen: University of Arizona ISI 2006 IEEE Intelligence and Security Informatics Conference May 23-24 San Diego, CA USING IMPORTANCE FLOODING TO IDENTIFY INTERESTING NETWORKS OF CRIMINAL ACTIVITY

2 2Importance Flooding & Networks of Criminal Activity The Case for Importance Flooding Analysis in Law Enforcement (LE) The need: feasible, cross-jurisdictional, intelligence analysis tools The promise of network methodologies Helping analysts create link charts –Criminal Activity Networks (CANs), importance flooding, and path-based importance heuristics Importance flooding for LE and more

3 3Importance Flooding & Networks of Criminal Activity We Need Feasible Cross-Jurisdictional Analysis Methodologies Events of the past few years both highlight the need for cross- jurisdictional sharing and demonstrate the difficulty of establishing feasible systems. –Focusing on investigational usefulness –Beyond criminal justice data –Respecting privacy and security issues Good sharing systems should support investigations.

4 4Importance Flooding & Networks of Criminal Activity Network-Based Methodologies: The Tool of Choice Criminal association networks are understandable and actionable. COPLINK Schema ABC 123 13 crossings 2 crossings Police Records Criminal conspirators receive longer sentences.

5 5Importance Flooding & Networks of Criminal Activity Network- or Graph-Based Analysis is Well Known in Law Enforcement (LE) From a LE research perspective: (Sparrow 1991) discussed the investigational implications of social network measures and identified some network properties that would impact real-world analysis applications: size, incompleteness, fuzzy boundaries, and dynamism. Social network analysis measures usually evaluate networks based on a single association weight (Coffman 2004, Xu and Chen 2003). Shortest path analysis in CrimeLink Explorer (Schroeder et al. 2003).

6 6Importance Flooding & Networks of Criminal Activity Network-Based Analysis in Real-World Investigations Link charts –combine many cases into an overall picture of criminal activity related to crime types or localities, –or focus on a particular case and particular suspects. They can be used to –focus an investigation, –communicate within law enforcement agencies, –or present data in court. Link chart creation is a valuable, manual, and expensive.

7 7Importance Flooding & Networks of Criminal Activity CAN Analysis: A Fraud/Meth Link Chart AN analyst spent 6 weeks in 2003 charting relationships between fraud and methamphetamines. Start with target individuals Research known associates Consider patterns of relationships Dealer Leg BreakerCheck Washer To inform officers and focus investigations

8 8Importance Flooding & Networks of Criminal Activity From a Data-Mining Perspective The interestingness (or importance) issue is a well recognized problem in the association rule mining field (Silberschatz & Tuzhilin 1996). “Beliefs” such as expected patterns and known information can be used to guide data-mining algorithms to unexpected or actionable items. (Padmanabhan and A. Tuzhilin 1999).

9 9Importance Flooding & Networks of Criminal Activity Network Interestingness In particular, interestingness as discovered in a network of relationships has received some attention: –(White and Smyth 2003) implement a generalized spreading activation model building from a “root set of nodes.” –Lin and Chalupsky (2003) detect novel network paths (not just nodes or links) to reveal interesting information.

10 10Importance Flooding & Networks of Criminal Activity Improving the Link Chart Creation Process More systematic = less training Faster = cheaper Algorithmic = can be applied to larger (e.g. cross-jurisdictional) data sets Easier = used in more investigations

11 11Importance Flooding & Networks of Criminal Activity Research Gaps Previous research has not directly addressed link chart creation. Previous analysis uses only weighted associations (network structure) for analysis. But practitioners relied heavily on individual and network-based activity heuristics. Emulating crime analysts, we want to use both associations (network structure) and importance heuristics (node semantics) in our system.

12 12Importance Flooding & Networks of Criminal Activity Research Questions How can we effectively identify interesting sub networks –from associations found in a large collection of criminal incidents –employing domain knowledge –to generate useful investigational leads and support criminal conspiracy investigations? Does the use of path-based importance heuristics and importance flooding improve upon link weight only methodologies?

13 13Importance Flooding & Networks of Criminal Activity Important Considerations (Solution Constraints) Criminal record datasets: –May miss personal associations (e.g. family) –May miss key individuals who appear un- important until caught or linked to investigational targets –Are ambiguous: Very different associations look the same in the records

14 14Importance Flooding & Networks of Criminal Activity Thus: Design Goals Selection algorithm design goals: –Be target focused –Use query-specific information to fill in the gaps –Tolerate missing and ambiguous data –Incorporate adjustable heuristics (or beliefs) These goals are appropriate for a large scale cross-jurisdictional analysis and local investigations.

15 15Importance Flooding & Networks of Criminal Activity Importance Flooding Basic Intuition: –Both a person’s past activity and their involvement in interesting association patterns establish initial importance. –Interestingness is partly path-based. That is, we improve the analysis by considering patterns of association. –Associates of interesting people become relatively more interesting.

16 16Importance Flooding & Networks of Criminal Activity Identifying Interesting Sub-networks of Criminal Associations Combined Associations Police Records COPLINK Schema The importance flooding module assigns a relative importance score to nodes in the network Importance Flooding Target List Link Weight Heuristics Importance Heuristics Simple Filtering (Path Distance) Importance Ranking Importance Ranked Network

17 17Importance Flooding & Networks of Criminal Activity How Does Importance Flooding Work? Target Individuals Other people Associations from incidents Step 1: Assign weights to network links based on: the role of each actor in each incident and the frequency of association Step 2: Assign initial importance values to nodes given: involvement in a specified type of incident (e.g., fraud) involved in a set of incident types (e.g., fraud & drugs) participation in an identified path (e.g. Fraud-Drugs-Assault) Step 3: Recursively pass importance to neighbors Step 4: Start with targets, best first search

18 18Importance Flooding & Networks of Criminal Activity Importance Flooding (.5 *.6) + (.5 *.2) + (.5 *.2) =.5 4 associations = 1 (frequency adjusted) (.99 *.6) + (.5 *.2) + (.5 *.2) =.794 1. Assign weights to network links 2. Assign initial importance values to nodes 3. Importance flooding C D A B Priority: A, B, C, … A C B New Score.1 = 1 Target Value * (1/5) Five Outlinks * 1 Link Weight *.5 Scalar coefficient at 1 hop Iteration 1: Scalar Coefficient [.5,.25] Normalized Results: Targets= 1, B=.25, C=.1, A=.05, D=.026 Note: A is scored higher than D After 1 D New Score.026 = (.1 From C * (1/2) outlinks *.794 wgt *.25 (2 nd hop)) + (.25 From B * (1/2) outlinks *.5 wgt *.25 (2 nd hop)) After the third iteration, D becomes “more interesting” than A because it links the two targets. Also, more distant nodes receive some attention. Normalized Results: Targets= 1, B=.436, C=.176, D=.093, A=.086 A D Once all the ranks are assigned, we select the highest ranked node with a direct connection to a previously selected node. 4. Start with targets, best first search

19 19Importance Flooding & Networks of Criminal Activity Could An Analyst Make A Link Chart Faster Using Importance Flooding? The Fraud/Meth Chart 110 people “Bronze” standard Start with 4 targets The basic notion of the methodology is to help the analyst build an association network around the target individuals. We measure success by computing the ratio of “correct” suggestions (people included in the manually constructed chart) to “incorrect” suggestions (people not selected in the original chart).

20 20Importance Flooding & Networks of Criminal Activity Our Testbed A network of person-incident-person triples from incident reports with date, crime type, and role (e.g. suspect, arrestee, victim) for each individual. For the Fraud/Meth chart we included 4,877 people which includes 73/110 targets.

21 21Importance Flooding & Networks of Criminal Activity Compare Selection Methods BFS: Breadth first search –approximates a manual approach CA: Closest Associate (CA) - link weights only –choose the unselected node which is most closely associated with a previously selected node –a link chart implementation of previous ideas IMP: Importance flooding - path-based importance –Path heuristics with no flooding (PATH) –Node-only importance flooding (NO: no path heuristics) PIF: Perfect Importance Flooding –Approximate flooding with “perfect” information: “correct” nodes =1; all other nodes=0 –this would be the theoretical limit for our methodology

22 22Importance Flooding & Networks of Criminal Activity Results: Fraud/Meth How many nodes did we look at to find one included in the manually-drawn link chart? Breadth First Search Closest Associate Importance Flooding Perfect Importance Flooding BFS was least effectiveClosest Associate Helped Importance Flooding: Consistently Better Given Perfect Information, Importance Flooding was Very Effective

23 23Importance Flooding & Networks of Criminal Activity Conclusions Our analysis shows the algorithm’s promise. –All the intelligent methods out performed breadth first search. –Importance analysis seems to improve on a link weight only approach. –Path-based heuristics helped as compared to solely node-based heuristics. Still, we should be cautious: our data set is limited and we can’t really say that one chart is “correct” while all others are “incorrect.” More study is needed.

24 24Importance Flooding & Networks of Criminal Activity Why Does It Fit the Domain? We can encode the kind of heuristics used by investigators: path heuristics, association weights, and target focus. Inquiry specific information can be leveraged by the algorithm. Heuristics can be tuned to a particular investigation. The data we use is shareable: –We use relationships not complete reports. –Different entity matching rules can be used for different applications. –But we still move beyond criminal justice data.

25 25Importance Flooding & Networks of Criminal Activity Importance Flooding: Not Just for CANs We believe that this kind of algorithm can be applied to other informal node-link knowledge representations. When the desired output is a network, this algorithm is designed to overcome link and identifier ambiguity by leveraging both the structure and the semantics of the underlying network. For example, we plan to explore the use of this algorithm in selecting interesting subsets of a network of biomedical pathway relations extracted from the text of journal abstracts.

26 26Importance Flooding & Networks of Criminal Activity Acknowledgement NSF, Knowledge Discovery and Dissemination (KDD) # 9983304. NSF, ITR: "COPLINK Center for Intelligence and Security Informatics Research - A Crime Data Mining Approach to Developing Border Safe Research." Department of Homeland Security (DHS) / Corporation for National Research Initiatives (CNRI): "Border Safe." Thanks to the Tucson Police Department: Kathy Martinjak, Tim Petersen, and Chuck Violette.

27 27Importance Flooding & Networks of Criminal Activity Hypotheses: Tested on Fraud/Meth Data A (technique) = Average nodes selected / correct nodes selected Results Techniques: BFS: breadth first (rank by # of hops) CA = closest associates IMP = importance flooding PATH: path rules, no flooding NO: node rules, flooding PIF = perfect importance flooding All techniques improve on BFS H1a: A(IMP) < A(BFS) Accepted ** H1b: A(CA) < A(BFS) Accepted ** Importance flooding out performs closest associates H2: A(IMP) < A(CA) Accepted ** Importance flooding out performs heuristics with no flooding H3: A(IMP) < A(PATH) Accepted at 500,1000, & 2000 but NOT for 100,250 Importance flooding with path rules out performs flooding with only node rules H4: A(IMP) < A(NO) Accepted ** Given “Perfect” information, flooding out performs other techniques H5a: A(PIF) < A(IMP) Accepted ** H5b: A(PIF) < A(CA) Accepted ** Hypotheses should hold for 100, 250, 500, 1000, and 2000 selected nodes ** significant at p=.01 for all levels of selected nodes

28 28Importance Flooding & Networks of Criminal Activity Association Strength Association Strength is based on: the role of each actor in each incident and the frequency of association (Schroeder et al. 2003) 1 & 2 may both be selected before 3 because of the Association Strength 3 1 2 Example: (As used in this work) Person/Role Suspect/Suspect Relationships =.99 Suspect/Not Suspect =.5 Not Suspect/Not Suspect =.3 Frequency Adjustment 4 or more associations, weight = 1 else, (strongest *.6, 2nd *.2, and 3rd *.2)

29 29Importance Flooding & Networks of Criminal Activity Activity and Path-Based Initial Importance Group Rule Example: Assign this node to the Fraud Group if they were ever a suspect in a fraud incident. Optionally, add 2 to the importance value of any node assigned to the Fraud Group. Multi-Group Rule Example: Assign this node to the Fraud/Drug group if they are a member of both the Fraud and Drug Groups. Path Rule Example: F = In Fraud, D = In Drug Sales, A = In Assault s/s = Two individuals who are both suspects in a recent incident Add 5 to nodes participating in an F-s/s-D-s/s-A path System Design

30 30Importance Flooding & Networks of Criminal Activity The Arrow Key Investigation Depicts 110 key people (coincidentally the same size as the Fraud/Meth chart) 23 original starting points (targets) were identified Testbed

31 31Importance Flooding & Networks of Criminal Activity BorderSafe Research Testbed 14 years of Local Law Enforcement (LE) Data - (1990-2004) 2.2 million people, 5.2 million incidents, 1 million vehicles Incident Reports Tucson Police Department (TPD) Pima County Sheriff’s Department (PCSD) Incident Reports People

32 32Importance Flooding & Networks of Criminal Activity Step 1: Schema-Level Integration RMS data in varying structures COPLINK Schema Pima County Tucson Police Step 2: Entity Alignment (1) Base Data establishes a network of relations (incident records) Nodes: Locations, People, & Vehicles Links are police incidents ABC 123 Data Integration Framework 2 Steps, 3 Classes of Data (Marshall et al. 2004) ABC 123 (3) Sensitive or restricted Query-Specific data is added but not stored in the database (e.g., subpoenaed phone records) 13 crossings (2) Supplementary data annotate entities (e.g., border crossings and vehicle seizures) 2 crossings

33 33Importance Flooding & Networks of Criminal Activity Results: Arrow Key Investigation How many nodes did we look at to find one included in the manually-drawn link chart? Results Breadth First Search Closest Associate Importance Flooding Perfect Importance Flooding BFS did not do as badly as in the Fraud/Meth data set Closest Associate Was Less Effective Importance Flooding Still Performed Well Important direct connections were not contained in available data

34 34Importance Flooding & Networks of Criminal Activity Average Nodes Selected Per Correct Node As more nodes are selected, a higher proportion of “incorrect” nodes are selected. Node Ranking Method

35 35Importance Flooding & Networks of Criminal Activity Raw Results – Fraud/Meth Data Avg = Average Number of Nodes Selected Per Correct Node SD = Standard Devation of Number of Nodes Per Correct Node In each cell, the top number is the number of correct nodes selected The second row is the Avg and (SD) Ranking Methodology:Perfect Importance Flooding Path Importance No Flooding Node Importance With FloodingClosest AssociateBreadth First 1 to 100 Nodes Selected 68272419150 1.08(0.14)3.40(0.62)3.39(0.68)3.86(1.07)4.98(1.51)N/A 1 to 250 Nodes Selected 694849423123 N/A3.95(0.71)3.96(0.69)4.70(1.00)6.20(1.43)47.63(21.62) 1 to 500 Nodes Selected 695651473835 N/A5.53(1.82)5.76(2.10)5.76(2.10)8.68(2.93)29.68(23.61) 1 to 1000 Nodes Selected 696259535138 N/A8.99(3.97)9.58(4.32)10.92(5.06)12.47(4.52)25.06(17.48) 1 to 2000 Nodes Selected 696867636445 N/A15.71(7.77)16.37(7.93)18.23(8.57)19.41(7.98)30.04(13.76) Until All 69 Correct Nodes Were Selected 10121583058440741404828 1.08(0.14)16.80(8.43)23.69(12.21)35.03(17.88)33.05(15.59)46.92(17.76)

36 36Importance Flooding & Networks of Criminal Activity We Need Feasible Cross-Jurisdictional Analysis Methodologies Vast Volume –Tucson alone has Law Enforcement data for 2 million people, 5 million incidents, and 1 million vehicles in 14 years Privacy Policies –What is buried in the reports? –Medical? Personal? Sensitive? Entity Equivalence –No unique identifier is available –Task-dependent accuracy requirements Sharing data across agencies is difficult and expensive. Are simple queries the answer?

37 37Importance Flooding & Networks of Criminal Activity Numeric Measure We expect that in networks ranked by a “better” algorithm, an analyst would have to look at fewer nodes to find a “correct” node. –A (technique) = Average (nodes selected / correct nodes selected). –It can be measured at various selected node levels. –For Example: A (importance flooding) at 250 = average ratio of selected nodes to “correct” nodes, selected by the importance flooding algorithm, when the number of selected nodes is 1,2,3…250.

38 38Importance Flooding & Networks of Criminal Activity Network- or Graph-Based Analysis is Well Known in Law Enforcement (LE) From a LE research perspective: (Sparrow 1991) discussed the investigational implications of social network measures and identified some network properties that would impact real-world analysis applications: size, incompleteness, fuzzy boundaries, and dynamism. Social network analysis measures usually evaluate networks based on a single association weight (Coffman 2004, Xu and Chen 2003). Shortest path analysis in CrimeLink Explorer (Schroeder et al. 2003). CrimeLink Explorer’s taxonomy of criminal association closeness for shortest path analysis considered: - crime-type and person-role, - shared addresses or phones, and - incident co-occurrence.

39 39Importance Flooding & Networks of Criminal Activity Association Closeness and Importance (chosen based on analyst input) Link Weight Heuristics –Suspect/Suspect Relationships =.99 –Suspect/Not Suspect =.5 –Not Suspect/Not Suspect =.3 Frequency Adjustment –4 or more associations, weight = 1 –else, ∑ (1 st strongest relation *.6, 2 nd *.2, and 3 rd *.2) Importance: –Groups: Aggravated Assault (A), Drug Sales (S), Drug Possession (P), Fraud (F) (A),(D), or (F) = 3 –Path Rules: (all applied only to crimes after 01/01/2001) (A)-(D)-(F) = 5 (A)-(D), (A)-(F), (D)-(F), (P)-(F) = 3 Nodes with any 2 of (A),(D), & (F) = 3; (A),(D), & (F) = 5


Download ppt "Byron Marshall: Oregon State University Hsinchun Chen: University of Arizona ISI 2006 IEEE Intelligence and Security Informatics Conference May 23-24 San."

Similar presentations


Ads by Google