Download presentation
Presentation is loading. Please wait.
Published byAlannah Walters Modified over 6 years ago
1
Analyzing the Enron Emails using Topical Analysis and Graph Theory
Casey Kalinowski Faculty Mentors: Dr. Zakaria Kurdi & Dr. Kim McCabe
2
Statement of the Problem
The Enron Scandal 2001 Bankruptcy of the Enron Corporation Investors lost millions of dollars Investigations by the FBI, IRS, and the Securities and Exchange Commission lasting for 5 years The Enron s 520,914 s from over 150 employees of the Enron Corporation Obtained during the investigation and released to the public as the largest dataset of its kind Stress how 520,914 s is way too many to be able to go through during an investigation without aid from software Is it possible to analyze the s with modern techniques in a way that is more efficient and effective than the original investigation?
3
Application of Theory Topical Analysis Graph Theory
Artificial Intelligence Natural Language Processing Create topics and assign a topic to each piece of data Graph Theory Application of Graphs to research Graph – Nodes connected by Edges Node Degree – number of edges coming from a node Edge Weight – numerical value shared by the two connected nodes Say that topical analysis in the criminology world is content analysis
4
Application of Theory Topical Analysis Models Creating the Graphs
Gensim Latent Dirichlet Allocation Model Natural Language ToolKit (NLTK) WordNet Keyword Search Nodes – Unique address Node degree – Number of edges connecting to other addresses Edges – between two nodes Edge weight – Number of s sent between two nodes
5
Methodology Research Design Source of Data Test Sample
Retrospective - relies on previously collected data Exploratory - little or no previous research on this topic Source of Data 520,914 s from 156 employees of the Enron Corporation Collected during Enron Investigation and released to the public by the Federal Energy Regulatory Commission in 2004 Retrieved from Carnegie Mellon University Test Sample 5890 s from inbox/outbox of Kenneth Lay It is ongoing research, that is why the test sample is so small
6
Analysis of the Data Processing the data Creating the topics
Topical Analysis of the data Creating the graphs Analyzing the graphs
7
Processing the Data Converting the s from .CSV Format to .JSON format with Python .CSV "dasovich-j/notes_inbox/526.", "Message-ID: Date: Wed, 20 Sep :51: (PDT) From: To: Subject: Cc: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Bcc: X-From: Steven J Kean X-To: David Parquet X-cc: Jeff Dasovich, Sandra McCubbin X-bcc: X-Folder: \Jeff_Dasovich_Dec2000\Notes Folders\Notes inbox X-Origin: DASOVICH-J X-FileName: jdasovic.nsf I talked to Hettie today. It's unlikely that we are going to find time for Jeff and the Governor to talk (because of the Governor's schedule). We’ll try to set something up later. In the meantime, the Governor should just sign the bill. Of course, Hettie had already communicated this; the Gov’s office acknowledged that the message was received but did not make a specific commitment. " .JSON {‘owner’:’dasovich-j’, ‘date’:’ :51:00’, ‘subject’:’ ‘, ‘message’:’ I talked to Hettie today. It's unlikely that we are going to find time for Jeff and the Governor to talk (because of the Governor's schedule). We’ll try to set something up later. In the meantime, the Governor should just sign the bill. Of course, Hettie had already communicated this; the Gov’s office acknowledged that the message was received but did not make a specific commitment.’} Show differences between the two formats
8
Creating the Topics NLTK WordNet and Keyword Search Picked by hand
Accounting, Bankruptcy, Fraud, Leisure, Management, Meeting, Stock, and None Mention that Gensim is artificial inteligence
9
Creating the Topics LDA Model
“I talked to Hettie today. It's unlikely that we are going to find time for Jeff and the Governor to talk (because of the Governor's schedule). We’ll try to set something up later. In the meantime, the Governor should just sign the bill. Of course, Hettie had already communicated this; the Gov’s office acknowledged that the message was received but did not make a specific commitment.” Gensim - Latent Dirichlet Allocation (LDA) Create training set from only the Nouns and Plural Nouns from s Feed the training set to Gensim to create LDA Model Process s with LDA Model to create lists of topics Use the top six topics and use them with the WordNet technique Business, Employees, Information, Market, People, and Stock “today time governor talk schedule try set something meantime governor sign bill course office message commitment” LDA Model s Topic 0: talk, bill, office, commitment Topic 1: governor, course, office, today Topic 2: talk, schedule, message, set
10
Topical Analysis of the Data
Keyword Search Very simple and easy to implement Keyword with highest number of occurrences in an becomes that ’s topic
11
Topical Analysis of the Data
NLTK WordNet Use WordNet to find the hyponyms of every topic Compare every hyponym of every topic to every sense of every word in each of the s Use WordNet to find similarity between hyponym and sense of the word Similarity of .25 or higher gets scored 1 point The topic with the highest weighted average score becomes the topic of the Weighted Average = (Total score / Total number of word senses in the ) If similarity is .25 or higher, A point is added
12
Topical Analysis of the Data
Tested on 50 random s Manually assign topics to each Run topical analysis on the s Compare the generated topics to the manually assigned topics
13
Creating the Graphs NetworkX – Python tool Load email from .JSON file
Create a node for each unique address Create edge between sending node and receiving node and +1 to edge weight Export as .graphml file
14
Analyzing the Graphs Gephi – Open source graph analysis and visualization tool View graph data (node degree, edge weight, centrality, etc.) Visualize the graphs Most importantly, refine the graphs
15
Analyzing the Graphs All topics 31 nodes, 38 edges
Nodes sized by degree Edges sized by weight Note the size of the node
16
Analyzing the Graphs Gensim + WordNet Topic Market 26 nodes 32 edges
Talk about how rosalee flemming connects to ken lay
17
Analyzing the Graphs WordNet Topic Accounting 18 nodes 20 edges
Note the edge between and Talk about how, as an investigator, I would look into these other people
18
Limitations to Research
Reliability of the data Did not collect the Enron s personally No record of what was done to the s before being released to the public But this is not truly a limitation because this is more of a concept and test than real research
19
Conclusion and Future Proposals
Work in progress Improvements to effectiveness and efficiency Promising Results WordNet and Gensim higher accuracy than Keyword Search Future Application Once refined, technique can be applied to other datasets
20
References 10 Enron Players: Where They Landed After the Fall. (2006, January 29). The New York Times. Retrieved January 25, 2018, from after-the-fall.html Bastian M., Heymann S., Jacomy M. (2009). Gephi: an open source software for exploring and manipulating networks.International AAAI Conference on Weblogs and Social Media. Cohen, W. W. (2015, May 8). Enron Dataset. Retrieved December 12, 2017, from Famous Cases and Criminals: Enron. (2016, July 20). Retrieved January 20, 2018, from Federal Energy Regulatory Commission. (n.d.). The Western Energy Crisis, the Enron Bankruptcy, and FERC’s Response. Retrieved January 25, 2018, from Klein, B. (n.d.). Python Advanced: Graphs in Python. Retrieved January 29, 2018, from Moon, B., Mccluskey, J. D., & Mccluskey, C. P. (2010). A general theory of crime and computer crime: An empirical test. Journal of Criminal Justice, 38(4), doi: /j.jcrimjus Peixin Zhao, Marjorie Darrah, Jim Nolan, & Cun-Quan Zhang. (2014). Analyses of Crime Patterns in NIBRS Data Based on a Novel Graph Theory Clustering Method: Virginia as a Case Study. The Scientific World Journal, doi: /2014/ Python Software Foundation. Python Language Reference, version Available at Ruohonen, K. (2013). Graph Theory (J. Tamminen, K. Lee, & R. Piché, Trans.). Retrieved January 27, 2018, from Salkind, N. J. (2010). Encyclopedia of Research Design. Thousand Oaks, Calif: SAGE Publications, Inc.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.