Text Mining SEC Filings for Fraud Detection Fletcher Glancy ISQS 7342.

Slides:



Advertisements
Similar presentations
ASSUMPTION CHECKING In regression analysis with Stata
Advertisements

Psychology Practical (Year 2) PS2001 Correlation and other topics.
Early Detection of Fraud: Evidence From Restatements Natalie T. Churyk, PhD, CPA Caterpillar Professor of Accountancy B. Douglas Clinton, PhD, CPA Alta.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
TEMPLATE DESIGN © Identifying Noun Product Features that Imply Opinions Lei Zhang Bing Liu Department of Computer Science,
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
An Empirical Examination of Transaction- and Firm-Level Influences on the Vertical Boundaries of the Firm Leiblein, Michael and Miller, Douglas
Introduction to Meta-Analysis Joseph Stevens, Ph.D., University of Oregon (541) , © Stevens 2006.
An Introduction to Stochastic Reserve Analysis Gerald Kirschner, FCAS, MAAA Deloitte Consulting Casualty Loss Reserve Seminar September 2004.
LSP 120: Quantitative Reasoning and Technological Literacy Section 118 Özlem Elgün.
Shipi Kankane Prashanth Nakirekommula.  Applying analytics and risk- management capabilities to health insurance through LexisNexis data platforms. 
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Session 6: Interviewing, Document Analysis, and Observation.
BA 555 Practical Business Analysis
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Management Fraud FRAUD EXAMINATION ALBRECHT & ALBRECHT Management Fraud CHAPTER 10.
Topics - Reading a Research Article Brief Overview: Purpose and Process of Empirical Research Standard Format of Research Articles Evaluating/Critiquing.
Roger S. Debreceny Shidler College of Business University of Hawai‘i at Mānoa Glen L. Gray College of Business & Economics California State University,
UCB CS Research Fair Search Text Mining Web Site Usability Marti Hearst SIMS.
Data Mining – Intro.
Overview of Search Engines
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
CSCI 347 / CS 4206: Data Mining Module 05: WEKA Topic 04: Data Preparation Tools.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Example of Simple and Multiple Regression
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Econometrics: The empirical branch of economics which utilizes math and statistics tools to test hypotheses. Special courses are taught in econometrics,
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
Advanced Statistics for Interventional Cardiologists.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Pearson Product-Moment Correlation PowerPoint.
Learning Objective Chapter 14 Correlation and Regression Analysis CHAPTER fourteen Correlation and Regression Analysis Copyright © 2000 by John Wiley &
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Chapter 1 Introduction to Data Mining
Two Approaches to Calculating Correlated Reserve Indications Across Multiple Lines of Business Gerald Kirschner Classic Solutions Casualty Loss Reserve.
Secure Systems Research Group - FAU Classifying security patterns E.B.Fernandez, H. Washizaki, N. Yoshioka, A. Kubo.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.
Automatic Keyphrase Extraction (Jim Nuyens) Keywords are an everyday part of looking up topics and specific content. What are some of the ways of obtaining.
Improving the Quality of the HMRC Personal Wealth Statistics Rebecca Ambler and Abeda Malek - HMRC.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
1 Exploring Data Mining Implementation By Karim Hirji, IBM Canada Chichang Jou, Tamkang University.
Figure 1 – Social Media Landscape 2015 (Source: FredCavazza.net)
Project 1 FINA B. Group of 5. Due by 18/09/ parts. Each worth 50% of total. Need to provide 1 excel workbook for part 1 and part 2. This.
ABSTRACT This paper examines the Hennes indicator to determine the effectiveness of the indicator in separating financial reporting fraud from errors in.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
Matic Perovšek, Anže Vavpeti č, Nada Lavra č Jožef Stefan Institute, Slovenia A Wordification Approach to Relational Data Mining: Early Results.
Login session using mouse biometrics A static authentication proposal using mouse biometrics Christopher Johnsrud Fullu 2008.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Blended Value Accounting and Social Enterprise Success Title.
Class Imbalance in Text Classification
Using Linguistic Analysis and Classification Techniques to Identify Ingroup and Outgroup Messages in the Enron Corpus.
Fund Governance and Collusion with Controlling Shareholders: Evidence from Non-tradable Shares Reform in China Authors: Q. Jin & V. Yu Discussion by John.
Corporate Governance and Financial Reporting Research Discussion of “Fraud type and auditor litigation: An analysis of SEC accounting and auditing enforcement.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Oracle Advanced Analytics
Data Mining – Intro.
A Methodology for Finding Bad Data
Introduction C.Eng 714 Spring 2010.
Empirical Project.
Meta-Analysis: Synthesizing the evidence
Data Warehousing and Data Mining
Meta-Analysis: Synthesizing evidence
Word Embedding Word2Vec.
iSRD Spam Review Detection with Imbalanced Data Distributions
TEXTAND WEB MINING.
TEXT and WEB MINING.
Topic A Grade 1.
Presentation transcript:

Text Mining SEC Filings for Fraud Detection Fletcher Glancy ISQS 7342

Research Issues 1.Can fraud be detected from SEC filings? 2.Can text mining provide a methodology for detection of potential fraud? 3.If text mining can provide an indication of potential fraud, which algorithm gives the best performance? Fletcher Glancy12/2/2008

Brief Background Corporate governance fraud has been a major concern, i.e., Enron, WorldCom, HealthSouth. Detection has been after many years of abuse. Most techniques involve ratio analysis. Churyk et al. used Context Analysis to detect fraud in MDA of 10K filings. Fletcher Glancy12/2/2008

Potential Strengths of Text Mining TM can be automated. The results can be used for further data mining. TM eliminates researcher bias that is potentially present in Context Analysis. Fletcher Glancy12/2/2008

Potential Problems/Weakness There is no context in text mining, only statistics. It is difficult to understand the relationships with a document-term matrix. Unable to handle negatives or punctuation. Fletcher Glancy12/2/2008

Narrow the Focus - Negatives Antonyms – Word Opposites. Negatives – not good = bad. Interference by articles. Not a good day. Interference by modifiers. Not highly motivated. Fletcher Glancy12/2/2008

Possible Data Preparation Options Preprocessing to remove articles. Convert punctuation to text. Replace ‘;’ with semicolon. Combine following noun with “not”. Not highly motivated becomes highly not_motivated. Create not_noun and replace with antonym. not_dead is replaced with alive. Fletcher Glancy12/2/2008

Testing Data Preparation Options Select/Create text database. – 10K Notes and MDA. – Firms that have received AAER. Preprocess with each alternative individually and cumulative. Create document text matrix and SVD. Fletcher Glancy12/2/2008

Testing Data Preparation Options Calculate variance of document set using SVD. Create logistic regression using set SVD and calculate variance. Test for predictability using validation set. Fletcher Glancy12/2/2008

Questions? Welcome to my potential dissertation topic! Fletcher Glancy12/2/2008