Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web-Based Data Collection and Analysis

Similar presentations


Presentation on theme: "Web-Based Data Collection and Analysis"— Presentation transcript:

1 Web-Based Data Collection and Analysis
Andy Leone June 2014

2 Course Overview Tools for extracting and analyzing text Software
SEC Filings Websites PDF Files Software Perl – ActivePerl Perl “Front End” Komodo MySQL

3 Why Perl? There Are Alternatives to Perl (e.g., Python, Ruby, R)
Perl – Practical Extraction and Reporting Language It is a mature language with considerable support. Countless packages (add-ons) About the best there is when it comes to “Regular Expressions.”

4 The big picture- What I do
SAS Connect WRDS SAS Pass-through Perl Edgar SQL ODBC ODS Proc Export Excel

5 SEC FTP Files can be obtained from the SEC’s FTP site:
ftp.sec.gov Note that the files are not made available on the FTP site until 24 hours after they are filed. It is possible to get real-time feeds. You are supposed to do large downloads after 9:00PM Eastern.

6 Index Files Index files are archived each quarter and the most recent one us updated every day: /edgar/full-index/2006/QTR1/company.zip The zip file contains a file “company.idx”

7 Index File, continued -First 10 lines - description then data-
-Company Name -Form Type -CIK Number -Filing Date -Directory and File Name Description: Master Index of EDGAR Dissemination Feed by Company Name Last Data Received: September 6, 2006 Comments: Anonymous FTP: ftp://ftp.sec.gov/edgar/ Company Name Form Type CIK Date Filed File Name 033 ASSET MANAGEMENT LLC / F-HR edgar/data/ / txt 1 800 CONTACTS INC Q edgar/data/ / txt

8 Data Files SGML format All files contain basic header information tagged by <SEC-HEADER> This is followed by each file submitted with the filing. Example- AMREP Corp. Submitted a 10-K/A (Text Version)

9 Challenge Lack of uniformity Another Example:
Smart-tek Solutions Inc. (Text)

10 Perl Practical Extracting and Reporting Language
A powerful scripting language for working with unformatted text.

11 Perl The key feature of Perl is Regular Expressions.
It is relatively easy to search through text files and match words or word patterns. For example, to identify firm with an internal control deficiency when wording differs. “We have identified a material weakness” “A material weakness exists” “A control deficiency exists” “We identified a control deficiency” “No material weakness exists” (Don’t want to match this one)

12 Expectations Using PERL can be very helpful. But..
More refined measures (e.g., modified audit opinions). Large samples relative to hand collection. But.. Extracting data from free-form text can be messy. Especially when formatting varies so much from file to file. A lot of trial and error. It would take an infinite amount of time to get success rates up to %. Need to accept 90-95% and live with some “noise”

13 Example: Audit Opinions
How do you find the opinion? Look for something like “We have audited.” But PWC has to start with “In our opinion.” How do you find the end of the opinion? Look for a date (e.g., March 15, 2005) a the start of a line But some dates in the opinion happen to fall at start of line (e.g., for year ended December 31, 2005) Then require the date to be followed by spaces. But some opinions are dated with “except for.” Foreign auditors might date the report in day- month-year order (15 March, 2005). Some firms have multiple opinions. Sometimes there are typos.

14 When is PERL most useful?
Extracting specific text and looking for keywords and phrases Audit Opinions Internal Controls Footnotes Not great for unstructured financial information. For example, use of proceeds in IPO prospectus. Sometimes you still need to code the data by hand.

15 What I do Download all index files and write them to a MYSQL database (via PERL). Download filings I work with fairly often (e.g., 10-K, 10-Q, 8-K) Create header header tables in MYSQL for each filing type. Create project-specific tables as needed. Note: MYSQL allows for easy access from SAS, STATA, excel, etc. via ODBC.

16 MYSQL, PERL and SAS DBI Module
This makes it really easy to work with data via SQL or statistics software (SAS, STATA). SAS Example: proc sql; connect to odbc (datasrc=Compustat user=me password=xxxx); create table myrestates as select * from connection to odbc (select gvkey, file, sic, hlink, company_name, to_CFO, TO_CFO_Date,cik, irreg,to_CEO, to_CEO_Date from 8kdata.restate_sample_jul2006 a); quit;

17 Software Installation Checklist
ActivePerl 5.10 Let’s go to the Package Manager and install a couple of modules: DBD::mysqlPP HTML::Format Komodo Edit 4.4 MySQL 5.0 MySQL GUI Tools SAS (or higher).


Download ppt "Web-Based Data Collection and Analysis"

Similar presentations


Ads by Google