Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Similar presentations

Presentation on theme: "Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp."— Presentation transcript:

1 Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp

2 Set up problems Mac– mostly no problems due to linux-like environment and great support Windows on MOBAXTERM – You can use apt-cyg to install everything – Apt-cyg install python – Apt-cyg install idle – Apt-cyg install idlex

3 REGEX CHALLENGE! 3 REGEX Challenges 1 from a well known t-shirt joke (if you know this, don’t say anything) 2 are song lyrics (tried to find well known songs). Raise your hand to say the answer

4 a t-shirt people wear r”(bb|[^b]{2})” Difficulty * Hint: Phrase

5 a t-shirt people wear r”(bb|[^b]{2})” “To be or not to be” Difficulty * Hint: Phrase

6 Challenge 2 Difficulty ***** Hint: This is literally the entire lyric for the song r”(\w+ [a-z]{3} w..ld ){144}”

7 Challenge 2 Difficulty **** Hint: This is literally the entire lyric for the song Hint 2: It’s a song by the music duo who created the latest Record of the Year r”(ar\w{3} [a-z]{3} w..ld ){144}”

8 Challenge 2 Difficulty ***** Hint: This is literally the entire lyric for the song Hint 2: It’s a song by the music duo who created the latest Record of the Year r”(\w+ [a-z]{3} w..ld ){144}” Around the world – by Daft Punk

9 Challenge 3 Difficulty ** Hint: Lyric of an old song r”ah, ((ba ){4} (bar){2}a an{2} \s)+”

10 Difficulty ** r”ah, ((ba ){4} (bar){2}a an{2} \s)+” Ah, Ba ba ba ba Barbara Ann~ Challenge 3

11 Song Phrases Ever since I learned regex, I was thinking that many Daft Punk songs are optimized for regex. Lyrics for a song in its entirety with this one simple regex r”(Around the world ){144}” – Around the world r"((buy|use|break|fix|trash|change) it )+ now upgrade it” –Technologic r”(((work|make|do|makes|more) (it|us|than) (harder|better|faster|stronger|ever))+ hour after our work is never over. \s)+” – Harder, better, faster, stronger

12 THE BIGGEST concern for doctoral students doing empirical work (year 2-4) “WHERE AND HOW DO I GET THE DATA?!“

13 Data sources 1.Companies 2.Wharton Organizations 3.Scraping Web 4.APIs : application programming interface

14 DATA SOURCES 1.Companies – HARD, UNIQUE – Hardest but once you get a good company, you are set for a paper or two or more… 2.Wharton Organizations – (WRDS) (EASY, COMMON - great for auxiliary data) Other people can also easily access this data. Data probably have been used already – (WCAI) (EASY, UNIQUE) data is actually pretty great and only few select teams get it after proposal review process 3.Scraping Web (WGET/REGEX/tools) – MEDIUM, MEDIUM – Relatively easy but painful for big projects and sometimes not allowed based on website. 4.APIs : application programming interface – EASY, COMMON – Easy but restricted to what the company made available.

15 Resources for Public Data There are many list of lists for public data Find a link to list of lists for data in the course website under “resources for learning” If you have a good source, please email me so I can link it on the web

16 Companies

17 Quick tips Don’t be afraid to contact random companies Attend conferences and network like an MBA - think of it like a game Send a short 2-3 page proposal suggesting a research collaboration Read about the company you are contacting and make sure to offer something that interests the company Low success probability – among many proposals I’ve sent (about 30+ if you count emails). – Mostly no response. – 1 company I was working with for 10 months just decided to drop the ball due to CTO changing twice. – 4 very easy data – not useful and suitable for research – 2 very useful data I am currently using/working with. – 1 company disputing about NDA NDAs: you can request help from upenn legal team here – https://medley05.isc-

18 NDAs are super important A horror story I heard – A student worked with a company for 1+ year and then the company just decided that the result was too good to publish. Wanted it to be a trade secret/IP. – NDA signed was bad. – No publication. – Most NDAs are OK but some are not. If bad, get help from that link and negotiate. – Look out for “Work for hire” type of NDAs

19 Wharton Specific

20 You probably heard about these organization from wharton doctoral orientation. WRDS: Wharton Research Data Services – WCAI: Wharton Customer Analytics Initiative – Other organizations exist but mostly for conferences and not for data. – centers-and-initiativ.cfm centers-and-initiativ.cfm

21 Basic Web Scraping

22 Caveats I spent time writing and testing a scraping code for this course where one inputs a list of music artists in csv format and the script queries to obtain information such as the genres associated with the artists. Written in March of 2013. On July, It broke because has updated their website…  This is one problem with scraping. You never know when it will stop working and you have to rewrite.

23 Outline of basic scraping 1.CRAWLING: Instead of using web browsers, use scripts to access html (xml, etc). Or crawl through website recursively and download all htmls or txts or whatever. (WGET or Python or any language such as php) 2.PATTERN SEARCHING: Researcher looks at the raw http output and looks for where the required data is and figure out what the pattern is. (Developer’s toolbox Firefox) 3.EXTRACTION: Use text extracting tool to extract information and store it! (if it’s structured format such as xml then use appropriate tools for each format). (REGEX, Apache Lucene, SED, AWK, etc) 4.Go publish papers with the data

24 Alternatives Want something easier or with GUI? – MOZENDA: Wharton has license and it’s cheap More advanced scraping – We will cover this next week with Scrapy There are many other tools and packages for this. – – know-of-a-good-python-based-web-crawler-that-i- could-use know-of-a-good-python-based-web-crawler-that-i- could-use

25 Tools used in our examples WGET + Python REGEX HTML/DOM inspector – Firefox has Web Developer's Toolbox which is an add-on you can download. – This is useful for looking for pattern of data you want to extract

26 Scraping Example 1 Facebook SEC filing exploration – Purpose: Exploration before research – What this toy example is doing: Get SEC filing for Facebook and extract certain parts – I am interested in reading a few words before and after whenever there is “shares” mentioned

27 DOWNLOAD HTMLS/TXT/JPG/ETC WGET “GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non- interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.” Fire up and

28 WGET FB’s SEC filings wget -r -l1 -H -t1 -nd -N -np -A.txt –e robots=off -r -H -l1 -np These options tell wget to download recursively. -nd no directory. Keep the downloaded in one folder -A.txt only download txt files -erobots=off ignore robot.txt (avoid using this option if wget without this option works. Make sure to use --wait option if you use this option or your IP may get banned)

29 Caveats WGET only works well for certain websites. You can use it download all photos etc. But if your script makes too many requests, they may ban your IP. You can specify delayed requests. Once website gets fancy, you have to use other tools such as PHP or Python packages – ASP – POST (as opposed to GET protocol in HTTP) – Javascript produced cites – AJAX cites This is a toy example for learning. You can still use this method for simple scraping but consider learning pro tools (we’ll cover basics of a such tool next week)

30 Scraping Example 2 concert venues – This example gets a list of artists and queries to get concert venue information. – Another toy example


32 Fire up

33 API ( Application Programming Interface)

34 Programmable Web – Search engine for freely available APIs online – real-estate-apis-zillow-trulia-walk-score/ real-estate-apis-zillow-trulia-walk-score/ – Usage examples Usually, you have to apply for API keys from the website or the company offering the data Mostly free (limited queries)

35 Idea behind API 1.You obtain a key from the company offering the data 2.Make requests for data – Many different ways based on API 3.Company server grants you the data 4.Data analysis

36 Commonly Used Protocol in API REST (REpresentational State Transfer) – guidelines for client-server interaction for exchanging data as opposed to the alternative SOAP I recommend this funny explanation for REST vs SOAP (diagram involving Martin Lawrence) – access-protocol-soap Based on HTTP You request data via HTTP GET ( protocol and server will give you data – HTTP-URL?QueryStrings – QueryStrings: Field=Value separated by & – E.g. – v: stands for video = some value – t: stands for start time= some value Usual Data formats – XML eXtensible Markup Language – JSON JavaScript Object Notation

37 XML Example Bloodroot Sanguinaria canadensis 4 Mostly Shady $2.44 031599 Columbine Aquilegia canadensis 3 Mostly Shady $9.37 030699 Many xml related packages

38 JSON Example (just like python) newObject = { "first": "Ted", "last": "Logan", "age": 17, "sex": "M", "salary": 0, "registered": false, "interests": ["Van Halen", "Being Excellent", "Partying"] } Main python module import json

39 Yahoo Finance Data Example

40 Python Package Wrapper Yahoo provides simple web interface for anyone to download stock information via url – – s: symbol “GOOG” – f: stat (e.g. l1 means last trade price) More info here – Ordered to take down –

41 This Wrapper Package does it for you ystockquote – – See the simple source code to learn Open up

42 Example: YQL APIs are written by individual companies and support different I/O and usually different languages. Yahoo Query Language is a simple interface that yahoo has made available to developers combining several APIs “Yahoo! Query Language (YQL) enables you to access Internet data with SQL-like commands.” Apply for your API Key –

43 Our example: BBYOPEN Retail information – Archive query - Returns a single file containing all attributes for all items exposed by the given API – Basic query - Returns information about a single item – Advanced query - Returns information about one or more items according to your specifications – Store availability query - Returns information about products available at specific storesBest buy is providing this API API overview –

44 Basic Query Basic query structure  API - One of {products, stores, reviews, categories}  Item - The value of the fundamental attribute for the selected API: o products - sku o stores - storeId o reviews - id o categories - id  Format - One of {xml, json}  show= - (optional) The item attributes you want displayed  Key - Your API key Note: show= and Key can be specified in either order.

45 Basic Query Examples

46 API example Open up

47 Lab session For the next 10-15 minutes, choose your favorite website and try to scrape a few items We’ll do this again with scrapy

48 Data isn’t impossibly hard to get after all. There are many routes but it could take a LONG time (especially if are going the company route). START EARLY and you’ll get that data.

49 Next Session Hugh will be speaking about HPCC After that, we will learn the basics of Scrapy Brush up on your HTML and look into XPATH – is the best Intro into Big Data and Empirical Business Research

Download ppt "Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp."

Similar presentations

Ads by Google