Data Scraping Presented by Stephen Popick & Chun Kuang (KC)

Data Scraping Presented by Stephen Popick & Chun Kuang (KC)
at GWU SAGE Student Seminar on February 2nd,2015

Look under the hood of web pages
<a href="/biz/founding-farmers-washington"> <img alt="Founding Farmers" class="photo-box-img" height="90" src="//s3-media2.fl.yelpcdn.com/bphoto/AWpidG8KMEEIRmqiKEnZHA/90s.jpg" width="90"> </a> </div> <div class="media-story"> <h3 class="search-result-title"> <span class="indexed-biz-name"> <a class="biz-name" href="/biz/founding-farmers-washington" data-hovercard-id="oqDHJGanf7M-gqxF_RfqqA">Founding Farmers</a> </span> </h3>

Look under the hood of web pages
<div class="biz-rating biz-rating-large clearfix"> <div class="rating-large"> <i class="star-img stars_4" title="4.0 star rating"> <img alt="4.0 star rating" class="offscreen" height="303" src="//s3-media4.fl.yelpcdn.com/assets/srv0/yelp_styleguide/c2252a4cd43e/assets/img/stars/stars_map.png" width="84"> </i> </div>

Scrapy.org saved the day!

Scrapy code in Python from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from tutorial.items import DmozItem class DmozSpider(BaseSpider): name = "dmoz" allowed_domains = ["yelp.com"] start_urls = [" def parse(self, response): hxs=HtmlXPathSelector(response) sites=hxs.select('//ul/li') items=[] for site in sites: item=DmozItem() price-range']/text()").extract() rating-qualifier']/text()").extract() #items.append(item) return item

DIY Web Scraping R – a free statistical language
Rstudio – Add a nice GUI to R And YOU!

R contains many user-written routines.
You don’t need to build your own tools, You can use tools others have built User written packages are available at:

Let’s Build Packages I needed: Stringr Lubridate Rcurl XML Plyr
Once you’ve installed these packages you can call them in r using the command Library(packagename) That loads the package in.

Let’s Build Stringr Lubridate Rcurl XML Plyr
Once you’ve installed these packages you can call them in r using the command Library(packagename) That loads the package in.

Indeed.com To provide motivation in my dissertation’s introduction, I demonstrated how salary offers based on educational attainment vary across California zip codes using Indeed.com

Investigate the URL I typed in Indeed’s search engine “high school” and “90210” The result: Let’s try another with “masters degree” and “94305” We have a URL pattern! Rock!

My code Searches for “high school degree” Iterates by zip code
key.words=c("high school degree") zdf=read.csv(file="pops_zip_file.csv",header=TRUE,stringsAsFactors=FALSE) zips=as.character(zdf$zipcode) root=" salary.data=list() for(key in key.words){ for(zip in zips){ url=paste(root,str_replace_all(key," ","+"),"&l1=",zip,sep="") page=getURL(url,ssl.verifypeer=FALSE) xml=htmlParse(page,asText=TRUE) salary=str_trim(unlist(lapply(node,xmlValue))[1]) if(is.null(salary.data[[key]])){ salary.data[[key]]=list() } salary.data[[key]][[zip]]=salary df=ldply(salary.data,function(lst){df=ldply(lst);names(df)=c("zip","salary");df}) names(df)[1]="education" write.csv(df,file="salary_high.csv",row.names=FALSE) Searches for “high school degree” Iterates by zip code Looks for embedded code for salary Once located, saves the salary value to a database

Part 1 key.words=c("high school degree")
zdf=read.csv(file="pops_zip_file.csv",header=T RUE,stringsAsFactors=FALSE) zips=as.character(zdf$zipcode) root="

Part 2 for(key in key.words){ for(zip in zips){
url=paste(root,str_replace_all(key," ","+"),"&l1=",zip,sep="") page=getURL(url,ssl.verifypeer=FALSE) xml=htmlParse(page,asText=TRUE) salary=str_trim(unlist(lapply(node,xmlValue))[1]) if(is.null(salary.data[[key]])){ salary.data[[key]]=list() } salary.data[[key]][[zip]]=salary

Part 3 df=ldply(salary.data,function(lst){df=ldply(lst); names(df)=c("zip","salary");df}) names(df)[1]="education" write.csv(df,file="salary_high.csv",row.names= FALSE)

Kimono Labs Free Web Scraper
Easy to Implement Crawls Text Easily Can’t Crawl Images

How To Yelp with R and Kimono
Step 1: I used Kimono Labs to crawl YELP’s local listing of restaurants in Washington, DC. I pulled the first 5 restaurant names. Kimono allows you to upload a database to iterate over (such as zip codes). I could have pulled many restaurants

How To Yelp with R and Kimono
Step 2: Next, I inspected the HTML elements of the Yelp page to find where the rating information was held. Step 3: I edited my Indeed scraper code to scrape Yelp .id name rating washington El Rinconcito Cafe 4 Old Ebbitt Grill Tortino Founding Farmers Quill 4.5

Compare

Founding Farmers

Founding Farmers

Data Scraping Presented by Stephen Popick & Chun Kuang (KC)

Similar presentations

Presentation on theme: "Data Scraping Presented by Stephen Popick & Chun Kuang (KC)"— Presentation transcript:

Similar presentations

About project

Feedback

Founding Farmers

Founding Farmers

Log in

Auth with social network:

Data Scraping Presented by Stephen Popick & Chun Kuang (KC)

Similar presentations

Presentation on theme: "Data Scraping Presented by Stephen Popick & Chun Kuang (KC)"— Presentation transcript:

Similar presentations

About project

Feedback