Download presentation
Presentation is loading. Please wait.
Published byGinger Sullivan Modified over 7 years ago
1
Data Scraping Presented by Stephen Popick & Chun Kuang (KC)
at GWU SAGE Student Seminar on February 2nd,2015
2
Look under the hood of web pages
<a href="/biz/founding-farmers-washington"> <img alt="Founding Farmers" class="photo-box-img" height="90" src="//s3-media2.fl.yelpcdn.com/bphoto/AWpidG8KMEEIRmqiKEnZHA/90s.jpg" width="90"> </a> </div> <div class="media-story"> <h3 class="search-result-title"> <span class="indexed-biz-name"> <a class="biz-name" href="/biz/founding-farmers-washington" data-hovercard-id="oqDHJGanf7M-gqxF_RfqqA">Founding Farmers</a> </span> </h3>
3
Look under the hood of web pages
<div class="biz-rating biz-rating-large clearfix"> <div class="rating-large"> <i class="star-img stars_4" title="4.0 star rating"> <img alt="4.0 star rating" class="offscreen" height="303" src="//s3-media4.fl.yelpcdn.com/assets/srv0/yelp_styleguide/c2252a4cd43e/assets/img/stars/stars_map.png" width="84"> </i> </div>
4
Scrapy.org saved the day!
5
Scrapy code in Python from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from tutorial.items import DmozItem class DmozSpider(BaseSpider): name = "dmoz" allowed_domains = ["yelp.com"] start_urls = [" def parse(self, response): hxs=HtmlXPathSelector(response) sites=hxs.select('//ul/li') items=[] for site in sites: item=DmozItem() price-range']/text()").extract() rating-qualifier']/text()").extract() #items.append(item) return item
6
DIY Web Scraping R – a free statistical language
Rstudio – Add a nice GUI to R And YOU!
7
R contains many user-written routines.
You don’t need to build your own tools, You can use tools others have built User written packages are available at:
8
Let’s Build Packages I needed: Stringr Lubridate Rcurl XML Plyr
Once you’ve installed these packages you can call them in r using the command Library(packagename) That loads the package in.
10
Let’s Build Stringr Lubridate Rcurl XML Plyr
Once you’ve installed these packages you can call them in r using the command Library(packagename) That loads the package in.
11
Indeed.com To provide motivation in my dissertation’s introduction, I demonstrated how salary offers based on educational attainment vary across California zip codes using Indeed.com
12
Investigate the URL I typed in Indeed’s search engine “high school” and “90210” The result: Let’s try another with “masters degree” and “94305” We have a URL pattern! Rock!
13
My code Searches for “high school degree” Iterates by zip code
key.words=c("high school degree") zdf=read.csv(file="pops_zip_file.csv",header=TRUE,stringsAsFactors=FALSE) zips=as.character(zdf$zipcode) root=" salary.data=list() for(key in key.words){ for(zip in zips){ url=paste(root,str_replace_all(key," ","+"),"&l1=",zip,sep="") page=getURL(url,ssl.verifypeer=FALSE) xml=htmlParse(page,asText=TRUE) salary=str_trim(unlist(lapply(node,xmlValue))[1]) if(is.null(salary.data[[key]])){ salary.data[[key]]=list() } salary.data[[key]][[zip]]=salary df=ldply(salary.data,function(lst){df=ldply(lst);names(df)=c("zip","salary");df}) names(df)[1]="education" write.csv(df,file="salary_high.csv",row.names=FALSE) Searches for “high school degree” Iterates by zip code Looks for embedded code for salary Once located, saves the salary value to a database
14
Part 1 key.words=c("high school degree")
zdf=read.csv(file="pops_zip_file.csv",header=T RUE,stringsAsFactors=FALSE) zips=as.character(zdf$zipcode) root="
15
Part 2 for(key in key.words){ for(zip in zips){
url=paste(root,str_replace_all(key," ","+"),"&l1=",zip,sep="") page=getURL(url,ssl.verifypeer=FALSE) xml=htmlParse(page,asText=TRUE) salary=str_trim(unlist(lapply(node,xmlValue))[1]) if(is.null(salary.data[[key]])){ salary.data[[key]]=list() } salary.data[[key]][[zip]]=salary
16
Part 3 df=ldply(salary.data,function(lst){df=ldply(lst); names(df)=c("zip","salary");df}) names(df)[1]="education" write.csv(df,file="salary_high.csv",row.names= FALSE)
17
Kimono Labs Free Web Scraper
Easy to Implement Crawls Text Easily Can’t Crawl Images
18
How To Yelp with R and Kimono
Step 1: I used Kimono Labs to crawl YELP’s local listing of restaurants in Washington, DC. I pulled the first 5 restaurant names. Kimono allows you to upload a database to iterate over (such as zip codes). I could have pulled many restaurants
19
How To Yelp with R and Kimono
Step 2: Next, I inspected the HTML elements of the Yelp page to find where the rating information was held. Step 3: I edited my Indeed scraper code to scrape Yelp .id name rating washington El Rinconcito Cafe 4 Old Ebbitt Grill Tortino Founding Farmers Quill 4.5
20
Compare
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.