Presentation is loading. Please wait.

Presentation is loading. Please wait.

Digging into Open Data Kim Rees, Periscopic

Similar presentations

Presentation on theme: "Digging into Open Data Kim Rees, Periscopic"— Presentation transcript:

1 Digging into Open Data Kim Rees, Periscopic @krees@krees, @periscopic@periscopic

2 Public Open Copyrights, patents, trademarks, restrictive licenses, etc.

3 Accessible without limitations on entity or intent In a digital, machine-readable format Free of restriction on use or redistribution in its licensing conditions Open Data is...

4 Open Exempt Be sure to check the Data Use Policies of your sources. Citations Attributions See

5 Open/Public Government Publications -The Guardian, WSJ, NYT, The Economist, etc. Companies -GE, Yahoo, Nike, Mint, Trulia, etc. Academia -Carnegie Mellon DASL, Berkeley Data Lab, MIT Open Data Library, etc.Carnegie Mellon DASLBerkeley Data LabMIT Open Data Library

6 Open Accessible


8 Most government sites (some of these are rabbit holes) Commercial Data markets (Infochimps, DataMarket, Azure Marketplace, Kasabi)InfochimpsDataMarketAzure Marketplace Kasabi Locating free data - -Open Science Data: Ask! (often you can email researchers/journalists directly to request data you cant find online) Research time = liberal estimate * 5 Finding Data


10 WebHarvy ($$, robust) WebHarvy Dapper (free, but limited) Dapper Google (free, but limited) Google OutWit Hub ($$, free limited version) OutWit Hub Mozenda ($$$$ subscription based) Mozenda Able2Extract ($$, for PDFs) Able2Extract ScraperWiki (free, but programming required) ScraperWiki Alternatives Needlebase, RIP!!!! Scraping Data

11 You can use any programming language, but Python is the language of choice. Libraries for getting web pages: urllib2 requests mechanize Scraping Data Programmatically

12 Libraries for parsing web pages: html5lib lxml BeautifulSoup Scraping Data Programmatically

13 import mechanize url = b = mechanize.Browser() b.set_handle_robots(False) ob = page = b.close()

14 import mechanize import re url = "" b = mechanize.Browser() b.set_handle_robots(False) ob = html = b.close() bold = re.compile('((? ).*?(?= ))') full = re.compile('(?s)(? ).*?(?= )') t = s = list(set( [x.replace(":", "") for x in bold.findall(t)] )) print s



17 import mechanize import re page_ids = [98936, 99001, 98929] #page id's of interest b = mechanize.Browser() base_url = "" html = {} for pid in page_ids: page = + str(pid)) print ("processing: " +b.title()) html[pid] = parseit( previous script page.close() b.close()

18 from nltk import WordNetLemmatizer WordNetLemmatizer().lemmatize(token)

19 Google Refine Data Wrangler ParseNIP Python SQL Cleaning Data Tableau ($$) Spotfire ($$) Many Eyes, Gephi R D3, Protovis, etc. Visualizing Data

20 -The ins and outs of using existing tools or rolling your own data parsing scripts -Thinking ahead – the stability of open data -Data timeliness -When screen scraping, no one will tell you when the format of the page is going to change. ScraperWiki can help this a bit if its an option for you. Business Considerations

21 Linked data More adoption (keeping up appearances) More adoption in private industry -Better anonymized data Better discovery methods Better tools Future...

22 Resources

23 Digging into Open Data Kim Rees, Periscopic @krees@krees, @periscopic@periscopic

Download ppt "Digging into Open Data Kim Rees, Periscopic"

Similar presentations

Ads by Google