Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7, 2009
Government Data on the Web
Objectives Investigate the role of semantic web in producing, processing and utilizing government datasets –To enrich the value of data via normalizing, linking and information-extraction –To realize the value of data via applications, esp. visualization –To support web developers via machine friendly data access and web services
Data Processors (Web Services & Analyzers) Data Processors (Web Services & Analyzers) SPARQL Web Service XSLT ServiceDiff Service RDF/XML RSS Generator SPARQL End Point Linked Data Linked Data GOV data (RDF) Google VizMIT ExhibitRSS 1.0 tagCloud … CSV XSL … Tabulator Convert Data Link & Enrich Data View & Use Data Link Annotator RDF/XML Li Ding, Dominic DiFranzo, Sarah Magidson, and Jim Hendler · Tetherless World Constellation · Rensselaer Polytechnic Institute · Aug · Sem Wiki Semantic Web Architecture for Government Data
Translate GOV data into RDF Principle 1: Keep the translation minimal –keep table structure –skip parsing values, unique property namespace Principle 2: Let the translation meet the Web –RDF/XML as output –Partition of big dataset, dereferenable URI Principle 3: Make the translation extensible –Property definition updatable via Semantic MediaWiki Principle 4: Preserve knowledge provenance –Recording provenance metadata using DC and FOAF Dominic
Translated Dataset Statistics data.gov hosts 432 Datasets: –390 “Raw Data Catalog” and 41 “Tool Catalog” –from 37 US government agencies We have 16 translated RDF datasets –13,532,385 table entries –2,927,399,269 triples. –2,526 properties. data.gov mentioned 458 data access points (mainly tables) –3 - RSS,ATOM –248 - csv/txt –46 – xml –66 - xls (MS Excel) –14 - kml or kmz –22 ESRI shape
(#10) Residential Energy Consumption Survey (#401) Budget Authority and offsetting receipts (#403) Governmental Receipts (#402) Outlays and offsetting receipts (#249) 2006 Toxics Release Inventory (#90) ACS PUMS Housing (#191) 2005 Toxics Release Inventory (#91) ACS PUMS Population (#34) Worldwide M1+ Earthquakes past 7 days (#9) CASTNET Visibility (#397) 2007 Toxics Release Inventory (#8) CASTNET Ozone Budget Population Energy and Utilities Geography and Environment CASTNET sites Cloud of government data Li Ding, Dominic DiFranzo, Sarah Magidson, and Jim Hendler · Tetherless World Constellation · Rensselaer Polytechnic Institute · Aug ·
Issues in Data.gov Duplicated Datasets- Some datasets are part of another dataset –Dataset 140 (2005 Toxics Release Inventory data for the state of California (EPA)) is a subset of Dataset 191. Formatting Issues - The format of some datasets is not friendly to machine processing. –Dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases (US Bureau of Reclamation)). –Dataset 335 (National Longitudinal Surveys (US Bureau of Labor Statistics)) tells you how to order data from the government. Access Point Issues - The access points are interactive webpage which is not friendly for machine access. –Dataset 330 (Local Area Unemployment Statistics (US Bureau of Labor Statistics) Sarah
Demos Visualization –Tabulator –Google Visualization (live) –Exhibit (live) Computation –RSS generation –TDB query (live) Live Demos: – – Dominic, Sarah
TODO List More demos –US Pollution Map –US agency –Earthquake in RPI Map Getting more data linked –Link properties –Link instance data More web services –Gov data auto-completion SPARQL integration for 2B triples –TDB –4Store (#9) CASTNET Visibility (#8) CASTNET Ozone CASTNET sites
Sample SPARQL queries List datasets: –SELECT ?s ?o WHERE {?s ?o } List all loaded documents: –SELECT ?s ?o WHERE {?s } List description about a EPA site (integration) –select ?s WHERE {?s "SHN418". } List contributions of agency (count) –PREFIX dgp92: SELECT ?ag count(*) WHERE { ?entry dgp92:agency ?ag. } GROUP BY ?ag ORDER BY ?ag List agencies (distinct) –PREFIX dgp401: SELECT distinct ?ag ?ag_code ?branch ?branch_code WHERE { ?entry dgp401:bureau_name ?ag; dgp401:bureau_code ?ag_code; dgp401:agency_name ?branch; dgp401:agency_code ?branch_code. } ORDER BY ?ag