Presentation is loading. Please wait.

Presentation is loading. Please wait.

Where’s My Data? Using MetriDoc to manage data integration headaches Joe Zucca– Tommy Barker –

Similar presentations


Presentation on theme: "Where’s My Data? Using MetriDoc to manage data integration headaches Joe Zucca– Tommy Barker –"— Presentation transcript:

1 Where’s My Data? Using MetriDoc to manage data integration headaches Joe Zucca– zucca@pobox.upenn.eduzucca@pobox.upenn.edu Tommy Barker – tbarker@pobox.upenn.edutbarker@pobox.upenn.edu Sponsored by

2 The Problem The request seems simple but the solution is complex Generally asked “who did / used x?” which leads to other questions Where’s the data? What’s the grain of the answer? So how do we answer these questions? If lucky, run script / query against a database and generate report If not lucky, build an application to answer the question This is what MetriDoc is built for

3 Current Solution - Datafarm Datafarm = Crontab + Perl + CGI = Spaghetti Voyager Blackboard COUNTER DLA logs Datafarm Gate Count Ezproxy Penn Community Borrow Direct App 1 App 3 App 2 App n

4 Datafarm Shortcomings Maintainability issues Not shareable Not reusable

5 MetriDoc = Datafarm 2.0 As our system grew, we began creating MetriDoc to address Datafarm’s problems Needed a scheduler that was more sophisticated than cron Needed languages that were more maintainable than perl Needed integration tools to simplify data gathering across disparate systems We built prototypes and services to help us evaluate technologies Received a grant from IMLS to speed up development Hired another programmer

6 MetriDoc Philosophy Keep it simple Sometimes a script is all you need Ease of use is more important than performance Don’t recreate the wheel 100% open source Sharable data

7 MetriDoc – How it Works MetriDoc’s core is built around database schemas A MetriDoc implementation consists of loading tables and normalized tables Loading tables prime the repository The user is responsible for populating these tables Normalized tables are built from the data in the loading tables MetriDoc takes care of this Conforming to similar schemas provides interesting possibilities Sharing data is easy Sharing a single repository is easy (think amazon web services) Easier to collaborate From a user’s perspective MetriDoc has tools to get your stuff in the loading tables But ultimately you just need to get it in there, so you can use whatever Use the MetriDoc tools to manage your integration needs Useful for getting, transforming / resolving, moving and loading data

8 MetriDoc – Core Technologies JVM Java is used for infrastructure Groovy is the primary language Master Scheduler Essentially the brains of MetriDoc Using Hudson for now (http://hudson-ci.org/)http://hudson-ci.org/ Integration Tooling Tooling built on top of Apache Camel (http://camel.apache.org/)http://camel.apache.org/ Helps move data from one place to another Really helpful for batch processing Resolutions / Transformation Tools Patron anonymization, text normalization, resource id to title resolutions, etc.

9 The Metridoc Solution Metridoc = Hudson + Java / Groovy + Apache Camel = Integration Nirvana Step 1 – Fill the loading tables Load Ezproxy Load Patron Info Load Counter Hudson Loading Tables Voyager Ezproxy COUNTER

10 Loading Tables 00.000.000.000||Philadelphia||PA||United States||Default+datasets+documents+pwp+vanwert||jsmith||[19/Jan/2011:00:01:44 - 0500]||GET||https://proxy.library.upenn.edu:443/login?url=http://www.sciencedirect.com/science?_ob=GatewayURL&_origin=SFX&_method=citationSearch&_volkey=02644 10X%2329%23266%232&_version=1&md5=8e47306a7f3a7da8a6fe7b521a7a149b||302||0||http://elinks.library.upenn.edu/sfx_local?genre=article&issn=0264410X&title=Vac cine&volume=29&issue=2&date=20101216&atitle=An+adjuvanted+pandemic+influenza+H1N1+vaccine+provides+early+and+long+term+protection+in+health+care+workers.& spage=266&sid=EBSCO:aph&pid=Madhun%2c+Abdullah+S.%3bAkselsen%2c+Per+Espen%3bSjursen%2c+Haakon%3bPedersen%2c+Gabriel%3bSvindland%2c+Signe%3bN%c3%b 8stbakken%2c+Jane+Kristin%3bNilsen%2c+Mona%3bMohn%2c+Kristin%3bJul- Larsen%2c+%c3%85sne%3bSmith%2c+Ingrid%3bMajor%2c+Diane%3bWood%2c+John%3bCox%2c+Rebecca+J.5550217620101216aph||Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5 (.NET CLR 3.5.30729)]||Re07OuEIyQo8X6w||UPennLibrary=AAAAAUkQ36AAAFTaAwO7Ag==; __utma=10244330.1344196133.1295210953.1295404568.1295411821.9; __utmc=10244330; __utmz=10244330.1295411821.9.3.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn; WRUID=0; __utmv=10244330.|1=User-Type=Current%20Students=1,; __utma=94565761.447912360.1295320755.1295404584.1295411882.4; __utmc=94565761; __utmz=94565761.1295320755.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn%20blackboard; hp=/vanpelt/; __utma=261680716.1522407254.1295392237.1295404624.1295412044.3; __utmc=261680716; __utmz=261680716.1295412044.3.3.utmcsr=library.upenn.edu|utmccn=(referral)|utmcmd=referral|utmcct=/biomed/; proxySessionID=18175547; ezproxy=Re07OuEIyQo8X6w; ARPT=MWPYIPS108CWYL; EHost2=sid=49d81d33-5139-4dbd-b94f- 5d76b01ffbdc@sessionmgr13&k2=dGJyMPGtr0iyqbVIrOPfgeyk44Dt6fIA&k3=dGJyMOPY8Xvt&k4=ehost&k6=en&k7=live&k8=DS:live; __utmb=10244330.4.10.1295411821; __utmb=94565761.6.9.1295413021459; __utmb=261680716.1.10.1295412044; ASPSESSIONIDCCAQQCRC=AHJAGJMDDPNIIMLMHBCPCHBL Patron_idPatron_ipurlRef_urlProxy_idEzproxy_id jsmith00.000.000.000http://www…http://elinks…18175547Re07OuEIyQo8X6w

11 The Metridoc Solution Metridoc = Hudson + Java / Groovy + Apache Camel = Integration Nirvana Step 2 – Populate the normalized tables Normalize Ezproxy Normalize Patron Info Normalize Counter Hudson Repository Loading Tables

12 Generally used for building software, but a fantastic cron replacement Can run arbitrary scripts locally and remotely Supports master / slave distribution model seamlessly Can be managed entirely via REST Extensible Helps with job dependencies It is simple and free Active community with a huge collection of plugins Jenkins – Death to Cron

13 A Little Groovy

14 The Metridoc Job Framework

15

16

17 Metrics on the Cheap

18

19 Where we are….


Download ppt "Where’s My Data? Using MetriDoc to manage data integration headaches Joe Zucca– Tommy Barker –"

Similar presentations


Ads by Google