Presentation is loading. Please wait.

Presentation is loading. Please wait.

IT Futures Nov 2008 1 Peter Buneman School of Informatics, University of Edinburgh and Digital Curation Centre Curated Data.

Similar presentations


Presentation on theme: "IT Futures Nov 2008 1 Peter Buneman School of Informatics, University of Edinburgh and Digital Curation Centre Curated Data."— Presentation transcript:

1 IT Futures Nov 2008 1 Peter Buneman School of Informatics, University of Edinburgh and Digital Curation Centre Curated Data

2 IT Futures Nov 2008 2 Where is our “reference” data? Before 2000 After 2000

3 IT Futures Nov 2008 3 Where is our research data? In a curated database

4 IT Futures Nov 2008 4 Curated databases Value lies in the organization and annotation of data Copied from other sources Proliferating in scientific research Constantly checked/verified. Produced by dedicated organizations. Now seen as “publications” by scientists. Highly cited. Human and machine-readable Usually open access Labour-intensive

5 IT Futures Nov 2008 5 The cost of curated data 10 -7 Big physics (LHC) data 10 -3 [Movie] 10 -1 Book 1“Production” code/Curated data 10“Reliable” code / Curated data In $/€/£ per byte

6 IT Futures Nov 2008 6 A change for the better? Redundant/Distributed Persistent Readable by people Clear standards for citation Historical record (old data is useful)‏ Well understood ownership/IP Single-source/Centralised Volatile Internal DBMS format No standards for citation No historical record Mind-boggling legal issues 20 th century libraries did some things better!

7 IT Futures Nov 2008 7 CIA World FactbookSwissprot Much of the “sweat” is organisation and annotation ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED)‏ DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)‏ DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)‏ DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 ND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE //

8 IT Futures Nov 2008 8 Some of the issues Publishing data, data exchange, integration and transformation Annotation Provenance Archiving Citation

9 IT Futures Nov 2008 9 Provenance These two sites disagree on the population of China and neither tells you that they got it from the CIA World Factbook ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED)‏ DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)‏ DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)‏ DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] Where did this taxonomic information come from?

10 IT Futures Nov 2008 10 Provenance A major problem, both in databases and in scientific programming We need to develop good models of what to record and how to record it Fundamental connections with database theory Intimately connected with archiving and citation.

11 IT Futures Nov 2008 11 Archiving How do you keep a record of something that is continually changing? –Alan Bundy's “liquid publications” Old data is useful –for verification (citation), and –for longitudinal studies For databases: –Frequent archiving – space consuming –infrequent archiving – lose information

12 IT Futures Nov 2008 12 The CIA World Factbook By far the most widely used source of demographic data Annual (paper) distribution from 1980 Web versions from circa 1998 Annual release until 2007. Now released every few weeks. CIA keeps annual versions from 2001 but tells you that older versions are “available from libraries”! We obtained – with some difficulty and a lot of data cleaning – all versions since 1990

13 IT Futures Nov 2008 13 A Working System Implemented by Heiko Müller For scale, we require external sorting of large XML files Designed and implemented by Ioannis Koltsidas Heiko Müller and Stratis Viglas Has a simple temporal query language Experimented with recent (HTML) versions of CIA world factbook

14 IT Futures Nov 2008 14 What the archive looks like Afghanistan Communications Internet users 1,000 (2002) 30,000 (2005) NA Radios 167,000 (1999) Telephones - main lines in use 100,000 (2005) 280,000 (2005) 29,000 (1998) 33,100 (2002) …

15 IT Futures Nov 2008 15 How did the population of China change from 2002-2007? Population 1,284,303,705 (July 2002 est.) 1,286,975,468 (July 2003 est.) 1,298,847,624 (July 2004 est.) 1,306,313,812 (July 2005 est.) 1,313,973,713 (July 2006 est.) 1,321,851,888 (July 2007 est.)

16 IT Futures Nov 2008 16 How did the land area of countries change from 2002-2007? … land 82,444 sq km 82,738 sq km … land 545,630 sq km 640,053 sq km; 545,630 sq km (metropolitan France) …

17 IT Futures Nov 2008 17 What are the differences between the factbooks on 21/08/2007 and 10/09/2007? 30,000 (2005) 535,000 (2006) 1.4 million (2005) 2.52 million (2006) …

18 IT Futures Nov 2008 18 Efficiency:100 days of OMIM Size (bytes) x 10 6 XMill(archive)‏ gzip(inc diff)‏ version archive, inc diff Legend archive inc diff version compressed inc diff compressed archive Uncompressed Archive size is –  1.01 times diff repository size –  1.04 times size of largest version Compressed archive size is between 0.94 and 1 times compressed diff repository size gzip - unix compression tool XMill - XML compression tool

19 IT Futures Nov 2008 19 How do you cite something in a database? Many scientific databases ask you to cite them, but.. they don’t tell you how, or they tell you to give the URL, or they tell you to cite a paper about the database. Nutrition Education for Diverse Audiences [Internet]. Urbana (IL): University of Illinois Cooperative Extension Service, Illinet Department; [updated 2000 Nov 28; cited 2001 Apr 25]. Diabetes mellitus lesson; [about 1 screen]. Available from http://www.aces.uiuc.edu/~necd/inter2_search.cgi?ind=854148396 NLM Recommended Formats for Bibliographic Citation. Internet Supplement. NLM Technical report Bethesda, MD 20894, July 2001.

20 IT Futures Nov 2008 20 The structure of a citation Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov; 17(11):999-1001. [Location and descriptive information] Ann. Phys., Lpz 18 639-641 Nature, 171,737-738 (We often want more than location)‏

21 IT Futures Nov 2008 21 Location is more important in databases The IUPHAR database 1. The IUPHAR database (C1) contains no information about Ginandtonicin. 2. The IUPHAR database (C2) lists five ligands for Melatonin receptor MT1. 3. The IUPHAR database (C3) asserts that luzindole is an antagonist ligand for receptor MT1.

22 IT Futures Nov 2008 22 Selected pages from the book – generated by a 100-line style sheet and data publishing technology developed by Wenfei Fan We could solve the problem by turning the database into a book

23 IT Futures Nov 2008 23

24 IT Futures Nov 2008 24 Automatically generating citations { DB=IUPHAR, Version=$v, Family=$f Receptor=$r, Contributors=$a, Editor=$e, Date=$d, DOI=$i} ← /Root[ ]/Version[Number=$’v,Editor=$?e, DOI=$.i, Date=$.d] /Data[ ]/Family[FamilyName=$’f] /Contributor-list/Contributor=$+a] /Receptor[ReceptorName=$’r] { DB=IUPHAR, Version=11, Family=Calcitonin, Receptor=CALCR, Contributors={Debbie Hay, David R. Poyner}, Editor=Tony Harmar, Date=Jan 2006, DOI=10.1234 } A rule: What gets generated (example):

25 IT Futures Nov 2008 25 Other database research: Data cleaning – yes, even curated data needs to be cleaned Database wikis – how does a community develop and populate a structured database? And, more generally: Economic issues. Although curated DBs are mostly open- access, they are much more expensive than other “digital objects” to maintain. How does one distribute a curated database for better access and safety What are the IP issues in data that is heavily fragmented and copied?

26 IT Futures Nov 2008 26 What is digital curation?

27 IT Futures Nov 2008 27

28 IT Futures Nov 2008 28 The Tegola Project Bringing high-speed internet to rural Scotland Can you configure a router? crimp a RJ45 jack? climb a Munro? Then come and help us deliver broadband to the remote areas where it is desperately needed


Download ppt "IT Futures Nov 2008 1 Peter Buneman School of Informatics, University of Edinburgh and Digital Curation Centre Curated Data."

Similar presentations


Ads by Google