Presentation is loading. Please wait.

Presentation is loading. Please wait.

P ROVENANCE M ANAGEMENT & C ITATIONS IN C URATED D ATABASES Kleisarchaki Sophia, HY561, 05/05/09.

Similar presentations


Presentation on theme: "P ROVENANCE M ANAGEMENT & C ITATIONS IN C URATED D ATABASES Kleisarchaki Sophia, HY561, 05/05/09."— Presentation transcript:

1 P ROVENANCE M ANAGEMENT & C ITATIONS IN C URATED D ATABASES Kleisarchaki Sophia, HY561, 05/05/09

2 He works in the Database Group in the Laboratory for Foundations of Computer Science (University of Edinburgh). He spent many years in the Database Group of the Department of Computer and Information Science at the University of Pennsylvania. You can find him....in polynomial time. A BOUT THE A UTHOR – P ETER B UNEMAN

3 C ONTENTS “ Provenance Management In Curated Databases ” Peter Buneman, Adriane P. Chapman, James Cheney “ How to cite curated databases and how to make them citable ” Peter Buneman 1 st paper2 nd paper “Curated Databases ” Peter Buneman, James Cheney, Wang-Chiew Tan “Provenance in Databases (Tutorial Outline) ” Peter Buneman, Wang-Chiew Tan Before All..

4 C URATED D ATABASES What is a Curated Database?  The term “ curated” comes from the Latin curare – to care for.  Are a result of a great deal of annotation, correction and transfer data from other sources.  Are databases that are populated & updated with a great deal of human effort through the consultation, verification and aggregation of existing sources and the interpretation of new raw data.

5 C URATED D ATABASES What a Curated Database IS NOT?  Curated databases are not warehouses. They are manually constructed by highly skilled scientists.  They are not views.  They are not computed automatically from existing datasets.

6 C URATED D ATABASES Notable examples of curated databases  UniProt (formerly called SwissProt ) used in molecular biology.  CIA World Factbook: source of demographic data.  IUPHAR: receptor database. Maintained by volunteers.  Such databases are not confined to biology; they are also being developed in areas such as astronomy and geology. Wikipedia and other wikis are also curated in that they are the product of direct human effort.  Nuclear Protein Database (NPD).  Reference manuals, dictionaries and gazetteers.

7 C URATED D ATABASES Which are the characteristics of a Curated Database?  Source. Data that is copied and edited from existing sources, perhaps other curated databases. Knowing the origin – provenance – is important.  Annotation. In addition to core data, curated databases also contain annotations that c arry additional pieces of information such as provenance.  Update. A common practice is to maintain a working database updated and to “publish” versions of it.  Schema and structure. Constructed “on the cheap”, usally stored in a text file. Almost inevitably the structure of the entries evolves over time.

8 C URATED D ATABASES Which are the characteristics of a Curated Database?  Source. Data that is copied and edited from existing sources, perhaps other curated databases. Knowing the origin – provenance – is important.  Annotation. In addition to core data, curated databases also contain annotations that c arry additional pieces of information such as provenance.  Update. A common practice is to maintain a working database updated and to “publish” versions of it.  Schema and structure. Constructed “on the cheap”, but almost inevitably the structure of the entries evolves over time.

9 P ROVENANCE IN D ATABASES (1/2) Provenance – also called lineage and pedigree – describes the source and derivation of data. Helps to: Determine the authenticity of a work. Establish the historical importance of a work by suggesting other artists who might have seen and be influenced by it. Determine the legitimacy of current ownership. Trust the data. Why is provenance important?

10 P ROVENANCE IN D ATABASES (2/2) Overview of provenance Provenance Workflow or coarse- grain provenance Dataflow or fine- grain provenance Why – provenance Where – provenance Describes the source and derivation of data. Record a complete history of the derivation of some data set. Derivation of part of the resulting data set. Keeps the justification for the element appearing in the output. The identification of the source elements where the data in the target is copied from.

11 W HERE -, W HY - P ROVENANCE Hotel Restaurant Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Waldorf Astoria Holiday Inn DT Cost $$$ $ $ HotelZip Rating Waldorf Astoria Restaurant CostType Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Zip $$$French10022 $$$Seafood10022 $Chinese10013 $ American10022 Holiday Inn DT 10022 10013 4.5 4.0 JOIN, PROJECT NYHotels (Source table) Why? Where? View 4.5 Rating 4.5 4.0 (Where-provenance) (Why-provenance)

12 C ONTENTS “ Provenance Management In Curated Databases ” Peter Buneman, Adriane P. Chapman, James Cheney “ How to cite curated databases and how to make them citable ” Peter Buneman 1 st paper2 nd paper

13 W HAT IS THE PROBLEM BEING ADDRESSED IN THE PAPER ?  Database technology is employed not only to provide access to source data, but also to the derived knowledge of scientifics who have interpreted the data.  Provenance or metadata describing creation, recording, ownership, processing, or version history is essential for assessing the value of such data. What information should be retained? How should it be managed?

14 W HAT IS THIS PAPER ABOUT ? Investigates general-purpose techniques for recording provenance for data that is copied among databases. Describes an approach in which they track the user’s actions, in order to record them in a convenient, query able form. Presents an implementation of this technique and use it to evaluate the feasibility of database support for provenance management.

15 C URATED D ATABASES - E XAMPLE Example a) Copies records of some interesting proteins from a SwissProt webpage into her database. b) Fixes the new entries so that the PTM (post translational modification) found in SwissProt is not confused with her. c) Copies some publications from OMIM and NCBI. d) One year later she finds a discrepancy between two PTMs.

16 T HE P ROBLEM It is necessary to retain provenance information describing the source and version history of the data. We focus on “fine-grained” provenance, which describes how data has moved through a network of databases. Need to record both local modifications to the database (insert, delete, update) and global operations such as copying data from external sources. Constraints: 1. There is not a standard for storing or exchanging provenance. 2. Varying practices for identifying or locating data. 3. Past versions may not be archived. 4. Curators employ a variety of application programs that cannot be changed.

17 External source databases Local database Auxiliary provenance database O UR A PPROACH (1/2) User’s actions are captured as a sequence of insert, delete, copy and paste by provenance- aware application. Provenance architecture

18 O UR A PPROACH (2/2) Implemented a naïve approach and several more sophisticated. The naïve approach increases the time to process each update by 28%. The amount of provenance information stored is proportional to the size of the changed data. Optimization techniques: Transactional provenance management. Hierarchical provenance management. Together these optimizations reduce the added processing cost of provenance tracking to less than 5-10% per operation and reduce the storage cost by a factor of 5-7 relative to the naïve approach. Typical provenance queries can be executed more efficiently.

19 M ANUAL U PDATES AND P ROVENANCE (1/2) “ Where a piece of data comes from?” We need to have a means for describing the location of any data element. Two assumptions: Database can be viewed as a tree. Labels on edges occur on at most one path. ( SwissProt/Release{20}/Q01780 identify a specific entry)

20 M ANUAL U PDATES AND P ROVENANCE (2/2) Update operations are of the form: u ::= ins{a:u} into p | del a from p | copy q into p Inserts an edge labeled a with value v into the subtree at p. Deletes an edge and its subtree. Replaces the subtree at p with a copy of the subtree at location q.

21 P ROVENANCE T RACKING Prov(Tid, Op, Loc, Src) Provenance architecture External source databases Local database Auxiliary provenance database

22 N AÏVE P ROVENANCE Store one provenance record for each copied, inserted or deleted node. Wasteful in terms of space. Retains the maximum possible information about the user’s actions. One transactio n per line

23 T RANSACTIONAL P ROVENANCE Actions are grouped into transactions larger than a single operation. Store only provenance links describing the net changes resulting from a transaction. Details about intermediate states are not retained. Less precise than naïve approach. Number of transactional provenance records: i + d + c i: number of inserted nodes in the output. d: number of nodes deleted in the input. c: number copied nodes in the output. Entire update as one transaction

24 H IERARCHICAL P ROVENANCE (1/2) It is not necessary to store all of the provenance links explicitly. The provenance of a child of a copied node can often be inferred from its parent’s provenance using a simple rule. Does not discard any information. Does not require user to group operations into transactions. Hierarchical version of naïve approach. 25% smaller than Prov, but much larger savings are possible.

25 H IERARCHICAL P ROVENANCE (2/2) We can define the full provenance table as a view of the hierarchical table as follows: If the provenance is specified in HProv, then it is just copied into Prov. Otherwise, The provenance of every target path p/a not mentioned in HProv is q/a, provided p was copied from q. Infer(t, p)  ¬( x, q.Hprov(t, x, p, q)) Prov(t, op, p, q)  Hprov(t, op, p, q) Prov(t, C, p/a, q/a)  Prov(t, C, p, q), Infer(t, p) Prov(t, I, p/a, )  Prov(t, I, p, ), Infer(t, p) Prov(t, D, p/a, )  Prov(t, D, p, ), Infer(t, p)

26 T RANSACTIONAL -H IERARCHICAL P ROVENANCE Combination of transactional and hierarchical provenance techniques. Storage is: i + d + C, i: number of inserted nodes in the output. d: number of nodes deleted in the input. C: number of roots of copied subtrees that appear in the output. Hierarchical version of (b). Entire update as one transaction

27 P ROVENANCE Q UERIES Define some convenient views of the raw Prov table. “p was unchanged during transaction t” Ins(t, p)  Prov(t, I, p, ) “p was inserted during transaction t” Del(t, p)  Prov(t, D, p, ) “p was deleted during transaction t” Copy(t, p, q)  Prov(t, C, p, q) “p was copied from q during transaction t” Unch(t, p)  ¬( x, q.Prov(t, x, p, q))

28 P ROVENANCE Q UERIES Define some convenient views of the raw Prov table. “node p comes from q during transaction t” “the data at location p at the end of transaction t “came from” the data at location q at the end of transaction u” Trace(p, t, q, u) Trace(p, t, p, t). Trace(p, t, q, u)  Trace(p, t, r, s), Trace(r, s, q, u). Trace(p, t, q, t-1)  From(t, p, q). From(t, p, q) From(t, p, q)  Copy(t, p, q) From(t, p, q)  Unch(t, p)

29 Let’s answer some… “simple” questions!

30 P ROVENANCE Q UERIES (1/2) Q1: Src Q2: Hist Q3: Mod What transaction first created the data at a location? (e.g. who entered your telephone number incorrect?) What is the sequence of all transactions that copied a node to its current position? What transactions are responsible for the creation or modification of the subtree under a node? Src(p) = {u | q.Trace(p, tnow, q, u), Ins(u, q)}Hist(p) = {u | q.Trace(p, tnow, q, u), Copy(u, q)}Mod(p) = {u, | q.p ≤ q, Trace(p, tnow, r, u), ¬Unch(u, r)}

31 P ROVENANCE Q UERIES (2/2) There are many interesting queries that mention both provenance and the row data. Q4 Such queries are tricky to write by hand. Providing advanced support for provenance queries is future work. Note: If some source databases do not track provenance then queries stop following the chain of provenance. Project the A field out of relation R(Id, A, B) along with its current provenance. Q(x, Px)  R(k, x, y), From(tnow, “R/” + k + “/A”, Px)

32 Provenance architecture Source database - OrganelleDB Target database - MiMI Auxiliary provenance database I MPLEMENTATION Wrappers for source and target databases

33 I MPLEMENTATION O F P ROVENANCE T RACKING (1/2) Naïve provenance Is a straightforward process of recording target and source information of every transaction that affects the target database. For a paste operation we add one record per node in the copied subtree. Transactional provenance When a commit action occurs, CPDB stores the provenance links connecting the current version with its predecessor. No links corresponding to temporary data are stored. The implementation maintains a provlist, of provenance links that will be added to the provenance store when the user commits.

34 I MPLEMENTATION O F P ROVENANCE T RACKING (2/2) Hierarchical Provenance Stores at most one record per operation. For a copy, stores the record connecting the root of the copied tree to the root of the source. Hierarchical Transactional Provenance Maintains hierarchical provenance instead of naïve provenance records in provlist. Checks and removes redundant links from provlist. E.g. copy S/a to T/a, copy S/a/b to T/a/b  redundant links

35 P ROVENANCE Q UERIES - I MPLEMENTATION Src, Mod, Hist implemented as programs. For naïve and transactional provenance, query directly the provenance store. For hierarchical provenance, the provenance store corresponds to the Hprov relation. Query the provenance store directly and compute the appropriate provenance links on the fly.

36 E VALUATION The experiments focused primarily on the storage and processing requirements of provenance tracking for the different approaches. Query optimization and database tuning left for future work. Chose to use random sequences of copy-paste operations to simulate worst case behavior.

37 E XPERIMENTAL S ETUP Performed five sets of experiments. Used six patterns of update operations. Update patternsDeletion patterns

38 F IRST T WO E XPERIMENTS First ExperimentSecond Experiment Figure 7: Number of entries in the provenance store after a variety of update patterns of length 3500. Figure 8: Number of entries in the provenance store after mix and real update patterns of length 14000. The number at the top of each bar shows the physical size of the table. N, T store 4 records/copy. H, HT store only 1 record.

39 S ECOND E XPERIMENT Figure 9 shows the time spent on storing provenance information for all the techniques. Figure 9: The average amount of time for target database processing and for add, delete, copy and commit operations on the provenance store during 14000-mix update. Copying in T is close to zero, because copies do not involve interaction with the provenance store.

40 S ECOND E XPERIMENT Figure 10: The overhead of provenance tracking per operation as a percentage of the time to perform each basic operation. For naïve approach all operations require less than 30% of the processing time needed for interaction with the target DB. H- provenance requires more time to process inserts than copies. H- provenance treats deletes as naïve provenance. T- provenance: Inserts and copies run essentially instantaneously, because no interaction with the target database or provenance store is needed.

41 T HIRD E XPERIMENT Measured the effects of deletes on provenance storage. Figure 11: The effect of deletion on the provenance store. The notation (ac) indicates provenance table size when only add and copy operations are performed while (acd) includes deletes. HT-provenance stores the fewest records among the approaches for each update pattern.

42 F OURTH E XPERIMENT Figure 12: The effect of transaction size on provenance processing time. Time to process a commit grows approximately linearly with transaction length.

43 F IFTH E XPERIMENT Displays the time needed to perform basic provenance queries. Figure 13: The time needed to perform basic provenance queries. The queries ran fastest for transactional provenance for all three queries,

44 C ONCLUSIONS The experimental results affirm that provenance can be tracked and managed efficiently using our approach. This is a promising first step towards providing powerful, general-purpose tools that will make life easier for scientific data curators and increase the reliability and transparency of the scientific record.

45 C ONTENTS “ Provenance Management In Curated Databases ” Peter Buneman, Adriane P. Chapman, James Cheney “ How to cite curated databases and how to make them citable ” Peter Buneman 1 st paper2 nd paper

46 W HAT IS THE PROBLEM BEING ADDRESSED IN THE PAPER ? Importance of citing databases. Citing something that has: Internal structure. Evolves over time. Propose a stable citation system for IUPHAR. Describe: How to publish the database in a form that can be cited. How to ensure that the citations remain valid. How to generate and validate the citations automatically.

47 P RELIMINARIES (1/4) Citations are used to identify the source material and provide some additional information. Example: Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):999-1001. Much more than we need to identify the work. Sufficient: OR Bioessays 17:999-1001Bard JB and Davies JA. Development, Databases and the Internet.

48 P RELIMINARIES (1/4) Citations are used to identify the source material and provide some additional information. Example: The citations.. Ann. Phys., Lpz 18 639-641 Nature, 171,737-738 while adequate for identification, hardly convey the importance of these publications.

49 P RELIMINARIES (2/4) A citation does not give us a specific mechanism for retrieving a document. It is useful to find what we are looking for. It is a structure that can be used by a variety of mechanisms such as online indexes and search engines. A citation consists of two kinds of information. Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):999-1001.

50 P RELIMINARIES (2/4) A citation does not give us a specific mechanism for retrieving a document. It is useful to find what we are looking for. It is a structure that can be used by a variety of mechanisms such as online indexes and search engines. A citation consists of two kinds of information. Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):999-1001. Location

51 P RELIMINARIES (2/4) A citation does not give us a specific mechanism for retrieving a document. It is useful to find what we are looking for. It is a structure that can be used by a variety of mechanisms such as online indexes and search engines. A citation consists of two kinds of information. Bard JB and Davies JA. Development, Databases and the Internet. Bioessays. 1995 Nov;17(11):999-1001. Descriptive Information (authorship, title, date)

52 P RELIMINARIES (3/4) Requirements concerning citations: There is some “thing” that is being cited. The thing should be accessible. The thing should not change over time. There are few accepted practices for supporting citation of data. Few standards. Little supporting technology.

53 P RELIMINARIES (4/4) D1 For any citation C, should remain fixed Since database change, this simple requirement is not always easy to maintain. D2 Any citable thing T should contain a citation C such that = T Anything we cite should provide us with at least one way of citing it. This is not always done in journal publications. It is essential because: 1.One wants confirmation that we have found the correct citation. Even if we have found T using some other citation C’ ( = ), we want to be sure that they refer to the same thing 2.If we found by some other means (search engine), we want to know how to cite it.

54 C URRENT P RACTICE On-line databases frequently give recommendations on how to cite them. They often omit version information. Fail to provide adequate location. The Columbia Guide to Online Style although it discusses issues of permanence of links, does not mention D1 as one of its citation “principles”. ISO690 standard deals with citations of parts of electronic documents.

55 S TRUCTURAL I SSUES (1/5) Databases have explicit structure. This offers the possibility of a citation using this structure to home in on the relevant data. Example (IUPHAR database) Figure 1: Rough structure of the IUPHAR web interface. The structure of what the user sees is not the same as the underlying database.

56 S TRUCTURAL I SSUES (2/5) Consider the following: 1. The IUPHAR database (C1) contains no information about Ginandtonicin. 2. The IUPHAR database (C2) lists five ligands for Melatonin receptor MT1. 3. The IUPHAR database (C3) asserts that luzindole is an antagonist ligand for receptor MT1. 1. Making the context two narrow can be as counterproductive as making it too wide. C1 should refer to the whole database should be the web page for that receptor or maybe the receptor family page. Citing just that row or the table? Better, cite the receptor or its family.

57 S TRUCTURAL I SSUES (3/5) One citation is coarser than another if it refers to a higher structure ( is coarser than ). D3 It should be possible to cite a database at varying degrees of coarseness. In order to make further progress we have to look at the internal structure of citation. Life Sci., 53, 393-398 journ al Volume number pages Our understanding is based on a common structure of all journals.

58 S TRUCTURAL I SSUES (4/5) A “concrete syntax” for citations is a sequence {k1 = v1, k2 = v2,..}, where k1, k2,... are keywords and v1, v2, … are associated values. Example {Journal = “Life Sci.”, Number = 53, Pages = 3930398} There is a natural “part of” relationship among citations. Example {Journal = “Life Sci.”} and {Journal = “Life Sci.”, Number = 53}

59 S TRUCTURAL I SSUES (5/5) D4 If C and C’ are citations and is coarser than then the location information in C’ should be a part of the location information in C. C’: {DB=IUPHAR, IUPHAR-Receptor- family=Melatonin} C: {IUPHAR-Receptor-family=Melatonin}

60 T EMPORAL I SSUES (1/2) The obvious way to deal with change in citation is to provide, in the citation, a version number. {DB=IUPHAR, Version=17, Family=Melatonin} Using time may be misleading. D5 Versions should be recorded at the database level. The rate of publication of versions is much much slower than the rate of updates. Having such a citation obliges someone to keep past versions. It is possible to cite a range of versions {..Version=2- 8..} To what does the version refer?

61 T EMPORAL I SSUES (2/2) Now, what is, a citation without a version number? The latest version of the database. So, we need two words: One for a fixed citation, One for a “current link”, the place at which you may find the latest information. A good job of distinguishing between “this” version, the “latest” version and previous versions of documents was presented.

62 P RESENTATION, C ONTENT AND P RESERVATION The structure of the cited “thing” is not necessarily the same as the structure of the underlying database. The underlying database contains information – working notes etc – that is not intended as part of the published material. We should not be making direct citations to the internal structure of the database. The hierarchy that the user sees should be represented as an XML document.

63 A UTOMATICALLY G ENERATING C ITATIONS (1/3) Insert citation data manually is both time consuming and error prone. Automatically generation of citations is a good check on the integrity of the document. Guarantees that the contents of the document are consistent with the citation. Give guarantees on the descriptive information (e.g. there is at most one Title)

64 A UTOMATICALLY G ENERATING C ITATIONS (2/3) A rule that generates location information: {DB=IUPHAR, Version=$v, Family=$f}  /Root[]/Version[Number=$’v]/Data[] /Family[FamilyName=$’f] The pattern is expressed in the syntax of Xpath. A concrete syntax of citations with variables. The database or document has a unique root. Each Version must have a Number that uniquely identifies the node and provides a value for $v. Indicates that for each Version, there is precisely one data node. Each family node has a FamilyName which uniquely identifies the family.

65 A UTOMATICALLY G ENERATING C ITATIONS (3/3) A rule that generates description information: {DB=IUPHAR, Version=$v, Family=$f, Receptor=$r, Contributors=$a, Editor=$e, Date=$d, DOI=$i}  /Root[]/Version[Number=$’v, Editor=$?e, DOI=#.i, Date=$.d]/Data[]/Family[FamilyName=$’f]/Contributor- list/Contributor=$+a]/Receptor[ReceptorName=$’r] Generates: { DB=IUPHAR, Version=11, Family=Calcitonin, Receptor=CALCR, Contributors={Debbie Hay, David R. Poyner}, Editor=Tony Harmar, Date=Jan, 2006, DOI=10.1234} Exactly one value. At most one value. One or more values expected.

66 C ONCLUSIONS We have to do a modest amount of work in structuring the data appropriately in XML, after which citations can be specified and generated by some simple rules.


Download ppt "P ROVENANCE M ANAGEMENT & C ITATIONS IN C URATED D ATABASES Kleisarchaki Sophia, HY561, 05/05/09."

Similar presentations


Ads by Google