Presentation is loading. Please wait.

Presentation is loading. Please wait.

August 30, 2006Workshop on Scholarly Databases1 Chris Rosin Parity Computing, Inc. Accuracy of Author Disambiguation August 30, 2006 Indiana University,

Similar presentations


Presentation on theme: "August 30, 2006Workshop on Scholarly Databases1 Chris Rosin Parity Computing, Inc. Accuracy of Author Disambiguation August 30, 2006 Indiana University,"— Presentation transcript:

1 August 30, 2006Workshop on Scholarly Databases1 Chris Rosin Parity Computing, Inc. Accuracy of Author Disambiguation August 30, 2006 Indiana University, Bloomington, IN Workshop on Scholarly Databases

2 August 30, 2006Workshop on Scholarly Databases2 Introduction Chris Rosin, President Parity Computing, Inc. –Headquarters in San Diego, CA –Providing intelligent automation for large collections of text –Commercial software and services –Clients are primarily scholarly publishers

3 August 30, 2006Workshop on Scholarly Databases3 Data Proprietary data held by a client Core is bibliographic metadata –Header metadata, abstract&indexing, references... –Many millions of records Additional public and private data sources –Patent metadata, institution databases... Fulltext Repositories

4 August 30, 2006Workshop on Scholarly Databases4 Integration Challenges Data quality and consistency –Varying data capture policies, partially structured data –Incorrect/omitted fields Scalability to very large databases Consistently identify entities across records, databases, representations –Duplicate record detection and consensus record creation –Author merging and disambiguation –Affiliation parsing and institution standardization –Reference extraction and linking New value derived from integrated data –e.g., extended author profiles

5 August 30, 2006Workshop on Scholarly Databases5 Author Disambiguation Accuracy Partition author occurrences into distinct profiles –e.g. 500 Medline occurrences of “C. Sato” as article author –Partition into profiles of sizes 139, 47, 14, 13, 12,... Automation can use all available context in database –Coauthors, affiliations, titles, abstract, indexing, references... Evaluate the automation –Compare with a trained human using the same data –Bias: only merge with sufficient evidence to be confident Different from fully researching identities (e.g. AMS) Starting from automated profiles, use as baseline for feedback via authors/editors

6 August 30, 2006Workshop on Scholarly Databases6 Author Disambiguation Precision & Recall Compare automated profiles to actual authors Define precision and recall appropriately for this context Precision: for a profile, the (largest) fraction of occurrences in a profile that belong to the same actual author –What fraction of the profile is correct? Recall: for an actual author, the fraction of that author’s occurrences contained in the author’s largest profile –How complete is the author’s largest profile? Average precision and recall weighted by #occurrences, so each occurrence counts equally towards the average. On some proprietary databases, we find that automation can achieve ~99% precision and ~95% recall.

7 August 30, 2006Workshop on Scholarly Databases7 Data Example (Medline) Profile 1 Sato C, Nishizawa K, Kojima K. Calcium-dependent process in reduction of cell surface charge after x-irradiation. Int J Radiat Biol Relat Stud Phys Chem Med, 1979. (PMID 3071047) Sato C, Kuriyama R, Nishizawa K. Microtubule-organizing centers abnormal in number, structure, and nucleating activity in x-irradiated mammalian cells. J Cell Biol, 1983. (PMID 4076888) Profile 2 Sato C, Nishizawa K, Nakayama T, Ohtsuka K, Nakamura H, Kobayashi T, Inagaki M (Laboratory of Experimental Radiology, Aichi Cancer Center Research Institute, Nagoya, Japan). Rapid phosphorylation of MAP-2- related cytoplasmic and nuclear Mr 300,000 protein by serine kinases after growth stimulation in quiescent cells. Exp Cell Res, 1988. (PMID 5627622) Precision 100%, Recall 66.7%

8 August 30, 2006Workshop on Scholarly Databases8 Chris Rosin crosin@paritycomputing.com Parity Computing, Inc. San Diego, CA http://www.paritycomputing.com


Download ppt "August 30, 2006Workshop on Scholarly Databases1 Chris Rosin Parity Computing, Inc. Accuracy of Author Disambiguation August 30, 2006 Indiana University,"

Similar presentations


Ads by Google