Presentation on theme: "Also By The Same Author: AKTiveAuthor, A Citation Graph Approach To Name Disambiguation AKT DTA Colloquium January 23, 2006 Duncan McRae-Spencer."— Presentation transcript:
Also By The Same Author: AKTiveAuthor, A Citation Graph Approach To Name Disambiguation AKT DTA Colloquium January 23, 2006 Duncan McRae-Spencer
Also By The Same Author Name ambiguity a problem for automated information extraction. Two problems: 1.Same name, different object: David L. Harris (Harvey Mudd College, formerly Stanford and MIT) and David L. Harris (Sandia Labs, Albuquerque) 2.Different name, same object: Professor Nick Jennings, Nicholas Jennings, N. R. Jennings.
Also By The Same Author Existing Solutions: –By-hand disambiguation (eg DBLP). Problem: slow, labour-intensive. –Text and context processing: Li et al (2005). Problem: deals with names within text, not document authors. –Metadata machine-learning techniques: Han et al (2004, 2005). Problem: Requires known canonical set and 50% of data used in training.
Also By The Same Author AKTiveAuthor: Linking together paper authors using metadata analysis. Specifically based on the following observation: –People cite their own work. When they cite an author with a similar name, 95-98% of the time it is the same person. Step one: Initial clustering on last name.
Also By The Same Author Self-citation analysis: –Within a name-cluster, test papers against each other. –Does paper A appear in the bibliography of paper B, or vice versa? –Iteratively use this approach to build groups of papers, each representing one real-world author.
Also By The Same Author Co-authorship Analysis: –Standard approach in disambiguation (Han et al) and social network analysis (AKT Ontocopi). –Use co-authorship relationships to further match the groups created in the self-citation stage. Source URL Analysis: –Extra linking provided using the source URL metadata field. –Links papers by same author on different subjects across one time period.
Also By The Same Author Sanity Check: –Before committing to a join on any of the three stages, check to see if its obviously not the same person. –Eg Norman L. Johnson and David E. Johnson (self-citation match). –Eg Earl and Erik Johnson (co-authorship match). –Eg Nicholas Jennings and N. Jennings allowed.
Also By The Same Author Metrics: –Essentially an information retrieval exercise. Three measures, each per individual paper: –Precision: (number of relevant docs retrieved) / (number of docs retrieved). –Recall: (number of relevant docs retrieved) / (number of relevant docs overall). –F-measure: Harmonic mean of Precision and Recall, used as generic measure of IR success.
Also By The Same Author Results: –Tested eight name-clusters, checking against by-hand disambiguated results. Precision ranged from 0.991 to 1.000 (mean 0.997). Recall ranged from 0.705 to 0.935 (mean 0.818) F-measure ranged from 0.824 to 0.965 (mean 0.899)
Also By The Same Author Analysis / Conclusions: –Precision higher than recall, mainly due to sanity check. –All three methods (self-citation, co-authorship and url source analysis) needed for best results. –Heavily-dominated name-clusters give best results (eg Giles (81.6% C Lee Giles)). –Large and small name-clusters equally good.
Also By The Same Author Future Work: –Original purpose: citation graph services, eg view my papers, count my citations, calculate my impact. –Improving the disambiguation algorithm: institutional affiliation data, tightening up co- authorship, better initial clustering.