Presentation is loading. Please wait.

Presentation is loading. Please wait.

Research Library, Los Alamos National November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The MESUR project: an.

Similar presentations


Presentation on theme: "Research Library, Los Alamos National November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The MESUR project: an."— Presentation transcript:

1 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The MESUR project: an update from the trenches.. Johan Bollen Digital Library Research & Prototyping Team Los Alamos National Laboratory - Research Library jbollen@lanl.gov Acknowledgements: Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Lyudmila L. Balakireva (LANL) Wenzhong Zhao (LANL), Aric Hagberg (LANL) Research supported by the Andrew W. Mellon Foundation. NISO Usage Data Forum November 1-2, 2007 Dallas, TX

2 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Citations and the journal impact factor. Journal Impact Factor: mean 2-year citation rate 2003 citations to 2001 and 2002 articles in X divided by number of articles published in X in 2001 and 2002 200320012002 Journal x All (2003) Citation data: Golden standard of scholarly evaluation Citation = scholarly influences. Extracted from published materials. Main bibliometric data source for scholarly evaluation. Low influence High influence Widely applied Fair approximation of journal “status”,…but Used to rank authors, departments, institutions, regions, nations, etc. Now common in tenure, promotion and other evaluation procedures!

3 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team From ranking to assessment best good not so good Ranking 0.6: - single criterion of value - single data source, e.g. citation data - impact: frequentist, e.g. average citation rates (ISI Impact Factor) “Assessment 3.0”: - situate item in value landscape - multiple sources of scholarly information, e.g. citation, usage and bibliographic data. - impact: relational indicators: trust, reputation, centrality, etc. ? ? ? 1 significant problem: which dimensions to choose? Mean citation rate bad

4 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Many possible facets of scholarly impact

5 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The MESUR project. Johan Bollen (LANL): Principal investigator. Herbert Van de Sompel (LANL): Architectural consultant. Aric Hagberg (LANL): Mathematical and statistical consultant. Marko Rodriguez (LANL): PhD student (Cognitive Science, UCSC). Lyudmila Balakireva (LANL): Database management and development. Wenzhong Zhao (LANL): Data processing, normalization and ingestion. “The Andrew W. Mellon Foundation has awarded a grant to Los Alamos National Laboratory (LANL) in support of a two-year project that will investigate metrics derived from the network-based usage of scholarly information. The Digital Library Research & Prototyping Team of the LANL Research Library will carry out the project. The project's major objective is enriching the toolkit used for the assessment of the impact of scholarly communication items, and hence of scholars, with metrics that derive from usage data.”

6 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Usage data Citation data pertain to 4 levels in the scholarly communication process: Community: authors of journal articles. Artifacts: journal articles. Data: citation data (+1 year publication delay). Metrics: mean citation rate rules supreme. Scale: expensive to extract. Hence, various initiatives focused on usage data: COUNTER, IRS, SUSHI, CiteBase. But where are the metrics? However, for usage data: Community: all users including most authors. Artifacts: all that is accessible. Data: recorded upon publication. Metrics: a range of web and web2.0 inspired metrics, e.g. clickstream and datamining. Scale: automatically recorded at point of service.

7 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Divergence and convergence. CSU Usage PRIF (2003)Title (abbv.) 178.565 21.455JAMA-J AM MED ASSOC 271.41429.781SCIENCE 360.37330.979 NATURE 440.8283.779J AM ACAD CHILD PSY 539.7087.157AM J PSYCHIAT LANL Usage PRIF (2003)Title (abbv.) 160.1967.035PHYS REV LETT 237.5682.950J CHEM PHYS 334.6181.179J NUCL MATER 431.1322.202PHYS REV E 530.4412.171J APPL PHYS MSR Usage PRIF (2005)Title (abbv.) 1 15.830 30.927SCIENCE 2 15.16729.273NATURE 3 12.79810.231PNAS 4 10.1310.402LECT NOTES COMP SCI 5 8.4095.854J BIOL CHEM Convergence! Open research questions: o Is this guaranteed? o To what? A common-baseline? What we do know: o Institutional perspective can be contrasted to baseline. o As aggregation increases in size, so does value. o Cross-validation is key.

8 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team MESUR 1 : Metrics from Scholarly Usage of Resources. 1. Pronounced “measure” Andrew W. Mellon Foundation funded study of usage-based metrics (2006- 2008) Executed at the Digital Library Research and Prototyping team, Los Alamos National Laboratory Research Library Objectives: 1.Create a model of the scholarly communication process. 2.Create a large-scale reference data set (semantic network) that relates all relevant bibliographic, citation and usage data according to (1). 3.Characterize reference data set. 4.Survey usage-based metrics on basis of reference data set.

9 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Project data flow and work plan. 1 23 4

10 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Project timeline. We are here!

11 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Presentation structure: an update on the MESUR project 1)Usage data characterization 2)Analysis 1) Usage graphs 2) Metrics analysis 3)Results 4)Discussion 1 2 3 4

12 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Presentation structure: an update on the MESUR project 1)Usage data characterization 2)Analysis 1) Usage graphs 2) Metrics analysis 3)Results 4)Discussion

13 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Usage data providers Publisher recorded usage data: - COUNTER statistics - Item-level data Aggregator recorded usage data: - COUNTER statistics - Item-level data Institutionally recorded usage data: - SFX logs - Custom log formats (XML)

14 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Usage data collected Data: > 1B usage events and 1B citations o At this point, 247,083,481 usage events loaded o Another +1,000,000,000 on the way Documents: > 50M documents Journals: 326,000 Community: > 100M users and authors combined Data providers: Publishers: 8 Aggregators: 3 Institutions: 4

15 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Data acquired: timelines Span: Majority: -1 years Some minor 2002-2003 data Sharing models: Historical data for period [t-x,t] Periodical updates Main issues: Restoration of archives Digital preservation issues o All data fields intact o Integration of various sources of usage data

16 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Presentation structure: an update on the MESUR project 1)Usage data characterization 2)Analysis 1) Usage graphs 2) Metrics analysis 3)Results 4)Discussion

17 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team MESUR’s usage data representation framework. Assumptions: 1.Sessions identify sequences of events (same user - same document) 2.Documents tied to aggregate request objects 3.Request objects consist of series of service requests and the date and time at which request took place. Implications: 1.Sequence preserved. 2.Most usage data and statistics can be reconstructed from framework 3.Lends itself to XML and RDF format 4.Permits request type filtering Example: 1.COUNTER stats= aggregate request counts for each document (journal) grouped by date/time (month) 2.Usage graph: overlay sessions with same document pairs Out of 13 MESUR providers so far, only 3 natively follow this model. The usage data of another 8 contains the necessary information for conversion

18 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Implications for structural analysis of usage data Sequence preservation allows: o Reconstruction of user behavior o Usage graphs! Statistics do not allow this type of analysis BUT are useful for: o validating results o rankings

19 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team How to generate a usage graph. Documents are associated by co-occurrence in same session Same session, same user: common interest Frequency of co-occurrence in same session: degree of relationship Usage data: Works for journals and articles Anything for which usage was recorded Options: Strict pair-wise sequence All within session Note: not something we invented. Association rule learning in data mining. Beer and diapers!

20 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Usage graphs journal1 journal2 MESUR graph created: 200M usage events Usage restricted to 2006 Journals clipped to 7600 2004 JCR journals Pair-wise sequences o Within session, only consecutive pairs o Raw frequency weights

21 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Tracing the flow of information. Marko Rodriguez

22 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Presentation structure: an update on the MESUR project 1)Usage data characterization 2)Analysis 1) Usage graphs 2) Metrics analysis 3)Results 4)Discussion

23 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Metric types Note: Metrics can be calculated both on citation and usage data Structural metrics require graphs o Citation graph, e.g. 2004 JCR o Usage graph, e.g. created by MESUR Wasserman (1994) Social network analysis

24 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Frequentist metrics Raw cites: Count number of citations to document or journal Count number of times document or journal was accessed Normalized: Journal Impact factor: o Number of citations to journal o Divided by number of articles published in journal Usage Impact Factor Number of request for journal or article Divided by number of articles published in journal Johan Bollen. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. Journal of the American Society for Information Science and Technology, 59(1), 2008

25 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Structural metrics calculated from usage graph Classes of metrics: Degree Shortest path Random walk Distribution Degree In-degree Out-degree Shortest path Closeness Betweenness Newman Random walk PageRank Eigenvector Distribution In-degree entropy Out-degree entropy Bucket Entropy Each can be defined to take into account weights, i.e. an unweighted and weighted variant.

26 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Social network metrics: different aspects of impact I In-degree/IF Degree centrality Closeness centrality Betweenness centrality Shortest path metrics Degree metrics

27 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Social network metrics: different aspects of impact II Basic idea: - Random walkers follow edges - + Probability of random teleportation - Visitation numbers converge ~ PageRank - “Stationary Probability distribution” From wikipedia.org Random walk Metrics, e.g. PageRank

28 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Presentation structure: an update on the MESUR project 1)Usage data characterization 2)Analysis 1) Usage graphs 2) Metrics analysis 3)Results 4)Discussion

29 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Set of metrics calculated on MESUR data set List of metrics: JCR 2004 CITE-BE CITE-ID CITE-IE CITE-IF CITE-OD CITE-OE CITE-PG CITE-UBW CITE-UBW-UN CITE-UCL CITE-UCL-UN CITE-UNM CITE-UNM-UN CITE-UPG CITE-UPR CITE-WBW CITE-WBW-UN CITE-WCL CITE-WCL-UN CITE-WID CITE-WNM CITE-WNM-UN CITE-WOD CITE-WPR Usage-based metrics: MESUR 2006 USES-BE, USES-ID USES-IE USES-OD USES-OE USES-PG USES-UBW USES-UBW-UN USES-UCL USES-UCL-UN USES-UNM USES-UNM-UN USES-UPG USES-UPR USES-WBW USES-WBW-UN USES-WCL USES-WCL-UN USES-WID USES-WNM USES-WNM-UN USES-WOD USES-WPR

30 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Overlaps and discrepancies: Usage Impact Factor

31 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Overlaps and discrepancies Citation PageRank vs. citation IF For Computer Science only - Map metrics themselves - We can examine structure of metric correlations and arrive at a structural model of metric relationships - Metric semantics -> what do they tell us about scholarly impact?

32 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Citation rankings 2004 Impact Factor valuejournal 1 49.794CANCER 2 47.400ANNU REV IMMUNOL 3 44.016NEW ENGL J MED 4 33.456ANNU REV BIOCHEM 5 31.694NAT REV CANCER Citation Pagerank value journal 1 0.0116SCIENCE 2 0.0111J BIOL CHEM 3 0.0108NATURE 4 0.0101PNAS 5 0.006PHYS REV LETT betweenness valuejournal 1 0.076PNAS 2 0.072SCIENCE 3 0.059NATURE 4 0.039LECT NOTES COMPUT SC 5 0.017LANCET Closeness valuejournal 1 7.02e-05PNAS 2 6.72e-05LECT NOTES COMPUT SC 3 6.43e-05NATURE 4 6.37e-05SCIENCE 5 6.37e-05J BIOL CHEM In-Degree value journal 1 3448SCIENCE 2 3182NATURE 3 2913 PNAS 4 2190LANCET 5 2160NEW ENGL J MED In-degree entropy Valuejournal 1 9.849LANCET 2 9.748SCIENCE 3 9.701NEW ENGL J MED 4 9.611NATURE 5 9.526JAMA

33 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Usage rankings 2004 Impact Factor valuejournal 1 49.794CANCER 2 47.400ANNU REV IMMUNOL 3 44.016NEW ENGL J MED 4 33.456NNU REV BIOCHEM 5 31.694NAT REV CANCER In-Degree value journal 1 4195SCIENCE 2 4019NATURE 3 3562PNAS 4 2438J BIOL CHEM 5 2432 LECT NOTES COMPUT SC In-degree entropy Valuejournal 1 9.364MED HYPOTHESES 2 9.152PNAS 3 9.027LIFE SCI 4 8.939LANCET 5 8.858INT J BIOCHEM CELL B betweenness valuejournal 1 0.035SCIENCE 2 0.032NATURE 3 0.020PNAS 4 0.017LECT NOTES COMPUT SC 5 0.006LANCET Closeness valuejournal 1 0.670SCIENCE 2 0.665NATURE 3 0.644PNAS 4 0.591LECT NOTES COMPUT SC 5 0.587BIOCHEM BIOPH RES CO Citation Pagerank value journal 1 0.0016SCIENCE 2 0.0015NATURE 3 0.0013PNAS 4 0.0010LECT NOTES COMPUT SC 5 0.0008J BIOL CHEM

34 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Metrics relationship

35 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Metrics relationships Citation and usage metrics reveal an entirely different pattern o Citation is split in 2 section: -Degree metrics (right) -Shortest path and random walk (left) o Usage is split in 4 clusters: -Degree metrics -PageRank and entropy -Closeness -Betweenness Usage pattern can be caused o Noise in usage graph o Higher density of usage/nodes

36 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Hierarchical analysis

37 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Presentation structure: an update on the MESUR project 1)Usage data characterization 2)Analysis 1) Usage graphs 2) Metrics analysis 3)Results 4)Discussion

38 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team MESUR: an update Usage data: Creation of single largest reference data set of usage, citation and bibliographic data +1,000,000,000 usage events loaded in next month Usage data obtained from multiple publishers, aggregators and institutions Infrastructure for a continued research program in this domain Results will guide scholarly evaluation and may help produce standards for usage data representation Usage graphs: Adequate data model for item-level usage data naturally leads to this Reduced distortion compared to raw usage: structure counts, not raw hits Several option on how to create: MESUR investigates option Metrics: Frequentist and structural metrics Each can represent different facets of scholarly impact Simple metrics can produce adequate results, e.g. in-degree vs. PageRank. How far should we go? Note increasing convergence of usage-metrics to citation metrics as sample increases.

39 Research Library, Los Alamos National Laboratory @ November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Some relevant publications. Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007 Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. (cs.DL/0610154) Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006. Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006. Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030) Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419-1440, 2005.


Download ppt "Research Library, Los Alamos National November, 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The MESUR project: an."

Similar presentations


Ads by Google