Presentation is loading. Please wait.

Presentation is loading. Please wait.

Research Library, Los Alamos National November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Aggregation and Analysis.

Similar presentations


Presentation on theme: "Research Library, Los Alamos National November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Aggregation and Analysis."— Presentation transcript:

1 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen Digital Library Research & Prototyping Team Los Alamos National Laboratory - Research Library jbollen@lanl.gov Acknowledgements: Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Lyudmila L. Balakireva (LANL) Wenzhong Zhao (LANL), Aric Hagberg (LANL) Research supported by the Andrew W. Mellon Foundation. NISO Usage Data Forum November 1-2, 2007 Dallas, TX

2 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Batgirl: “Who’s going to be batman now?”

3 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Usage data has arrived. Value of usage data/statistics is undeniable: Business intelligence Scholarly assessment Monitoring of scholarly trends Enhanced end-user services Production chain: Recording Aggregation Analysis Challenges and opportunities at all links in chain. But of interest to this workshop in particular: Recording: data models, standardization Aggregation: restricted sampling vs. representativeness Analysis: frequentist vs. structural This presentation: overview of lessons learned by MESUR project at LANL. Technical project details later this afternoon.

4 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Structure of presentation. 1)MESUR: introduction and overview 2)Aggregation: value proposition 3)Data models: 1) Usage data representation 2) Aggregation frameworks 4)Usage data providers: technical and socio-cultural issues, a survey 5)Sampling strategies: theoretical issues 6)Discussion: way forward, roadmap

5 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Structure of presentation. 1)MESUR: introduction and overview 2)Aggregation: value proposition 3)Data models: 1) Usage data representation 2) Aggregation frameworks 4)Usage data providers: technical and socio-cultural issues, a survey 5)Sampling strategies: theoretical issues 6)Discussion: way forward, roadmap

6 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The MESUR project. Johan Bollen (LANL): Principal investigator. Herbert Van de Sompel (LANL): Architectural consultant. Aric Hagberg (LANL): Mathematical and statistical consultant. Marko Rodriguez (LANL): PhD student (Computer Science, UCSC). Lyudmila Balakireva (LANL): Database management and development. Wenzhong Zhao (LANL): Data processing, normalization and ingestion. “The Andrew W. Mellon Foundation has awarded a grant to Los Alamos National Laboratory (LANL) in support of a two-year project that will investigate metrics derived from the network-based usage of scholarly information. The Digital Library Research & Prototyping Team of the LANL Research Library will carry out the project. The project's major objective is enriching the toolkit used for the assessment of the impact of scholarly communication items, and hence of scholars, with metrics that derive from usage data.”

7 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Project data flow and work plan. 1 23 4a 4b negotiation aggregationingestion reference data set metrics survey

8 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Project timeline. We are here!

9 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Lecture focus: MESUR lessons learned

10 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Structure of presentation. 1)MESUR: introduction and overview 2)Aggregation: value proposition 3)Data models: 1) Usage data representation 2) Aggregation frameworks 4)Usage data providers: technical and socio-cultural issues, a survey 5)Sampling strategies: theoretical issues 6)Discussion: way forward, roadmap

11 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Lesson 1: aggregation adds significant value. Focus on institution-oriented services and analysis Multiple institutions, each with focus on institution- oriented services and analysis Aggregating usage data provides: 1.Validation: reliability and span 2.Perspective: group-based services and analysis Note: MESUR=not services

12 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Aggregation value: validation Aggregating usage data provides: 1.Reliability: overlaps and intersections Validation: error checking Reliability: repeated measurements 2.Span: Diversity: multiple communities Representativeness: sample of scholarly community

13 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Aggregation value: convergence. CSU Usage PRIF (2003)Title (abbv.) 178.565 21.455JAMA-J AM MED ASSOC 271.41429.781SCIENCE 360.37330.979 NATURE 440.8283.779J AM ACAD CHILD PSY 539.7087.157AM J PSYCHIAT LANL Usage PRIF (2003)Title (abbv.) 160.1967.035PHYS REV LETT 237.5682.950J CHEM PHYS 334.6181.179J NUCL MATER 431.1322.202PHYS REV E 530.4412.171J APPL PHYS MSR Usage PRIF (2005)Title (abbv.) 1 15.830 30.927SCIENCE 2 15.16729.273NATURE 3 12.79810.231PNAS 4 10.1310.402LECT NOTES COMP SCI 5 8.4095.854J BIOL CHEM Convergence! Open research questions: o Is this guaranteed? o To what? A common-baseline? What we do know: o Institutional perspective can be contrasted to baseline. o As aggregation increases in size, so does value.

14 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team We have COUNTER/SUSHI. How about the aggregation of item-level usage data? If there is value in aggregating COUNTER and other reports, there is considerable value in aggregating item-level usage data.

15 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Structure of presentation. 1)MESUR: introduction and overview 2)Aggregation: value proposition 3)Data models: 1) Usage data representation 2) Aggregation frameworks 4)Usage data providers: technical and socio-cultural issues, a survey 5)Sampling strategies: theoretical issues 6)Discussion: way forward, roadmap

16 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Lesson 2: We need a standardized representation framework for item-level usage data. MESUR: Ad hoc parsing is highly problematic - field semantics - field relations - various data models Framework objectives: 1.Minimize data loss: 1. Preserve event info 2. Preserve sequence info 3. Preserve document metadata 2.“realistic”: scalability and granularity. 3.Apply to variety of usage data, i.e. no inherent bias towards specific type of usage data

17 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Requirements for usage data representation framework. Implications 1.Sequence: session ID and date/time preserves sequence 2.Privacy: session ID groups events not by user ID 3.Request types: filter on types of usage Needs to minimally represent following concepts: 1.Event ID: distinguish usage events 2.Referent ID: DOI, SICI, metadata 3.User/Session ID: define groups of events related by user 4.Date and time ID: identify data and time of event 5.Request types: identiby type of request issued by user

18 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team COUNTER reports: information loss From: www.niso.org/presentations/MEC06-03-Shepherd.pdf 1.Event ID: distinguish usage events ? 2.Referent ID: DOI, SICI, metadata 3.User/Session ID: define groups of events related by user? 4.Date and time ID: identify data and time of event 5.Request types: identiby type of request issued by user

19 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team From same usage data: journal usage graphs journal1 journal2

20 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team From same data: article usage graphs

21 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Usage graphs connect this domain to 50 years of network science Heer (2005) - Large-Scale Online Social Network Visualization social network analysis small world graphs network science graph theory social modeling Good reads: Barabasi (2003) Linked. Wasserman (1994). Social network analysis.

22 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Structure of presentation. 1)MESUR: introduction and overview 2)Aggregation: value proposition 3)Data models: 1) Usage data representation 2) Aggregation frameworks 4)Usage data providers: technical and socio-cultural issues, a survey 5)Sampling strategies: theoretical issues 6)Discussion: way forward, roadmap

23 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Standardization objectives similar to work done for COUNTER and SUSHI: 1.Serialization (~COUNTER): - standard to serialize usage data - Suitable for large-scale, open aggregation - Event provenance and identification 2.Transfer protocol (~SUSHI): - Communication of usage data between log archive and aggregator - Allow open aggregation across stakeholders in scholarly community 3.Privacy standard: - Standards need to address privacy concerns - Should allow emergence of trusted intermediaries: aggregation ecology Lesson 3: aggregating item-level usage data requires standardized aggregation framework. LANL has made proposals based on community standards

24 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team OpenURL ContextObject to represent usage data Referent * identifier * metadata Requester * User or user proxy: IP, session, … ServiceType <ctx:context-object timestamp=“2005-06-01T10:22:33Z” … identifier=“urn:UUID:58f202ac-22cf-11d1-b12d-002035b29062” …> … info:pmid/12572533 info:ofi/fmt:xml:xsd:journal … Isolation of common receptor for coxsackie B … Science … … urn:ip:63.236.2.100 … … yes … … Resolver… Referrer… …. Resolver: * identifier of linking server Event information: * event datetime * globally unique event ID

25 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Aggregation framework: existing standards Aggregated logs Log DB Serialization: OpenURL ContextObjects Log Repository 1 Log harvester CO Aggregated Usage Data Johan Bollen and Herbert Van de Sompel. An Architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006. Log Repository 2 Log Repository 3 Tranfer protocol: OAI-PMH

26 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The issue of anonymization Privacy and anonymization concerns play on multiple levels that standard needs to address: 1.Institutions: where was usage data recorded? 2.Providers: who provided usage data? 3.Users: who is the user? -Goes beyond “naming” and masking identity: simple statistics can reveal identity -User identity can be inferred from activity pattern (AOL search data) -Law enforcement issues MESUR: 1.Session ID: preserve sequence without any references to individual users 2.Negotiated filtering of usage data

27 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Structure of presentation. 1)MESUR: introduction and overview 2)Aggregation: value proposition 3)Data models: 1) Usage data representation 2) Aggregation frameworks 4)Usage data providers: technical and socio-cultural issues, a survey 5)Sampling strategies: theoretical issues 6)Discussion: way forward, roadmap

28 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Lesson 4: socio-cultural issues matter. BUT: 12 months +550 email messages Countless teleconferences Correspondence reveals socio-cultural issues for each class of providers o Privacy o Business models o Ownership o World-view Study by means of email term analysis of correspondence o Term frequencies: tag clouds o Term co-occurrences: concept networks MESUR’s efforts Negotiated with 1.Publishers 2.Aggregators 3.Institutions Results: +1B usage events obtained Usage data provided ranging from 100M to 500K Spanning multiple communities and collections

29 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The world according to usage data providers: institutions Note: strong focus on technical matters related to sharing SFX data

30 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The world according to usage data providers: institutions

31 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The world according to usage data providers: publishers Note: strong focus on procedural matters related to agreements and project objectives

32 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The world according publishers

33 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The world according to usage data providers: aggregators Note: strong focus on procedural matters and project/research objectives

34 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The world according to usage data providers: aggregators

35 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The world according to usage data providers: Open Access Publishers Note: strong focus on technical matters related to sharing of data

36 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team The world according to usage data providers: Open Access publishers

37 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Summary diagram of provider concerns

38 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Lesson 5: size isn’t everything “Return on investment” analysis Value generated relative to investment Investment = correspondence Value = usage events loaded Extracted from each provider correspondence with MESUR: Investment: 1. Delay between first and last email 2. Number of emails 3. Number of bytes in emails 4. Timelines of email “intensity” Value: o Number of usage events loaded o (bytes won’t work: different formats) MESUR’s efforts: 12 months +550 email messages Good results but large variation: Usage data provided ranging from 100M to 500K

39 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Return on investment (ROI) DaysEmailsBytesEvents (M)Type 1 1674739858149.5A 2 6381818522.5OAP 3 178935338025.3P 4 126824342715.1I 5 19043293045.6A 6 5570342093.4I 7 11441394101.0A ROI = value / work value = usage events loaded work = 1) days between first contact 2) emails sent 3) bytes communicated

40 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Return on investment (ROI) TypeAskedPositiveNegative% pos Publishers148657% Aggregators330100% Institutions440100% Totals21147 ROI = value / workValue = usage data obtained Work = asking Survivor bias! Another way to look at it:

41 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Return on investment: ETA? All Publisher Aggregator

42 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Return on investment: timelines Institution Note: same pattern, but much less traffic.

43 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Structure of presentation. 1)MESUR: introduction and overview 2)Aggregation: value proposition 3)Data models: 1) Usage data representation 2) Aggregation frameworks 4)Usage data providers: technical and socio-cultural issues, a survey 5)Sampling strategies: theoretical issues 6)Discussion: way forward, roadmap

44 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Lesson 6: Sampling is (seriously) difficult Usage data = sample Sample of whom? o Different communities o Different collections o Different interfaces Who to aggregate from? o Different providers o Interfaces o Systems o Representation frameworks Result = difficult choices

45 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team A tapestry of usage data providers: Main players: Individual institutions Aggregators Publishers We will hear from several in this workshop Each represent different, and possibly overlapping, samples of the scholarly community. Institutions: Institutional communities Many collections Aggregators: Many communities Many collections Publishers: Many communities Publisher collection

46 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team A tapestry of usage data providers * A: Technical o Aggregation o Normalization o Deduplication MESUR’s research MESUR: significant project overhead can be reduced by standards Unresolved. However, MESUR lessons may elucidate. * C: Socio-cultural o World views o Concerns o Business models * B: Theoretical o Who: Sample characteristics? o What: resulting aggregate? o Why: sampling objective? Which providers to choose? Different problems:

47 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Structure of presentation. 1)MESUR: introduction and overview 2)Aggregation: value proposition 3)Data models: 1) Usage data representation 2) Aggregation frameworks 4)Usage data providers: technical and socio-cultural issues, a survey 5)Sampling strategies: theoretical issues 6)Discussion: way forward, roadmap

48 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Conclusion There exists considerable interest in aggregating item-level usage data. This requires standardization on multiple levels: 1)Recording and processing Field semantics: what do the various recorded usage items mean? Requests: standardize on request types and semantics? Standardized representation needs to be scalable yet minimize information loss 2)Aggregation: Serialization: standardized serialization for resulting usage data Transfer: standard protocol for exposing and harvesting of usage data Privacy: protect identity and rights of providers, institutions and users Although technical issues can be resolved: 1)Socio-cultural issues: Worldviews, business models and ownership shape aggregation strategies. Greatly improved by existing standards and infrastructure for usage recording, processing and aggregating 2)Theoretical questions of sampling: MESUR’s research project will provide insights.

49 Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Some relevant publications. Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007 Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. Journal of the American Society for Information Science and technology, 59(1), pages 001-014 (cs.DL/0610154). Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006. Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006. Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030) Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419-1440, 2005.


Download ppt "Research Library, Los Alamos National November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team Aggregation and Analysis."

Similar presentations


Ads by Google