Repository Statistics Peter Millington Technical Development Officer SHERPA, University of Nottingham
Overview Introduction Global statistics The what & why of repository statistics Benchmarks & data sources Compilation methods Web usage logging tools Google Analytics demo Problems and solutions Group session – Key issues
Global Repository Statistics Data Sources – Global lists of repositories OpenDOAR- ROAR- Repository66- May be useful for advocacy work Examples of types of chart & presentation
ROAR – Individual Growth Charts
ROAR – Individual Source Data MonthRecords Archives MonthRecords Archives
Delegates What and Why of Statistics Rate of growth For advocacy Measure of success – for our paymasters Rate of usage Targeting weak areas – departments Measure of success Justifying funding Most downloaded author/paper Promotes interest and engagement from authors
Delegates What and Why of Statistics Where are visitors coming from – referrers Curiosity – is it being seen by the right people Citation statistics To demonstrate the beneficial impact of repositories Drilling down for more detail For a sense reality Steep slopes, animation, etc Glitzy marketing
Individual Repositories - Content Growth & Deposition rates Measure of progress Impact of advocacy events Impact of mandatory deposition Types of document or item Trend-watching? Breakdown by department and/or author How much is everyone contributing? Proportion of full text v metadata only Measure of usefulness
Item types: Universidade do Minho
Individual Repositories - Performance Proportion of publications deposited How comprehensive is the archive? Proportion of authors who are depositing Are they complying with local mandates? Compliance with funders mandates Are you meeting your obligations? Repository administration Are your turn round times acceptable?
Compliance with the CERN Mandate
Compliance Benchmarks Counting publications Institution-wide bibliographies e.g. Maintained by research managers Publication lists on departmental web pages Public/Commercial databases – ISI, Medline, etc Counting authors Who qualifies as an author? Academic staff, Research students, Managers University Calendars & Departmental staff lists
Individual Repositories - Usage Rates of usage Measure of usefulness Impact of news-related items Most downloaded items Identifying research(ers) with most impact? Engendering competition between authors? Downloads according to author Performance reviews? Geographical distribution of users Are you reaching your intended audience?
Sources of Data Repositorys own database OAI-PMH Servers access log Remote logging
Compilation Methods Repositorys own database Copying from the human interface Interactive SQL commands
Copying from the Human Interface
Interactive SQL Commands mysql> SELECT type,COUNT(*) FROM eprint GROUP BY type; | type | COUNT(*) | | article | 456 | | book | 5 | | book_section | 39 | | conference_item | 173 | | exhibition | 1 | | monograph | 18 | | other | 3 | | thesis | 4 | rows in set (0.00 sec)
Compilation Methods Repositorys own database Copying from the human interface Interactive SQL commands OAI-PMH Harvesting programs – e.g. ROARs Celestial
OAI-PMH ListIdentifiers
OAI-PMH ListRecords
ROAR - Celestial dateidentifierurl oai:bora.uib.no:1956/2270Department of Earth Science oai:bora.uib.no:1956/2272Department of History oai:bora.uib.no:1956/2273Department of the History of Religions oai:bora.uib.no:1956/2274Section for Endocrinology oai:bora.uib.no:1956/2275Department of the History of Religions oai:bora.uib.no:1956/2276Department of the History of Religions oai:bora.uib.no:1956/2277Department of the History of Religions oai:bora.uib.no:1956/2278Department of the History of Religions oai:bora.uib.no:1956/2279Department of Oral Sciences oai:bora.uib.no:1956/2281Department of the History of Religions oai:bora.uib.no:1956/2282Department of Sociology oai:bora.uib.no:1956/2283Else Æyen oai:bora.uib.no:1956/2284Section for Art History oai:bora.uib.no:1956/2285Section for Russian oai:bora.uib.no:1956/2286Department of Geography oai:bora.uib.no:1956/2287Department of Greek, Latin and Egyptology oai:bora.uib.no:1956/2288Section for Spanish oai:bora.uib.no:1956/2289Department of Mathematics oai:bora.uib.no:1956/2290Department of Geography oai:bora.uib.no:1956/2291Department of Geography oai:bora.uib.no:1956/2292Department of Biology oai:bora.uib.no:1956/2293Department of Biology
Compilation Methods Repositorys own database Copying from the human interface Interactive SQL commands OAI-PMH Harvesting programs – e.g. ROARs Celestial Servers access log Web usage statistics tools
Raw Web Access Logs [10/Apr/2005:05:34: ] "GET /portfolio.css HTTP/1.0" "-" "ia_archiver" [10/Apr/2005:07:16: ] "GET /DAWN_Index.htm HTTP/1.0" "-" "ia_archiver" [10/Apr/2005:07:17: ] "GET /Eric.htm HTTP/1.0" "-" "ia_archiver" [10/Apr/2005:07:21: ] "GET /Library_Form.htm HTTP/1.0" "-" "ia_archiver" [10/Apr/2005:07:22: ] "GET /cleansing.htm HTTP/1.0" "-" "ia_archiver" [10/Apr/2005:07:25: ] "GET /index.htm HTTP/1.0" "-" "ia_archiver" [10/Apr/2005:07:28: ] "GET /integration.htm HTTP/1.0" "-" "ia_archiver" [10/Apr/2005:07:31: ] "GET /merging.htm HTTP/1.0" "-" "ia_archiver" [10/Apr/2005:07:34: ] "GET /publication.htm HTTP/1.0" "-" "ia_archiver" [10/Apr/2005:08:22: ] "GET /ABACUS_Index.htm HTTP/1.0" "-" "ia_archiver" [10/Apr/2005:08:27: ] "GET /limitations.htm HTTP/1.0" "-" "ia_archiver" [20/Dec/2004:13:22: ] "GET /robots.txt HTTP/1.1" "-" "gazz/ [20/Dec/2004:13:23: ] "GET / HTTP/1.1" "-" "gazz/ [20/Dec/2004:13:25: ] "GET /Logo.gif HTTP/1.1" "-" "gazz/ [20/Dec/2004:13:27: ] "GET /contact.htm HTTP/1.1" "-" "gazz/ [20/Dec/2004:13:29: ] "GET /profile.htm HTTP/1.1" "-" "gazz/ [20/Dec/2004:13:37: ] "GET /index.htm HTTP/1.1" "-" "gazz/ [20/Dec/2004:13:47: ] "GET /publication.htm HTTP/1.1" "-" "gazz/ [20/Dec/2004:13:49: ] "GET /InsideInfo.jpg HTTP/1.1" "-" "gazz/5.0 Recorded fields include: IP Address of the computer requesting a file Date & time transaction completed Name of file requested Success code – usually 200 for successfully completed File size in bytes
Web Usage Statistics Tools Analog Webalizer AWStats etc.
Sample output from the Analog Statistics Package
Sample output from the Webalizer Statistics Package
Sample output from the AWStats Statistics Package
Compilation Methods Repositorys own database Copying from the human interface Interactive SQL commands OAI-PMH Harvesting programs – e.g. ROARs Celestial Servers access log Web usage statistics tools Remote logging Google Analytics
Sign up to a Google Account Specify the URL to be logged Obtain snippet of JavaScript code Insert snippet into HTML of pages to be logged Ideally into a template file Make sure the modified pages are live! Logging starts automatically Log in to your account to view the analytics
Google Analytics JavaScript snippet var gaJsHost = ((" == document.location.protocol) ? " : " document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); var pageTracker = _gat._getTracker("UA "); pageTracker._initData(); pageTracker._trackPageview(); Find URL Containing/Excluding String e.g.pdf Regular expressions e.g./[0-9]*/for EPrints IDs
Problems Web bots and crawlers Inflating usage volume Scewing usage time series Auxiliary files & non-eprint pages CSS style sheet files Image files – jpeg, gif, etc. Index pages Linking URLs to bibliographic references What does that eprint number mean?
Problems and Solutions Web bots and crawlers Use robots.txt & meta robots tags to prevent crawling Filtering out known bots Still leaves maverick hackers & students bots Auxiliary files & non-eprint pages Configuring & tuning the analysis tool Filter using regular expressions Linking URLs to bibliographic references Programmatic concordance e.g. IRStats
Over to Chris for DSpace statistics…
What are your priorities for statistics?
Peter Millington