How Does Digitization Affect Scholarship? Mark McCabe University of Michigan Roger Schonfeld Ithaka Christopher Snyder Dartmouth College December 11, 2007
What Characteristics Are Important to Authors?
Journal Characteristics Important to an Author When it comes to influencing decisions about journals in which to publish an article of yours, how important to you is each of the following possible characteristics of an academic journal? a)The journal makes its articles freely available on the Internet, so there is no cost to purchase or to read. b)The journal permits scholars to publish articles for free, without paying page or article charges. c)Measures have been taken to ensure the protection and safeguarding of the journal’s content for the long term. d)The current issues of the journal are circulated widely, and are well read by scholars in your field. e)The journal is highly selective; only a small percentage of submitted articles are published. f)The journal is available to readers not only in developed nations, but also in developing nations.
Preferences for Academic Journals, 2006 Percent of faculty who believe that each characteristic is “very important” in influencing the decisions where to publish their articles
Background on the Present Study
Objectives What are the scholarly impacts of various business models for journal publishing? How do various business models for journal publishing affect the value derived by authors and readers?
Natural Experiment Beginning in 1995 publishers and content aggregators began digitizing current and archival content and placing it online. However, as late as 2005 (the endpoint of our analysis) backfiles for many journals (and current content in some cases) remained offline. We exploit this heterogeneous chronology to explore the impact of online access.
Previous Studies Many previous studies of this relationship find large effects Common flaws: these efforts do not adequately control for potential selection problems affecting article quality, do not use adequate statistical methods, or both For example, did the best journals, at least in some disciplines, gain an online presence earlier? This study avoids these problems: Variation in journal quality for content published prior to 1995 is unlikely to be related to online strategies adopted by publishers after 1995.
Some Empirical Questions What is the impact of online access on journal citation rates? Are the benefits greater for newer or older content? Are the effects discipline-specific? Which online “channels” have the greatest impact? Is the geographic and institutional distribution of citing authors influenced by online access?
People, Funding, and Timeline Researchers Mark McCabe, Professor of Economics, University of Michigan – Principal Investigator Chris Snyder, Professor of Economics, Dartmouth – Co-Principal Investigator Roger Schonfeld, Manager of Research, Ithaka Funded by a grant from The Andrew W. Mellon Foundation Data collection is completed, analysis is underway, full findings are expected to become available by mid 2008
Our Data
Three Disciplines History Economics and Business Biological and General Sciences Hundreds of publishers, aggregators, and archives provided data 100 journals in each discipline, comparing journal-year by journal-year 50 that were digitized early on 50 that were digitized only more recently or not at all Examine citations TO these journals that appeared in ANY journal from 1980 to 2005 Complete citation databases obtained from ISI
Descriptive Statistics ECONOMICS ObsMeanStd devMinMax Year journal first published Publication year3, Citation year58, Citations to journal-publication-year in a year 58, SCIENCE ObsMeanStd devMinMax Year journal first published Publication year3, Citation year71, Citations to journal-publication-year in a year 71, , ,589
Skewed Distribution of Citation in Economics Citations to journal-publication-year in a year Frequency About 4,700 zeros, one had 771 cites
Skewed Distribution of Citations in Science Citations to journal-publication-year in a year Frequency About 5,500 zeros, one had 32,500 cites
Online Availability for 1980 Content TitlesMeanSt DevMinMax Economics (82 journals published in 1980) JSTOR ProQuest Ebsco Publisher Website Science ( 74 journals published in 1980 ) JSTOR Ebsco PubMed Central Publisher Website
Geographic Distribution of First Authors of Articles that Cite Other Articles Science Cites (000) % Econ Cites (000) % English-Speaking Countries*9, , Non-English-Speaking Western Europe** 3, Rest of the World2, Total Cites15,521 1,687 * US, England, Canada, Australia, Scotland, New Zealand, Wales, Ireland, Northern Ireland ** Germany, Netherlands, France, Spain, Italy, Sweden, Belgium, Norway, Switzerland, Denmark, Finland, Austria, Greece, Portugal, Czech Republic, Slovakia.
Challenges ISI data requires extensive clean-up and quality control Many publishers and aggregators maintain poor records of their journals’ online histories First authors are confusing and require more consideration
Findings
Regression Outputs. xtreg lncit1 age* cyr* d2* js2* ow2*, i(articlegroup) fe robust; Fixed-effects (within) regression Number of obs = Group variable: articlegroup Number of groups = 99 R-sq: within = Obs per group: min = 52 between = avg = overall = max = 975 F(102,54464) = corr(u_i, Xb) = Prob > F = (Std. Err. adjusted for clustering on articlegroup) | Robust lncit1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] age1 | age2 | age3 | age4 | age5 | age6 | age7 | age8 | age9 | age10 | age11 | age12 | age13 | age14 | age15 | age16 | age17 | age18 | age19 | age20 | age21 | age22 | age23 | age24 | age25 | age26 | age27 | age28 | age29 | age30 | age31 | age32 | age33 | age34 | age35 | age36 | age37 | age38 | age39 | age40 | age41 | age42 | age43 | age44 | age45 | age46 | age47 | age48 | age49 | cyr1981 | cyr1982 | cyr1983 | cyr1984 | cyr1985 | cyr1986 | cyr1987 | cyr1988 | cyr1989 | cyr1990 | cyr1991 | cyr1992 | cyr1993 | cyr1994 | cyr1995 | cyr1996 | cyr1997 | cyr1998 | cyr1999 | cyr2000 | cyr2001 | cyr2002 | cyr2003 | cyr2004 | cyr2005 | d21995 | d21996 | d21997 | d21998 | d21999 | d22000 | d22001 | d22002 | d22003 | d22004 | d22005 | js21995 | (dropped) js21996 | (dropped) js21997 | js21998 | js21999 | js22000 | js22001 | js22002 | js22003 | js22004 | js22005 | ow21995 | (dropped) ow21996 | (dropped) ow21997 | (dropped) ow21998 | ow21999 | ow22000 | ow22001 | ow22002 | ow22003 | ow22004 | ow22005 | _cons | sigma_u | sigma_e | rho | (fraction of variance due to u_i) USA. xtreg lncit1 age* cyr* d2* js2* ow2*, i(articlegroup) fe robust; Fixed-effects (within) regression Number of obs = Group variable: articlegroup Number of groups = 99 R-sq: within = Obs per group: min = 136 between = avg = overall = max = 975 F(102,57725) = corr(u_i, Xb) = Prob > F = (Std. Err. adjusted for clustering on articlegroup) | Robust lncit1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] age1 | age2 | age3 | age4 | age5 | age6 | age7 | age8 | age9 | age10 | age11 | age12 | age13 | age14 | age15 | age16 | age17 | age18 | age19 | age20 | age21 | age22 | age23 | age24 | age25 | age26 | age27 | age28 | age29 | age30 | age31 | age32 | age33 | age34 | age35 | age36 | age37 | age38 | age39 | age40 | age41 | age42 | age43 | age44 | age45 | age46 | age47 | age48 | age49 | cyr1981 | cyr1982 | cyr1983 | cyr1984 | cyr1985 | cyr1986 | cyr1987 | cyr1988 | cyr1989 | cyr1990 | cyr1991 | cyr1992 | cyr1993 | cyr1994 | cyr1995 | cyr1996 | cyr1997 | cyr1998 | cyr1999 | cyr2000 | cyr2001 | cyr2002 | cyr2003 | cyr2004 | cyr2005 | d21995 | d21996 | d21997 | d21998 | d21999 | d22000 | d22001 | d22002 | d22003 | d22004 | d22005 | js21995 | (dropped) js21996 | (dropped) js21997 | js21998 | js21999 | js22000 | js22001 | js22002 | js22003 | js22004 | js22005 | ow21995 | (dropped) ow21996 | (dropped) ow21997 | (dropped) ow21998 | ow21999 | ow22000 | ow22001 | ow22002 | ow22003 | ow22004 | ow22005 | _cons | sigma_u | sigma_e | rho | (fraction of variance due to u_i) Non_USA English. xtreg lncit1 age* cyr* d2* js2* ow2*, i(articlegroup) fe robust; Fixed-effects (within) regression Number of obs = Group variable: articlegroup Number of groups = 99 R-sq: within = Obs per group: min = 136 between = avg = overall = max = 975 F(102,56959) = corr(u_i, Xb) = Prob > F = (Std. Err. adjusted for clustering on articlegroup) | Robust lncit1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] age1 | age2 | age3 | age4 | age5 | age6 | age7 | age8 | age9 | age10 | age11 | age12 | age13 | age14 | age15 | age16 | age17 | age18 | age19 | age20 | age21 | age22 | age23 | age24 | age25 | age26 | age27 | age28 | age29 | age30 | age31 | age32 | age33 | age34 | age35 | age36 | age37 | age38 | age39 | age40 | age41 | age42 | age43 | age44 | age45 | age46 | age47 | age48 | age49 | cyr1981 | cyr1982 | cyr1983 | cyr1984 | cyr1985 | cyr1986 | cyr1987 | cyr1988 | cyr1989 | cyr1990 | cyr1991 | cyr1992 | cyr1993 | cyr1994 | cyr1995 | cyr1996 | cyr1997 | cyr1998 | cyr1999 | cyr2000 | cyr2001 | cyr2002 | cyr2003 | cyr2004 | cyr2005 | d21995 | d21996 | d21997 | d21998 | d21999 | d22000 | d22001 | d22002 | d22003 | d22004 | d22005 | js21995 | (dropped) js21996 | (dropped) js21997 | js21998 | js21999 | js22000 | js22001 | js22002 | js22003 | js22004 | js22005 | ow21995 | (dropped) ow21996 | (dropped) ow21997 | (dropped) ow21998 | ow21999 | ow22000 | ow22001 | ow22002 | ow22003 | ow22004 | ow22005 | _cons | sigma_u | sigma_e | rho | (fraction of variance due to u_i) Non_English_Non_Europe. xtreg lncit1 age* cyr* d2* js2* ow2*, i(articlegroup) fe robust; Fixed-effects (within) regression Number of obs = Group variable: articlegroup Number of groups = 99 R-sq: within = Obs per group: min = 104 between = avg = overall = max = 975 F(102,53138) = corr(u_i, Xb) = Prob > F = (Std. Err. adjusted for clustering on articlegroup) | Robust lncit1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] age1 | age2 | age3 | age4 | age5 | age6 | age7 | age8 | age9 | age10 | age11 | age12 | age13 | age14 | age15 | age16 | age17 | age18 | age19 | age20 | age21 | age22 | age23 | age24 | age25 | age26 | age27 | age28 | age29 | age30 | age31 | age32 | age33 | age34 | age35 | age36 | age37 | age38 | age39 | age40 | age41 | age42 | age43 | age44 | age45 | age46 | age47 | age48 | age49 | cyr1981 | cyr1982 | cyr1983 | cyr1984 | cyr1985 | cyr1986 | cyr1987 | cyr1988 | cyr1989 | cyr1990 | cyr1991 | cyr1992 | cyr1993 | cyr1994 | cyr1995 | cyr1996 | cyr1997 | cyr1998 | cyr1999 | cyr2000 | cyr2001 | cyr2002 | cyr2003 | cyr2004 | cyr2005 | d21995 | d21996 | d21997 | d21998 | d21999 | d22000 | d22001 | d22002 | d22003 | d22004 | d22005 | js21995 | (dropped) js21996 | (dropped) js21997 | js21998 | js21999 | js22000 | js22001 | js22002 | js22003 | js22004 | js22005 | ow21995 | (dropped) ow21996 | (dropped) ow21997 | (dropped) ow21998 | ow21999 | ow22000 | ow22001 | ow22002 | ow22003 | ow22004 | ow22005 | _cons | sigma_u | sigma_e | rho | (fraction of variance due to u_i)
Science Journal Citations Peak in Year Three 95% confidence interval Years since publication Citations relative to age 49 Notes: Results from negative binomial regression with age dummies, digital dummy aggregated across channels for any presence, restricted to publication years
Economics Journal Citations Peak in Year Five 95% confidence interval for science Years since publication Citations relative to age 49 Notes: Results from negative binomial regression with age dummies, digital dummy aggregated across channels for any presence, restricted to publication years 95% confidence interval for economics
Preliminary General Findings Citation levels more than double in both disciplines over the sample period, There is an increase in citations as a result of digitization and online availability. Highly significant, both for pre-1995 content (digitized backfiles) and born-digital periods.
Disciplinary Differences Citation rates peak earlier in science (3 years) than in economics (5 years); the subsequent decline in citations is more rapid in science. Online access is associated with an average increase in citations of about 10% for economics and 20% for science titles. However, the changes in citations observed over time is an order of magnitude larger than the measured impact of online access.
Years since publication Citations relative to age 49 Online Offline For Science, Online Access Boosts Citations 20% Overall Notes: Results from negative binomial regression with age dummies, digital dummy aggregated across channels for any presence, restricted to publication years
Years since publication Citations relative to age 49 Online Offline For Economics, Online Access Boosts Citations 10% Overall Notes: Results from negative binomial regression with age dummies, digital dummy aggregated across channels for any presence, restricted to publication years
Channel Effects For Science: JSTOR and publisher portals are important, but not other 3rd party channels (except for the period 95-97). For Economics, all types of channels have a significant impact. Longer embargo periods clearly decrease the ability of a given channel to increase citations.
HIGHLY PRELIMINARY: Geographic Effects on Citation Growth over Time Rate of citation growth for biology is much higher (double) in non-English-speaking countries. Rate of citation growth for economics is moderately higher in non-English-speaking countries. Implication: Are these disciplines growing faster in non-English- speaking countries?
Impact of Digitization for Science – Publisher Website
Impact of Digitization for Science – JSTOR
Impact of Digitization for Science – Aggregators
Impact of Digitization for Economics – Publisher Website
Impact of Digitization for Economics – JSTOR
Impact of Digitization for Economics – Aggregators
HIGHLY PRELIMINARY: Geographic Effects on Citation Patterns Science: The channel impact is about twice as large in the non- English speaking countries (e.g. overall a 30% increase versus 15%). Economics: The channel impact is about twice as large outside of the developed English-speaking countries (~20% increase versus less than 10%). There is much we can learn from various models for the distribution of content and their relative strengths over time.
Further Questions and Discussion
Further Questions Does year of source-item publication matter? Will references to older articles increase more than references to more recently published articles? Have self-citation patterns changed? Presumably we will find no effect, an important confirmation of our data and analytical framework.
Findings and Discussion We find a consistent significant impact from digitization. At the same time, it is an order of magnitude less than the changes observed over time. Is the impact “large” or “small” and what implications if any are there? The impact is greater in science than in economics. Why? What are the implications? The impact is greater outside of the English-speaking countries. Why? What are the implications? Channel effects are dramatic. What are the implications?
How Does Digitization Affect Scholarship? Roger C. Schonfeld (212) 500 –