A social network analysis of research collaboration in the economics community Thomas Krichel (Long Island University) Nisa Bakkalbasi (Yale University)

Slides:



Advertisements
Similar presentations
Analysis of Computer Algorithms
Advertisements

Chapter 7 System Models.
LIS618 lecture 2 Thomas Krichel Structure Theory: information retrieval performance Practice: more advanced dialog.
Living Network Centrality Thomas Krichel Long Island University & Novosibirsk State University 5 May 2010.
Rclis in vision and reality Thomas Krichel
RePEc and OLS Thomas Krichel prepared for the first retreat for disciplinary repositories Monterey
Open Archives and Open Libraries Thomas Krichel
Document data & personal data Thomas Krichel Long Island University & Novosibirsk State University
New Century, New Metadata Thomas Krichel University of Surrey, Hitotsubashi University and Long Island University.
Use your bean. Count it. Thomas Krichel
My life and times Thomas Krichel LIU & НГУ
LIS510 lecture 12 Thomas Krichel library as an organization Every organization like any living creature, wants to survive. Current threats.
Add Governors Discretionary (1G) Grants Chapter 6.
The 5S numbers game..
DATE 2006 Aetna Rx Home Delivery ® An enhanced website for Aetna members
Decision Maths Networks Kruskals Algorithm Wiltshire Networks A Network is a weighted graph, which just means there is a number associated with each.
New Patterns in Response Rates: How the Online Format Has Changed the Game Presented by David Nelson, PhD Purdue University.
The basics for simulations
EE, NCKU Tien-Hao Chang (Darby Chang)
Heuristic Search techniques
Fast algorithm for detecting community structure in networks M. E. J. Newman Department of Physics and Center for the Study of Complex Systems, University.
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
Basic Computer Fundamentals Lecture4 Prepared by Jalal
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
ANOVA Demo Part 2: Analysis Psy 320 Cal State Northridge Andrew Ainsworth PhD.
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
We’ll be spending minutes talking about Quiz 1 that you’ll be taking at the next class session before you take the Gateway Quiz today.
Introduction Embedded Universal Tools and Online Features 2.
Social network partition Presenter: Xiaofei Cao Partick Berg.
Hawawini & VialletChapter 7© 2007 Thomson South-Western Chapter 7 ALTERNATIVES TO THE NET PRESENT VALUE RULE.
Section 7.4: Closures of Relations Let R be a relation on a set A. We have talked about 6 properties that a relation on a set may or may not possess: reflexive,
Efficient Autoscaling in the Cloud using Predictive Models for Workload Forecasting Roy, N., A. Dubey, and A. Gokhale 4th IEEE International Conference.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Detailed Design Kenneth M. Anderson Lecture 21
Shortest path algorithm. Introduction 4 The graphs we have seen so far have edges that are unweighted. 4 Many graph situations involve weighted edges.
1 More about the Confidence Interval of the Population Mean.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
27-Jun-15 Profiling code, Timing Methods. Optimization Optimization is the process of making a program as fast (or as small) as possible Here’s what the.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Chapter 1 Program Design
Web of Science Pros Excellent depth of coverage in the full product (from 1900-present for some journals) A large number of the records are enhanced with.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
WIKI IN EDUCATION Giti Javidi. W HAT IS WIKI ? A Wiki can be thought of as a combination of a Web site and a Word document. At its simplest, it can be.
Lecture for Week Spring.  Numbers can be represented in many ways. We are familiar with the decimal system since it is most widely used in everyday.
Copyright © Cengage Learning. All rights reserved.
Data Structures Week 9 Towards Weighted BFS So, far we have measured d s (v) in terms of number of edges in the path from s to v. Equivalent to assuming.
A collaboration graph for E-LIS Thomas Krichel Long Island University & Novosibirsk State University & Open Library Society 3 November 2011.
Chapter 2 Modeling and Finding Abnormal Nodes. How to define abnormal nodes ? One plausible answer is : –A node is abnormal if there are no or very few.
Research evaluation requirements José Manuel Barrueco Universitat de València (SPAIN) Servei de Biblioteques i Documentació May, 2011.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Extreme/Agile Programming Prabhaker Mateti. ACK These slides are collected from many authors along with a few of mine. Many thanks to all these authors.
Representing and Using Graphs
Which of these can be drawn without taking your pencil off the paper and without going over the same line twice? If we can find a path that goes over all.
LIS618 lecture 0 Thomas Krichel Organization homepage Contents to be discussed today. Send mail.
Developing an Algorithm. Simple Program Design, Fourth Edition Chapter 3 2 Objectives In this chapter you will be able to: Introduce methods of analyzing.
Chapter 10 Graph Theory Eulerian Cycle and the property of graph theory 10.3 The important property of graph theory and its representation 10.4.
1 CS 430: Information Discovery Lecture 5 Ranking.
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
On-Line BankCard Center Presentation Cardholder Role During the Presentation click the mouse on this button to move back a slide During the Presentation.
Sul-Ah Ahn and Youngim Jung * Korea Institute of Science and Technology Information Daejeon, Republic of Korea { snowy; * Corresponding Author: acorn
CitEc as a source for research assessment and evaluation José Manuel Barrueco Universitat de València (SPAIN) May, й Международной научно-практической.
Unified Modeling Language
UNIT 4 - BIG DATA AND PRIVACY
Effective Writing Where and how to start?
Michael L. Nelson CS 495/595 Old Dominion University
Route Inspection Which of these can be drawn without taking your pencil off the paper and without going over the same line twice? If we introduce a vertex.
GhostLink: Latent Network Inference for Influence-aware Recommendation
Citation databases and social networks for researchers: measuring research impact and disseminating results - exercise Elisavet Koutzamani
Presentation transcript:

A social network analysis of research collaboration in the economics community Thomas Krichel (Long Island University) Nisa Bakkalbasi (Yale University)

sponsors Open Society Institute through the sponsoring of the ACIS project – – RAS is now based on ACIS. Miteq Corp for the computation support – They sponsored usage of HP Proliant 8 CPU machine on which the computations are done. – Otherwise they would have taken a verrrrry long time.

structure of this talk background on RePEc RePEc author service centrality as an incentive device back to basics results using the RePEc author service implementation challenges

RePEc essence and history It is an open-access abstracting and indexing database about economics. It goes back to 1993 when Thomas Krichel started to build indeces of printed and online working papers in economics. Now it also covers journal articles and some other publication types such as books and book chapters.

what is interesting about RePEc Large Unfunded Relational Evaluation oriented

RePEc is large Over 550 archives contribute document data to the collection. There about 350k items described. These are more than in arXiv.org, at some recent count. There are about 10 different user services that use RePEc data or further process.

RePEc is unfunded While there are some sponsors for parts of RePEc, neither data collection or service provision is externally sponored. Most data about publications come from dedicated RePEc archives based at – economics departments at universities – other research centers – some specialized administrative units such as central banks. Services are mainly run by amateurs.

RePEc is relational RePEc does not only register documents but also researcher and their institutions. Institutions are centrally registered by one volunteer, Christian Zimmermann. People register with the RePEc Author Service RAS. More about this later.

RePEc is evaluation-oriented Since we have indentified authors, we can aggregate evaluative measures over authors and institutions. Recently, Christian Zimmermann has built a battery of 22 different indicators for individuals. This is very rich dataset for scientometric exercise. any questions?

RAS history and essence It goes back to 1999 when Thomas Krichel directed work by Markus Johannes Richard Klink to build a special author registration web interface. In 2002 the Open Society Institute contributed $50k to develop a generic software to implements servics such as RAS. – The software is written by Ivan V. Kurmanov. – It is called ACIS (Academic Contribution Information System)

how does RAS work? Authors contact RAS to let RePEc know what papers they have written. – Registrants create and maintain a personnal profile – Registrants create and maintain a name variations profile – RAS creates and maintains a contributions profile. Once an initial profile is defined ACIS has a mechanism called ARPU that alerts authors about documents being added to their profile. The contributions profile contains the name of all documents.

what is interesting about RAS? Registration of authors solves all problems of trying to indentify authors by their names. – There are many ways to represent the same name. ex Bruno Van Pottelsbergh De la Potterie, proceedings page 128. Some RAS registrants have even longer names! – Many different authors may share the same name or the same way in which the name can be represented. Solving these problems "manually" is very expensive and only feasible for small sets of authors.

but RAS is not complete Bakkalbasi and Krichel (2006) /home/krichel/papers/elba.pdf, (Elba paper) have shown, that, at their time of writing – Roughly every third RePEc document has at least one registered author. – Roughly very fourth RePEc authorship is captured by RAS. These figures are not likely to change very rapidly. – RAS gets more registrants. – RePEc gets more documents.

RAS and co-authorship In the Elba paper there is a conjecture that the fact that author A is registered does not significantly increase the chance that the co- authors of A are registered. This is can not be formally shown without labouring through attempt to identify by name. One indication is that the graph of formed by co-author relationships in RAS is not dense. This has been found in recent work by Nisa Bakkalbasi.

registration incentive on co-authors To get authors to register, we need good incentives. In conventional (Zimmermann's 22) indicators, the positionn of an author depends only on the author's action. If we use co-authorship, we can devise rankings that depend on co-authorship. If we have such a ranking, authors will have incentives to get their co-authors to register.

imagine a RAS-CIS A RAS Collaboration Information System should be built. RAS-CIS could show the registrants – local information about shortest paths – network summaries via centrality indices The summary information will improve with more colllaborators of the author registered.

two tasks to build RAS-CIS We have to select the measures to calculate and develop the tools to calculated them. This is what the paper is about. We have to build an interface that will allow intuitive access to that data. The data would have to be updated. Since there has been no similar service before this is a hard task. But not done here.

the job here We calculate differents centrality rankings of authors. We compare the rankings among themselves. We want to select a measure that is best to use in web-based collaboration centrality ranking service. RAS-CIS is still to be built.

collaboration graph From a social networking perspective, collaboration establishes a graph structure – RAS authors are the nodes. – Collaboration, i.e. common claim(s) of a same paper is the arcs between nodes. – If there is no common paper claimed by two authors no arcs exists between the nodes. Specific results depends on how the arc length is calculated from the collaboration structure.

graph components If there is a path between one author A and another author B along collaboration archs, we say that A and B belong to the same component of the collaboration graph. It is commonly observed in real work network that the largest component is quite large. It usually has more than 50% of all nodes and it is therefore know as the giant component. Most centrality measures are only meaningful for the members of the giant components.

face the force of facts registrants are found it RAS registrants (70% of registrants) are authors, i.e. they have claimed at least one paper registrants (66% of authors) are co- authors, i.e. they are authors who have collaborated with at least one other RAS author registrants (83% of co-authors) are in the giant component.

the RAS nodes 5019 authors is still a rather large network. Compare to the 96 authors in the Hou & Kretschmer and Liu paper on page 77 in the proceedings. There are at least shortest paths between the authors, and many more other paths. Calculations of a set of shortest paths takes 10 hours on an 8 CPU machine.

network type Between any two nodes, there is an edge if the authors have ever collaborated. But the length of the edge depends on your point of view of the strength of collaboration. Different edge lengths lead to different networks. We introduce three networks in the following three slides.

network 1: binary network In the binary network, the collaboration strengh between any two authors is one if the two authors have claimed at least one common paper in RAS. The collaboration strength is zero otherwise. The edge length is the inverse of the collaboration strength. If the collaboration strength is zero, there is no edge between the two nodes. We use an algorithm by Newman to do the calculations.

network 2: symmetric weighted network In a symmetric weighted network, for each paper that two authors have claimed in common, we increment the collaboration strength between two authors by the number of authors on that paper minus 1. As a result, the total collaboration strength of an author is the amount of co-authored papers. We used the Dijkstra algorithm to find the shortest paths. This will find only one shortest path.

network 3: random walk network In this type of network, we norm the collaboration strength of each author to be one. This generates an assymetric networks where inward edges are shorter for important authors who have written more papers. This type of measures is used in SNA to measure prestige. We used the Dijkstra algorithm to find the shortest paths. This will find only one shortest path.

centrality measures For each network, we can look at two centrality measures. – closeness centrality: a node is more central if it has shorter average shortest path leading to all other nodes. – betweeness centrality: a node is more central if it lies on the more shortest paths leading from one node to the other. These centrality measures rank authors from the more central to the least central.

notation for centrality measures BIC closeness centrality in the binary network BIB betweeness centrality in the binary network SYB closeness centrality in the symmetric weighed network SYC betweeness centrality in the symmetric weighed network RWC closeness centrality in the colnetwork RWB betweeness centrality in the binary network

pair-wise Spearman rank correlation BIC BIB SYC SYB RWC RWB BIC BIB SYC SYB RWC RWB

comments All three closeness measures are produce very similar rankings. SYB and BIB are close, but RWB is quite far off both of them. Overall, the choice of betweeness and closeness seem to be more important that the choice between models. This has been a surprise to us. BIC and BIB are close by 60%, the others are even lower.

adding the number of documents We can add the number of documents as an additional ranking criterion NDO. We get NDO BIC BIB SYC SYB RWC RWB NDO Overall, the weighed network appears to be best correlated with the number of documents. This should come as no surprise.

why add this alien number NDO? We can think of NDO as the simplest easiest indication of the personal fame of an author. If we want to incentivize authors to want to climb the ranks of a collaboration centrality ranking, we need to have people at the top that they do actually realize. Remember Groucho Marx "I'll never join a club that accepts me as a member". Thus the symmetric weighed network appears appealling.

symmetric weight network If we are using the symmetric value is an interface, the numbers that come out for closeness are not intutive because the total length are fractions. But the fact that there should be much less path multiplicity makes the presentation simpler. But the paths may be longer (in simple counts of intermediate nodes) than counts in the binary model.

RAS-CIS The most difficult aspect is to build the interface when there is no similar service present at this time. The updating can not be done instantaneous, but ought to be close to it. – If the contributions profile of an author changes, we can recalculate her paths. – We can also recalculate the paths of her co- authors. – But then we end up with an overall network that is no longer symmetric.

more work RAS authorship are a high-quality dataset that is easy to use. It is not widely used at this point. Note in particular that much of the data affecting collaboration has not been worked on – affiliation data – journal/series data – subject classification data New ideas and partnerships welcome!

Thank you for your attention! write to