Presentation on theme: "A social network analysis of research collaboration in the economics community Thomas Krichel (Long Island University) Nisa Bakkalbasi (Yale University)"— Presentation transcript:
A social network analysis of research collaboration in the economics community Thomas Krichel (Long Island University) Nisa Bakkalbasi (Yale University)
sponsors Open Society Institute through the sponsoring of the ACIS project – http://acis.openlib.org – RAS is now based on ACIS. Miteq Corp for the computation support – They sponsored usage of HP Proliant 8 CPU machine on which the computations are done. – Otherwise they would have taken a verrrrry long time.
structure of this talk background on RePEc RePEc author service centrality as an incentive device back to basics results using the RePEc author service implementation challenges
RePEc essence and history It is an open-access abstracting and indexing database about economics. It goes back to 1993 when Thomas Krichel started to build indeces of printed and online working papers in economics. Now it also covers journal articles and some other publication types such as books and book chapters.
what is interesting about RePEc Large Unfunded Relational Evaluation oriented
RePEc is large Over 550 archives contribute document data to the collection. There about 350k items described. These are more than in arXiv.org, at some recent count. There are about 10 different user services that use RePEc data or further process.
RePEc is unfunded While there are some sponsors for parts of RePEc, neither data collection or service provision is externally sponored. Most data about publications come from dedicated RePEc archives based at – economics departments at universities – other research centers – some specialized administrative units such as central banks. Services are mainly run by amateurs.
RePEc is relational RePEc does not only register documents but also researcher and their institutions. Institutions are centrally registered by one volunteer, Christian Zimmermann. People register with the RePEc Author Service RAS. More about this later.
RePEc is evaluation-oriented Since we have indentified authors, we can aggregate evaluative measures over authors and institutions. Recently, Christian Zimmermann has built a battery of 22 different indicators for individuals. This is very rich dataset for scientometric exercise. any questions?
RAS history and essence It goes back to 1999 when Thomas Krichel directed work by Markus Johannes Richard Klink to build a special author registration web interface. In 2002 the Open Society Institute contributed $50k to develop a generic software to implements servics such as RAS. – The software is written by Ivan V. Kurmanov. – It is called ACIS (Academic Contribution Information System)
how does RAS work? Authors contact RAS to let RePEc know what papers they have written. – Registrants create and maintain a personnal profile – Registrants create and maintain a name variations profile – RAS creates and maintains a contributions profile. Once an initial profile is defined ACIS has a mechanism called ARPU that alerts authors about documents being added to their profile. The contributions profile contains the name of all documents.
what is interesting about RAS? Registration of authors solves all problems of trying to indentify authors by their names. – There are many ways to represent the same name. ex Bruno Van Pottelsbergh De la Potterie, proceedings page 128. Some RAS registrants have even longer names! – Many different authors may share the same name or the same way in which the name can be represented. Solving these problems "manually" is very expensive and only feasible for small sets of authors.
but RAS is not complete Bakkalbasi and Krichel (2006) http://openlib.org /home/krichel/papers/elba.pdf, (Elba paper) have shown, that, at their time of writing – Roughly every third RePEc document has at least one registered author. – Roughly very fourth RePEc authorship is captured by RAS. These figures are not likely to change very rapidly. – RAS gets more registrants. – RePEc gets more documents.
RAS and co-authorship In the Elba paper there is a conjecture that the fact that author A is registered does not significantly increase the chance that the co- authors of A are registered. This is can not be formally shown without labouring through attempt to identify by name. One indication is that the graph of formed by co-author relationships in RAS is not dense. This has been found in recent work by Nisa Bakkalbasi.
registration incentive on co-authors To get authors to register, we need good incentives. In conventional (Zimmermann's 22) indicators, the positionn of an author depends only on the author's action. If we use co-authorship, we can devise rankings that depend on co-authorship. If we have such a ranking, authors will have incentives to get their co-authors to register.
imagine a RAS-CIS A RAS Collaboration Information System should be built. RAS-CIS could show the registrants – local information about shortest paths – network summaries via centrality indices The summary information will improve with more colllaborators of the author registered.
two tasks to build RAS-CIS We have to select the measures to calculate and develop the tools to calculated them. This is what the paper is about. We have to build an interface that will allow intuitive access to that data. The data would have to be updated. Since there has been no similar service before this is a hard task. But not done here.
the job here We calculate differents centrality rankings of authors. We compare the rankings among themselves. We want to select a measure that is best to use in web-based collaboration centrality ranking service. RAS-CIS is still to be built.
collaboration graph From a social networking perspective, collaboration establishes a graph structure – RAS authors are the nodes. – Collaboration, i.e. common claim(s) of a same paper is the arcs between nodes. – If there is no common paper claimed by two authors no arcs exists between the nodes. Specific results depends on how the arc length is calculated from the collaboration structure.
graph components If there is a path between one author A and another author B along collaboration archs, we say that A and B belong to the same component of the collaboration graph. It is commonly observed in real work network that the largest component is quite large. It usually has more than 50% of all nodes and it is therefore know as the giant component. Most centrality measures are only meaningful for the members of the giant components.
face the force of facts 13049 registrants are found it RAS. 9111 registrants (70% of registrants) are authors, i.e. they have claimed at least one paper. 6038 registrants (66% of authors) are co- authors, i.e. they are authors who have collaborated with at least one other RAS author. 5019 registrants (83% of co-authors) are in the giant component.
the RAS nodes 5019 authors is still a rather large network. Compare to the 96 authors in the Hou & Kretschmer and Liu paper on page 77 in the proceedings. There are at least 12592671 shortest paths between the authors, and many more other paths. Calculations of a set of shortest paths takes 10 hours on an 8 CPU machine.
network type Between any two nodes, there is an edge if the authors have ever collaborated. But the length of the edge depends on your point of view of the strength of collaboration. Different edge lengths lead to different networks. We introduce three networks in the following three slides.
network 1: binary network In the binary network, the collaboration strengh between any two authors is one if the two authors have claimed at least one common paper in RAS. The collaboration strength is zero otherwise. The edge length is the inverse of the collaboration strength. If the collaboration strength is zero, there is no edge between the two nodes. We use an algorithm by Newman to do the calculations.
network 2: symmetric weighted network In a symmetric weighted network, for each paper that two authors have claimed in common, we increment the collaboration strength between two authors by the number of authors on that paper minus 1. As a result, the total collaboration strength of an author is the amount of co-authored papers. We used the Dijkstra algorithm to find the shortest paths. This will find only one shortest path.
network 3: random walk network In this type of network, we norm the collaboration strength of each author to be one. This generates an assymetric networks where inward edges are shorter for important authors who have written more papers. This type of measures is used in SNA to measure prestige. We used the Dijkstra algorithm to find the shortest paths. This will find only one shortest path.
centrality measures For each network, we can look at two centrality measures. – closeness centrality: a node is more central if it has shorter average shortest path leading to all other nodes. – betweeness centrality: a node is more central if it lies on the more shortest paths leading from one node to the other. These centrality measures rank authors from the more central to the least central.
notation for centrality measures BIC closeness centrality in the binary network BIB betweeness centrality in the binary network SYB closeness centrality in the symmetric weighed network SYC betweeness centrality in the symmetric weighed network RWC closeness centrality in the colnetwork RWB betweeness centrality in the binary network
comments All three closeness measures are produce very similar rankings. SYB and BIB are close, but RWB is quite far off both of them. Overall, the choice of betweeness and closeness seem to be more important that the choice between models. This has been a surprise to us. BIC and BIB are close by 60%, the others are even lower.
adding the number of documents We can add the number of documents as an additional ranking criterion NDO. We get NDO BIC BIB SYC SYB RWC RWB NDO 184.108.40.206.60.70.19 Overall, the weighed network appears to be best correlated with the number of documents. This should come as no surprise.
why add this alien number NDO? We can think of NDO as the simplest easiest indication of the personal fame of an author. If we want to incentivize authors to want to climb the ranks of a collaboration centrality ranking, we need to have people at the top that they do actually realize. Remember Groucho Marx "I'll never join a club that accepts me as a member". Thus the symmetric weighed network appears appealling.
symmetric weight network If we are using the symmetric value is an interface, the numbers that come out for closeness are not intutive because the total length are fractions. But the fact that there should be much less path multiplicity makes the presentation simpler. But the paths may be longer (in simple counts of intermediate nodes) than counts in the binary model.
RAS-CIS The most difficult aspect is to build the interface when there is no similar service present at this time. The updating can not be done instantaneous, but ought to be close to it. – If the contributions profile of an author changes, we can recalculate her paths. – We can also recalculate the paths of her co- authors. – But then we end up with an overall network that is no longer symmetric.
more work RAS authorship are a high-quality dataset that is easy to use. It is not widely used at this point. Note in particular that much of the data affecting collaboration has not been worked on – affiliation data – journal/series data – subject classification data New ideas and partnerships welcome!
Thank you for your attention! http://openlib.org/home/krichel write to email@example.com