1 Writeslike.us Em Tonkin, Andrew Hewson

1 Writeslike.us Em Tonkin, Andrew Hewson e.tonkin@ukoln.ac.uk a.hewson@ukoln.ac.uk

2 Background Relevant research themes: Metadata harvesting and reuse Automatic metadata extraction Text analysis Social network analysis Scholarly communication, particularly informal communication

3 Aim Helping people to find each other: Finding other researchers with similar interests to yourself in your geographic area Or in your area of research Not everybody with similar interests will attend the same conferences! Helping students find potential research supervisors Encouraging serendipity.

4 Relevant technologies In fact there are an awful lot of these. Social network analysis: Requires a very large dataset Solvable either by a) being Facebook or similar (but adoption rates are far from 100%) b) automated analysis of relevant data Solution b) is cheap, simple, and very fallible. Not a new approach – at the core of bibliometrics

5 Relevant technical problems Author identity disambiguation Formal social networks disambiguate between instances of individual names (for example, if there are many people called 'John Smith', the system can tell you which is which). Needs to be solved to acceptable level. Need to define how good 'acceptable' is. Formal solutions usually depend on unique identifiers + registries Cheap, moderately effective solution: disambiguate via textual characteristics + metadata

6 Methodology Harvest OAI metadata: captures large list of: Author names (somewhat randomly formatted) Digital object titles, descriptions (sometimes), dates (sometimes) and content (sometimes) Citations (sometimes) Spider digital objects, analyse them for formal metadata – retrieve email addresses, etc. Retain OAI source: useful clue regarding author affiliations (sometimes)

7 Methodology (II) Analyse text for noun-phrase-like structures – useful clue as to theme Background information required, such as: Institution name, domains/URLs associated with each institution Retrieved via harvesting from Wikipedia Much of this information is not well-structured, so unavailable via DBPedia Poorly structured information needs filtering: for example, author names are not consistently structured between repositories. - machine learning problem. Search with contextual network graph algorithm

8 'Sometimes' and 'usually' Statistics are: Cheap Imperfect Available Rapid innovation philosophy: Cheap is good Simple is good Solutions requiring novel/additional uptake of infrastructure are out of reach

9 Results Basic concept worked well Law of diminishing returns: beyond the first 80-90%, increasing effort led to only minor improvements in dataset (minor niggles!) Interface development actually required more time than the dataset development, and exceeded project length... But useful dataset can be released as linked data, reused for various purposes

10 Walkthrough: Basic search (the harder method!)

11 Advanced search

16 Walkthrough

17 Conclusion OAI-DC (and Wikipedia!) is a good source for 'semi-structured' data There is a great deal of potential for using this together with appropriate analysis tools, such as those explored within the FixRep project, to develop social network- like graphs Application of this type of data for the purpose of encouraging informal academic communication/collaboration is an interesting research field with many potential applications

1 Writeslike.us Em Tonkin, Andrew Hewson

Similar presentations

Presentation on theme: "1 Writeslike.us Em Tonkin, Andrew Hewson"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Writeslike.us Em Tonkin, Andrew Hewson

Similar presentations

Presentation on theme: "1 Writeslike.us Em Tonkin, Andrew Hewson"— Presentation transcript:

Similar presentations

About project

Feedback