Download presentation
Presentation is loading. Please wait.
Published byMatthew McLain Modified over 10 years ago
1
Current work on CitEc José Manuel Barrueco Cruz http://www.uv.es/~barrueco Thomas Krichel http://openlib.org/home/krichel
2
Data Papers from RePEc dataset –31139 Working Papers –15145 Journal Articles all of them available online, not all are free More than 90% of them are in PDF or PostScript formats
3
Harvesting Perl script that: – Reads the RePEc data – Downloads the documents full text – Converts them to ASCII (using pstotext) – Tries to find a Reference section
4
Test on 1000 documents 13% are not found in the URL specified 3% are not it PDF or PS 15% give errors in the pstotext conversion 9% are converted but a reference section can not be found 60% were successfully converted
5
Parsing problems of CiteSeer Publication date. When a reference contains more than one year it is discarded Source of publication, i.e. working papers series or journals titles is not parsed be CiteSeer. We will need to add code with a list of all journals and working paper series.
6
To do Study of citation patterns Use of data in user services Use of data in logging and registration services
7
Thank you for your attention. Contact José Manuel Barrueco Cruz for more information
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.