Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project.

Similar presentations


Presentation on theme: "© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project."— Presentation transcript:

1 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project – Brigitte Jörg, Jure Ferlez, Hans Uszkoreit, Mitja Jermol (DFKI) (IJS) (DFKI) (IJS)

2 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Project Information  Funding Organization: European Commission  Funding Program: Sixth Framework Programme (FP6: IST (3 rd Call))  Project Type: Specific Support Action (SSA)  Duration: 32 Months (April 2005 – November 2007)  Project Co-ordination: DFKI GmbH  Technical Co-ordination: Jozef Stefan Institute (IJS)  Technology Partners: DFKI, IJS, Ontotext, CCLRC  Project Consortium: 15 partners from EU MS, NMS and ACC

3 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Project Consortium  Deutsches Forschungszentrum für Künstliche Intelligenz, Germany  Institute Jozef Stefan, Slovenia  Ontotext Lab, Sirma AI EAD, Bulgaria  RTD Talos, Cyprus  Institute of Information Theory and Automation, Czech Republic  Archimedes Foundation, Estonia  Comp. and Autom. Research Inst., Hung. Academy of Sc., Hungary  Institute of Mathematics and Computer Science, Uni of Latvia  Lithuanian Innovation Centre, Lithuania  Projects in Motion, Malta  Technical University of Silesia, Poland  National Institute for R&D in Informatics, Romania  Slovak University of Technology, Poland  TUBITAK, Turkey  The Science and Technology Facilities Council, UK (formerly CCLRC, UK)

4 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Technology Partners DFKI Co-ordinator “LT World” Portal Information Extraction Semantic Web Jozef Stefan Institute Technical Co-ordinator “Project Intelligence” Data Mining Social Network Analysis Ontotext “KIM Semantic Annotation Platform” euroCRIS “CERIF” Standard Access to Data

5 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Project Objectives  Set up and populate an information portal on IST research  Provide information about RTD actors and their experience and expertise  Provide innovative and automated services  To promote RTD competencies in specific fields  To support partner search for IST proposals and commercial projects

6 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Presentation Outline  Information Repository  Data Collection  Data Integration / Data Cleaning  Evaluation of Results  Analytic Tools  Overall Conclusion

7 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Repository Features  Information Repository (CERIF 2004) containing  Organisation  Person  Project  Publications  Data Collection (CERIF XML) from  National CRISs  National Collections  Web Crawlings  Community Support  Data Integration into ONE single dataset  to enable analysis at European Level  Data Cleaning with  Supervised Machine Learning Methods (Active Learning)

8 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Repository Data Analysis  Duplicate records inherent in single datasets  Even more duplicate records after merging single datasets  Most obvious duplicates for organisations and persons  no significant number of duplicate projects  publications have been ignored  Duplicate records are a known problem

9 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results  Problem: duplicate detection in record set A  Given: a set of records in A  Classify: every pair (a,b) A x A M U (set of true matches) (set of true non matches) Formal Problem Definition (Winkler 2006)

10 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results  Heuristic Analysis of Random Samples: National Datasets / Cordis Datasets  most obvious duplicates found inside Cordis FP5 and Cordis FP6 datasets and across Cordis FP5 and FP6 datasets  not so many duplicates found in national datasets  a lot of duplicate person records across all datasets  no duplicate records found in project datasets  only some duplicate records across project datasts  publications have not been examined  Decision taken with respect to the IST World scope  not touching project records  ignore publication records  find a solution for person records (IST World Community)  concentrate on cleaning organisation records IST World Problem Definition

11 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Problems with Organisation Records Most entries had slightly different names caused by additional special characters or character modifications  Capitalization, Lowercase Letters  Blanks, extra Spaces  Hyphens  Quotes  Coma in Different Places  Article in Name  Full stop in Name  Incomplete Names  English Translation  Word Order  Language Specific Characters (Jorg instead of Jörg)  Special Characters (wrong encoding &, ?, )  Mixture of Organisation Names and Department Names  Differences in Addresses Data Cleaning Application

12 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results IST World Dataset Integration Organisation Names: Fulltext Indexing Querying Organisation Names + Location (1) Name/Location Strings (Bag of Words) (2) Word/Character Order (String Kernels) (3) Spelling Errors (Edit Distance Measure) (4) Normalization of (1-3) Human Decision M = Match U = Non-Match - = unknown Machine Learning (Support Vector Machine) M = Match U = Non-Match - = unknown Machine Decision M = Match U = Non-Match Knowledge about Records

13 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Active Learning Application

14 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Evalution of Results in CORDIS FP6 dataset  human evaluation of 1000 organisation record pairs  30 M correct; 934 U correct  1 M incorrect; 35 U incorrect  97% precision  46% recall  integration approach worked well  can be used for large scale integration tasks  Result: semi-automated identification of 4000 duplicates with high accuracy and a reasonable recall

15 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analytic Tools  Advanced Tools  Collaboration Diagram  Competence Diagram  Experimental Tools  Collaobration Trends  Competence Trends  Consortia Prediction  Semantic Search

16 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results How to analyze or generate a Diagram  definition of a query in the IST World Portal  get a list of result records matching the query  generate diagrams based on results

17 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Competence Diagram Query: IST SSA projects within FP6 Aim: investigate the thematic range of SSA projects in FP6 Thematic Areas (Blue Clouds): SEMANTIC HEALTH LEGAL CHANGING ROADMAP SOFTWARE Projects (Red Dots) Linked with Full Record in Repository

18 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Competence Diagram Query: IST SSA projects within FP6 Aim: investigate the thematic range of SSA projects in FP6 Goals (List of Keywords): DEMENTIA PEOPLE MEDICAL STANDARDS … Configuration of Result Space: 40% of result list 30 topics

19 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Competence Diagram Query: IST SSA projects within FP6 Aim: investigate the thematic range of SSA projects in FP6 Goals Configuration of Result Space: 40% of result list 30 topics Themes

20 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Collaboration Diagram Query: IST SSA projects within FP6 Aim: investigate the collaboration of SSA partners in FP6 Number of joint partners Configuration of Result Space: 20% of result list Project

21 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Evaluation of Analytic Tools  IST World allowed to perform the tasks defined  for more details see the full paper in the Proceedings  All analytics depend on the data behind  The analytic tools are very powerful

22 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Evaluation of Queries  Query execution performed in March 2008  Queried datasets IST World / Cordis IST World Portal: http://www.ist-world.org/ CORDIS Search: http://cordis.europa.eu/en/home.html

23 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Results of Query Evaluation Discovered inconsistencies with Cordis data:  „FP6“ string: 30 of 80 relevant records missed the string  „SSA“ string: 15 of 208 relevant records missed the string  „Specific Support Action“ string: 15 of 208 relevant records missed the string  Dates (Year of the call): not consistently recorded  Query 1: 22 projects contained the string „Coordination Action“, „Specific Targeted Action“, „Integrated Project“, others  An investigation of the results of the Query 1 in Cordis revealed: 80 projects of the result list are missing in IST World

24 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Overall Conclusion  Integration Method:  Could be further developed  Test data could be used to generate a better classification model  Feature generation could be improved by using ontological knowledge  Transfer learning methods might be helpful for re-use of the learned model  Evaluation of large Datasets:  very difficult  needs expert knowledge  Analytic Tools:  depend on quality data behind  are very powerful for investigation of large datasets

25 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results European Research Dataset (entries)  Europan Research: 55078 Orgs, 30489 Proj, 58164 Exp, 165795 Pubs  Bulgaria: 794 Orgs, 73 Proj, 10940 Exp, 19023 Pubs  Cyprus: 29 Orgs  Czech Republic: 183 Orgs, 163 Proj, 164 Exp  Estonia: 75 Orgs, 1256 Proj, 6726 Exp., 51376 Pubs  Hungary: 2665 Orgs, 1297 Proj, 2425 Exp  Latvia: 106 Orgs, 830 Proj, 701 Exp  Lithuania: 102 Orgs,  Malta: 58 Orgs, 27 Proj, 898 Exp, 180 Pubs  Poland: 1451 Orgs, 2179 Proj, 7392 Exp, 16086 Pubs  Romania: 169 Orgs, 68 Proj, 87 Exp  Serbia: 60 Orgs, 2278 Exp, 79130 Pubs  Slovenia: 1723 Orgs, 3748 Proj, 11655 Exp  Slovakia: 56 Orgs, 432 Proj, 683 Exp.  Turkey: 285 Orgs  EPRI-start: 286 Orgs, 275 Exp  Cordis FP5+FP6: 48988 Orgs, 20436 Proj, 13941 Exp  Community: 61 Orgs, 41 Proj, 435 Exp January 2008

26 © Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Beyond the Project IST World is online: http://www.ist-world.org/ Registration is free Create your Competence Map / Collaboration Map Continuation is planned …


Download ppt "© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project."

Similar presentations


Ads by Google