Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alexandre Donizeti Alves Horacio Hideki Yanasse Nei Yoshihiro Soma October 24, 2011 11 th Workshop on Domain-Specific Modeling.

Similar presentations


Presentation on theme: "Alexandre Donizeti Alves Horacio Hideki Yanasse Nei Yoshihiro Soma October 24, 2011 11 th Workshop on Domain-Specific Modeling."— Presentation transcript:

1 Alexandre Donizeti Alves Horacio Hideki Yanasse Nei Yoshihiro Soma October 24, th Workshop on Domain-Specific Modeling

2 Introduction Lattes Platform is an information system implanted by CNPq (National Council for Scientific and Technological Development) to manage information on science, technology and innovation related to researchers and institutions in Brazil This platform is undoubtedly the major source of information available on Brazilian researchers

3 Introduction: Lattes Platform

4 Introduction The Lattes CV system, a curricular information system, is the main component of the platform Currently, the Lattes CV system stores around 2,000,000 curricula of researchers, lectures, students and professionals from diverse areas of knowledge

5 Introduction: Lattes CV system Jorge Almeida Guimaraes

6 Introduction: Lattes curriculum (English)

7

8 Introduction: Lattes curriculum (Portuguese)

9 Introduction In the last years, many works were developed using data extracted from Lattes Platform of researchers of different areas of knowledge A common problem presented in these works is that the curricula and the information extracted had to be obtained manually

10 Introduction Therefore, this system has a very high quality information extraction potential

11 LattesMiner LattesMiner is an internal multilingual DSL for automatic information extraction from Lattes curricula It is composed by a set of classes written in Java that allows developers to implement their own applications with a high-level abstraction and expression power

12 LattesMiner Data Discovery is used to find the (ID) number of the researchers. Usually, only the name of the researcher is available. Data Acquisition is responsible for downloading the Lattes curricula of the researchers from Lattes CV system on the Web. Data Extraction is the main component of LattesMiner. It is responsible for extracting data from the HTML files. The technique of information extraction based on regular expressions was used. The extracted data can be stored in XML files or in any database using the Data Structure component. The Data Visualization component is responsible for the identification and visualization of the academic social networks. These networks are identified by checking the relationships between researchers. The Data Analysis component is responsible for the analysis of the data extracted and also for the analysis of the relationships identified.

13 LattesMiner Biodata Board BiodataIE BoardIE BoardDao BiodataDao lattes.miner lattes.miner.ie lattes.miner.en lattes.miner.dao Perfil Banca lattes.miner.br The LattesMiner class is composed by instances of classes Biodata and Board, in addition to many others not presented here.

14 LattesMiner LattesMiner was created through a fluent interface, that provides a compact and yet easy-read representation of the domain problem Fluent interfaces are implemented using the method chaining LattesMiner makes use of static factory methods and imports

15 Case Study For the following examples researchers of the Computer Science area with CNPq Research Productivity Scholarship were considered. The list contains all the names of the researchers. However, their corresponding (ID) number are not provided.

16 Listing 1 import java.util.*; import lattes.util.Util; import static lattes.miner.LattesMiner.*; public class Listing1 { public static void main(String[] args) { } List list = new ArrayList (); for (String name : Util.getList("names.txt")) list.add( ); Util.setList(list, "ids.txt"); search(name) Java application code

17 Listing 2 dir("cvs"); for (String id : Util.getList("ids.txt")) download(id). save(); Code fragment used to download the lattes curricula of the researchers.

18 Listing 3 props("mysql"); for (String id : Util.getList("ids.txt")) { } load(id). biodata(). address(); publications( ) JOURNAL. save(); This listing shows as to extracted data from Lattes curricula of the researchers.

19 Listing 4 for (String id : Util.getList("ids.txt")) { } // Portuguese // English for (Banca b : ) { } for (Board b : ) { } carregar(id). bancas(). getBancas() load(id). boards(). getBoards() if ( ) System.out.println( ); if ( ) System.out.println( ); b.ano() == 2010 b.aluno() b.year()== 2010 b.student() Code fragment to illustrate how the LattesMiner is used to extract information in different languages.

20 Results The SUCUPIRA is a system for identification and visualization of academic social networks. Here is shows the geographical distribution of the five researchers that have published more articles in scientific journals.

21 Results This is a graph of contacts of the five researchers that have published more in scientific journals. The graph depicts an academic social network of the five researchers. Nodes are presented with the name of researcher The color of the edges represent the number of relationships among researchers.

22 Conclusions Currently, the Lattes curricula are available in HTML format LattesMiner however does not depend on the data format because it allows users to program their own applications with a high-level abstraction If the data format is eventually modified, the DSL interface remains the same

23 Conclusions An advantage of LattesMiner is that it searches by the name of the researcher LattesMiner is multilingual Another advantage is that the data extracted can are stored in a structural format (XML or database), allowing these data to be easily used by others applications

24 Future work The future step that is already being implemented in the LattesMiner DSL is a statistical analysis of the data

25 ACNOWLEDGMENTS


Download ppt "Alexandre Donizeti Alves Horacio Hideki Yanasse Nei Yoshihiro Soma October 24, 2011 11 th Workshop on Domain-Specific Modeling."

Similar presentations


Ads by Google