Download presentation
Presentation is loading. Please wait.
1
Ricardo EIto Brun Strasbourg, 5 Nov 2015
Terminology extraction and the identification of research areas : an essay for Space Engineering” Ricardo EIto Brun Strasbourg, 5 Nov 2015
2
The value of Terminology as a research evaluation tool
From a pragmatic perspective, terminology work and Terminography needs to analyze the “language and vocabulary” used in the documents: New terms are created to refer to new concepts, tools or techniques. Terms are created by “combination” of existing terms. The “frequency of use” of the terms may be a good indicator of the “popularity” of the terms or the extent to which it has been adopted by a particular community. The “decreasing frequency” of the use of the terms may be an indicator of a “lack of interest” or “commoditization” of a concept, tool or technique.
3
The value of Terminology as a research evaluation tool
In the context of scientific and technical knowledge: “Recombination of existing knowledge” is a means of creating new knowledge. E.g. the application of a new technique or tool on a specific process may help improve its performance/capability in a significant way The use of an existing method in a different context may be useful to solve a known-problem. There are “research trends” that are followed by researchers, funding agencies, etc., that guide the research investment policies.
4
The value of Terminology as a research evaluation tool
How can terminology be useful in this context? Is it possible to apply terminology analysis techniques to get a profile or research trends? Can these terminological analysis be used with a retrospective purpose (history of techniques)? Can these terminological analysis be used with a prospective purpose (identification of future research trends)?
5
Research Presentation
This research purpose is to use proven, widely available term extraction techniques coupled with “bibliometric analysis” techniques to characterize research trends (retrospective approach). Focus of research is the scientific and technical production of the European Space Agency (ESA). A preliminary analysis has been run with a small set of patents (334 documents) to assess the feasibility of the approach. Terminology extraction is done with AlchemyAPI . TermMine tool is another candidate.
6
Context of Research This activity is part of a bibliometric analysis of the ESA scientific and technical production. In the last 50 years, ESA staff has produced more than scientific and technical articles and proceedings. Bibliometric study aims to analyze: Productivity (who are the most productive authors, productivity be period) Impact – number of citations received by the different researchers, evolution in time. Collaboration patterns – to which extent ESA collaborated with other entities, and its evolution in time. Areas of research – and its evolution in time.
7
Research objectives (current)
Identify the “subject areas” and topics in the research conducted by ESA in different time periods. Characterize research topics by using techniques like term-frequency and “word co-occurrence” Analyze the “lifecycle” (evolution) of different research topics.
8
Research objectives (future stage)
Compare the “terminological profile” of the patents released in a particular period, with the “terminological profile” of the “basic research” conducted before. Is there any relationship between the basic research, and its translation into “working innovations” (products, methods or services? Characterize research fronts by sets of well-defined terms. Analyse the relationships between citing and cited documents from a terminological perspective: Is the research described in a specific document, the result of the recombination of the terms used in previously (cited) conducted research?
9
1. Identify source documents
Research steps 1. Identify source documents WoS Derwent Database Data set: 334 patents. Export result set to “tagged format”
10
2. Convert results to xml and split records
Research steps 2. Convert results to xml and split records Custom XSLT style sheet with Altova® MapForce® Convert “tagged” data set into XML. Split record set into individual records. Keep only “relevant fields”: title, abstract, keywords.
11
2. Terminology extraction
Research steps 2. Terminology extraction Running the AlchemyAPI too in batch mode. Command line tool, generates as output the set of “terms” and its “relevance”. Built-in PHP script to process the set of files. Results is a file with docId, term and weight in document. AlchemyAPI can be called using different programming languages. It does not extract only “words”, but “terms made up of two or more words”
12
Research steps 2. Terminology extraction
13
Research steps 3. Terminology clean-up
The output generated by the tool was visually inspected to identify “extracted terms” that should be removed and no-later processed. This happened mostly with words appearing in section titles or generic terms (e.g. Advantage). The possibility of defining “stop word lists” to guide term extraction is being considered to get more accurate results.
14
4. Analysis of terms used in a specific period.
Research steps 4. Analysis of terms used in a specific period. Extracted terms are “tagged” with a specific time period. Extraction process was run considering a “five year period”. This can be changed anyway to a one year period. Tagging terms extractions with dates allows getting the evolution in the use of terms across time. Word clouds can be generated with the most relevant terms per period. Word clouds give a quick overview of the “main concepts” involved in research for that period.
15
Research steps 4. Analysis of terms used in a specific period.
16
5. Evolution of the “Use of terms”
Research steps 5. Evolution of the “Use of terms” The evolution of the “weight” of individual terms may be relevant to identify “research trends”. These values show how important the term was in the different periods. At this stage, a second conceptual analysis is needed to group terms that refer to more generic or specific concepts . Setting up these hierarchies allows an analysis at different levels, e.g.: “research on propulsion”, “research on “Combined ion-electric propulsion”. Setting up this hierarchies is manual work done with the support of subject experts.
17
5. Evolution of the “Use of terms”
Research steps 5. Evolution of the “Use of terms” Example:
18
Research steps 5. Evolution of the “Use of terms”
19
6. “Concept identification”
Research steps 6. “Concept identification” Co-occurrence of terms is considered a good indicator to identify “relationships between concepts” This idea has been widely used in bibliometric analysis and Information Retrieval to analyse “areas of knowledge”. In our case, term co-occurrence may be considered an indicator of “patterns of knowledge re-combination” Co-occurrence is calculated with the BibExcel tool, using the output generated with AlchemyAPI. Note: some bibliometric tools make co-occurrence analysis, but they work on “single words”, not “compound terms”.
20
Research steps 6. “Concept identification”
21
7. “Drawing conclusions”
Research steps 7. “Drawing conclusions” Graphical representations can be generated with BibExcel for the Payek and VosViewer tools. These are dynamic tools that may be used to “explore” the network of related terms. This analysis can be restricted to specific time periods.
22
Research steps 7. “Drawing conclusions”
23
Way Forward…. Preliminary results are satisfactory:
Terminology extraction tools provide good performance, although some pre- and post-processing is still needed. Visual displays provide an interesting tool to “present terms and show their relevance and relationships (based on co-occurrence). The execution of the analysis on a bigger set of records is expected to increase quality of results (as well as complexity for “data cleaning”) Checking differences in the “terms profile” of set of documents may be considered an evidence of “knowledge recombination” that leads to innovation.
24
Way Forward…. But… Analysis is restricted to title, abstract, keywords due to the unavailability of full-text search in the chosen database. Stop-word lists need to be refined for a better data “clean up”. To get a detailed characterization of research fields, it is still necessary to identify relationships between concepts (mainly IS_A / BT/NT and RT) to support an analysis at different level of aggregations. A way to show and compare the evolution of the “terms” upon user-demand needs to be automated. Data currently kept in files (xml, Excel and text). A database is needed for further analysis.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.