Presentation on theme: "Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu."— Presentation transcript:
Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu 1, Hanqing Ma 1, Jinhui Ma 3,Na Li 1 1.The Lanzhou Branch of the National Science Library, Chinese Academy of Sciences,, Lanzhou 73000,China; 2. Cold and Arid Regions Environmental and Engineering Research Institute, Chinese Academy of Sciences, Lanzhou ,China; 3. College Of Earth and Environmental Science, LanZhou University,Lanzhou , China Collnet 2012,Korea Souel
Outline Background Intorduction to Related Study Framework of the Analysis Tool Spatial Analysis of Research Activity in sporopollen in China Conclusion
Background Recently, many scholars and applications have begun to show analysis results of scientific papers combined with GIS visually. Most of their studies are based on addresses of authors given by the authors directly. There are few reports on the analysis of distribution of research area based on text-mining in research papers, especially written in Chinese.
Katy Börner, Shashikant Penumarthy, Mark Meriss etc. Mapping the Diffusion of Information Among Major U.S. Research Institutions. Scientometrics, 2006,68(3): [ Xuemei Wang, Mingguo Ma.Spatial information mining and visualization for Qinghai-Tibet Plateaus literature based on GIS[A] in:Yaolin Liu, Xinming Tang.International Symposium on Spatial Analysis, Spatial-Temporal Data Mining[C].Wuhan, Proc. Of SPIE,2009,1-8 Lutz Bornmann$, Ludo Waltman. The detection of hot regions in the geography of science – A visualization approach by using density maps, arXiv: v2 Lutz Bornmann, Loet Leydesdorff, Christiane Walch-Solimena, Christoph Ettl$Mapping excellence in the geography of science: An approach based on Scopus data
Background In earth science, resources and environment related fields, research is closely related with some location. It is inefficient to read the articles one by one while annotate the research area by hand to get the understanding of the distribution of research area. In doing so, it is not easy to grasp where the research blanks and hot spots are.
Background Through automatic recognition and indication of geographical names referred in research papers, we can analyze the spatial distribution of research activities in a research field, and understand the hot areas and blank areas in the field. It will help decision makers and researchers to adjust strategy of research and optimize research resources allocation, and it will be an innovation in information analysis by adding a new spatial dimension to traditional information analysis.
Background Possibility Can we mine hidden geographical knowledge from large-scale research papers to support spatial analysis of research activity? How? How to analyze geographical feature in magnanimous textual collections and mine the hidden knowledge efficiently? Key:Toponym resolution in the research articles includes two tasks, namely Geo-Parsing and Geo- Coding
Introduction to Related Study Geo-parsing Geo-parsing consists of detecting and extracting the geographic names referred in the unstructured text of an article or a Web page using Named Entity Recognition (NER) techniques.
Gazetteers based extraction. Simple and allows efficient implementations, with a loss of precision in toponym extraction. A tedious job to get a full covered gazetteer. Natural language processing generally based on statistical models. Hidden Markov Models (HMMs), Maximum Entropy Models (MEMs),Maximum Entropy Markov Models (MEMMs),Conditional Random Field (CRF),Supporting Vector Machine(SVM)were discussed in many documents for extraction of geographic names. require lots of training and are corpus dependent.
Geo-coding Geo-coding is the key step to correlate textual information to maps. Gazetteer or the geographical knowledge base is the key component A well-designed digital gazetteer can support geo- entity identification, toponym disambiguation and geo-coding. By now, the famous digital gazetteers includes ADL Gazetteer, Getty TGN GeoName. And some digital map services, including Google Map, Microsoft Bing map, Yahoo PlaceFinder, Baidu Map provide API for geo-coding.
Chinese Toponym Extraction Unlike English, there is no blank to mark word boundaries in Chinese text. The previous research focused on syntax rules and word segmentation. Statistical models have been used to identify unknown geographical names in Chinese text. The research mainly carried out in webpage & news, few of them related to research paper.
Framework of The Analysis Tool
Documentary Database Preparation Geo-parsing in Text Geo-extraction from authors affiliation and address fields Geo-recognition from unstructured text CRF++ Based Toponym Identification Geographical Knowledge Base with Semantic Relationship Supporting Toponym Disambiguation GeoFocus Geo-Coding Spatial Analysis of Research Activity Based on Toponym Resolution from Documents with ArcGIS
Geographical Knowledge Base with Semantic Relationship Supporting Toponym Disambiguation
15 Geographical KB Abbre-Alias-Formal Toponym transformation Toponoy-Footprint/Coordinate Combining with toponym rules to support toponym annotation. Combining administrative,spatialrelationship, and feature type of geo-entity to support disambugation. Geo-coding
Spatial Analysis of Research Activity in sporopollen in China The authors distribution CNKI 1490 papers ( ) 1402 items have clear authors affiliations and addresses. identified 97.08% authors affiliation and address. In combination with Google earth and Google Map, the rate of geo-coding to 96.9%.
As Fig shows most of authors of palynology come from Beijing, Jiansu and Shanghai, then from Gansu and Shanxi. Few of the authors are from Xizang and Ningxia.
Distribution of the research area in sporopollen in China There are 1112 papers referred geographical names according to manual annotation in abstract.
Distribution of research area in sporopollen The hottest research area of Sporopollen in China is estuary of the Yangtze river, Shandong inland area, Beijing, Qinling mountain area and Junggar Basin, the sampling point is sparse in the south of Changjiang, mountainous border of Heilongjiang Jilin and Inner Mongolia and northwest desert and southwest tropical regions. These places should be payed much attention in the future, i.e. in addition to consider research significance, geographic area representative and filling blanks research area also is worth considering.
Conclusion The experiment shows that it is possible to analyze distribution of research activities based on automatic identification and annotation of the geo-entity in large-scale textual collections. The method is useful for the science decision maker to allocate research resources.
Further research Further research and experiment is needed and actually is on-going to improve geo-parsing and geo-coding rate. We need much more corpus to be trained, need to adjust the feature template to get better efficiency. We also need to take into consideration of other heuristics to improve the toponym resolution. A systematic evaluation of the method we have taken should be carried out as well.