Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Korean Place Name Information Service on the Web 2.0 Environment
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Automatically Annotating and Integrating Spatial Datasets Chieng-Chien Chen, Snehal Thakkar, Crail Knoblock, Cyrus Shahabi Department of Computer Science.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Automatic Acquisition of Fuzzy Footprints Steven Schockaert, Martine De Cock, Etienne E. Kerre.
Search Engines and Information Retrieval
Tour the World: building a web-scale landmark recognition engine ICCV 2009 Yan-Tao Zheng1, Ming Zhao2, Yang Song2, Hartwig Adam2 Ulrich Buddemeier2, Alessandro.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
 Image Search Engine Results now  Focus on GIS image registration  The Technique and its advantages  Internal working  Sample Results  Applicable.
Iris localization algorithm based on geometrical features of cow eyes Menglu Zhang Institute of Systems Engineering
Scalable Text Mining with Sparse Generative Models
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Webpage Understanding: an Integrated Approach
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
Search Engines and Information Retrieval Chapter 1.
Representativeness Evaluation of China National Climate Reference Station Network Jianxia Guo 1, Ling Chen 2, Haihe Liang 1, Xin Li 3 1. Meteorological.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Characteristics and use of grey literature in scientific journals articles of Algerian researchers: Case study of University of Science and Technology.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
De-identifying Pathology Reports for Pathology Informatics
Visualization, analysis and mining of geo- spatial information in educational data sets using web-based tools Aniruddha Desai |Winter 2013 Presentation.
Sharing Management of Data and Information on Earth Science about Western China Prof. SUN Chengquan and ZHANG Haihua the Scientific Information Center.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Presenter: Shanshan Lu 03/04/2010
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Presented by: Ashgan Fararooy Referenced Papers and Related Work on:
ISPRS Congress 2000 Multidimensional Representation of Geographic Features E. Lynn Usery Research Geographer U.S. Geological Survey.
Last Words DM 1. Mining Data Steams / Incremental Data Mining / Mining sensor data (e.g. modify a decision tree assuming that new examples arrive continuously,
Siyuan Liu *#, Yunhuai Liu *, Lionel M. Ni *# +, Jianping Fan #, Minglu Li + * Hong Kong University of Science and Technology # Shenzhen Institutes of.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
URBPD 442 Urban and regional geospatial analysis This course provides theoretical and practical skills for analyzing spatial patterns and phenomena in.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
The New Milestone of China’s Scientific Data Archiving and Access Peng Jie (ISINFO/MOST) Zhao Hui (ISINFO/MOST) Liu Chuang (IGSNRR/CAS) (reporter) CODATA2006.
Three basic areas for consideration: 1.Searching, reading and critically evaluating your literature. 2.Managing your literature – organizing and documenting.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
STEWARD: A Spatio-Textual Document Search Engine for HUDUSER.ORG Prof. Hanan Samet Department of Computer Science, University of Maryland, College Park,
Science data sharing user behavior mining: an approach combining Web Usage Mining and GIS Mo Wang, Juanle Wang, Yongqing Bai Institute of Geographic Sciences.
Experience Report: System Log Analysis for Anomaly Detection
MINING DEEP KNOWLEDGE FROM SCIENTIFIC NETWORKS
Sentiment analysis algorithms and applications: A survey
Web Data Extraction Based on Partial Tree Alignment
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Social Knowledge Mining
Topic Oriented Semi-supervised Document Clustering
Searching and browsing through fragments of TED Talks
人文地理領域的基礎網絡設施 The Cyber Infrastructure For GeoHumanities: 廖泫銘 研究副技師
Using Uneven Margins SVM and Perceptron for IE
World-Class Disciplines Evaluation in Mainland China -Case Study of Chinese Postgraduate Education and Discipline Evaluation Prof. Junping Qiu, Dr.
World-Class Disciplines Evaluation in Mainland China -Case Study of Chinese Postgraduate Education and Discipline Evaluation Prof. Junping Qiu, Dr.
Presentation transcript:

Study in Spatial Distribution Analysis of Science Research Activities based on Toponym Resolution in Text Jianxia Ma 1, Guodong Cheng 2, Shaoxiong Liu 1, Hanqing Ma 1, Jinhui Ma 3,Na Li 1 1.The Lanzhou Branch of the National Science Library, Chinese Academy of Sciences,, Lanzhou 73000,China; 2. Cold and Arid Regions Environmental and Engineering Research Institute, Chinese Academy of Sciences, Lanzhou ,China; 3. College Of Earth and Environmental Science, LanZhou University,Lanzhou , China Collnet 2012,Korea Souel

Outline Background Intorduction to Related Study Framework of the Analysis Tool Spatial Analysis of Research Activity in sporopollen in China Conclusion

Background Recently, many scholars and applications have begun to show analysis results of scientific papers combined with GIS visually. Most of their studies are based on addresses of authors given by the authors directly. There are few reports on the analysis of distribution of research area based on text-mining in research papers, especially written in Chinese.

Katy Börner, Shashikant Penumarthy, Mark Meriss etc. Mapping the Diffusion of Information Among Major U.S. Research Institutions. Scientometrics, 2006,68(3): [ Xuemei Wang, Mingguo Ma.Spatial information mining and visualization for Qinghai-Tibet Plateaus literature based on GIS[A] in:Yaolin Liu, Xinming Tang.International Symposium on Spatial Analysis, Spatial-Temporal Data Mining[C].Wuhan, Proc. Of SPIE,2009,1-8 Lutz Bornmann$, Ludo Waltman. The detection of hot regions in the geography of science – A visualization approach by using density maps, arXiv: v2 Lutz Bornmann, Loet Leydesdorff, Christiane Walch-Solimena, Christoph Ettl$Mapping excellence in the geography of science: An approach based on Scopus data

Background In earth science, resources and environment related fields, research is closely related with some location. It is inefficient to read the articles one by one while annotate the research area by hand to get the understanding of the distribution of research area. In doing so, it is not easy to grasp where the research blanks and hot spots are.

Background Through automatic recognition and indication of geographical names referred in research papers, we can analyze the spatial distribution of research activities in a research field, and understand the hot areas and blank areas in the field. It will help decision makers and researchers to adjust strategy of research and optimize research resources allocation, and it will be an innovation in information analysis by adding a new spatial dimension to traditional information analysis.

Background Possibility Can we mine hidden geographical knowledge from large-scale research papers to support spatial analysis of research activity? How? How to analyze geographical feature in magnanimous textual collections and mine the hidden knowledge efficiently? Key:Toponym resolution in the research articles includes two tasks, namely Geo-Parsing and Geo- Coding

Introduction to Related Study Geo-parsing Geo-parsing consists of detecting and extracting the geographic names referred in the unstructured text of an article or a Web page using Named Entity Recognition (NER) techniques.

Gazetteers based extraction. Simple and allows efficient implementations, with a loss of precision in toponym extraction. A tedious job to get a full covered gazetteer. Natural language processing generally based on statistical models. Hidden Markov Models (HMMs), Maximum Entropy Models (MEMs),Maximum Entropy Markov Models (MEMMs),Conditional Random Field (CRF),Supporting Vector Machine(SVM)were discussed in many documents for extraction of geographic names. require lots of training and are corpus dependent.

Geo-coding Geo-coding is the key step to correlate textual information to maps. Gazetteer or the geographical knowledge base is the key component A well-designed digital gazetteer can support geo- entity identification, toponym disambiguation and geo-coding. By now, the famous digital gazetteers includes ADL Gazetteer, Getty TGN GeoName. And some digital map services, including Google Map, Microsoft Bing map, Yahoo PlaceFinder, Baidu Map provide API for geo-coding.

Chinese Toponym Extraction Unlike English, there is no blank to mark word boundaries in Chinese text. The previous research focused on syntax rules and word segmentation. Statistical models have been used to identify unknown geographical names in Chinese text. The research mainly carried out in webpage & news, few of them related to research paper.

Framework of The Analysis Tool

Documentary Database Preparation Geo-parsing in Text Geo-extraction from authors affiliation and address fields Geo-recognition from unstructured text CRF++ Based Toponym Identification Geographical Knowledge Base with Semantic Relationship Supporting Toponym Disambiguation GeoFocus Geo-Coding Spatial Analysis of Research Activity Based on Toponym Resolution from Documents with ArcGIS

Geographical Knowledge Base with Semantic Relationship Supporting Toponym Disambiguation

15 Geographical KB Abbre-Alias-Formal Toponym transformation Toponoy-Footprint/Coordinate Combining with toponym rules to support toponym annotation. Combining administrative,spatialrelationship, and feature type of geo-entity to support disambugation. Geo-coding

Spatial Analysis of Research Activity in sporopollen in China The authors distribution CNKI 1490 papers ( ) 1402 items have clear authors affiliations and addresses. identified 97.08% authors affiliation and address. In combination with Google earth and Google Map, the rate of geo-coding to 96.9%.

As Fig shows most of authors of palynology come from Beijing, Jiansu and Shanghai, then from Gansu and Shanxi. Few of the authors are from Xizang and Ningxia.

Distribution of the research area in sporopollen in China There are 1112 papers referred geographical names according to manual annotation in abstract.

Distribution of research area in sporopollen The hottest research area of Sporopollen in China is estuary of the Yangtze river, Shandong inland area, Beijing, Qinling mountain area and Junggar Basin, the sampling point is sparse in the south of Changjiang, mountainous border of Heilongjiang Jilin and Inner Mongolia and northwest desert and southwest tropical regions. These places should be payed much attention in the future, i.e. in addition to consider research significance, geographic area representative and filling blanks research area also is worth considering.

Conclusion The experiment shows that it is possible to analyze distribution of research activities based on automatic identification and annotation of the geo-entity in large-scale textual collections. The method is useful for the science decision maker to allocate research resources.

Further research Further research and experiment is needed and actually is on-going to improve geo-parsing and geo-coding rate. We need much more corpus to be trained, need to adjust the feature template to get better efficiency. We also need to take into consideration of other heuristics to improve the toponym resolution. A systematic evaluation of the method we have taken should be carried out as well.

Thanks for your attention!