1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.

Slides:



Advertisements
Similar presentations
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Advertisements

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
New Features Update ISI Web of Knowledge. Copyright 2006 Thomson Corporation 2 New features added Mozilla Firefox web browser is now supported New access.
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
C ONCLUSION & A BSTRACT RESEARCH METHOD FOR ACADEMIC PROJECT I.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
1/1/ Integrating Genetic Algorithms with Conditional Random Fields to Enhance Question Informer Prediction Min-Yuh Day 1, 2, Chun-Hung Lu 1, 2, Chorng-Shyong.
1/1/ An Integrated Knowledge-based and Machine Learning Approach for Chinese Question Classification Min-Yuh Day 1,2, Cheng-Wei Lee 1, Shih-Hung Wu 3,
Traditional Information Extraction -- Summary CS652 Spring 2004.
Introduction to EndNote Martin Snelling March 2007.
1/1/ Question Classification in English-Chinese Cross-Language Question Answering: An Integrated Genetic Algorithm and Machine Learning Approach Min-Yuh.
1/1/ Designing an Ontology-based Intelligent Tutoring Agent with Instant Messaging Min-Yuh Day 1,2, Chun-Hung Lu 1,3, Jin-Tan David Yang 4, Guey-Fa Chiou.
Scalable Text Mining with Sparse Generative Models
1 Drawing Dynamic Geometry Figures with Natural Language Wing-Kwong Wong a, Sheng-Kai Yin b, Chang-Zhe Yang c a Department of Electronic Engineering b.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Webpage Understanding: an Integrated Approach
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Dept. Computer Science, Korea Univ. Intelligent Information System Lab. XML clustering methods Sohn Jong-Soo Intelligent Information.
Social Networking Techniques for Ranking Scientific Publications (i.e. Conferences & journals) and Research Scholars.
1 A Discriminative Approach to Topic- Based Citation Recommendation Jie Tang and Jing Zhang Presented by Pei Li Knowledge Engineering Group, Dept. of Computer.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
A Comparison of On-line Computer Science Citation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles
IJCAI 2003 Workshop on Learning Statistical Models from Relational Data First-Order Probabilistic Models for Information Extraction Advisor: Hsin-His Chen.
Developing a Concept Extraction Technique with Ensemble Pathway Prat Tanapaisankit (NJIT), Min Song (NJIT), and Edward A. Fox (Virginia Tech) Abstract.
Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
The ISI Web of Knowledge nce/training/wok/#tab3.
NTCIR /21 ASQA: Academia Sinica Question Answering System for CLQA (IASL) Cheng-Wei Lee, Cheng-Wei Shih, Min-Yuh Day, Tzong-Han Tsai, Tian-Jian Jiang,
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Summary  Extractive speech summarization aims to automatically select an indicative set of sentences from a spoken document to concisely represent the.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
 Depending on what kind of paper you are writing determines which format you use. Your professor or teacher will also direct which format they want to.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Web Technologies for Bioinformatics Ken Baclawski.
Reference Collections: Collection Characteristics.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
Boris Radovanović Marko Stojićević
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Publication Pattern of CA-A Cancer Journal for Clinician Hsin Chen 1 *, Yee-Shuan Lee 2 and Yuh-Shan Ho 1# 1 School of Public Health, Taipei Medical University.
A Bibliographic Management Software NORSHUHADA SAIDIN REFERENCE & RESEARCH DIVISION PERPUSTAKAAN KEJURUTERAAN UNIVERSITI SAINS MALAYSIA.
MINING DEEP KNOWLEDGE FROM SCIENTIFIC NETWORKS
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Bibliometric Analysis of Water Research
Performing CS Research
Building an autonomous citation index for grey literature: the
Bibliometric Analysis of Process Safety and Environmental Protection
Bibliometric Analysis of Quality of Life Publication
Marcos André Gonçalves
Hierarchical, Perceptron-like Learning for OBIE
Building Topic/Trend Detection System based on Slow Intelligence
Bibliometric Analysis of Desflurane Research
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Presentation transcript:

1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong Ong 2, Wen-Lian Hsu 1 1 Institute of Information Science, Academia Sinica, Nankang, Taipei, Taiwan 2 Department of Information Management, National Taiwan University, Taipei, Taiwan 3 Department of Computer Science and Engineering, National Taiwan University, Taipei, Taiwan 4 Dept. of Computer Science and Information Engineering, Chaoyang Univ. of Technology, Taiwan IEEE IRI 2005

Min-Yuh Day 2/2/ Outline Introduction Proposed Approach Experimental Results and Discussion Related Works Conclusions and Future Research

Min-Yuh Day 3/3/ Introduction Integration of the bibliographical information of scholarly publications available on the Internet is an important task in academic research. Accurate reference metadata extraction for scholarly publications is essential for the integration of information from heterogeneous reference sources. We propose a knowledge-based approach to literature mining and focus on reference metadata extraction methods for scholarly publications. INFOMAP: ontological knowledge representation framework Automatically extract the reference metadata.

Min-Yuh Day 4/4/ Proposed Approach

Min-Yuh Day 5/5/ Reference Data Collection Journal Spider (journal agent) collect journal data from the Journal Citation Reports (JCR) indexed by the ISI and digital libraries on the Web. Citation data source ISI web of science DBLP Citeseer PubMed Phase 1

Min-Yuh Day 6/6/ Knowledge Representation in INFOMAP Phase 2

Min-Yuh Day 7/7/ INFOMAP INFOMAP as ontological knowledge representation framework extracts important citation concepts from a natural language text. Feature of INFOMAP represent and match complicated template structures hierarchical matching regular expressions semantic template matching frame (non-linear relations) matching graph matching Using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different kinds of reference formats or styles.

Min-Yuh Day 8/8/ Reference Metadata Extraction Journal Reference styles Reference style example Bioinformatics style (BIOI) Davenport, T., DeLong, D., & Beers, M. (1998) Successful knowledge management projects. Sloan Management Review, 39(2), ACM style (ACM) 1.Davenport, T., DeLong, D. and Beers, M Successful knowledge management projects. Sloan Management Review, 39 (2) IEEE style (IEEE) [1]T. Davenport, D. DeLong, and M. Beers, "Successful knowledge management projects," Sloan Management Review, vol. 39, no. 2, pp , APA style (APA) Davenport, T., DeLong, D., & Beers, M. (1998). Successful knowledge management projects. Sloan Management Review, 39(2), JCB style (JCB) Davenport, T., DeLong, D., & Beers, M Successful knowledge management projects. Sloan Management Review 39(2), MISQ style (MISQ) Davenport, T., DeLong, D., and Beers, M. "Successful knowledge management projects," Sloan Management Review (39:2) 1998, pp Table 1. Examples of different journal reference styles Phase 3

Min-Yuh Day 9/9/ Knowledge-based Reference Metadata Extraction - Online Service Phase 4

Min-Yuh Day 10/ Citation Extraction From Text to BixTex W. L. Hsu, "The coloring and maximum independent set problems on planar perfect graphs," J. Assoc. Comput. Machin., (1988), W. L. Hsu, "On the general feasibility test of scheduling lot sizes for several products on one machine," Management Science 29, (1983), W. L. Hsu, "The distance-domination numbers of trees," Operations Research Letters 1, (3), (1982), Author = {W. L. Hsu}, Title = {The coloring and maximum independent set problems on planar perfect graphs,"}, Journal = {J. Assoc. Comput. Machin.}, Volume = {}, Number = {}, Pages = { }, Year = {1988 Author = {W. L. Hsu}, Title = {On the general feasibility test of scheduling lot sizes for several products on one machine,"}, Journal = {Management Science}, Volume = {29}, Number = {}, Pages = {93-105}, Year = {1983 Author = {W. L. Hsu}, Title = {The distance-domination numbers of trees,"}, Journal = {Operations Research Letters}, Volume = {1}, Number = {3}, Pages = {96-100}, Year = {1982 }} Figure 5. The system output of BibTex Format Figure 3. The system input of knowledge-based RME

Min-Yuh Day 11/ Figure 6. The online service of knowledge-based RME ( System Output System Input (Plain text) Output BibTex

Min-Yuh Day 12/ Experimental Results and Discussion Experimental data We used EndNote to collect Bioinformatics citation data for 2004 from PubMed. A total of 907 bibliography records were collected from PubMed digital libraries on the Web. Reference testing data was generated for each of the six reference styles (BIOI, ACM, IEEE, APA, MISQ, and JCB). Randomly selected 500 records for testing from each of the six reference styles.

Min-Yuh Day 13/ Accuracy of Citation Extraction Definition: We consider a field to be correctly extracted only when the field values in the reference testing data are correctly extracted. The accuracy of citation extraction is defined as follows:

Min-Yuh Day 14/ Experimental results of citation extraction from six reference styles

Min-Yuh Day 15/ Example Results

Min-Yuh Day 16/ Analysis of the structure of reference styles FieldField Relation StructurePercentage% Author 54.29% 42.86% N/A2.85% Year 48.57% 20.00% 14.29% 5.71% 2.86% 2.86% N/A5.71% Title 48.57% 42.86% N/A8.57% Journal 71.43% 20.00% 5.71% N/A2.86% Volume 40.00% 31.43% 14.29% 5.71% 2.86% 2.86% N/A2.85% Issue 34.29% 14.29% N/A51.42% Pages 42.86% 34.29% 17.14% 2.86% N/A2.85%

Min-Yuh Day 17/ Related Works Machine learning approaches Citeseer [8, 9, 12] take advantage of probabilistic estimation, which is based on the training sets of tagged bibliographical data, to boost performance. The citation parsing technique of Citeseer can identify titles and authors with approximately 80% accuracy and page numbers with approximately 40% accuracy. Seymore et al. [15] use the Hidden Markov Model (HMM) to extract important fields from the headers of computer science research papers Achieve an overall word accuracy of 92.9% Peng et al. [14] employ Conditional Random Fields (CRF) to extract various common fields from the headers and citations of research papers. Achieve an overall word accuracy of 85.1%(HMM) compared to 95.37%(CRF) and an overall instance accuracy of 10%(HMM) compared to 77.33%(CRF) for paper references.

Min-Yuh Day 18/ Related Works (Cont.) Rule-based models Chowdhury [3] and Ding et al. [5], use a template mining approach for citation extraction from digital documents. Ding et al. [5] use three templates for extracting information from cited articles (citations) and obtain a quite satisfactory result (more than 90%) for the distribution of information extracted from each unit in cited articles. The advantage of their rule-based model is its efficiency in extracting reference information. However, they treat references in one style only from tagged texts (e.g., references formatted in HTML), whereas our method treats references in more than six reference styles from plain text.

Min-Yuh Day 19/ Comparison with related works Knowledge-based approach Our proposed knowledge-based RME method for scholarly publications can extract reference information from 907 records in various reference styles with a high degree of precision the overall average field accuracy is 97.87% for six major styles listed in Table % for the MISQ style 87% for other 30 randomly selected styles

Min-Yuh Day 20/ Conclusions Citation extraction is a challenging problem The diverse nature of reference styles We have proposed a knowledge-based citation extraction method for scholarly publications. The experimental results indicate that, by using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different reference styles with a high degree of precision. The overall average field accuracy of citation extraction is 97.87% for six major reference styles.

Min-Yuh Day 21/ Future Research Integrate the ontological and the machine learning approaches to boost the performance of citation information extraction Maximum-Entropy Method (MEM) Hidden Markov Model (HMM) Conditional Random Fields (CRF) Support Vector Machines (SVM)

Min-Yuh Day 22/ Q & A A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong Ong 2, Wen-Lian Hsu 1 1 Institute of Information Science, Academia Sinica, Nankang, Taipei, Taiwan 2 Department of Information Management, National Taiwan University, Taipei, Taiwan 3 Department of Computer Science and Engineering, National Taiwan University, Taipei, Taiwan 4 Dept. of Computer Science and Information Engineering, Chaoyang Univ. of Technology, Taiwan IEEE IRI 2005