Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Ricochet A Family of Unconstrained Algorithms for Graph Clustering.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Predicting the Semantic Orientation of Adjectives
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star.
Clustering Unsupervised learning Generating “classes”
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
A Random Walk on the Red Carpet: Rating Movies with User Reviews and PageRank Derry Tanti Wijaya Stéphane Bressan.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Presented by Tienwei Tsai July, 2005
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Multi-Prototype Vector Space Models of Word Meaning __________________________________________________________________________________________________.
Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors : JEROEN DE KNIJFF, FLAVIUS FRASINCAR, FREDERIK HOGENBOOM DKE Data & Knowledge.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut.
Extracting Keyphrases to Represent Relations in Social Networks from Web Junichiro Mori and Mitsuru Ishizuka Universiry of Tokyo Yutaka Matsuo National.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Hierarchical Clustering
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Cold-Start KBP Something from Nothing Sean Monahan, Dean Carpenter Language Computer.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Neighborhood - based Tag Prediction
Statistical NLP: Lecture 9
Block Matching for Ontologies
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan

Motivation Lack of annotated training data for Bahasa Indonesia Contextual information gives clues to the part-of-speech of words User knowledge of the language helps in determining the part-of-speech of words

Idea Clustering of words based on their contextual similarities The clustering must be interactive to allow the inclusion of user knowledge Choose incremental hierarchical clustering because its hierarchical construction of clusters allows for interactivity in-between the hierarchy levels

Related Works: Schutze’s Approach Schutze (1999) proposes the first algorithm for tagging words whose POS are unknown Similarity between words are determined using contextual information: The left and right neighbors of a word is its feature Similarity between words is the cosine of their feature vectors Buckshot algorithm is used to cluster the words

Related Works: Extended Schutze’s Approach Bressan et. al. (2004) extends Schutze’s approach by considering a broader context for feature vectors: two left and two right neighbors Shown to be superior on the Brown corpus

Proposed Method: Overview Cluster words into their POS classes based on the cosine similarity of their feature vectors Evaluate two incremental hierarchical clustering: Single-link incremental hierarchical clustering Our own Borůvka hierarchical clustering Can be extended to other hierarchical clustering (average-, complete-link)

Proposed Method: Clustering Single-Link Treat each vertex as a separate cluster initially Scan through the list of edges from heaviest to lightest similarity Iteratively merge pairs of clusters connected by the heaviest edge until there is only one cluster left Borůvka Treat each vertex as a separate cluster initially Scan through the list of clusters Iteratively merge each cluster to another cluster to which it is connected with its heaviest edge until there is only one cluster left Single-Link scans through the list of edges while Borůvka scans through the list of vertices (clusters)

Proposed Method: Tools Feature Vectors Measure the similarity of words by the degree to which they share the same two neighbors on the left and right (extended Schutze’s approach) Interactive Clustering Allow user to decide which clusters to merge/break in between levels Constrained Clustering Allow user to input constraints (words/morphological) in between levels

Performance Evaluation: Experimental Setup Evaluate proposed method using the Indonesian Language Corpus (Jelita Asian et. al., 2004) Obtain 3000 most frequent words in the corpus to compose the feature vectors Select 198 words whose POS tags are not ambiguous to be clustered Manually tag these 198 words using Penn Treebank tag set Study recall, precision and F1 measure

Performance Evaluation: Experimental Results Present at each hierarchy level the average precision, recall and F1 measure Borůvka always gives higher F1 and recall than Single-Link Borůvka builds clusters level-by-level, therefore allows user interactivity in-between levels

Performance Evaluation: Experimental Results Borůvka and Single-Link comparison

Performance Evaluation: Experimental Results Borůvka at different levels Best clustering is found at level 2

Performance Evaluation: Experimental Results Adding words and morphological constraints to Borůvka gives the highest improvement of F1 values Best clustering is found at level 2

Performance Evaluation: Experimental Results Clustering at fine granularity displays semantic significance beyond part-of-speech tagging Clustering is able to group words by their named-entity and senses

Performance Evaluation: Experimental Results Example of named-entity grouping: At level 1 and level 2 of clusters, names of days, months, years, places, and people are grouped in separate clusters Example of word-sense grouping: Indonesian repeat words (e.g. orang-orang: people) are most often used as nouns. However, some repeat words (e.g. pelan-pelan: slowly) are adjectives. Our proposed clustering is able to cluster them correctly in different groups (one of nouns and one of adjectives)

Conclusion We apply clustering to the problem of POS tagging for Bahasa Indonesia We present a tool for interactive and constrained exploration of POS classes Performance of Borůvka is better than Single-Link and is satisfactory even for a small set of words Clustering at fine granularity displays semantic results beyond parts-of-speech tagging (named- entity tagging, word senses identification)

References Hinrich Schutze Distributional Part-of-speech Tagging. In EACL7: Stéphane Bressan and Lily Indrajaja Part-of- speech Tagging without Training. In Proceedings of IFIP International Conference, INTELLCOMM 2004, Bangkok, Thailand. Jelita Asian, Hugh Williams, and Seyed Tahaghoghi A Testbed for Indonesian Text Retrieval. In Proceedings of the 9th Australasian Document Computing Symposium, Melbourne, Australia :