Dongyeop Kang1, Youngja Park2, Suresh Chari2

Slides:

Advertisements

Similar presentations

Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California

Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.

Unsupervised Learning

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.

1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.

Universal Learning over Related Distributions and Adaptive Graph Transduction Erheng Zhong †, Wei Fan ‡, Jing Peng*, Olivier Verscheure ‡, and Jiangtao.

Title: The Author-Topic Model for Authors and Documents

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Probabilistic Clustering-Projection Model for Discrete Data

Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.

Modeling Relationship Strength in Online Social Networks Rongjian Xiang 1, Jennifer Neville 1, Monica Rogati 2 1 Purdue University, 2 LinkedIn WWW 2010.

Joint Sentiment/Topic Model for Sentiment Analysis Chenghua Lin & Yulan He CIKM09.

1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.

Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.

Generative Topic Models for Community Analysis

Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Personalized Search Result Diversification via Structured Learning

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Cross Validation Framework to Choose Amongst Models and Datasets for Transfer Learning Erheng Zhong ¶, Wei Fan ‡, Qiang Yang ¶, Olivier Verscheure ‡, Jiangtao.

Data Mining – Intro.

Relaxed Transfer of Different Classes via Spectral Partition Xiaoxiao Shi 1 Wei Fan 2 Qiang Yang 3 Jiangtao Ren 4 1 University of Illinois at Chicago 2.

CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.

Active Learning for Class Imbalance Problem

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.

Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Popularity-Aware Topic Model for Social Graphs Junghoo “John” Cho UCLA.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

Source-Selection-Free Transfer Learning

Learning the threshold in Hierarchical Agglomerative Clustering

Exploratory Learning Semi-supervised Learning in the presence of unanticipated classes Bhavana Dalvi, William W. Cohen, Jamie Callan School Of Computer.

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.

MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.

A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.

Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.

IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:

Topic Modeling using Latent Dirichlet Allocation

Hedge Detection with Latent Features SU Qi CLSW2013, Zhengzhou, Henan May 12, 2013.

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.

Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.

Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Active, Semi-Supervised Learning for Textual Information Access Anastasia Krithara¹, Cyril Goutte², Massih-Reza Amini³, Jean-Michel Renders¹ Massih-Reza.

Collaborative Deep Learning for Recommender Systems

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Data Mining and Text Mining. The Standard Data Mining process.

Semi-Supervised Clustering

Online Multiscale Dynamic Topic Models

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Stochastic Optimization Maximization for Latent Variable Models

Knowledge Transfer via Multiple Model Local Structure Mapping

Michal Rosen-Zvi University of California, Irvine

Low-Rank Sparse Feature Selection for Patient Similarity Learning

Presentation transcript:

Dongyeop Kang1, Youngja Park2, Suresh Chari2 1. IT Convergence Laboratory, KAIST Institute,Korea 2. IBM T.J. Watson Research Center, NY, USA Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information

Topic Discovery - Supervised Topic classification Learn decision boundaries of classes by learning from data with labels Accurate topic classification for general domains Very hard to build a model for business applications due to data bottleneck

Topic Discovery – Unsupervised Probabilistic topic modeling Learn topic distribution for each class by learning from data without label information, and choose topic of new data from most similar topic distribution e.g., Latent Dirichlet Allocation (LDA) Not sufficiently accurate or interpretable

Topic Discovery – Semi-supervised Supervised topic modeling methods Supervised LDA [Blei&McAuliffe,2007], Labeled LDA [Ramage,2009]: document labels provided Semi-supervised topic modeling methods Seeded LDA [Jagarlamudi,2012], zLDA [Andrzejewski,2009]: word labels/constraints provided Limitations Only one kind of domain knowledge is supported The labels should cover the entire topic space, |L| = |T| All documents should be labeled in training data, |Dunlabeled| = Ф

Partially Semi-supervised Topic Modeling with Heterogeneous Labels Generation of labeled training samples is much more challenging for real-world applications In most large companies, data are generated and managed independently by many different divisions Different types of domain knowledge are available in different divisions Can we discover accurate and meaningful topics with small amount of various types of domain knowledge?

Hetero-Lableled LDA: Main Contributions Heterogeneity Domain knowledge (labels) come in different forms e.g., document labels, topic-indicative features, a partial taxonomy Partialness Small amount of labels are given We address two kinds of partialness Partially labeled documents: |L| << |T| Partially labeled corpus: |Dlabeled| << |Dunlabeled| Three levels of domain information Group Information: Label Information: Topic Distribution:

Challenges Document labels (Ld) Feature labels (Lw) ????? Feature labels (Lw) {trade, billion, dollar, export, bank, finance} {grain, wheat, corn, oil, oildseed, sugar, tonn} {game, team, player, hit, dont, run, pitch} {god, christian, jesus, bible, church, christ} ?????

Hetero-Labeled LDA: Heterogeneity w z θ α φ β K D Λd γ Document Labels Λw δ Word Labels Wd

Hetero-Labeled LDA: Partialness w z θ α φ β Wd Λw δ Λd γ Kd << K Kw << K Kd ∩ Kw ≠ Ф K Kw D

Hetero-Labeled LDA: Heterogeneity+Partialness Ψ Kd w z θ α φ β K Λw Kw δ Hybrid Constraint Wd D γ Document specific topic distribution General topic distribution

Hetero-Labeled LDA: Generative Process

Hetero-Labeled LDA: Generative Process

Hetero-Labeled LDA: Inference & Learning Gibbs-Sampling

Experiments Datasets: Algorithms: Evaluation metric: Baseline: LDA, LLDA, zLDA Proposed: HLLDA (L=T), HLLDA (L<T) Evaluation metric: Prediction Accuracy: the higher the better Clustering F-measure: the higher the better Variational Information: the lower the better Data set N V T Reuters 21,073 32,848 20 News20 19,997 82,780 Delicious.5K 5,000 890,429

Experiment: Questions Q1. How does mixture of heterogeneous label information improve performance of classification and clustering?

Multi-class Prediction Accuracy Clustering F-Measure

Experiment: Questions Q2. How does HLLDA improve performance of partially labeled documents? Partially labeled corpus: |Dlabeled| << |Dunlabeled| Partially labeled document: |L| << |T| For a document, the provided label set covers a subset of all the topics the document belongs to. Our goal is to predict the full set of topics for each document.

Partially Labeled Documents: |L| << |T|

Partially Labeled Corpus: |Dlabeled| << |Dunlabeled|

Experiment: Questions Q3. How good are the generated topics interpretable? Comparison between LLDA and HLLDA User study for topic quality

News-20: LLDA (10) vs HLLDA (10) <LLDA(10) with 10 Document labels> <HLLDA(10) with 10 Document labels > <LLDA(10) with another 10 Document labels>

Delicious.5k: LLDA (10) vs HLLDA (10) <LLDA(10) with 10 Document labels> <LLDA(10) with another 10 Document labels> <HLLDA(10) with 10 Document labels >

User Study for Topic Quality Number of topically irrelevant (Red) and relevant (Blue) words. The more blue (red) words are, the higher (lower) the topic quality is

Conclusions Proposed a novel algorithm for partially semi-supervised topic modeling Incorporates multiple heterogeneous domain knowledge which can be easily obtained in real life Supports two types of partialness : |L| << |T| and |Dlabeled| << |Dunlabeled| A unified graphical model Experimental results confirm that learning from multiple domain information is beneficial (mutually reinforcing) HLLDA outperforms existing semi-supervised methods in terms of classification and clustering task

THANK YOU contact: young_park@us.ibm.com