Link Distribution on Wikipedia [0407]KwangHee Park.

Slides:



Advertisements
Similar presentations
Hierarchical Dirichlet Process (HDP)
Advertisements

Information retrieval – LSI, pLSI and LDA
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Probabilistic Clustering-Projection Model for Discrete Data
{bojan.furlan, jeca, 1/42 Probabilistic Graphical Models For Text Mining: A Topic Modeling Survey V. Jelisavčić*, B.
Final Project Presentation Name: Samer Al-Khateeb Instructor: Dr. Xiaowei Xu Class: Information Science Principal/ Theory (IFSC 7321) TOPIC MODELING FOR.
Probabilistic inference in human semantic memory Mark Steyvers, Tomas L. Griffiths, and Simon Dennis 소프트컴퓨팅연구실오근현 TRENDS in Cognitive Sciences vol. 10,
Statistical Topic Modeling part 1
Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Latent Dirichlet Allocation a generative model for text
Distinguishing Photographic Images and Photorealistic Computer Graphics Using Visual Vocabulary on Local Image Edges Rong Zhang,Rand-Ding Wang, and Tian-Tsong.
Scott Wen-tau Yih Joint work with Kristina Toutanova, John Platt, Chris Meek Microsoft Research.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
A probabilistic approach to semantic representation Paper by Thomas L. Griffiths and Mark Steyvers.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Probabilistic Latent Semantic Analysis
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.
Presented By Wanchen Lu 2/25/2013
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
NL Question-Answering using Naïve Bayes and LSA By Kaushik Krishnasamy.
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Japanese Spontaneous Spoken Document Retrieval Using NMF-Based Topic Models Xinhui Hu, Hideki Kashioka, Ryosuke Isotani, and Satoshi Nakamura National.
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
1 Business Proprietary © 2009 Oculus Info Inc. Everyone’s a Critic: Memory Models and Uses for an Artificial Turing Judge W. Joseph MacInnes, Blair C.
Topic Modeling using Latent Dirichlet Allocation
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Latent Dirichlet Allocation
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
CSC 594 Topics in AI – Text Mining and Analytics
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
Concept-based Short Text Classification and Ranking
Automatic Labeling of Multinomial Topic Models
Natural Language Processing Topics in Information Retrieval August, 2002.
14.0 Linguistic Processing and Latent Topic Analysis.
Inferring User Interest Familiarity and Topic Similarity with Social Neighbors in Facebook INSTRUCTOR: DONGCHUL KIM ANUSHA BOOTHPUR
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Link Distribution on Wikipedia [0422]KwangHee Park.
Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.
Link Distribution in Wikipedia [0324] KwangHee Park.
Application of Classification and Clustering Methods on mVoC (Medical Voice of Customer) data for Scientific Engagement Yingzi Xu, Department of Statistics,
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Link Distribution in Wikipedia
Topic Models in Text Processing
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.
Jinwen Guo, Shengliang Xu, Shenghua Bao, and Yong Yu
Presentation transcript:

Link Distribution on Wikipedia [0407]KwangHee Park

Table of contents  Introduction  Topic modeling  Preliminary Problem  Conclusion

Introduction  Why focused on Link  When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others  Assumption  Link terms in the Wikipedia articles is the key terms which can represent specific characteristic of articles

Introduction  Problem what we want to solve is  To analyses latent distribution of set of Target document by topic modeling

Topic modeling  Topic  Topics are latent concepts buried in the textual artifacts of a community described by a collection of many terms that co-occur frequently in context Laura Dietz, Avaré Stewart 2006 ‘ Utilize Probabilistic Topic Models to Enrich Knowledge Bases’  T = {W i,…,W n }

Topic modeling  Bag of word assumption  The bag-of-words model is a simplifying assumption used in natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and even word order. From Wikipedia  Each document in the corpus is represented by a vector of integer  {f 1,f 2,…f |w| }  F i = frequency of i th word  |w| = number of words

Topic modeling  Instead of directly associating documents to words, associate each document with some topics and each topic with some significant words  Document = {T n, T k,…,T m }  {Doc : 1 }  {T n : 0.4, T k : 0.3,… }

Topic modeling  Based upon the idea that documents are mixtures of topics  Modeling  Document  topic  term

Topic modeling  LSA  performs dimensionality reduction using the singular value decomposition.  The transformed word– document co-occurrence matrix, X, is factorized into three smaller matrices, U, D, and V.  U provides an orthonormal basis for a spatial representation of words  D weights those dimensions  V provides an orthonormal basis for a spatial representation of documents.

Topic modeling  pLSA Observed word distributions word distributions per topic Topic distributions per document

Topic modeling  LDA (Latent Dirichlet Allocation)  Number of parameters to be estimated in pLSA grows with size of training set  In this point LDA method has advantage  Alpha and beta are corpus-level documents that are sampled once in the corpus creating generative model (outside of the plates!) pLSA LDA

Topic modeling – our approach  Target  Document = Wikipedia article  Terms = linked term in document  Modeling method  LDA  Modeling tool  Lingpipe api

Advantage of linked term  Don’t need to extra preprocessing  Boundary detection  Remove stopword  Word stemming  Include more semantics  Co-relation between term and document  Ex) cancer as a term  cancer as a document cancer A Cancer

Preliminary Problem  How well link terms in the document are represent specific characteristic of that document  Link evaluation  Calculate similarity between document

Link evaluation  Similarity based evaluation  Calculate similarity between terms  Sim_t{term1,term2}  Calculate similarity between documents  Sim_d{doc1,doc2}  Compare two similarity

Link evaluation  Sim_t  Similarity between terms  Not affected input term set  Sim_d  Similarity between documents  Significantly affected input term set p,q = topic distribution of each document Lin 1991

Link evaluation  Compare top 10 most similar each link  Ex )Link A  Term list most similar to A as term  Document list most similar to A as document  Compare two list – number of overlaps  Now under experiment

Conclusion  Topic modeling with link distribution in Wikipedia  Need to measure how well link distribution can represent each article’s characteristic  After that analysis topic distribution in variety way  Expect topic distribution can be apply many application

Thank