Presentation is loading. Please wait.

Presentation is loading. Please wait.

Link Distribution in Wikipedia [0324] KwangHee Park.

Similar presentations


Presentation on theme: "Link Distribution in Wikipedia [0324] KwangHee Park."— Presentation transcript:

1 Link Distribution in Wikipedia [0324] KwangHee Park

2 Table of contents  Introduction  Cluster using LDA  Experiment  Disease, settlement  Demo  Considering Application

3 Introduction  Why focused on Link  When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others  Assumption  Link terms in the Wikipedia articles is the key terms which can represent specific characteristic of articles

4 Introduction  Problem what we want to solve is  To analyses latent distribution of set of Target document by Clustering of Link term set  Find the Tendency of latent distribution of specific Domain by limiting input document to specific Domain

5 Process  Terminology  Term set = all of terms in the input documents  Topic = Set of term  {W i,…,W n }  Document = Set of term  {W k,W l,…,W n }  Document = set of part of topic  {T n, T k,…,T m }  {Doc : 1 }  {T n : 0.4, T k : 0.3,… }  Clustering Term set  Find latent distribution of each Document  Group by domain

6 LDA  The clustering techniques  The LDA model consists of a fixed number of topics  Each topic is modeled as a distribution over words.  A document under LDA is modeled as a distribution over topics. Term Set Topic n Topic Topic 3 Topic 2 Topic 1 Doc 1 Doc2 Doc 3

7 Experiment  Domain :  Disease  #Doc : 208  #Link terms :  English : 46615, Espanola: 34560, French:, 31747Chinese:, 9286 Korean: 3272  Settlement  #Doc : 1328  #Link term :  English : 372483, Espanola: 227950, French:150921, Chinese:93227, Korean: 38089  Number of Topic  10,20,30,40,50,75,100,125,150,175,200,225,250  Demo site  http://143.248.135.30

8 Considering Application  Document Classification  Classify domain of target document by calculate similarity between topic distribution of document  Usage : Template recommendation,…  Domain characteristic # of appearance / # of total Doc Topic number Disease Settlement

9 Template recommendation  Starvation Trenton,_New_Jersey  Starvation  Disease  Trenton,_New_Jersey  Settlement

10 Thanks

11 Domain characteristic # of appearance /# of total Doc Topic number Disease Settlement


Download ppt "Link Distribution in Wikipedia [0324] KwangHee Park."

Similar presentations


Ads by Google