Presentation on theme: "Anon Plangprasopchok & Kristina Lerman USC Information Sciences Institute Constructing Folksonomies from User-Specificed Relations on Flickr."— Presentation transcript:
Anon Plangprasopchok & Kristina Lerman USC Information Sciences Institute Constructing Folksonomies from User-Specificed Relations on Flickr
Motivation Users Web content classification Consume Produce Annotate Organize Discover Annotation / Metadata Organize Search Recommend Leverage
Inducing Folksonomy GOAL: induce hidden classification hierarchies, “Folksonomies*,” from user generated metadata Although metadata from an individual user may be too inaccurate and incomplete, the metadata from different users may complement each other, making it, in combination, meaningful. In this work, we explore some strategies that combine metadata from many users and then induce folksonomies. * The definition is somewhat different from the original one, made by Thomas Vander Wal.
Outline Motivation Hierarchical Relations Approaches Results Discussion Related work
Hierarchical Relations in Social Web Appear Implicitly Appear Explicitly Tags: Insect Grasshopper Australian Macro Orthoptera Folder (collection) Sub folder (set) Relations Goal: to induce deeper hierarchies from this metadata
Inducing Hierarchy from Tags Existing approaches Graph based [Mika05] build a network of associated tags (node = tag, edge = co-occurrence of tags) suggest applying betweenness centrality and set theory to determine broader/narrower relations Hierarchical Clustering [Brooks06; Heymann06+] Tags appearing more frequently would likely have higher centrality and thus more abstract. Probabilistic subsumption [ Sanderson99+; Schmitz06] x is broader than y if x subsumes y x subsumes y if p(x|y) > t & p(y|x) < t x y
Inducing Hierarchy from Tags Some difficulties when using tags to induce hierarchy: Above relations induced using subsumption approach on tags [Sanderson99+, Schmitz06] Washington United States Car Automobile Notation: A B (A is broader than B Or hypernym relation) Insect Hongkong Color Brazilian Specificity Rarity Tags are from different facets
User specified relations, e.g., Flickr’s Collection-Set Delicious’ Bundle-Tag Bibsonomy’s Relation-Tag Key intuition: Not so many people specify peculiar relations like “automobile” “car”, or “Washington” “United States” Inducing Hierarchy from user-specified relations In this work, we concentrate on metadata from Flickr.
Remove noisy relations: 1 st approach Conflict Resolution (when both A B and B A appear) Relation conflicts occur because of noise Voting scheme: Keep A B (and discard B A) If N u (A B) > 1 and N u (A B) > N u (B A) insect butterfly insect 10 2
Remove noisy relations: 2 nd approach Significance Test - Use statistical significance test to decide if A B is significant - Null hypothesis: observed relation A B was generated by chance, via the random, independent generation of individual concepts A, B. # observations reject accept # of A B Is B narrower than A by chance?
Link Concepts Link concepts together simply assume that same terms refer to the same concept anim bug anim insect anim buginsect bug insect anim moth bug moth insect moth
Select path anim bug insect moth possible paths from anim moth: 1)a b i m 2)a i m 3)a m 4)a b m Network Bottleneck idea: “the flow bottleneck is a minimum flow capacity among all relations in the path” 1) a b i m [BN score = min(26,1,18) = 1] 2) a i m [BN score = min(72,18) = 18] 3) a m [BN score = min(10) = 10] 4) a b m [BN score = min(26,4) = 4] 10 Select path: link relations from many users can cause a spaghetti graph
Evaluation & Data Set Hypothesis: the approach that takes explicit relations into account can induce better hierarchies. “Better” means more consistent with the reference hierarchy (obtained from Open Directory Project (ODP)) ODP Hierarchy in ODP is created by volunteer editors controlled under ODP guidelines
Evaluation & Data Set (2) The baseline approach is subsumption approach [Schmitz06] Collection and set terms are used instead of tags, making it comparable. Data Set: Data from 17 user groups, devoted to wildlife and naturalist photography 21,792 of 39,922 users specify at least one collection 110,543 unique terms (c.f. 166,153 unique terms in ODP), 15,495 terms in common.
Evaluation methodology ODP has many sub hierarchies: comparing to the induced ones are impractical! It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e., specifying a certain sub tree to compare. Reference hierarchy Relations (right after tokenized) Induced hierarchy Induce (remove noise+link) (ODP)
Metrics Taxonomic Overlap [adapted from Maedche02+] measuring structure similarity between two trees for each node, determining how many ancestor and descendant nodes overlap to those in the reference tree. Lexical Recall measuring how well an approach can discover concepts, existing in the reference hierarchy (coverage)
Discussion Simple strategy to aggregate a large number of shallow relations specified by different users into a common, deeper hierarchy Induced hierarchies are more consistent with ODP Future work includes: Term ambiguity Global structure Relation types Apply to other datasets
Related Work Learning concept hierarchy from text data Syntactic based [Hearst92, Caraballo99, Pasca04, Cimiano+05, Snow+06] Word clustering [e.g., Segal+02, Blei+03] Induce concept hierarchy from tags Graph-based & clustering based [Mika05, Brooks+06, Heymann+06, Zhou07+] Probabilistic subsumption [Schmitz06] Ontology alignment [e.g., Udrea+07] Exploit user-specified hierarchy GiveALink [Markines06+]
Questions? Is the metric used in evaluation meaningful? How is the scalability of the system? WordNet, ODP is already there. Why do we need this system? How is this work related to ontology enrichment? Is it ethical to collect users’ data? …. Questions? THANK YOU!