GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
A review on “Answering Relationship Queries on the Web” Bhushan Pendharkar ASU ID
1 / 22 Issues in Text Similarity and Categorization Jordan Smith – MUMT 611 – 27 March 2008.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Stephan Gammeter, Lukas Bossard, Till Quack, Luc Van Gool.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
Image Search Presented by: Samantha Mahindrakar Diti Gandhi.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Presented by Zeehasham Rasheed
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Scalable Text Mining with Sparse Generative Models
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Nonnegative Shared Subspace Learning and Its Application to Social Media Retrieval Presenter: Andy Lim.
Chapter 5: Information Retrieval and Web Search
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
TransRank: A Novel Algorithm for Transfer of Rank Learning Depin Chen, Jun Yan, Gang Wang et al. University of Science and Technology of China, USTC Machine.
Webpage Understanding: an Integrated Approach
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Search Engines and Information Retrieval Chapter 1.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
1 The BT Digital Library A case study in intelligent content management Paul Warren
By : Garima Indurkhya Jay Parikh Shraddha Herlekar Vikrant Naik.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Chapter 6: Information Retrieval and Web Search
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Semi-Automatic Image Annotation Liu Wenyin, Susan Dumais, Yanfeng Sun, HongJiang Zhang, Mary Czerwinski and Brent Field Microsoft Research.
Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
An Empirical Study of Learning to Rank for Entity Search
WSExpress: A QoS-Aware Search Engine for Web Services
Presentation transcript:

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA

CONTENTS  Motivation  Problem statement  Proposed approach  Data type labelling  Experiments and results  Application concept  Experiments and results  Similar dataset identification  Experiments and results  Conclusions and future work

MOTIVATION  Annotation is act of adding a note by way of comment or explanation.  Apart from documents, images, videos are searchable only when they have tags or annotations (i.e. content)  Recently, genomic databases, archeological databases are annotated for indexing.

ANNOTATING RESEARCH DATASETS  No context- hard to be searchable by popular search engines.  Make the dataset visible and informative.

EXAMPLE OF STRUCTURED ANNOTATION

PROBLEM STATEMENT  Given a data name “D” as a string of English characters, the research task is to generate semantic annotations for the dataset denoted by “D” in the following categories:  Characteristic data type  Application domain  List of similar datasets

PROPOSED APPROACH Research challenges  No universal schema for describing content of a dataset.  Common attribute, dataset name.  No well known structure for semantic annotation of research datasets.  Proposed structure should positively impact user’s search for datasets.

CONTEXT GENERATION Critical step: how to generate useful context for a dataset. Usage of the dataset in research. Research articles and journals. Get a proxy using web knowledge: Google scholar search engine. Used the top-50 results to build context for the dataset “Global context”

IDENTIFYING DATA TYPE LABELS  For a dataset ‘D’: Given: global context of ‘D’, a list of data types Required: data type of ‘D’  Approach: Supervised Multi-label classification Feature construction: 0. Preprocessing of global context-stop word removal etc. 1. BOW and TFIDF representation of Global context of ‘D’. 2. Dimensionality reduction by PCA- 98% of variance coverage

EXPERIMENTS AND RESULTS DatasetInstancesLabel countLabel densityLabel cardinality SNAP UCI Ground truth: author provided data type labels. Baseline: ZeroR classifier. Evaluation metrics: typical multi-label classification metrics ( Tsoumakas et al 2010) MeasureZeroRAdaBoostMH (tfidf) Fmeasure ↑ Average Precision ↑ Macro AUC ↑ MeasureZeroRAdaBoostMH (BOW) Fmeasure ↑ Average Precision ↑ Macro AUC ↑ SNAP dataset UCI dataset

CONCEPT GENERATION  Given a dataset ‘D’, find k-descriptors (n-gram words) for the application of dataset.  Approach: Concept extraction from world knowledge (wikipedia, dbpedia)  Input feature: Global context of ‘D’.  Preprocessing of global context  Used text analytic tools (AlchemyAPI) for concept generation.  Pruning of input query terms

EXPERIMENTS AND RESULTS  Baseline: Context generated from the short description provided by the owner. Text pre-processing was done.  Evaluation metrics: user rating. Comparison of average user rating on UCI and SNAP dataset. UCI datasetSNAP dataset

IDENTIFYING SIMILAR DATASETS  Given a dataset ‘D’, find k-most similar datasets from a list of datasets.  Approach: cosine similarity between TFIDF vectors of global-context of ‘D’ and global-context of d_i in list of datasets.  Top-k selection from list ranked in descending order.

EXPERIMENTS AND RESULTS  Ground truth: dataset categorization provided by the dataset repository owners. Different categorization for SNAP and UCI.  Baseline: Context generated from owner’s description.  Evaluation metrics: SNAP datasetUCI dataset

USE CASE: SYNTHETIC QUERYING  Synthetic querying on the annotated database of research datasets.  50 queries on SNAP database and 50 queries on UCI database.  Query structure: find a dataset used for like  are random generated from their respective lists.  Evaluation metric: overlap between context of retrieved results and the input query.  Baseline: querying on Google database and extracting dataset names from the retrieved results.

QUANTITATIVE AND QUALITATIVE EVALUATION Comparison of Google results with annotated DB for a few samples

CONCLUSIONS AND FUTURE WORK  Real world datasets play an important role- testing and validation purposes.  General purpose search engines cannot find datasets due to lack of annotation.  A novel concept of structured semantic annotation of dataset- data type labels, application concepts, similar datasets.  Annotation generated using global context from the web corpus.  Data type labels identification using multi-label classifier- using web context helps to improve accuracy both for SNAP and UCI test datasets.

CONCLUSIONS AND FUTURE WORK  Concept generation using web context performs better than baseline based on user ratings.  Web context is not significantly helpful in identifying similar datasets for UCI and SNAP datasets.  18% improvement in accuracy over normal datasets search using Google ( for synthetic queries).  Future work: finding an overall encompassing structure of annotation ; extending analysis across different domains.

THANK YOU