Introduction Dataset search

Slides:



Advertisements
Similar presentations
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Advertisements

Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Image Search Presented by: Samantha Mahindrakar Diti Gandhi.
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
Presented by Zeehasham Rasheed
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
Overview of Search Engines
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Fourth Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Presenter: Shanshan Lu 03/04/2010
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Keyword Query Routing.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Data Mining for Web Intelligence Presentation by Julia Erdman.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
A Novel Visualization Model for Web Search Results Nguyen T, and Zhang J IEEE Transactions on Visualization and Computer Graphics PAWS Meeting Presented.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Toward Semantic Search: RDFa based facet browser Jin Guang Zheng Tetherless World Constellation.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Third Edition Discovering the Internet Discovering the Internet Complete Concepts and Techniques, Second Edition Chapter 3 Searching the Web.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Discovery and Metadata March 9, 2004 John Weatherley
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
HOW TO USE GOOGLE WEBMASTER TOOLS TO IMPROVE SEO ? GOOGLE WEBMASTEER.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Based on Menu Information
Multimedia Content-Based Retrieval
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Associative Query Answering via Query Feature Similarity
Multimedia Information Retrieval
Web IR: Recent Trends; Future of Web Search
Applying Key Phrase Extraction to aid Invalidity Search
#VisualHashtags Visual Summarization of Social Media Events using Mid-Level Visual Elements Sonal Goel (IIIT-Delhi), Sarthak Ahuja (IBM Research, India),
Measuring Data Quality and Compilation of Metadata
Lecture 5: Leave no relevant data behind: Data Search
Presented by: Prof. Ali Jaoua
Martin Rajman, EPFL Switzerland & Martin Vesely, CERN Switzerland
Intent-Aware Semantic Query Annotation
ece 627 intelligent web: ontology and beyond
Disambiguation Algorithm for People Search on the Web
Magnet & /facet Zheng Liang
MAPO: Mining and Recommending API Usage Patterns
Journal of Web Semantics 55 (2019)
Presentation transcript:

Towards More Usable Dataset Search: From Query Characterization to Snippet Generation

Introduction Dataset search Existing efforts mostly focus on metadata management, result filtering, and dataset browsing. Little attention has been given to the queries, which may differ considerably from general Web search queries. We analyze real data needs, reveal the complex constitution of a typical dataset search query, where various elements of both metadata and data content can be mentioned.

Introduction A new query-centered framework Our progress in two aspects Characterizing data needs Generating dataset snippets

Contribution We collect real data needs from diverse sources, derive 1,947 dataset search queries and semantically annotate them using a novel fine- grained scheme. Our analysis provides implications for enhancing dataset search. Complementary to query-independent illustrative dataset snippet, we extract a query-biased snippet from the content of a dataset by adapting a state-of-the-art algorithm for the group Steiner tree problem. We conduct a user study to evaluate the usefulness of these two types of snippets, and also quantitatively analyze their features.

Characterization of data needs Collection and preprocessing Four data sources: Stack Overflow Open Data Stack Exchange Reddit data.gov.uk “looking for dataset” human experts 1,947 dataset search queries 200 data needs I am looking for datasets that lists the location of accidents or traffic (latitude and longitude) with date and time in many countries. I found datasets for USA and UK, now looking for datasets for other countries. Any type of road accident would be great. location of accidents or traffic with date and time in many countries

Characterization of data needs Analysis A query in average contains 9.22 words. More than half (58.60%) of the queries contain 5-11 words. Three types of queries are identified: phrases (63.33%), keywords (31.38%) sentences (5.29%)

Characterization of data needs Analysis Category % of queries Example query Metadata Name 3.54% HUST-ASL Dataset Domain/Topic 94.45% weather dataset with solar radiance and solar energy production Data Format 16.23% jpg images for all unicode characters Language 3.90% annotated movie review dataset in German Accessibility 7.40% open source handwritten English alphabets dataset Provenance 0.21% FDA datasets about medicine name and the result has adverse events Statistics 2.98% dataset contains at least 1000 examples of opinion articles Overall 96.05%

Characterization of data needs Analysis Category % of queries Example query Content Concept 50.59% dataset about people, include gender, ethnicity, name Geospatial 19.21% judicial decisions in France Other Entities 0.41% datasets with nutrition data for many commercial food products (i.e., Lucky Charms, Monster Energy, Nutella, etc.) Temporal 9.35% 2011–2013 MoT failure rates on passenger cars Other Numbers 1.59% businesses that employ over 1000 people in Yorkshire region Overall 63.79%

Characterization of data needs Implications It would be insufficient to only take metadata into account in indexing, ranking, and result presentation like Google Dataset Search. (60.61% of the queries mention both metadata and data content) The constitution of a dataset search query is complex, requiring novel semantic parsing techniques to process, understand, and finally improve the accuracy of search. Data needs collected from different sources exhibit different features. The results of analyzing a single type of data needs may not be generalizable.

Snippet generation methods Query-biased Snippet Query is treated as a set of keywords after removing stop words. Each keyword is mapped to a set of nodes in the graph-structured data content. Finding an optimal connected subgraph that spans at least one mapped node from each query keyword. Optimality is defined by the minimization of total edge weights. Illustrative Snippet Query-independent, optimal size-constrained connected subgraph Optimality is defined by a linear combination of covering the most frequent schema-level elements, i.e., classes and properties covering the most central instance-level elements, i.e., entities.

Snippet generation Preliminary Results

Related work Dataset search Snippet Generation Query Analysis Google Dataset Search and other prototype systems exploit metadata but ignore the content of a dataset. Snippet Generation None of existing static snippets can precisely explain how the dataset is relevant to the query. Query Analysis Limited data sources - national open data portals Lexical annotation

Conclusion On the way of developing a more usable dataset search engine, we summarize our current findings as Real data needs mention both metadata and data content. The constitution of a dataset search query is complex. Snippet generation for dataset search should combine query relevance with schema coverage