GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA

CONTENTS  Motivation  Problem statement  Proposed approach  Data type labelling  Experiments and results  Application concept  Experiments and results  Similar dataset identification  Experiments and results  Conclusions and future work

MOTIVATION  Annotation is act of adding a note by way of comment or explanation.  Apart from documents, images, videos are searchable only when they have tags or annotations (i.e. content)  Recently, genomic databases, archeological databases are annotated for indexing.

ANNOTATING RESEARCH DATASETS  No context- hard to be searchable by popular search engines.  Make the dataset visible and informative.

EXAMPLE OF STRUCTURED ANNOTATION

PROBLEM STATEMENT  Given a data name “D” as a string of English characters, the research task is to generate semantic annotations for the dataset denoted by “D” in the following categories:  Characteristic data type  Application domain  List of similar datasets

PROPOSED APPROACH Research challenges  No universal schema for describing content of a dataset.  Common attribute, dataset name.  No well known structure for semantic annotation of research datasets.  Proposed structure should positively impact user’s search for datasets.

CONTEXT GENERATION Critical step: how to generate useful context for a dataset. Usage of the dataset in research. Research articles and journals. Get a proxy using web knowledge: Google scholar search engine. Used the top-50 results to build context for the dataset “Global context”

IDENTIFYING DATA TYPE LABELS  For a dataset ‘D’: Given: global context of ‘D’, a list of data types Required: data type of ‘D’  Approach: Supervised Multi-label classification Feature construction: 0. Preprocessing of global context-stop word removal etc. 1. BOW and TFIDF representation of Global context of ‘D’. 2. Dimensionality reduction by PCA- 98% of variance coverage

EXPERIMENTS AND RESULTS DatasetInstancesLabel countLabel densityLabel cardinality SNAP4250.341.69 UCI11040.2751.1 Ground truth: author provided data type labels. Baseline: ZeroR classifier. Evaluation metrics: typical multi-label classification metrics ( Tsoumakas et al 2010) MeasureZeroRAdaBoostMH (tfidf) Fmeasure ↑ 0.0250.172 Average Precision ↑ 0.6570.663 Macro AUC ↑ 0.50.555 MeasureZeroRAdaBoostMH (BOW) Fmeasure ↑ 0.8540.873 Average Precision ↑ 0.9080.924 Macro AUC ↑ 0.50.54 SNAP dataset UCI dataset

CONCEPT GENERATION  Given a dataset ‘D’, find k-descriptors (n-gram words) for the application of dataset.  Approach: Concept extraction from world knowledge (wikipedia, dbpedia)  Input feature: Global context of ‘D’.  Preprocessing of global context  Used text analytic tools (AlchemyAPI) for concept generation.  Pruning of input query terms

EXPERIMENTS AND RESULTS  Baseline: Context generated from the short description provided by the owner. Text pre-processing was done.  Evaluation metrics: user rating. Comparison of average user rating on UCI and SNAP dataset. UCI datasetSNAP dataset

IDENTIFYING SIMILAR DATASETS  Given a dataset ‘D’, find k-most similar datasets from a list of datasets.  Approach: cosine similarity between TFIDF vectors of global-context of ‘D’ and global-context of d_i in list of datasets.  Top-k selection from list ranked in descending order.

EXPERIMENTS AND RESULTS  Ground truth: dataset categorization provided by the dataset repository owners. Different categorization for SNAP and UCI.  Baseline: Context generated from owner’s description.  Evaluation metrics: precision@k SNAP datasetUCI dataset

USE CASE: SYNTHETIC QUERYING  Synthetic querying on the annotated database of research datasets.  50 queries on SNAP database and 50 queries on UCI database.  Query structure: find a dataset used for like  are random generated from their respective lists.  Evaluation metric: overlap between context of retrieved results and the input query.  Baseline: querying on Google database and extracting dataset names from the retrieved results.

QUANTITATIVE AND QUALITATIVE EVALUATION Comparison of Google results with annotated DB for a few samples

CONCLUSIONS AND FUTURE WORK  Real world datasets play an important role- testing and validation purposes.  General purpose search engines cannot find datasets due to lack of annotation.  A novel concept of structured semantic annotation of dataset- data type labels, application concepts, similar datasets.  Annotation generated using global context from the web corpus.  Data type labels identification using multi-label classifier- using web context helps to improve accuracy both for SNAP and UCI test datasets.

CONCLUSIONS AND FUTURE WORK  Concept generation using web context performs better than baseline based on user ratings.  Web context is not significantly helpful in identifying similar datasets for UCI and SNAP datasets.  18% improvement in accuracy over normal datasets search using Google ( for synthetic queries).  Future work: finding an overall encompassing structure of annotation ; extending analysis across different domains.

THANK YOU

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Similar presentations

Presentation on theme: "GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Similar presentations

Presentation on theme: "GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA."— Presentation transcript:

Similar presentations

About project

Feedback