Detecting Nearly Duplicated Records in Location Datasets Microsoft Research Asia Search Technology Center Yu Zheng Xing Xie, Shuang Peng, James Fu.

Slides:

Advertisements

Similar presentations

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Advertisements

Mining User Similarity Based on Location History Yu Zheng, Quannan Li, Xing Xie Microsoft Research Asia.

An Interactive-Voting Based Map Matching Algorithm

Large-Scale Entity-Based Online Social Network Profile Linkage.

Data Mining Classification: Alternative Techniques

Learning Location Correlation From GPS Trajectories Yu Zheng Microsoft Research Asia March 16, 2010.

Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.

Constructing Popular Routes from Uncertain Trajectories Ling-Yin Wei 1, Yu Zheng 2, Wen-Chih Peng 1 1 National Chiao Tung University, Taiwan 2 Microsoft.

T-Drive : Driving Directions Based on Taxi Trajectories Microsoft Research Asia University of North Texas Jing Yuan, Yu Zheng, Chengyang Zhang, Xing Xie,

Yu Zheng, Lizhu Zhang, Xing Xie, Wei-Ying Ma Microsoft Research Asia

A reactive location-based service for geo-referenced individual data collection and analysis Xiujun Ma Department of Machine Intelligence, Peking University.

Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.

Trajectories Simplification Method for Location-Based Social Networking Services Presenter: Yu Zheng on behalf of Yukun Cheng, Kai Jiang, Xing Xie Microsoft.

About ISoft … What is Decision Tree? Alice Process … Conclusions Outline.

1 Web Query Classification Query Classification Task: map queries to concepts Application: Paid advertisement 问题：百度 /Google 怎么赚钱？

Learning Transportation Mode from Raw GPS Data for Geographic Applications on the Web Yu Zheng, Like Liu, Xing Xie Microsoft Research.

Retrieving Location-based Data on the Web Andrei Tabarcea,

Lecture 5 Geocoding. What is geocoding? the process of transforming a description of a location—such as a pair of coordinates, an address, or a name of.

Mining Interesting Locations and Travel Sequences From GPS Trajectories Yu Zheng and Xing Xie Microsoft Research Asia March 16, 2009.

Exploiting Geographic Dependencies for Real Estate Appraisal Yanjie Fu Joint work with Hui Xiong, Yu Zheng, Yong Ge, Zhihua Zhou, Zijun Yao Rutgers, the.

Friends and Locations Recommendation with the use of LBSN

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.

Mining Interesting Locations and Travel Sequences from GPS Trajectories IDB & IDS Lab. Seminar Summer 2009 강 민 석강 민 석 July 23 rd,

GIS Data Quality.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA

80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

BING: Binarized Normed Gradients for Objectness Estimation at 300fps

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Feature Detection in Ajax-enabled Web Applications Natalia Negara Nikolaos Tsantalis Eleni Stroulia 1 17th European Conference on Software Maintenance.

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.

Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey.

Accuracy of Land Cover Products Why is it important and what does it all mean Note: The figures and tables in this presentation were derived from work.

Answering Similar Region Search Queries Chang Sheng, Yu Zheng.

CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.

School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.

USING MAP SCALES. Map Scale  A scale is a statement of the relationship between distances on a map and distances in real life.  A drawing that is made.

Yu Zheng Microsoft Research, Beijing, China

Intelligent DataBase System Lab, NCKU, Taiwan Josh Jia-Ching Ying, Eric Hsueh-Chan Lu, Wen-Ning Kuo and Vincent S. Tseng Institute of Computer Science.

Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.

How we all fit together ©2014 Inome, Inc. All Rights Reserved. Probabilistic Estimates of Attribute Statistics and Match Likelihood for People Entity Resolution.

Trajectory Data Mining Dr. Yu Zheng Lead Researcher, Microsoft Research Chair Professor at Shanghai Jiao Tong University Editor-in-Chief of ACM Trans.

Towards Entailment Based Question Answering: ITC-irst at Clef 2006 Milen Kouylekov, Matteo Negri, Bernardo Magnini & Bonaventura Coppola ITC-irst, Centro.

Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.

Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.

Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.

Automated Geo-referencing of Images Dr. Ronald Briggs Yan Li GeoSpatial Information Sciences The University.

Smartphone-based Wi-Fi Pedestrian-Tracking System Tolerating the RSS Variance Problem Yungeun Kim, Hyojeong Shin, and Hojung Cha Yonsei University Bing.

PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

A Flexible Spatio-temporal indexing Scheme for Large Scale GPS Tracks Retrieval Yu Zheng, Longhao Wang, Xing Xie Microsoft Research.

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Click Through Rate Prediction for Local Search Results

T-Share: A Large-Scale Dynamic Taxi Ridesharing Service

Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph

iSRD Spam Review Detection with Imbalanced Data Distributions

Finding Similar Failures Using Callstack Similarity

Intro to Machine Learning

GANG: Detecting Fraudulent Users in OSNs

Leverage Consensus Partition for Domain-Specific Entity Coreference

Presentation transcript:

Detecting Nearly Duplicated Records in Location Datasets Microsoft Research Asia Search Technology Center Yu Zheng Xing Xie, Shuang Peng, James Fu

Background Web maps and local search engines are frequently-used The quality of the services depends on geographic data

Background NameAddressGPS PositionPhone Num.CategoryType The Matt’s Bar701 5 th Ave Seattle, WA , CaféYP Silver Cloud Inn314 7 th Ave Redmond, WA , HotelPOI Point of interests Collected by people holding GPS-enabled devices in the physical world Accurate GPS coordinates Less accurate address Yellow page Inputted by people in a cyber environment, e.g., online Accurate address Inaccurate GPS coordinates (translated by geocoding)

Problem Nearly duplicated POIs The same entity in the physical world With slightly different presentations of name, address, Caused by multiple resources Different vendors and channels Different types: POI and YP Results Bring trouble to data management Confuse users Example: Seattle Premier Outlet Mall Seattle Premium Outlet

What we do Infer the similarity between two location entities Based on a machine learning based approach Consider multiple fields: name, address, coordinates, categories Identify some useful features Evaluate our method using real datasets

Similarities between two entities Name similarity Address similarity Category similarity Train a inference model Using these similarities as features A small human label training set Apply to a large scale dataset Methodology

Name similarity

Address similarity the geospatially closer two records are located, the higher the probability these two records might be nearly duplicated 79 Beaver St, New York, NY Water St, New York, NY Example: The same building having two different address presentation City structure

Address similarity Insert YP data into the city structure according to their address Calculate the mean coordinates of each leaf node Insert POI data into the city structure in terms of their coordinates Find out the co-parent node in the structure

Map each entity to a category hierarchy Find the co-parent node of two entities The lower lever the co-parent is on the high similar Category similarity E.g., some shops usually provide coffee, lunch and wine simultaneously. Therefore, different people would classify these shops into different categories

Experiments- Settings Beijing Dataset In total 0.7 million entities 0.3m POIs and 0.4m YPs Human labeled Decision tree + Bagging Baselines Exact match Rule-based: edit distance and geo-distance DatasetsTraining SetTest SetTotal D D D D

Experiments - Results Single feature study S1 and S2 are name similarity S3 denotes address similarity S4 represents category similarity

Experiments - Results Feature combination Features DuplicatedNon-duplicated Overall accuracy Pre.Rec.Pre.Rec

Experiments- results Features DuplicatedNon-duplicated Overall accuracy Pre.Rec.Pre.Rec. Exact Match Rule-based method Our approach

Conclusion A classification model using Name similarity Address similarity Category similarity Determine the nearly duplicated location data With a overall accuracy of 0.89

Thanks! Y u Zheng Microsoft Research Asia