Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.

Slides:



Advertisements
Similar presentations
String Similarity Measures and Joins with Synonyms
Advertisements

Computer Science and Engineering Diversified Spatial Keyword Search On Road Networks Chengyuan Zhang 1,Ying Zhang 2,1,Wenjie Zhang 1, Xuemin Lin 3,1, Muhammad.
Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi.
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Aggregating local image descriptors into compact codes
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Presented by Xinyu Chang
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.
Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara.
Fast Algorithm for Nearest Neighbor Search Based on a Lower Bound Tree Yong-Sheng Chen Yi-Ping Hung Chiou-Shann Fuh 8 th International Conference on Computer.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.
A KLT-Based Approach for Occlusion Handling in Human Tracking Chenyuan Zhang, Jiu Xu, Axel Beaugendre and Satoshi Goto 2012 Picture Coding Symposium.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
A Novel Scheme for Video Similarity Detection Chu-Hong Hoi, Steven March 5, 2003.
Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.
Bundling Features for Large Scale Partial-Duplicate Web Image Search Zhong Wu ∗, Qifa Ke, Michael Isard, and Jian Sun CVPR 2009.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel Review by Newton Alex
Grid Load Balancing Scheduling Algorithm Based on Statistics Thinking The 9th International Conference for Young Computer Scientists Bin Lu, Hongbin Zhang.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
School of Information Technology & Electrical Engineering Multiple Feature Hashing for Real-time Large Scale Near-duplicate Video Retrieval Jingkuan Song*,
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.
Mining High Utility Itemset in Big Data
K. Selçuk Candan, Maria Luisa Sapino Xiaolan Wang, Rosaria Rossini
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li Department of Computer and Information Science.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
Auditing Information Leakage for Distance Metrics Yikan Chen David Evans TexPoint fonts used in EMF. Read the TexPoint manual.
Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes Changqing Li, Tok Wang Ling, Min Hu.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.
Naifan Zhuang, Jun Ye, Kien A. Hua
Outline Introduction State-of-the-art solutions
Multiple Feature Hashing for Real-time Large Scale
Efficient Image Classification on Vertically Decomposed Data
Ambika Shrestha Chitrakar Prof. Slobodan Petrovic
TT-Join: Efficient Set Containment Join
Chuan Xiao, Wei Wang, Xuemin Lin
Efficient Image Classification on Vertically Decomposed Data
Efficient Subgraph Similarity All-Matching
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Nearest Neighbors CSC 576: Data Mining.
Donghui Zhang, Tian Xia Northeastern University
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing Algorithm 1 University of New South Wales, Australia 2 Renmin University of China, Chnia

School of Computer Science and Engineering Motivation Identify Near Duplicate Webpages ABCDEF simhash ABCDEF Similar Chemical data Maps in to Binary code ABCDEF ABCDEF1 Similar

School of Computer Science and Engineering More Applications Iris recognition Image retrieval C2LSH

School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

School of Computer Science and Engineering Hamming Distance Query Hamming distance Hamming distance query Number of positions at which the corresponding symbols are different for two equal length vectors. q: ABCD v: ACCD Hamming distance(R, S) = 1 Given a database V of vectors, a query vector Q (all the vectors have the same dimensionality N) and a Hamming distance threshold k, find all v i in V, that hd (v i, Q) <= k

School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

School of Computer Science and Engineering Basic Idea General framework: 1.We can do k=1 efficiently (show later) 2.So we transform larger k problem to several small k=1 problem by partitioning 3.We do filtering by looking at each partition 4.We do verification at last v q the same hd (q, v)<=1 hd(q left, v left )=0 or hd(q right, v righ t)=0 So if k =1, can be filtered by looking at each part v q

School of Computer Science and Engineering Framework Data Partitioning Indexing Index Query Partitioning Candidates0 Filtering Candidates1 Verification Results Generating Signatures General Partitionin g Scheme 1-variants and 1-deletion variants Enhanced Filtering Hierarchical Filtering and Verification Dimension Rearrangement

School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

School of Computer Science and Engineering Partitioning Lowerbound for partition strategy Given q and v such that hd(q, v)<=k, if the N dimensions are divided into κ parts, there should be at least partitions, such that hd(q part, v part )<= In our algorithm, we choose When k is even, m = 1 When k is odd, m = 2 When k= 0 or 1, m=1, hd = 0 When k>=2, hd <= 1

School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

School of Computer Science and Engineering Signature Generation 1-variants 1-deletion-variants Substituting each dimension with ‘#’ each time Substituting each dimension with each domain value each time (plus itself) v=[1, 2, 3] 1-del-val(v)=[#, 2, 3], [1, #, 1], [1, 2, #] v=[1, 2, 3] and Σ (domain) =[1, 2, 3] 1-val(v)=[1, 2, 3], [2, 2, 3], [3, 2, 3], [1, 1, 3], [1, 3, 3], [1, 2, 1], [1, 2, 2] We index all 1-val(v) and when q comes in, we search q in the index We index all 1-del-val(v) and when q comes in, we generate 1-del-val(q), and search all 1-del-val(q) in the index OR

School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

School of Computer Science and Engineering Enhanced Filter (Even) v q If k =2, based on the formula before, m=1, hd(v part, q part )=1 So this v becomes a false positive However, we find that If k (k>=1) is even, v is qualified for two situations: 1) m=1, where hd(v part, q part )=0 2)m=2, where hd(v part, q part )<=1 Using enhanced filter, no situation applied so v is filtered Based on the Formula before When k (k>=1) is even, m = 1 Example

School of Computer Science and Engineering Enhanced Filter (Odd) If k =3, based on the formula before, m=2, hd(v part, q part )=1 So this v becomes a false positive However, we find that If k (k>=1) is odd, v is qualified for two situations: 1) m=2, where hd(v part, q part )<=1 and at least one of them = 0 2) m=3, where hd(v part, q part )<=1 Using enhanced filter, no situation applied so v is filtered Based on the Formula before When k (k>=1) is odd, m = 2 Example v q

School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

School of Computer Science and Engineering Hierarchical Filtering and Verification Significant bit 1st 2nd 3rd v=[5, 0, 3, 6] q=[5, 2, 2, 5] Σ=|8|, k=1 So hd(v, q)>=2, filtered More over, even if k=4 4 comparisons to calculate hd(v,q)=3 diff XOR OR 0111hd(v,q)=3 We can use binary operations to do a hierarchical filtering and verification

School of Computer Science and Engineering Hierarchical Filtering and Verification Significant bit 1st 2nd 3rd v=[5, 0, 3, 6] q=[5, 2, 3, 5] diffcumdiff XOR OR Number of 1 In cumdiff 1 2 <=1, conti. >1, filtered

School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

School of Computer Science and Engineering Impact of Data Skewness Given k=2, then m = 1 and k’=1 Only v1 is qualified We propose to reset the order and partition Length to improve performance All vectors are qualified Dim v2 v1 q Partition Partition v v Dim v2 v1 q Partition Partition v v

School of Computer Science and Engineering Greedy Dimension Rearrangement Dim v2 v1 Partition Partition v v MaxFreq for Dim MaxFreq is the Max Frequency of any values in each dimension Dim v2 v1 Partition Partition v v Our goal: Minimize the global MaxFreq MaxFreq for partition Achieve the goal

School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

School of Computer Science and Engineering Conclusion 1.General Partition Scheme 2.1-variants and 1-deleltion-variants 3.Techniques help boost the performance –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement

School of Computer Science and Engineering Outline Problem Definition Framework HmSearch –Partitioning Scheme –Signature Generation –Enhanced Filtering –Hierarchical Filtering and Verification –Dimension Rearrangement Conclusion Experiment

School of Computer Science and Engineering Experiment Settings Environment –Intel Xeon X GHz CPU, 4GB RAM –Debian –AMD Operon™ GHZ CPU, 96GB RAM (for Pubchem) –Ubuntu/Linaro unbuntu5 –All complied with GCC with –O3 Dataset

School of Computer Science and Engineering Experiment Settings Terms –EF, Enhanced Filtering –HB, Hierarchical Binary Filter –RD, Rearranging Dimensions Our algorithms 1.HSD, HSV, our proposed algorithms, the former one using 1-deleltion- variants as signatures and the latter one using 1-varitnas as signatures 2.HSD-nEB, HSV-nEB, variations that remove EF and HB 3.HSD-nB, HSV-nB, variations that remove HB 4.HSD-nR, HSV-nR, variations that remove RD Baseline algorithm 1.Scancount (Li et. ICDE08) State-of-the-art algorithms 1.Google (Manku et. www07) 2.Hengine (Liu et. ICDE11)

School of Computer Science and Engineering Query time HSV has the best performance

School of Computer Science and Engineering Candidate Size HSV has the smallest candidate size

School of Computer Science and Engineering Effect of EF and HB EF and HB help improve the performance

School of Computer Science and Engineering Effect of RD RD boost the performance for PubChem Data

School of Computer Science and Engineering Index Size HSV and HSD have a larger candidate size

School of Computer Science and Engineering Thank you