Geographically-Typed Geospatial Data Source Matching with High- Quality Clustering and Multi- Attribute Matching Jeffrey Partyka Dr. Latifur Khan Dr. Bhavani.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Clustering Categorical Data The Case of Quran Verses
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Improved TF-IDF Ranker
Fast Algorithms For Hierarchical Range Histogram Constructions
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Semantic Web Research at University of Texas at Dallas (Schema Matching + Storage & Retrieval of RDF graph) Faculties: Latifur Khan Bhavani Thuraisingham.
Content-Based Geospatial Schema Matching Using Semi-Supervised Geosemantic Clustering and Hierarchy Jeffrey Partyka Dr. Latifur Khan.
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
So What Does it All Mean? Geospatial Semantics and Ontologies Dr Kristin Stock.
Efficient Query Evaluation on Probabilistic Databases
Ontology Alignment. Problem Statement Given N Ontologies (O 1,…, O n ) ◦ In a Particular Domain ◦ Different Level of Coverage Goal ◦ Evaluate Commonality.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
6/2/ An Automatic Personalized Context- Aware Event Notification System for Mobile Users George Lee User Context-based Service Control Group Network.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Mutual Information Mathematical Biology Seminar
Aki Hecht Seminar in Databases (236826) January 2009
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Bulut, Singh # Selecting the Right Interestingness Measure for Association Patterns Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava Department of Computer.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1/27 Ensemble Visualization for Cyber Situation Awareness of Network Security Data Lihua Hao 1, Christopher G. Healey 1, Steve E. Hutchinson 2 1 North.
Presented by Tienwei Tsai July, 2005
Ontologies for the Integration of Geospatial Data Michael Lutz Workshop: Semantics and Ontologies for GI Services, 2006 Paper: Lutz et al., Overcoming.
Beyond Co-occurrence: Discovering and Visualizing Tag Relationships from Geo-spatial and Temporal Similarities Date : 2012/8/6 Resource : WSDM’12 Advisor.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
A Language Independent Method for Question Classification COLING 2004.
updated CmpE 583 Fall 2008 Ontology Integration- 1 CmpE 583- Web Semantics: Theory and Practice ONTOLOGY INTEGRATION Atilla ELÇİ Computer.
Chapter 6: Information Retrieval and Web Search
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
SINGULAR VALUE DECOMPOSITION (SVD)
Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 29 Nov 11, 2005 Nanjing University of Science & Technology.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Unsupervised Streaming Feature Selection in Social Media
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
A Consensus-Based Clustering Method
Data Integration with Dependent Sources
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Block Matching for Ontologies
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Geographically-Typed Geospatial Data Source Matching with High- Quality Clustering and Multi- Attribute Matching Jeffrey Partyka Dr. Latifur Khan Dr. Bhavani Thuraisingham Funded by NGA & US Air Force

Topic Outline Problem Statement Background Information Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT Matching) - Geographic Matching (GT Matching) - Attribute Weighting - High-Quality Clustering - 1:N Matching Experimental Results Future Work

Motivation Internet Architecture ▫Highly Distributed ▫Federated Architecture Web Application Problems ▫ Low Performance for Information Retrieval ▫Accuracy of Retrieved Information

Sample Scenario Rank Data Source Query: Publication of Academic Staff MIT Ontology Karlsruhe Ontology UMBC Ontology {Article, Book, Booklet, InBook, InCollection, InProceedings, Manual, Misc, Proceedings, Report, Technical Report, Project Report, Thesis, Master Thesis, PhD Thesis, Unpublished, Faculty Member, Lecturer}

Different Bibliography Ontologies MIT Ontology Karlsruhe Ontology UMBC Ontology

Problem Statement: Schema Matching Given 2 data sources, S 1 and S 2, each of which is composed of a set of tables where {T 11, T 12, T 13 …T 1k …T 1m } є S 1 and {T 21, T 22, T 23 …T 2j …T 2n } є S 2, with 1<= k <= m and 1 <= j <= n, determine the similarity between T 1k and T 2j roadNameCity Johnson Rd.Plano School Dr.Richardson Zeppelin St.Lakehurst Alma Dr.Richardson RoadCounty Custer PwyCooke 15 th St.Collin Parker Rd.Collin Alma Dr.Collin S1S1 S2S2 COUNTYDestination SNOHOMISHMukilteo PIERCEPoint Defiance KITSAPSouthworth SNOHOMISHEdmonds CityCounty AnacortesSkagit Friday HarborSan Juan ArgyleSan Juan KirklandKing Road

Given 2 ontologies, O 1 and O 2, each of which is composed of a set of concepts where {C 11, C 12, C 13 …C 1k …C 1m } є O 1 and {C 21, C 22, C 23 …C 2j …C 2n } є O 2, with 1<= k <= m and 1 <= j <= n, determine the similarity between C 1k and C 2j Problem Statement: Ontology Matching

Motivating Scenarios 1 Making Complex Business Decisions “Should we invest in a new cholesterol drug for the Asia-Pacific region?“ 2 Robust Semantic Web Applications 2 R & D Corporate Marketing Regulatory Affairs Manufacturing Yes/No/ Maybe? “Find the group of friends around Jeff. Then find the most important person out of the group. Find out if this person was at an event of type Meeting, and happened between 9AM-11AM within 5 miles of UTD” Jeff, Jeff’s friends Within 5 miles of UTD 9:00am- 11:00am Yes/No/ Maybe? Social Network Geospatial Ontology Temporal Logic RDFS Lookup Event of Type ‘Meeting’

Matching Approaches Mappings may be generated in several ways – some approaches are: (1: Name Matching (2: Structure Matching (3: Instance Matching Address CountyDSP KitsapKingston WahkiakPuget Island COUNTYNAMECID TRAIL RANGE DR 96 KITSAP97 ?

Some Definitions Definition 1 (attribute) An attribute of a table T, denoted as att(T), is defined as a property of T that further describes it. Definition 2 (instance) An instance x of an attribute att(T) is defined as a data value associated with att(T). Definition 3 (keyword) A keyword k of an instance x associated with attribute att(T) is defined as a meaningful word (not a stopword) representing a portion of the instance.

Some Definitions (cont) Definition 4a (geographic type (GT)) A geographic type GT associated with attribute att(T) is defined as a class of instances of att(T) that represent the same geographic feature. (e.g: “lake”, “road”) Definition 4b (non-geographic type (NGT)) A non- geographic type (NGT) associated with attribute att(T) is defined as a group of keywords from instances of att(T) that are semantically related to each other. Collin Plano Richardson New Jersey Trenton Monmouth

Topic Outline Problem Statement Background Information Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching Experimental Results Future Work

Overview of Matching Algorithm 1 Select attribute pairs for comparison 2 roadNameroadType city Match instances between compared attributes town rTyperName county roadNamerName 3 Determine final attribute similarity K Ave. Jupiter Rd. Coit Rd. L Ave. LBJ Freeway US 75 roadNamerName EBD =.98 Run Sim algorithms…

Determining Semantic Similarity We use Entropy-Based Distribution (EBD) EBD is a measurement of type similarity between 2 attributes (or columns): EBD takes values in the range of [0,1]. Greater EBD corresponds to more similar type distributions between compared attributes (columns) EBD = H(C|T) H(C)

Applying EBD to Semantic Matching att1 X X X Y Y Z att2 X X Y Y Y Z X X X Y Y Z Y Y Y X X Z Y Y X Y Y Y X X X X Z Z Entropy = H(C) = Conditional Entropy = H(C|T) =

Topic Outline Problem Statement Background Information Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching Experimental Results Future Work

Matching Using N-grams Use commonly occurring N-grams [2,3] in compared attributes to determine similarity (N = 2) StrNameFENAMEStatus LOCUST-GROVE DRLOCUST GROVE BUILT TRAIL RANGE DRTRAIL RANGEBUILT StreetLaddressRaddress LOUISE -DOVER DR CR45/MANET CT TATA Some N-grams extracted from A.StrName = {LO, OC, CU,ST, OV…..} Some N-grams extracted from B.Street = {LO, OU, UI, OV,…..} TBTB LO OV ST UI Conditional Entropy = H(C|T) = [2] Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Content-based ontology matching for GIS datasets. ACM SIGSPATIAL GIS 2008 (ACM GIS, Laguna Beach, California, Nov. 2008): 51. [3] Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Ontology Alignment Using Multiple Contexts. 7th International Semantic Web Conference (ISWC) Karlsruhe, Germany, Oct

Faults of this Method Semantically similar columns are not guaranteed to have a high similarity score CityCountry DallasUSA HoustonUSA KingstonJamaica HalifaxCanada Mexico City Mexico ctyNamecountry ShanghaiChina BeijingChina TokyoJapan New DelhiIndia Kuala Lumpur Malaysia 2-grams extracted from A: {Da, al, la, as, Ho, ou, us…} A є T 1 B є T 2 2-grams extracted from B: {Sh, ha, an, ng, gh, ha, ai, Be, ei, ij…}

Topic Outline Problem Statement Background Information Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching Experimental Results Future Work

Non-Geographic Matching Dallas USA Houston Tokyo Beijing Halifax New Delhi China Jamaica India Malaysia ● Use clustering methods to group keywords of instances together without relying on shared N-grams between instances[4] ● K-means is not suitable because we cannot compute a centroid among string instances, so we use K-medoid clustering ● Use Normalized Google Distance (NGD) as a distance measure between any two keywords in a cluster ● WordNet would not be a suitable distance measure in the GIS domain [4] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Semantic Schema Matching without Shared Instances. 3rd IEEE International Conference on Semantic Computing (ICSC) Berkeley, California, September 2009:

Definition of Google Distance NGD(x, y)[7] is a measure for the symmetric conditional probability of co-occurrence of x and y [7] Cilibrasi,R.,Vitányi, P.: The Google Similarity Distance. IEEE Trans. Knowledge and Data Engineering 19, (2007)

: Attribute 1 : Attribute 2 Similarity = H(C|T) / H(C) T 1 є O 1 T 2 є O 2 Step 3 Calculate Similarity Extract distinct keywords from compared attributes Group distinct keywords together into semantic clusters Keywords extracted from attributes = {Johnson, Rd., School, 15th,…} “Rd.”,”Dr.”,”St.”,”Pwy”,… “Johnson”,”School”,”Dr.”…. T1T1 T2T2 Step 1 Step 2 roadNameCity Johnson Rd.Plano School Dr.Richardson Zeppelin St.Lakehurst RoadCounty Custer PwyCollin 15 th St.Collin Parker Rd.Collin K-medoid + NGD instance similarity

Problems with Non-Geographic Matching via NGD + K-medoid It is possible that two different geographic entities (ie: Dallas, TX and Dallas County) in the same location will be mistaken for being similar: roadNameCity Johnson Rd.Plano School Dr.Richardson Zeppelin St.Lakehurst Alma Dr.Richardson Preston Rd.Addison Dallas PkwyDallas RoadCounty Custer PwyCooke 15 th St.Collin Parker Rd.Collin Alma Dr.Collin Campbell Rd. D enton Harry Hines Blvd. Dallas similarity =.797

Topic Outline Problem Statement Background Information Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching Experimental Results Future Work

Geographic Type Matching We use a gazetteer to determine the geographic type (GT) of an instance[5,6]: Instances of S 1 GTs Instances of S 2 Anacortes Edmonds Victoria ? Clinton ? Victoria ? Clinton ? Victoria ? [5] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Geographically-Typed Semantic Schema Matching. In: Divyakant, A., Aref, W., Lu, C.T. et al. (eds.) ACM SIGSPATIAL GIS 2009, Seattle, Washington, pp ACM (Nov. 2009) [6] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Geospatial Schema Matching with High-Quality Clustering and Multi-Attribute Matching. Submitted to the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2011, May 2011, Shenzhen, China).

Using Latlong Value to Enhance GT Matching

GSim: Combining NGT and GT Matching We apply GT matching for an attribute comparison if >= 50% of the instances involved in the comparison have GT information. If this is not the case, then NGT matching is applied instead[1]: featureNameCity Collin CreekPlano White Rock Lake Dallas Dallas RiverLakehurst LakeCounty Cooke LakeCooke Mud LakeCollin Stone Briar Lake Collin >= 50% of instances have a GT? NGT Matching GT Matching Lake Creek River Rock Stone Mud Cooke Lake Mud Lake Stone Briar Lake Collin Creek [1] Jeffrey Partyka, Pallabi Parveen, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Enhanced Geographically- Typed Semantic Schema Matching. To appear in the Journal of Web Semantics, 2011.

Topic Outline Problem Statement Background Information Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching Experimental Results Future Work

Attribute Weighting We can distribute the weight of each attribute match based on their importance: strAddcitystatezipCode 1000 Park Blvd.PlanoTX Spring Valley Rd. RichardsonTX Danube Ln.PlanoTX Roehampton Dr. DallasTX75252 Street AddressCityStateZip 100 Genstar Dr.DallasTX Spring Creek Rd. PlanoTX Danube Ln.PlanoTX Roehampton Dr. DallasTX % 23% 26% 24%

Measuring Attribute Match Importance Attribute Match Importance determined by: nameroadType city town road_typerName Attribute Uniqueness 1 2 Attribute Relevance name city cty Name lakeType name type lakename destPort county edez_id city Roads Ports Lakes Roads Sea Ports LakeFeatures Dest

Attribute Uniqueness Determine uniqueness of attributes att1 and att2 involved in a match (att1-att2) by clustering all attributes from all tables over S 1 and S 2 : cutoff 1 cutoff 2

Attribute Clustering Use Intercluster Similarity (ICS) to decide if clusters A and B should merge: Calculate cutoff point (CP) to determine when to stop clustering :

Cutoff Point vs. # of Cluster Iterations

Calculating AU, corrected EBD value Calculate AU for an attribute att in a match: Calculate pairwise uniqueness (PU) for a match att1-att2: PU att1,att2 = avg (AU att1(T), AU att2(T’) ) Recalculate EBD between att1(T)-att2(T’): EBD corr (att1,att2) = EBD orig (att1, att2) x PU att1,att2 rName name (Roads) name (Ports) lakename name (Lakes) Name (Sea Ports) destPort Dest Att Match PU att1-att2 EBD orig EBD corr Name(Ports) – Name (Sea Ports) destPort–Dest AU att ϵ [0,1]

Attribute Weighting Algorithm

Topic Outline Problem Statement Background Information Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching Experimental Results Future Work

High-Quality Clustering Due to the inherent randomness of clustering (e.g: choosing initial centroid), EBD scores may not be stable [6] We need a way to produce consistent EBD values - To eliminate EBD variability - To provide a confidence value for our EBD value - To guarantee that our EBD value was generated from a high- quality clustering We proposed the following two cluster-based measures (1: Semantic Purity: the “meaning distance” between any two instances within the same cluster (2: Geographic Purity: the GT purity of a given cluster

Cluster Purity Measures Distance-based Measure: Imp S = Geographic-Type Measure: Objective Function to be Minimized: O SSKM = where W i = Collin Tarrant Plano Kaufman Coppell Richardson Collin Tarrant Kaufman Plano Coppell Richardson

Topic Outline Problem Statement Background Information Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching Experimental Results Future Work

1:N Matching Many relationships are not 1:1, but involve matching groups of entities idMailing Address 112 Plano Dr., Plano, TX, Coit Rd., Richardson, TX, Preston Rd., Dallas, TX 42 Hedgecoxe Rd. Street Address CityStateZip 100 Genstar Dr. DallasTX Spring Creek Rd. PlanoTX Danube Ln. PlanoTX Roehampton Dr. DallasTX75252 Cmp N1N1 N2N2 N3N3 N4N4

Defining 1:N Matching 1:N matching can be defined in many ways - Optimize similarity or value of N? - Meronymy or Subsumption? We chose to optimize similarity (EBD) - Use EBD scores produced from 1-1 matches between Cmp and N k (1 <= k <= N) - Apply greedy algorithm to add attributes to match with Cmp based on decreasing EBD score (highest to lowest) - Any 1:N match will minimize the set difference between GT(Cmp) and the union of the sets of GTs for the N matching attributes. - We do not include an attribute in a 1:N match if it would make the EBD of the current match decrease

1:N Matching Example idMailing Address 112 Plano Dr., Plano, TX, Coit Rd., Richardson, TX, Hedgecoxe Rd. Street Address CityStateZipCounty 100 Genstar Dr. DallasTX75252Dallas 2091 Spring Creek Rd. PlanoTX75075Collin 1704 Danube Ln. PlanoTX75075Collin Cmp N1N1 N2N2 N3N3 N4N Attribute1-1 EBD w/ Cmp 1:N EBD Street Address.81 City State Zip Final 1:N EBD

1:N Matching From Type Perspective Mailing Address W X Y Z W X Y Street Address CityState Zip W X Y Z W X Y Z W X Y Z X X X Y Y Y Y Y Y X X X Y Y X Y Y Y X X X X W W Entropy = H(C) Conditional Entropy = H(C|T) Z Z W W W Z Z W W W Z Z Z X Z Z Z Y W W W W

Greedy 1:N Matching Algorithm program 1:N_Matching (S(T 2 ), Sebd(T 2 )) { var E(T 2 ) = Φ; var S(T 2 ) = Φ; Sebd(T 2 ) = 0.0; GT Cmp = getGTSet(Cmp); E(T 2 ) = getMatchCandidates(Cmp, T 2, GT Cmp ); E(T 2 ) = orderByEBD(E(T 2 )); For att A E(T 2 ) with max value of EBD(Cmp,A){ if (increaseEBD(Cmp, Sebd(T 2 )) { Emax = A; S(T 2 ) = S(T 2 ) U Emax; Sebd(T 2 ) = addEBD(Sebd(T 2 ), EBD(Cmp, Emax)) end if E(T 2 ) = E(T 2 ) – A; end for }

Proof of Correctness Theorem 1: (Proof of Greedy Choice Property for 1:N matching algorithm) – All choices for Emax x (T 2 ) will be present in an optimal 1:N match with Cmp T 1. Suppose that Sebd N (T 2 ), for an arbitrary S N (T 2 ), produces an optimal EBD. Let us build a new set called S2ebd N (T 2 ) from S2 N (T 2 ) such that every attribute included in S2 N (T 2 ) represented a value of Emax x (T 2 ) for some x. Also, the cardinality of S N (T 2 ) and S2 N (T 2 ) are equal, and every attribute between S N (T 2 ) and S2 N (T 2 ) is identical, except for an arbitrary attribute indexed by r (r = the EBD value produced between Cmp and attribute r given in S N (T 2 ). Since all other attributes are equal between S N (T 2 ) and S2 N (T 2 ), then their associated 1:1 EBD scores with Cmp are also identical. Therefore, EBD(Cmp, S2 N (T 2 )) >= EBD (Cmp, S N (T 2 )), but since S N (T 2 ) produces an optimal EBD with Cmp through Sebd N (T 2 ), then EBD(Cmp, S2 N (T 2 )) = EBD (Cmp, S N (T 2 )). Thus, S2 N (T 2 ) also produces an optimal EBD with Cmp through S2ebd N (T 2 ).

Proof of Correctness (cont) Theorem 2: (Proof of optimal substructure property) – Let Sebd N-1 (T 2 ), N > 1, be the EBD score corresponding to the attribute match between Cmp T 1 and S N-1 (T 2 ) T 2. If Sebd N-1 (T 2 ) is an optimal EBD score, and Sebd N (T 2 ) is obtained by adding Emax x to S N-1 (T 2 ), then Sebd N (T 2 ) must also be an optimal EBD score. Assume that S N (T 2 ) was formed by adding Emax x to S N-1 (T 2 ), but does not produce an optimal value of Sebd N (T 2 ). Emax x represents the attribute with the highest EBD score with Cmp to be included in S N-1 (T 2 ) with respect to all other attributes in E x (T 2 ). Then this means that S N-1 (T 2 ) contains some attribute indexed by r (r <= N-1) whose EBD value is less than that of Emax r. Thus, Sebd N-1 (T 2 ) is not an optimal EBD score. This contradicts the statement above that Sebd N- 1 (T 2 ) is an optimal EBD score. Therefore, if Sebd N-1 (T 2 ) is an optimal EBD score, and Sebd N (T 2 ) is obtained by adding Emax x to S N-1 (T 2 ), then Sebd N (T 2 ) must be an optimal EBD score. Theorem 3: Greedy 1:N matching produces a safe match with an optimal EBD score. This follows from Theorem 1 and Theorem 2.

Dataset Details GIS Transportation Dataset (GTD) GIS Location Dataset (GLD)

Dataset Details (cont) GIS Point of Interest Dataset (GPD) - Through all of our datasets, few shared instances exist - Data is multijurisdictional in nature - Number of attributes and instances differ

NGT Matching Over GTD

GT Matching Over GTD

The Effect of Latlong Values on Matching in GPD

The Effect of Attribute Weighting on Matching in GTD and GLD

Observing the Effects of Multiple Matching Methods over GPD (1) GT matching (2) GT matching + latlong (3) GT matching + latlong + NGT matching (4) GT matching + latlong + NGT matching + attribute weighting

GSim vs. N-grams, SVD, NMF & GSim G

1:N Matching Experiment Results Experiment 1 T 1 = {‘Address’} T 2 = {‘Street Address’, ‘City’, ‘State’, ‘Zip’}

1:N Matching Experiment Results (cont) Experiment 2 T 1 = {‘Island_Group’} T 2 = {‘Island1’, ‘Island2’, ‘Island3’, ‘Island4’, ‘Island5’, ‘Island6’} ‘Island6’ is not a part of ‘Island_Group’

Summary of Matching Methods Exact Match Synonym Match GT Match GT + Latlong Match Hierarchical GT Match N-grams NGT Matching GT Matching GT + Latlong GT + Cluster Purity Final GSim (Ideal)

Hierarchical GT Matching Use a GT hierarchy to match types with relationships between them (e.g: superclass/subclass, meronym/holonym, etc.) Bodies of Water LakesRivers RapidsStreams att1 Dell Lake Dallas River Coppell Stream att2 HP Lake Collin River Plano Rapids Dell Lake HP Lake Dallas River Coppell Stream Collin River Plano Rapids Get GT Relations From Ontology Calculate Similarity

THANK YOU! ANY QUESTIONS?