Geographically-Typed Geospatial Data Source Matching with High- Quality Clustering and Multi- Attribute Matching Jeffrey Partyka Dr. Latifur Khan Dr. Bhavani.

Geographically-Typed Geospatial Data Source Matching with High- Quality Clustering and Multi- Attribute Matching Jeffrey Partyka Dr. Latifur Khan Dr. Bhavani Thuraisingham Funded by NGA & US Air Force

Topic Outline Problem Statement Background Information Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT Matching) - Geographic Matching (GT Matching) - Attribute Weighting - High-Quality Clustering - 1:N Matching Experimental Results Future Work

Motivation Internet Architecture ▫Highly Distributed ▫Federated Architecture Web Application Problems ▫ Low Performance for Information Retrieval ▫Accuracy of Retrieved Information

Sample Scenario Rank Data Source Query: Publication of Academic Staff MIT Ontology Karlsruhe Ontology UMBC Ontology {Article, Book, Booklet, InBook, InCollection, InProceedings, Manual, Misc, Proceedings, Report, Technical Report, Project Report, Thesis, Master Thesis, PhD Thesis, Unpublished, Faculty Member, Lecturer}

Different Bibliography Ontologies MIT Ontology Karlsruhe Ontology UMBC Ontology

Problem Statement: Schema Matching Given 2 data sources, S 1 and S 2, each of which is composed of a set of tables where {T 11, T 12, T 13 …T 1k …T 1m } є S 1 and {T 21, T 22, T 23 …T 2j …T 2n } є S 2, with 1<= k <= m and 1 <= j <= n, determine the similarity between T 1k and T 2j roadNameCity Johnson Rd.Plano School Dr.Richardson Zeppelin St.Lakehurst Alma Dr.Richardson RoadCounty Custer PwyCooke 15 th St.Collin Parker Rd.Collin Alma Dr.Collin S1S1 S2S2 COUNTYDestination SNOHOMISHMukilteo PIERCEPoint Defiance KITSAPSouthworth SNOHOMISHEdmonds CityCounty AnacortesSkagit Friday HarborSan Juan ArgyleSan Juan KirklandKing Road

Given 2 ontologies, O 1 and O 2, each of which is composed of a set of concepts where {C 11, C 12, C 13 …C 1k …C 1m } є O 1 and {C 21, C 22, C 23 …C 2j …C 2n } є O 2, with 1<= k <= m and 1 <= j <= n, determine the similarity between C 1k and C 2j Problem Statement: Ontology Matching

Motivating Scenarios 1 Making Complex Business Decisions “Should we invest in a new cholesterol drug for the Asia-Pacific region?“ 2 Robust Semantic Web Applications 2 R & D Corporate Marketing Regulatory Affairs Manufacturing Yes/No/ Maybe? “Find the group of friends around Jeff. Then find the most important person out of the group. Find out if this person was at an event of type Meeting, and happened between 9AM-11AM within 5 miles of UTD” Jeff, Jeff’s friends Within 5 miles of UTD 9:00am- 11:00am Yes/No/ Maybe? Social Network Geospatial Ontology Temporal Logic RDFS Lookup Event of Type ‘Meeting’

Matching Approaches Mappings may be generated in several ways – some approaches are: (1: Name Matching (2: Structure Matching (3: Instance Matching EmailemailAddress CountyDSP KitsapKingston WahkiakPuget Island COUNTYNAMECID TRAIL RANGE DR 96 KITSAP97 ?

Some Definitions Definition 1 (attribute) An attribute of a table T, denoted as att(T), is defined as a property of T that further describes it. Definition 2 (instance) An instance x of an attribute att(T) is defined as a data value associated with att(T). Definition 3 (keyword) A keyword k of an instance x associated with attribute att(T) is defined as a meaningful word (not a stopword) representing a portion of the instance.

Some Definitions (cont) Definition 4a (geographic type (GT)) A geographic type GT associated with attribute att(T) is defined as a class of instances of att(T) that represent the same geographic feature. (e.g: “lake”, “road”) Definition 4b (non-geographic type (NGT)) A non- geographic type (NGT) associated with attribute att(T) is defined as a group of keywords from instances of att(T) that are semantically related to each other. Collin Plano Richardson New Jersey Trenton Monmouth

Topic Outline Problem Statement Background Information Matching Procedures - Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching Experimental Results Future Work

Overview of Matching Algorithm 1 Select attribute pairs for comparison 2 roadNameroadType city Match instances between compared attributes town rTyperName county roadNamerName 3 Determine final attribute similarity K Ave. Jupiter Rd. Coit Rd. L Ave. LBJ Freeway US 75 roadNamerName EBD =.98 Run Sim algorithms…

Determining Semantic Similarity We use Entropy-Based Distribution (EBD) EBD is a measurement of type similarity between 2 attributes (or columns): EBD takes values in the range of [0,1]. Greater EBD corresponds to more similar type distributions between compared attributes (columns) EBD = H(C|T) H(C)

Applying EBD to Semantic Matching att1 X X X Y Y Z att2 X X Y Y Y Z X X X Y Y Z Y Y Y X X Z Y Y X Y Y Y X X X X Z Z Entropy = H(C) = Conditional Entropy = H(C|T) =

Matching Using N-grams Use commonly occurring N-grams [2,3] in compared attributes to determine similarity (N = 2) StrNameFENAMEStatus LOCUST-GROVE DRLOCUST GROVE BUILT TRAIL RANGE DRTRAIL RANGEBUILT StreetLaddressRaddress LOUISE -DOVER DR 1600 1798 CR45/MANET CT25002598 TATA Some N-grams extracted from A.StrName = {LO, OC, CU,ST, OV…..} Some N-grams extracted from B.Street = {LO, OU, UI, OV,…..} TBTB LO OV ST UI Conditional Entropy = H(C|T) = [2] Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Content-based ontology matching for GIS datasets. ACM SIGSPATIAL GIS 2008 (ACM GIS, Laguna Beach, California, Nov. 2008): 51. [3] Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Ontology Alignment Using Multiple Contexts. 7th International Semantic Web Conference (ISWC) Karlsruhe, Germany, Oct. 2008.

Faults of this Method Semantically similar columns are not guaranteed to have a high similarity score CityCountry DallasUSA HoustonUSA KingstonJamaica HalifaxCanada Mexico City Mexico ctyNamecountry ShanghaiChina BeijingChina TokyoJapan New DelhiIndia Kuala Lumpur Malaysia 2-grams extracted from A: {Da, al, la, as, Ho, ou, us…} A є T 1 B є T 2 2-grams extracted from B: {Sh, ha, an, ng, gh, ha, ai, Be, ei, ij…}

Non-Geographic Matching Dallas USA Houston Tokyo Beijing Halifax New Delhi China Jamaica India Malaysia ● Use clustering methods to group keywords of instances together without relying on shared N-grams between instances[4] ● K-means is not suitable because we cannot compute a centroid among string instances, so we use K-medoid clustering ● Use Normalized Google Distance (NGD) as a distance measure between any two keywords in a cluster ● WordNet would not be a suitable distance measure in the GIS domain [4] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Semantic Schema Matching without Shared Instances. 3rd IEEE International Conference on Semantic Computing (ICSC) Berkeley, California, September 2009: 297-302.

Definition of Google Distance NGD(x, y)[7] is a measure for the symmetric conditional probability of co-occurrence of x and y [7] Cilibrasi,R.,Vitányi, P.: The Google Similarity Distance. IEEE Trans. Knowledge and Data Engineering 19, 370--383 (2007)

: Attribute 1 : Attribute 2 Similarity = H(C|T) / H(C) T 1 є O 1 T 2 є O 2 Step 3 Calculate Similarity Extract distinct keywords from compared attributes Group distinct keywords together into semantic clusters Keywords extracted from attributes = {Johnson, Rd., School, 15th,…} “Rd.”,”Dr.”,”St.”,”Pwy”,… “Johnson”,”School”,”Dr.”…. T1T1 T2T2 Step 1 Step 2 roadNameCity Johnson Rd.Plano School Dr.Richardson Zeppelin St.Lakehurst RoadCounty Custer PwyCollin 15 th St.Collin Parker Rd.Collin K-medoid + NGD instance similarity

Problems with Non-Geographic Matching via NGD + K-medoid It is possible that two different geographic entities (ie: Dallas, TX and Dallas County) in the same location will be mistaken for being similar: roadNameCity Johnson Rd.Plano School Dr.Richardson Zeppelin St.Lakehurst Alma Dr.Richardson Preston Rd.Addison Dallas PkwyDallas RoadCounty Custer PwyCooke 15 th St.Collin Parker Rd.Collin Alma Dr.Collin Campbell Rd. D enton Harry Hines Blvd. Dallas similarity =.797

Geographic Type Matching We use a gazetteer to determine the geographic type (GT) of an instance[5,6]: Instances of S 1 GTs Instances of S 2 Anacortes Edmonds Victoria ? Clinton ? Victoria ? Clinton ? Victoria ? [5] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Geographically-Typed Semantic Schema Matching. In: Divyakant, A., Aref, W., Lu, C.T. et al. (eds.) ACM SIGSPATIAL GIS 2009, Seattle, Washington, pp. 456--459. ACM (Nov. 2009) [6] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Geospatial Schema Matching with High-Quality Clustering and Multi-Attribute Matching. Submitted to the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2011, May 2011, Shenzhen, China).

Using Latlong Value to Enhance GT Matching

GSim: Combining NGT and GT Matching We apply GT matching for an attribute comparison if >= 50% of the instances involved in the comparison have GT information. If this is not the case, then NGT matching is applied instead[1]: featureNameCity Collin CreekPlano White Rock Lake Dallas Dallas RiverLakehurst LakeCounty Cooke LakeCooke Mud LakeCollin Stone Briar Lake Collin >= 50% of instances have a GT? NGT Matching GT Matching Lake Creek River Rock Stone Mud Cooke Lake Mud Lake Stone Briar Lake Collin Creek [1] Jeffrey Partyka, Pallabi Parveen, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Enhanced Geographically- Typed Semantic Schema Matching. To appear in the Journal of Web Semantics, 2011.

Attribute Weighting We can distribute the weight of each attribute match based on their importance: strAddcitystatezipCode 1000 Park Blvd.PlanoTX75075 209 Spring Valley Rd. RichardsonTX75080 1703 Danube Ln.PlanoTX75075 18431 Roehampton Dr. DallasTX75252 Street AddressCityStateZip 100 Genstar Dr.DallasTX75252 2091 Spring Creek Rd. PlanoTX75075 1704 Danube Ln.PlanoTX75075 18331 Roehampton Dr. DallasTX75252 27% 23% 26% 24%

Measuring Attribute Match Importance Attribute Match Importance determined by: nameroadType city town road_typerName Attribute Uniqueness 1 2 Attribute Relevance name city cty Name lakeType name type lakename destPort county edez_id city Roads Ports Lakes Roads Sea Ports LakeFeatures Dest

Attribute Uniqueness Determine uniqueness of attributes att1 and att2 involved in a match (att1-att2) by clustering all attributes from all tables over S 1 and S 2 : cutoff 1 cutoff 2

Attribute Clustering Use Intercluster Similarity (ICS) to decide if clusters A and B should merge: Calculate cutoff point (CP) to determine when to stop clustering :

Cutoff Point vs. # of Cluster Iterations

Calculating AU, corrected EBD value Calculate AU for an attribute att in a match: Calculate pairwise uniqueness (PU) for a match att1-att2: PU att1,att2 = avg (AU att1(T), AU att2(T’) ) Recalculate EBD between att1(T)-att2(T’): EBD corr (att1,att2) = EBD orig (att1, att2) x PU att1,att2 rName name (Roads) name (Ports) lakename name (Lakes) Name (Sea Ports) destPort Dest Att Match PU att1-att2 EBD orig EBD corr Name(Ports) – Name (Sea Ports).688.90.619 destPort–Dest.938.80.750 AU att ϵ [0,1]

Attribute Weighting Algorithm

High-Quality Clustering Due to the inherent randomness of clustering (e.g: choosing initial centroid), EBD scores may not be stable [6] We need a way to produce consistent EBD values - To eliminate EBD variability - To provide a confidence value for our EBD value - To guarantee that our EBD value was generated from a high- quality clustering We proposed the following two cluster-based measures (1: Semantic Purity: the “meaning distance” between any two instances within the same cluster (2: Geographic Purity: the GT purity of a given cluster

Cluster Purity Measures Distance-based Measure: Imp S = Geographic-Type Measure: Objective Function to be Minimized: O SSKM = where W i = Collin Tarrant Plano Kaufman Coppell Richardson Collin Tarrant Kaufman Plano Coppell Richardson

1:N Matching Many relationships are not 1:1, but involve matching groups of entities idMailing Address 112 Plano Dr., Plano, TX, 75075 218 Coit Rd., Richardson, TX, 75080 3200 Preston Rd., Dallas, TX 42 Hedgecoxe Rd. Street Address CityStateZip 100 Genstar Dr. DallasTX75252 2091 Spring Creek Rd. PlanoTX75075 1704 Danube Ln. PlanoTX75075 18331 Roehampton Dr. DallasTX75252 Cmp N1N1 N2N2 N3N3 N4N4

Defining 1:N Matching 1:N matching can be defined in many ways - Optimize similarity or value of N? - Meronymy or Subsumption? We chose to optimize similarity (EBD) - Use EBD scores produced from 1-1 matches between Cmp and N k (1 <= k <= N) - Apply greedy algorithm to add attributes to match with Cmp based on decreasing EBD score (highest to lowest) - Any 1:N match will minimize the set difference between GT(Cmp) and the union of the sets of GTs for the N matching attributes. - We do not include an attribute in a 1:N match if it would make the EBD of the current match decrease

1:N Matching Example idMailing Address 112 Plano Dr., Plano, TX, 75075 218 Coit Rd., Richardson, TX, 75080 42 Hedgecoxe Rd. Street Address CityStateZipCounty 100 Genstar Dr. DallasTX75252Dallas 2091 Spring Creek Rd. PlanoTX75075Collin 1704 Danube Ln. PlanoTX75075Collin Cmp N1N1 N2N2 N3N3 N4N4 1 2 34 Attribute1-1 EBD w/ Cmp 1:N EBD Street Address.81 City.79.88 State.72.92 Zip.66.95 Final 1:N EBD.95 1 2 3 4

1:N Matching From Type Perspective Mailing Address W X Y Z W X Y Street Address CityState Zip W X Y Z W X Y Z W X Y Z X X X Y Y Y Y Y Y X X X Y Y X Y Y Y X X X X W W Entropy = H(C) Conditional Entropy = H(C|T) Z Z W W W Z Z W W W Z Z Z X Z Z Z Y W W W W

Greedy 1:N Matching Algorithm program 1:N_Matching (S(T 2 ), Sebd(T 2 )) { var E(T 2 ) = Φ; var S(T 2 ) = Φ; Sebd(T 2 ) = 0.0; GT Cmp = getGTSet(Cmp); E(T 2 ) = getMatchCandidates(Cmp, T 2, GT Cmp ); E(T 2 ) = orderByEBD(E(T 2 )); For att A E(T 2 ) with max value of EBD(Cmp,A){ if (increaseEBD(Cmp, Sebd(T 2 )) { Emax = A; S(T 2 ) = S(T 2 ) U Emax; Sebd(T 2 ) = addEBD(Sebd(T 2 ), EBD(Cmp, Emax)) end if E(T 2 ) = E(T 2 ) – A; end for }

Proof of Correctness Theorem 1: (Proof of Greedy Choice Property for 1:N matching algorithm) – All choices for Emax x (T 2 ) will be present in an optimal 1:N match with Cmp T 1. Suppose that Sebd N (T 2 ), for an arbitrary S N (T 2 ), produces an optimal EBD. Let us build a new set called S2ebd N (T 2 ) from S2 N (T 2 ) such that every attribute included in S2 N (T 2 ) represented a value of Emax x (T 2 ) for some x. Also, the cardinality of S N (T 2 ) and S2 N (T 2 ) are equal, and every attribute between S N (T 2 ) and S2 N (T 2 ) is identical, except for an arbitrary attribute indexed by r (r = the EBD value produced between Cmp and attribute r given in S N (T 2 ). Since all other attributes are equal between S N (T 2 ) and S2 N (T 2 ), then their associated 1:1 EBD scores with Cmp are also identical. Therefore, EBD(Cmp, S2 N (T 2 )) >= EBD (Cmp, S N (T 2 )), but since S N (T 2 ) produces an optimal EBD with Cmp through Sebd N (T 2 ), then EBD(Cmp, S2 N (T 2 )) = EBD (Cmp, S N (T 2 )). Thus, S2 N (T 2 ) also produces an optimal EBD with Cmp through S2ebd N (T 2 ).

Proof of Correctness (cont) Theorem 2: (Proof of optimal substructure property) – Let Sebd N-1 (T 2 ), N > 1, be the EBD score corresponding to the attribute match between Cmp T 1 and S N-1 (T 2 ) T 2. If Sebd N-1 (T 2 ) is an optimal EBD score, and Sebd N (T 2 ) is obtained by adding Emax x to S N-1 (T 2 ), then Sebd N (T 2 ) must also be an optimal EBD score. Assume that S N (T 2 ) was formed by adding Emax x to S N-1 (T 2 ), but does not produce an optimal value of Sebd N (T 2 ). Emax x represents the attribute with the highest EBD score with Cmp to be included in S N-1 (T 2 ) with respect to all other attributes in E x (T 2 ). Then this means that S N-1 (T 2 ) contains some attribute indexed by r (r <= N-1) whose EBD value is less than that of Emax r. Thus, Sebd N-1 (T 2 ) is not an optimal EBD score. This contradicts the statement above that Sebd N- 1 (T 2 ) is an optimal EBD score. Therefore, if Sebd N-1 (T 2 ) is an optimal EBD score, and Sebd N (T 2 ) is obtained by adding Emax x to S N-1 (T 2 ), then Sebd N (T 2 ) must be an optimal EBD score. Theorem 3: Greedy 1:N matching produces a safe match with an optimal EBD score. This follows from Theorem 1 and Theorem 2.

Dataset Details GIS Transportation Dataset (GTD) GIS Location Dataset (GLD)

Dataset Details (cont) GIS Point of Interest Dataset (GPD) - Through all of our datasets, few shared instances exist - Data is multijurisdictional in nature - Number of attributes and instances differ

NGT Matching Over GTD

GT Matching Over GTD

The Effect of Latlong Values on Matching in GPD

The Effect of Attribute Weighting on Matching in GTD and GLD

Observing the Effects of Multiple Matching Methods over GPD (1) GT matching (2) GT matching + latlong (3) GT matching + latlong + NGT matching (4) GT matching + latlong + NGT matching + attribute weighting

GSim vs. N-grams, SVD, NMF & GSim G

1:N Matching Experiment Results Experiment 1 T 1 = {‘Address’} T 2 = {‘Street Address’, ‘City’, ‘State’, ‘Zip’}

1:N Matching Experiment Results (cont) Experiment 2 T 1 = {‘Island_Group’} T 2 = {‘Island1’, ‘Island2’, ‘Island3’, ‘Island4’, ‘Island5’, ‘Island6’} ‘Island6’ is not a part of ‘Island_Group’

Summary of Matching Methods Exact Match Synonym Match GT Match GT + Latlong Match Hierarchical GT Match N-grams NGT Matching GT Matching GT + Latlong GT + Cluster Purity Final GSim (Ideal)

Hierarchical GT Matching Use a GT hierarchy to match types with relationships between them (e.g: superclass/subclass, meronym/holonym, etc.) Bodies of Water LakesRivers RapidsStreams att1 Dell Lake Dallas River Coppell Stream att2 HP Lake Collin River Plano Rapids Dell Lake HP Lake Dallas River Coppell Stream Collin River Plano Rapids Get GT Relations From Ontology Calculate Similarity

THANK YOU! ANY QUESTIONS?

Geographically-Typed Geospatial Data Source Matching with High- Quality Clustering and Multi- Attribute Matching Jeffrey Partyka Dr. Latifur Khan Dr. Bhavani.

Similar presentations

Presentation on theme: "Geographically-Typed Geospatial Data Source Matching with High- Quality Clustering and Multi- Attribute Matching Jeffrey Partyka Dr. Latifur Khan Dr. Bhavani."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Geographically-Typed Geospatial Data Source Matching with High- Quality Clustering and Multi- Attribute Matching Jeffrey Partyka Dr. Latifur Khan Dr. Bhavani.

Similar presentations

Presentation on theme: "Geographically-Typed Geospatial Data Source Matching with High- Quality Clustering and Multi- Attribute Matching Jeffrey Partyka Dr. Latifur Khan Dr. Bhavani."— Presentation transcript:

Similar presentations

About project

Feedback