Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented By: Kiran Kancharlapalli DBMS - Topics 11 & 12.

Similar presentations

Presentation on theme: "Presented By: Kiran Kancharlapalli DBMS - Topics 11 & 12."— Presentation transcript:

1 Presented By: Kiran Kancharlapalli DBMS - Topics 11 & 12

2 Semantic Interoperability

3 What is Semantic Interoperability? Ability of computer systems to transmit data with unambiguous, shared meaning Data must be made available between heterogeneous agents Metadata must also be made available allowing a software agent to learn how to interpret the data – Document Type Definition – XML-Schema – RDF Annotations Requirement to enable machine computable logic, inferring and knowledge discovery between information systems. Results in Semantic Web

4 How is it accomplished? By adding data about the data (metadata), linking each data element to a controlled, shared vocabulary The meaning of the data is transmitted with the data itself, in one self-describing "information package" that is independent of any information system Syntatic interoperability is a prerequisite for semantic interoperability – refers to the packaging and transmission mechanisms for data

5 What is Semantic Web? Will be able to provide justified answers to natural language questions – Current search engines provide lists of resources that are supposed to contain the answer Knowledge rather than plain data would be retrieved i.e. data which is relevant to the user’s task Social factors such as privacy and trust would also be taken into account

6 Benefits Search can often be frustrating because of the limitations of keyword-based matching techniques. – Users frequently experience one of two problems: either get back no results or too many irrelevant results. The problem is that words can be synonymous (that is, two words have the same meaning) or polysemous (a single word has multiple meanings). However, if the languages used to describe web pages were semantically interoperable, then the user could specify a query in the terminology that was most convenient, and be assured that the correct results were returned, regardless of how the data was expressed in the sources.

7 Ontologies

8 What are Ontologies? Content theories possible about objects in a specified domain A representation vocabulary, specialized to some domain or subject matter Provide potential terms for describing knowledge about the domain Translating the terms in an ontology from, say English to French, does not change the ontology conceptually

9 What are Ontologies? Designed to reuse across multiple applications and implementations

10 Motivation select EMPDAT from PERSTAB where POS=“mgmnt” – What does it mean? – PERSTAB is a table which lists employee data What’s an employee? How is an employee different from a contractor? What if I want data on both? Even if this information is available in English, a human has to read it

11 Motivation (cntd…) "Parenthood is a more general relationship than motherhood." "Mary is the mother of Bill." "Who are Bill's parents?“ "Mary is the parent of Bill.” – that fact is not stated anywhere, but can be derived by a DAML application.

12 More formally stated, given the statements (motherOf subProperty parentOf) (Mary motherOf Bill) when stated in DAML, allows you to conclude (Mary parentOf Bill) Java code or a stored procedure could do this sort of inference for facts in XML or SQL But the DAML spec itself says the conclusion is true In contrast, different Java code could reach a different conclusion

13 Everything is not a nail Ontology is not always the right tool for the job Face recognition, vehicle control systems etc – not the right applications for ontology

14 Many Ways to Use Ontology As an information engineering tool – Create a database schema – Map the schema to an upper ontology – Use the ontology as a set of reminders for additional information that should be included As more formal comments – Define an ontology that is used to create a DB or OO system – Use a theorem prover at design time to check for inconsistencies For taxonomic reasoning – Do limited run-time inference in Prolog, a description logic, or even Java For first order logical inference – Full-blown use of all the axioms at run time

15 Upper Ontology An attempt to capture the most general and reusable terms and definitions

16 Motivation to capture Upper Ontology Ontologies may have different names for the same things – type – a relation between a class and an instance – instance – a relation between a class and an instance – isa – a relation between a class and an instance – … Ontologies may have the same name for different things, and no corresponding terms – before – a relation between two time points – before – a relation between two time intervals Either use the same upper ontology, or at least map to a common upper ontology

17 Some Formal Upper Ontologies DOLCE Cyc SUMO

18 Simple Methodology Extract nouns and verbs from a source text Find classes in SUMO for the nouns and verbs Record a mapping as being either equal, subsuming or instance. – type a single word that relates to the UBL term in the "SUMO term" or "English Word" text areas in the SUMO browser Create a subclass of SUMO if it's a subsuming mapping Add properties to the subclass – reusing SUMO properties – extending SUMO properties by creating a &%subrelation of an existing property Add English definition to the class – define constraints that express how the subclass is more specific than the superclass Express the classes and properties in KIF and begin creating axioms, based on the English definitions created previously

19 High Level Distinctions The first fundamental distinction is that between ‘Physical’ (things which have a position in space/time) and ‘Abstract’ (things which don’t) Physical Abstract

20 High Level Distinctions Partition of ‘Physical’ into ‘Objects’ and ‘Processes’ Physical Object Process

21 DBpedia: A Nucleus for a Web of Open Data is an effort to: – extract structured information from Wikipedia – make this information available on the Web under an open license – interlink the DBpedia dataset with other datasets on the Web

22 Title Abstract Infoboxes Geo-coordinates Categories Images Links Other languages Other wiki pages To the web Redirects Disambiguates

23 Extracting Structured Information from Wikipedia Wikipedia consists of – 6.9 million articles – in 251 languages – monthly growth-rate: 4% Wikipedia articles contain structured information – infoboxes which use a template mechanism – images depicting the article’s topic – categorization of the article – links to external webpages – intra-wiki links to other articles – inter-language links to articles about the same topic in different languages

24 Traditional Web Browser Web 2.0 Mashups Semantic Web Browsers SPARQL Endpoint Linked Data SNORQL Browser Query Builder Virtuoso Articles MySQL Infobox Categories Wikipedia Dumps DB tablesArticle texts DBpedia datasets loaded into published via Extraction

25 Extracting Infobox Data (RDF Representation)

26 DBpedia Basics The structured information can be extracted from Wikipedia and can serve as a basis for enabling sophisticated queries against Wikipedia content. The project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. It uses the SPARQL query language to query this data. At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data.

27 The DBpedia Dataset 1,600,000 concepts including – 58,000 persons – 70,000 places – 35,000 music albums – 12,000 films described by 91 million triples – using 8,141 different properties. – 557,000 links to pictures – 1,300,000 links external web pages – 207,000 Wikipedia categories – 75,000 YAGO categories

28 Accessing the DBpedia Dataset over the Web 1. SPARQL Endpoint 2. Linked Data Interface 3. DB Dumps for Download

29 SPARQL SPARQL is a query language for RDF. RDF is a directed, labeled graph data format for representing information in the Web. This specification defines the syntax and semantics of the SPARQL query language for RDF. SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.

30 The DBpedia SPARQL Endpoint hosted on a OpenLink Virtuoso server can answer SPARQL queries like – Give me all Sitcoms that are set in NYC? – All tennis players from Moscow? – All films by Quentin Tarentino? – All German musicians that were born in Berlin in the 19th century?


32 Example To know everything Bart wrote on blackboard board in season 12 of Simpson's: The Simpson episode Wikipedia pages are the identified "things” that we would consider as the subjects of our RDF triples. The bottom of the Wikipedia page for the "Tennis the Menace" episode tells us that it is a member of the Wikipedia category "The Simpsons episodes, season 12". The episode's DBpedia page tells us that p:blackboard is the property name for the Wikipedia infobox "Chalkboard" field. entities SELECT ?episode,?chalkboard_gag WHERE { ?episode skos:subject. ?episode dbpedia2:blackboard ?chalkboard_gag } Table


34 Possible Improvements Better data cleansing required. Improvement in the classification. Interlink DBpedia with more datasets. Improvement in the user interfaces. Performance Scalability More Expressiveness

35 Questions for Discussion DBpedia gains new information when it extracts data from the latest Wikipedia dump, whereas Freebase, in addition to Wikipedia extractions, gains new information through its userbase of editors. – Which one is better approach? Can Freebase or DBpedia be substitute for Wikipedia? – Freebase : Not good in that we have two similar things – Wikipedia, Freebase – DBPedia : Not good in that it extracts data from dump How can we interlink Freebase & DBpedia? What can be killer applications using Dbpedia? – If there is, okay – If there is no, do we really need a large general structured knowledge?

36 Uncertainty propagation Every physical quantity has : – A value or size – Uncertainty (or ‘Error’) – Units Without these three things, no physical quantity is complete. When quoting your measured result, follow the simple rules : Ex: A = 1.71  0.01 m Always quote main value to the same number of decimal places as the uncertainty Always include Units ! ! (but if the quantity is dimensionless, say so) Never quote uncertainty to more than 1 or 2 significant figures (this would make no sense)

37 Terminology: ‘Uncertainty’ and ‘Error’ The terms Uncertainty and Error are used interchangeably to describe a measured range of possible true values. The meaning of the term Error is : – NOT the DIFFERENCE between your experimental result & that predicted by theory, or an accepted standard result ! – NOT a MISTAKE in the experimental procedure or analysis ! Hence, the term Uncertainty is less ambiguous. Nevertheless, we still use terms like ‘propagation of errors’, ‘error bars’, ‘standard error’, etc. The term “human error” is imprecise - avoid using this as an explanation of the source of error.

38 Error Propagation using Calculus Functions of one variable If uncertainty in measured x is Δx, what is uncertainty in a derived quantity z (x) ? Error propagation is just calculus – you do this formally in the “Data Handling” course Basic principle is that, if (Δx)/x is small, then to first order: e.g., if z = x n, then : Hence, for this particular function, the percent (or fractional) error in z is :                x x n z z or...... just n times the percent error in x

39 Error Propagation using Calculus Functions of more than one variable Suppose uncertainties in two measured quantities x and y are : Δx and Δy, what is the uncertainty in some derived quantity z (x,y) ? For such functions of 2 variables we use partial differentiation But, combining errors ALWAYS INCREASES total error - so make sure terms add with the same sign : It is better to add in quadrature i.e. “the root of the sum of the squares” : We can usually always handle error propagation in this way by calculus

40 Simplified Error Propagation A short-cut avoiding calculus Instead of differentiating  z/  x,  z/  y etc, a simpler approach is also acceptable : 1. In the derived quantity z, replace x by x + Δx, say 2. Evaluate Δz in the approximation that Δx is small Ex. 1 : z = x + a, where a = constant Ex. 2 : z = bx, where b = constant Ex. 3 : z = bx 2, where b = constant

41 Synthetic Data Any production data applicable to a given situation that are not obtained by direct measurement Used in a variety of fields as a filter for information that would otherwise compromise the confidentiality of particular aspects of the data. Many times the particular aspects come in the form of human information (i.e. name, home address, IP address, telephone number, social security number, credit card number, etc.)

42 Importance Obtaining actual or real data sets could be difficult, and sometimes impossible due to impediments such as – Privacy issues – Image control – Logistics issues – Time – Cost Protecting information confidentiality – Data cannot be traced back to an individual Certain conditions may not be found in the original data

43 Importance (cntd.) Used to train the fraud detection system itself, thus creating the necessary adaptation of the system to a specific environment – By creating realistic behavior profiles of users and attackers – Ex: Intrusion Detection Systems are trained using Synthetic Data Allow a baseline to be set – Ex: Researcher doing clinical trials generate synthetic data to aid in creating a baseline for future studies and testing More or less realism could be exhibited according to the selected properties of the original data sets

44 Synthetic Data Generation Mostly Scenario based – Evaluating Information Analytics Software – Matching Data Mining Patterns – Evaluate quality of extraction algorithms Specific Algorithms and generators for a scenario or a set of (similar) scenarios Patterns from data mining techniques could be used to generate synthetic data sets

45 Researchers frequently need to explore the effects of certain data characteristics on their models. – To help construct datasets exhibiting specific properties, such as autocorrelation or degree disparity, synthetic data could be generated having one of several types of graph structure: random graphs independent and identically distributed (i.i.d.) connected components lattice graphs having a ring structure lattice graphs having a grid structure forest fire graphs cluster graphs with nodes arranged in separate clusters (cliques)

46 Synthetic data is generated with simple forms of realism by: – Domain sampling within a field – Preserving cardinality relationships In all cases, the data generation process follows the same process: – Generate the empty graph structure. – Generate attribute values based on user-supplied prior probabilities. Because the attribute values of one object may depend on the attribute values of related objects, the attribute generation process assigns values collectively.

47 Data Quality Some Definitions – The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use. – The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of excellence for factors related to data. – Complete, standards based, consistent, accurate and time stamped.

48 Data Quality Data are of high quality if, – they are fit for their intended uses in operations, decision making and planning – they correctly represent the real-world construct to which they refer As data volume increases – the question of internal consistency within data arises, regardless of fitness for use for any external purpose e.g. a person's age and birth date may conflict within different parts of a database

49 Data Attributes Nearly 200 such attributes are there and there is little agreement in their definition and measures Most common are – Accuracy – Correctness – Currency – Completeness – Relevance

50 Incorrect Data Includes – invalid and outdated information – can originate from different data sources resulting from data entry, or data migration and conversion projects Total cost to the US economy due to data quality problems is over US$600 billion per annum

51 Frameworks for understanding data quality A systems-theoretical approach – influenced by American pragmatism expands the definition of data quality to include information quality, and emphasizes the inclusiveness of the fundamental dimensions of accuracy and precision One framework seeks to integrate – product perspective (conformance to specifications) and – service perspective (meeting consumers' expectations)

52 One highly theoretical approach analyzes the ontological nature of information systems to define data quality rigorously Another framework evaluates the quality of the form, meaning and use of the data

53 Data Quality Assurance Service providers clean the data on a contract basis Consultants advise on fixing processes or systems to avoid data quality problems in the first place Tools for analyzing and repairing poor quality data

54 Data profiling - initially assessing the data to understand its quality challenges Data standardization - a business rules engine ensures that data conforms to quality rules Geocoding - for name and address data. Corrects data to US and Worldwide postal standards Matching or Linking - a way to compare data so that similar, but slightly different records can be aligned. – Matching may use "fuzzy logic" to find duplicates in the data. It often recognizes that 'Bob' and 'Robert' may be the same individual. – It might be able to find links between husband and wife at the same address. – It often can build a 'best of breed' record, taking the best components from multiple data sources and building a single super-record. Monitoring - keeping track of data quality over time and reporting variations in the quality of data. Software can also auto-correct the variations based on pre-defined business rules. Batch and Real time - Once the data is initially cleansed (batch), companies build the processes into enterprise applications to keep it clean.

55 ?

Download ppt "Presented By: Kiran Kancharlapalli DBMS - Topics 11 & 12."

Similar presentations

Ads by Google