Big Data Quality the next semantic challenge

Big Data Quality the next semantic challenge
Maria Teresa PAZIENZA a.a

(BIG) DATA IS ONLY AS USEFUL AS ITS QUALITY

Introduction (Since Big Data is big and messy), challenges can be classified into engineering tasks (managing data at an unumaginable scale) and semantics (finding and meaningfully combining information that is relevant to your needs)

Challenges for Big Data
Identify relevant pieces of information in messy data. Named entity resolution (event extraction in tweets –short texts) Coreference resolution (if 2 mentions refer to each other-indexing billions of RDF triples-data formats easy to use RDF/RDFS, OWL) Information extraction (difficult to scale) Paraphrase resolution (it aims at identifying an entry in a given knowledge base to which an entity mention in a document refers) Ontology population entity consolidation (organizing extracted tuples in a quering form such as instances of ontologies, tuples of a database for schema or set of quads –subject, predicate, object, context-)

Basic assumptions Datasets published on the web of data cover a diverse set of domains Data on the web reveals a large variation in data quality . Data extracted from semi-structured sources –Dbpedia etc.- often contain inconsistencies as well as misrepresented and incomplete information Even datasets with quality problems might be useful for certain applications as long as the quality is in the required range (in different application contexts)

Quality on the Web specific aspects
Coherence via links to external datasets Data representation quality Consistency with regard to implicit information (inference mechanisms for knowledge representation formalisms on the web -owl- usually follow an open world assumption, whereas databases usually adopt closed world semantics) Ontology quality No consensus on how data quality dimensions and metrics should be defined

Quality on the Web specific aspects
The challenges are related to openness of the web of data, diversity of the information and unbound, dynamic set of autonomous data sources and publishers.

Dimensions of data quality
Organized into two categorie: contextual, referring to attributes that are dependent on the context in which the data are observed or used, and intrinsic, referring to attributes that are objective and native to the data.

Contextual dimensions of data quality
Include at least relevancy, value added , quantity, believability, accessibility, understandibility, availability, verifiability and reputation of the data. Contextual dimensions of data quality lend themselves more towards information as opposed to data because these dimensions are formed by placing data within a situation or problem specific context.

Intrinsic dimensions of data quality
Intrinsic data quality has 4 dimensions: Accuracy (degree to which data are equivalent to their corresponding «real» values) Timeliness (degree to which data are up-to-date: currency or lenght of time since the record’s last update, volatility which describes the frequency of updates) Consistency (degree to which related- data- records -match in terms of format and structure) Completeness (degree to which data are full and complete in content, with no missing data) Es: indirizzo

Intrinsic dimensions of data quality
Data quality dimension Description Supply chain example Accuracy Are the data free of errors? Customer shipping address in a customer relationship management system matches the address on the most recent customer order Timeliness Are the data up-to-date? Inventory management system reflects real-time inventory levels at each retail location Consistency Are the data presented in the same format? All requested delivery dates are entered in a DD/MM/YY format Completeness Are necessary data missing? Customer shipping address includes all data points necessary to complete a shipment (i.e. name, street address, city, state, and zip code) Table 1. Dimensions of data quality.

The question from knowledge management experts
Big Data can leverage on semantics? Yes Commonly used data in BD context: Data generated by humans (mainly disseminated through web tools as social networks, cookies, s, …) Data generated from connected objects The Internet of human being and the Internet of things become a mix of big data that must be targeted to understand, plan and act in a predictive way

Bidirectionality The relation between Big Data and Semantics is bidirectional As it is true for BD leverages on semantics, some semantics tasks are optimized by using tools designed for large data sets processing

a) Meaningful data integration challenges: Define the problem to solve Identify relevant pieces of data in Big Data ETL it into appropriate formats and store it for processing Disambiguate it Solve the problem

b) Billion Triple Challenge which aims to process large scale target vocabulary and to link that entity to the corresponding sources c) The Linked Open Data ripper for providing good use cases for LOD and to be able to link them with non LOD efficiently d) The value of the use of semantics in data integration and in the design of futur DBMS

Semantics could be considered as a magic world to bridge the gap of the heterogeneity of data. Semantics can be used in a decidable system which makes possible to detect inconsistency of data, generates new knowledge using inference engine or simply links more accurately specific data not relevant for machine learning based techniques.

To determine the quality of datasets published on the web and make this quality information explicit. Assuring data quality is particularly a challenge in LOD as it involves a set of autonomously evolving data sources. Information quality criteria for: Web documents – page trustworthiness versus page rank Structured information – correctness of facts, adequacy of semantic representation, degree of coverage

Trustworthisess of web sources
Trustworthiness or accuracy of a web source as the probability that it contains the correct value for a fact, assuming that it mentions any value for that fact. Trustworthiness is orthogonal to PageRank

Data quality assessment methodology
A data quality assessment methodology is defined as the process of evaluating if a piece of data meets the information consumers need in a specific case. The process involves measuring the quality dimensions that are relevant to the user and comparing the assessment results with the users quality requirements.

Big Data Quality the next semantic challenge

Similar presentations

Presentation on theme: "Big Data Quality the next semantic challenge"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data Quality the next semantic challenge

Similar presentations

Presentation on theme: "Big Data Quality the next semantic challenge"— Presentation transcript:

Similar presentations

About project

Feedback