Inconsistent Data on the Semantic Web A Theoretical Approach Brian Goodrich.

Slides:



Advertisements
Similar presentations
Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Clearing your Cookies Google Chrome A short guide to help you navigate our website faster Brought to you by:
Inferences The Reasoning Power of Expert Systems.
Information Retrieval in Practice
Xyleme A Dynamic Warehouse for XML Data of the Web.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Information Retrieval
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Search Engine Optimization
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
CRLT GSI Training: Using Online Resources Presented By: Jay Holden GSIs GRADUATE STUDENT INSTRUCTORS +
1 California State University, Fullerton Chapter 8 Personal Productivity and Problem Solving.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Crawling Slides adapted from
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
G063 - Distributed Databases. Learning Objectives: By the end of this topic you should be able to: explain how databases may be stored in more than one.
Presenter: Shanshan Lu 03/04/2010
Semantic Web - an introduction By Daniel Wu (danielwujr)
Page 1 Alliver™ Page 2 Scenario Users Contents Properties Contexts Tags Users Context Listener Set of contents Service Reasoner GPS Navigator.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Data Mining for Web Intelligence Presentation by Julia Erdman.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Search Engines By: Faruq Hasan.
Majid Sazvar Knowledge Engineering Research Group Ferdowsi University of Mashhad Semantic Web Reasoning.
Deep Web Exploration Dr. Ngu, Steven Bauer, Paris Nelson REU-IR This research is funded by the NSF REU program AbstractOur Submission Technique Results.
OWL Representing Information Using the Web Ontology Language.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR
Setting up a search engine KS 2 Search: appreciate how results are selected.
A s s i g n m e n t W e e k 7 : T h e I n t e r n e t B Y : P a t r i c k O b i s p o.
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
General Architecture of Retrieval Systems 1Adrienn Skrop.
CS 440 Database Management Systems Web Data Management 1.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Web Programming Java Script-Introduction. What is Javascript? JavaScript is a scripting language using for the Web. JavaScript is a programming language.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Data mining in web applications
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Presented by: Hassan Sayyadi
Prepared by Rao Umar Anwar For Detail information Visit my blog:
G063 - Distributed Databases
Information Retrieval
Data Mining Chapter 6 Search Engines
Web Search Engines.
The ultimate in data organization
CSE591: Data Mining by H. Liu
Presentation transcript:

Inconsistent Data on the Semantic Web A Theoretical Approach Brian Goodrich

The Problem An computer application has a set of input and a set of output based upon the set of input and its internal logic. If an application is given data as input which causes a conflicted state in deciding its output, it will crash without some kind of logic by which to decide that conflict. The Semantic Web is based being able to parse human intent from structured, semi- structured, and unstructured data on the Web. Human intent is frequently conflicting.

Conflicting Data Sources Malicious - (deceptive or rerouting attempts) or just ignorantly incorrect information Incomplete Information – having insufficient context or simply unfinished data Humor – especially sarcasm, satire and exaggeration (e.g. political cartoons) Time – what once was one thing is now another (e.g. quality of service, price, etc.) Ontological Deficiency – when extraction ontology lacks sufficient vividness to separate data appropriately.

Solution FastMaintain current speed of the Web. AccurateCorrect decisions of data reliance. DynamicKeeps pace with change on the Web.

Thesis To propose a method for simplifying the task of dealing with conflicting data on the Semantic Web in a fast, accurate and dynamic way by supplying each web source with a derived indicator of its communal usage called a Consensual Reliability Score. (CRS)

Methods Formula for deriving CRS from inputs a, b, c, & d. With weighted constants z, y, x, & w.

Site Type Mining (a * z)… Five types of Web Pages Head Pages Navigation Pages Content Pages Look up Pages Personal Pages

Incoming Index …(b * y)… Distributed web crawler that counts hyperlinks then traverses the unique hyperlink paths, looking for additional links. Link counts are stored in a hash indexed by the destination of the hyperlinks. Provides a dynamic count of how often the internet as a whole is pointing to a given web source. Therefore an indication of how often people use the given web source. Excludes orphan sites (mostly personal sites and spam pop-ups) Based on the success of the Google search engine

Usage Mining …(c * x)… Most straight forward approach of testing how often people use a web source. Query site’s # of hits or how many people have seen this site? Problem: Unlike Incoming Index method, does not exclude orphan sites. Further experimentation needed to determine x’s weight.

Direct Survey …(d * w)… Most reliable method of determining reliability. Manually query users directly. Too slow and costly to be consider a whole solution but can assist in CRS derivation. Hopefully offset frequently visited sites with no true info (onion.com, humor, etc.) More experimentation needed to determine w’s weight.

Review

“Classical content data mining is not applicable in this case (CRS derivation) because it is the content of the web sources that is in question.” -Brian Goodrich

Storage Global Index – Fast access Centralized storage for CRSBot. Centralized vulnerability. Vital non-distributed resource in a distributed system. Local Storage Non-centralized vulnerability Non-unified derivation formula (disrupts trust algorithm) Local Derivation Too slow to be useful (problem size too large)

Related Work Tim Berners-Lee There is a choice here, and I am not sure right now which appeals to me most. One is to say precicely,  "whatever any document says of the form xxxx is a member of W3C so long as it is signed with key ". The other is to say,  "whatever is of form xxxx and can be inferred from information signed with key “ Problems with both choices, but both use static references in a dynamic environment (the web)

Contributions CRS provides a fast and accurate measure of community consensus on the web. Allows reliable decision about deciding between conflicting data on the web, fine-tuning the results from the Semantic Web.

Limitations Totally reliant on usage patterns of the internet, which may not always reflect which data is more correct. Reflects only consensus to a data source, not the actual data contained in it. Cannot express complex or compound relationships or extract partial truths.

Questions?