Web Semantization Martin Kruliš by Martin Kruliš (v1.2) 9. 1. 2017.

Web Semantization Martin Kruliš by Martin Kruliš (v1.2)

Web of Documents Web Page Web Page Web Page Web Page Web Page Web Page
Web Page Web Page Web Page Web Page Web Page Web Page Web Page Web Page Web Page Web Page by Martin Kruliš (v1.2)

Crawling Automatic Web Processing By an application (crawler, bot)
For the purpose of searching, indexing, data mining, … Typical crawling process Breadth-first search of the link graph Crawler starts with initial URLs (seeds), which are Each page in the queue is downloaded and Processed (e.g., indexed) or saved for processing Links (URLs) are harvested and enqueued for processing Despite the fact the crawler basically executes BFS on a graph, it is rather complicated, if you take in to account all technical details, such as: The size of the web requires enormous processing power. HTTP servers may be offline at the time of crawling or they may crumble under the intensive traffic caused by crawler. HTML pages may be malformed, URLs may not be valid, … Dynamic pages may generate different content each time they are downloaded. Worse, they can be used to create infinite-sized pages or link-chain of supposedly legitimate pages (e.g., calendar that can scroll into future with no restrictions). … by Martin Kruliš (v1.2)

Web of Documents Web Page Web Page Web Page Web Page Web Page Web Page
Web Page Web Page Web Page Web Page Web Page Web Page Web Page Web Page Web Page by Martin Kruliš (v1.2)

Robots.txt Managing The Crawling Bots
Configured by robots.txt in the root of the web See for details User-agent: Googlebot Disallow: /private User-agent: * Disallow: / Optionally, <meta> tags in HTML can be used <META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"> Obeyed by “good” robots (e.g., web indexers) Bots that harvest s for SPAMing or exploiting known vulnerabilities may choose to ignore it Google should not index private stuff, other bots should not index anything by Martin Kruliš (v1.2)

Searching the Web Searching the Web Many available commercial services
Google, Bing, Yahoo, Seznam, … Technical issues Size of the data vs. response time Showing the (relevant) results Understanding user’s query Placing the most relevant results on the top Social issues Do we trust these services? How much? by Martin Kruliš (v1.2)

Page Rank Page Rang Ranking algorithm for web pages by Larry Page
Assigns each page a value between 0 and 1 Probability that a random surfer will reach the page Sum of all ranks is 1 (it is a probability distribution) Links between pages determine the rank If A → B link exists, A contributes to the rank of B Can be computed algebraically or iteratively Initially 𝑃𝑅 𝑥 =1/𝑁, where 𝑁 is the number of pages Each node 𝑥 sends 𝑃𝑅 𝑥 / 𝑑 𝑜𝑢𝑡 (𝑥) by outgoing edges New 𝑃𝑅 is computed as a sum of all incoming edges Page rank is named after Larry Page (not by its function – ranking pages). The exact implementation of this algorithm remains a Google secret. Furthermore, this is not the only mechanism used for ordering pages. Query relevance is taken into account first and there are also personalization issues if the user is logged in. by Martin Kruliš (v1.2)

Search Engine Optimization
Search Engine Optimization (SEO) URL is very important Keywords should be also in URL Meta-tags (description, keywords) Correct usage of tags that mark significant content Especially <h1>, <em>, … treated as more important <article> - compact part of a page <section> - division of the content <nav> - navigation elements (links here are important) <img alt=""> - description of the image by Martin Kruliš (v1.2)

/myweb/index.php?page=home
Application Design Front Controller Design Pattern A mod_rewrite Example RewriteEngine On RewriteCond %{REQUEST_URI} !^/myweb/(css|pic|index\.php) RewriteRule ^([-a-zA-Z0-9_]+)/?$ /myweb/index.php?%{QUERY_STRING}&page=$1 [L] /myweb/home is rewritten to /myweb/index.php?page=home by Martin Kruliš (v1.2)

Web Semantization Affiliation Name E-mail Job Group membership
by Martin Kruliš (v1.2)

Web Semantization Machine-readable Web Annotations
HTML provides structural information How the data are nested or related How the data should be visualized Semantic metadata can specify, what is the meaning of the web page contents Emphasizing information that could be automatically processed by search engines or browsers E.g., names, postal addresses, date/time information, entity relations (person affiliated with institution), … by Martin Kruliš (v1.2)

Microformats Microformats (μF)
Use existing HTML attributes to include the semantic information into a web page class – CSS classes of predefined names rel – relationship of a target link in <a> element rev – reverse relationship Vocabularies for various specific domains exist hCard – contact information hCalendar – calendar events hResume – personal resumes and CVs … by Martin Kruliš (v1.2)

Microformats Example <ul class="vcard">
<li class="fn">Martin Kruliš</li> <li class="org">Charles University in Prague</li> <li class="tel"> </li> <li><a class="url" href=" </li> </ul> by Martin Kruliš (v1.2)

Resource Description Framework
Resource Description Framework (RDF) Describes objects in triplets (subject-predicate-object expressions) Used for conceptual modeling and knowledge manag. Can be saved in various formats (text, XML, …) RDF in Attributes (RDFa) Use HTML/XML attributes that can carry metadata about, rel, rev, src, href, resource, property, content, datatype, and typeof Vocabulary is bound to a XML namespace by Martin Kruliš (v1.2)

Resource Description Framework
Example <div vocab=" <div resource="#krulis" typeof="Person"> <span property="name">Martin Kruliš</span> knows<a property="knows" href="#michelfeit">Jan</a> </div> <div resource="#michelfeit" typeof="Person"> <span property="name">Jan Michelfeit</span> by Martin Kruliš (v1.2)

HTML5 Microdata Microdata
A new specification how to include metadata into HTML markup (using dedicated attributes) itemscope – item is specified within this element itemtype – URL of a vocabulary schema itemprop – tag that annotates the content … Vocabularies for various domains exist schema.org schemas Person, event, product, offer, … Some microformat schemas can be used as well by Martin Kruliš (v1.2)

HTML5 Microdata Example
<section itemscope itemtype=" Person: <span itemprop="name">Martin Kruliš</span><br> Job: <span itemprop="jobTitle">assistant professor </span><br> Affiliation: <span itemprop="affiliation">Charles University in Prague</span><br> <span Web: <a href=" itemprop="url"> </section> by Martin Kruliš (v1.2)

Google Rich Snipplets Rich/Structure Snipplets
Google-supported vocabulary for annotations Can be encoded in Microformat, RDFa, or Microdata Supports various domains People, products, films, events, reviews, music, … The data are mapped to the knowledge graph And displayed in the search engine by Martin Kruliš (v1.2)

Discussion by Martin Kruliš (v1.2)

Web Semantization Martin Kruliš by Martin Kruliš (v1.2) 9. 1. 2017.

Similar presentations

Presentation on theme: "Web Semantization Martin Kruliš by Martin Kruliš (v1.2) 9. 1. 2017."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Semantization Martin Kruliš by Martin Kruliš (v1.2) 9. 1. 2017.

Similar presentations

Presentation on theme: "Web Semantization Martin Kruliš by Martin Kruliš (v1.2) 9. 1. 2017."— Presentation transcript:

Similar presentations

About project

Feedback