IBE110: HTML document processing concepts and searching on the Web 2015 Judith A. Molka-Danielsen.

Slides:



Advertisements
Similar presentations
WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
XML: Extensible Markup Language
Getting a Taste of Cascading Stylesheets Steve Mooradian December 14, 2005.
 To publish information for global distribution, one needs a universally understood language, a kind of publishing mother tongue that all computers may.
XP Information Technology Center - KFUPM1 Microsoft Office FrontPage 2003 Creating a Web Site.
Project 1 Introduction to HTML.
IN350 Class 2: Document Properties and Markup Languages August 30, 2001 Judith A. Molka-Danielsen Reference: Parts of Chapter 6 handout, Chapter 1: XML.
CM143 - Web Week 2 Basic HTML. Links and Image Tags.
Developing a Basic Web Page with HTML
1st Project Introduction to HTML.
Chapter 2 Introduction to HTML5 Internet & World Wide Web How to Program, 5/e Copyright © Pearson, Inc All Rights Reserved.
Chapter 10 Publishing and Maintaining Your Web Site.
HTML 1 Introduction to HTML. 2 Objectives Describe the Internet and its associated key terms Describe the World Wide Web and its associated key terms.
Chapter ONE Introduction to HTML.
HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
Creating a Simple Page: HTML Overview
Principles of Web Design 6 th Edition Chapter 1 – HTML5.
Working with Cascading Style Sheets. Introducing Cascading Style Sheets Style sheets are files or forms that describe the layout and appearance of a document.
Lesson 12 — The Internet and Research
ULI101 – XHTML Basics (Part II) What is Markup Language? XHTML vs. HTML General XHTML Rules Block Level XHTML Tags XHTML Validation.
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
Chapter 1 Understanding the Web Design Environment Principles of Web Design, 4 th Edition.
Week 1 Understanding the Web Design Environment. 1-2 HTML: Then and Now HTML is an application of the Standard Generalized Markup Language Intended to.
CS117 Introduction to Computer Science II Lecture 1 Introduction to WWW and HTML Instructor: Li Ma Office: NBC 126 Phone: (713)
Learning Web Design: Chapter 4. HTML  Hypertext Markup Language (HTML)  Uses tags to tell the browser the start and end of a certain kind of formatting.
CP2022 Multimedia Internet Communication1 HTML and Hypertext The workings of the web Lecture 7.
HTML, XHTML, and CSS Sixth Edition Chapter 1 Introduction to HTML, XHTML, and CSS.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Chapter 8 Browsing and Searching the Web. Browsing and Searching the Web FAQs: – What’s a Web page? – What’s a URL? – How does a browser work? – How do.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
XP New Perspectives on Microsoft Office FrontPage 2003 Tutorial 1 1 Microsoft Office FrontPage 2003 Tutorial 1 – Creating a Web Site.
Introduction to web development and HTML MGMT 230 LAB.
Chapter 9 Publishing and Maintaining Your Site. 2 Principles of Web Design Chapter 9 Objectives Understand the features of Internet Service Providers.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 8 Browsing and Searching the Web. 2Practical PC 5 th Edition Chapter 8 Getting Started In this Chapter, you will learn: − What is a Web page −
HTML: Hyptertext Markup Language Doman’s Sections.
XHTML By Trevor Adams. Topics Covered XHTML eXtensible HyperText Mark-up Language The beginning – HTML Web Standards Concept and syntax Elements (tags)
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Search Tools and Search Engines Searching for Information and common found internet file types.
Lawrence Snyder University of Washington, Seattle © Lawrence Snyder 2004.
HTML Concepts and Techniques Fifth Edition Chapter 1 Introduction to HTML.
Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.
Chapter 1 Introduction to HTML, XHTML, and CSS HTML5 & CSS 7 th Edition.
XP Review 1 New Perspectives on JavaScript, Comprehensive1 Introducing HTML and XHTML Creating Web Pages with HTML.
CHAPTER TWO HTML TAGS. 1.Basic HTML Tags 1.1 HTML: Hypertext Markup Language  HTML stands for Hypertext Markup Language.  It is the markup language.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Introduction. Internet Worldwide collection of computers and computer networks that link people to businesses, governmental agencies, educational institutions,
Web Design Principles 5 th Edition Chapter 3 Writing HTML for the Modern Web.
Introduction to HTML Dave Edsall IAGenWeb County Coordinator’s Conference June 30, 2007.
HTML PROJECT #1 Project 1 Introduction to HTML. HTML Project 1: Introduction to HTML 2 Project Objectives 1.Describe the Internet and its associated key.
Blended HTML and CSS Fundamentals 3 rd EDITION Tutorial 1 Using HTML to Create Web Pages.
Search Engine Optimization
DHTML.
Project 1 Introduction to HTML.
Chapter 8 Browsing and Searching the Web
Chapter 1 Introduction to HTML.
Types of Search Questions
XML QUESTIONS AND ANSWERS
Project 1 Introduction to HTML.
Introduction to XHTML.
Objective % Explain concepts used to create websites.
5.00 Apply procedures to organize content by using Dreamweaver. (22%)
Presentation transcript:

IBE110: HTML document processing concepts and searching on the Web 2015 Judith A. Molka-Danielsen

Document Processing  Hypertext Processing: In the 1990s we saw the development of internetworks, and ubiquitous interfaces (windows).  Tim Berners-Lee at the National Radiation Lab at CERN created HTML and URL (Uniform Resource Locator) protocols so that a simple standardized form of markup, based on Scribe, could be used to describe documents and naming scheme would allow for the universal identification of documents.  So documents could be and viewed in graphical format and large collections linked across multiple internets. This is hypertext processing.

Properties of Documents Syntax - can express structure, presentation style, semantics, and external actions. It can be implicit in the contents of a document or expressed in a language. Structure - a structural element like a section can have can have a Formating Style associated with it that tells how the elements relate to each other within the document. Presentation Style - is how the document is displayed or printed. It can be embedded in the documents such as in TeX, and use macros LaTeX. Or can be defined separately as CSS for HTML documents. Presentation style can be determined by the author (in applications or languages) or the reader (Web browser). Semantics - the meaning within a language, can be associated with use.

Characteristics continued... Metadata - information about the organization of the data. Data about the data. Such as, author, publication date, subject codes, etc.

What is Markup? Markup is everything in a document that is not content. Typesetters used procedural markup to lay out instructions of how a document should look. (16 pt bold Helvetica) Word Processing software like Microsoft Word uses Procedural markup. They have a specific set of markup codes. The codes apply to a single physical way of presenting information, such as on a printed page. It doesn't define the appearance on other media like CD-ROM or Internet. Descriptive markup, or generic markup, describes the structure of the document rather than the appearance. Content is separate from style. You can publish on all media using the same structure instruction set.

SGML SGML (Standard Generalized Markup Language, ISO 8879, 1986), specifies a standard method for describing the structure of the document. Structural elements are for example: title, chapter, paragraph. It is an extensible Meta Language. It can supports an infinite variety of document structures like: information bulletins, technical manuals, parts catalogs, design specifications, reports, letters, memos. The Document Type Definition (DTD) describes the structure of the document. (like a database schema in a database). The DTD provides a framework of elements (chapters, headers). The DTD specifies rules for the relationship between elements, ie. a chapter header must come after the start of a chapter. A document intance is a document whose contents is tagged in conformance with a DTD. A DTD can be applied throughout the whole organization.

SGML continued SGML uses tagging to identify the contents position within a DTD structure. So we insert tags around the content. You can nest elements. A parser program verifies that a document follows the rules of a DTD. The parser checks if the document is structurally correct. Documents can be ported to different formats for different output medium (printer, screen, CD Rom, speaker, TV) Style is usally handled separately by style sheets, like Cascading Style Sheets (CSS).

HTML (first version in 1992) a tagging language that could be used on the World Wide Web for text formatting and linking documents. It adopts the syntax of SGML and is an application of SGML described by a particular DTD. HTML is not an extensible language. Authors cannot add their own tags. HTML supports style sheets written in CSS language (color, font, layout for web pages.) to define the look and layout of text and other materials. HTML can embed scripts written in languages such as JavaScript which affect the behavior of HTML web pages. The World Wide Web Consortium (W3C), maintainer of both the HTML and the CSS standards, has encouraged the use of CSS over explicit presentational HTML since HTML5 – cross platform for mobile applications and implementation with more file types. Started in 2008, in 2014 is a proposed recommendation by W3C. Potential of HTML5: we-mean-when-we-say-html5.htmlhttp://learningcircuits.blogspot.no/2011/12/what-do- we-mean-when-we-say-html5.html Element reference list:

Positive comments on HTML HTML uses tags to separate content (text) from format (structure, appearance). It lets amateurs control markup (good and bad) HTML tags were used for appearance formatting, but little attention was used toward content structuring.

Negative comments on HTML HTML did not offer enough custom control over the WYSIWYG environment. Things looked different in different browsers (reader interpreted, not author interpreted). Navigating through hypertext requires user memory. Designing hypertext (document collections) for easy searching is hard to do.

Comments on CSS Cascading Style Sheets helped HTML by freeing tags like and from carrying format information. Puts them in the style sheet. It lets tags like carry structure information. CSS is a styling tool that can work with other markup languages like XML. Current version is CSS3

Comments on CSS Formating Structure Appearance Content Information Data The Document Structure – HTML does this a little bit. Appearance – or presentation, before HTML did this with tags like but now all structure control should be taken out of HTML documents and put in CSS or XSL files.

XML XML (XML 1.0, 1998, Extensible Markup Language) is also a meta language in that it describes other languages. There is not pre-defined list of elements. Elements are specified using a DTD or Schema. Also style sheets can be used to specify the output format of each element (XSL). XML is based on SGML but it is a subset and is considered easier to program. XML is also supported to be viewed in most current versions of browsers. More on XML in a later lecture..

How do search engines work? They create directories in different ways: Human powered directories. In 2001, Yahoo, depended on humans for listings. You submitted a description to them. The search looked for matches in the descriptions submitted. Changing your web page had no effect on your listing. You could get reviewed by others if you were a good site. Crawler based search engines: Most use these. Create listings automatically. Indexes change periodically when the crawler is reissued. The crawlers must (1) crawl through web pages (2) make an index, and (3) rank the results. Hybrid search engines: use both humans and crawlers to produce directories or listings.

Search engine features Crawling features: deep, frame support, image maps, robots, meta tags, link popularity, paid inclusion Indexing features: full body text, stop words, meta descriptions, meta keywords, ALT text, comments, stemming Ranking features: meta tags boost ranking, link popularity boost ranking, direct hit boost ranking. Spam features: meta refresh (target pages take visitors automatically to other pages in a web site), invisible text (text is same color as background), tiny text.

Meaning of full text search types Keyword search - Accepts a list of words as criteria and matches a document that contains any of the words. E.g. a keyword search for smart data matches a document that contains either smart or data. Boolean search - Accepts a Boolean expression that states rules for the presence or absence of words in a document. Matches a document in which the required words are present and the forbidden words are absent. For example, a Boolean search for smart & data matches only documents that contain both smart and data. Phrase search - Accepts a list of words as criteria and matches a document that contains the words in the stated order as a complete phrase. For example, a phrase search for smart data matches only documents that contain the complete phrase smart data. Proximity search - Accepts a list of words as criteria and matches a document that contains the words in any order in close proximity. For example, a proximity search for smart data would match a document that contains the phrase the data is smart. Fuzzy search - Also known as pattern search. Tunes one of the previous search strategies by matching slight variations on the words in the criteria list. For example, a fuzzy phrase search for smart data could match a document that contains the variant phrase a smart datum. Ranking - Also known as weighting. In a fuzzy search, determines the relevance of the document based on the similarity of the match to the criteria. Documents with a higher ranking appear earlier in the result list. For example, a fuzzy phrase search for smart data would rank a document containing the exact phrase smart data higher than a document containing the variant phrase a smart datum. Stop Words - Also known as noise words. Words that should be ignored in matches, such as a, the, some, and other articles and prepositions. For example, if in and the are stop words, a phrase search for smart data would match a document that contains the phrase smart in the data. Synonyms - Also known as a thesaurus. Words that are equivalent for the documents in the repository. For example, if smart and intelligent are synonyms, a phrase search for smart data would match a document that contains the phrase intelligent data.

Other search engines besides Google Examples of search engines: AltaVistaAltaVista (Now Yahoo!) HotBot NorthernLight Excite

Search Engine User Interface Many search engines have advanced features that the general searcher does not know how to use. The most commonly used features are quotation marks and capitalization. (Show example case study in class.) Important issues: Query Interface: different by engine. In AltaVista (Yahoo!) a sequence of words is a logical union. In HotBot it is an intersection. Interface for complex queries: Boolean, phrase, proximity, wild cards, filtering, special qualifications via date, language, url, title, internet domain, file types. Response Interface: 10 entries per page. Entry contains information on: url, size, date indexed, some text. Return options: the number of pages returned, maybe sorting by url or date.

Crawling the Web A ranking algorithm like PageRank can be used to rank the relevancy of documents in a hit set. This algorithm can be used to decide which page to visit next by web crawler programs. Crawlers can traverse up to 10 million Web pages per day. Traversal approaches: Breadth first is to look at all pages linked by the current pages, and so on. Depth first is to follow the first link on the page and successive pages and return up recursively. This is a narrow but deep search. Crawlers can use much bandwidth. Priorities and restrictions might be set on their use. Crawlers are also referred to as Spiders.

Ranking: how is it used by search engines Criteria: location, frequency, metatags, number of web pages indexed, spamming controls, "off the page" -link analysis, click through ratings The difference between the Web and DBMS ranking is that the Web ranking can use hyperlink information. It can use the number of links coming into a site, or the number of outward pointing links to other sites. Authorities are pages that have many links pointing to them. They are likely to be good sources of information on the searched topic. The number of inward reference links can indicate the popularity of the site, and perhaps this is reflective of the quality of the information there. Hubs are pages that have many links outgoing to other servers. They point to pages with similar or related information. Better authority pages come from incoming edges from good hubs. Better hub pages come from outgoing edges to good authorities.

How users can improve searching on the web Given the User Problems user does not understand the meaning of searching user does not know the rules (case, stemming) used by the search engine and gets unexpected answers. users have problems with Boolean logic users find the engines slow, answers sets too large, not very relevant, not up to date. Techniques for Users to improve information retrieval Start with a relevant page, use the keywords from that page Use authors personal Web pages Pages on the topic already contain relevant references and links use web directories to select a category for a starting point. use search engines to improve the query formulation on a relevant set of answers. On Web Query Languages: Structured searches (using sql type queries) only work on domains where the data is structured.

Size of the Web (Netcraft survey)