Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh www.Gelbukh.com.

Slides:



Advertisements
Similar presentations
Alexander Gelbukh Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh
Advertisements

CSCI N241: Fundamentals of Web Design Copyright ©2004 Department of Computer & Information Science Introducing XHTML: Module B: HTML to XHTML.
1 Copyright © 2002 Pearson Education, Inc.. 2 Chapter 1 Introduction to Perl and CGI.
Copyright © 2003 Pearson Education, Inc. Slide 8-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
Copyright © 2003 Pearson Education, Inc. Slide 3-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.
Copyright © 2003 Pearson Education, Inc. Slide 4-1 Created by Cheryl M. Hughes, Harvard University Extension School Cambridge, MA The Web Wizards Guide.
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8): Indexing.
Special Topics in Computer Science The Art of Information Retrieval Chapter 13: Searching the Web Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 5 (book chapter 11): Multimedia.
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 1: Introduction Alexander Gelbukh
1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.
Improving Human-Semantic Web Interaction: The Rhizomer Experience Roberto García and Rosa Gil GRIHO - Human Computer Interaction Research Group Universitat.
Introduction to HTML, XHTML, and CSS
10. Juni 1998reto ambühler ( WELCOME TO THE GATHERING PLACE.
4. Internet Programming ENG224 INFORMATION TECHNOLOGY – Part I
Configuration management
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Basic HTML Workshop LIS Web Team Spring 2007.
The World Wide Web. 2 The Web is an infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that.
Database System Concepts and Architecture
XML INTRODUCTION Prepared by Hongming Yu Modified by Fernando Farfán.
Macromedia Dreamweaver MX 2004 – Design Professional Dreamweaver GETTING STARTED WITH.
Dr. Alexandra I. Cristea XHTML.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Content Types: Markup and Multimedia. Introduction Markup languages use extra textual syntax to encode: –Formatting / display information –Structure information.
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999.
WMES3103 : INFORMATION RETRIEVAL
Content Types: Text and Metadata. Introduction Text documents come in many forms –Article (news, conference, journal, etc.) – , memo, … –Book, manual,
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Text Properties and Mark-up Languages. 2 Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary.
XML for Information Management – Day 3: Formal and Natural Languages in XML Airi Salminen XML for Information Management University of Erlangen-Nuremberg.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Overview of Search Engines
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 XML Taken from Chapter 7.
Chapter 6 Text and Multimedia Languages and Properties
CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
Introduction to XML Eugenia Fernandez IUPUI. What is XML? From the World Wide Web Consortium (W3C) The Extensible Markup Language (XML) is the universal.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
1 © Netskills Quality Internet Training, University of Newcastle Introducing XML © Netskills, Quality Internet Training University.
XML BIS4430 – unit 10. XML Origins Extensible Markup Language (XML) 1998 Inspired by Standard Generalized Markup Language (SGML) and HTML. SGML defines.
XML - Why: The HTML-Dilemma HTML, SGML, XML - How: Syntax, Concept, Language Elements Basics Well-formed XML-Documents (without DTD) Valid XML-Documents.
XML Extensible Markup Language. What is XML? An infrastructure for describing text and data Developed by W3C(the World Wide Web Consortium)
What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
1 Chapter 10: XML What is XML What is XML Basic Components of XML Basic Components of XML XPath XPath XQuery XQuery.
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
XML 2nd EDITION Tutorial 1 Creating An Xml Document.
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
Markup and Metadata How to Build a Digital Library Ian H. Witten and David Bainbridge.
XP 1 Creating an XML Document Developing an XML Document for the Jazz Warehouse XML Tutorial.
XML Design Goals 1.XML must be easily usable over the Internet 2.XML must support a wide variety of applications 3.XML must be compatible with SGML 4.It.
1 Tutorial 11 Creating an XML Document Developing a Document for a Cooking Web Site.
XML Engr. Faisal ur Rehman CE-105T Spring Definition XML-EXTENSIBLE MARKUP LANGUAGE: provides a format for describing data. Facilitates the Precise.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Document Computing Technologies for Managing Electronic Document Collections Ross Wilkinson... [et al.] Circulation Counter [RES3H] ZA4080.D
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
XML Extensible Markup Language
XML BASICS and more…. What is XML? In common:  XML is a standard, simple, self-describing way of encoding both text and data so that content can be processed.
Information Retrieval in Practice
XML QUESTIONS AND ANSWERS
Text Languages and Properties
Recuperação de Informação B
Presentation transcript:

Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

2 Previous chapter: Conclusions Query operations: Relevance feedback oSimple, understandable oNeeds user attention oTerm re-weighting Local analysis for query expansion oCo-occurrences in the retrieved docs oUsually gives better results than global analysis oComputationally expensive Global analysis oWorse results. What is good for collection is not for a query oLinguistic methods, dictionaries, ontologies, stemming,...

3 Previous chapter: Trends and research topics Interactive interfaces oGraphical, 2D or 3D Refining global analysis techniques Application of linguistics methods. Stemming. Ontologies Local analysis for the Web (now too expensive) Combine the tree techniques (feedback, local, global)

4 Anatomy of a document... We search for documents What is a document?

5 Characteristics of a document Syntax is a device that plays the document producing semantics (kind of: presentation) Like CD drive plays CD to produce music Knowing Korean + paper w/glyphs meaning

6...Anatomy of a document Queries are conditions on semantics/presentation, not on (binary?) data of the document Thus need to know syntax oExample: search in PS or PDF How to describe formally?

7 Metadata Info about the organization of data oData about the data Descriptive vs. Semantic metadata oDescriptive: about creation: author, date,... oSemantic: about meaning: keywords, subject codes,... Ontologies oOthers: who and how to use. E.g.: adult, confident, signature Standards (many) oDublin Core Metadata Element Set: 15 fields. Descriptive. oMachine Readable Catalog Record (MARC): bibliographic WEB – very important oMany projects on Web ontologies. Semantic Web.

8 Text Encoding. ASCII-7, 8. UNICODE: oriental Format. Binary vs. ASCII (better). DOC, RTF, PDF, PS Compression. ZIP, ARJ Binary in ASCII: uuencode To predict behavior of tools and systems, need to model text Entropy: the limit of compression, degree of chaos Statistics of the letters and words oVery skewed

9 Zipf law

10 Zipf law, etc. i-th most frequent word appears k/i times, = 1.5 – 2 Mandelbrot form: k/(c+i) 50% of text are few hundred words Most of them are stopwords: the, of, and, a, to, in... Not indexed smaller indices Distribution of words by docs op, k depend on collection and word

11 Heaps law

12 Heaps law, etc. # of distinct words (size of vocabulary) = Kn, o = square root Applies to collections To WWW Average length of word. English: o letters per word, average by text o without stopwords o average by vocabulary

13 Similarity between strings symmetric; triangle: dist (a,c) dist (a,b) + dist (b,c) Hamming: # of different positions. Also for sets. Soundex: phonetic similarity Levenshtein: min # insertions, deletions, substitutions odist (survey, surgery) = 2 oA very good measure Longest common subsequence: survey, surgery surey Various metrics to compare whole docs oE.g., consider strings as symbols, or similarity of strings, etc.

14 Markup languages Our documents do not belong to us but to Bill Gates! Extra textual syntax to describe formatting, structure,... Marks are called tags. Initial and ending tags surround the marked text. Standard metalanguage: SGML (Standard Generalized Markup Language) oXML (eXtensible), its subset: new metalanguage for Web oHTML is an instance of SGML

15 SGML Provides rules for defining tags A document consists of: oDefinitions of tags Document Type Declaration, DTD Informal comments or an additional description oText with tags Tags: text Mostly defines semantics, not printing format oDefined in other languages

16 HTML 1992; 4.0: 1997 Instance of SGML oExists DTD, usually not used Also does not define (much of) formatting. Thus: Cascade Style Sheets (CSS) odefine aspects of formatting ocan be combined (cascaded) onot well supported by browsers Does NOT (unlike generic SGML ( too expensive)) oallow to specify new tags osupport nesting structures osupport validity checks

17 XML (eXtensible...) More flexible than HTML, simpler than SGML Simplified subset of SGML oMuch simpler in implementation Allows for human- and machine-readable markup oGood for development of Web docs oAllow to do things that now are done with Java scripts Using DTD is optional, parser can discover tags Extensible Style sheet Language (like CSS in HTML) oLike macros in a word processor Extensible Linking Language

18 Uses of XML MathML: Mathematical Markup Language oNot only presentation but also meaning of expressions! SMIL: Synchronized Multimedia Integration Language oDeclarative language to specify positions and timing Resource Description Format oMetadata for XML Trend: HTML evolutions to model and describe the structure of data, not presentation details

19 Multimedia Text, sound, images, video Image formats. BMP. Compression: oGIF. Good for few colors oJPG. Lossy compression. Parametric: can be controlled oTIFF is used for exchange; can contain metadata Moving images: oMPEG: Moving Pictures Expert Group. Encodes changes Textual images. Compression. Retrieval: oMetadata, keywords oOCR. Many typos; keyword search should be approximate oTreat as a sequence of images, convert query similarly

20 Taxonomy of Web languages

21 Conclusions Modeling of text helps predict behavior of systems oZipf law, Heaps law Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search Languages to describe document syntax oSGML, too expensive oHTML, too simple oXML, good combination

22 Thank you! Till November 6 The class of Oct 30 is cancelled