Presentation on theme: "Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh www.Gelbukh.com."— Presentation transcript:
Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh
2 Previous chapter: Conclusions Query operations: Relevance feedback oSimple, understandable oNeeds user attention oTerm re-weighting Local analysis for query expansion oCo-occurrences in the retrieved docs oUsually gives better results than global analysis oComputationally expensive Global analysis oWorse results. What is good for collection is not for a query oLinguistic methods, dictionaries, ontologies, stemming,...
3 Previous chapter: Trends and research topics Interactive interfaces oGraphical, 2D or 3D Refining global analysis techniques Application of linguistics methods. Stemming. Ontologies Local analysis for the Web (now too expensive) Combine the tree techniques (feedback, local, global)
4 Anatomy of a document... We search for documents What is a document?
5 Characteristics of a document Syntax is a device that plays the document producing semantics (kind of: presentation) Like CD drive plays CD to produce music Knowing Korean + paper w/glyphs meaning
6...Anatomy of a document Queries are conditions on semantics/presentation, not on (binary?) data of the document Thus need to know syntax oExample: search in PS or PDF How to describe formally?
7 Metadata Info about the organization of data oData about the data Descriptive vs. Semantic metadata oDescriptive: about creation: author, date,... oSemantic: about meaning: keywords, subject codes,... Ontologies oOthers: who and how to use. E.g.: adult, confident, signature Standards (many) oDublin Core Metadata Element Set: 15 fields. Descriptive. oMachine Readable Catalog Record (MARC): bibliographic WEB – very important oMany projects on Web ontologies. Semantic Web.
8 Text Encoding. ASCII-7, 8. UNICODE: oriental Format. Binary vs. ASCII (better). DOC, RTF, PDF, PS Compression. ZIP, ARJ Binary in ASCII: uuencode To predict behavior of tools and systems, need to model text Entropy: the limit of compression, degree of chaos Statistics of the letters and words oVery skewed
9 Zipf law
10 Zipf law, etc. i-th most frequent word appears k/i times, = 1.5 – 2 Mandelbrot form: k/(c+i) 50% of text are few hundred words Most of them are stopwords: the, of, and, a, to, in... Not indexed smaller indices Distribution of words by docs op, k depend on collection and word
11 Heaps law
12 Heaps law, etc. # of distinct words (size of vocabulary) = Kn, o = square root Applies to collections To WWW Average length of word. English: o letters per word, average by text o without stopwords o average by vocabulary
13 Similarity between strings symmetric; triangle: dist (a,c) dist (a,b) + dist (b,c) Hamming: # of different positions. Also for sets. Soundex: phonetic similarity Levenshtein: min # insertions, deletions, substitutions odist (survey, surgery) = 2 oA very good measure Longest common subsequence: survey, surgery surey Various metrics to compare whole docs oE.g., consider strings as symbols, or similarity of strings, etc.
14 Markup languages Our documents do not belong to us but to Bill Gates! Extra textual syntax to describe formatting, structure,... Marks are called tags. Initial and ending tags surround the marked text. Standard metalanguage: SGML (Standard Generalized Markup Language) oXML (eXtensible), its subset: new metalanguage for Web oHTML is an instance of SGML
15 SGML Provides rules for defining tags A document consists of: oDefinitions of tags Document Type Declaration, DTD Informal comments or an additional description oText with tags Tags: text Mostly defines semantics, not printing format oDefined in other languages
16 HTML 1992; 4.0: 1997 Instance of SGML oExists DTD, usually not used Also does not define (much of) formatting. Thus: Cascade Style Sheets (CSS) odefine aspects of formatting ocan be combined (cascaded) onot well supported by browsers Does NOT (unlike generic SGML ( too expensive)) oallow to specify new tags osupport nesting structures osupport validity checks
17 XML (eXtensible...) More flexible than HTML, simpler than SGML Simplified subset of SGML oMuch simpler in implementation Allows for human- and machine-readable markup oGood for development of Web docs oAllow to do things that now are done with Java scripts Using DTD is optional, parser can discover tags Extensible Style sheet Language (like CSS in HTML) oLike macros in a word processor Extensible Linking Language
18 Uses of XML MathML: Mathematical Markup Language oNot only presentation but also meaning of expressions! SMIL: Synchronized Multimedia Integration Language oDeclarative language to specify positions and timing Resource Description Format oMetadata for XML Trend: HTML evolutions to model and describe the structure of data, not presentation details
19 Multimedia Text, sound, images, video Image formats. BMP. Compression: oGIF. Good for few colors oJPG. Lossy compression. Parametric: can be controlled oTIFF is used for exchange; can contain metadata Moving images: oMPEG: Moving Pictures Expert Group. Encodes changes Textual images. Compression. Retrieval: oMetadata, keywords oOCR. Many typos; keyword search should be approximate oTreat as a sequence of images, convert query similarly
20 Taxonomy of Web languages
21 Conclusions Modeling of text helps predict behavior of systems oZipf law, Heaps law Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search Languages to describe document syntax oSGML, too expensive oHTML, too simple oXML, good combination
22 Thank you! Till November 6 The class of Oct 30 is cancelled