WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.

Slides:



Advertisements
Similar presentations
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Advertisements

Chapter 5: Introduction to Information Retrieval
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999.
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Compression Word document: 1 page is about 2 to 4kB Raster Image of 1 page at 600 dpi is about 35MB Compression Ratio, CR =, where is the number of bits.
WMES3103 : INFORMATION RETRIEVAL
WMES3103 : INFORMATION RETRIEVAL
Content Types: Text and Metadata. Introduction Text documents come in many forms –Article (news, conference, journal, etc.) – , memo, … –Book, manual,
LYU 0102 : XML for Interoperable Digital Video Library Recent years, rapid increase in the usage of multimedia information, Recent years, rapid increase.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
Judith Molka-Danielsen, Høgskolen i Molde1 IN350: Document Management and Information Steering: Class 5 Text properties and processing, File Organization.
CS580: Building Web Based Information Systems Roger Alexander & Adele Howe The purpose of the course is to teach theory and practice underlying the construction.
CS 430 / INFO 430 Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Chapter 6 Text and Multimedia Languages and Properties
CIS 702 Communication/Information Technologies (CIT) Philip Robbins – March 7, 2013 Dr. Luz Quiroga, Ph.D. Chapter 6 Documents: Language & Properties Communication.
Chapter 1 Understanding the Web Design Environment Principles of Web Design, 4 th Edition.
Week 1 Understanding the Web Design Environment. 1-2 HTML: Then and Now HTML is an application of the Standard Generalized Markup Language Intended to.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Tommie Curtis SAIC January 17, 2000 Open Forum on Metadata Registries Santa Fe, NM SDC JE-2023.
Evaluating IR (Web) Systems Study of Information Seeking & IR Pragmatics of IR experimentation The dynamic Web Cataloging & understanding Web docs Web.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Lifecycle Metadata for Digital Objects November 1, 2004 Descriptive Metadata: “Modeling the World”
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
1 MULTIMEDIA TECHNOLOGY SMM 3001 MEDIA - TEXT. 2 What is Text? the basic element of most multimedia the basic element of most multimedia consisting of.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Information Retrieval
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
XP Tutorial 9New Perspectives on HTML and XHTML, Comprehensive 1 Working with XHTML Creating a Well-Formed Valid Document Tutorial 9.
WIRED Future Quick review of Everything What I do when searching, seeking and retrieving Questions? Projects and Courses in the Fall Course Evaluation.
Statistical Properties of Text
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Documents and Indexing Readings Overview Topic Discussions Schedule Set Projects and Papers Ideas.
WIRED Week 6 Syllabus Review Readings Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Introduction Multimedia initial focus
Text Based Information Retrieval
CS 430: Information Discovery
Federated & Meta Search
Multimedia Information Retrieval
Content Analysis of Text
Recuperação de Informação B
Presentation transcript:

WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects and Papers Ideas - Searchable Personal Digital Library - Browser hacks for searching

Text & Multimedia 6 Metadata Text Markup Languages Multimedia Trends

It all comes down to Text Main form of knowledge communication, storage, retrieval - Picture = 1K words? Clusters of (hopefully related) text are documents Documents have syntax, structure and semantics - Styles - Formats - Uses - Languages

Metadata Information that describes a document that is not (necessarily) in the document Describes the document in relation to other documents Context about the Content Document semantics Internally consistent descriptions of content for individual documents, document sets or a specified set of content. For collections or individual documents

Metadata Types Dublin Core elements MARC (machine readable cataloging) - What isn’t machine readable? Semantic Web elements Bottom-up, derived data Format-based - ASCII, EBCDIC - RTF - PostScript PDF - MIME

Information Theory & Text Text conveys an amount of information related to the distribution of symbols (in a document). Entropy: Ribeiro-Neto 1999 The more unique the symbols, the more information contained. Alphabet has σ symbols, with probability p i, as measured in bits Text models have these probabilites. Text models can be unique themselves.

Natural Language and Text We have two kinds of symbols - Words - Separators, Differentiators Conventions of language have probabilities - “th”, “qu”, “ff”, “ing” - “fn”, “qi”, “aa”, “en” Etaoin Shrdlu Grammar models have structure Languages have structure too

Models of Entropy & Frequency Entropy is related to Frequency (Uniqueness) Zipf’s Law is a model of the distribution of words in a text, document or language. - “Principle of Least Effort” - Known, predictable models of language use - i-th most frequent word appears 1/i  times of the most frequent word, Vocabulary, Harmonic numberHarmonic

Zipf’s Law The distribution which applied to word frequency in a text states that the nth ranking word will appear k/n times, where k is a constant for that text. It is easier to choose and use familiar words, therefore probabilities of occurrence of familiar words is higher. rf=C rank, frequency, Count all of the words in a document (- stop list) with the most frequent occurrences representing the subject matter of the document. Relative frequency (more often than expected) instead of absolute frequency is possible.

Wyllys on Zipf’s Law Surprisingly constrained relationship between rank and frequency in natural language. Zipf said the fundamental reason for human behavior : the striving to minimize effort. Mandelbrot - further refinement of Zipf’s law: (r+m) B f=c where r is the rank of a word, f is its frequency, m, B and c are constants dependent on the corpus. m has the greatest effect when r is small.

Heap’s Law Predicts the growth of a vocabulary in a normal (natural language) text A text can also be a collection of documents - Papers for this class? - The Web? Length of words increases in the Vocabulary logarithmically with text size - Longer the text (documents), longer the p of words

Text Document Properties More text: - equals less overall entropy - More overall predictability - At the vocabulary level (Zipf) - At the document level (Heap) Users will be searching over similar texts a lot in a document set. Documents have similarity - Measured by a distance function - Edit distance is number of transforms to make things equal (entropy)

Markup Languages Additional structure applied to text Formats for presentation or content description SGML - DTD - HTML XML - MathML - SMIL - RDF Prescribed by authors with tools Automated for higher machine readability

Trends for Text More Markup languages (finer details) Automated markup & conversion - Based on “Laws” - From CMS Semantic Web text representation Multi-lingual text representation - Global measures Laws of your language use and search term preferences

Text Operations 7 Document Preprocessing Document Clustering Text Compression Comparing Text Compression Techniques Trends

Document Preprocessing 1.Lexical Analysis Characters, digits, punctuation Sentence, paragraphs 2.Stopword filtering Eliminate redundant words & phrases Reduce entropy 3.Word Stemming Prefix, suffix, variations 4.Index term selection Syntax, frequency, structure (markup) 5.Term category structures Thesaurus, estimated queries, metadata use

Document Clustering Grouping together similar or related documents in classes. P 173 Global – with whole collection - Collections on the Web? - Sites? Domains? Versions? Local – in context of the query - Multiple queries? - Many contexts? - Links?

Text Compression Is compression always good? - Less space may mean less functionality. - Open standards for compression, immediately (machine) recognizable Taking advantage of document preprocessing and the Laws to reduce size. (fewer bytes) Random access to text is difficult enough, compressed text more so

Compression Methods Statistical Methods - Probability (chain of codings) - Words and NLP Dictionary Methods - Symbols and substitution - Inverted File Compression Vocabulary Lists with pointers Which is best? - Speed of compression & compressed size - Memory, access & pattern match

Compression is new again Web is making all these issues important again Distributed indexing Meta tags Multiple authors Versioning

Multimedia IR (11 &12) Data Modeling Query Languages XML & SQL Indexing - Text track Feature Extraction - Keystone frames, transitions Speed, dynamic identification - Machine Learning - Feature Extraction (ownership, subject)

Finalize Topic Discussions Leading WIRED Topic Discussions - Week 6, 8 (1), About 20 minutes reviewing issues from the week’s readings Key ideas from the readings Questions you have about the readings Concepts from readings to expand on - PowerPoint slides - Handouts - Extra readings (at least a few days before class) – send to wired listserv

Web Information Retrieval System Evaluation - 5 page written evaluation of a Web IR System - technology overview (how it works) - a brief overview of the development of this type of system (why it works better) - intended uses for the system (who, when, why) - (your) examples or case studies of the system in use and its overall effectiveness

How can (Web) IR be better? - Better IR models - Better User Interfaces More to find vs. easier to find Scriptable applications New interfaces for applications New datasets for applications Projects and/or Papers Overview