Information retrieval

Information retrieval
2019/2020

about name.surname@stuba.sk fiit.stuba.sk/~kompan 4.30
Jakub Mačina Michal Kompan Peter Gaspar 20:40:10

course overview Part I – Introduction to the Information Retrieval
Boolean IR and document indexing Vector space Indexing compression Evaluation Part II – Machine learning in Information Retrieval Classification Clustering Pattern mining Part III – Personalized Recommendations in Information Retrieval User modeling Collaborative recommendation Content – based recommendation 20:40:10

course overview class lab – consultations (min 6x for both assignments) Lectures 2h/w 20:40:10

Requirements 20:40:10

CL schedule 20:40:10

CL assignment 1 20:40:10

CL assignment 1 details DATA DATA DATA do NOT use database (=memory)
txt, xml, json… large volume – index min 500MB+ - ideally GBs 10k+ „records“ do have data prepared for 1st consultation (at last sample, alternatives, etc.) 20:40:10

CL assignment 1 details Doc – motivation, problem, your contribution, evaluation Soft – ZIP including source codes + data sample Names Part I VI_2019_Z1_AISLOGIN.PDF , VI_2019_Z1_AISLOGIN.ZIP Part II VI_2019_Z2_AISLOGIN.PDF , VI_2019_Z2_AISLOGIN.ZIP , VI_2019_Z2_AISLOGIN.csv AIS – CLASSWORK SUBMISSION section AISLOGIN – do NOT use the student number 20:40:10

CL assignment 2 - TBA Design, evaluate and document recommender
Do not use libraries for recommendation Do use libraries for specific tasks (e.g. Matrix factorization) E-commerce VI Challenge (optional) 20:40:10

general send emails from „normal” servers (reply)
send s from „normal“ addresses 20:40:10

how to pass Avoid plagiarism Earn at last 25p (50p) from class lab
You have to work on both tasks Submit results in correct form (filenames convention) Earn at last 56p in total fiit.stuba.sk/~kompan 20:40:10

additional material C. D. Manning, P. Raghavan and H. Schutze. Introduction to Information Retrieval, Cambridge University Press,2008. Michal Laclavík, Martin Šeleng: Vyhľadávanie informácií, 2012 Ian H. Witten and Eibe Frank Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Ricci, F.; Rokach, L.; Shapira, B.; Kantor, P.B. (Eds.), Recommender Systems Handbook. 1st Edition., 2011. 20:40:10

ir 20:40:10

history For more than 5,000 years, man has organized information for later retrieval and searching For holding the various items, special purpose buildings called libraries, or bibliothekes, are used 20:40:10

history For more than 5,000 years, man has organized information for later retrieval and searching For holding the various items, special purpose buildings called libraries, or bibliothekes, are used For centuries indexes have been created manually as sets of categories, with labels associated with each category The advent of modern computers has allowed the construction of large indexes automatically 20:40:10

20:40:10

20:40:10 Fielden, N. (2002)

IR goals Goal of IR is to retrieve all and only the “relevant” documents in a collection for a particular user with a particular need for information Information Retrieval (IR) deals with the representation, storage, organization of unstructured data 20:40:10

IR goals indexing text searching for useful documents in a collection
Early goals of the IR area: modeling, Web search, text classification, systems architecture, user interfaces, data visualization, filtering, languages Nowadays, research in IR includes: IR goals 20:40:10

history 1994, 100 000 web pages, WWW Worm
1997, 2 mil web pages, WebCrawler overall (100 mil) 2000, 1 bil web pages 1994, WWW Worm, 1500 query per day 1997, Altavista 20 milon query per day Lycos, Altavista, Yahoo, ... trillion pages indexed by Google 20:40:10

20:40:10

Graphics & images, presentations Maps & satellite imagery
IR objects Text: Web pages, books, articles, papers, reports, letters, blogs, … Conversational: s, tweets, comments, … Graphics & images, presentations Speech & video Maps & satellite imagery 20:40:10

basic concept Retrieve vs. Browse
A user who seeks information on a topic of their interest This user first translates their information need into a query, which requires specifying the words that compose the query We say that the user is searching or querying for information of their interest (retrieve) A user who has an interest that is either poorly defined or inherently broad For instance, the user has an interest in car racing and wants to browse documents on Formula 1 and Formula Indy In this case, we say that the user is browsing or navigating the documents of the collection 20:40:10

basic concept Searching Browsing 20:40:10

20:40:10

ir 20:40:10 Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition

information data: 20 information: 20°C knowledge: room temperature
information = data with semantic IR does not have to understand semantics (but is essential when it tries to) Characters Data Information Knowledge Actions Syntax Semantics Pragmatics Reasoning (Bergman, 2002, Experience Management) 20:40:10

20:40:10

ml 20:40:10

ml+dm used in the mean of learning from data
used in the mean of transforming unstructured data used in the mean of enriching data 20:40:10

ml training data provided as input
training data: classes for input documents Supervised learning no training data is provided Examples: neural network models, clustering Unsupervised learning small training data combined with larger amount of unlabeled data Semi-supervised learning 20:40:10

ml classification, clustering, pattern mining Applications in IR:
Language identification (classes: English vs. French etc.) Automatic detection of spam sites Automatic detection of sexually explicit or adult content Sentiment detection: is a (e.g. restaurant) review +1 or -1 Topic-specific or vertical search Machine-learned ranking function in ad hoc retrieval Entity Linking: map named mentioned to entities 20:40:10

information overload 20:40:10

information overload the state of having too much information to make a decision or remain informed about a topic IR technologies can assist a user to look up content if the user knows exactly what he is looking for (too many sources) one of the most important sources is user-generated content has been the key to success for many of today’s leading Web 2.0 companies, such as Amazon, eBay and Youtube 20:40:10

information overload How much data is out there?
Hundreds billions of documents? Approx. 10 KB/doc → several PB Everything else: , personal files, proprietary databases, broadcast media, print Estimated 5 Exabytes p.a.(growing at 30%) 800 MB p.a. and person 20:40:10

Annual Number of Google Searches Average Searches Per Day
Year Annual Number of Google Searches Average Searches Per Day 2016 3,293,250,000,000 9,022,000,000 2015 2,834,650,000,000 7,766,000,000 2014 2,095,100,000,000 5,740,000,000 2013 2,161,530,000,000 5,922,000,000 2012 1,873,910,000,000 5,134,000,000 2011 1,722,071,000,000 4,717,000,000 2010 1,324,670,000,000 3,627,000,000 2009 953,700,000,000 2,610,000,000 2008 637,200,000,000 1,745,000,000 2007 438,000,000,000 1,200,000,000 2000 22,000,000,000 60,000,000 1998 3,600,000 *Google official first year 9,800 20:40:10

rs 20:40:10

David Goldberg, David Nichols, Brian M. Oki, and Douglas Terry. 1992
David Goldberg, David Nichols, Brian M. Oki, and Douglas Terry Using collaborative filtering to weave an information tapestry. Commun. ACM 35, 12 (December 1992),

rs based on the “ask a friend” idea content-based
collaborative filtering hybrid context special features (e.g. images) group 20:40:10

collaborative filtering

collaborative filtering
users vs. items A B C D E F G User 1 X - User 2 User 3 User 4 User 5 User 6 User 7

rs vs. ir IR deals with large repositories of unstructured content about a large variety of topics RS focus on smaller content repositories on a single topic Personalization in IR (personalized search engines) is now receiving more and more interest IR and RS supports different stages of the information search/discovery process (nowadays more and more overlapping) 20:40:10

rs vs. ir IR RS Static content base Dynamic information need
Invest time in indexing content Dynamic information need Queries presented in “real time” RS Reverse assumptions from IR Static information need Dynamic content base Invest effort in modeling user need Hand-created “profile” Machine learned profile 20:40:10

20:40:10

rs 20:40:10 Is Seeing Believing? How Recommender Interfaces Affect Users’ Opinions

evaluation and testing
the most important part offline/online experiment design evaluation metrics 20:40:10

20:40:10

20:40:11

example 20:40:11

example What does it take to build a basic search engine for Boolean retrieval? Queries are Boolean expressions, e.g., ‘Caesar and Brutus’ (predicates for term occurrence) Search engine returns all documents that satisfy the Boolean expression Conceptually simple and easy to understand Basic operation: identify set of documents containing a certain term 20:40:11

example Which plays of Shakespeare contain the words BRUTUS and CAESAR, but NOT CALPURNIA? One could grep all of Shakespeare’s plays for BRUTUS and CAESAR, then strip out lines containing CALPURNIA. Why is grep not the solution? Slow (for large collections) “NOT CALPURNIA” is non-trivial Other operations (e.g., find the word car near shop) not feasible Ranked retrieval (best documents to return) 20:40:11

term-document matrix Anthony and Cleopatra Julius Caesar The Tempest
Hamlet Othello Macbeth Anthony 1 Brutus Caesar Calpurnia Cleopatra Mercy Worser 20:40:11

incidence vectors We have a 0/1 vector for each term (characteristic functions for term occurrence predicates) To answer the query BRUTUS AND CAESAR AND NOT CALPURNIA: Take the vectors for BRUTUS, CAESAR, and CALPURNIA Complement the vector of CALPURNIA Do a (bitwise) and on the three vectors and and 101111 100100 20:40:11

term-document matrix Anthony and Cleopatra Julius Caesar The Tempest
Hamlet Othello Macbeth Anthony 1 Brutus Caesar Calpurnia Cleopatra Mercy Worser 20:40:11

bigger collections Previous example N=6 documents, 7 tokens
Consider N = 106 documents, each with about 1000 tokens (before language processing) On average 6 bytes per token, including spaces and punctuation ⇒ size is about 6GB Assume there are M = 500,000 distinct terms in the collection 20:40:11

bigger collections 500,000 x 106 matrix has half-a-trillion 0’s and 1’s. But it has no more than one billion 1’s (1,000x106). 1000 terms for each doc x number of docs in the collection matrix is extremely sparse 20:40:11

inverted index We need variable-size postings lists
On disk, a continuous run of postings is normal and best In memory, can use linked lists or variable length arrays Brutus 1 2 4 11 31 45 173 174 Dictionary Caesar 1 2 4 5 6 16 57 132 Calpurnia 2 31 54 101 Postings 20:40:11

inverted index construction
Documents to be indexed Friends, Romans, countrymen. Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Indexer Inverted index friend roman countryman 2 4 13 16 1 20:40:11

speed speed speed the most important part
60-120s to decide (Neil Hunt: Quantifying the Value of Better Recommendations) 20:40:11

20:40:11

sources VI M. Laclavik FIIT STU, SAV BA IR T. Hofmann ETH Zurich
PIR R. Larson UC Berkley SI ISR F. Ricci FU Bozen-Bolzano 20:40:11

Information retrieval

Similar presentations

Presentation on theme: "Information retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information retrieval

Similar presentations

Presentation on theme: "Information retrieval"— Presentation transcript:

Similar presentations

About project

Feedback