Search Engine Technology (1)

Slides:



Advertisements
Similar presentations
28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
Advertisements

Text Categorization.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
WEB MINING. Why IR ? Research & Fun
Application of Ensemble Models in Web Ranking
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Search Engines and Information Retrieval
ISP 433/533 Week 2 IR Models.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Search Engines and Information Retrieval Chapter 1.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
Information Retrieval
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
IR. SI 650/EECS 549 Information Retrieval People search the Web daily Search engines –Google –Bing –Baidu –Yandex Information Retrieval is about search.
1 CS 430: Information Discovery Lecture 5 Ranking.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Automated Information Retrieval
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Search Engine Architecture
Information Retrieval and Web Search
Information Retrieval and Web Search
Text & Web Mining 9/22/2018.
Information Retrieval and Web Search
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Data Mining Chapter 6 Search Engines
PageRank GROUP 4.
Introduction to Information Retrieval
Information Retrieval and Web Search
Introduction to Search Engines
Presentation transcript:

Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu

SET FALL 2013 … Introduction

Examples of search engines Conventional (library catalog). Search by keyword, title, author, etc. Text-based (Lexis-Nexis, Google, Yahoo!). Search by keywords. Limited search using queries in natural language. Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance (shapes, colors,… ). Question answering systems (Ask, NSIR, Answerbus) Search in (restricted) natural language Clustering systems (Vivísimo, Clusty) Research systems (Lemur, Nutch)

What does it take to build a search engine? Decide what to index Collect it Index it (efficiently) Keep the index up to date Provide user-friendly query facilities

What else? Understand the structure of the web for efficient crawling Understand user information needs Preprocess text and other unstructured data Cluster data Classify data Evaluate performance

Goals of the course Understand how search engines work Understand the limits of existing search technology Learn to appreciate the sheer size of the Web Learn to write code for text indexing and retrieval Learn about the state of the art in IR research Learn to analyze textual and semi-structured data sets Learn to appreciate the diversity of texts on the Web Learn to evaluate information retrieval Learn about standardized document collections Learn about text similarity measures Learn about semantic dimensionality reduction Learn about the idiosyncracies of hyperlinked document collections Learn about web crawling Learn to use existing software Understand the dynamics of the Web by building appropriate mathematical models Build working systems that assist users in finding useful information on the Web

Course logistics Wednesdays 6:10-7:55 in 410 IAB Dates: Sep 4, 11, 18, 25 Oct 2, 9, 16, 23, 30 Nov 6, 13, 20, 27 Dec 4 + final in mid-December, date TBA URL: http://www1.cs.columbia.edu/~cs6998/ Instructor: Dragomir Radev Email: radev@cs.columbia.edu Office hours: TBA TAs: Amit Ruparel and Ashlesha Shirbhate {ar3202, ass2167}@columbia.edu set_ta@lists.cs.columbia.edu

Course outline Classic document retrieval: storing, indexing, retrieval Web retrieval: crawling, query processing. Text and web mining: classification, clustering Network analysis: random graph models, centrality, diameter and clustering coefficient

Syllabus Introduction. Queries and Documents. Models of Information retrieval. The Boolean model. The Vector model. Document preprocessing. Tokenization. Stemming. The Porter algorithm. Storing, indexing and searching text. Inverted indexes. Word distributions. The Zipf distribution. The Benford distribution. Heap's law. TF*IDF. Vector space similarity and ranking. Retrieval evaluation. Precision and Recall. F-measure. Reference collections. The TREC conferences. Automated indexing/labeling. Compression and coding. Optimal codes. String matching. Approximate matching. Query expansion. Relevance feedback. Text classification. Naive Bayes. Feature selection. Decision trees.

Syllabus Linear classifiers. k-nearest neighbors. Perceptron. Kernel methods. Maximum-margin classifiers. Support vector machines. Semi-supervised learning. Lexical semantics and Wordnet. Latent semantic indexing. Singular value decomposition. Vector space clustering. k-means clustering. EM clustering. Random graph models. Properties of random graphs: clustering coefficient, betweenness, diameter, giant connected component, degree distribution. Social network analysis. Small worlds and scale-free networks. Power law distributions. Centrality. Graph-based methods. Harmonic functions. Random walks. PageRank. Hubs and authorities. Bipartite graphs. HITS. Models of the Web.

Syllabus Crawling the web. Webometrics. Measuring the size of the web. The Bow-tie-method. Hypertext retrieval. Web-based IR. Document closures. Focused crawling. Question answering Burstiness. Self-triggerability Information extraction Adversarial IR. Human behavior on the web. Text summarization POSSIBLE TOPICS Discovering communities, spectral clustering Semi-supervised retrieval Natural language processing. XML retrieval. Text tiling. Human behavior on the web.

Readings required: Information Retrieval by Manning, Schuetze, and Raghavan (http://nlp.stanford.edu/IR-book/information-retrieval-book.html), freely available, hard copy for sale optional: Modeling the Internet and the Web: Probabilistic Methods and Algorithms by Pierre Baldi, Paolo Frasconi, Padhraic Smyth, Wiley, 2003, ISBN: 0-470-84906-1 (http://ibook.ics.uci.edu). papers from SIGIR, WWW and journals (to be announced in class).

Prerequisites Linear algebra: vectors and matrices. Calculus: Finding extrema of functions. Probabilities: random variables, discrete and continuous distributions, Bayes theorem. Programming: experience with at least one web-aware programming language such as Perl (highly recommended) or Java in a UNIX environment. Required CS account

Course requirements Three assignments (30%) Final project (30%) Some of them will be in Perl. The rest can be done in any appropriate language (e.g. Python or Java). All will involve some data analysis and evaluation. Final project (30%) Research paper or software system. Class participation (10%) Final exam (30%)

Final project format Research paper - using the SIGIR format. Students will be in charge of problem formulation, literature survey, hypothesis formulation, experimental design, implementation, and possibly submission to a conference like SIGIR or WWW. Software system - develop a working system or API. Students will be responsible for identifying a niche problem, implementing it and deploying it, either on the Web or as an open-source downloadable tool. The system can be either stand alone or an extension to an existing one.

Active research projects Scientific paper analysis, bibliometrics Citation analysis Question answering Social media Political debates Blogs and rumors IR for the humanities Health IR Collective intelligence Sentiment analysis and word polarity Cartoons Social networks

More project ideas Shingling Build a language identification system. Participate in the Netflix challenge. Query log analysis. Build models of Web evolution. Information diffusion in blogs or web. Author-topic models of web pages. Using the web for machine translation. News recommendation system. Compress the text of Wikipedia (losslessly). Spelling correction using query logs. Automatic query expansion.

List of projects from the past Document Closures for Indexing Tibet - Table Structure Recognition Library Ruby Blog Memetracker Sentence decomposition for more accurate information retrieval Extracting Social Networks from LiveJournal Google Suggest Programming Project (Java Swing Client and Lucene Back-End) Leveraging Social Networks for Organizing and Browsing Shared Photographs Media Bias and the Political Blogosphere Measuring Similarity between search queries Extracting Social Networks and Information about the people within them from Text LSI + dependency trees

Available corpora Netflix challenge AOL query logs Blogs Bio papers AAN Email Generifs Web pages Political science corpus VAST del.icio.us SMS News data: aquaint, tdt, nantc, reuters, setimes, trec, tipster Europarl multilingual US congressional data DMOZ Pubmedcentral DUC/TAC Timebank Wikipedia wt2g/wt10g/wt100g dotgov RTE Paraphrases GENIA Generifs Hansards IMDB MTA/MTC nie cnnsumm Poliblog Sentiment xml epinions Enron

Related courses elsewhere Stanford (Chris Manning, Prabhakar Raghavan, and Hinrich Schuetze) Cornell (Jon Kleinberg) CMU (Yiming Yang and Jamie Callan) UMass (James Allan) UTexas (Ray Mooney) Illinois (Chengxiang Zhai) Johns Hopkins (David Yarowsky) UNT (Rada Mihalcea)

The size of the World Wide Web The size of the indexed world wide web pages (By Sep.4, 2012) Indexed by Google: about 40 billion pages Indexed by Bing: about 16.5 billion pages Indexed by Yahoo: about 4.8 billion pages http://www.worldwidewebsize.com/

Twitter hits 400 million tweets per day (June, 2012 Twitter hits 400 million tweets per day (June, 2012. Dick Costolo, CEO at Twitter) Over 2.5 billion photos uploaded to Facebook each month (2010. blog.facebook.com) Google’s clusters process a total of more than 20 petabytes of data per day. (2008. Jeffrey Dean from Google [link])

55 Million WordPress Sites in the World WordPress.com users produce about 500,000 new posts and 400,000 new comments on an average day http://en.wordpress.com/stats/

Dynamically generated content New pages get added all the time The size of the blogosphere doubles every 6 months Yahoo deals with 12TB of data per day (according to Ron Brachman)

2. Models of Information retrieval The Vector model The Boolean model SET FALL 2013 … 2. Models of Information retrieval The Vector model The Boolean model

Sample queries (from Excite) In what year did baseball become an offical sport? play station codes . com birth control and depression government "WorkAbility I"+conference kitchen appliances where can I find a chines rosewood tiger electronics 58 Plymouth Fury How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? emeril Lagasse Hubble M.S Subalaksmi running

Fun things to do with search engines Googlewhack Reduce document set size to 1 Find query that will bring given URL in the top 10

Key Terms Used in IR QUERY: a representation of what the user is looking for - can be a list of words or a phrase. DOCUMENT: an information entity that the user wants to retrieve COLLECTION: a set of documents INDEX: a representation of information that makes querying easier TERM: word or concept that appears in a document or a query

Mappings and abstractions Reality Data Information need Query From Robert Korfhage’s book

Documents Not just printed paper Can be records, pages, sites, images, people, movies Document encoding (Unicode) Document representation Document preprocessing (e.g., removing metadata) Words, terms, types, tokens

Sample query sessions (from AOL) toley spies grames tolley spies games totally spies games tajmahal restaurant brooklyn ny taj mahal restaurant brooklyn ny taj mahal restaurant brooklyn ny 11209 do you love me like you say do you love me like you say lyrics do you love me like you say lyrics marvin gaye

Characteristics of user queries Sessions: users revisit their queries. Very short queries: typically 2 words long. A large number of typos. A small number of popular queries. A long tail of infrequent ones. Almost no use of advanced query operators with the exception of double quotes

Queries as documents Advantages: Problems: Mathematically easier to manage Problems: Different lengths Syntactic differences Repetitions of words (or lack thereof)

Document representations Term-document matrix (m x n) Document-document matrix (n x n) Typical example in a medium-sized collection: 3,000,000 documents (n) with 50,000 terms (m) Typical example on the Web: n=30,000,000,000, m=1,000,000 Boolean vs. integer-valued matrices

Storage issues Imagine a medium-sized collection with n=3,000,000 and m=50,000 How large a term-document matrix will be needed? Is there any way to do better? Any heuristic?

Tokenizing text (CNN) -- A tropical storm has strengthened into Hurricane Leslie in the Atlantic Ocean, forecasters said Wednesday. The slow-moving storm could affect Bermuda this weekend, according to the National Hurricane Center in Miami. The Category 1 hurricane was churning Wednesday afternoon about 465 miles (750 kilometers) south-southeast of the British territory and moving north at 2 mph (4 kph), the hurricane center said. http://www.cnn.com/2012/09/05/world/americas/bermuda-hurricane-leslie/index.html

Inverted index Instead of an incidence vector, use a posting table CLEVELAND: D1, D2, D6 OHIO: D1, D5, D6, D7 Use linked lists to be able to insert new document postings in order and to remove existing postings. Can be used to compute document frequency Keep everything sorted! This gives you a logarithmic improvement in access.

Basic operations on inverted indexes Conjunction (AND) – iterative merge of the two postings: O(x+y) Disjunction (OR) – very similar Negation (NOT) – can we still do it in O(x+y)? Example: MICHIGAN AND NOT OHIO Example: MICHIGAN OR NOT OHIO Recursive operations Optimization: start with the smallest sets

Major IR models Boolean Vector Probabilistic Language modeling Fuzzy retrieval Latent semantic indexing

The Boolean model Venn diagrams z x w y D1 D2

Boolean queries Operators: AND, OR, NOT, parentheses Example: CLEVELAND AND NOT OHIO (MICHIGAN AND INDIANA) OR (TEXAS AND OKLAHOMA) Ambiguous uses of AND and OR in human language Exclusive vs. inclusive OR Restrictive operator: AND or OR?

Canonical forms of queries De Morgan’s Laws: NOT (A AND B) = (NOT A) OR (NOT B) NOT (A OR B) = (NOT A) AND (NOT B) Normal forms Conjunctive normal form (CNF) Disjunctive normal form (DNF) Some people swear by CNF - why?

Evaluating Boolean queries Incidence vectors: CLEVELAND: 1100010 OHIO: 1000111 Examples: CLEVELAND AND OHIO CLEVELAND AND NOT OHIO CLEVALAND OR OHIO

Exercise D1 = “computer information retrieval” D2 = “computer retrieval” D3 = “information” D4 = “computer information” Q1 = “information AND retrieval” Q2 = “information AND NOT computer”

Exercise 1 Swift 2 Shakespeare 3 4 Milton 5 6 7 8 Chaucer 9 10 11 12 13 14 15 ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

3. Document preprocessing. SET FALL 2013 … 3. Document preprocessing. Tokenization. Stemming. The Porter algorithm. Storing, indexing and searching text. Inverted indexes.

Document preprocessing Dealing with formatting and encoding issues Hyphenation, accents, stemming, capitalization Tokenization: USA vs. U.S.A. – equivalence class Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc, can’t Example: “The New York-Los Angeles flight” Hewlett-Packard numbers, e.g., (888) 555-1313, 1-888-555-1313 dates, e.g., Jan-13-2012, 20120113, 13 January 2012, 01/13/12 MIT, mit (in German)?

Non-English languages http://www.kyodo.co.jp/entame/showbiz/2012-09-04_64060/ ストロベリーナイト http://ja.wikipedia.org/wiki/%E3%82%B9%E3%83%88%E3%83%AD%E3%83%99%E3%83%AA%E3%83%BC%E3%83%8A%E3%82%A4%E3%83%88 http://it.wikipedia.org/wiki/Dorama http://it.wikipedia.org/wiki/Strawberry_Night テレビドラマ

Non-English languages Arabic: Japanese: (kono hon ha omoi) German: Lebensversicherungsgesellschaftsangesteller Chinese: shàng hé http://www.mandarintools.com/worddict.html http://en.wiktionary.org/wiki/%D9%83%D8%AA%D8%A7%D8%A8 كتاب この本は重い。 和 尚

Hiragana, Katakana, Romaji, Kanji http://tastymiso.com/japanese-alphabets/113

Document preprocessing Normalization: Casing (cat vs. CAT), the Fed Stemming (computer, computation) Soundex Accent removal – cote in French Labeled/labelled, extraterrestrial/extra-terrestrial/extra terrestrial, Qaddafi/Kadhafi/Ghadaffi Index reduction Dropping stop words (“and”, “of”, “to”) Problematic for “to be or not to be”

Porter’s algorithm Example: the word “duplicatable” duplicat rule 4 duplicate rule 1b1 duplic rule 3 The application of another rule in step 4, removing “ic,” cannot be applied since one rule from each step is allowed to be applied. More examples: SSES  SS caresses  caress IES  I ponies  poni SS  SS caress  caress S  [blank] cats  cat

Porter’s algorithm

Links http://maya.cs.depaul.edu/~classes/ds575/porter.html http://www.tartarus.org/~martin/PorterStemmer/def.txt

When does stemming help? Camera, cameras? Electricity, electrical? Operating, operations, operative, operational (systems, research, dentistry, plan)

Approximate string matching The Soundex algorithm (Odell and Russell) Uses: spelling correction hash function non-recoverable

The Soundex algorithm 1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions 2. Assign the following numbers to the remaining letters after the first: b,f,p,v : 1 c,g,j,k,q,s,x,z : 2 d,t : 3 l : 4 m n : 5 r : 6

The Soundex algorithm 3. if two or more letters with the same code were adjacent in the original name, omit all but the first 4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits Examples: Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300 same as Ellery, Ghosh, Heilbronn, Kant, and Ladd Some problems: Rogers and Rodgers, Sinclair and StClair

Readings MRS1, MRS2, MRS3 MRS5 (Zipf), MRS6 MRS7, MRS8