Presentation is loading. Please wait.

Presentation is loading. Please wait.

Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.

Similar presentations


Presentation on theme: "Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search."— Presentation transcript:

1 Internet Information Retrieval Sun Wu

2 Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search engines –How they work –How to design and develop a large scale internet search engine –Research issues in internet search engines

3 Outline Basic Introduction Data Crawling Data Preprocessing and Mining Index and Search Kernel (Full Text Retrieval System) User Interface and Query Processing Service Maintenance and Management

4 Basic Introduction Basic introduction: –Introduction to WWW and history of search engines –Search Engine Architecture –Classification of search engines and applications –Evaluations of search engines –Search engine market and SEO (Search Engine Optimization)

5 Data Crawling A web search engine needs to crawl large scale of web data (billions of web pages/objects) efficiently! Techniques and design issues –Architecture of distributed crawling system –Optimization of crawling efficiency –Focused crawling –Crawling quality optimization: Url job queue management, selection, filteration, and scheduling. Spam and porn data detection –Incremental Crawling

6 Data Preprocessing and Mining Before we do the indexing, a lot of data preprocessing and mining tasks have to be done. The goal of the data preprocessing and mining is to –optimize the data quality and transform it into a form suitable for indexing –Extract valuable information that is useful for search engine service Spam detection and Data filteration –Some spam data can not be caught in crawling phase, so we have to detect them after crawling.

7 Data Preprocessing and Mining Data partition: –language partition –Url partition –Data type partition Redundancy Removal: –Cross site redundancy removal, –In-site redundancy removal Link Analysis to find relationship between web content and assign ranking scores for web sites/pages.

8 Index and Search Kernel Full Text Retrieval System is needed. Hashing and Inverted index is the basic tech. Inverted Index architecture and techniques Index performance optimization Search Kernel query processing optimization Ranking techniques Distributed indexing and searching –Horizontal partition –Vertical partition

9 User Interface and Query Processing Process user’s query based on the search kernel Design issue for user friendliness SERP (Search Engine Result Page) design Integrated and Classified search results Query Session management and Interactive search –Error correction –Recommended and related searches –User ranking feedback Personalization –Locality tuning –Personalized search ranking adjustment –Search result management

10 Service Maintenance and Management Search Engine Service Management –Bot detection and service security –Fault tolerant and load balancing –Log analysis –Performance optimization and Query Caching

11 Prerequest Solid background in data structures Basic Web programming experience

12 Text Books No text books are used Students are encouraged to use search engines to search for related information, articles, papers, guides, software, in the web.

13 Course Requirement No examinations Approximately two project assignments. A solid document report plus an oral presentation is required.


Download ppt "Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search."

Similar presentations


Ads by Google