Presentation on theme: "Special Topics in Computer Science The Art of Information Retrieval Chapter 13: Searching the Web Alexander Gelbukh www.Gelbukh.com."— Presentation transcript:
Special Topics in Computer Science The Art of Information Retrieval Chapter 13: Searching the Web Alexander Gelbukh
2 Previous chapter: Conclusions Interface is a key element of the system. If the users cannot use it, it does not matter how good it is. Interface design choices are important at any stage of the process oEspecially to formulate queries oAlso to present results o3D interfaces to present results Also, overall system interface and action tracking Difficult to assess quality. Difficult to find new ideas Very promising if you find them!
3 Previous chapter: Research topics Many ideas throughout the chapter osome may be obsolete New interface types! 3D interfaces Ways of assessing the quality of interfaces
4 Web: challenges (differences) Distributed data Volatile data: 40% / month Very large volume oVery large answers o1998: 3,000,000 servers, 350,000,000 pages. o2003: Only Google: 3,307,998,701 pages (10 times more) Unstructured and redundant data. 30% are duplicates Quality of data. 0.5% errors, 30% in foreign names Heterogeneous data (languages, alphabets: Chinese) Heterogeneous and inexperienced users
5 Search engines Difference: full text is not available oNow obsolete: Google stores it, some other engines too Centralized (logically) architecture othere are distributed (physically) architectures Crawlers (robots) collect data/index in a central place A search engine indexes only a small amount (2%? 30%?) of Web Recall is nearly not relevant for simple queries Google: a revolution (AltaVista of our days)
6 Ranking Commercial secret Ranking can take into account hypertext Google: PageRank algorithm oRoughly, # of incoming links (much more complicated) Problems: tricks oLink exchange oAnti-trick measures: detect link exchangers oPenalize tricks: repeated keywords, etc. Related pages oCo-cited or co-citing pages are related oClustering the search results
7 Crawling Depth-first? Width-first? Most popular first? How to divide the work between crawlers? Index is always obsolete oNot equally obsolete (like stars) oDepends on crawling policy o2% - 9% of invalid links. Snapshots. PageRank first! Robot instruction file on each server
8 Metasearchers Search using many engines and unify the results oHow to rank?! Marge rankings? oInquirus: Download each page and analyze it; rank Intersection of different major search engines is 1%
9 Other topics Indexing Hierarchies Interfaces User problems (understanding Boolean search,...) have been covered in previous chapters Hyperlink (structured) search Fish search: explore neighborhood of a hit on the fly oRelevant docs frequently have relevant neighbors
10 Research topics NLP techniques to improve indexing and ranking oWSD. Anaphora? Semantic structures Semantic Web oOntologies Text Mining to improve navigation. Web Mining (links ) Distributed architectures Scalable index compression (? – just bigger disks) Multimedia search
11 Conclusions Web has its own challenges as compared with general collections Search engines have to cope with them Gathering data (crawling) is a problem specific for Web Also, Web provides new types of info (links), which can be used by search engines
Special Topics in Computer Science The Art of Information Retrieval Chapter 14: Libraries and Bibliographical Systems Alexander Gelbukh
13 Differences with IR... Historically first applications for searching oPredecessor of IR Docs: bibliographic records oFree text oStructured fields (e.g., date) Users: mostly librarians, or users of a library othus: very limited budget Usually use Boolean model (IR: vector space) oSeems to be mostly due to historical reasons (among others) oRecently tend to add natural language search
14... Differences with IR Creating the database is a subtask of such systems oSuite data to the system, not system to data as in IR oCarefully selected, structured, and annotated data oAnnotation standards. Decimal classification,...
15 Online Public Access Catalogs (OPAC) Three generations: 1.Known-item finding tools (by title, author,...) 2.Subject headings, keywords,... 3.Search strategy assistance, natural language queries, improved GUI,... Prove to be very hard to use by inexperienced users Nowadays tend to become similar to digital library tools
16 Research topics Ease of use More power and flexibility ? Integration with Digital Libraries ?
17 Conclusions Highly interoperable and standardized Look like legacy systems...
Special Topics in Computer Science The Art of Information Retrieval Chapter 15: Digital Libraries Alexander Gelbukh
19 Digital libraries (DL) Simplistic view: library in a machine-readable form oDigitalization issues. Multilingual. 5S model: 1.Streams (texts, multimedia,...) 2.Structures (databases, indices,...) 3.Spaces (interfaces in 1D, 2D, 3D, time,...) 4.Scenarios (procedures, transformations, services,...) 5.Societies (authors, annotators,...) This provides a way to define a DL
20 Architecture Provide Web services Manipulate Digital Objects (Items?) Repositories of such objects. Access protocol. Standards. Security. Payment. Copyright. Watermarking Parallel search across heterogeneous distributed (multilingual) collections Multimedia collections Metadata, Standard formats
21 Systems A lot of specific projects and systems are mentioned in the book. Interoperability. Standards for automatic searching remote libraries. Protocols
22 Research topics Markup tools to produce high-quality documents Scaling Interoperability. Standards Better integration with IR
23 Conclusions Turning heaps of texts collected in conventional (or new) libraries into searchable and accessible information DLs are technological solutions, which involve IR as one of aspects Unlike Web, they handle carefully prepared docs. Very costly. Like Web, they are highly distributed and heterogeneous. Thus importance of standardization and interopearbility