Presentation on theme: "Midnight in the Garden of Good and Evil Search Engines Presentation by Richard Wiggins –Technical Advisor, NEM Online, Michigan State University www.msu.edu/staff/rww."— Presentation transcript:
Midnight in the Garden of Good and Evil Search Engines Presentation by Richard Wiggins –Technical Advisor, NEM Online, Michigan State University www.msu.edu/staff/rww firstname.lastname@example.org –Columnist, “Internet Buzz,” webreference.com www.webreference.com/outlook email@example.com –Co-host, Nothing But Net television program (produced by Media One)
A Parable: The Encounter Between the USS Nimitz and a Canadian Vessel...
A Frequency Analysis of the Appearance of a Critical Search Term Among Major Search Engines...
Frequency of the Search Term “Slavko” Among Major Search Indexes Frequency of the Search Term “Slavko” Among Major Search Indexes AltaVista5477AltaVista5477 Excite1160Excite1160 Infoseek1452Infoseek1452 Hotbot4226Hotbot4226
Come Join Our Tour of...a place millions want to visit... …where a cast of characters stands ready to help you find exactly what you’re looking for...
SearchVannah’s Tour Guides …a relatively new town …only existed since 1993 With so many visitors, lots of tour guides have set up shop –They tend to have funny names –They compete fiercely –They’re all trying to make money helping visitors find their way
The Tour Guides AltaVista –Fast, lots of memory, knows a lot –But people complain sometimes results are inconsistent InfoSeek –Claims answers are more relevant MetaCrawler –Doesn’t know anything at all! Just asks the other tour guides!
HotBot HotBot: This tour guide wears the ugliest clothes!
The Tour Guides... Inktomi: other tour guides hire Inktomi to answer their questions One guide knows a LOT less than all the others… –But it’s the most popular by far! –The smarter tour guides think of it as just a dumb Yahoo… But maybe tourists want to know where the B&B is, not a list of all the towels and dishes
Definitions Crawler: automated tool to discover new and changed pages, feeds data to… Indexer: builds and maintains an index, concordance-style Search engine: the actual tool end-users employ when searching …but in popular usage, all together = “search engine”
Leveraging 30 Years of Information Retrieval (IR) Most new ideas we see in Web engines were thought of long ago... –Stemming –Controlled vocabulary –Text analytics –Knowledge Bases –Personalization (by observing user usage patterns) –Natural language
How Do People Search? “Honestly, tourists are the dumbest people” -- anonymous Tour Guide
What Do People Search For? Major search services say people look for... –Sex sites –One’s own name –Friends, colleagues’ Web sites (also by name) –Items in the news –Company / product information –Etc.
One user view of search.msu.edu: Academics application for graduation overseas study ordering catalog School of Music Computer Science human ecology department psychology 101
Another user view of search.msu.edu: Virtual Library DNA sequencing climate change beam theory feline brain tumor PRL and sequencing
Another user view of search.msu.edu: Extension livestock pavilion wildlife fisheries bathtub removal and installation Round Bale Storage
Another user view of search.msu.edu: Conversational I would like to know if you offer a workshop on “International Law”
What Do People Search For? Matt Koll’s Formulation “finding a needle in a haystack” a known needle in a known haystack a known needle in an unknown haystack, to any needle in a haystack Where are the haystacks? GenX rendition: Needles? Haystacks? Whatever!
Typical User Search Strategy Type in a one-word search term Maybe two words Seldom exploit advanced options –Capitalization –Quoting phrases (e.g. “climate change”) –Date restrictions –Host:, URL: parameters Seldom use iterative refinement
Users Make “Wrong” Choices Picking the right database is confusing –Reference librarians, experienced users learn brand names –Inexperienced users do not Lycos example: “Small” versus “Large” catalog –“Small” catalog was faster, more precise –Virtually no one used it, thinking “Large” meant “better”
A Route 128 Story Engineering firm on Route 128 Engineers new products Has constant need for specialized information Uses traditional sources, and the Web “Joe down the hall” does the Internet searches Joe is a reference librarian with an engineering degree (and no training in online searching!)
Prospects for Training are Dismal! We don’t know the users, so we can’t hope to train them Users won’t read documentation or help notes If engine doesn’t deliver, users react viscerally –“This engine is useless” or –“The Internet has nothing useful” –“The Internet has too much information!”
How Well Do Today’s Engines Meet Real Users’ Needs? Most engines cannot yield high precision, high recall hit list with only one search term But most users don’t compose or refine their searches carefully Boolean operators virtually unused Therefore most users probably fail to get desired results Many sample searches from MSU example would not yield desired information
AltaVista “Intelligent” Case Matching Example Looking for information on “TREC” search engines testing at NIST
Scale Issues “This town is growing so fast, and there’s too many tourists!” -- a 3rd generation resident
The Problem of Scale No one knows exact size of Web –Databases, intranets complicate issue –“Dark matter” -- Vint Cerf Probably 250 to 500 million pages publicly accessible Recent Science article claims most spider coverage is incomplete AltaVista claims 140 million pages in index
Problem of Scale: Transaction Load AltaVista handles 30 million searches per day Inktomi is “back-end” for numerous sites –HotBot, N2H2 (Japan), Australian news service –Soon, the “find a Web site” function in Windows 98 No popular service has melted down yet
Inktomi’s “Network of Workstations” Model Eric Brewer, CEO, claims centralized high- speed servers cannot scale Developed new clustering scheme: dozens or hundreds of low-cost servers on high- speed network But centralized engines have not broken down yet 64-bit processors @ 300-450 MHz, gigabytes of RAM, fast paths to disk
Trends “We have a forward-looking sense of fashion!” -- one of the tour guides
Trends Among Search Engines Observations of Dr. Susan Feldman, Cornell: More professional look, feel than a couple years ago Common syntax evolving: –Plus sign prefix for required term, minus for excluded term –Quotes signify phrases, caps signify case significant Unique “personalities” evolving
The Role of Meta-Crawlers Experts agree that spider coverage varies across services No two services cover the same sites for a given search Therefore searching across multiple indexes yields more results Therefore metacrawlers can be useful
Targeted Spiders Train the spider to crawl only sites that fit a certain subject domain InfoSeek News Index –Death of a Princess example Internet.com’s “vertical” index LawCrawler NEM Online –Research project at Michigan State University –Harnessing information of use to manufacturers
“death of Princess Diana” Search on Infoseek, 8/31/97 1:00 pm
Traditional Model: First, Pick a Database, Then Do Your Search
Why Northern Light is a Breakthrough Delivering quality sources alongside Web resources –As Web becomes more cluttered, advantage grows Database search paradigm inverted: First do your search, then pick your source Automatic categorization yields manageable hit lists –Advantage also grows as Web grows
Beyond Text: Still Images, Digitized Speech, Video We tend to think of search engines as limited to text But increasingly we will face digital content Thanks to scanners, digital cameras, digital sound cards, digital video cameras These digital collections will be corporate assets But to use, and re-purpose, these assets, we will need search engines
IBM Almaden’s Image Search Software Able to index a large collection of still images Able to find similar images –User selects image, asks for similar shapes –User draws shapes –User filters by color, textual metadata Samples available online: –Searchable digital postage stamp archive www.qbic.almaden.ibm.com/cgi-bin/stamps-demo –Searchable archive of trademarks (logos)
AltaVista Keyword Index into Clinton Testimony Video
Cross-Language Searching Internet is biased towards English But it is a World Wide Web Tools to allow searching in one language, against a universe in other languages, are evolving Challenges of understanding meaning, resolving ambiguities multiply But effective tools are coming
The AltaVista Translation Service: Extending Search Engines into New Areas Translates to/from English, Spanish, French, Italian, German Try translating “Are you having a bad hair day?” to another language and back...
Translation Result: “Are you having a bad hair day?”...becomes… “It is for you defective day of hats, no?”
Any Portal in a Storm Search engine services becoming portals Non-index services –Browsing view –Stock quotes –Pager services –Personalization (“My Yahoo, My AltaVista, My Foot) The linear search engine result set can’t compete without added components
Evaluating the Engines “You just can’t trust some of the other tour guides!” -- every tour guide
Evaluating Search Engines Searchenginewatch.com –Part of internet.com family –“Search Engine EKG” –Measures rate of crawling, other metrics, fhor leading Web engines National Institute of Standards and Technology TREC Series –Rigorous annual “bakeoff” conducted by Donna Harmon –Leading technology firms, university researchers compete
AltaVista vs Infoseek: An Accidental Bakeoff Michigan State University was first university to acquire AltaVista Intranet product (1996) Used for campus-wide spider as well as subject-specific index (manufacturing) –search.msu.edu –www.nemonline.org Infoseek on its own initiative set up an index of msu.edu
In many cases, AltaVista and Infoseek return very similar results Using actual searches typed by users, in some cases Infoseek shows superior relevancy ranking –Word proximity has more weight Infoseek also appears to offer superior duplicate detection “Find similar” in Infoseek works very well AltaVista vs Infoseek: Preliminary Observations
Decentralized Searching: The Infoseek Experiment Steve Kirsch (CEO of Infoseek) offers this experiment: “Name a movie by James Cameron”
What This Experiment Shows… Some servers are louder than others Several servers know recent, highly- publicized information Some pieces of information are known only to one server Some servers give out wrong information Some servers never answer any question
Decentralization Trend We’ve tried decentralized indexes with little success –WAIS –Harvest But scale of single central indexes may force new attempts Infoseek intends major push –Network of “Ultraseek” intranet sites –“Use other people’s servers to do the hard work”
Ethics of Engines “What does ethics have to do with helping people find things??!” -- every tour guide
The Ethics of Search Engines Gaining value from freely-available content Yahoo, AskJeeves advertise themselves as reference sources They make money on answers that others provide for free Are they a bibliography, which has always been legitimate? Or, thanks to the hyperlink, are they exploiting those who provide the real value?
Ethics: Index Spamming People learned to spam the index early on –Overload your page with terms people use in searching –Some sites present a different page to the spider than the end user sees –One church asked a Web developer to put in meta tags with obscene words Is spamming unethical? –Seems to be, but why exactly? –Sears catalog vs Montgomery Ward
Ethics: The Search Services’ Incentives Most make money from banner ads They want to maximize page impressions and clickthroughs The ideal user would search forever! Banner ads adapt to the search based on keyword –Banner ad technology is better than result set technology!
Ethics: Editorial Copy Versus Advertising-Influenced In the print world, it’s pretty obvious what’s an advertisement –Yellow Pages –New York Times –Thomas Register To avoid confusion, some ads are labeled In the online world, it’s not always clear If companies sold better search positions, how would we know?
Ethics: Buy First Place on the Hit List -- and Tell How Much You Paid!
Paying for Position "We tried the editorial system of rating pages, and we found that it wasn't scalable...but the market is infinitely scalable.” –Jeffrey Brewer, CEO, GoTo.com
The Future “I don’t know what the future is, but we’ll be number one!!!” -- every tour guide
The Future: Promises and Limits IR scientists say engines may be approaching fundamental limit Koll: typical gigabyte of searchable space holds 25,000 occurrences of typical search term “With a lot of work, maybe we can get to 50% recall and 50% precision” But combination of approaches can yield greater power
The Search Engine Industry Analysts generally agree that “Yahoo wins!”! Claims ~100 million transactions per day Also claims 30 million unique users Also claims more “viewership” per day than most specialty cable TV channels (e.g. MTV) And it’s a catalog, not a full-text engine!
Search Engine Companies’ “Value Per User” (Mecklermedia)
Changes and Alliances All search sites now offer browsing views All services offering free e-mail Yahoo offering news Alliances –AltaVista plus Real Name System –AltaVista plus Amazon.com –Lycos plus Barnes and Noble online –Yahoo drops AltaVista when AltaVista adds browsing view
Combining Best Features Of Yahoo, Infoseek, AskJeeves Build a knowledge base –Leverage the actual queries people issue –An FAQ Offer a blend of drill-down hierarchy, knowledge base, full-text Search for one word yields rich result set –E.g. “Intel” Example: Verity’s new Knowledge Organizer
Verity’s Knowledge Organizer Product A tool to capture and organize an organization’s online information –Build your own Yahoo and AltaVista-style search service Site builds its own topical taxonomy –Using a graphical user interface Tool indexes within categories and across them End user can –drill down within topics –search within and across topics
A Modest Proposal: The Accidental Thesaurus For intranet, online product catalog, newspaper, campus sites Build a thesaurus based on what people look for Don’t even try to be comprehensive Use your search logs to find what people look for -- and how they actually search Fuzzy matching of user searches against thesaurus, a la AskJeeves
New Job Title: The Info Snout Like an Info Scout...only nosier Similar job as cataloging librarian...more like a pathfinder builder Daily routine: –Look at search logs –Find new terms, add to thesaurus –Also look at company newsletters, newspaper, trade journals, etc
Lack of Structure Today’s spiders effectively index every page as a separate document What if an OPAC did that? The atom in a hit list should be a document, not a page With XML, one could define structure for documents But will we have one definition, or many?
The Future Much more intelligent engines Not much more intelligence in users The linear, undifferentiated hit list will die Cross-language Text, image, sound, video The “Star Trek” computer model of searching
A Comment from the PR Person at a Major Internet Search Service... “I hope you are aware of our product, and I hope your remarks will show that our product is one of the good ones, not one of the evil ones…” We will not name the company, but its name evokes “aurora borealis”….
Infonortics Search Engines Conference Outstanding two day conference with leading search engine experts –From academe and from search industry Held April 1 in Boston; two previous conferences Scheduled for April 19-20, 1999 –Back Bay Hilton, Boston See www.infonortics.com
Special Thanks To... Judy Matthews, Michigan State University LibrariesMichigan State University Libraries Lou Rosenfeld, Argus AssociatesArgus Associates Sue Davidsen, Michigan Electronic LibraryMichigan Electronic Library Julie Long, Advanced Information ConsultantsAdvanced Information Consultants
See Related Articles in June 1998 issue of Searcher “Infonortics '98 Search Engines Conference” article by Judy Matthews and Rich Wiggins: http://www.infotoday.com/searcher/jun/story4.htm Article & chart covering search engine trends by Susan Feldman: http://www.infotoday.com/searcher/jun/story2.htm
These slides will appear... www.nemonline.org/present/rww