Search and the ‘Net in 2009 Trends, Challenges and Cutting-Edge Developments in Internet Search Michael Hunter Reference Librarian Hobart and William Smith.

Search and the ‘Net in 2009 Trends, Challenges and Cutting-Edge Developments in Internet Search Michael Hunter Reference Librarian Hobart and William Smith Colleges For Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council Supported by Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2008

For today... Landscape of Search in 2009 Web 2.0 and Social Search Web 3.0 – What, Where and What it can do The Year at Google New Services Recent Developments at Established Services Possible Directions for Search in the Future Linklist http://people.hws.edu/hunter/searchnet09.htm

How large is the Web? Internet Systems Consortium www.isc.org

Web Search in 2008 Who’s crawling the Web? Google Yahoo Live Search (MSN) Ask owns Teoma Gigablast

Size Estimates 7/9/08 Google AND Yahoo! text filetypes in millions

User Satisfaction ForeSee Results and U. Michigan 8/14/07

Web 2.0 and Social Search Services that Let people form online communities Facilitate collaboration Operate across formats and platforms

Web Search Read-only CONSUMERS CONTENT Full-text keyword search Machine-based ranking Search Engine

Web 2.0 Search Read-Write-Create-Describe-Organize Consumers Creators Taggers Online Communities Creators

Searching Web 2.0 Enhances results beyond just full-text search Users’ interactions within “community-based” systems can be used to infer query context and intent Return all postings relevant to “steep slopes” Return only postings by community members sharing the same interests eg. men with heart disease interested in “steep slopes”

Web 2.0 Applications Flickr – Share and tag photos YouTube – Share, tag, comment and subscribe to video content Del.icio.us –Share, tag, store and organize web content Yahoo Answers – Ask, answer, rate, comment on questions

Social Search Engines 2004- Incorporate human-generated metadata in retrieval and ranking of search results Tags – from creators and readers/viewers Comments, ratings, edits, deletions Searcher-supplied interest profiles Online community-based search engines Eurekster, Mahalo, Wikia Search, Delver, me.dium.com, Stumbleupon, Thagoo (meta)

Potential Advantages At least one human has selected and endorsed each result Readers generate most tags, not authors Reduces impact of link spam in ranking User input keeps results more current and relevant

Concerns User-generated spam Content vandalism Unique searches benefit less “Long tail” effect

Wikiasari: Quick rummaging search “User ranked results” Open source SE by Jimmy Wales and Amazon Initial results ordered with algorithms a la Google Users reorder results, which will be used in ranking of future similar searches Edits allowed on all search results Strength is in general search topics

Web 3.0 – The Brave New Web

Web-based applications or services that combine metadata and artificial intelligence to provide a more productive and intuitive user experience First used by John Markoff of the New York Times in 2006

Key Components of Web 3.0 Expanded from Wikipedia “Web 3.0” (Nova Spivack) http://en.wikipedia.org/wiki/Web_3.0 Semantic Web – Embedded Metadata Intelligent Applications Human Judgment and Analysis Expanded Network Computing Open Technologies – Open Source Distributed Databases

Key Components of Web 3.0 Embedded metadata (Semantic Web) Dublin CorehCard Creative CommonshCalendar RDFhReview GeoRSShAtom

Expressing Dublin Core in HTML/XHTML meta and link elements

Key Components of Web 3.0 Intelligent applications Natural Language Processing (NLP) Information extraction Machine learning Data mining

Natural Language Processing (NLP) With NLP software unstructured text and data can be processed to reveal degrees of meaning by Extracting terms identified as significant Summarizing content Discovering relationships among terms and groups of terms HOW???

NLP Extraction Take all articles from a group of pharmaceutical journals published in one year (the “corpus”) Extraction – Run a relevant controlled vocabulary (list of all known drugs) against the corpus

NLP Extraction Drugs found, number of occurrences and location in the corpus plus a list of possible drugs not in the controlled vocabulary 86>penicillin click for locations 124>tetracycline click for locations 213>aspirin click for locations Are these also drugs? XXX, XXX, XXX

NLP Summarization Retain phrases surrounding the extracted term(s) with links to locations in the corpus (KWIC Index) rare uses of penicillin Often penicillin is contraindicated when responds well to penicillin

NLP Summarization Tag all words in the corpus with their grammatical function and search for noun – verb – noun and other syntactic patterns (drug A) treats (disease B) (drug C) causes (disease B) (drug D) is contraindicated in (disease B)

NLP Term Relationship Queries answered by tracking references across sentences Can penicillin cause shock? “Penicillin treatment is not without risks. In certain cases it can trigger anaphylactic shock.”

NLP can do even more … Word disambiguation bank (river) bank (finances) bank (verb) Retrieval of alternative word forms Retrieval of variants in capitalization and spelling Topic detection and tracking Following different themes in a changing RSS feed Machine translation

Key Components of Web 3.0 Human Judgment and Analysis Web 2.0 - Selection, tagging, rating, comment Expanded Network Computing Distributed Computing Cloud Computing Grid Computing Interoperability

Key Components of Web 3.0 Open Technologies Open Source Creative Commons Open Archive Initiative Distributed Databases Structured data records in reusable and searchable formats Standardized query language – SPARQL technology RDF structured databases

Semantic Search Systems Understand the user’s query Understand Web text Bring these together for query results that are contextually relevant Algorithms that match the meanings and not just the words Natural Language Processing Concept Mapping

Expanding processing capabilities XML NLP Machine learning Data mining Open source WEB 2.0 Tags Comments Ratings Online Communities Dublin Core RDF Web 3.0 Semantic Search Headup Twine Mahalo The new Yahoo and more... Human judgment

New Services

Kosmix www.kosmix.com Google interface Offers overview of results by document type Basic FactsReviews & Opinions Media People & Community ShoppingNews Extensive clustering by subject Blended results with thumbnails of images, video and audio clips, presentations and reports Human-created “topic pages” for subjects of current interest

VideoCrawler Video meta engine by AT&T Searches over 1,600 online video sources by Media type Rating Date of creation Popularity

ChaCha www.chacha.com Free mobile search service Requires a (free) account Text your questions and a human “guide” sends back an answer, limited to 160 characters Supported by 98% of mobile providers

Highlights of this year at Google

Universal Search Results from G.’s verticals blended into Web results and ranked together (Google as Portal?) Launched 2007 Web SearchNews Book SearchVideo ImagesBlog Search Local/MapsProduct Search

Under the Hood G “knows” 1 trillion web-based items (8/13/08) Not indexed: data records calendar pages other auto-generated content duplicates Supplemental Index incorporated into main index (1/3/08)

Results Ranking Beyond Word Frequency, Links & PageRank Algorithms incorporating Language – interpreting phrases, synonyms, diacritics, spelling mistakes, etc. Query – language used to express the search Time – returning pages with freshness appropriate to the query Personalization – Not all people want the same set of results

Personalization for each searcher New ranking process based on Wording of the query Text relationships of the pages retrieved Location Recent searches executed at that computer Functions with or without a Google account “Focusing on the user’s intent and location” On-screen message- Customized based on recent search activity Launched mid-November ‘08

Evaluating Search Automated evaluations every minute Monitoring users click behavior (anonymous or personalized via Google account) Google quality raters – hired to execute and evaluate specific queries VitalNot Relevant Useful Off-topic Relevant Google Quality Raters Guidelines 2007 http://www.mauriziopetrone.com/blog/wp- content/uploads/quality-rater-guidelines-2007.pdf Over 450 improvements launched in 2007

New features and collections SearchWiki - customize search by re-ranking, deleting, adding, and commenting on search results – requires free account “Voice Search” - mobile app for iPhone allows queries to be spoken then run in Google Life Photo Archive hosted in Image Search Images from 1790 to today, most never before published Searchable by category and decade http://images.google.com/hosted/lifehttp://images.google.com/hosted/life or Add source:life to any term(s) in Image Search

Google Books 2 divisions Partner (Publisher)Program and Library Project Partner Program Publishers authorize G. to scan and make searchable the full text of their books Users see only the page containing their search terms Link to purchase copy No download possible

Google Books 2 divisions Partner Program and Library Project Library Project (2004) Scan and make searchable millions of books For works in copyright, users see only a few sentences around search terms Users may browse full text of public domain works NOTE: Not possible to print ANY material from either Google Print project

Current Member Libraries U. of CaliforniaHarvard U. of MichiganNYPL OxfordStanford U. of VirginiaU. of Wisconsin U. of Texas (Austin) U. Complutense (Madrid) Princeton Bayerische Staatsbibliothek

Copyright and the World of Books The set of ALL books that are in the public domain The set of ALL books that are copyrighted and in print (mostly in the Partner Program) ALL other books are still in copyright, but out of print

Google Books in the Future G. will sell electronic access to in-copyright, out-of-print works, with permission of publisher and author 37% of revenue to Google 63% of revenue to publisher and author In-print, in-copyright works scanned through the Library project will be full-text searchable. No portion of these works will be shown unless they are also part of the Partner Program

Google Books in the Future Public domain works will continue to be fully available as before Individuals and institutions may purchase full online access to all Google Book titles “Public and university libraries” will be able to offer free access to out-of-print, in-copyright titles through “designated terminals”. http://books.google.com/agreement

Google’s Custom Search www.google.com/coop/cse/ A tool from the Google Coop initiative Keywords chosen determine content and weighting of results (limit of 100 characters) Search Entire web Your selected sites only Entire web with selected sites emphasized Within Coop, a CSE can be created and maintained collaboratively Stored or Linked versions available

The Latest at Established Services

Why do I need more than Google? The Google effect -- the single most powerful force in today’s Internet a private profit-driven company owns more information on individuals’ search behavior, companies and organizations than any other entity

Why do I need more than Google? Great potential for misuse/abuse of this information for financial gain Societies seldom leave basic services (utilities, medical and traffic regulation) totally to the “free market” Is web search now a “basic service” ???

Search dominance --- Potential skewing for commercial, political or social purposes Database composition Ranking Privacy No single search engine can crawl the whole web Limits search features, results display, consumer and shopping information http://google-watch.org

Yahoo Open Strategy Y!OS – major internal and external redesign to unify all Yahoo’s services Owns Flickr, del.icio.us, Upcoming “We are building social into everything we do” Offers more control over what is shared Easier to set up small social networks Will open some search technology to developers and users (http://developer.yahoo.com/search/boss/)

Yahoo and the Semantic Web Will begin to include certain metadata embedded in web pages as search and ranking elements Dublin CorehCard Creative CommonshCalendar RDFhReview GeoRSShAtom Will support Open Search specification allowing crawler access to deep web resources (!!!)

Yahoo Pipes - pipes.yahoo.com Users can combine, filter and display any RSS content Finished “pipes” can be shared and embedded in other web pages eg. A pipe for RSS feeds from educational blogs flitering for technology, physics or any other keywords Version available for the iPhone iphone.pipes.yahoo.com

MSN’s Live.com Database increasing Simpler Interface (4/08) “Rich Answers” blended results Image search enhancements filter:face filter:portrait filter:bw NLP question processing improved Live Search Books and Search Academic ended 5/08

New & Notable at Ask The Butler is gone! Teoma is in his place! Smart Search Web Answers Zoom Superior Mapping Tools

Gigablast Maintains unique database Offers advanced search features “Freshness dating limit” estimates the date that a particular page was first published or most recently edited or modified Custom Topic Search of Gigablast – up to 500 domains (www.gigablast.com/cts.html)

Future directions in search Social Search will continue to grow and adopt spam-prevention measures Personalization will increase as searchers trade privacy for enhanced, customized results and alerting services Yahoo’s new social-based service will combine Web 2.0 capabilities with a robust search engine Open Source development of NLP, data and text mining software will continue to incorporate these capabilities into free services

Thank You! Michael Hunter Reference Librarian Hobart and William Smith Colleges Geneva, NY 14456 (315) 781-3552hunter@hws.edu

Search and the ‘Net in 2009 Trends, Challenges and Cutting-Edge Developments in Internet Search Michael Hunter Reference Librarian Hobart and William Smith.

Similar presentations

Presentation on theme: "Search and the ‘Net in 2009 Trends, Challenges and Cutting-Edge Developments in Internet Search Michael Hunter Reference Librarian Hobart and William Smith."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Search and the ‘Net in 2009 Trends, Challenges and Cutting-Edge Developments in Internet Search Michael Hunter Reference Librarian Hobart and William Smith.

Similar presentations

Presentation on theme: "Search and the ‘Net in 2009 Trends, Challenges and Cutting-Edge Developments in Internet Search Michael Hunter Reference Librarian Hobart and William Smith."— Presentation transcript:

Similar presentations

About project

Feedback