Presentation on theme: "Google Search Appliance November 2, 2010 Susan Fagan."— Presentation transcript:
Google Search Appliance November 2, 2010 Susan Fagan
2 Why Google Search Appliance? A different approach to search at EPA Smarter ranking Improved indexing Easier operations A future Were going to call it GSA from here on in
3 How GSA ranks documents Its a secret, but we know some things –Page rank –Self learning We can control some things –Date biasing –Source biasing –Metadata biasing –Best bets Were going to let it do its thing before we tune it too much
4 How GSA ranks documents: Page Rank Who links to your pages? Who links to pages that link to your pages? How does everybody link? –What does it say in the link text? –Is the link always the primary URL (because if it isnt, you dont get any points)? A primary URL is a URL that contains no aliases that are not primary. Primary as defined by what you put in the TSSMS Alias Tool.
5 How GSA Ranks Documents: Things We Can Control Date biasing –Newer is better –We control how much better Source biasing –Boost or decrease chunks of our website –Regions are slightly decreased for Agency search Metadata biasing –We control how much each metadata field counts –We can turn up the bias as metadata quality improves
6 How GSA Ranks Documents: More Things We Can Control Best Bets –Like buying keywords from Google.com –Specific pages for specific keywords or phrases –Always featured at the top –Take effect immediately
7 How GSA Indexes Documents Continuous crawl Learns by experience Crawl rates tunable by host and time Requires some starting points (seeds) Restricted by Do Not Crawl list A manually maintained list in the GSA Admin UI, of URL patterns that the crawler should not visit. Respects robots.txt (in its own way)
8 How EPA is implementing GSA Same Java webapp on the same servers Your search form will stay the same Area search wont change much Your XML search application may change (most wont) Smart, fast indexing, with some help Only indexing primary URLs
9 Implementing GSA: Your search form will stay the same Implemented Northern Light via an object-oriented Java application –We get to keep our code this time –6 weeks to change it, instead of 6 months –Nothing changes for client pages Two Model 7007 Google Search Appliances - -Primary -Hot spare for failover -Parallel indexes 2,000,000 document license
10 Implementing GSA: Your search form URL is the same All common elements work the same Some obscure elements go away –weighted_search, search_crumbs Custom result templates work the same Advanced search works the same
11 Implementing GSA: Area Search Area search is here for now If you search by TSSMS –We will translate it on the fly to URL –We will only translate TSSMS to primary alias If you search by URL –Nothing changes… –…. But aliases are your problem Contact Peter to test your area search
12 Implementing GSA: Your XML search app Parameters and templates are unchanged GSA response packet automatically transformed to original NL format Only 1,000 results are available for a single query 3 applications have been observed exceeding that limit
13 Implementing GSA: Smart, fast indexing Continuous crawl – scans the website at least daily for new links If its not linked, it wont be found Librarian looks daily for new content If all this doesnt work (quickly), tell the librarian Notes databases do not require Verity Views
14 Implementing GSA: Indexing your primary URL Search engines think different URLs are different documents This means duplicates in search results All non-primary aliases are being placed in the Do Not Crawl list
15 What will our customers see? The same thing…. At first. Breadcrumbs are gone…what were they, anyway? Folders replaced by Related Searches FAQ will come back Best Bets for top documents The document theyre looking for!
16 What do we have to do? Plan our November 19 public access implementation Test (with your help) Implement Make it better
17 What do you have to do? Keep working on ROT Keep working on metadata Dont change your search form… … Area search will work, if you want it Tell us what you think
18 What are we leaving out … for now? EPA thesaurus –Contains only general terms –We will add EPA vocabulary Googles spellchecker –Well use our own for now –Well compare and use the winner RSS presentation – delivers only raw XML in search results, for now Recent searches
19 Whats in our future? Marketplace of One Box modules –Faceted search? –Contextual search? –Business intelligence? More social media OneEPA integration Web CMS integration Advanced analytics Special collections Geographic search? GSA for intranet
20 Contact: Susan Fagan Fagan.Susan@epa.gov 202-566-2021
Your consent to our cookies if you continue to use this website.