GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Slides:



Advertisements
Similar presentations
Enterprise Search with FAST Rick McDannel Manager of Information Technology.
Advertisements

MOSS 2007 Document Management Adam McCarthy 1 st April 2009.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Information Retrieval in Practice
Benchmarking XML storage systems Information Systems Lab HS 2007 Final Presentation © ETH Zürich | Benchmarking XML.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
LCT2506 Internet 2 Data-driven web sites Week 5. LCT2506 Internet 2 Current Practice  Combining web pages and data stored in a relational database is.
Crawler-Based Search Engine By: Bryan Chapman, Ryan Caplet, Morris Wright.
Overview of Search Engines
A Social blog using MongoDB ITEC-810 Final Presentation Lucero Soria Supervisor: Dr. Jian Yang.
Word Up! Using Lucene for full-text search of your data set.
MMG508.  Access Types  Tables  Relational tables  Queries  Stored database queries  Forms  GUI forms for data entry/display  Reports  Reports.
Databases & Data Warehouses Chapter 3 Database Processing.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
Computing for Bioinformatics Introduction to databases What is a database? Database system components Data types DBMS architectures DBMS systems available.
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
Dynamic Web Pages (Flash, JavaScript)
CPSC 203 Introduction to Computers T59 & T64 By Jie (Jeff) Gao.
What is IIS? IIS (Internet Information Server) is a group of Internet servers (including a Web or Hypertext Transfer Protocol server and a File Transfer.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Indexing CAx Data and SharePoint based SDM SLM Seminar What commercial PLM/SLM still do not do.
What’s New in VRS? GUGM May 15, 2008 Presenter: Kelly P. Robinson GIL Service Georgia State University
Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
U:/msu/course/cse/103 Day 10, Slide 1 CSE 103 Students: Your BTs have been graded. See Erica or Jo with questions or stay.
Data File Access API : Under the Hood Simon Horwith CTO Etrilogy Ltd.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
MET280: Computing for Bioinformatics Introduction to databases What is a database? Not a spreadsheet. Data types and uses DBMS (DataBase Management System)
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Searching Business Data with MOSS 2007 Enterprise Search Presenter: Corey Roth Enterprise Consultant Stonebridge Blog:
© 2015 Ascendum Solutions. All rights reserved. Welcome To Create Dazzling End-user applications using SharePoint Search Speaker: Bill Crider #sharepointcincy2015.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
XML – Its Role and Use Ben Forta Senior Product Evangelist, Macromedia.
Database Basics BCIS 3680 Enterprise Programming.
CS562 Advanced Java and Internet Application Introduction to the Computer Warehouse Web Application. Java Server Pages (JSP) Technology. By Team Alpha.
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?
Session 1 Module 1: Introduction to Data Integrity
Relational Database Systems Bartosz Zagorowicz. Flat Databases  Originally databases were flat.  All information was stored in a long text file, called.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
uses of DB systems DB environment DB structure Codd’s rules current common RDBMs implementations.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Introduction to Enterprise Search Corey Roth Blog: Twitter: twitter.com/coreyrothtwitter.com/coreyroth.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Apache Lucene Searching the Web and Everything Else Daniel Naber Mindquarry GmbH ID 380.
1 ODF and Web Mashups Basic techniques Rob Weir, IBM :15.
Information Retrieval in Practice
Search Engine Architecture
Web Technologies IT230 Dr Mohamed Habib.
Jeff Coughlin FarCry 3.0 An Overview Jeff Coughlin
Building Search Systems for Digital Library Collections
Database Management  .
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Introduction to Information Retrieval
10 Most Important WordPress Plugins You Must Have Website Promoters L.L.C.
Presentation transcript:

GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

About Me Grover Fields  Revorg, LLC (Owner)  M.S. Information System (Troy University)  B.S. Industrial Engineering (Florida A&M University)  Stanford Project Management Courses

About Me  10+ years of development, analysis, and implementation  10+ years of ColdFusion experience  2+ years of Java experience  Commonspot, Strongmail, ClickFix (Developer)   Web site:

Agenda  What? What can we do with GOAT?  Why? Why do we want to use GOAT and not Verity?  How? How do we do that?  Conclusion and alternative solutions

What  What is a Search Engine? Builds an index on text Answers queries using that index, a la Verity Existing database already  A search engine offers? Scalability Reliance Ranking Tweaking Integrates different sources ( , web pages, files, DATABASES)

What is a search engine? (cont.)  Works on words, not on substrings Auto != automatic, automobile  Indexing process: Convert document Extract text and meta data Normalize text Write (inverted) index

Apache Lucene Overview  Lucene Java 2.4 A high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.  No GUI 

Apache Lucene Overview  Java library for indexing and searching  No dependencies  Works with Java 1.4 or later  Input for indexing: Document objects Each document: set of Fields, field name, field content  Stores its index as files on disk or memory  No document converters  No web crawler

Lucene Java users  HBCU.info  LinkedIn  IBM OmniFind Yahoo! Edition  Techorati.com  Eclipse  Monster.com  …

Lucene Java Summary  Java Library for indexing and searching  Lightweight /no dependencies  Powerful and fast and tested!  No document conversion  No GUI

Why?  Cost of Enterprise Search Solution  Need for search speed  Java projects to work on Things to do

Verity Limitations  10,000 documents for ColdFusion Developer Edition  125,000 documents of ColdFusion Standard Edition  250,000 documents for ColdFusion Enterprise Edition What do developers do in a shared hosting environment? Is it possible for the hosting company to limit the number of documents per Web site?

T-SQL Limitations?  Search for “Yahoo” on my blog SELECT entry.id FROM tbl_mango_entry as entry INNER JOIN tbl_mango_post as post ON entry.id = post.id WHERE entry.blog_id = ‘default’ AND (entry.title LIKE ‘%yahoo%’ OR entry.content LIKE ‘%yahoo%’ OR entry.excerpt LIKE ‘%yahoo%’ ) AND post.posted_on <= getdate() AND entry.status = 'published' ORDER BY post.posted_on DESC  Multiply that time 10, 100, 500, or 1000 users/hr?

T-SQL Limitations?  Full table scan = 1 THING  PERFORMANCE KILLER!!!  No search sorting RDBMS isn’t designed to do this but allows it  Use the right tools!

How?  GOAT Search Solution Lucene ColdFusion MX 8 MX is fine but GUI needs to be rolled back Commons IO 1.4  Simply package.jar files  Simply Web based GUI

How?  Macromedia JDBC Drivers Same drivers that ColdFusion uses No additional drivers to install  Supports RDBMS ONLY MSSQL MySQL Oracle  No File system support (Yet)

Basics?  Indexing extracts both meaning and structure from unstructured information by indexing each document  Contains a complete list of all the words used in a given document along with metadata about that document  Lucene creates a collection that normalizes both the structured and unstructured data.  Search requests then check these collections rather than scanning the actual documents and database fields.  This provides a faster search of information, regardless of the file type and whether the source is structured or unstructured.

Basics?  Collection A special database created by Lucene that contains metadata that describes the documents  Documents A sequence of fields Similar to a row in a database table Row 1 Row 2, etc  Fields A named sequence of terms Similar to a column in a table Primary Key Column 1  Terms Is a string

Knowledge?  Index A special database created by Lucene that contains metadata that describes the documents  Query Syntax Similar to Google’s advanced search: field:value E.G. resume: coldfusion Results Primary Key list of values XML based on the document CFX Tag integration

Alternative Solutions for Search  Commercial vendors: FAST, $100k Autonomy, $80k Google, $50k  Commercial search engines based on Lucene IBM OmniFind Yahoo Edition  RDBMS with Integrated Search Oracle MySQL MSSQL PERFORMANCE KILLERS

RoadMap A set of guidelines, instructions, or explanations: wrote an ethics code as a road map for the behavior of elected officials.  Overhaul Java programming (still novice)  Integrate with other products Aperture Nutch Solr  File system integration.txt,.pdf,.doc,.ppt, etc.  Geospatial based searches E.G. All jobs within a 50 mile radius

References  Apache.org  Adobe.com  Ben Forta’s Blog  Slideshare.net Multiple authors  Other references