Introduction to YouSeer

Introduction to YouSeer
Partha Mukherjee

Outline Overview YouSeer components Heritrix Solr Demo

Overview Requirements
YouSeer: is a complete and powerful open source search engine available on SourceForge that integrates the open source crawler Heritrix with the open source indexer Solr/Lucene. Java-based, and run successfully on Windows Requirements 512 MB RAM, 6.5 GB on Hard Disk Java 1.6 ( Java 1.5 also works)

Search Engine: Basic Workflow
Courtesy of Saurabh Kataria

Advantages of YouSeer Built on top of scalable components
Tested on 23M documents, while Solr and Heritrix can scale to billions Very flexible, and easy to extend Modifying the index and the ingestion module is easy The crawler supports complicated crawling policies

YouSeer Components Heritrix: Apache Solr:
The Internet Archive’s crawler Reported to scale up to 1B documents Written in Java, and has a web interface Apache Solr: open source enterprise search server based on the Lucene Has REST-like API Supports caching, distributed search, and index replication

YouSeer Architecture WWW Storage Apache Tomcat DB Cache Request
heritrix File System Middleware Apache Solr

Heritrix Workflow 1) Choose a URI from all among the scheduled
2) Fetch that URI 3)Analyze or archive the results 4) select discovered URIs of interest, and add to those scheduled 5) Note that the URI is done and repeat “An Introduction to Heritrix. An open source archival quality web crawler”. Gordon Mohr et al

Heritrix Crawl Result By default, heritrix writes all its crawled to disk as Internet Archive ARC files By default, Heritrix writes compressed version of ARC files The compression is done with gzip Each record (which contain a document) is gzipped All gzipped records are concatenated together to make up a file of multiple gzipped members

Apache Solr Very popular distribution of Lucene
Easy to configure and optimize All modifications are in the XML files No need to touch the code The index has a schema, similar to database schema Think of the index as a table in the database, and you have to define the columns

Solr Schema Example <field name="url" type="string" indexed="true" stored="true"/> <field name="title" type="text" indexed="true" stored="true"/> <field name="keywords" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true"/> <field name="creationDate" type="date" indexed="true" stored="true"/> <field name="rating" type="sint" indexed="true" stored="true"/> <field name="published" type="boolean" indexed="true" stored="true"/> <field name="content" type="text" indexed="true" stored="true" /> <field name="all" type="text" indexed="true" stored="true" multiValued="true"/>

Solr Documents Solr accepts well formatted XML documents
<add> <doc> <field name=“URL"> <field name=“title">CNN Breaking News – Obama wins</field> <field name=“content">Barack Obama is the 44th president of the USA</field> <field name=“pubDate"> T23:59:59.999Z</field> </doc> </add>

YouSeer workflow Waits for the crawled documents to be written
Iterates on the compressed files, and process the documents Extract the textual content of the document, and parse metadata Generate an XML file as output Each custom extractor appends its result to this file This XML file is submitted to the index

Demo: Configurtion The schema of Solr is already configured in your installation Solr is installed on tomcat Heritrix web interface is listening on the port 8080 by default same as Apache TomCat server. So change it to some other port number i.e. ./hertitrix –p 9000

Demo Download Virtual Machine image from
Unzip fedora-11-i386.zip The virtual image is a linux VMware image To run the VM, you need to download and install VMware player from: Double click on Vmware virtual machine configuration icon

Demo Get into YouSeer with password “heritrixsolr”.
You are in a virtual Linux environment sitting in Windows. While leaving the VM environment Log out from youseer (“youseer -> quit” ) Shutdown the VM (“ shutdown”) Press Ctrl + Alt to work in your local machine.

Demo About to start Heritrix (crawler) !!! In VM open a terminal
Go to apps directory (cd apps) You find solr, tomcate, heritrix etc applications Don’t forget to start up solr server before running heritrix Go to apache-solr…/example/ Locate the jar file “start.jar” and run it. Solr should run all the time.

Demo Now open another terminal or another tab from the same terminal
Go to heritrix under /home/apps. Run heritrix application with the following command line arguments ./heritrix –p XXXX - -admin=nameX:passwordX Now open the browser in VM and type the URL Get heritrix UI (Username= nameX and password = passwordX)

Demo: Heritrix Heritrix log in screen

Demo: Heritrix

Demo: Heritrix Enter the Seed URLs

Demo: Heritrix Configure first job Enter a valid URL and email address
Most important parameter is user agent under configurations Enter a valid URL and address Enter And your OWN address Do not run more than 5 threads Avoid machine “tireness” and system crash.

Demo: Heritrix Change the Agent URL

Demo: Features of Heritrix

Demo: More features

Demo : Heritrix

Demo ARC files are written to: To start tomcat, enter start-tomcat
~/crawler/heritrix /jobs/JOB-NAME/arcs To start tomcat, enter start-tomcat Solr will start automatically YouSeer ingestion module (middleware) is located under: ~/youseer/release Add folder entry to Apache web server configuration file Retrieve cached copies of documents from ARC files Use URL of the solr to post the document Specify number of working threads to process the documents Java –jar YouSeer.jar [IndexURL] [Path_ARCfiles] [Cached_virtual_Folder][Number_of_Threads][wait_Time]

Demo To index documents crawled by heritrix:
Navigate to ~/youseer/release Run: java –jar YouSeer.jar /absolute/path/to/arc/files /cachingDirectory 1 0 Solr URL The full path to the ARC files The virtual directory which maps to the cached files Number of threads, please keep it <5 Waiting Time between retries

Comments YouSeer tracks which arc files has been processed into the database, default name is submitted.db If you want to re-ingest the documents, Map virtual directory within TomCat directory Update the submitted.db file Execute $ path= /cached docBase=“/heritrix /jobs/JOB_NAME/arcs” crossContext=“false” debug=“0” reloadable=“true”/ The search interface:

Test case (http://pike.psu.edu)

Test Case(:pike)

References Want to Download separately??
Want to Download separately?? –crawler/files/archive-crawler%20(heritrix%201.x)/

THANK YOU

Introduction to YouSeer

Similar presentations

Presentation on theme: "Introduction to YouSeer"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to YouSeer

Similar presentations

Presentation on theme: "Introduction to YouSeer"— Presentation transcript:

Similar presentations

About project

Feedback