807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.

807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT

Clustering Many packages – CLUTO – Weka – MALLET MAHOUT – Supported by the Apache foundation – Industrial strength (builds on top of Hadoop) – Includes libraries for reading in index files in different formats including Weka.arff and Lucene index files – We’ll use SOLR to produce Lucene index files

This Lab Clustering with Mahout Clustering with indices produced using Lucene: brief review of SOLR

MAHOUT A machine learning framework Built to be usable on top of Hadoop – scalability What’s in it: –Simple Matrix/Vector library –Taste Collaborative Filtering –Clustering Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet –Classifiers Naïve Bayes Complementary NB –Evolutionary Integration with Watchmaker for fitness function

Basic format bin/mahout bin/mahout kmeans bin/mahout seqdirectory

INPUT FORMAT IMF p. 155: ‘for clustering Mahout relies on data in org.apache.mahout.matrix.Vector format’ – Vector = a tuple of floats SparseVector vs DenseVector Several libraries for creating Vectors from other formats – Weka – Apache Lucene – programmatic

K_MEANS CLUSTERING The Federalist papers example

CONVERSION The Reuters example

For more sophisticated indexing … … can use SOLR for preprocessing; Mahout knows how to read in Lucene-style indices

What is Solr? Solr is an open source enterprise search server based on the Lucene Java search library. Solr runs in a Java servlet container such as Tomcat or Jetty Solr is free software and a project of the Apache Software Foundation Solr is a sub-project of Lucene and can be found at http://lucene.apache.org/solr/http://lucene.apache.org/solr/ By Mick England

Key Features Advanced Full-Text search Optimized for High Volume Web Traffic Standards Based Open Interfaces – XML and HTTP Comprehensive HTML Administration Interface Server statistics exposed over JMX for monitoring Scalability through efficient replication Flexibility with XML configuration and Plugins Push vs Crawl indexing method

Solr Clients Solr can be integrated with, among others… – Ruby – PHP – Java – Python – JSON – Forrest/Cocoon – C# or Deveel Solr Client or solrnet – Coldfusion – Drupal or apacheSolr project for Drupal

Why SOLR? It can be used to preprocess documents and produce an index for them that can then be used as representation

Indexing Push vs Crawl Schema.xml Add documents HTML interface – Update – Delete – Commit DataImportHandler – For searching databases By Mick England

SOLR: what you should do (Installing SOLR on your laptop: see Section 0 of Lab script) Posting docs to SOLR Searching Getting the indexed docs

Posting documents to SOLR SOLR documents – fields schema.xml

SOLR Documents: fields

Importing Lucene indices into MAHOUT Use the lucene.vector option

807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.

Similar presentations

Presentation on theme: "807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.

Similar presentations

Presentation on theme: "807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT."— Presentation transcript:

Similar presentations

About project

Feedback