807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT
Clustering Many packages – CLUTO – Weka – MALLET MAHOUT – Supported by the Apache foundation – Industrial strength (builds on top of Hadoop) – Includes libraries for reading in index files in different formats including Weka.arff and Lucene index files – We’ll use SOLR to produce Lucene index files
This Lab Clustering with Mahout Clustering with indices produced using Lucene: brief review of SOLR
MAHOUT A machine learning framework Built to be usable on top of Hadoop – scalability What’s in it: –Simple Matrix/Vector library –Taste Collaborative Filtering –Clustering Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet –Classifiers Naïve Bayes Complementary NB –Evolutionary Integration with Watchmaker for fitness function
Basic format bin/mahout bin/mahout kmeans bin/mahout seqdirectory
INPUT FORMAT IMF p. 155: ‘for clustering Mahout relies on data in org.apache.mahout.matrix.Vector format’ – Vector = a tuple of floats SparseVector vs DenseVector Several libraries for creating Vectors from other formats – Weka – Apache Lucene – programmatic
K_MEANS CLUSTERING The Federalist papers example
CONVERSION The Reuters example
For more sophisticated indexing … … can use SOLR for preprocessing; Mahout knows how to read in Lucene-style indices
What is Solr? Solr is an open source enterprise search server based on the Lucene Java search library. Solr runs in a Java servlet container such as Tomcat or Jetty Solr is free software and a project of the Apache Software Foundation Solr is a sub-project of Lucene and can be found at By Mick England
Key Features Advanced Full-Text search Optimized for High Volume Web Traffic Standards Based Open Interfaces – XML and HTTP Comprehensive HTML Administration Interface Server statistics exposed over JMX for monitoring Scalability through efficient replication Flexibility with XML configuration and Plugins Push vs Crawl indexing method
Solr Clients Solr can be integrated with, among others… – Ruby – PHP – Java – Python – JSON – Forrest/Cocoon – C# or Deveel Solr Client or solrnet – Coldfusion – Drupal or apacheSolr project for Drupal
Why SOLR? It can be used to preprocess documents and produce an index for them that can then be used as representation
Indexing Push vs Crawl Schema.xml Add documents HTML interface – Update – Delete – Commit DataImportHandler – For searching databases By Mick England
SOLR: what you should do (Installing SOLR on your laptop: see Section 0 of Lab script) Posting docs to SOLR Searching Getting the indexed docs
Posting documents to SOLR SOLR documents – fields schema.xml
SOLR Documents: fields
Importing Lucene indices into MAHOUT Use the lucene.vector option