Presentation is loading. Please wait.

Presentation is loading. Please wait.

807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.

Similar presentations


Presentation on theme: "807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT."— Presentation transcript:

1 807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT

2 Clustering Many packages – CLUTO – Weka – MALLET MAHOUT – Supported by the Apache foundation – Industrial strength (builds on top of Hadoop) – Includes libraries for reading in index files in different formats including Weka.arff and Lucene index files – We’ll use SOLR to produce Lucene index files

3 This Lab Clustering with Mahout Clustering with indices produced using Lucene: brief review of SOLR

4 MAHOUT A machine learning framework Built to be usable on top of Hadoop – scalability What’s in it: –Simple Matrix/Vector library –Taste Collaborative Filtering –Clustering Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet –Classifiers Naïve Bayes Complementary NB –Evolutionary Integration with Watchmaker for fitness function

5 Basic format bin/mahout bin/mahout kmeans bin/mahout seqdirectory

6 INPUT FORMAT IMF p. 155: ‘for clustering Mahout relies on data in org.apache.mahout.matrix.Vector format’ – Vector = a tuple of floats SparseVector vs DenseVector Several libraries for creating Vectors from other formats – Weka – Apache Lucene – programmatic

7 K_MEANS CLUSTERING The Federalist papers example

8 CONVERSION The Reuters example

9 For more sophisticated indexing … … can use SOLR for preprocessing; Mahout knows how to read in Lucene-style indices

10 What is Solr? Solr is an open source enterprise search server based on the Lucene Java search library. Solr runs in a Java servlet container such as Tomcat or Jetty Solr is free software and a project of the Apache Software Foundation Solr is a sub-project of Lucene and can be found at http://lucene.apache.org/solr/http://lucene.apache.org/solr/ By Mick England

11 Key Features Advanced Full-Text search Optimized for High Volume Web Traffic Standards Based Open Interfaces – XML and HTTP Comprehensive HTML Administration Interface Server statistics exposed over JMX for monitoring Scalability through efficient replication Flexibility with XML configuration and Plugins Push vs Crawl indexing method

12 Solr Clients Solr can be integrated with, among others… – Ruby – PHP – Java – Python – JSON – Forrest/Cocoon – C# or Deveel Solr Client or solrnet – Coldfusion – Drupal or apacheSolr project for Drupal

13 Why SOLR? It can be used to preprocess documents and produce an index for them that can then be used as representation

14 Indexing Push vs Crawl Schema.xml Add documents HTML interface – Update – Delete – Commit DataImportHandler – For searching databases By Mick England

15 SOLR: what you should do (Installing SOLR on your laptop: see Section 0 of Lab script) Posting docs to SOLR Searching Getting the indexed docs

16 Posting documents to SOLR SOLR documents – fields schema.xml

17 SOLR Documents: fields

18 Importing Lucene indices into MAHOUT Use the lucene.vector option


Download ppt "807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT."

Similar presentations


Ads by Google