Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text Processing 3.Index Techniques in Database 4.Index Techniques in Wireless Network.

Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈

Outline 1.Introduction 2.Text Processing 3.Index Techniques in Database 4.Index Techniques in Wireless Network 5.Text Processing Operations 6.Apache Lucene 7.Apache Solr 8.Demo 5/18 5/20

Text Processing Operations ●Text processing operations o Classification o Clustering o Part-of-speech tagging o Parsing o Sentiment analysis o Language modeling o Named entity recognition o etc. ●Why indexing is important to above operations?

Classification ●Classification o Automatically classify items into correct classes o Supervised learning ●Text classification o Classify documents using text features o Used as a common approach to many text processing operations ●Examples o Spam filter o Email routing o Language identification o etc.

Classification ●Approaches o Probabilistic  e.g. Naive Bayes o Geometric  e.g. Support vector machine o Artificial neural network o Decision tree o etc.

Clustering ●Unsupervised learning ●Based on similarity values, make groups of similar items ●Text clustering o Large volume o Sparse data o e.g. grouping documents sharing a same topic Image from http://analyticstraining.com/2011/cluster-analysis-for-business/

Clustering ●Approaches o Mainly statistical o Hierarchical o Partitional o … ●Examples o k-means o affinity propagation Images from http://scikit-learn.org/stable/modules/clustering.html

Language Modeling ●The method for representing language in machine-comprehensible form ●Approaches o Probabilistic language model  Use probability of a sequence of words o Recently, neural language models are widely used  Use neural network to map language into value

POS Tagging ●Every word has its part-of-speech tag o noun, verb, adjective, adverb, … o e.g. What is the airspeed of an unladen swallow?  What/WP is/VBZ the/DT airspeed/NN of/IN an/DT unladen/JJ swallow/VB o e.g. 아버지가 방에 들어가신다  아버지 /NNG 가 /JKS 방 /NNG 에 /JKB 들어가 /VV 시 /EPH 다 /EFN ●Approaches o classifier, sequence model, rule based,... ●Partly easy problem o Many words are unambiguous o Even stupidest method’s performance is about 90% o State-of-the-art method’s performance is about 97%

Parsing ●Syntactic structure o Constituency (phrase structure) o Dependency ●Parsing solves ambiguity of sentences ●Approaches o Pre-1990: by defining symbolic grammar o After that: statistical method  due to the rise of annotated data (e.g. Penn Treebank)

Sentiment Analysis ●Detection of attitudes ●Types of sentiment analysis o Whether the attitude is positive/negative o Rank the attitude from 1 to 5 o Or more complex types

Sentiment Analysis ●Approaches o Classification o Regression o Using lexicon (e.g. WordNet) ●Why sentiment analysis? o For companies, to know consumers’ opinions on a product o For politicians, to know people’s oponions on a candidate or an issue ●Also known as o Oponion extraction, opinion mining, sentiment mining, subjectivity analysis

Named Entity Recognition ●Important sub-task of information extraction ●Find and classify names in text ●Approaches o Sequence model o Lexicon o Classification

Why Index? ●Many operations are based on statistical approach o Large number of documents ●Retrieving documents from their words is a very frequent task o Word is the common unit of many operations

References 1.https://web.stanford.edu/~jurafsky/NLPCourseraSlides.htmlhttps://web.stanford.edu/~jurafsky/NLPCourseraSlides.html 2.http://www.nltk.org/api/nltk.tag.htmhttp://www.nltk.org/api/nltk.tag.htm 3.Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-1155. 4.http://scikit-learn.org/stable/documentation.htmlhttp://scikit-learn.org/stable/documentation.html

Apache Lucene

●Lucene? o Open-source Java full-text search “Library” o Makes it easy to add search functionality to an application or website o NOT Care about the source of the data, its format, or even its language  as long as you can convert it to text ●Main Capabilities o Creation / Maintenance / Accessibility of the Lucene inverted index Lucene Overview

●Basic Process 1.Adds content to a full-text index 2.Performs queries on this index 3.Returns results ranked by a. The relevance to the query b. An arbitrary field i. e.g., Last modified date

How to Make Content Searchable ●Search engines generally: a. Extract Tokens from content b. Optionally transform the tokens depending on needs  Stemming  Expand with synonyms (usually done at query time)  Romove token (stopword)  Add metadata c. Store tokens and related metadata (position, etc.) in a data structure optimized for searching  Called an Inverted Index

●Inverted Index o Searches an index instead of searching the text directly o Page-centric structure (page->words) to a keyword- centric data structure (word->pages) Terms

●Documents o The unit of search and index o An index consists of one or more Documents o Content can be from various sources  SQL/NoSQL database, a file system, websites o e.g.) Lucene index of a database table of users ● Each user = Lucene Document

●Fields o A Document consists of one or more Fields o Simply a name-value pair  e.g.) Title : Avengers Terms

●Fields o Types  Keyword ●Not analyzed, but indexed and stored ●Original value should be preserved in its entirety ●e.g.) File system path, dates...  UnIndexed ●Neither analyzed nor indexed, but stored as is ●Need to display with search results, but whose values you’ll never search directly ●e.g.) Database primary key...  UnStored ●Analyzed and indexed but not stored ●Large amount of text that doesn’t need to be retrieved in its original form ●e.g.) Bodies of web pages, any other type of text document  Text ●Analyzed and indexed ●If String, stored ●If the data is from a Reader, not stored Terms

An example of Lucene Fields

Terms ●Attributes o Tokenized  Analyze the content, extracting Tokens and adding them to the inverted index o Stored  Keep the content in a strorage data structure for use by application

Lucene Architecture

Lucene Functionality 1.Language Analysis 2.Indexing 3.Querying 4.Ancillary Features The Core of Lucene

Language Analysis

●Overview o The process of converting raw text into indexable tokens o Analyzer = Tokenizer + TokenFilter classes  Lucene provides many Analyzers out-of-the-box ● StandardAnalyzer, WhitespaceAnalyzer, etc.  Tokenizer for chunking the input into Tokens  TokenFilter can further modify the Tokens o Easy to add your own o Done on both the content to be indexed and the query

Language Analysis ●Input o Contents (documents) to be indexed o Queries to be searched ●Output o Appropriate internal representation as needed Input Output

1.Optional character filtering and normalization a. e.g.) removing diacritics 2. Tokenization a.“Time is an illusion. Lunch time doubly so.” ==> [“Time”, “is”, “an”, “illusion.”, “Lunch”, “time”, “doubly”, “so.”] Language Analysis

3.Token Filtering a. Stopword removal i. Remove words too common to be useful ii. e.g.) and, a, the, but, … b. Stemming i. Chop off the ends of words to map different forms of a word to a single form ii. e.g.) lazy, laziness -> lazi Language Analysis

3.Token filtering c. Lemmatization  Remove inflectional endings only and return the base or dictionary form of a word (lemma)  e.g.) better, best -> good d. N-gram createion  For approximate matching  e.g.) “This is my car” ⇒ [“This”, “is”, “my”, “car”], [“This is”, “is my”, “my car”], [“This is my”, “is my car”] Language Analysis

Indexing

●Indexing o Prepare / Add text to Lucene o Optimized for searching ●Lucene Indexing o Well-known inverted index representation o Keeping adjacent non-inverted data on a per- document basis ●Key Point o Lucene only indexes Strings  Convert whatever file format we have into something Lucene can use Indexing

Indexing with Lucene ●Overview o Fast: over 200 GB/hour o Incremental and “near-realtime” o Multi-threaded o Beyond full-text: numbers, dates, binary,... o Customize what is indexed (“analysis”) o Customize index format (“codecs”)

Indexing ●Document Model o A flat ordered list of fields with content o Fields have name, content data, float weight, and other attributes o Does not need to have a unique identifier

Indexing ●Store terms and documents in arrays

Indexing ●Insertions? o Insertion = write a new segment o Merge segments when there are too many of them o concatenate docs, merge terms, dicts and postings lists (merge sort!)

Indexing ●Deletions? o Deletion = turn a bit off o Ignore deleted documents when searching and merging (reclaims space) o Merge policies favor segments with many deletions

●Updates require writing a new segment o Single-doc updates are costly, bulk updates prefered o Writes are sequential ●Segments are never modified in place o Filesystem-cache-friendly o Lock-free! ●Terms are deduplicated o Saves space for high-freq terms ●Docs are uniquely identified by an ord o Useful for cross-API communication o Lucene can use several indexes in a single query ●Terms are uniquely idendified by an ord o Important for sorting: compare longs, not strings o Important for faceting Indexing

●Term vectors o Per-document inverted index o Useful for more-like-this ( 연관 검색어 ) o Sometimes used for highlighting

Indexing ●Numeric/binary doc values o Per doc and per field single numeric values o Useful for sorting and custom scoring o Norms are numeric doc values

Indexing ●Sorted (set) doc values o Original-enabled per-doc and per-field values  Sorted: single-valued, useful for sorting  Sorted set: multi-valued, useful for faceting

Indexing ●Stored fields vs Doc values o Optimized for different access patterns  get many field values for a few docs: stored fields  get a few field values for many docs: doc values

Indexing ●Lucene APIs

Querying

●Lucene Query Parser converts strings into Java objects that can be used for searching ●Qeury objects can also be constructed programmatically ●Native support for many types of queries o Keyword o Phrase o Wildcard o Many more

Core Searching classes ●IndexSearcher ●Term o Basic unit for searching, consists of the field and the value of that field ●Query o TermQuery, BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, RangeQuery, FilteredQuery, and SpanQuery ●Hits o Simple container of pointers to ranked search results

Types of Queries 1.TermQuery a. Useful for retrieving documents by a key b. When the expression consists of a single word 2.PrefixQuery a. Matches documents containing terms beginning with a specified string b. When it ends with an asterisk(*) in query expressions 3.RangeQuery a. Facilitates searches from a starting term through an ending term

Types of Queries 4.BooleanQuery o A container of Boolean clauses o A clause is a subquery that can be optional, required, or prohibited 5.PhraseQuery o An index contains positional information of terms o Uses this information to locate documents where terms are within a certain distance of one another 6.FuzzyQuery o Matches terms similar to a specified term

Querying ●Support a variety of query options o Ability to filter, page, and sort results o Pseudo relevance feedback ●Over 50 different kinds of query representations ●Several query parsers ●A query parsing framework

Various Types of Queries

Analysis and Search Relevancy

Lucene Tutorial 1.Download Lucene from http://lucene.apache.org/java http://lucene.apache.org/java 2.Write Code a. Indexing Side i. Write code to add Documents to index b. Search Side i. Write code to transform user query into Lucene Query instances ii. Submit Query to Lucene to search iii. Display results

Basic Application

1.http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and- lemmatization-1.htmlhttp://nlp.stanford.edu/IR-book/html/htmledition/stemming-and- lemmatization-1.html 2.http://www.lucenetutorial.com/index.htmlhttp://www.lucenetutorial.com/index.html 3.http://trijug.org/downloads/TriJug-11-07.pdfhttp://trijug.org/downloads/TriJug-11-07.pdf 4.https://lucene.apache.org/core/https://lucene.apache.org/core/ 5.https://fosdem.org/2015/schedule/event/apache_lucene_5/attachments/slid es/750/export/events/attachments/apache_lucene_5/slides/750/Uwe_Schi ndler___Apache_Lucene_5.pdfhttps://fosdem.org/2015/schedule/event/apache_lucene_5/attachments/slid es/750/export/events/attachments/apache_lucene_5/slides/750/Uwe_Schi ndler___Apache_Lucene_5.pdf 6.http://www.slideshare.net/nitin_stephens/lucene-basicshttp://www.slideshare.net/nitin_stephens/lucene-basics 7.http://www.slideshare.net/lucenerevolution/what-is- inaluceneagrandfinal?from_action=savehttp://www.slideshare.net/lucenerevolution/what-is- inaluceneagrandfinal?from_action=save References

Apache Solr

Relationship between Lucene & Solr ●Engine & Car o Lucene  A programmatic library which you can't use as-is o Solr  A complete application which you can use out-of-box

Solr Overview ●Solr? o Web application o Enterprise search platform built on Lucene o Highly reliable, scalable, fault tolerant ●Solr is not a HTTP wrapper of Lucene o It adds many functionalities to Lucene o Some features of Solr are implemented before they are available in Lucene

Solr vs. Lucene ●Solr uses Lucene library, but extends it o Data-driven schemaless mode o Faceted search and filtering o Geospatial search o Performance optimizations o Monitoring o Rich document parsing o and so on...

Solr Functionality ●Advanced full-text search ●Scalability & Fault tolerance ●Open interfaces ●Administration interfaces ●Easy monitoring ●Easy configuration ●Near real-time indexing ●Extensible plugins

Scaling ●On distributed systems, Solr provides... o High scalability o Fault tolerance ●Built on Apache Zookeeper o Coordinator for distributed systems o Centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services

Open Interfaces ●REST-like API o Invoke diverse operations via HTTP requests ●XML, JSON, CSV, binary format o Put data with these formats o Receive data in these formats ●Easy integration with any language

Web Interfaces ●Provides administrative and monitoring features

Web Interfaces (cont.) ●Provides querying interfaces ●Provides various querying options

Solr Query ●Solr query supports… o keyword matching o wildcard matching o proximity matching o range search o assigning different weights on search conditions o function query

●RDBMS vs. Text search platform o Does one size fit all? o Comparison on features and performances between two database systems o MySQL(RDBMS) vs. Solr(Text search platform)

MySQL vs. Solr ●MySQL o RDBMS used for general purposes ●Solr o Search platform, targeting only on text retrieval ●Questions o Will be the performance difference between two systems significant? o Does Solr have other advantages over traditional DBMSs?

Settings ●Yelp review dataset o https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset o Dataset we used in course projects o Served as JSON form o Over 1.5 million reviews

Importing Data ●There is no direct method to import JSON data into MySQL o We have to insert them article-by-article, o Or load them into DB after converting to CSV file ●In Solr, parsing rich document is available o XML, JSON, PDF, Word, etc. o Powered by Apache Tika

Importing Data ●In MySQL, … o Convert to CSV o And load

Importing Data ●In Solr… o Single line command

Test #1: Matching Single Term ●Retrieve documents that have the word ‘cuisine’ in their contents ●In SQL, o SELECT * FROM review WHERE text LIKE '%cuisine%'; ●In Solr query, o HTTP request to ‘/select’, with parameter o q=text:cuisine

Test #1: Matching Single Term ●Video clip: MySQL

Test #1: Matching Single Term Video clip: Solr

Test #2: Matching with Conditions ●Retrieve documents with below conditions o contains ‘meal’ in text o contains ‘coffee’ in text o does not contain ‘china’ in text o star rating is over 3 o written before 2012 o sort by date, ascending order o retrieve up to 500 documents

Test #2: Matching with Conditions ●In MySQL, SELECT * FROM review WHERE text LIKE '%meal%' AND text LIKE '%coffee%' AND text NOT LIKE '%china%' AND stars > 3 AND date < '2012-01-01 00:00:00.000' ORDER BY date ASC LIMIT 500; ●In Solr, HTTP request to ‘/select’, with parameter q=text:meal AND text:coffee AND -text:china AND stars:{3 TO *} AND date:{* TO 2012-01-01T00:00:00Z} sort=date ASC rows=500

Test #2: Matching with Conditions ●Video clip: MySQL

Test #2: Matching with Conditions ●Video clip: Solr

Test #3: Proximity Search ●Proximity search o Matching term occurrences within a specified distance o e.g. ‘hotel’ and ‘california’ within distance 4  This hotel is located in California  Welcome to the Hotel California, such a lovely place

Test #3: Proximity Search ●In MySQL o Can we do it…? ●In Solr o text:"hotel california"~4 o or, {!surround} text:3w(hotel, california)

Test #4: Faceted Search ●Solr supports faceted search feature o This allows users to explore information by applying multiple filters o Dynamic clustering of search results into categories that let users drill into search results by any value in any field. ●Popular technique for commercial applications

Test #4: Faceted Search ●When to use? o I want to find a specific item o but it is hard to define what I want to find ●By faceted search, we can remove irrelevant candidates, by applying filters

Test #4: Faceted Search ●Define filters o Star rating o Date written  From 2006-01-01  To 2010-01-01  By 3-month interval ●Query o GET request to ‘/select’, with parameters  q=*  facet=true  facet.field=stars  facet.date=date  f.date.facet.date.start= 2006-01-01T00:00:00Z  f.date.facet.date.end= 2010-01-01T00:00:00Z  f.date.facet.date.gap= +3MONTH

Test #4: Faceted Search

●q=* ●facet=true ●facet.field=stars ●facet.date=date ●f.date.facet.date.start=2006-01-01T00:00:00Z ●f.date.facet.date.end=2010-01-01T00:00:00Z ●f.date.facet.date.gap=+3MONTH ●fq=stars:4 ●fq=date:{2007-07-01T00:00:00Z TO 2007-07-01T00:00:00Z+3MONTH}

Test #4: Faceted Search ●In MySQL? o SELECT stars, COUNT(*) FROM review GROUP BY stars o SELECT YEAR(date), (CASE WHEN MONTH(date) >= 1 AND MONTH(date) = 4 AND MONTH(date) = 7 AND MONTH(date) = 10 AND MONTH(date) = '2006-01-01 00:00:00' AND date < '2010-01-01 00:00:00' GROUP BY YEAR(date), period; ●Long and messy!

Test #5: Language Analysis ●For an efficient text retrieval, language analysis techniques are used o Stemming o Synonyms o Stopword removal o etc.

Test #5: Language Analysis ●In Solr, we can apply filters on index and query o Some filters are applied automatically in default, by language o For English, Porter stemmer is used defaultly  Of course, we can change a stemmer to use

Test #5: Language Analysis ●‘a nice hotel’, ‘hotel with niceness’ o They will give us same results, due to stemming and stopword removal

Test #5: Language Analysis ●Are these possible in MySQL? o Almost impossible by MySQL itself o Should be done in application-level, not DB-level

Results ●RDBMS vs. Text search platform o Response time  Using indices, text search platform retrieved documents faster o Rich search functionalities  Text search platform gives us rich functionalities such as proximity search and faceted search o Language Analysis  Text search platform applies filters on index and query to find synonymy terms, terms experienced inflection, etc.

Conclusions ●For text retrieval, Solr outperforms MySQL o What if updates occur frequently? o What if we need to find documents not by words? o What if we need complex join operations? ●Does one size fit all? o RDBMS is a possible good choice for general purposes, but there exist systems for a specific domain ●We have to select a suitable system o If you are a database engineer who has to build text retrieval system, text retrieval engine might be a good choice

References 1.http://lucene.apache.org/solr/features.htmlhttp://lucene.apache.org/solr/features.html 2.https://www.apache.org/dyn/closer.cgi/lucen e/solr/ref-guide/apache-solr-ref-guide- 5.1.pdfhttps://www.apache.org/dyn/closer.cgi/lucen e/solr/ref-guide/apache-solr-ref-guide- 5.1.pdf 3.https://lucidworks.com/blog/faceted-search- with-solr/https://lucidworks.com/blog/faceted-search- with-solr/

Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text Processing 3.Index Techniques in Database 4.Index Techniques in Wireless Network.

Similar presentations

Presentation on theme: "Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text Processing 3.Index Techniques in Database 4.Index Techniques in Wireless Network."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text Processing 3.Index Techniques in Database 4.Index Techniques in Wireless Network.

Similar presentations

Presentation on theme: "Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text Processing 3.Index Techniques in Database 4.Index Techniques in Wireless Network."— Presentation transcript:

Similar presentations

About project

Feedback