Presentation is loading. Please wait.

Presentation is loading. Please wait.

+7 (499) 135-04-63 117312, Moscow pr. 60-letiya Oktyabrya, 9 www.isa.ru SYSTEM FOR INTELLIGENT SEARCH AND ANALYSIS OF LARGE-SCALE TEXT COLLECTIONS Institute.

Similar presentations


Presentation on theme: "+7 (499) 135-04-63 117312, Moscow pr. 60-letiya Oktyabrya, 9 www.isa.ru SYSTEM FOR INTELLIGENT SEARCH AND ANALYSIS OF LARGE-SCALE TEXT COLLECTIONS Institute."— Presentation transcript:

1 +7 (499) 135-04-63 117312, Moscow pr. 60-letiya Oktyabrya, 9 www.isa.ru SYSTEM FOR INTELLIGENT SEARCH AND ANALYSIS OF LARGE-SCALE TEXT COLLECTIONS Institute for Systems Analysis Federal Research Center «Computer Science and Control» of Russian Academy of Sciences Ilya Tikhomirov PhD

2 About Russian Academy of Sciences:  the national academy of Russia  methodological guidance of more than 400 research centers Federal Research Center “Computer Science and Control” of Russian Academy of Sciences:  multidisciplinary research (mathematics, IT, economics etc.)  1200+ employees, 300+PhDs

3 www.textapp.ru /34 TextAppliance TextAppliance – system for intelligent search and analysis of large-scale text collections Different from Uses deep natural language processing Based on advanced Exactus technology Result of the state-of-the-art research in computer science 3 etc.

4 www.textapp.ru /34 Functions Text Appliance consists of a hardware cluster and software intelligent services for search and analysis of large-scale text collections:  Semantic and explorative search  Search for similar documents  Semantic plagiarism detection  Formation, comparison and topic analysis of user’s collections  Automatic extraction of keywords  Automatic generation of document summary  Topic analysis for document collections 4

5 www.textapp.ru /34 Features of TextAppliance Processes documents in  Russian  English  Extensible for Persian languages Can be easily integrated into infrastructures of organizations Provides a wide set of search and analytical functions High quality of text processing Supports common document formats, including pdf without text layer (performs OCR) 5

6 www.textapp.ru /34 Architecture 6 Scalability Resiliency Easy to integrate (JSON / XML-RPC) Support for Big Data Full-text indexing Extracting and indexing of metadata Support for common document formats

7 www.textapp.ru /34 Implementation The implementation on a computational cluster running Linux Debian Distributed computing provides scalability and stability at high load 7

8 /34 Technologies behind TextAppliance 8

9 www.textapp.ru /34 Semantic search method Perform deep natural language processing of user query  POS-tagging  Syntactic parsing  Semantic role labeling  Semantic relation extraction  Named entity recognition Compare linguistic structure of query with structures of documents in a large indexed textual collection 9

10 www.textapp.ru /34 Semantic search scheme 10

11 www.textapp.ru /34 Tokenization and sentence splitting Extract tokens from raw text Extract sentences from raw text 11 The mother brings her son to school. themotherbringshersontoschool.

12 www.textapp.ru /34 POS-tagging and morphological analysis Determine part-of-speech (POS) tags of words Determine morphological features of words (for morphologically rich languages) 12 themotherbringshersontoschool. the det mother noun brings verb her pronoun son noun to prep school noun.

13 www.textapp.ru /34 Syntax parsing Build a syntax tree Extract grammatical structure of a sentence 13

14 www.textapp.ru /34 Semantic analysis Creates an abstracted representation of text that does not depend on a particular language Extracts semantic roles and semantic relations 14

15 www.textapp.ru /34 Relational-Situational model of text (1) 15 Syntax relations Semantic roles and values of syntaxemes Semantic relations between syntaxemes Coreference relations Other information extracted from texts:  names of persons  names companies  geographical objects  etc. Example: “Oxygen arrives at tissues from lungs through blood. There it is spent on oxidation of various substances.”

16 www.textapp.ru /34 Relational-Situational model of text (2) M = S – set of syntaxemes, S = {s 1, s 2, …, s n }, s i – syntaxeme R – family of relationships on the set of syntaxemes, R  S × S T s – syntaxeme types I s : S →T s s =   T s, T s = {‘p’, ’n’} W – word P – syntaxeme features including categorial semantic class, prepositions and other morphological properties  – type of syntaxeme (‘p’ – predicate word; ‘n’ – nominal syntaxeme) R = {(s 1, s 2 )} is a family of binary relationships, it consists of three subfamilies:  R p – relationships between predicate words and nominal syntaxemes (or syntaxeme meanings)  R n – relationships between nominal syntaxemes  R c – relationships that express anaphora and co-reference 16

17 www.textapp.ru /34 Semantic search example 17

18 www.textapp.ru /34 Indexing technology Fast indexing. Sublinear dependency between number of indexed documents and indexing speed Efficient search enhanced by linguistic information including semantic structures and syntax trees 18 Stores rich linguistic structures of texts efficiently. Minimum overhead for keeping semantic information

19 www.textapp.ru /34 Evaluation of semantic search (ROMIP’08) 19 Recall Precision – 1 st place

20 www.textapp.ru /34 Evaluation of question answering search (ROMIP’10) Best results for all metrics 20

21 /34 Evaluation of plagiarism detection (CLEF’2014) 21 Developed method shows 2 d result on F-measure and 1 st result on the ratio of F-measure/number of checked fragments F-measure The ratio of F-measure to number of queries

22 /34 TextAppliance applications 22

23 www.textapp.ru /34 Academic applications – provides search and analytics on scientific publications, fields, and research groups. Created for Ministry of Education and Science of the Russian Federation. – searches plagiarism in scientific publications – intelligent patent search and patent analytics 23 – TextAppliance helps Russian Foundation for Basic Research to expertise applications for grants

24 www.textapp.ru /34 Academic applications (1) 24 Analysis of publication activity on the topic of "expert systems":

25 www.textapp.ru /34 Academic applications (2) 25

26 www.textapp.ru /34 Academic applications (3)  Example: some analytics on “electronic book” patents 26 Patent holdersPatents by country www.textapp.ru Number of patents (Rest)

27 www.textapp.ru /34 The Russian Foundation for Basic Research (RFBR) –the biggest scientific fund in Russia RFBR uses TextAppliance to improve expertise of applications for scientific grants TextAppliance helps to  structure large collections of applications and reports  find plagiarism in applications and reports  find topically similar projects  extract emerging scientific fields that need additional support  assign experts to projects 27 Academic applications (4)

28 www.textapp.ru /34 Business applications Leading Russian publishers:  Infra-M, product Znanium.com  Rucont.ru Integrated in their products:  Plagiarism detection service  Service of intelligent thematic search  Service for analysis of scientific document structure (evaluation of publication quality) 28

29 /34 Business application (1) Znanium.com 29

30 www.textapp.ru /34 Business application (2) Rucont.ru 30 Example: 3D representation of clustering results

31 www.textapp.ru /34 Customers and partners 31

32 TextAppliance team (1) PhD Olga Vybornova Dr.Sc, prof. Gennady Osipov PhD Ilya Tikhomirov PhD Ivan Smirnov Researcher Dmitry Devyatkin PhD Alexander Shvets PhD Artem Shelmanov PhD Ilya Sochenkov

33 TextAppliance team (2) PhD-student Roman Suvorov PhD-student Denis Zubarev Dr.Sc, prof. Sergey Krylov PhD-student Margarita Ananyeva PhD-student Margarita Kamenskaya PhD-student Ivan Khramoin Student Vasiliy Iadrencev Student Vadim Isakov

34 www.textapp.ru demo.textapp.ru Institute for Systems Analysis Federal Research Center “Computer Science and Control” of Russian Academy of Sciences 117312, Moscow, pr. 60-letiya Oktyabrya, 9 Tel/fax: +7 499 135 0463 tih@isa.ru


Download ppt "+7 (499) 135-04-63 117312, Moscow pr. 60-letiya Oktyabrya, 9 www.isa.ru SYSTEM FOR INTELLIGENT SEARCH AND ANALYSIS OF LARGE-SCALE TEXT COLLECTIONS Institute."

Similar presentations


Ads by Google