Presentation is loading. Please wait.

Presentation is loading. Please wait.

Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing.

Similar presentations


Presentation on theme: "Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing."— Presentation transcript:

1 Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing

2 Searching similar documents  Searching similar documents or searching documents with content similar to query is a new forward-looking technology.  In the correlation method the correlations between words or ASCII symbols are taken into account for creating full text index of the archive of electronic documents.  It makes possible to pick up automatically the typical terminology for the documents indexed in the archive.  In the case of ASCII symbols indexing the similar document retrieval is language independent.

3 High relevance of the document retrieval  This technology  increases the relevance of the document retrieval,  solves the problems of fuzzy informational content,  consolidates information from various resources and generating a report on the similarity of documents already stored in the database that is, detecting duplicate documents.

4 Natural language, full page query  Offer in the natural language, a paragraph or even the whole page of the text can be transmitted as the search inquiry.  The search inquiry transferred to the input of search of similar is coded by means of the expanded alphabet available.

5 Relevance criteria On the basis of a list of symbols for each indexed page the following sum is calculated : Then theobtained Pi values are ordered and pages with the highest Then the obtained Pi values are ordered and pages with the highest values are given to the user as results of search.

6 Software products of the Controlling Chaos Technologies Ltd. A described method of text processing is realized and used in the software products of the Controlling Chaos Technologies Ltd., that are CCT Archive and CCT Publisher. A described method of text processing is realized and used in the software products of the Controlling Chaos Technologies Ltd., that are CCT Archive and CCT Publisher.  Software products are intended for the creation of electronic archives of not structured documents with an opportunity of full – text searching, and for creation and preparation for CD and DVD electronic books, encyclopedias, archives of magazines.  Examples of successful application of software products are the electronic archives of well- known Russian magazines “Chemistry and the Life”, "Quantum", "Znanie - Sila".

7 Archive of magazine " Quantum "  On the next slide there are results of search system operation with electronic archive of magazine " Quantum " as an example.  At the upper left is inquiry in the natural language on which the search was carried out, below is the ranged list of the documents found. To the right is the document page with the allocated inputs.

8 Archive of magazine " Quantum "

9 Basic time characteristics  Below are the basic time characteristics managed to be reached with the present program realization of the algorithms described.  All values are obtained using an ordinary personal computer, by the text size we mean the number of ASCII symbols in a text but not the size of files containing this text.

10 Basic time characteristics  The maximal size of the indexed text is about 1 Gb.  Text indexation rate is about 1 Mb per min.  Time of index opening is not more than 1 min.  Search time is about 1 sec.

11 Rubrication and text clusterization  It should be noted that the technology being developed is not language dependent and can be adjusted to any language systems.  Development of ideas put in searching the similar allows one to solve such problems as search of plagiarism, rubrication and text clusterization and Internet content filtration.


Download ppt "Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing."

Similar presentations


Ads by Google