Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing.

Slides:



Advertisements
Similar presentations
Organisation Of Data (1) Database Theory
Advertisements

What’s the difference between MBD Search Engine and other SEs?
Spreadsheet Basics Computer Technology.
R2 Library Features and Functionality Overview. The R2 Library  The R2 Library is an electronic database that enables access to digital book content.
Communicating Information: Web Design. It’s a big net HTTP FTP TCP/IP SMTP protocols The Internet The Internet is a network of networks… It connects millions.
1 Oct 30, 2006 LogicSQL-based Enterprise Archive and Search System How to organize the information and make it accessible and useful ? Li-Yan Yuan.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Managing data Resources: An information system provides users with timely, accurate, and relevant information. The information is stored in computer files.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
1 SWE Introduction to Software Engineering Lecture 22 – Architectural Design (Chapter 13)
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
Computer Technology Correct Keyboarding Technique Eyes on copy Fingers curved Correct fingers Key smooth Proper sitting posture.
Chapter 5 Application Software.
Rimantas Ramanauskas Kazys Maksvytis Alvydas Janulevičius State Enterprise Centre of Registers INTEGRATED PROCESSING OF DIGITAL CADASTRAL DATA IN LITHUANIA.
PayDox Corporate Document Management System Rotech AB Interface Ltd Business Software Integration.
© Paradigm Publishing, Inc. 5-1 Chapter 5 Application Software Chapter 5 Application Software.
Storage Devices. Internal / External Hard Drive Also known as hard disks Internal drive stores the operating system software, application software and.
Transformation of data into Information
 Promote books online add more content – increase sales.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
You are about to view an instructional presentation created in PowerPoint. Many of the slides have animated text. Please wait several seconds before advancing.
Case Study.  Client needed to build a tool to crawl through their data set and identify duplicates  The algorithm should identify exact as well as near.
Introduction of Geoprocessing Topic 7a 4/10/2007.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
© Paradigm Publishing Inc. 5-1 Chapter 5 Application Software.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Lesson 01: Introduction to Database Software. At the end of this lesson, students should be able to: State the usage of database software. Start a database.
Nature and Type of Audit Evidence
Getting Started with MATLAB (part2) 1. Basic Data manipulation 2. Basic Data Understanding 1. The Binary System 2. The ASCII Table 3. Creating Good Variables.
7-1 Computerized Accounting Systems Electronic Presentation by Douglas Cloud Pepperdine University Chapter F7.
Database Concepts Track 3: Managing Information using Database.
Intellectual Works and their Manifestations Representation of Information Objects IR Systems & Information objects Spring January, 2006 Bharat.
Introduction of Geoprocessing Lecture 9. Geoprocessing  Geoprocessing is any GIS operation used to manipulate data. A typical geoprocessing operation.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
Introduction of Geoprocessing Lecture 9 3/24/2008.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Software. Introduction n A computer can’t do anything without a program of instructions. n A program is a set of instructions a computer carries out.
Research Vocabulary. Research The investigation of a particular topic using a variety of reliable resources.
 At the end of the class students should:  distinguish between data and information.  explain the characteristics and forms of Information Processing.
Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall. Chapter
N5 Databases Notes Information Systems Design & Development: Structures and links.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Introducing the World Wide Web
TOPICS Information Representation Characters and Images
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Database Vocabulary Terms.
Searching for and Accessing Information
Computers & Programming Languages
Unit# 6: ICT Applications
Introduction into Knowledge and information
Data Mining Chapter 6 Search Engines
Introduction to computers
Creating a Bibliography
Lecture 8 Information Retrieval Introduction
Spreadsheets, Modelling & Databases
The ultimate in data organization
Microsoft Office Access is the best –selling personal computer database management system. What is Access?
MAIN MENU 1. Introduction 2. Unit Info 3. Unit Overview 4. Subtasks
Computer Terms 1 Terms Internet Terms 1 Internet Terms 2 Computer
Information Retrieval and Web Design
Aggregating Online Resources: Grolier Online as an Educational Portal
Presentation transcript:

Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing

Searching similar documents  Searching similar documents or searching documents with content similar to query is a new forward-looking technology.  In the correlation method the correlations between words or ASCII symbols are taken into account for creating full text index of the archive of electronic documents.  It makes possible to pick up automatically the typical terminology for the documents indexed in the archive.  In the case of ASCII symbols indexing the similar document retrieval is language independent.

High relevance of the document retrieval  This technology  increases the relevance of the document retrieval,  solves the problems of fuzzy informational content,  consolidates information from various resources and generating a report on the similarity of documents already stored in the database that is, detecting duplicate documents.

Natural language, full page query  Offer in the natural language, a paragraph or even the whole page of the text can be transmitted as the search inquiry.  The search inquiry transferred to the input of search of similar is coded by means of the expanded alphabet available.

Relevance criteria On the basis of a list of symbols for each indexed page the following sum is calculated : Then theobtained Pi values are ordered and pages with the highest Then the obtained Pi values are ordered and pages with the highest values are given to the user as results of search.

Software products of the Controlling Chaos Technologies Ltd. A described method of text processing is realized and used in the software products of the Controlling Chaos Technologies Ltd., that are CCT Archive and CCT Publisher. A described method of text processing is realized and used in the software products of the Controlling Chaos Technologies Ltd., that are CCT Archive and CCT Publisher.  Software products are intended for the creation of electronic archives of not structured documents with an opportunity of full – text searching, and for creation and preparation for CD and DVD electronic books, encyclopedias, archives of magazines.  Examples of successful application of software products are the electronic archives of well- known Russian magazines “Chemistry and the Life”, "Quantum", "Znanie - Sila".

Archive of magazine " Quantum "  On the next slide there are results of search system operation with electronic archive of magazine " Quantum " as an example.  At the upper left is inquiry in the natural language on which the search was carried out, below is the ranged list of the documents found. To the right is the document page with the allocated inputs.

Archive of magazine " Quantum "

Basic time characteristics  Below are the basic time characteristics managed to be reached with the present program realization of the algorithms described.  All values are obtained using an ordinary personal computer, by the text size we mean the number of ASCII symbols in a text but not the size of files containing this text.

Basic time characteristics  The maximal size of the indexed text is about 1 Gb.  Text indexation rate is about 1 Mb per min.  Time of index opening is not more than 1 min.  Search time is about 1 sec.

Rubrication and text clusterization  It should be noted that the technology being developed is not language dependent and can be adjusted to any language systems.  Development of ideas put in searching the similar allows one to solve such problems as search of plagiarism, rubrication and text clusterization and Internet content filtration.