The Marathi Portal with a Search Engine Center for Indian Language Technology Solutions, IIT Bombay.

Slides:



Advertisements
Similar presentations
THE STEPS OF SEARCH You have opened a new veterinary clinic in a small town, and want people in the vicinity to know about it. You need some new ideas.
Advertisements

Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Chapter 5: Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Inverted Index Hongning Wang
Page 1 June 2, 2015 Optimizing for Search Making it easier for users to find your content.
Information Retrieval in Practice
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
The PageRank Citation Ranking “Bringing Order to the Web”
Good Websites. 2. Submit one good web interface. This website is a good because of it usability and appears of the website.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Search engines. The number of Internet hosts exceeded in in in in in
Chapter 19: Information Retrieval
Modern Information Retrieval Chapter 4 Query Languages.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
The Further Mathematics network
IBM User Technology March 2004 | Dynamic Navigation in DITA © 2004 IBM Corporation Dynamic Navigation in DITA Erik Hennum and Robert Anderson.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
Do's and don'ts to improve your site's ranking … Presentation by:
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Module 10 Administering and Configuring SharePoint Search.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
The Business Model of Google MBAA 609 R. Nakatsu.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Search Engines By: Faruq Hasan.
NASRULLAH KHAN.  Lecturer : Nasrullah   Website :
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
NeOn Components for Ontology Sharing and Reuse Mathieu d’Aquin (and the NeOn Consortium) KMi, the Open Univeristy, UK
Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Information Retrieval
Prepared by Rao Umar Anwar For Detail information Visit my blog:
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Techniques and Advanced tools for Researchers
Eric Sieverts University Library Utrecht Institute for Media &
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
Information Retrieval
Project Tukaram Sagar Tamhane
Data Mining Chapter 6 Search Engines
17th APAN Meetings & Joint Techs Workshop
Web Search Engines.
Mining Anchor Text for Query Refinement
Information Search Week 4.
Information Retrieval and Web Design
Presentation transcript:

The Marathi Portal with a Search Engine Center for Indian Language Technology Solutions, IIT Bombay

A Search Engine To promote use of information available on web in Marathi language Locate the right pages that you need Present the pages to the user in an order of importance

Types of Searches Based on user queries Category based search Browse through pre-classified categories Search selected literature which will be hosted on the Marathi Portal

Search Engine: Performance Criteria Coverage Cover as many pages as possible. A study has revealed that a large part of the web remains un- indexed Response time The user should be presented with the results as quickly as possible Relevance The information presented should be relevant and ordered in an order of importance

Main Components of a Search Engine Crawling unit Indexing unit Searching unit Ranking unit

A Prototype A prototype has been developed to gauge the complexity and architectural issues involved in developing the complete Marathi Portal

About the Prototype A search engine prototype has been built with manually selected sites in different categories It indexes about 1800 pages consisting of over 10,14,000 words The Engine is developed on Windows platform on MS Access Monolingual ISFOC pages are covered

Ranking Criteria used in the prototype Number of words in the query string that appear in the document In OR search, documents containing maximum number of words in the string is ranked higher Proximity between words No. of words that are together within distance of 5 words Context of the word Is it in title or body? Frequency of the desired word in the document No. of occurrences of the word

A Fast Engine is under Development A Linux based fast prototype for the same number of pages is being developed. It takes 2 minutes to build the dictionary, 2 hours to build the index and less than a second to search

What if the Machine that hosts the engine fails? The index must be in main memory while search is being performed You cannot afford to loose the index since it would take days (even months for large engines) to build it again on a large number of pages Dumping the index of the Linux prototype through traversal takes around 35 minutes But to load it in main memory took 2 minutes!

Requirements from the Infrastructure for the actual Portal High RAM – in GBs High Computing Power: Parallel Processing through network of workstations Parallel IO As number of users increase, more and more parallelism will have to be employed to guarantee same performance criteria to each user

Representations and Fonts Currently only ISFOC is supported There are sites in Marathi with different types of encodings which need to be integrated Converters Input/Display technology for Linux

Crawling Crawling and meta-crawling techniques Some interesting facts: E.g. it was found that word ‘Aahe’ is one of the most widely occurring words Words Aahe and Aani together span most of the documents There are specific words that occur most widely and most frequently in different categories

Indexing and Searching Incremental Dynamic Fast Search In Memory

Relevancy What the user really wants Heuristics for ranking results Query modification

Selected Texts Saint Tukarama’s Abhangs will be made searchable and will be hosted on this website Search on other selected texts will also be hosted on this website