The Marathi Portal with a Search Engine Center for Indian Language Technology Solutions, IIT Bombay
A Search Engine To promote use of information available on web in Marathi language Locate the right pages that you need Present the pages to the user in an order of importance
Types of Searches Based on user queries Category based search Browse through pre-classified categories Search selected literature which will be hosted on the Marathi Portal
Search Engine: Performance Criteria Coverage Cover as many pages as possible. A study has revealed that a large part of the web remains un- indexed Response time The user should be presented with the results as quickly as possible Relevance The information presented should be relevant and ordered in an order of importance
Main Components of a Search Engine Crawling unit Indexing unit Searching unit Ranking unit
A Prototype A prototype has been developed to gauge the complexity and architectural issues involved in developing the complete Marathi Portal
About the Prototype A search engine prototype has been built with manually selected sites in different categories It indexes about 1800 pages consisting of over 10,14,000 words The Engine is developed on Windows platform on MS Access Monolingual ISFOC pages are covered
Ranking Criteria used in the prototype Number of words in the query string that appear in the document In OR search, documents containing maximum number of words in the string is ranked higher Proximity between words No. of words that are together within distance of 5 words Context of the word Is it in title or body? Frequency of the desired word in the document No. of occurrences of the word
A Fast Engine is under Development A Linux based fast prototype for the same number of pages is being developed. It takes 2 minutes to build the dictionary, 2 hours to build the index and less than a second to search
What if the Machine that hosts the engine fails? The index must be in main memory while search is being performed You cannot afford to loose the index since it would take days (even months for large engines) to build it again on a large number of pages Dumping the index of the Linux prototype through traversal takes around 35 minutes But to load it in main memory took 2 minutes!
Requirements from the Infrastructure for the actual Portal High RAM – in GBs High Computing Power: Parallel Processing through network of workstations Parallel IO As number of users increase, more and more parallelism will have to be employed to guarantee same performance criteria to each user
Representations and Fonts Currently only ISFOC is supported There are sites in Marathi with different types of encodings which need to be integrated Converters Input/Display technology for Linux
Crawling Crawling and meta-crawling techniques Some interesting facts: E.g. it was found that word ‘Aahe’ is one of the most widely occurring words Words Aahe and Aani together span most of the documents There are specific words that occur most widely and most frequently in different categories
Indexing and Searching Incremental Dynamic Fast Search In Memory
Relevancy What the user really wants Heuristics for ranking results Query modification
Selected Texts Saint Tukarama’s Abhangs will be made searchable and will be hosted on this website Search on other selected texts will also be hosted on this website