Presentation is loading. Please wait.

Presentation is loading. Please wait.

17th APAN Meetings & Joint Techs Workshop

Similar presentations


Presentation on theme: "17th APAN Meetings & Joint Techs Workshop"— Presentation transcript:

1 17th APAN Meetings & Joint Techs Workshop
FilipinianaWeb Nestor Michael C. Tiglao Computer Networks Lab (CNL) University of the Philippines 17th APAN Meetings & Joint Techs Workshop Jan. 30, 2004

2 World Wide Web Enormous growth (10 billion pages)
Imagine the Web without search engines Need for intelligent document discovery mechanisms

3 Web Crawlers Programs that retrieve Web pages Two kinds:
General-purpose crawlers Focused crawlers

4 Sample Query: anthrax

5 Result 1

6 Result 2

7 Focused Crawler Selectively seek out pages that are relevant to a pre-defined set of topics Topics are specified by sample documents

8 Research on Search Engines
Implemented the focused crawler on a Linux cluster using Beowulf and MPI (2002) Philippine-specific search engine using the openMosix platform (2003)

9 Focused Crawler Architecture
User Interface Results Sample Document Classifier Crawl Tables Distiller Crawler

10 Focused Crawler Design

11 Flowchart

12 Performance (Crawl Time)

13 Why another search engine?
Existing Philippine search engines: Yehey.com, Alleba, Tanikalang Ginto, Pugad.com and EdsaWorld actually web directories We need a better search engine

14 Unique Situation Many Philippine-related sites are not registered under the .ph domains Many sites are hosted outside the Philippines English as the de facto language

15 System Design (Gagambot)

16 Filters ph Domain filter Language filter gov.ph, edu.ph
iso 639, iso /latin1 and windows-1252 subset of Unicode characters utf-8 and us-ascii

17 Filters 2 GeoURL filter Bayesian filter
Location-to-URL reverse directory Finds URLs by their proximity to a given location ( Bayesian filter Analyzes the textual content of the HTML document

18 FilipinianaWeb

19 Current Plans Develop FilipinianaWeb on a grid platform
Better filtering techniques Integrate focused crawling Support for other object formats: documents, images, XML, etc.

20 Conclusion FilipinianaWeb is a work-in-progress and a proof-of-application Grid infrastructure will help provide the computational and resource requirements of a production-level search engine


Download ppt "17th APAN Meetings & Joint Techs Workshop"

Similar presentations


Ads by Google