Presentation is loading. Please wait.

Presentation is loading. Please wait.

Challenges in Web Document Summarization: Some Myths and Reality A. Rahman H. Alam Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa.

Similar presentations


Presentation on theme: "Challenges in Web Document Summarization: Some Myths and Reality A. Rahman H. Alam Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa."— Presentation transcript:

1 Challenges in Web Document Summarization: Some Myths and Reality A. Rahman H. Alam Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif, USA

2 Basic Problem Statement What are web based documents? What is summarization? Textual summarization vs. content summarization What myths do we have about summarization? What is the reality?

3 Why Summarization? Display area of handheld devices i.e. PDAs and Cell phones is too small for useful web browsing Download times is still too slow for comfortable browsing using wireless devices Cost factor is still too high

4 Where is the Money? 1.2 billion web pages 2 hours/site to adapt an existing page for wireless, it will take 2.4 billion work-hours At $20 per hour is assumed, this effort requires an investment of around $50 billion

5 Current need? Viewing website using small screen handheld devices Since web sites are written using HTML codes, we need to translate these to systems that the wireless devices can support.

6 Myths Web summarization is easy No scanning No image processing No Word or character level recognition HTML has structural elements Already in electronic formats

7 Current Solutions Handcrafting: –Custom Web Sites are typically crafted by hand by a set of content experts Transcoding: –Thranscoding replaces HTML tags with suitable device specific tags (HDML, WML etc)

8 Handcrafting Take an existing website and make it available to wireless access. Aether Systems, Mshift and 2Roam currently offer these types of solutions. Use a proprietary graphical interface to ease the development of wireless applications from scratch. Covigo and iConverse offer these type of solutions. Let the user do all coding in languages such as C++ or Java. ThinAirApps offers this type of solution.

9 Handcrafting Labor intensive Expensive. Typically less than 1% of a web site gets converted to wireless content.

10 Transcoding Transcoding was introduced in Japan during 1999-2000. It was widely rejected by the Japanese users. Recently, Google and Pixo introduced this solution for the US market, but have so far failed to attract attention of end users.

11 The Alternate Solution Separate the content into smaller segments Generate a summary of these segments Prioritize these summaries from individual segments Put together to form a summary of the overall document

12 Summarization vs. Transcoding Long displays Long download times Finding information difficult No mapping of the importance of content in the original document

13 Steps to Summarization Segmentation – A tree Problems –Tables –Frames –Java Script –Graphics –Other Artifacts –Over segmentation –Under segmentation –Poor coding –Browsers are too good! Ccontent CTable CRow CCol etc….. CTable Etc…

14 Steps to Summarization Labeling –Main Story –Links –Navigation Bars –Advertisement Bars –Other Stories –Forms –Images Visual cues Size of font Headlines Boldness Color Links, Flashing Italic (I) Emphasized Underlines. Problems Graphics OCR Java scripts CSS

15 Steps to Summarization Labeling => Segment Summary: Extraction of a low level summary of the segment Priority: Estimating importance of these segments Table of Content (TOC) => Document Summary: Putting together a summary of the document

16 Conclusion Content can be used effectively to summarize web documents Content summarization is more complex than textual summarization HTML structure is a good starting point, but not enough to understand context Summarization offers significant advantages over transcoding Summarization also helps in faster browsing experience There is a lot of money in this!


Download ppt "Challenges in Web Document Summarization: Some Myths and Reality A. Rahman H. Alam Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa."

Similar presentations


Ads by Google