Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif, USA

Current need? Viewing website using small screen handheld devices Since web sites are written using HTML codes, we need to translate these to systems that the wireless devices can support.

Current Solutions Handcrafting: –Custom Web Sites are typically crafted by hand by a set of content experts Transcoding: –Thranscoding replaces HTML tags with suitable device specific tags (HDML, WML etc)

Handcrafting Automation –Use of XML. There is no standard XML tagset (Document Type Definition – DTD) in use by vendors. XML has been available to web designers for the last 10 years. Examination of websites shows little use of document structural elements. –Web masters see themselves as artists rather than programmers. –XML may meet the same fate as SGML, an earlier attempt to create structured documents.

Handcrafting Take an existing website and make it available to wireless access. Aether Systems, Mshift and 2Roam currently offer these types of solutions. Use a proprietary graphical interface to ease the development of wireless applications from scratch. Covigo and iConverse offer these type of solutions. Let the user do all coding in languages such as C++ or Java. ThinAirApps offers this type of solution.

Handcrafting Labor intensive Expensive. Typically less than 1% of a web site gets converted to wireless content.

Transcoding Most web pages have a loose repeating visual structure. The wireless user gets the same repeating information with every screen Browsing is an unfriendly experience Transcoding sends all the information to the wireless device, making it substantially slow on the wireless network

Transcoding Transcoding was introduced in Japan during 1999-2000. It was widely rejected by the Japanese users. Recently, Google and Pixo introduced this solution for the US market, but have so far failed to attract attention of end users.

The Alternate Solution Separate the content into smaller segments Generate a summary of these segments Prioritize these summaries from individual segments Put together to form a summary of the overall document

Steps to Content Extraction Structural analysis: Understanding the relationship of the various segments with the document Decomposition: Breakdown on these segments into operational units Contextual Analysis: Employment of context to revise the segmentation (Continued=>)

Steps to Content Extraction (Continued) Labeling => Segment Summary: Extraction of a low level summary of the segment Priority: Estimating importance of these segments Table of Content (TOC) => Document Summary: Putting together a summary of the document

Content Extraction Proximity Analysis: Relational analysis of content between segments Content Classification: callification into various types, i.e. [stories], [navigation], [links], [images], [forms] etc. Relationship Analysis –Contextual grammar (Natural Language) –Knowledge modes –Information retrieval techniques

Content Extraction: Why do we need it? Viewing any website: Any solution to web browsing has to be universal High network access: Any transformation has to be fast and on-the-fly Network Usage: Network traffic should increase because of these systems (Continued=>)

Content Extraction: Why do we need it (continued)? Easy Configurability: Any such system should be easiliy configurable Rapid Deployment: Should be rapidly deployable Non-intrusive Design: Should be possible to transform web sites without modifying the actual web site Multiple Views: System Integrators should be able to create multiple views of the same site

Advantages of Content Extraction Displays size Locating information Important content can be on top Multiple levels of abstraction can be created The browsing can use a demand-driven model Faster download More efficient use of small display areas Mapping of the importance of content from the original document

Supported Devices and Formats PDAs (HTML3.2) Cell phones –USA/Europe: WAP –Japan iMode (NTT DoCoMo) J-Sky (J-Phone) EZWeb (KDDI)

Conclusion Content from web documents can be extracted based on the –HTML structure –Proximity analysis –Logical relationship analysis –Information retrieval techniques Content can be used effectively to summarize web documents –Better option compared to handcrafting or transcoding –Produces faster browsing experience

Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Similar presentations

Presentation on theme: "Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Similar presentations

Presentation on theme: "Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,"— Presentation transcript:

Similar presentations

About project

Feedback