Record-Boundary Discovery in Web Documents by Yuan Jiang December 1, 1998
Introduction n The amount of Web data has increased dramatically n Two commonly used methods to retrieve data from the Web u Browsing u Keyword Searching n Limitations u inefficient u easy to get lost while tracing links on the Webs u unwanted data n Database Retrieval Techniques n Problem: Most Web data is unstructured or semistructured, and cannot be queried using traditional query languages
A Sample Web Document Classifieds Funeral Notices - October 1, 1998 Lemar K. Adamson age 84, of Tucson, died September 30, 1998. He is survived by wife, Cindy; daughters, Elvia, Gloria, Irene, Isabel, Jewel, and Jessica; sons, Paul, John, Jeffery, and Louis; brothers, Kirk, Justin, Ivan, Hubert and Grover. Funeral service at 10:00 a.m. Monday, October 5, 1998 at Silverbell Ward, 1540 E. Linden. Burial in City Cemetery. Friends may call from 9:00 a.m. to 10:00 a.m. Monday, at the church. Arrangements by BRING'S MEMORIAL CHAPEL, 236 S. Scott Our beloved Brian Fielding Frost, age 41, passed away Tuesday morning, September 30, 1998, due to injuries sustained in an automobile accident. He was born January 12, 1957 in Salt Lake City, to Donald Fielding and Helen Glade Frost. He married Susan Fox on June 1, 1981. He is survived by Susan; sons Alfred, Joseph; parents, and two sisters, Anne (Dale) Elkins and Sally (Kent) Britton. Funeral services will be held at 12 noon Tuesday, October 6, 1998 in the Howard Stake Center, 350 South 1600 East. Friends may call 5-7pm. Monday at Carrillo's Tucson Mortuary, 3401 S. Highland Drive. Interment at Holy Hope Cemetery. Leonard Kenneth Gunther age 82. A resident of Tucson, passed away peacefully on September 30, 1998. He was born June 6, 1916 in Iowa. He joined the U.S. Navy serving during World War II. He remained a member of the U.S. Naval Reserve (USNR) for several years. He is survived by his wife, Gwendolyn; sons, Eric D. of San Francisco, CA, Vincent J. of Tucson; a daughter, Janet H. of Provo, UT; and one granddaughter, Sarah R. of Phoenix, AZ. Friends may call from 5:00 p.m. until 7:00 p.m. on Monday, October 5, 1998 at HEATHER MORTUARY, 1040 N. Columbus Blvd. Funeral services will be at 11:00 a.m. at HEATHER MORTUARY, on Tuesday, October 6, 1998. Burial will be private at South Lawn Cemetery. All material is copyrighted.
Building Wrappers Part of the problem is record separation n The record-identification task in wrapper construction is nontrivial n Previous Work u manually [AM97, GHR97, HGMC + 97] u semi-automatically [Ade98, AK97a, AK97b, DEW97, KWD97, Sod97] n Our Work u automatic with the following assumptions u the Web document F has multiple records F is in HTML F contains at least one record-separator tag
Heuristic for Locating Groups of Records Classifieds Funeral Notices - October 1, 1998 Lemar K. Adamson … BRING'S MEMORIAL CHAPEL, … Our beloved Brian Fielding Frost, … Howard Stake Center, … Carrillo's Tucson Mortuary... Leonard Kenneth Gunther … HEATHER MORTUARY, … HEATHER MORTUARY,... All material is copyrighted.
Highest-count Tags (HT) Individual Heuristic n Observation: the candidate tag with the most appearances is a likely separator. n Rank the candidate tags based on number of appearances.
Identifiable “Separator” Tags (IT) Individual Heuristic n Observation: Both hand-created and tool-generated HTML documents tend to consistently use some few common separator tags. n Use a pre-determined list of likely HTML separator tags: hr tr td a table p br h4 h1 strong b i
Standard Deviation (SD) Individual Heuristic n Observation: when multiple records about an entity appear in a document, the records are typically about the same size. n The candidate tag with the minimum standard deviation based on the size of the plain text between identical tags tends to be the separator.
Repeating-Tag Pattern (RP) Individual Heuristic n Observation: divisions between records often include several tags that consistently appear in the same order. n If an adjacent pair occurs at a record boundary and is the record separator, the count for this pair should be about the same as the count of number of occurrences of alone.
Ontology-Matching (OM) Individual Heuristic n Observation: One or more fields of a record (called record- identifying fields) appears once and only once in a record. n If we can locate a value for the field or even just an indication that the value exists, we can count the number of such occurrences. n To choose record-identifying fields, we u limit the number of fields to be at least 3 and no more than 20% of the number of sets of objects in the ontology; u choose fields with a 1-1 correspondence to the entity of interest over fields functionally dependent on the entity of interest; u choose keyword indicators over identifiable values.
Example: Ontology-Matching (OM) (Continue) Funeral Notices - October 1, 1998 Lemar K. Adamson... died September 30, 1998.... Funeral service at 10:00 a.m. Monday, October 5, 1998... Burial in City Cemetery. Arrangements by BRING'S MEMORIAL CHAPEL, … Our beloved Brian Fielding Frost, age 41, passed away Tuesday morning, September 30, 1998,... He was born January 12, 1957 … Funeral services will be held at 12 noon Tuesday, October 6, 1998 in the Howard Stake Center, … at Carrillo's Tucson Mortuary,... Interment at Holy Hope Cemetery. Leonard Kenneth Gunther... passed away peacefully on September 30, 1998. He was born June 6, 1916 … at HEATHER MORTUARY,... Funeral services will be at 11:00 a.m. at HEATHER MORTUARY, on Tuesday, October 6, 1998. Burial will be private at South Lawn Cemetery.
Combined Heuristic n Each individual heuristic is independent of the others but works well only for some particular Web documents. n Certainty Measure (Stanford certainty theory) u Suppose CF(E1) & CF(E2) are two certainty factors for the same observation B, then the compound certainty factor of B is CF(E1) + CF(E2) - CF(E1) * CF(E2). u By using this rule repeatedly, it is possible to combine the results of evidence from any number of independent events.
Experimental results for all the combined heuristics
Record-Boundary Discovery Algorithm Input: A Web document D. Output: The record separator of D. Step1: Create the tag tree T of D. Step2: Locate the highest-fan-out subtree HF in T. Step3: Extract the set of candidate tags CT from HF. Step4: Apply the five individual heuristics OM, SD, IT, HT, and RP to CT. Step5: For each candidate tag in CT, apply Stanford certainty theory to the results of all five heuristics (ORSIH). Step6: Choose the candidate tag with the highest compound certainty factor as the record separator for D.
Example n The results of applying the five individual heuristics to the sample document presented earlier are as follows: OML = [(hr, 1), (br, 2), (b, 3)] RPL = [(hr, 1), (br, 2), (b, 3)] SDL = [(hr, 1), (b, 2), (br, 3)] ITL = [(hr, 1), (br, 2), (b, 3)] HTL = [(b, 1), (br, 2), (hr, 3)] n Combining these five individual heuristics together yields: ORSIH: [(hr, 99.96%), (b, 64.75%), (br, 56.34%)] n Hence, ‘hr’ is chosen as the record separator.
Conclusions n We described a heuristic approach to discover record boundaries in unstructured Web documents. n Main contribution: we provided a set of individual heuristics and a way to combine these heuristics into a method for discovering record boundaries. n Under normal assumptions, the process is O(n), where n is the size of a document. n The experiments we conducted showed that this approach uniformly attained an accuracy of 100%.