Presentation on theme: "Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2."— Presentation transcript:
Wrappers Kapowtech RoboSuite 6.0 Team číslo 10 – Vampires, stretnutie číslo 2
Documents Lixto WhitePaper Wrapper Development Tools Piggy Bank WebVCR Kapow RoboSuite Documentation
Why wrappers? -HTML is used to display data -the data is stored inside your HTML -WEB is designed for human consumption, even if it was derived from well-defined database -wrapper – robot browsing web and extraction of data
Applications online price comparisons automatic stock market surveillance personalized online news flight tickets job search competitors advantage research of a new technology …….
Lixto WhitePaper presented by Duri table on the next slide -Comparison of wrappers, programming languages and by hand conversion -Criteria's like learning time, expressive power, user friendliness,…
Comparison of wrappers, programming languages and by hand conversion
Wrapper Development Tools 3 main functions: -ability of downloading HTML pages from website -search for, recognize and extract data -save extracted data in a suitable formats, such a XML, XLS, Databases for further importing to the other applications
Wrapper Development Tools Non commercial tools: -most of them developed at universities -output data: mainly text and XML -most of them offer API -most of them is implemented in Java and is OpenSource -Most of them offer Web Crawling -some of them offer GUI -just few offer Editor – regular expressions, ontologies
Wrapper Development Tools Commercial tools: -most of them developed in commercial companies -output data: mainly XML, tables and text -most of them offer database connectivity -most of them offer Web Crawling -most of them offer API -all of them offer GUI -most of them offer Editor – regular expressions, Perl, VBScript,…
Piggy Bank extension for Firefox Web browser turns it into a Semantic Web browser let users: -combine information from several web sites and browse them all together -save information you have found on the Web -tag each item you save -share saved information -browse and search through an existing web site
Piggy Bank – Applications Meeting with friends and you want to locate restaurant with Chinese cuisine, which is close to your favorite coffee shop with wireless network You are moving to the new place and you are looking for apartment close to school, subway station, away crime hotspots, nearby hospital,…
Piggy Bank – How it works semantic web RDF model XML information screen scraper
Piggy Bank Example
WebVCR smart bookmarks – shortcuts to Web content that require multiple steps to be retrieved - hard-to-reach Web content VCR style – record, replay, eventually browse steps users actions no programming required from user, just usual browsing
WebVCR - application navigation travelocity.com: - Juliana plans to attend the WWW9 conference and she is looking for flights from Newark to Amsterdam, that leave from Newark May 14th and return from Amsterdam on May 20th. She must take the following steps: - go to -choose the Find/Book a Flight option -login -specify details of itinerary -produced address:
WebVCR – how to cope with changes changes do not pose a problem to a user browsing the Web since the user can easily determine which link he wants to follow, but they do present a challenge to a system that performs automatic navigation -Attempt to locate a link in the last retrieved page corresponding to DOM location stored in current smart bookmark step. If the link exists, the target of the link matches the bookmark, and either the URL or text of the retrieved link match the step, then use that link. -Otherwise, if there is a unique link in the page whose target, URL, and text match those of the stored link, use that link -Otherwise, if there is a unique link in the page whose target and URL match those of the stored link, use that link -Otherwise, if there is a unique link in the page whose target and text match those of the stored link, use that link.
WebVCR – how to cope with changes Otherwise, if the link corresponds to a CGI bin script (e.g., contains ``?'' in it), then find all links that match the stored URL up to the first occurrence of a ``?'' and store them in set of candidate links, which we denote L. Eliminate any elements of L whose parameter names do not match the stored version. For instance, if the stored URL is then matches, but does not, since it has a parameter named z that does not appear in the stored version. For each parameter in the stored version whose value matches the corresponding parameter value in at least one element of L, eliminate all elements of L with a non-matching value for the same parameter.
WebVCR – how to cope with changes If L is a singleton set, use that element. Otherwise, the playback can either be aborted, or the link present at the recorded DOM location can be used to try and proceed through the playback (our implementation uses the latter). However, the playback might fail later in the sequence, or the sequence might traverse pages different from what the user had recorded.
WebVCR – problems HTTP authentication - some user actions cannot be recorded in the client, it is not possible to detect when HTTP authentication takes place, and since the values entered by the user are not available through the DOM API State information – cookies, login and password just first time, after that go straight through cookies Signed applets Automatic refresh – they assume that auto refresh takes place Microsoft IE limitations