Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern.

Similar presentations


Presentation on theme: "Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern."— Presentation transcript:

1 Extracting tabular data from the Web

2 Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern matching – not very accurate & unpredictable. Pattern matching – not very accurate & unpredictable. Need to rewrite code for fetching & parsing HTML pages from different websites(Eg. MSAMB - Maharashtra, Krishi Marata Vahini – Karnataka,etc.) Need to rewrite code for fetching & parsing HTML pages from different websites(Eg. MSAMB - Maharashtra, Krishi Marata Vahini – Karnataka,etc.) Doesn’t take care of misplaced tags. Doesn’t take care of misplaced tags.

3 Characteristics of a Solution to this problem Flexible. Flexible. Unicode Compliant. Unicode Compliant. Smarter pattern matching – explore the structure of the HTML page rather than single line at a time. Smarter pattern matching – explore the structure of the HTML page rather than single line at a time.

4 Possible Solutions

5 Solution 1 Step 1: Fetch data from the desired site. Step 1: Fetch data from the desired site. Step 2: Tidy the HTML page. Step 2: Tidy the HTML page. Step 3 : Construct the HTML DOM(Document Object Model) tree. Step 3 : Construct the HTML DOM(Document Object Model) tree. Step 4: Extract node information using Document object. Step 4: Extract node information using Document object.

6 Solution 2 Similar to Solution 1 Similar to Solution 1 Use XPath to locate data(Step 4). Use XPath to locate data(Step 4). Relative position of nodes in DOM tree stored as XPath. Relative position of nodes in DOM tree stored as XPath. These XPaths are stored in the properties file instead of the entire table structure. These XPaths are stored in the properties file instead of the entire table structure.

7 Solution 3 Tested a software - screen-scraper.(www.screen- scraper.com) Tested a software - screen-scraper.(www.screen- scraper.com) Proxy server that allows the contents of HTTP and HTTPS requests to be viewed Proxy server that allows the contents of HTTP and HTTPS requests to be viewed Engine that can be configured to extract information from Web sites using special patterns and regular expressions. Engine that can be configured to extract information from Web sites using special patterns and regular expressions. Embedded scripting engine that allows extracted data to be manipulated, written out to a file, or inserted into a database. Embedded scripting engine that allows extracted data to be manipulated, written out to a file, or inserted into a database. It can be used with PHP, Java, or any COM-friendly language such as Visual Basic or Active Server Pages. It can be used with PHP, Java, or any COM-friendly language such as Visual Basic or Active Server Pages. Costs $90 ! Costs $90 ! No Unicode support. No Unicode support.

8

9

10

11 Other Possible Solutions  XMLize the HTML content. XML – more structured and well-formed. XML – more structured and well-formed. Data interchange between incompatible systems. Data interchange between incompatible systems. Can use XSL and XSLT to convert from one form to another. Can use XSL and XSLT to convert from one form to another.

12 Implementation

13 HTML scraper The HTML scraper has 3 main steps The HTML scraper has 3 main steps 1.Downloading the web page using crawlers like ‘wget’. 2.Parsing and constructing the DOM tree. 3.Querying the DOM tree for retrieving the desired information and inserting to the database.

14 Implementation Download the web page using Download the web page using wget --post-data=“data” www.agmarknet.nic.in wget --post-data=“data” www.agmarknet.nic.in Can store the page locally. Construct DOM tree using JTidy API. Construct DOM tree using JTidy API. Tidy tidy = new Tidy(); Tidy tidy = new Tidy(); Parse the DOM tree Parse the DOM tree Document doc = tidy.parseDOM(htmlfile,null); Document doc = tidy.parseDOM(htmlfile,null);

15 Query the DOM tree : Query the DOM tree : Depth First Search through the DOM tree Depth First Search through the DOM tree Or Or Using the XPath APIs. Using the XPath APIs. Store the HTML page structure in file and use DFS. Store the HTML page structure in file and use DFS.Or Store XPaths and use it for querying. Insert into database using JDBC. Insert into database using JDBC.

16 DOM tree of the parsed HTML page html head table tr APMCArrivalsVarietyLow RateMid RateHigh Rate

17

18

19 Total time taken by the new parser is less than 15 seconds per page. But the old one is more than 30 seconds. Total time taken by the new parser is less than 15 seconds per page. But the old one is more than 30 seconds. Daily data fetching time=(200*15)seconds Daily data fetching time=(200*15)seconds Statistics

20  Parser (using DFS) for NIC and MSAMB (both English and Marathi) are ready.


Download ppt "Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern."

Similar presentations


Ads by Google