Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deep Web Exploration Dr. Ngu, Steven Bauer, Paris Nelson REU-IR This research is funded by the NSF REU program AbstractOur Submission Technique Results.

Similar presentations


Presentation on theme: "Deep Web Exploration Dr. Ngu, Steven Bauer, Paris Nelson REU-IR This research is funded by the NSF REU program AbstractOur Submission Technique Results."— Presentation transcript:

1 Deep Web Exploration Dr. Ngu, Steven Bauer, Paris Nelson REU-IR This research is funded by the NSF REU program AbstractOur Submission Technique Results Introduction Extraction Challenges Automated Submission Challenges With more and more information going online, the importance of extracting and managing the information from the Internet is becoming increasingly important. While the surface Web's information is relatively easy to obtain thanks to search engines such as Google and Bing, collecting the information from the deep Web is still a challenging task and these search engines do not index information located inside the deep Web. Compared to the surface Web, the deep Web contains vast more information. In particular, building a generalized search engine that can index and search deep Web across all domains remains a difficult research problem. In this paper, we highlight these challenges and demonstrate via prototype implementation of a generalized deep Web discovery framework. In particular, we highlight our methods of automatic submission and extraction of results; these will enable users to query and explore the deep web. Our automated submission technique relies on a cluster of partial match terms per domain. These terms require someone to manually visit a small number of high quality web sites in order to find common terms. Once we have loaded the web page we scrape all of the forms on the desired webpage. Per domain we have a minimum number of required forms before we can submit, for people these are first name, last name, and location. We search through the extracted forms and try to find which one contains all of the required forms based on partial matches of each elements ID and name. If we are able to find all of the required elements except for one, we brute force all of the remaining elements in that form with the query data and test to see if we reached a results page. Once we have reached the result page we catalog the names and id’s of the elements that we used to submit such that the next time around we can go directly to those elements. We ran our crawler, which returns 100 potential sites for a given domain. We then ran our program on that list of 100 sites and recorded the results. The table below displays the results for a run performed on the people search domain. True positives are sites that we submitted to and confirmed for people sites that actually were people sites. False positives are sites that we submitted to and confirmed for people sites that were actually something different. True negatives are sites that we could not submit to that were not people sites. False negatives were sites that we could not submit to that were people sites. As can be seen from the data, there is definite room for improvement in the submission technique. We assessed why we missed the 14 people sites and determined this: 6 of the sites used drop down boxes instead of standard text boxes, a functionality we had yet to implement. 3 of the sites used one box for full name, while our submission technique expected one text box for first name and another for last name. The remaining sites could have been successfully submitted with the addition of a few more partial terms for finding boxes, due to the unique names given to the same boxes by different sites. With more time to expand the project, these 3 considerations could be easily accommodated. Similar results were obtained when the program was run on flight and hotel domains with similar causes. Because of the low number of successful submissions, we were unable to get a good sample size for our extraction methods. Our techniques are general enough that after submission is achieved with a higher success rate with minor modifications our methods should support a large number of sites. What we have concluded is that our submission can be greatly improved with the addition of a few simple functionalities. This would be reflected in the success of our extraction techniques as well. Our technique and framework has laid a solid foundation that future work may expound upon. Over the past decade, the Internet has become an important source of information for many aspects of today's human life. It has always been important to web users to be able to extract the information they need from the internet in an easy and quick way. Currently, most users typically get information they need from the surface Web. These information are considered as ``static'', which means you get the same set of answers with the same query. However, there is a lot more information out there that is not indexed by search engines, known as the deep web. Unlike the surface Web, the information in a deep Web is dynamic. It is generally generated from a source (e.g., database, file, or application) which may change constantly. The deep Web is an important information source since it has been suggested that there is 400-550 times the information in the deep Web than the surface Currently, to query any deep Web sources a user has to manually find HTML/search forms, input their queries and submit them. These forms are typically used to search for a particular page at that specific site quickly. Our goal is to create a web service to automatically query across these deep web databases and return all of the results to the user. Result pages have no set structure. Web designers attempt to create a visually appealing interface which is often very complex and difficult to be machine readable. Even if a group of websites use tables they are all configured differently. Many websites have several levels of nested tables. Data can be named arbitrarily which makes it difficult to identify that two columns from different web sites carry the same information. It is difficult to discern the relevant data from all of the other text on a webpage especially when you try to grab data beyond just the generic data provided by most websites. Often times some information is available for certain entries but not for others causing alignment issues. Websites have no set structure and the input boxes and buttons that we need to identify can, and usually are, named almost randomly. It is difficult to recognize when you have submitted successfully and have not just reached an error page. Complex websites sometimes use JavaScript and flash for their input interfaces which is difficult if not impossible to parse. Sometimes it is possible to find the majority of the input boxes you are looking for but one is unable to be matched. Often websites use an indirection/loading page in-between the input page and the results page. Extraction Techniques Extraction (Continued) In order to maximize the number and types of sites we can extract from we have created three separate extraction methods. Each is designed to retrieve information in a distinct format. Repeated Data Extractor Recursively traverse the results page attempting to locate every web element that has more than 4 sub elements sharing the same class name or ID that also make up at least 75% of the total elements of the parent. Out of the found elements, the parent element with the largest number of children is our extraction candidate. Recursively iterate through this element’s children extracting the text values and matching them up with their given class name or ID. After all of the text is extracted we attempt to filter out irrelevant information by throwing out groups of elements that are larger or smaller in number than the rest of our extracted elements. Next we filter out all element groups that contain the same text data over and over. This technique has the possibility of filtering out things such as name because unless they have a different middle name, all of the results will be the same. We overcome this by keeping anything that matches our query. Once we have finished filtering the results we use our pattern matching technique to identify known data types and change their header to our common term. An Example results page displaying the structure supported by RDE Table Extractor Recursively traverse the results page attempting to locate every web element that has more than 4 sub elements sharing the same class name or ID that also make up at least 75% of the total elements of the parent. Out of the found elements, the parent element with the largest number of children is our extraction candidate. Recursively iterate through this element’s children extracting the text values and matching them up with their given class name or ID. After all of the text is extracted we attempt to filter out irrelevant information by throwing out groups of elements that are larger or smaller in number than the rest of our extracted elements. Next we filter out all element groups that contain the same text data over and over. This technique has the possibility of filtering out things such as name because unless they have a different middle name, all of the results will be the same. We overcome this by keeping anything that matches our query. An example results page displaying a table supported by our Table Extractor Annotated Extractor Uses predefined regular expressions for expected values to identify data on a results page (for flight, expected values would include flight no. and price). Once these data have been found, they are annotated with tags so that they can be easily referred to and/or retrieved later. The data are extracted one at a time until an instance of each has occurred. The order in which they were found is recorded for extracting the rest of the data. That way, if there is one row of data that is missing a datum, the extractor will recognize this and account for it. Data Recognition Regardless of the extraction method, the extracted data is processed here. This program will use regular expressions of data types and small clusters of terms to identify data headers. Data headers that are recognized are presented using the existing cluster keyword. If the header name is not recognized, the data is analyzed and if the data matches the data requirements of a known cluster, the new header is added to that cluster for future recognition. If the header name is not recognized and the data matches a regular expression but the data types that they match are not associated with a know cluster, then a cluster is created with the header name as the keyword Any misalignments in the table are also corrected by the program. True PositivesFalse PositivesTrue NegativesFalse Negatives 100%0%85%15% 1107513


Download ppt "Deep Web Exploration Dr. Ngu, Steven Bauer, Paris Nelson REU-IR This research is funded by the NSF REU program AbstractOur Submission Technique Results."

Similar presentations


Ads by Google