Presentation is loading. Please wait.

Presentation is loading. Please wait.

(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall

Similar presentations


Presentation on theme: "(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall"— Presentation transcript:

1 (Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall wenwen@asu.edu

2 Outline 1. Concepts 2. Stand-alone based data access 3. Web based data access 4. Web crawler 5. Semantic-based data access 6. Summary

3 Concepts Definition Data access typically refers to software and activities related to storing, retrieving, or acting on data housed in a database or other repository.* Data access refers to a user's ability to access or retrieve data stored within a database or other repository.** * http://en.wikipedia.org/wiki/Data_accesshttp://en.wikipedia.org/wiki/Data_access ** http://www.techopedia.com/definition/26929/data-accesshttp://www.techopedia.com/definition/26929/data-access

4 Concepts

5 Concepts Types of data access* There are two fundamental types of data access: Sequential access Search the data by a pre-defined sequence (one after another), e.g. video Random access Search the data in any given location, e.g. * http://en.wikipedia.org/wiki/Sequential_access#/media/File:Random_vs_sequential_access.svg http://en.wikipedia.org/wiki/Sequential_access#/media/File:Random_vs_sequential_access.svg

6 Stand-alone based data access Features Different data repositories are accessed by authorization Differences between user and administrator Managed by administrator Local database Security Query

7 Web-based data access Why? A lot of data have been gathered and published on Internet since 1990s. Huge volumes of rapidly expanding data and ever-changing experimental and simulation results are largely disconnected from each other The distributed nature of the communities which create datasets.

8 Key components Data repository: stores and manages the data. Open standards: make the data from diverse data sources interoperable. Clearinghouse network: enables interaction between data producers and data users. Policies: regulates data access and licensing, protects the privacy, and provides custodianship at all administrative levels. People: such as data users and domain experts. Web-based data access

9 * http://www.resc.rdg.ac.uk/twiki/bin/view/Resc/GeoServerhttp://www.resc.rdg.ac.uk/twiki/bin/view/Resc/GeoServer Web-based data access

10 Web indexing Web indexing (or Internet indexing) refers to various methods for indexing the contents of a website or of the Internet as a whole.* For individual websites: “back-of-the-book” index (non-alphabetical A-Z index) For search engines: keywords and metadata Web search engine Web search engine aims to discover and search information from web resources. Searchable result consists of images, webpage… Implementing real-time search when using crawler Conducting web pages finding by web indexing * http://www.opengeospatial.org/standardshttp://www.opengeospatial.org/standards Web-based data access

11 Challenges Dynamical data Large data Unstructured data Uneven quality of data Diverse data formats Different data types Zipf’s Law * https://www.cs.utexas.edu/~mooney/ir-course/https://www.cs.utexas.edu/~mooney/ir-course/ Web-based data access

12 Limitations The logic of WFS and WCS at the client side would be more complex. Service providers have registered their services in the catalog and the services are registered with: 1) Correct classifications 2) Updated information. Based on the number and weight of other links pointing to a certain web page, which is not measured by: 1) The quality of service (QoS) 2) The quality of data Web-based data access

13 A Web crawler* is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider**, an ant, an automatic indexer***. Web crawler is exploited by search engines (e.g. Google, Bing) to Read the pages by processing hyperlinks and HTML code Download and search the pages it reads Update the web content of web search engines Index the web content of other websites * http://en.wikipedia.org/wiki/Web_crawler#Architectureshttp://en.wikipedia.org/wiki/Web_crawler#Architectures **https://web.archive.org/web/20040903174942/archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Agents/spetka/spetka.html *** Kobayashi, M. and Takeda, K. (2000). "Information retrieval on the web". ACM Computing Surveys (ACM Press) 32 (2): 144–173. Web crawler

14 Search Strategies Breadth-first search Depth-first search Both strategies use the queue of URLs Restrictions Limited to a specific website Limited to a specific web directory Limited to a specific page Web crawler

15 Link extraction Extracting all links and URLs of one page Extracting relative and current URLs Filtering Original location of HTML Base URL Internal page fragments Anchor text Robot exclusion Robots exclusion protocol Robots META tag Web crawler

16 Multi-thread crawler Avoid the delay when downloading from one page Avoid overloading pages from single server Each thread request a page from different service hosts Distribute requests to improve the data transmission Example: the crawlers of early Google have about 300 threads each, to enable downloading more than 100 pages per second. Web crawler

17 Topic-Directed crawler Pre-define the pages of interesting topics Rank the links by similarity measuring between pages and topic Preferentially request the pages closer to interesting topic Link-Directed crawler Distinguish the in-degree and out-degree pages Authorities: rank the pages with in-coming links Hubs: rank the pages with out-going links Web crawler

18 Prototype implementation: Web crawler

19 Efficiency improvement by concurrent threads When there is only one crawling thread and one determination thread, the crawler’s throughput is very low; When increasing the number of crawling and determination threads to 10 each, the crawler’s throughput increases dramatically. Web crawler

20 Efficiency improvement by concurrent threads Web crawler

21 Coverage and timeliness compared to other WMS crawlers Coverage: the claimed number of WMSs found, the actual number of live WMSs found, the number of unique WMS hosts, and the total number of live layers. The ‘liveliness’ of services and layers: was determined by the downloading and parsing capabilities of WMSs. Timeline: the number of dead links in their results. Web crawler

22 Coverage and timeliness compared to other WMS crawlers Web crawler

23 Quickness in locating WMSs and findings regarding WMS distribution Web crawler

24 Semantics: refers to the meaning of languages, as opposed to their form (syntax). In other words, semantics is about interpretation of an expression.* Semantic search: seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results.** * http://en.wikipedia.org/wiki/Semantics#Computer_sciencehttp://en.wikipedia.org/wiki/Semantics#Computer_science ** http://en.wikipedia.org/wiki/Semantic_searchhttp://en.wikipedia.org/wiki/Semantic_search Semantic-based data access

25 Advantages Discovering latent semantic association b/t terms and meanings Answering by reducing processing time Effectively identifying of place names by using spatial filtering Conducting both subject and location based query Displaying, render the search result by using multi-dimensional and multivariate visualization Animating the spatio-temporal development Semantic-based data access

26 Latent semantic association discovery Semantic-based data access

27 Data access type Stand-alone data access & web based data access Web indexing and web search engine Web crawler: Search strategies, restrictions, link extraction, filtering Multi-thread crawler Topic directed & link-directed crawler Prototype Semantic based data access Summary


Download ppt "(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall"

Similar presentations


Ads by Google