(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Natural Language Processing WEB SEARCH ENGINES August, 2002.
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
Exploring the Deep Web Brunvand, Amy, Kate Holvoet, Peter Kraus, and David Morrison. "Exploring the Deep Web." PPT--Download University of Utah.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Information Retrieval in Practice
Crawling the WEB Representation and Management of Data on the Internet.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
1 Software Testing and Quality Assurance Lecture 32 – SWE 205 Course Objective: Basics of Programming Languages & Software Construction Techniques.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
_______________________________________________________________________________________________________________ E-Commerce: Fundamentals and Applications1.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Search Engine Optimization
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Objective Understand concepts used to web-based digital media. Course Weight : 5%
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
1 Crawling The Web. 2 Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines,
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Search Engine Architecture
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Search Engines.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Search Engines By: Faruq Hasan.
OWL Representing Information Using the Web Ontology Language.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Digital Literacy Concepts and basic vocabulary. Digital Literacy Knowledge, skills, and behaviors used in digital devices (computers, tablets, smartphones)
SEO Friendly Website Building a visually stunning website is not enough to ensure any success for your online presence.
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
By R. O. Nanthini and R. Jayakumar.  tools used on the web to find the required information  Akeredolu officially described the Web as “a wide- area.
Understanding Web-Based Digital Media Production Methods, Software, and Hardware Objective
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
The Internet Salihu Ibrahim Dasuki (PhD) CSC102 INTRODUCTION TO COMPUTER SCIENCE.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Spidering (Crawling)
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
Service Oriented Architecture (SOA) Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Data mining in web applications
Chapter Five Web Search Engines
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
E-commerce | WWW World Wide Web - Concepts
E-commerce | WWW World Wide Web - Concepts
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Search Engines & Subject Directories
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
Data Mining Chapter 6 Search Engines
Web Mining Department of Computer Science and Engg.
Search Engines & Subject Directories
Search Engines & Subject Directories
Internet Basics and Information Literacy
Presentation transcript:

(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall

Outline 1. Concepts 2. Stand-alone based data access 3. Web based data access 4. Web crawler 5. Semantic-based data access 6. Summary

Concepts Definition Data access typically refers to software and activities related to storing, retrieving, or acting on data housed in a database or other repository.* Data access refers to a user's ability to access or retrieve data stored within a database or other repository.** * **

Concepts

Concepts Types of data access* There are two fundamental types of data access: Sequential access Search the data by a pre-defined sequence (one after another), e.g. video Random access Search the data in any given location, e.g. *

Stand-alone based data access Features Different data repositories are accessed by authorization Differences between user and administrator Managed by administrator Local database Security Query

Web-based data access Why? A lot of data have been gathered and published on Internet since 1990s. Huge volumes of rapidly expanding data and ever-changing experimental and simulation results are largely disconnected from each other The distributed nature of the communities which create datasets.

Key components Data repository: stores and manages the data. Open standards: make the data from diverse data sources interoperable. Clearinghouse network: enables interaction between data producers and data users. Policies: regulates data access and licensing, protects the privacy, and provides custodianship at all administrative levels. People: such as data users and domain experts. Web-based data access

* Web-based data access

Web indexing Web indexing (or Internet indexing) refers to various methods for indexing the contents of a website or of the Internet as a whole.* For individual websites: “back-of-the-book” index (non-alphabetical A-Z index) For search engines: keywords and metadata Web search engine Web search engine aims to discover and search information from web resources. Searchable result consists of images, webpage… Implementing real-time search when using crawler Conducting web pages finding by web indexing * Web-based data access

Challenges Dynamical data Large data Unstructured data Uneven quality of data Diverse data formats Different data types Zipf’s Law * Web-based data access

Limitations The logic of WFS and WCS at the client side would be more complex. Service providers have registered their services in the catalog and the services are registered with: 1) Correct classifications 2) Updated information. Based on the number and weight of other links pointing to a certain web page, which is not measured by: 1) The quality of service (QoS) 2) The quality of data Web-based data access

A Web crawler* is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider**, an ant, an automatic indexer***. Web crawler is exploited by search engines (e.g. Google, Bing) to Read the pages by processing hyperlinks and HTML code Download and search the pages it reads Update the web content of web search engines Index the web content of other websites * ** *** Kobayashi, M. and Takeda, K. (2000). "Information retrieval on the web". ACM Computing Surveys (ACM Press) 32 (2): 144–173. Web crawler

Search Strategies Breadth-first search Depth-first search Both strategies use the queue of URLs Restrictions Limited to a specific website Limited to a specific web directory Limited to a specific page Web crawler

Link extraction Extracting all links and URLs of one page Extracting relative and current URLs Filtering Original location of HTML Base URL Internal page fragments Anchor text Robot exclusion Robots exclusion protocol Robots META tag Web crawler

Multi-thread crawler Avoid the delay when downloading from one page Avoid overloading pages from single server Each thread request a page from different service hosts Distribute requests to improve the data transmission Example: the crawlers of early Google have about 300 threads each, to enable downloading more than 100 pages per second. Web crawler

Topic-Directed crawler Pre-define the pages of interesting topics Rank the links by similarity measuring between pages and topic Preferentially request the pages closer to interesting topic Link-Directed crawler Distinguish the in-degree and out-degree pages Authorities: rank the pages with in-coming links Hubs: rank the pages with out-going links Web crawler

Prototype implementation: Web crawler

Efficiency improvement by concurrent threads When there is only one crawling thread and one determination thread, the crawler’s throughput is very low; When increasing the number of crawling and determination threads to 10 each, the crawler’s throughput increases dramatically. Web crawler

Efficiency improvement by concurrent threads Web crawler

Coverage and timeliness compared to other WMS crawlers Coverage: the claimed number of WMSs found, the actual number of live WMSs found, the number of unique WMS hosts, and the total number of live layers. The ‘liveliness’ of services and layers: was determined by the downloading and parsing capabilities of WMSs. Timeline: the number of dead links in their results. Web crawler

Coverage and timeliness compared to other WMS crawlers Web crawler

Quickness in locating WMSs and findings regarding WMS distribution Web crawler

Semantics: refers to the meaning of languages, as opposed to their form (syntax). In other words, semantics is about interpretation of an expression.* Semantic search: seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results.** * ** Semantic-based data access

Advantages Discovering latent semantic association b/t terms and meanings Answering by reducing processing time Effectively identifying of place names by using spatial filtering Conducting both subject and location based query Displaying, render the search result by using multi-dimensional and multivariate visualization Animating the spatio-temporal development Semantic-based data access

Latent semantic association discovery Semantic-based data access

Data access type Stand-alone data access & web based data access Web indexing and web search engine Web crawler: Search strategies, restrictions, link extraction, filtering Multi-thread crawler Topic directed & link-directed crawler Prototype Semantic based data access Summary