Algorithms (Contd.). How do we describe algorithms? Pseudocode –Combines English, simple code constructs –Works with various types of primitives Could.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Seeking prime numbers quickly through parallel-computing Daniel J. Wright.
1 Presented By Avinash Gutte Under The Guidance of Mrs. Hemangi Kulkarni Department of Computer Engineering Pimpri-Chinchwad College of Engineering, Pune.
Distributed computing November Administrivia No lab this week Lab 6 (Visual Basic 2) next week Happy Thanksgiving!!
Computability Start complexity. Motivation by thinking about sorting. Homework: Finish examples.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Information Retrieval in Practice
Introduction to Analysis of Algorithms
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
GCSE Computing - The CPU
Distributed Algorithms. Distributed computing Key idea –Buying 1000 machines of speed x is significantly cheaper than buying one machine of speed 1000x.
Algorithms November 27, Administrivia Homework Assignment 6 –If you forgot to put your name on it, let me know Homework Assignment 7 –Due next Tuesday.
SM3121 Software Technology Mark Green School of Creative Media.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Search Engines
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Introduction CSE 1310 – Introduction to Computers and Programming
Computing Hardware Starter.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 346, Royden, Operating System Concepts Operating Systems Lecture 24 Paging.
Lecturer: Ghadah Aldehim
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.
The Confident Researcher: Google Away (Module 2) The Confident Researcher: Google Away 2.
Overview of Computing. Computer Science What is computer science? The systematic study of computing systems and computation. Contains theories for understanding.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
How could the database be most efficiently searched to find all of the inventions of Samuel Morse? A. Inventor = "Morse" C. Invention = "telegraph" B.
M1G Introduction to Database Development 6. Building Applications.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
Data Structures & Algorithms and The Internet: A different way of thinking.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Cloud Computing Dave Elliman 11/10/2015G53ELC 1. Source: NY Times (6/14/2006) The datacenter is the computer!
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
IT253: Computer Organization
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Search engines are used to for looking for documents. They compile their databases by employing "spiders" or "robots" to crawl through web space from.
Search. Search issues How do we say what we want? –I want a story about pigs –I want a picture of a rooster –How many televisions were sold in Vietnam.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Intermediate 2 Software Development Process. Software You should already know that any computer system is made up of hardware and software. The term hardware.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Advertising 1 *The red circles show the position of the keyframes on the timeline. What are banner and pop-up advertisements? 1 Answer Banner and pop-up.
Search Engines By: Faruq Hasan.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
Introduction TO Network Administration
Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
System Models Advanced Operating Systems Nael Abu-halaweh.
Advanced Higher Computing Science The Project. Introduction Worth 60% of the total marks for the course Must include: An appropriate interface using input.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Search Engine Optimization
Information Retrieval in Practice
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Grid Computing Colton Lewis.
ITIS 1210 Introduction to Web-Based Information Systems
SCALABLE OPEN ACCESS Hussein Suleman
CS246: Search-Engine Scale
Information Retrieval and Web Design
Presentation transcript:

Algorithms (Contd.)

How do we describe algorithms? Pseudocode –Combines English, simple code constructs –Works with various types of primitives Could be + - / * Could be more complex operations –Describes how data is organized –Describes operations on the data –Is meant to be higher level than programming

Searching with indices (pseudocode) Build the indices –Do this by going through the list and determining where department names change –Store the results in an array called Indices Search the indices –Do a binary search on the array Indices Do this by comparing to the middle element –Then use binary search to compare to the upper half –Or use binary search to compare to the lower half

Building a web search engine Crawl/spider the web Organize the results for fast query processing Process queries

Crawl the web Every month use networking to go to as many reachable web pages as you can –10B pages, 10 Kbytes/page, so 100 terabytes Can compress an average page to 3Kbytes Numeracy –To crawl 10B pages in 100 days: Crawl 100M pages per day Crawl 4M pages per hour Crawl 1,000 pages per second

Organize the results Put into alphabetical order Build indices for faster lookup Make multiple copies so that searching can proceed in parallel. When you update, you rebuild the indices

Process search queries Look up indices Look up words/phrases –Advertiser can buy a word or phrase This search gives you internal addresses of web pages –Look them up to build results page Ranking results: content match, popularity, price paid by advertisers, …

Ranking by Popularity The web is a collection of links –A document’s importance is determined by How many pages point to it How important those pages are Used for determining –How often to crawl a page –How to order pages presented.

Content Relevance Simple string matching –Does the document/string contain the word computer? More complex string matching –Did the word computer occur before or after the word science? –Did it appear within 10 words of the word science?

How does string matching work? State machines –Move along states as long as you keep matching –Back off when you miss a match

State machine – looking for abcd Read a Read bRead c Read d Other SaSa SbSb ScSc SdSd OK What happens if input is abccadbacabcd? S a S b S c S d S a S b S a S a S b S a S b S c S d OK

State machine – looking for abcd Read a Read bRead c Read d Other SaSa SbSb ScSc SdSd OK What happens if input is abcabcd? S a S b S c S d S a S a S a S a

State machine – looking for abcd Read a Read bRead c Read d Other SaSa SbSb ScSc SdSd OK Read a

Larger search challenges Allow strings to have don’t cares –Starts with a and ends with e –Has come number of copies of the substring ab Finding strings similar to but not the same as your string –For spelling corection

Algorithms -- summary Methods for solving problems Understand at a high level Make sure your reasoning is correct Worry about efficiency in situations where that matters Write as pseudocode

Distributed Algorithms

Distributed computing Key idea –Buying 1000 machines of speed x is significantly cheaper than buying one machine of speed 1000x –No one person has to buy all 1000 machines: A lot of computational, communication and storage resources already in place and can be harvested for bigger things Key challenge –Making the machines work together for effective speedup. Communication between machines is a key challenge. Approaches –Find problems that can be distributed easily

Distributed problems Problems that can use decentralized computing –Weather prediction Weather in a location is most affected by weather nearby –Movie generation Individual frames can be generated separately –Google search engine 10,000s PC’s. all of them cheap, many of them identical Can answer over 100,000,000 queries per day in ½ sec or less each –Looking for the origin of the universe Can be localized like weather prediction –File swapping and access (distributed storage) –Looking for extra terrestrial intelligence –Content caching and distribution

Distributed computers Scales of distributed computing –Cluster-in-a-roomhundreds of machines All dedicated to the task –PCs on a campusthousands of machines Using spare cycles –SETI clustermillions of machines Screen saver situation

Cluster in a Room Machines are dedicated to the network All machines run similar software Problem is divided into pieces –Each piece is assigned to a machine in the cluster Problem pieces should be loosely linked –Computation is faster than communication

PCs on a Campus Loosely coupled on a local-area-network PCs do other things some of the time When free cycles are available, they’re used Many more machines, but less of each machine available

Workstation Network at Google Front end 100 machines called Searching machines Retrieving machines Fit machines in a 7’x2’x3’ rack

SETI Telescope at Arecibo, PR collects data Data is processed in real time by fast machines But, no one looks for weak signals –Too costly project built to do this

Receive data from Arecibo –35 Gbytes per day by snail mail Break into Work Units –.25 Mbyte each, so 140,000 WU’s per day WU takes 20 hours to process Need about 117,000 dedicated machines to process one day

Get individual users to download software Machine idle and screen saver runs software –Download WU –Compute –When finished send back result Database at Berkeley reassembles results Progress to date --

Medical/Biological Applications Peer-to-Peer Medicine Cancer Research …