For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.

Slides:

Advertisements

Similar presentations

Heuristic Search techniques

Advertisements

Greedy best-first search Use the heuristic function to rank the nodes Search strategy –Expand node with lowest h-value Greedily trying to find the least-cost.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Introduction to OBIEE:

Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??

© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 excerpts Graphs (breadth-first-search)

CS171 Introduction to Computer Science II Graphs Strike Back.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

Aki Hecht Seminar in Databases (236826) January 2009

FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

RDF: Building Block for the Semantic Web Jim Ellenberger UCCS CS5260 Spring 2011.

HEURISTIC SEARCH. Luger: Artificial Intelligence, 5 th edition. © Pearson Education Limited, 2005 Portion of the state space for tic-tac-toe.

1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.

Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.

SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,

 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.

1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.

HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.

Automatic Subject Classification and Topic Specific Search Engines -- Research at KnowLib Anders Ardö and Koraljka Golub DELOS Workshop, Lund, 23 June.

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.

Master Thesis Defense Jan Fiedler 04/17/98

Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,

TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Presenter: Shanshan Lu 03/04/2010

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Algorithmic Detection of Semantic Similarity WWW 2005.

1 Branch and Bound Searching Strategies Updated: 12/27/2010.

SNU OOPSLA Lab. 1 Great Ideas of CS with Java Part 1 WWW & Computer programming in the language Java Ch 1: The World Wide Web Ch 2: Watch out: Here comes.

Informed Search Reading: Chapter 4.5 HW #1 out today, due Sept 26th.

Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)

Search Techniques CS480/580 Fall Introduction Trees: – Root, parent, child, sibling, leaf node, node, edge – Single path from root to any node Graphs:

© 2006 Pearson Addison-Wesley. All rights reserved 14 A-1 Chapter 14 Graphs.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

1 CO Games Development 1 Week 8 A* Gareth Bellaby.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

February 25, 2016Introduction to Artificial Intelligence Lecture 10: Two-Player Games II 1 The Alpha-Beta Procedure Can we estimate the efficiency benefit.

Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

For Monday Read chapter 4 exercise 1 No homework.

Data mining in web applications

Set Collection A Bag is a general collection class that implements the Collection interface. A Set is a collection that resembles a Bag with the provision.

Heuristic Search Introduction to Artificial Intelligence

Understand Internet Search Tools

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Motion Planning for a Point Robot (2/2)

CS Fall 2016 (Shavlik©), Lecture 10, Week 6

Artificial Intelligence

The Alpha-Beta Procedure

Introduction to Artificial Intelligence Lecture 9: Two-Player Games I

BEST FIRST SEARCH -OR Graph -A* Search -Agenda Search CSE 402

HW 1: Warmup Missionaries and Cannibals

Chapter 14 Graphs © 2011 Pearson Addison-Wesley. All rights reserved.

HW 1: Warmup Missionaries and Cannibals

Artificial Intelligence

Reading: Chapter 4.5 HW#2 out today, due Oct 5th

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Presentation transcript:

For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application of the A* Informed Search Heuristic to finding Information on the World Wide Web Daniel J. Sullivan, Presenter Date: April 30 th, 2003 Location: SL210

Problem Domain This project explores the application of the A* heuristic search function to the problem of document retrieval and classification based upon a relevancy criterion. This work includes a modification of A* and proposes a means of determining relevancy as a function of independent textual mappings.

The Principal Objectives of this Project 1.The problem of retrieving useful information from the WWW. 2.The A* (A-Star) heuristic approach to searching a state space. 3.The development of a simple relevance heuristic which does not require a large sample base. 4.The development and testing of a basic search agent.

The A* Heuristic 1.An informed search technique. 2.A function which evaluates the total cost of a path based upon the actual cost [ G(n) ] up to the current node and the estimated cost [ H(n) ] from the current node to the goal node. 3.Requires an effective means of predicting expected path cost and it needs to be admissible. Admissible means that the H(n) cannot overestimate the cost of reaching the goal node.

A* Function with User Set Time Limit 1.F(n) = G(n) + H(n) where n = time in Seconds 2.G(n) = Total Time Elapsed 3.Document Value = Relevance * Size (in bytes) 4.CRI = Current Bytes of Relevant Information 5.BP (Best Path) = Max_Bandwidth * total_time_avail, this is the perfect case and serves as the admissibility criterion 6.H(n) = (BP-CRI)/DV, which yields the number of seconds left if this path is followed. 7.Links with lowest total time left to reach information goal are inserted into the priority queue and are explored first.

Relevance 1.The technique used in this project is simply a comparison of text sample features. 2.It begins with a single sample without specifying how large the sample needs to be. 3.It uses more than one functional mapping for comparison and expects that the weights assigned to each mapping accurately reflect their specificity.

 1 : S → WL, where S is the sample document and WL is the set of ordered pairs (a,b), such that a is a word in S and b is its relative frequency. This is the most basic lexical comparison between documents.  2 : S → WC, where S is the sample document and WC is a set of ordered pairs (a,b), such that a is a content related token (a  C) from S and b is the relative frequency of this token.  3 : S → TC, where S is same as above and TC is a set of ordered pairs (a,b), such that a is a  O and b is the relative frequency of a.  4 : S → OP, where S is same as above and OP is a set of 3-tuples (a,b,c), such that a is is S, b is in S, b is 1 place ahead in the ordering than a or b = a + 1. And c is the relative frequency of this pair of words.  5 : S → MXST, where S is same as above and MXST represents a set of 3- tuples, which is a subset of OP and represents the maximum spanning tree connecting all words in the document based upon their ordering. Text Document Mappings

Document 1: This is the original sample document. The user wants to find a maximum of related materials – related to this sample. Document 2: This is the document downloaded from the WWW. This document may contain related information. This is the set produced by F3 above and should show similarity for most documents, even those which are not really relevant. But are there small distinctions? Which can be used to judge similarity? The diagonal region is meant to indicate intersection between sets. For this case, lets assume this is a comparison using the F2 mapping. The intersection is small, but clearly not of the same magnitude as testing whether these documents use a similar frequency of operators. In this case, it is obvious we would not want to weight these mappings the same. Value of Different Mappings?

Reasons to work with Web Search Agents… 1.To investigate general and common problems for all forms of intelligence. 2.To experiment in a domain where machines are on a more ‘equal’ footing in terms of perception. 3.To confront the common and real problem of information overload.

What my program does.... This program takes a sample of text (and possibly a very small sample) and conducts a search for similar text documents (html format) on the World Wide Web.

Principal Objects in Design HEURISTICS: This part of the program contains all of the code which is related to the main investigation of this thesis. It contains the A* as implemented. PRIORITY QUEUE: This part of the program ensures that only the links with the lowest A* value will be visited first. CONNECTION OBJECT: This part of the program opens up a connection with a web site and downloads the information, and this information is returned. TEXT PROCESSOR: This portion of the program prepares information for processing, removes links and initializes key data points. DATABASE: This portion of the program manages important data which needs to be persistent. LINK OBJECT: This object is the actual data type managed by the Priority Queue and contains two values: a hyperlink and an A* score. VISIT LIST: This object is a hash table (I am simply using a HASH as implemented by PERL) which ensures that there are not duplicate visits. MAIN: All of the functionality, including the execution of A* is included in the Main Module.

PROCESS TEXT SAMPLE AND CREATE COMPARISON TABLES SUBMIT INITIALIZATION QUERY TO INTERNET SEARCH ENGINE PLACE LINKS IN PRIORITY QUEUE WITH AN INITIALLY LOW SECONDS SCORE REMOVE LOWEST SCORE LINK FROM PRIORITY QUEUE AND DOWNLOAD INFORMATION APPLY A* FUNCTION TO THE DATA RETRIEVED AND INSERT ALL LINKS IN THE PRIORITY QUEUE WITH THE SCORE HAS TIME LIMIT BEEN REACHED? NO HALT! PROCESS RETRIEVED DATA AS DIRECTED BY THE USER AND THE PURPOSE OF THE SEARCH. YES Simplified High Level Process Flow

Lessons Learned Use ANN for relevance function Investigate whether this problem is better solved using Hill-Climbing Use JAVA and distributed objects to break down the tasks further and to enable simultaneous processing. Many tasks can be performed at the same time.