Presentation on theme: "Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati."— Presentation transcript:
Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati
Goals To develop an Intelligent web agent which learns about the users behavior with any user intervention and uses the information gathered to help him search pages of his interests by ranking results of google according to his interests.
Intelligent Web Agent Agent – something that perceives its environment through its sensors and acting upon it through its effectors. Web Agent – the environment is the world wide web. Intelligent Web Agent – A rational web agent i.e. one that can make a rational decision when given a choice leading to its goal. Personalized Intelligent Web Agent (PIWA) – It learns user preferences and behavior over a length of time and exhibits ample intelligence in its decisions.
Problem Explosive growth of internet has made it the largest knowledge repository mankind ever had. Need to use it efficiently – providing one with information one seeks. Searching for information is a Problem. The internet structure - a massive mess of hyperlinks pointing to HTML pages makes it difficult Search Engines though excellent in their working fail to satisfy an users needs returning millions of results - most of which are unwanted
How can we help the user A web based search engine lacks information about the user. They are based on algorithms which need more information about the search query – a normal user provides just one or two. We can have a software agent at the users desktop. It has more access to users browsing behavior. It can learn from it. Using its knowledge base, it can help user to find information matching his interests.
Profiling Static – profiles are built beforehand like templates Dynamic – profile are generated dynamically learning from users behavior. Need an intelligent search engine over an general purpose search engine. Perform real time adaptive learning from monitoring users habits with no relevance feedback from user. Must change according to interest changes in the user
The Algorithm Salient Features: Representation of users interest as a group of words. Generation of the group of words by unobtrusive monitoring of users browsing habits with no relevance feedback from user. Dynamic membership of the group of words representing a particular users interest. Using the group of words to improvise web query generation. Using the group of words to ranks results from general purpose search engine.
Interest Basic knowledge block of user profile. Represented by a group of 10 words each having an associated weight and a timestamp. Weight represents the importance of word in that particular interest. Eg: said 16, yesterday 15, city 15, sadr 14, iraq 12, holy 12, news 12, east 11, talks11, tension 11 users interest in what sadr said yesterday in a holy city of iraq and his talks about tension and east
Generation of Interest Key Point – unobtrusive, user friendly
Implementation Agent act as a proxy for users browser. Passive monitoring of incoming traffic. The 10 words are extracted from the very pages the user browses through. Extract top words from a HTML document. Get the page. Do feature extraction. Do stop-word removal. Do stemming.
Generation of Interest From HTML pages browsed by the user. Feature extraction done using latest features of HTMLEditorKit (available in JDK SE > 1.4) HTML tags given weights like title 10, meta-names 6, block-quote 4, boldfaced and underline 2, fontsizes, etc Content tag given weight 1 (similar to Term Frequency (tf)) Weights are summed up for all words. Commonly used words removed by stop words elimination and removing words of length <= 2 Top 10 words selected. Morphological analysis not done as many words dont occur in dictionary like yahoo, and the process still is not very efficient.
Creation of Profile Get 10 keywords from each page visited 2 possible cases: Current page (keywords) matches/is similar to a past interest -> list of interest updated Current page (keywords) is new -> new interest created Match if 3 words or more (>= 30%) match between keywords of current page and past interest. Interest Update:- Sum up the weights of the matched words and get the top 10 from the merged list.
Maintenance of Profile An optimum size needed as too big will have erroneous interests and has performance problems, a small list may not cover all his interests – at present its at 20 interest. When an interest is created or updated its timestamp is updated (associated with the 11 word marker 1234567890) The product of timestamp and sum of weights of the interest is used to determine which interest will remain in list and which one removed.
Use of Profile for web searches Direct Searching: user provides search query 3 cases: Query matches one interest Query matches more than one interest Query does not match any interest No match: simple google search More than one match: sum up the words of query in the matched interest and select the one for which sum in maximum. So now we have one matched interest
Trigger Pair Model Trigger Pair – get some words from the matched interest which have weights less than the smallest weight for any word of the query. Done to prevent overshadowing of original query by more popular/weighted words. 1 word added for single worded query, 2 words for double or more worded query -> to prevent overshadowing. Trigger Pair refines results from google to a great extent.
Ranking of results to users interest Get top 20 results from Google. Get top 10 keywords for each of the result Score each page by summing the product of weights of common words between matched interest and keywords. Get the top 10 pages based on the score. – 1 st Result Take a arithmetic mean of ranks of Google and rank of the algorithm and get top 10 pages – 2 nd Result
Points to note 1 st interest: lexington and concorde came in as they were advertisements on the first page – parser is not designed to ignore and considers part of the page Can be done if structure of page known beforehand but impossible in present case. Timestamps of 2 nd and 10 th page merging as both from reuters
Search: IRAQ Search query: iraq Matched interest: last one as weight of iraq is maximum in it – 42 Trigger Pair causes Fallujah to be appended.
Points to note 1 st result of Google describes about fallujah and has less elements of fighting in it – hence ranks a poor 10 th in the PIWA rank. 7 th result of Google which is basically a discussion board on iraq war with lots of discussion on iraq, fallujah, marines, coalition, Baghdad (word in matched interest) ranks first in PIWA rankings. Mixed Results: Its rank 1 are 2 nd of Google and 5 th of PIWA.
2. Sandeep 3 pages: 2 homepages and 1 resume 3 interests formed
Points to note Matched interest was derived from 10 th result of Google but ranked 2 nd in PIWA rankings and hence 5 th on mixed results. A page visited in past affects the results greatly. Sandeep is very general term so results still not much inclined in my favour.
Points to note Normal search in google without trigger pair resulted in results on war – not wanted by user Google's 7 th, 8 th, 9 th and 10 th result dont make it to top 10 of PIWA – shows ranking differences based on users interests.
Mixed Results In classical AI search terminology Google:- explore strategy – get new results PIWA:- exploit strategy – use past information to decide new ranks Mixed:- a 50-50 mix of both, can be changed to explore more initially and then exploit more as in any other learning process
Conclusion A profile for an user was generated with absolutely no user relevance feedback Dynamic profile maintenance – continuously updated by new information. Profile used to improve users web searches to suit his interests.
Future Work Improve GUI: specially for search utility Support for plugins: to handle non-HTML documents Support of encoded pages: SSL, gzipped, etc News reader can be made easily with improved parser knowing the structure of news pages beforehand.