Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ah-Hwee Tan Laboratories for Information Technology, Singapore Oct 11, 2002 Guest Lecture to Singapore-MIT Alliance Artificial Intelligence Technologies.

Similar presentations

Presentation on theme: "Ah-Hwee Tan Laboratories for Information Technology, Singapore Oct 11, 2002 Guest Lecture to Singapore-MIT Alliance Artificial Intelligence Technologies."— Presentation transcript:

1 Ah-Hwee Tan Laboratories for Information Technology, Singapore Oct 11, 2002 Guest Lecture to Singapore-MIT Alliance Artificial Intelligence Technologies for Web Intelligence

2 What is Web Intelligence (WI)? How to do WI? Technologies and Tools (disclaimer: snapshots only) Whats next? Outline

3 and … spying on the web Web Intelligence

4 Scanning, tracking, and analyzing information on the world wide web for the purpose of competitive intelligence Intelligence as in Central Intelligence Agency Web Intelligence

5 Web Intelligence Consortium (WIC) ( Artificial Intelligence (AI), Information Technology (IT), + Web Intelligence as in Artificial Intelligence The other definition of Web Intelligence

6 Highlight the importance of gathering, analyzing, and distributing competitive information to gain competitive advantages Too risky to do business without CI SCIP grew from 150 (1991) to 7000 (2000) Press articles has increased from 100 (1991) to 6000 (2000). Competitive Intelligence (CI) (Fuld & Company, 2000, 2001)

7 Competitive Intelligence Cycle (Fuld & Company, 2000, 2001) Planning & Direction Information Gathering Analysis & Production Evaluation & Tracking

8 Information Gathering –Getting the information (search, information retrieval) Analysis and Production –Putting things in perspectives (clustering, categorization) –Gaining insights (info/knowledge extraction, discovery) Evaluation and Tracking AI Technologies for Web Intelligence

9 Technologies for Search Purpose: Getting the right information Challenges –Too much information, irrelevant information, out- of-date information Technologies –Information retrieval, PageRank Tools –General: Google, AltaVista, Excite, etc –Specialized: Patent (Delphion), News (LexisNexis)

10 SMART (Salton, 1971) One of the first, and still best IR systems vector space model for representing documentsvector space modeldocuments automatic indexing Given a new queryquery –converts to a vector –uses a similarity measure to compare it to the documentssimilarity –Return top n documents can perform relevance feedbackrelevance feedback

11 Document Representation Vector Space Model –Bag of words, e.g. operating, system –Terms/Phrases, e.g. operating systems N-grams (Huffman, TREC-4, 1995) Syntactic 3-tuples (Kanagasa & Pan, PRICAI- 2000) Concept-Relation-Concept (Paik et al, US6,263,335)

12 Indexing Goal –To select a set of important keyword features among all words appear in the document set How –remove stop words, reduce to root form –pick terms based on part-of-speech tagging –keyword weighting

13 Feature Weighting Goal –To represent a doc using a real-valued vector How: An example –For doc d j and keyword w i, calculate Term frequency (TF) = TF(w i,d j ) Inverse Document Frequency (IDF) = log (N/DF(w i )) TF.IDF I ij = TF.IDF –Normalize I j = (I j1 /I m, I j2 /I m, …, I jN /I m ) where I m = max (I ij ) for all i

14 PageRank (Page & Brin, 1998) using its vast link structure as an indicator of an individual page's value A page that receives many links is important A page receives a link from an important page is also important combines PageRank with sophisticated text- matching techniques to find pages that are both important and relevant

15 How to Search - Tips from an Intelligence scout (Courtesy of LIT s Planning Group)

16 LIT KSKS Process 1) KIT (Identify your Key Intelligence Topic) 2) Sources (and resources) 3) KIQ (Key Intelligence Questions) 4) Search Strategy

17 Key Intelligence Topic Identify your Key Intelligence Topic(s) Drill down –instead of Ubiquitous Computing, what sub topics are you REALLY interested in? –a taxonomy will be useful

18 KIT Start with a good descriptive paragraph on your topic, name a few applications Think out of the box - terminologies used by reporters journalists laymen

19 Sources

20 … and Resources TIME and MANPOWER and TRAINING Monitoring = Project –Monitoring : long periods of time, identify the delta (change) –Project: specific, determined period of time. Objective/goal is to know as much as possible on topic

21 Key Intelligence Questions Known Analysis Techniques: 5F, 5C, SCP, TOWS LIT methodology: KIQ technique (Combo of above) Your KIQs form the backbone of your analysis (WYAIWYG)

22 …. KIQs Ask yourself 5-8 Key Intelligence Questions Establish key indicators or proxy indicators

23 Sample KIQs -Top industry players? (big, small, listed, unknown) Region? Profiles. -R&D labs? Region? -Major research trends? -Products available? Prototypes? Technologies? -Research challenges? (problems and issues) -Upcoming markets (segments? size? Time frame) -IP and opportunities for LIT? Demand/ opportunities Supply Environment/ opportunities Strength/ opportunities Supply/weakness/ threats Environment

24 Questions Where are the markets for the applications? What time frame for market release? What are the price points? Who are the top # players? (by countries/region/labs/companies) What products available? Any prototypes? What are the technologies behind these? What are the research trends/ challenges? Any IP opportunities?

25 Search Strategy Sources and URLs Search Magnets (word/phrase spotting) Tools Reiterate!

26 Magnets Magnets are specific, well used terms to increase probability –append to your normal search string Trends, surveys, forecasts, estimates, units shipped, scenarios CEO + interview market research report, table of contents see handout Appendix B. cheat sheet on magnets

27 Recap KIT (sub topics) –terms (known to you): –terms (used elsewhere during a search) Sources –Specific syntax –Magnets KIQs Tools - Search

28 Tools for Search Copernics (PC) Google, AltaVista Link Search (web, free) Lexis Nexis (web, subscription) – Use advance search –purpose: increase relevance –tablebase InfoTech Trends (web, subscription) Delphion Patent Server (web, subscription)

29 Copernics: Search, File, Track

30 Google ( - a tool for

31 Google: Search

32 Tips for using Google Try the obvious first. If you're looking for information on java project, enter Java project" rather than java". Use words likely to appear on a site with the information you want. Java Project Spanish Inquisition" gets better results than spanish java". Make keywords as specific as possible.

33 All terms By default, Google only returns pages that include all of your search terms. There is no need to include "and" between terms. Keep in mind that the order in which the terms are typed will affect the search results.

34 Stop words If a common word is essential to getting the results you want, you can include it by putting a "+" sign in front of it. (Be sure to include a space before the "+" sign.) Star Wars Episode +1 Star Wars Episode 1

35 Google: not case sensitive Google searches are NOT case sensitive. All letters, regardless of how you type them, will be understood as lower case. For example, searches for "george washington", "George Washington", and "gEoRgE wAsHiNgToN" will all return the same results.

36 Google: no stemming Google does not use "stemming" or support "wildcard" searches. In other words, Google searches for exactly the words that you enter in the search box.

37 Find out who links to you Find out who links to the Java Project

38 Google: Site search The word "site" followed by a colon enables you to restrict your search to a specific site. To do this, use the syntax spanish inquisition

39 Useful if you are looking for news surrounding small unknown unlisted company which may be your competitor Instead of searching for the small company, search for who else links or write about that small company. Who else? (what can you find out about the small company) its interested investors or alliances, its suppliers. Research collaborations Use the Good Old Alta Vista Altavista: Link search

40 Link:infineon + fabric + wearable who else links to infineon? Who else is interested in infineon? note: why is www left out in the link search? Link: everyone else except krdl (not interested in self citations) link: url:edu who are the edu (usually univ, including research) with interest or collaborating with krdl link: url:edu same as above not not interested in local univ. Alta Vista link search

41 - The Legal and News Provider Lexis Nexis

42 - Power Search - Relevance e.g headline( smart homes ) - Proximity and Stemming e.g comput! (stemming) e.g w/10 (within 10 words) e.g w/p (within paragraph) - Limit currency (90 days, previous year), then expand Lexis Nexis

43 - (red eye) w/p patent Example red-eye correction

44 - Find the Elusive Market Numbers Specific source within Lexis Nexis Select RDS TableBase Text articles accompanied by tabulated data from market research consultants and investment house. Supplement with another useful table database Infotechtrends Lexis Nexis Power Tip 2

45 Lexis Nexis RDS TableBase market size data

46 Results

47 Handset leaders? Strategy Analytics, a Boston-based research firm, estimates that Nokia and Samsung Electronics Co. Ltd., Seoul, South Korea, were the only leading handset makers to make a profit last year.

48 - InfoTech TrendsInfoTech Trends Data compiled from various IT related trade magazines - Login with ip address Data and Tables (2)


50 Clustering –Organizing information into groups based on similarity functions and thresholds –e.g. NorthernLight, BullsEye, Vivisimo Categorization –Organizing information into a predefined set of classes –e.g. Yahoo!, Autonomy Knowledge Server Technologies for Organizing

51 Grouping of information based on their similarities Unsupervised/self-organizing, require no training or predefinition of classes Many methods available –Agglomerative, K-means, SOFM, ART, etc Purpose is to identify groupings or themes automatically Clustering (Sch64, Wis69)

52 Bottom up, hierarchical Algorithm –Given N input, begin with N clusters –Merge pairs of clusters that are closest –Update similarity matrix –Repeat until 1 cluster remains Simple Too slow to run Agglomerative Hierarchical Clustering (Barnard & Downs, 1992)

53 Bottom-up, flat approach Algorithm –Initialize K reference clusters –Assign each data point to the nearest cluster centroid –Recalculate the centroid of each cluster using the means of the input –Repeat until convergence K-means (Tou & Gonzalez, 74)

54 Initialize K cluster vectors (with neighborhood relationship) Given an input, identify the closet cluster Update the cluster vector together with those in the neighborhood to the input vector Repeat and shrink the neighborhood until convergence Self-Organizing Map (Kohonen, 1997)

55 Tools for Search & Organizing BullsEye (PC) NorthernLight (web, free) Vivisimo (web,free) Aurigin/ThemeScape for Patents (web, subscription)

56 BullsEye: Search, Organize, File, Track

57 NorthernLight (

58 NorthernLight Custom Search Folders group your results by Subject (e.g., hypertension, baseball, camping, expert systems, desserts) Type (e.g., press releases, product reviews, resumes, recipes) Source (e.g. personal pages, magazines, encyclopedias, databases) Language (e.g., English, German, French, Spanish)

59 Introducing Vivisimo ( - a tool for search and

60 Vivisimo Meta-search engine Supports the most advanced features of the major search engines using one Vivísimo syntax Vivísimo translates your query into the corresponding syntax of each underlying search engine.

61 Vivisimo




65 A user defines a set of categories or classes Assigning a text document to one or more of the predefined categories or document classes Theme extraction –The Simplest form of text mining Text Categorization

66 Supervised learning approach Examples –Decision tree (C4.5, C5) –K Nearest Neighbor (KNN) –Bayes classifier –Linear least square fit (LLSF) –Support vector machine (SVM) –Neural Networks Assume the availability of a large pre-labeled training corpus Statistical Text Categorization

67 Autonomys Intelligent Data Operating Layer ( IDOL) Server Enterprise software Functions –retrieval –clustering –categorization –Community & collaboration –XML –Agents –...

68 Pros –Unsupervised/self-organizing, require no training or predefinition of classes –Able to identify new themes Cons –Users have no control –Difficult to navigate due to ever changing cluster structure Clustering: Pros and Cons

69 Require learning (supervised) and/or definition of classification rules/knowledge Every info has to be assigned to one or more class(es) Good control but lack flexibility to handle new information Categorization: Pros and Cons

70 User-configurable Clustering (Tan & Pan, PAKDD-02) New way of information organization and content management Combines automatic clustering with user- defined structure (preferences) Reduces to a clustering system if no user indication given Allows personalization in a direct, intuitive, and interactive manner Control + flexibility

71 x F 1 a a A + - x F 1 b b B + - F 2 b a Information Clusters Information Vector Preference Vector Vigilance check Vigilance check Adaptive Resonance Associative Map (ARAM) (Tan, Neural Networks, 1995)

72 FOCI ( - a tool for search, clustering, personalization, tracking, and sharing

73 Flexible Organizer for Competitive Intelligence (FOCI) (Tan et. al, IJCAI-01 workshop, CIKM-01, KAIS Journal forthcoming) A platform for gathering, organizing, tracking, analyzing, and sharing intranet and internet based competitive information New way turning raw information into competitive knowledge First multilingual CI software –Based on LIT Multilingual Efficient Analyzer –English and Chinese Domain localization (Technology)

74 FOCI Architecture Intranet/ Internet Content Management Domain-Specific Knowledge Content Mining Visualization Front End Users CI Portfolio Content Gathering Content Publishing

75 FOCI - Personalized Content Management Portfolio created through Search Unsupervised clustering Loop –Personalization by users –Reorganization of clusters Saving of personalized portfolio Tracking of new information

76 Personalization Functions Marking/labeling (selected) clusters –Personal interpretation Inserting Clusters –Indicate preference on groupings Merging clusters –Indicate preferences on similarities Splitting clusters –Indicate preferences on differences...

77 Clustering by URL + Title + Description

78 A partially Personalized Portfolio

79 A fully Personalized Portfolio

80 Organizing New Information (Without Personalization) 42 documents from DirectHit, Netscape, and BusinessWire

81 Organizing New Information (Based on Personalized Portfolio)

82 Technologies for Analyzing To analyze document content in terms of entities and relations Challenges Need to understand natural language Technologies –Information extraction –Knowledge extraction –Concept map visualization –Discovery of new knowledge

83 Text Predefined/pre-trained templates Flat/relational For building databases Records and fields Semi-structured or Structured Form Need to handle new concept Deep structure For building knowledge base Facts and rules Information Extraction vs Knowledge Extraction Differences Similarity

84 Knowledge Extraction by Concept Frame Graph (Kanagasa & Tan, CIKM 2002) Concept extraction

85 Knowledge Extraction by Concept Frame Graph (Kanagasa & Tan, CIKM 2002) Concept mapping Q & A

86 Technology Landscape Search and organizing –already mature –many vendors –Autonomy, Verity, Mohomine, Semio, Stratify,... Analysis –still in research –real knowledge discovery

87 Whats next? Autonomous agents –Personal software to be spy for you Semantic Web (,,, –XML/RDF –web-based applications and services

88 Semantic Web (Tim Berners-Lee et al, Scientific American, May 2001) Assumption The real power of WWW as a platform for knowledge repository and sharing has yet to be unleashed Vision Automated services, interweaving computers and human being SW will bring structure to the web, creating an environment where software agents... can readily carry out sophisticated tasks for users

89 Semantic Web + Agents Standard (XML, RDF) The Old Web Ontology Information Mining/ Knowledge Management

90 More readings Intelligence Software Report –( –more info integration and data analysis software Taxonomy & Content Classification –A Delphi Group White Paper ( –more content/information management software

Download ppt "Ah-Hwee Tan Laboratories for Information Technology, Singapore Oct 11, 2002 Guest Lecture to Singapore-MIT Alliance Artificial Intelligence Technologies."

Similar presentations

Ads by Google