Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.
A Vector Space Model for Automatic Indexing
Chapter 5: Introduction to Information Retrieval
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Search Engines and Information Retrieval
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
A machine learning approach to improve precision for navigational queries in a Web information retrieval system Reiner Kraft
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Presented by Zeehasham Rasheed
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Recommender systems Ram Akella November 26 th 2008.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
Search Engines and Information Retrieval Chapter 1.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Personalized Course Navigation Based on Grey Relational Analysis Han-Ming Lee, Chi-Chun Huang, Tzu- Ting Kao (Dept. of Computer Science and Information.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Algorithmic Detection of Semantic Similarity WWW 2005.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
User Modeling and Recommender Systems: recommendation algorithms
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
User Modeling for Personal Assistant
Evaluation Anisio Lacerda.
Multimedia Information Retrieval
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Web Mining Research: A Survey
NAÏVE BAYES CLASSIFICATION
Introduction Dataset search
Presentation transcript:

Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3

Credibility assessment on web pages 17/05/2015Planet Data - Madrid2 Introduction The number of available data sources keeps increasing at fast pace Sensors embedded in mobile phones, websites, blogs, … Data becomes more valuable when combined from different sources What about the trustworthiness of this aggregated data? Unknown data sources No standard way to evaluate trustworthiness Subjectivity of the consumer of the data Important economic incentive to lie Interesting case of the WWW Web credibility assessment

Credibility assessment on web pages 17/05/2015Planet Data - Madrid3 What is the problem of web credibility ? Non credible websites represent an important percentage of the web Credibility seen as an aggregation of objective and subjective components (Fogg) Credibility= trustworthiness AND expertise Web users can be naïve or lazy and won’t try to verify information Focus on domains where expertise is hard to evaluate for lambda users Medical treatments Trading operations Ideological assertions Economic / politic interests are at stacks

Credibility assessment on web pages 17/05/2015Planet Data - Madrid4 Background Trustworthiness components in the context of web credibility: Y. Yamamoto and K. Tanaka. Enhancing credibility judgment of web search results. Accuracy: referential importance Authority: social reputation Objectivity: content typicality Currency: update frequency Coverage: coverage of topic M. J. Metzger. Making sense of credibility on the web: Models for evaluating online information and recommendations for future research. Credentials Advertisements Design

Credibility assessment on web pages 17/05/2015Planet Data - Madrid5 Credibility assessment as a classification problem Use historical information on evaluations for future credibility assessment A machine learning approach Binary classification Users evaluate pages as credible or non-credible Content-based Features Extracted programmatically from web pages Training set and test set Leave-one-out cross validation Tested by category

Credibility assessment on web pages 17/05/2015Planet Data - Madrid6 Feature selection Categories Act as a filter, only pages from the same category are tested for similarity Keywords and Entities in the document Reflect the topic of the web page at a finer grain Sentiment analysis Computed at the words level Used in conjunction with keywords & entities Part of speech Extra feature reflecting the overall structure of the webpage Number of Ads displayed (in process) They distract users from their activity and the page loose credibility Complexity of the css files (not included yet) Pages with no structure tend to loose credibility PageRank Google’s metric which include a credibility measure

Credibility assessment on web pages 17/05/2015Planet Data - Madrid7 Experimental setup Two machine learning algorithms kNN Item-Item algorithm Compute a similarity between pages take only into account the most similar pages C4.5 decision tree Has good performance in general However not suitable for multivalued features (keywords, entities) Defined as a baseline Microsoft corpus 1000 pages evaluated for credibility by experts and regular users Divided into 5 topics Top 40 pages retrieved by search engines for 5 queries Rescaled from Likert scale [0;5] to binary scale {-1;1}

Credibility assessment on web pages 17/05/2015Planet Data - Madrid8 Content-based rating kNN item-item algorithm Based on similarity between pages rated by the user Aggregated similarities Based on pages features’ similarity Cosine similarity for monovalued features (POS, pageRank, …) Jaccard similarity for multivalued features (keywords, entities) Only positive similarity are taken into account        mssimilarItej ji mssimilarItej juji iu s rs,,,, 

Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Evaluation Preliminary results

Credibility assessment on web pages 17/05/2015Planet Data - Madrid10 Results Mixed results Precision ~ 0.7, recall ~ 0.8 Impossible to predict accurately the credibility Biased by ratings distribution over classes

Credibility assessment on web pages 17/05/2015Planet Data - Madrid11 Results Tests on keywords + entities + sentiment Similar results (Precision ~ 0.7, Recall ~ 0.8)

Credibility assessment on web pages 17/05/2015Planet Data - Madrid12 Results Mixed results among classes Tests on all features (POS + keywords + entities + sentiments) Similar results (Precision ~ 0.7 and Recall ~ 0.8)

Credibility assessment on web pages 17/05/2015Planet Data - Madrid13 Future work Semantic distances Pages seen as set of concepts Definition of a distance between two sets in the concepts space Similarity using a path distance in a concept hierarchy Social referrals Use evaluation of other peoples Weights based on their trustworthiness Estimate page credibility based on beta reputation Combine reputation with classification approaches to have an aggregated metric To get better estimation of the credibility than the two components separated

Credibility assessment on web pages 17/05/2015Planet Data - Madrid14 Conclusion Project based on content-based aspects Results promising although room for improvement Accuracy of the prediction Time complexity of the implementation Several features remain unimplemented Local extraction of features Integration of new page features Semantic aspect of web pages