Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.

Similar presentations


Presentation on theme: "Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3."— Presentation transcript:

1 Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3

2 Credibility assessment on web pages 17/05/2015Planet Data - Madrid2 Introduction The number of available data sources keeps increasing at fast pace Sensors embedded in mobile phones, websites, blogs, … Data becomes more valuable when combined from different sources What about the trustworthiness of this aggregated data? Unknown data sources No standard way to evaluate trustworthiness Subjectivity of the consumer of the data Important economic incentive to lie Interesting case of the WWW Web credibility assessment

3 Credibility assessment on web pages 17/05/2015Planet Data - Madrid3 What is the problem of web credibility ? Non credible websites represent an important percentage of the web Credibility seen as an aggregation of objective and subjective components (Fogg) Credibility= trustworthiness AND expertise Web users can be naïve or lazy and won’t try to verify information Focus on domains where expertise is hard to evaluate for lambda users Medical treatments Trading operations Ideological assertions Economic / politic interests are at stacks

4 Credibility assessment on web pages 17/05/2015Planet Data - Madrid4 Background Trustworthiness components in the context of web credibility: Y. Yamamoto and K. Tanaka. Enhancing credibility judgment of web search results. Accuracy: referential importance Authority: social reputation Objectivity: content typicality Currency: update frequency Coverage: coverage of topic M. J. Metzger. Making sense of credibility on the web: Models for evaluating online information and recommendations for future research. Credentials Advertisements Design

5 Credibility assessment on web pages 17/05/2015Planet Data - Madrid5 Credibility assessment as a classification problem Use historical information on evaluations for future credibility assessment A machine learning approach Binary classification Users evaluate pages as credible or non-credible Content-based Features Extracted programmatically from web pages Training set and test set Leave-one-out cross validation Tested by category

6 Credibility assessment on web pages 17/05/2015Planet Data - Madrid6 Feature selection Categories Act as a filter, only pages from the same category are tested for similarity Keywords and Entities in the document Reflect the topic of the web page at a finer grain Sentiment analysis Computed at the words level Used in conjunction with keywords & entities Part of speech Extra feature reflecting the overall structure of the webpage Number of Ads displayed (in process) They distract users from their activity and the page loose credibility Complexity of the css files (not included yet) Pages with no structure tend to loose credibility PageRank Google’s metric which include a credibility measure

7 Credibility assessment on web pages 17/05/2015Planet Data - Madrid7 Experimental setup Two machine learning algorithms kNN Item-Item algorithm Compute a similarity between pages take only into account the most similar pages C4.5 decision tree Has good performance in general However not suitable for multivalued features (keywords, entities) Defined as a baseline Microsoft corpus 1000 pages evaluated for credibility by experts and regular users Divided into 5 topics Top 40 pages retrieved by search engines for 5 queries Rescaled from Likert scale [0;5] to binary scale {-1;1}

8 Credibility assessment on web pages 17/05/2015Planet Data - Madrid8 Content-based rating kNN item-item algorithm Based on similarity between pages rated by the user Aggregated similarities Based on pages features’ similarity Cosine similarity for monovalued features (POS, pageRank, …) Jaccard similarity for multivalued features (keywords, entities) Only positive similarity are taken into account        mssimilarItej ji mssimilarItej juji iu s rs,,,, 

9 Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Evaluation Preliminary results

10 Credibility assessment on web pages 17/05/2015Planet Data - Madrid10 Results Mixed results Precision ~ 0.7, recall ~ 0.8 Impossible to predict accurately the credibility Biased by ratings distribution over classes

11 Credibility assessment on web pages 17/05/2015Planet Data - Madrid11 Results Tests on keywords + entities + sentiment Similar results (Precision ~ 0.7, Recall ~ 0.8)

12 Credibility assessment on web pages 17/05/2015Planet Data - Madrid12 Results Mixed results among classes Tests on all features (POS + keywords + entities + sentiments) Similar results (Precision ~ 0.7 and Recall ~ 0.8)

13 Credibility assessment on web pages 17/05/2015Planet Data - Madrid13 Future work Semantic distances Pages seen as set of concepts Definition of a distance between two sets in the concepts space Similarity using a path distance in a concept hierarchy Social referrals Use evaluation of other peoples Weights based on their trustworthiness Estimate page credibility based on beta reputation Combine reputation with classification approaches to have an aggregated metric To get better estimation of the credibility than the two components separated

14 Credibility assessment on web pages 17/05/2015Planet Data - Madrid14 Conclusion Project based on content-based aspects Results promising although room for improvement Accuracy of the prediction Time complexity of the implementation Several features remain unimplemented Local extraction of features Integration of new page features Semantic aspect of web pages


Download ppt "Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3."

Similar presentations


Ads by Google