Predicting the controversy of a Wikipedia article

Predicting the controversy of a Wikipedia article
Taras Gritsenko

Overview Introduction Background work Theoretical analysis
Experimental design/analysis Conclusion

Introduction Problem: Is it possible to observe controversy in Wikipedia articles using existing, publicly available data? What is controversial topic? A topic that incites discussion and gives rise to public disagreement No consensus Many types of topics are controversial: Politics (e.g most recent presidential campaign) Legislation (e.g gun control, medical marijuana, abortion..) .. Science and technology (has Science gone too far?)

Introduction With respect to Wikipedia, what characteristics can we isolate and analyze to determine properties of controversy? Over 5,275,000 articles… (somewhat difficult to quantify, seems small) Goes without saying: can’t know for certain whether or not something is objectively controversial, but can invent some construct and be reasonably certain Not exactly clear what this construct or threshold is Controversiality: 0 or 1, true or false

Background Work Identify aspects of Wikipedia articles that may indicate controversy Apply algorithms and…creative techniques Fidel Castro’s article on Wikipedia…is it controversial? Figure 1. The Wikipedia article corresponding to Cuban politician Fidel Castro (H-Index = 6, 299 comments).

Background Work Spent a good portion of time thinking about reasonable ways to approach this problem Problem: Where do I get a lot of metadata relating to Wikipedia articles? Naïve approach: download a distributed database dump that the Wikimedia foundation puts out downloading at 600kb/s (10.3GB file) Ironically I spent a week downloading this and it didn’t even contain talk page metadata All the download links for the talk page dump for Wikipedia were broken

Background Work Use the seemingly useless random article feature for random sampling We can visit potentially all of the articles on Wikipedia It’s easy to scrape statically structured webpages But what does all of this have to do with…controversy? Figure 2. A screenshot of the homepage of Wikipedia, featuring the random article feature

Background Work Wikipedia article ratings dataset from July 2011 – July 2012 Each rating contains a rating from 1-5, and a key (1-4) for each component:

Article rating distribution
Ratings are potentially indicators of controversy The distribution of ratings can reveal whether or not a concept, topic, or anything, is controversial subject to what is being assessed “Like to dislike” ratio Theory: expect that as the ratio between positive and negative ratings approaches 1:1 that the topic is more controversial, No “convergence” or skewing in the dataset Since our rating data is potentially any value from 1-5, we can divide scores into “likes” and “dislikes” depending on some threshold

How do we measure the degree to which data is skewed? A rating vector in 4-space needs to be converted into a rating in one space Simple: take the average between all 4 components of a rating yielding some value between 1-5 Given some threshold α, the rating is either placed in one bucket, or another Figure 3. A “controversial” distribution with an ra of 0.55.

e.g. α = 3 (median from 1-5) 𝑟 𝑎 = # 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑟𝑎𝑡𝑖𝑛𝑔𝑠 # 𝑜𝑓 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑟𝑎𝑡𝑖𝑛𝑔𝑠 𝑟 𝑎 = 1 𝑟 𝑎 𝑖𝑓 𝑟 𝑎 >1 Order can be interchanged Straight forward approach, minimal effort Alternative? second frequency moment

Second frequency moment
Article rating distribution as a function of frequency moments Divide the continuous range of all possible ratings into n segments (classes) with a width ε where scores fall into Ideally, ε= highest rating−lowest rating n segments Or within the context of scores from 1-5, ε = (5-1)/4 = 1.

Second frequency moment
𝑠𝑒𝑐𝑜𝑛𝑑 𝑚𝑜𝑚𝑒𝑛𝑡= 𝑖=1 𝑛 𝑛 𝑖 2 Where ni is the number of elements in the i’th class, or bin Our frequency, or score, is divided by n2 (total number of scores computed) to compute a score bounded between 0 and 1 (i.e normalize it) Reason: in practice articles have different amounts of rating data

Discussion depth The depth of a discussion can be an indicator of controversy Maximum depth, number of comments on a talk page, … The more controversial an article, the higher the 'average’ depth of a reply-chain (discussion) between users Use the notion of an H-Index to compute the most reasonable estimate of where a comment is going to be at a given depth

H-Index Used in measuring the productivity and citation impact of a scholar A scholar with an index of h has published h papers each of which has been cited at least h times. Hirsch index or Hirsch number The H-Index of a discussion (comment-reply) tree is 𝐻−𝐼𝑛𝑑𝑒𝑥= 𝑖 𝑚𝑎𝑥 min 𝑓 𝑖 ,𝑖 (where f(i) corresponds to the number of replies at depth i.) Figure 4. A discussion tree with a max-depth of 7 and H-Index of 3.

Experimental design: Article distribution
2.85GB of .tsv (tab separated) data, roughly ~1,200,000 article ratings parsed in a Go program I wrote: timestamp page_id page_title page_namespace rev_id user_id rating_key rating_value RC_Timişoara RC_Timişoara RC_Timişoara RC_Timişoara

Top 50 handpicked results: Article Second Moment Germanic Wars 0.04 Khader_Adnan 0.05 Felix_Z._Longoria,_Jr. 0.09 Timeline_of_the_2011–2012_Egyptian_revolution .. 0.179 Deadgirl_(2008_film) 0.18 Abdul_Hakeem,_Pakistan 0.180 Non-lethal_weapon 0.185 Gaza 0.186 Yuri_Sidorenko 0.190 Kidnapping_of_children_by_Nazi_Germany 0.210

What I noticed was that in general there was a relationship between controversy and article rating distribution, but it wasn’t nearly as obvious as the H-Index approach There were very few articles with a frequency < 0 I filtered out articles with little to no rating since they were unreliable (< 10 ratings) The average number of ratings was 12. Conclusion: Article rating data simply isn’t particularly reliable, since often times users rating the article are critical of the article itself rather than simply reflecting their personal feelings toward the topic

Experimental design: H-Index of articles
How do we traverse a webpage, specifically a Wiki discussion? Implement an algorithm to traverse the DOM tree Start of a comment: <dl> or <dd>, end </dl> or </dd>

Simple Tree-traversal Algorithm: Keep a map (integer->integer) of all depths with key i corresponding to the number of comments at depth i Given the set of all tokens on the webpage, traverse each token If the token is an opening tag to a comment (<dd> or <dl>, choose one) increment depths[currentDepth] by 1 and currentDepth by 1. If the token is a closing tag to a comment, decrement currentDepth by 1. When no more comments are to be parsed the result is a map containing all of the levels of depth and the number of comments at each depth.

Result: 1,000,000 articles traversed in hours (80-90 threads) Only 5% of the traversed articles had talk pages (meaning that in all of English Wikipedia only 260,000 have talk pages) Most articles have an H-Index of 2

Top 15 results: Article H-Index Max-Depth Muawiyah I/Archive 1 21 29 Nagorno-Karabakh/Archive 5 16 19 Time Cube/Archive 1 Jehovah's Witnesses/Archive 25 24 John Vincent Atanasoff/Archive 13 23 Freemasonry/Archive 13 Ebionites/Archive 3 15 22 Soviet invasion of Manchuria/Archive 3 Political correctness/Archive 17 27 Societal attitudes toward homosexuality /Archive 2 14 Two envelopes problem/Archive 1 13 Gaza War/Archive 47 Political positions of John McCain/Archive 2 List of states with limited recognition/Archive 8 17 Race and intelligence/Archive 74 12

Sample output sorted by number of comments:: 21 29 Muawiyah I/Archive 13 19 Neuro-linguistic programming/Archive 9 11 Chiropractic/Archive 10 16 Chiropractic/Archive 8 15 Monty Hall problem/Arguments/Archive 8 12 Comparison of the health care systems in Canada and the United States/Archive 10 15 Pseudoscience/Archive 14 25 Moment of inertia 974 11 14 Transcendental Meditation/Archive 13 20 Commonwealth realm/Archive 10 17 Historicity of Jesus/Archive 9 15 Stephen Barrett/Archive 4 887 13 Intelligent design/Archive Nagorno-Karabakh/Archive Monty Hall problem/Archive 15 Acupuncture/Archive

Most of the articles with controversial discussions are archives Theory: A lot of articles on Wikipedia which generate controversy, particularly in their talk pages, are more controversial in the past since they get reverted and edited quickly (discussions don’t stay relevant forever or indefinitely) Looking at the number of comments produces similar reslts to looking at the H-Index (the more popular an article is, the more controversial it may be) Generally less than 10 articles on the list in the top 100 that were not controversial

Conclusion Generally speaking, there is no perfect method for predicting controversy Even the methods you’d think be 100 percent accurate aren’t necessarily Some methods are produce more interesting results I avoided relative controversy (the theory is in my paper) but in the future finding an implementation for that would be nice. Future work: relative controversy, improving upon statistical analysis

Predicting the controversy of a Wikipedia article

Similar presentations

Presentation on theme: "Predicting the controversy of a Wikipedia article"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predicting the controversy of a Wikipedia article

Similar presentations

Presentation on theme: "Predicting the controversy of a Wikipedia article"— Presentation transcript:

Similar presentations

About project

Feedback