Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prediction of Retweet Cascade Size over Time

Similar presentations


Presentation on theme: "Prediction of Retweet Cascade Size over Time"— Presentation transcript:

1 Prediction of Retweet Cascade Size over Time
Andrey Kupavskii, Liudmila Ostroumova, Alexey Umnov, Svyatoslav Usachev, Pavel Serdyukov, Gleb Gusev, Andrey Kustarev {kupavskiy, ostroumova-la, umnov, kaathewise, pavser, gleb57, Takeaway: wait for 30 seconds to make the prediction much more precise The second one: we also utilize the information about the spread of the cascade up to moment T0. Algorithm: We train gradient boosted decision tree models. One of them approximates the natural logarithm of the size of the cascade at the moment T, minimizing mean square root error. Two others do binary classification that sorts out large epidemics: tweets that gained more than 4000 retweets and [1600,3999] retweets. Conclusions: The prediction have high precision. If you use the initial spread of thed tweet, the quality of the prediction increases significantly. New features like PageRank in thed retweet graph or the flow of the cascade are important for the prediction. Features: Social and time-sensitive features of the initial node, content features, features of the infected nodes up to the moment T0. PageRank in the retweet graph can be used as a measure of user influence. Future work: Experimental results: Analysis of other measures of tweet popularity Study of the cascade growth in more detail Comparison of different measures of user influence Modeling the tweet spread from the epidemiological point of view New features: PageRank in the retweet graph: The vertices of the retweet graph are users, we have an edge (A,B) with weight w, if user B retweeted user A w times. We calculate PageRank for both weighed and unweighed graph. The flow of the cascade: For each edge from participating user to his follower we define the activity of the follower and the edge which depends on time. Informally, the flow of the initial part of a cascade is the sum of activities over all edges between participating users and their followers. Other features: Average local and global retweet ratios of the initial user up to the moment T, the number of retweets at the moment T0, sum of average retweet ratios, PageRanks, and the total number of followers of the infected users at the moment T0, Motivation: sociology, breaking news detection, viral marketing, freshness of the search engine layout. Viral marketing: You spread an advertisement and you want to get 1000 retweets within a day. You choose the set of initial users and then you can try to predict, whether you get 1000 retweets or not. If you wait for some time and use the information about the initial spread of the cascade, then you can make the prediction more accurate. Prediction: We predict the number of retweets the tweet will gain during the time T since the initial tweet. Two variants of the prediction task: The first one: we utilize only the information available at the moment of the initial tweet. Tweet class Baseline all No flow No PR [1600,3999] 0.659 0.775 0.76 0.761 ≥4000 0.436 0.67 0.657 0.632 F1-score for the binary classification of two groups of tweets that gained the largest number of retweets using different sets of features . Baseline + New features T0=0, T=15m 0.981 0.957 T0=0, T=1w 1.243 1.226 T0=15s, T=15m 0.796 T0=15s T=1w 1.050 T0=30s, T=15m 0.588 T0=30s, T=1w 0.838 Mean square error of the logarithm of the predicted cascade size at moment T. If the error is equal to x, then, roughly speaking, the actual and predicted number of retweets on average differ in ex times.


Download ppt "Prediction of Retweet Cascade Size over Time"

Similar presentations


Ads by Google