Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 765 – Fall 2014 Paulo Alexandre Regis Reddit analysis.

Similar presentations


Presentation on theme: "CS 765 – Fall 2014 Paulo Alexandre Regis Reddit analysis."— Presentation transcript:

1 CS 765 – Fall 2014 Paulo Alexandre Regis Reddit analysis

2 Outline REVIEW REDDIT API DATA COLLECTION / CLEANING NETWORK CREATION TOOLS CONCLUSION Q&A

3 What is reddit? Reddit is an open-source platform that supports the interaction of communities. It has been used as news hub, Q&A platform, internet hoax/meme propagation. Some characteristics include voting, posting, commenting. Has public API that allows data crawling. Has not been deeply studied.

4 The API 30 requests per minute limit, max. 100 results per request = 3000 results per minute PRAW: API wrapper, takes care of API limits Comment tree can be flattened by PRAW (not like described in the report)

5 Comment tree

6 Data collection Total subreddits443 685 (17 MB) Filtered subreddits18 058 Posts6 514 338 (721 MB) Comments54 224 887* (> 2GB) Usersunknown * Estimated, in progress

7 Reddit stats

8 Data cleaning At least 300 subscribers Not a snapshot Reddit doesn’t stop! Repeated results Anonymizer before data is available

9 Data cleaning

10

11 Network creation Nodes are users Edges happens when they comment on the same post Examine when a threshold is applied

12 Tools PRAW (did I mention this library is important?) Graph visualizer (Pajek, Gephi, igraph)

13 Analysis proposal Calculate the node degree (number of links) in different scenarios Compare the calculated value with the users “karma” Compare network with other social networks previous studies Is it power-law? Small-world?

14 Conclusion Time constraint Expected crawling time? More 2 weeks just for comments Plan B: analyze with data collected

15 Questions?


Download ppt "CS 765 – Fall 2014 Paulo Alexandre Regis Reddit analysis."

Similar presentations


Ads by Google