Presentation on theme: "Cong Ding, Yang Chen*, and Xiaoming Fu University of Göttingen"— Presentation transcript:
1 Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks Cong Ding, Yang Chen*, and Xiaoming FuUniversity of Göttingen*Duke University
2 Significance of social network data crawling Understanding user behaviorsImproving SNS architecturesHandling privacy/security issuesand so on...
3 Current data collection methods (1) ISP-based measurement [Schneider IMC’09]Only ISP companies can do that
4 Current data collection methods (2) Cooperate with SNS companies [Yang IMC’11]Most research groups do not have chance
5 Current data collection methods (3) Crawl data by a single group (and share them to others) [Gjoka INFOCOM’10]Suffering request rate limiting
6 Shortages of crawling by a single group Waste computing and network resourcesIntroduce overhead to service providers (and may lead stricter rate limiting)Lack of ground truth for the research community
7 Why not collect data collaboratively? A new thoughtWhy not collect data collaboratively?
9 System design Fetching UIDs (BFS, etc.) Handling crawling failure (timeout)Bypassing request rate limiting (massive IP addresses)Data fidelity (redundant crawling)
10 ImplementationA proof-of-concept prototype (without the data fidelity part) to crawl in Weibo472 PlanetLab servers as crawlers
11 EvaluationIn 24 hours, we have crawled 2.22M users’ data from Weibo, including user profiles, all the posts, all the social connectionsComparison:Fu et al. (PLOS ONE 2013) get 30K user’s data in 6 daysGuo et al. (PAM 2013) get 1M user’s data in 1 monthCrowd CrawlingFu et al.Guo et al.#UIDs/day2.22M5K33K
14 Conclusion and Discussion Data sharing may violate some providers’ terms of servicesTwitter does not allow to share data (even for research)Weibo allows to share data among researchersUnlimited data sharing might cause ethical issuesThe data should be anonymizedWe will publish the data crawled in the evaluation