Presentation on theme: "SNFS: The design and implementation of a Social Network File System Ch. Kaidos, A. Pasiopoulos N. Ntarmos, P. Triantafillou University of Patras."— Presentation transcript:
SNFS: The design and implementation of a Social Network File System Ch. Kaidos, A. Pasiopoulos N. Ntarmos, P. Triantafillou University of Patras
Shameless plug.. If interested, please check out eXO: Decentralized Autonomous Scalable Social Networking, 5 th Conference on Innovative Data Systems Research (CIDR2011), 2011.
Social Networks Our Take: 1.Search for People (friends, experts, …) Content (books, photos, videos, blogs, websites, …) 2.Form entities (collections) Friends-lists, content-libs 3.Search for entities Using previously-formed collections… 4.SNFS currently provides the foundation for these…
Tagging Tag 1 Tag 2 Tag 3 Tag 4 Tag 5 Profiles: sets of tags describing entities. Search for: based on profiles. Ranked retrieval (top-k)
Current State 5,000,000,000 photos 3,000 photos/min(as of September 2010) 2,000,000,000 videos served up each day (May 2010) 600,000,000 monthly active users (January 2011) 15,000,000 books (October 2010) 130,000,000 by the end of the decade
Current State Need to access published content 22,750,000,000 queries in search engines 4,000,000,000 queries in YouTube 351,000,000 queries in Facebook 416,000,000 queries in MySpace (U.S. market figures, December 2009) ?
Current State How do I find stuff I want? How do I provide intresting objects to my users?
Proposal A content-aware file system for Social Network Systems Usefull to users And service providers too!
Previous Work on File Indexing 1991 – Semantic File Systems by Gifford 1996 – BeFS by Giampaolo and Meurillon, part of the BeOS BeOS never had commercial success – Indexing Service on Windows NT, not needed at the time Remnant of the Object File System from the unmaterialized Cairo project Typically no ranked retrieval No users input (tags) No user relationships
Desktop Searches 2004 – Windows Desktop Search, widely popular – Mac OS X's Spotlight, Google Desktop, Beagle, Strigi, Tracker... Typically no ranked retrieval ? No user relationships no exploits from relations for searching
Problems Power tools for power users... But for average users... Boolean operators??? SQL like queries???
Previous Work on Ranked Retrieval 1968 – SMART system by Salton, introduced weights in retrieval, instead of classical Boolean retrieval 1975 – Vectors and cosine similarity by Salton 1988 – Other functions for similarity tested and evaluated by Salton and Buckley 2003 – Fagin proposes and compares several efficient algorithms for top-k retrieval
Design – SNFS Tags are extracted from object, stemmed and frequency is counted Weights for each tag and document are calculated Each object is associated with a unique id in a Tree A tf-idf weighting scheme was chosen
Design – SNFS Term Weight and Object ID are stored in an inverted index Each posting list of the index is a B+Tree stored in secondary memory The position of the root of the B+Tree in the index is stored in a Red Black Tree
Design – Search and retrieval The query is split in terms and stemmed The score of each document is calculated using a threshold algorithm and a tf-idf function
Threshold Algorithms Input: Posting lists sorted on weight (decreasing) t1 t3 t2 depth1 d1 d3 d2 NRA (No Random Access) Algorithm d4 d5 d2 2 Doc ID Score Doc ID d1 t1 s1 d2 s2 d3 s3 d4 d5 s5 s4 +s6 d4 d3 d2 3 +s7 +s8 +s9 Thresholds1+s2+s3 s4+s5+s6 s7+s8+s9 When no score bellow the top-k objects can be improved to exceed the threshold the algorithm halts
Threshold Algorithms Input: Posting lists sorted on weight (decreasing) TA (Threshold Algorithm with random accesses) t1 t3 t2 1 d1 d3 d2 d4 d5 d2 2 d4 d3 d2 3 Thresholds1+s2+s3 s4+s5+s6 s7+s8+s9 Doc ID Score Doc ID d1 s1 d2 s2 d3 s3 d4 d5 s5 s4 +s6+s7 +s8 +s9 depth d5 +s10 When score of the last object is bellow threshold the algorithm halts
Qualitative Comparison NRATA Disk Accesses State Keeping and computation System Calls We expect TA to perform many more slow disk accesses Can NRA's large state keeping keeping and computation need overcome TA's disk accesses? We implement both, on hard disk and on RAM-disk to find out...
Implementation with FUSE
Testing - 4 real world test sets - files containing tags from online objects - index is normally on secondary memory - ram-disk used to evaluate the effect of disk accesses
Results demanded vs Time Disk based index NRA TA
Results demanded vs Time RAM based index NRA TA
Query Terms vs Time Disk based index NRA TA
Query Terms vs Time RAM based index NRA TA
Beagle vs NRA Terms vs time Results vs time
Conclusions SNFS: - Indexing, storage, and ranked retrieval of entities in a SN. - Study of efficiency of algorithms and implementations, using real-world data, and various implementations. - Competitive performance, (eg against Beagle). - Many ways of further expansion
Future Work - Expansion for distributed systems and clouds - Distributed file systems (HDFS) - Distributed data structures - Tagging, Indexing, and searching for entity- collections – straightforward, as our object implementation/abstraction captures this. -Establishing entities consisting of relationships between entities, using advanced-tagging, and searching for these…