Projects (2012-2013) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.

Slides:



Advertisements
Similar presentations
Overview of this week Debugging tips for ML algorithms
Advertisements

MapReduce.
Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Finding your friends and following them to where you are by Adam Sadilek, Henry Kautz, Jeffrey P. Bigham Presented by Guang Ling 1.
Analysis and Modeling of Social Networks Foudalis Ilias.
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Computations have to be distributed !
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Information Retrieval in Practice
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 Unique identifiers for the Web Zoltan Miklos Joint work with Gleb Skobeltsyn, Saket Sathe, Nicolas Bonvin, Philippe Cudré-Mauroux, Ekaterini Ioannou,
Final Presentation Undergraduate Researchers: Graduate Student Mentor: Faculty Mentor: Jordan Cowart, Katie Allmeroth Krist Culmer Dr. Wenjun Zeng Investigating.
Memoplex Browser: Searching and Browsing in Semantic Networks CPSC 533C - Project Update Yoel Lanir.
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Using Social Networks in Education Region One Technology Conference May 11, 2010.
SOCIAL NETWORKS AND THEIR IMPACTS ON BRANDS Edwin Dionel Molina Vásquez.
Your User Name is the first portion of your Carleton Connect account eg. mroger4 if the was n.ca
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
The Web-based Data Collection in the Italian Population and Housing Census Leonardo Tininini and Antonino Virgillito ISTAT Meeting on the Management of.
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Most of contents are provided by the website Introduction TJTSD66: Advanced Topics in Social Media Dr.
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Social Media: The Basics Teresa Marks School Community Oral Health Conference Friday, October 16, 2015.
IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Optimizing today's websites using tomorrow's technologies.
Graphs G = (V,E) V is the vertex set. Vertices are also called nodes and points. E is the edge set. Each edge connects two different vertices. Edges are.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
IR Homework #1 By J. H. Wang Mar. 25, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
TwitterFeedRank Nick Flacco Dalton Huynh Abhishek Jha Phong Lam.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Spring Staff Lecturer: Prof. Sara Cohen Graders: Igor Lifshits, Arbel Moshe 2.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Twitter Hashtags RMBI4310Spring 2016 Group 14 Cheung Hiu Yan, Debbie Chow Miu Lam, Carman Tsang Wing Wah, Denise
Social Networks Some content from Ding-Zhu Du, Lada Adamic, and Eytan Adar.
Information Retrieval in Practice
Map Reduce.
Comparison of Social Networks by Likhitha Ravi
CS341: Project in Mining Massive Datasets Infosession
Data Exploration Of Wikipedia
CS 594: Empirical Methods in HCC Social Network Analysis in HCI
Graph and Link Mining.
Presentation transcript:

Projects ( ) Ida Mele

Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published on my web site. – Usually the project deadline is the same day of the written exam. Students, who pass the exam during the first session, can deliver the projects by the second session. The project score is from 0 to 10. – The professor decides the final mark, considering also the score of the written exam. A project can be assigned to max 2 groups. Ida MeleProjects ( )1

Project Request Students have to send me an with object: WebIR - project request and the following information: – Name and last name of each student in the group. – Title of the project. – Short description of what the students intend to do (up to 250 words). Important: all the members of the group should be cc-ed in the . If everything is OK, you will receive a confirmation . There is no deadline for the request of the project. Ida MeleProjects ( )2

Project Delivery The presentation of the project takes 15-minutes. The presentation should contain the description of the problem, the design decisions, the most important issue related to the implementation, and the results achieved. Students use slides for their presentations and if they want they can realize a demo as well. Students have to deliver the source code and the slides. More instructions about the project delivery will be published on my web site. Ida MeleProjects ( )3

Project list 1.Analyze the link structure of the web graph of Sapienza University. 2.Analyze the link structure of Twitter social network. 3.Find communities in Facebook. 4.Find communities in IMDB. 5.Find communities in DBLP. 6.Hadoop implementation of PageRank. 7.Hadoop implementation of HITS. 8.Realize a reverse web graph with Hadoop. 9.Realize an inverted index with Hadoop. 10.Personalized ranking of news. 11.Enrich News using Tweets. 12.Enrich News using Wikipedia. Ida MeleProjects ( )4

Projects 1) Analyze the link structure of the web graph of Sapienza University. – Crawl the portion of the Web related to the domain uniroma1.it, create the corresponding web graph. Analyze its link structure, and identify the authoritative web sites. – Tip: the students can use node features such as: degree, in-degree, out-degree, PageRank, etc. They can plot the distribution of the aforementioned measures. The students can enrich their analysis by studying the edge reciprocity, and the graph assortativity. Ida MeleProjects ( )5

Projects 2) Analyze the link structure of Twitter social network. – Use Twitter API and create the who-follow-whom network. Analyze the distribution of followers, following, and identify most popular users. Study the edge reciprocity, and determine if the network is assortative. – Tip: the students can use PageRank and/or other node features to identify the most popular users. – Tip: the network is assortative when nodes tend to be connected with similar nodes, for example nodes with high degree have edges to nodes with high degree. Ida MeleProjects ( )6

Projects 3 ) Find communities in Facebook. – Use Facebook API to download data of your friends and of friends of friends. Create the corresponding friendship graph and find communities of users. Check if communities correspond to groups of users who live in the same city, work for the same organization, or attend the same school, university, etc. – Tip: the students can identify clusters of users by using a graph-partitioning tool. Ida MeleProjects ( )7

Projects 4 and 5) Find communities in a network of collaborations. Project n.4: use IMDB: Project n.5: use DBLP: – Create a graph where nodes are people and a link between two people represents the fact that they have worked together. Use this graph to find communities of people. People come from the same country, they are famous (for project n.4), they belong to the same university (for project n.5). – Tip: the information about the number of collaborations is important, students can use weighted edges to represent it. – Tip: the students can use a tool for graph partitioning in order to find out clusters of users. Ida MeleProjects ( )8

Projects 6 and 7) Hadoop implementation of a ranking algorithm. Project n.6: implementation of PageRank. Project n.7: implementation of HITS. – Given a web graph, where nodes represent web pages and the edge between two nodes u and v represents the link from the source page u to the target page v, implement a ranking algorithm to computes the scores of the nodes. Plot and analyze the distribution of the obtained scores. Ida MeleProjects ( )9

Projects 8) Realize a reverse web graph with Hadoop. – Given a web graph, the algorithm creates the graph with reversed edges. For example if the input graph has the edge (u,v), the output graph will have the edge (v,u). Represent the input and output graphs (or portions of them) using a graph tool. – Tip: for each link the map creates pairs. The reducer create the concatenation of the sources, and emits pairs. Ida MeleProjects ( )10

Projects 9) Realize an inverted index with Hadoop. – Given a large collection of documents, the algorithm creates the inverted index, where the dictionary contains the indexed terms, and for each term is stored the list of postings. – Tip (for the dictionary): the students can decide to use stemming or to remove stop-words. – Tip (for the postings): the students can realize an inverted index where each posting has the ID of the document containing the term and other information, such as the frequency of the term in the document and the position of the occurrences of the term in the document. Ida MeleProjects ( )11

Projects 10) Personalized ranking of news. – Create a system which re-ranks news articles according to the user interests. Users can specify their interests by selecting them from a list of keywords (ex. gossip, sport, politics, …). The system uses an algorithm that ranks the news articles according to the user preferences. – Tip: the students can use different sources for collecting the news articles. Ida MeleProjects ( )12

Projects 11) Enrich News using Tweets. – Enrich a news site with the information published by the users of Twitter. Given a news article, the system can gather all the user tweets about that and show the news article along with the tweets. – Tip: students can use news about concerts of famous singers, or about strikes, riots… – Tip: students can decide to use a timeline of tweets on the top of the page, or to rank them and show the top-n tweets on the left of the page. Ida MeleProjects ( )13

Projects 12) Enrich News using Wikipedia. – Enrich the facts reported in news pages with information extracted from Wikipedia. Given a news article identify the name of people mentioned in the article and for each of them report the wikipedia information about their life. – Tip: the students can use Stanford Name Entity Recognizer ( for the entity-extraction task. It allows to easily find the name of famous people. – Tip: the students can use the whole wikipedia page or paragraphs extracted from it. Ida MeleProjects ( )14

Other important information Graph datasets: for those students who want work on graphs, but they cannot crawl a portion of the Web, they can find some large graphs here: News datasets: for those students who want to work on news articles, but they cannot collect the pages from the Web, send me an . Some famous graph tools: – Gephi ( – METIS ( for graph- partitioning. For questions send me an , I will reply ASAP. Ida MeleProjects ( )15