Tutorial: Big Data Algorithms and Applications Under Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop
KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA August 7, 2014

Schedule I. Introduction to big data (8:00 – 8:30) II. Hadoop and MapReduce (8:30 – 9:45) III. Coffee break (9:45 – 10:00) IV. Distributed algorithms and applications (10:00 – 11:40) V. Conclusion (11:40 – 12:00)

V. Conclusion

Conclusion What is big data? Why big matters to you?
What are techniques for big data analytics? Hadoop and MapReduce Clustering algorithm: K-means Topic modeling algorithm: LDA Social network analysis: centrality

What is big data? Five Vs Volume: the size of data
Velocity: the change speed of data, streaming generating data Variety: the format of data is various Veracity: the truth of data Value: companies can benefit from big data analysis

Why big data matters to you?
Big data analytics has been occurred in every domain, including finance, government, science, healthcare, IT, etc. Big data becomes a hot word in job descriptions Many companies benefit from big data analysis

Techniques in big data analytics
Machine learning Text/web mining Distributed computing Social network analysis Natural language processing Visualization Optimization

Hadoop and MapReduce Hadoop is a platform
MapReduce is a computing mechanism

HDFS architecture

MapReduce framework Per cluster node: Single JobTracker per master
Responsible for scheduling the jobs’ component tasks on the slaves Monitor slave progress Re-execute failed tasks Single TaskTracker per slave Execute the task as directed by the master

Hadoop data flow

K-Means k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color). k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means. The centroid of each of the k clusters becomes the new mean. Steps 2 and 3 are repeated until convergence has been reached.

Topic modeling algorithm: LDA
Nd K βk topics α V-dimensional Dirichlet θd topic proportions for document Zd,n topic assignment for word Wd,n observed word η K-dimensional Dirichlet Joint distribution

Network analysis: centrality
Degree centrality of a node in a network is the number of links (vertices) incident on the node. Closeness centrality determines how “close” a node is to other nodes in a network by measuring the sum of the shortest distances (geodesic paths) between that node and all other nodes in the network. Betweenness centrality determines the relative importance of a node by measuring the amount of traffic flowing through that node to other nodes in the network. This is done by measuring the fraction of paths connecting all pairs of nodes and containing the node of interest. Eigenvector centrality is a more sophisticated version of degree centrality where the centrality of a node not only depends on the number of links incident on the node but also the quality of those links. This quality factor is determined by the eigenvectors of the adjacency matrix of the network.

Some tools (I) Weka 3: data mining software in Java Apache Mahout: scalable machine learning library Natural language toolkit (NLTK) Gephi: network analysis

Some tools (II) igraph: network analysis package Data visualization Hive: distributed data warehouse Pig: analyzing large dataset

Recommended papers Big data report: MapReduce:

Recommended papers Machine learning algorithm survey: Community detection in a network survey: Topic modeling:

Thank you

Tutorial: Big Data Algorithms and Applications Under Hadoop

Similar presentations

Presentation on theme: "Tutorial: Big Data Algorithms and Applications Under Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tutorial: Big Data Algorithms and Applications Under Hadoop

Similar presentations

Presentation on theme: "Tutorial: Big Data Algorithms and Applications Under Hadoop"— Presentation transcript:

Similar presentations

About project

Feedback