Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data Research Progress Chao Jan 22, 2013. Big Data Lab Big MIT – – 23 nodes.

Similar presentations


Presentation on theme: "Big Data Research Progress Chao Jan 22, 2013. Big Data Lab Big MIT – – 23 nodes."— Presentation transcript:

1 Big Data Research Progress Chao Jan 22, 2013

2 Big Data Lab Big Data@CSAIL, MIT – http://bigdata.csail.mit.edu/ http://bigdata.csail.mit.edu/ – 23 nodes – GROWING BIG LINKED DATA FROM SEED: BUILDING A DEMO – VISION MACHINE: LEARNING ONLINE FROM 25 MILLION IMAGES – NATURAL LANGUAGE INTERFACE FOR BIG DATA – SCIDB – MACHINE LEARNING – SOCIAL: CONDENSR – SOCIAL: TWITINFO – SOCIAL: INFLUENCE MODELING – …

3 Big Data Lab NASA tournament lab – http://www.nasa.gov/directorates/heo/ntl/ http://www.nasa.gov/directorates/heo/ntl/ Big data challenge – http://open.nasa.gov/blog/2012/10/03/nasa- tournament-labs-big-data-challenge/ http://open.nasa.gov/blog/2012/10/03/nasa- tournament-labs-big-data-challenge/ – Apply the process of open innovation to conceptualizing new and novel approaches to using “big data” information sets from various U.S. government agencies, e.g., health, energy and earth science.

4 Big Data People Jimmy Lin (University of Maryland) – http://www.umiacs.umd.edu/~jimmylin/ http://www.umiacs.umd.edu/~jimmylin/ Ron Bekkerman (LinkedIn) – http://people.cs.umass.edu/~ronb/ Misha Bilenko (MSR) – http://research.microsoft.com/en-us/um/people/mbilenko/ John Langford (Yahoo! Research) – http://hunch.net/~jl/

5 Tutorial Scaling Up Machine Learning-Parallel and Distributed Approaches KDD’2011 Ron Bekkerman (LinkedIn), Misha Bilenko (MSR) and John Langford (Yahoo! Research) http://hunch.net/~large_scale_survey/

6 Tutorial State-of-the-art platforms and algorithm choices Hardware options (from FPGAs and GPUs to multi-core systems and commodity clusters) Programming frameworks (including CUDA, MPI, MapReduce, and DryadLINQ) Learning settings (e.g., semi-supervised and online learning) Example-driven, covering a number of popular algorithms (e.g., boosted trees, spectral clustering, belief propagation) and diverse applications (e.g., speech recognition and object recognition in vision).

7 Parallelization: platform choices PlatformCommunication SchemeData size Peer-to-PeerTCP/IPPetabytes Virtual ClustersMapReduce / MPITerabytes HPC ClustersMPI / MapReduceTerabytes MulticoreMultithreadingGigabytes GPUCUDAGigabytes FPGAHDLGigabytes

8 The Book Cambridge Uni Press Due in November 2011 21 chapters Covering – Platforms – Algorithms – Learning setups – Applications

9 10 2 Chapter contributors 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 20 21

10 New age of big data The world has gone mobile – 5 billion cellphones produce daily data Social networks have gone online – Twitter produces 200M tweets a day Crowdsourcing is the reality – Labeling of 100,000+ data instances is doable Within a week

11 Big Data Data DATA.GOV – http://www.data.gov/developers/community/dev elopers http://www.data.gov/developers/community/dev elopers – Data portal provided by US government

12 Big Data in Q&A It is estimated that 2.5 quintillion bytes of new data are created daily with an estimated 80% of this produced as "unstructured" data IBM Watson deep Q&A – http://www.research.ibm.com/articles/watson.shtml http://www.research.ibm.com/articles/watson.shtml – Evidence-based decision support – Jeopardy! – Provide a single correct answer with confidence – Analyze over 200 million pages in three seconds

13 Big Data in Q&A IBM Watson deep Q&A – Health care 2011, pilot program with WellPoint, whose affiliated health plans cover one in nine Americans 2012, partnership with Memorial Sloan-Kettering Cancer Center, where work is under way to teach Watson about oncology diagnosis and treatment options

14 Big Data Blog http://whatsthebigdata.com/ – News and events about Big Data http://www.greenplum.com/industry- buzz/big-data/research-papers http://www.greenplum.com/industry- buzz/big-data/research-papers – News and research papers about Big Data

15 Big Data Publication Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture http://arxiv.org/pdf/1210.7350v1.pdf Architecture behind Twitter's real-time related query suggestion and spelling correction service – First implementation: typical Hadoop-based analytics stack, did not meet the latency requirement – Second implementation: system deployed in production, custom in-memory processing engine

16 Big Data Publication Fast Candidate Generation for Two-Phase Document Ranking: Postings List Intersection with Bloom Filters http://www.umiacs.umd.edu/~jimmylin/publications/Asadi_Lin_CIK M2012.pdf http://www.umiacs.umd.edu/~jimmylin/publications/Asadi_Lin_CIK M2012.pdf Most modern web search engines employ a two-phase ranking strategy: a candidate list of documents is generated using a “cheap” but low-quality scoring function, which is then reranked by an “expensive" but high-quality method Candidate generation for conjunctive query processing in this context A fast, approximate postings list intersection algorithms based on Bloom Filters

17 Big Data Publication Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling – http://www.umiacs.umd.edu/~jimmylin/publications/Ture_Lin_ NAACL-HLT2012.pdf http://www.umiacs.umd.edu/~jimmylin/publications/Ture_Lin_ NAACL-HLT2012.pdf Large-Scale Machine Learning at Twitter – http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_ SIGMOD2012.pdf http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_ SIGMOD2012.pdf Smoothing Techniques for Adaptive Online Language Models: Topic Tracking in Tweet Streams – http://www.umiacs.umd.edu/~jimmylin/publications/Lin_etal_K DD2011.pdf

18 Big Data Book Data-Intensive Text Processing with MapReduce http://lintool.github.com/MapReduceAlgorith ms/MapReduce-book-final.pdf


Download ppt "Big Data Research Progress Chao Jan 22, 2013. Big Data Lab Big MIT – – 23 nodes."

Similar presentations


Ads by Google