Big Data Research Progress Chao Jan 22, 2013
Big Data Lab Big MIT – – 23 nodes – GROWING BIG LINKED DATA FROM SEED: BUILDING A DEMO – VISION MACHINE: LEARNING ONLINE FROM 25 MILLION IMAGES – NATURAL LANGUAGE INTERFACE FOR BIG DATA – SCIDB – MACHINE LEARNING – SOCIAL: CONDENSR – SOCIAL: TWITINFO – SOCIAL: INFLUENCE MODELING – …
Big Data Lab NASA tournament lab – Big data challenge – tournament-labs-big-data-challenge/ tournament-labs-big-data-challenge/ – Apply the process of open innovation to conceptualizing new and novel approaches to using “big data” information sets from various U.S. government agencies, e.g., health, energy and earth science.
Big Data People Jimmy Lin (University of Maryland) – Ron Bekkerman (LinkedIn) – Misha Bilenko (MSR) – John Langford (Yahoo! Research) –
Tutorial Scaling Up Machine Learning-Parallel and Distributed Approaches KDD’2011 Ron Bekkerman (LinkedIn), Misha Bilenko (MSR) and John Langford (Yahoo! Research)
Tutorial State-of-the-art platforms and algorithm choices Hardware options (from FPGAs and GPUs to multi-core systems and commodity clusters) Programming frameworks (including CUDA, MPI, MapReduce, and DryadLINQ) Learning settings (e.g., semi-supervised and online learning) Example-driven, covering a number of popular algorithms (e.g., boosted trees, spectral clustering, belief propagation) and diverse applications (e.g., speech recognition and object recognition in vision).
Parallelization: platform choices PlatformCommunication SchemeData size Peer-to-PeerTCP/IPPetabytes Virtual ClustersMapReduce / MPITerabytes HPC ClustersMPI / MapReduceTerabytes MulticoreMultithreadingGigabytes GPUCUDAGigabytes FPGAHDLGigabytes
The Book Cambridge Uni Press Due in November chapters Covering – Platforms – Algorithms – Learning setups – Applications
10 2 Chapter contributors
New age of big data The world has gone mobile – 5 billion cellphones produce daily data Social networks have gone online – Twitter produces 200M tweets a day Crowdsourcing is the reality – Labeling of 100,000+ data instances is doable Within a week
Big Data Data DATA.GOV – elopers elopers – Data portal provided by US government
Big Data in Q&A It is estimated that 2.5 quintillion bytes of new data are created daily with an estimated 80% of this produced as "unstructured" data IBM Watson deep Q&A – – Evidence-based decision support – Jeopardy! – Provide a single correct answer with confidence – Analyze over 200 million pages in three seconds
Big Data in Q&A IBM Watson deep Q&A – Health care 2011, pilot program with WellPoint, whose affiliated health plans cover one in nine Americans 2012, partnership with Memorial Sloan-Kettering Cancer Center, where work is under way to teach Watson about oncology diagnosis and treatment options
Big Data Blog – News and events about Big Data buzz/big-data/research-papers buzz/big-data/research-papers – News and research papers about Big Data
Big Data Publication Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture Architecture behind Twitter's real-time related query suggestion and spelling correction service – First implementation: typical Hadoop-based analytics stack, did not meet the latency requirement – Second implementation: system deployed in production, custom in-memory processing engine
Big Data Publication Fast Candidate Generation for Two-Phase Document Ranking: Postings List Intersection with Bloom Filters M2012.pdf M2012.pdf Most modern web search engines employ a two-phase ranking strategy: a candidate list of documents is generated using a “cheap” but low-quality scoring function, which is then reranked by an “expensive" but high-quality method Candidate generation for conjunctive query processing in this context A fast, approximate postings list intersection algorithms based on Bloom Filters
Big Data Publication Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling – NAACL-HLT2012.pdf NAACL-HLT2012.pdf Large-Scale Machine Learning at Twitter – SIGMOD2012.pdf SIGMOD2012.pdf Smoothing Techniques for Adaptive Online Language Models: Topic Tracking in Tweet Streams – DD2011.pdf
Big Data Book Data-Intensive Text Processing with MapReduce ms/MapReduce-book-final.pdf