Big Data Research Progress Chao Jan 22, 2013. Big Data Lab Big MIT – – 23 nodes.

Slides:



Advertisements
Similar presentations
The Big Idea for the “Emerging Young Artists” is to do SMART marketing using digital marketing avenues. The idea is to create awareness and increase.
Advertisements

Introduction to Text Mining
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Social Media Intro to Business & Marketing. The most three most trusted forms of advertising are: Recommendations from people I know - 90% Consumer opinions.
1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
Search Engines and Information Retrieval
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.
Information Retrieval in Practice
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
2010 © University of Michigan 1 Text Retrieval and Data Mining in SI - An Introduction Qiaozhu Mei School of Information Computer Science and Engineering.
Video Search Engines and Content-Based Retrieval Steven C.H. Hoi CUHK, CSE 18-Sept, 2006.
Computer Science Storage Systems and Sensor Storage Research Overview.
SMS-Based web Search for Low- end Mobile Devices Jay Chen New York University Lakshmi Subramanian New York University
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
1 1 Chenhao Tan, 1 Jie Tang, 2 Jimeng Sun, 3 Quan Lin, 4 Fengjiao Wang 1 Department of Computer Science and Technology, Tsinghua University, China 2 IBM.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.
Search Engines and Information Retrieval Chapter 1.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Data / Information / Knowledge Presentation by Pauline Lake Modifications by Rick Mercer Acknowledgment and Disclaimer: This presentation is supported.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Distributed Computing Rik Sarkar. Distributed Computing Old style: Use a computer for computation.
天文信息技术联合实验室 New Progress On Astronomical Cross-Match Research Zhao Qing.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Master Thesis Defense Jan Fiedler 04/17/98
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
DAVID CALAWA IBM DATA MINING TOOLS. PRODUCTS Cognos A suite of products focusing on analyzing and displaying data Watson A cloud based analytics service.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
IoT Meets Big Data Standardization Considerations
CSci6702 Parallel Computing Andrew Rau-Chaplin
Introduction to Text Mining Hongning Wang
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Yahoo! BOSS Open up Yahoo!’s Search data via web services Developer & Custom Tracks Big Goal – If you’re in a vertical and you perform a search, you should.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Mining of Massive Datasets Edited based on Leskovec’s from
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Machine Learning. Definition Machine learning is a subfield of computer science that evolved from the study of pattern recognition and computational.
Big Data Yuan Xue CS 292 Special topics on.
1© 2015 IBM Corporation Unlocking the power of the API economy Client Briefing Nov.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
Big Data – Lendület kutatócsoport Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
Book web site:
Data Analytics 1 - THE HISTORY AND CONCEPTS OF DATA ANALYTICS
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Big Data A Quick Review on Analytical Tools
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Machine Learning overview Chapter 18, 21
Machine Learning overview Chapter 18, 21
Proposal for Term Project
6 October 2016 Irmingard Eder Data Scientist, Munich Re
Overview of big data tools
Technical Capabilities
McGraw-Hill Technology Education
Panel on Research Challenges in Big Data
Presentation transcript:

Big Data Research Progress Chao Jan 22, 2013

Big Data Lab Big MIT – – 23 nodes – GROWING BIG LINKED DATA FROM SEED: BUILDING A DEMO – VISION MACHINE: LEARNING ONLINE FROM 25 MILLION IMAGES – NATURAL LANGUAGE INTERFACE FOR BIG DATA – SCIDB – MACHINE LEARNING – SOCIAL: CONDENSR – SOCIAL: TWITINFO – SOCIAL: INFLUENCE MODELING – …

Big Data Lab NASA tournament lab – Big data challenge – tournament-labs-big-data-challenge/ tournament-labs-big-data-challenge/ – Apply the process of open innovation to conceptualizing new and novel approaches to using “big data” information sets from various U.S. government agencies, e.g., health, energy and earth science.

Big Data People Jimmy Lin (University of Maryland) – Ron Bekkerman (LinkedIn) – Misha Bilenko (MSR) – John Langford (Yahoo! Research) –

Tutorial Scaling Up Machine Learning-Parallel and Distributed Approaches KDD’2011 Ron Bekkerman (LinkedIn), Misha Bilenko (MSR) and John Langford (Yahoo! Research)

Tutorial State-of-the-art platforms and algorithm choices Hardware options (from FPGAs and GPUs to multi-core systems and commodity clusters) Programming frameworks (including CUDA, MPI, MapReduce, and DryadLINQ) Learning settings (e.g., semi-supervised and online learning) Example-driven, covering a number of popular algorithms (e.g., boosted trees, spectral clustering, belief propagation) and diverse applications (e.g., speech recognition and object recognition in vision).

Parallelization: platform choices PlatformCommunication SchemeData size Peer-to-PeerTCP/IPPetabytes Virtual ClustersMapReduce / MPITerabytes HPC ClustersMPI / MapReduceTerabytes MulticoreMultithreadingGigabytes GPUCUDAGigabytes FPGAHDLGigabytes

The Book Cambridge Uni Press Due in November chapters Covering – Platforms – Algorithms – Learning setups – Applications

10 2 Chapter contributors

New age of big data The world has gone mobile – 5 billion cellphones produce daily data Social networks have gone online – Twitter produces 200M tweets a day Crowdsourcing is the reality – Labeling of 100,000+ data instances is doable Within a week

Big Data Data DATA.GOV – elopers elopers – Data portal provided by US government

Big Data in Q&A It is estimated that 2.5 quintillion bytes of new data are created daily with an estimated 80% of this produced as "unstructured" data IBM Watson deep Q&A – – Evidence-based decision support – Jeopardy! – Provide a single correct answer with confidence – Analyze over 200 million pages in three seconds

Big Data in Q&A IBM Watson deep Q&A – Health care 2011, pilot program with WellPoint, whose affiliated health plans cover one in nine Americans 2012, partnership with Memorial Sloan-Kettering Cancer Center, where work is under way to teach Watson about oncology diagnosis and treatment options

Big Data Blog – News and events about Big Data buzz/big-data/research-papers buzz/big-data/research-papers – News and research papers about Big Data

Big Data Publication Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture Architecture behind Twitter's real-time related query suggestion and spelling correction service – First implementation: typical Hadoop-based analytics stack, did not meet the latency requirement – Second implementation: system deployed in production, custom in-memory processing engine

Big Data Publication Fast Candidate Generation for Two-Phase Document Ranking: Postings List Intersection with Bloom Filters M2012.pdf M2012.pdf Most modern web search engines employ a two-phase ranking strategy: a candidate list of documents is generated using a “cheap” but low-quality scoring function, which is then reranked by an “expensive" but high-quality method Candidate generation for conjunctive query processing in this context A fast, approximate postings list intersection algorithms based on Bloom Filters

Big Data Publication Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling – NAACL-HLT2012.pdf NAACL-HLT2012.pdf Large-Scale Machine Learning at Twitter – SIGMOD2012.pdf SIGMOD2012.pdf Smoothing Techniques for Adaptive Online Language Models: Topic Tracking in Tweet Streams – DD2011.pdf

Big Data Book Data-Intensive Text Processing with MapReduce ms/MapReduce-book-final.pdf