DBSI Teaser presentation The Beckman Report On Database Research

DBSI Teaser presentation The Beckman Report On Database Research
Presented by: Akshita Anand ( ) Priyanka Balotra (MT14018) Sakshi Agarwal (MT14043) 28 database researchers, two invited speakers, October 2013, Beckman Center and University of California. 30 leaders from the database community had a meeting on ‘Big data as a defining challenge of our time” Hello everyone, we are going to deliver a presentation on “The Beckman Report on Database Research”. Very famous people like Rakesh Agarwal, Daniel Abadi, Raghu Ramakrishnan and many more were involved in this conference.

Content Characteristics of Big Data Research Challenges
Community Challenges Conclusion This will be our roadmap for this presentation.

Characteristics Of Big Data
Big Data is identified as a defining challenge for the field of Database. So we all know what big data is. It is a broad term for data sets which is so large that the traditional data processing applications are inadequate. Let’s discuss about the characteristics of Big Data. It’s the 3-V. That is Volume, Velocity and Variety. Volume is the huge amount of data. In velocity we focus on how fast the operations can be performed on the big data and variety is that big data is collected from various sources so there is like a variety of big data.

Research Challenges Scalable big/fast data infrastructures
2. Coping with diversity in data management End-to-end processing of data So at the end of that meeting, those researchers were able to pen down 3 main challenges for Big Data: scalable big/fast data infrastructures coping with diversity in data management end-to-end processing of data cloud services the roles of people in the data life cycle and we will go through these challenges in this presentation.

Challenge #1:Scalable Big Data Infrastructure
Taking about the first challenge: scalable big data infrastructures. Let’s talk about this in more detail. Parallel and Distributed Processing – Database world has seen success through Parallel Processing, Data Warehousing and Higher-level languages due to which data processing has become easier like Hadoop. but we cannot ignore that more powerful cost-aware query processors and optimizers are needed to fully exploit the large clusters and for that we require New Hardware. Eg: fields like graphics processing units and integrated circuits produce very large data so to process them, more heterogeneous environments are required. We need specialized processors to be more specific but we cannot overlook the cost-efficient storage. And for this both server-attached and network-attached storage architectures need to be considered like HDFS. Let’s imagine that we are able to satisfy all these requirements, so then there will come a time when we’ll be receiving data at higher speed. So we also need algorithms to process those streams of data. One important challenge we face is Late-bound schemas. We need query engines that can efficiently run over raw files which are processed only once. (As they are processed only once storing and indexing them is costly, so they should be stored as binary files). For Consistency, many systems are developed to ensure consistency like NoSQL systems but most of them provide only basic data access and weak atomicity and isolation guarantees so there is a need to revisit programming models for data consistency. Lastly, scalability should be measured not only in petabytes of data and queries per second, but also total cost of ownership, end-to-end processing speed or usability. To measure progress against such broader metrics, new types of benchmarks will be required.

Challenge #2:Diversity in data management
No one-size-fits-all. Cross-platform integration Integration of platforms Hiding heterogeneity Optimization of performance Programming models. Diversity in programming abstractions and reusuablilty Need of more than one language! Focus on domain- specific language Data processing workflows platforms that can span both "raw" and "cooked" data. example, querying data with SQL and then analyzing it with R 1. platforms need to be integrated or federated to enable data analysts to combine and analyze data across systems. involves hiding the heterogeneity of data formats and access languages also, optimizing the performance of accesses and flows in big data systems. Disconnected devices also raises challenges in reliable data ingestion, query processing, and data inconsistency in such sometimes-connected, wide-area environments. 2. diverse programming abstractions to operate on very large datasets development of reusable middle-layer components single data analysis language doesn’t meet everyone’s need User must feel free to use language they feel comfortable to anaylse their data. Like R python sql etc. tools that simplify the implementation of new scalable, data-parallel languages.

Challenge #3:End-to-end processing of data
Data-to-knowledge pipeline steps of the raw-data-to-knowledge pipeline data acquisition; selection, assessment, cleaning, and transformation, extraction and integration etc. greater diversity of data and users Tool diversity need of multiple tools to solve each step of raw-data-to-knowledge pipeline Tool customizability domain knowledge, such as dictionaries, knowledge bases, and rules. easy to customize to a new domain Hand crafted rules are needed along with machine learning Ex- precision sensitive applications like e-commerce In the pipeline mentioned above, we need to take into consideration human feedback also. nd must be usable by subject-matter experts, not just by IT professionals. For example, a journalist may want to clean, map, and publish data from a spreadsheet file of crime statistics. Tools must also be tailored to data scientists, the new class of data analysis professionals that has emerged. Data comes in wide variety of formats like structured and unstructured. But we need to use them together and in a structured fashion. seamlessly integrated and easy to use for both lay and expert users, We will need machine learning as well as hand crafted rules here. Hand crafted rules will give us precision need in sensitive systems. We need to cover corner cases in such applications and that can be done by hand crafted rules. So our tool should support this feature also.

Mostly expensive proprietory products Understanding data
Open source Few open source tools Mostly expensive proprietory products Understanding data Capturing and managing appropriate meta-information Eg. Facebook automatically identifies faces in the image so users can optionally tag them Knowledge base The more knowledge about a target domain, the better that tools can analyze the domain Open source data capture, data processing, analysis, and the generation of outputs Understanding data Filtering, summarization and visualization etc required

Community Challenges Conclusion
Some of these are new, some old, brought by big data and are becoming increasingly important: Database education Data science Conclusion Database research has been restricted by the rigors of the enterprise and relational database systems Handling data diversity; exploiting new hardware, software, and cloud-based platforms; It is also time to rethink approaches to education, involvement with data consumers, and our value system and its impact on how we evaluate Conclusion This is an exciting time for database research The rise of big data and the vision of a data-driven world present many exciting new research challenges

THANK YOU

DBSI Teaser presentation The Beckman Report On Database Research

Similar presentations

Presentation on theme: "DBSI Teaser presentation The Beckman Report On Database Research"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DBSI Teaser presentation The Beckman Report On Database Research

Similar presentations

Presentation on theme: "DBSI Teaser presentation The Beckman Report On Database Research"— Presentation transcript:

Similar presentations

About project

Feedback