Presentation is loading. Please wait.

Presentation is loading. Please wait.

732A54 Big Data Analytics 6hp http://www.ida.liu.se/~732A54.

Similar presentations


Presentation on theme: "732A54 Big Data Analytics 6hp http://www.ida.liu.se/~732A54."— Presentation transcript:

1 732A54 Big Data Analytics 6hp

2 Teachers Examiner: Patrick Lambrix (B:2B:474)
Teachers Examiner: Patrick Lambrix (B:2B:474) Lectures: Patrick Lambrix, Christoph Kessler, Jose Pena, Valentina Ivanova, Johan Falkenjack Labs: Zlatan Dragisic, Valentina Ivanova Director of studies: Patrick Lambrix

3 Course literature Articles (on web/handout) Lab descriptions (on web)
Course literature Articles (on web/handout) Lab descriptions (on web)

4 Data and Data Storage

5 Data and Data Storage Database / Data source
Data and Data Storage Database / Data source One (of several) ways to store data in electronic format Used in everyday life: bank, hotel reservations, library search, shopping

6 Databases / Data sourcces
Databases / Data sourcces Database management system (DBMS): a collection of programs to create and maintain a database Database system = database + DBMS

7 Databases / Data sources
Databases / Data sources Information Model Queries Answer Database system Physical database management Processing of queries/updates Access to stored data

8 What information is stored?
What information is stored? Model the information - Entity-Relationship model (ER) - Unified Modeling Language (UML)

9 What information is stored? - ER
What information is stored? - ER entities and attributes entity types key attributes relationships cardinality constraints EER: sub-types

10 1 tgctacccgc gcccgggctt ctggggtgtt ccccaaccac ggcccagccc tgccacaccc 61 cccgcccccg gcctccgcag ctcggcatgg gcgcgggggt gctcgtcctg ggcgcctccg 121 agcccggtaa cctgtcgtcg gccgcaccgc tccccgacgg cgcggccacc gcggcgcggc 181 tgctggtgcc cgcgtcgccg cccgcctcgt tgctgcctcc cgccagcgaa agccccgagc 241 cgctgtctca gcagtggaca gcgggcatgg gtctgctgat ggcgctcatc gtgctgctca 301 tcgtggcggg caatgtgctg gtgatcgtgg ccatcgccaa gacgccgcgg ctgcagacgc 361 tcaccaacct cttcatcatg tccctggcca gcgccgacct ggtcatgggg ctgctggtgg 421 tgccgttcgg ggccaccatc gtggtgtggg gccgctggga gtacggctcc ttcttctgcg 481 agctgtggac ctcagtggac gtgctgtgcg tgacggccag catcgagacc ctgtgtgtca 541 ttgccctgga ccgctacctc gccatcacct cgcccttccg ctaccagagc ctgctgacgc 601 gcgcgcgggc gcggggcctc gtgtgcaccg tgtgggccat ctcggccctg gtgtccttcc 661 tgcccatcct catgcactgg tggcgggcgg agagcgacga ggcgcgccgc tgctacaacg 721 accccaagtg ctgcgacttc gtcaccaacc gggcctacgc catcgcctcg tccgtagtct 781 ccttctacgt gcccctgtgc atcatggcct tcgtgtacct gcgggtgttc cgcgaggccc 841 agaagcaggt gaagaagatc gacagctgcg agcgccgttt cctcggcggc ccagcgcggc 901 cgccctcgcc ctcgccctcg cccgtccccg cgcccgcgcc gccgcccgga cccccgcgcc 961 ccgccgccgc cgccgccacc gccccgctgg ccaacgggcg tgcgggtaag cggcggccct 1021 cgcgcctcgt ggccctacgc gagcagaagg cgctcaagac gctgggcatc atcatgggcg 1081 tcttcacgct ctgctggctg cccttcttcc tggccaacgt ggtgaaggcc ttccaccgcg 1141 agctggtgcc cgaccgcctc ttcgtcttct tcaactggct gggctacgcc aactcggcct 1201 tcaaccccat catctactgc cgcagccccg acttccgcaa ggccttccag ggactgctct 1261 gctgcgcgcg cagggctgcc cgccggcgcc acgcgaccca cggagaccgg ccgcgcgcct 1321 cgggctgtct ggcccggccc ggacccccgc catcgcccgg ggccgcctcg gacgacgacg 1381 acgacgatgt cgtcggggcc acgccgcccg cgcgcctgct ggagccctgg gccggctgca 1441 acggcggggc ggcggcggac agcgactcga gcctggacga gccgtgccgc cccggcttcg 1501 cctcggaatc caaggtgtag ggcccggcgc ggggcgcgga ctccgggcac ggcttcccag 1561 gggaacgagg agatctgtgt ttacttaaga ccgatagcag gtgaactcga agcccacaat 1621 cctcgtctga atcatccgag gcaaagagaa aagccacgga ccgttgcaca aaaaggaaag 1681 tttgggaagg gatgggagag tggcttgctg atgttccttg ttg Hint

11 DEFINITION Homo sapiens adrenergic, beta-1-, receptor
DEFINITION Homo sapiens adrenergic, beta-1-, receptor ACCESSION NM_000684 SOURCE ORGANISM human REFERENCE 1 AUTHORS Frielle, Collins, Daniel, Caron, Lefkowitz, Kobilka TITLE Cloning of the cDNA for the human beta 1-adrenergic receptor REFERENCE 2 AUTHORS Frielle, Kobilka, Lefkowitz, Caron TITLE Human beta 1- and beta 2-adrenergic receptors: structurally and functionally related receptors derived from distinct genes PUT THIS SLIDE ALSO ON OH We wil model this data using the different approaches to give a feel for the issues in the different approaches, we focus on data model + querying. OBS: the formatted file resources have their own tags (first step in a chain of biochemical reactions that put cell in stressed state) SOURCE of data:Genbank This membrane-bound protein is responsible for detecting adrenalin (ephineprine). - adrenalin binds to the beta-receptor and induces a conformation change, which lets an associated membrane-bound G-protein lose a alpha sub-unit. - alpha-unit changes a GDP toa GTP and acivates a neighboring membrane-bound adenylatcylase. - the adenylatcylase synthesizes (converts ATP to) a lot of cAMP and then shuts down. - the cAMP level in the cell activates a cAMP-dependent proteinkinase (cAPK) that mediates the signal to cellresponse through phosphorylation. (This phosphorylation puts the cell in a ’stressed state’.)

12 Entity-relationship protein-id source PROTEIN accession definition m
Entity-relationship protein-id source PROTEIN accession definition m Reference n title ARTICLE article-id author

13 Databases / Data sources
Databases / Data sources Information Model Queries Answer Database system Physical database management Processing of queries/updates Access to stored data

14 How is the information stored? (high level) How is the information accessed? (user level) structure precision Text (IR) Semi-structured data Data models (DB) Rules + Facts (KB)

15 IR - formal characterization
IR - formal characterization Information retrieval model: (D,Q,F,R) D is a set of document representations Q is a set of queries F is a framework for modeling document representations, queries and their relationships R associates a real number to document-query-pairs (ranking)

16 IR - Boolean model cloning adrenergic receptor Doc1 Doc2 (0 1 0) yes
IR - Boolean model cloning adrenergic receptor Doc1 Doc2 (1 1 0) (0 1 0) yes no --> Q1: cloning and (adrenergic or receptor) --> (1 1 0) or (1 1 1) or (0 1 1) Result: Doc1 CONCEPTUAL MODEL boolean model: OBS: DISJUNCTIVE NORMAL FORM disjunction of conjunction of literals Q1: cloning and (adrenergic or receptor) (cloning and adrenergic) or (cloning and receptor) (cloning and adrenergic and receptor) or (cloning and adrenergic and not receptor) or (cloning and receptor and not adrenergic) (1 1 1) or (1 1 0) or (0 1 1) Q2: cloning and not adrenergic (cloning and not adrenergic and receptor) or (cloning and not adrenergic and not receptor) (0 1 1) or (0 1 0) Q2: cloning and not adrenergic --> (0 1 0) or (0 1 1) Result: Doc2

17 IR - Vector model (simplified)
IR - Vector model (simplified) Doc1 (1,1,0) cloning receptor adrenergic Doc2 (0,1,0) Q (1,1,1) CONCEPTUAL MODEL vector space model: - define n-dimensional space with n = number of words - represent each document as a vector for each word, if word occurs in document, then 1 for coordinate corresponding to the word, else 0 - represent query in similar way - angle between vectors is an indication of similarity cos alpha = (d. q) / |d| |q| OBS: |q| has no influence on the ranking |d| gives a normalization (length of d depends on the number of words in the document) In this example: simplified model Normally, the coordinates are weights representing importance and selectiveness of words in document or query. sim(d,q) = d . q |d| x |q|

18 Semi-structured data Protein DB ”Homo sapiens adrenergic, human
Semi-structured data DEFINITION SOURCE human ”Homo sapiens adrenergic, beta-1-, receptor” AUTHOR Frielle Collins Daniel Caron Lefkowitz Kobilka REFERENCE ACCESSION NM_000684 TITLE ”Cloning of …” ”Human beta-1 …” PROTEIN Protein DB

19 Semi-structured data - Queries
Semi-structured data - Queries select source from PROTEINDB.protein P where P.accession = ”NM_000684”;

20 Relational databases PROTEIN ACCESSION SOURCE DEFINITION
Relational databases PROTEIN ACCESSION SOURCE DEFINITION Homo sapiens adrenergic, beta-1-, receptor NM_000684 human PROTEIN-ID 1 REFERENCE ARTICLE-ID 2 ARTICLE-AUTHOR Human beta 1- and beta 2-adrenergic receptors: structurally and functionally related receptors derived from distinct genes ARTICLE-ID TITLE Cloning of the cDNA for the human beta 1-adrenergic receptor ARTICLE-TITLE 1 2 ARTICLE-ID AUTHOR 1 2 Frielle Collins Daniel Caron Lefkowitz Kobilka

21 Relational databases - SQL
Relational databases - SQL select source from protein where accession = NM_000684; PROTEIN ACCESSION SOURCE DEFINITION Homo sapiens adrenergic, beta-1-, receptor NM_000684 human PROTEIN-ID 1

22 Evolution of Database Technology
Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: Advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, temporal, multimedia, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems NoSQL databases

23 Knowledge bases (F) source(NM_000684, Human)
Knowledge bases (F) source(NM_000684, Human) (R) source(P?,Human) => source(P?,Mammal) (R) source(P?,Mammal) => source(P?,Vertebrate) Q: ?- source(NM_000684, Vertebrate) A: yes Q: ?- source(x?, Mammal) A: x? = NM_000684 F. The source organism of NM_ is Human. If the source organism is Human then it is also mammal If …..mammal then …vertebrate Is vetebrate the source organism of NM_000684? A: inferred using F + 2 rules Retrieve all the entries that have Mammal as source organism. A: inferred using F + 1 rule

24 Interested in more? 732A57 Database Technology (relational databases)
Interested in more? 732A57 Database Technology (relational databases) TDDD43 Advanced data models and databases (IR, semi-structured data, DB, KB) 732A47 Text mining (includes IR)

25 Analytics

26 Analytics Discovery, interpretation and communication of meaningful patterns in data

27 Analytics - IBM What is happening? Descriptive
Discovery and explanation Why did it happen? Diagnostic Reporting, analysis, content analytics What could happen? Predictive Predictive analytics and modeling What action should I take? Prescriptive Decision management What did I learn, what is best? Cognitive

28 Analytics - Oracle Classification Regression Clustering
Attribute importance Anomaly detection Feature extraction and creation Market basket analysis

29 Why Analytics? The Explosive Growth of Data
Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving for knowledge!

30 Ex.: Market Analysis and Management
Where does the data come from?—Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association Customer profiling—What types of customers buy what products (clustering or classification) Customer requirement analysis Identify the best products for different groups of customers Predict what factors will attract new customers Provision of summary information Multidimensional summary reports Statistical summary information (data central tendency and variation) Market baket analysis: computer and printer sell well together

31 Ex.: Fraud Detection & Mining Unusual Patterns
Approaches: Clustering & model construction for frauds, outlier analysis Applications: Health care, retail, credit card service, telecomm. Auto insurance: ring of collisions Money laundering: suspicious monetary transactions Medical insurance Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests Telecommunications: phone-call fraud Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm Anti-terrorism

32 Knowledge Discovery (KDD) Process
Pattern evaluation and presentation Data Mining Task-relevant Data Selection and transformation Data Warehouse Data Cleaning Preprocessing steps: Data cleaning: remove noise and inconsistent data Data integration: multiple sources can be combined Data selection: data relevant to the analysis task is retrieved from the database Data transformation: data transformed into forms appropriate for the mining (e.g. Summarizing, aggregating) Data mining: extract data patterns using ’intelligent’ methods Pattern evaluation: identify truly interesting patterns, interestingness measures Knowledge presentation: present knowledge to user – visualization and knowledge representation Data Integration Databases

33 Data Mining: Classification Schemes
General functionality Descriptive data mining Predictive data mining Descriptive: characterize the general properties of the data Predicitive: perform inference on the current data to make predictions Data to be mined Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

34 Data Mining – what kinds of patterns?
Concept/class description: Characterization: summarizing the data of the class under study in general terms E.g. Characteristics of customers spending more than sek per year Discrimination: comparing target class with other (contrasting) classes E.g. Compare the characteristics of products that had a sales increase to products that had a sales decrease last year

35 Data Mining – what kinds of patterns?
Frequent patterns, association, correlations Frequent itemset Frequent sequential pattern Frequent structured pattern E.g. buy(X, “Diaper”)  buy(X, “Beer”) [support=0.5%, confidence=75%] confidence: if X buys a diaper, then there is 75% chance that X buys beer support: of all transactions under consideration 0.5% showed that diaper and beer were bought together E.g. Age(X, ”20..29”) and income(X, ”20k..29k”)  buys(X, ”cd-player”) [support=2%, confidence=60%] Frequent itemset: items frequently appearing together in transactional data (e.g. bread+milk; computer+printer) Frequent sequential pattern: sequence patterns (e.g first buy computer, then camera, then memory card) Frequent structured pattern: graphs, trees, lattices

36 Data Mining – what kinds of patterns?
Classification and prediction Construct models (functions) that describe and distinguish classes or concepts for future prediction. The derived model is based on analyzing training data – data whose class labels are known. E.g., classify countries based on (climate), or classify cars based on (gas mileage) Predict some unknown or missing numerical values Decision trees, neural nets

37 Data Mining – what kinds of patterns?
Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster customers to find target groups for marketing Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis Outlier: Data object that does not comply with the general behavior of the data Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation

38 Interested in more? 732A95 Introduction to machine learning
Interested in more? 732A95 Introduction to machine learning 732A61 Data mining – clustering and association analysis

39 Big Data

40 Big Data So large data that it becomes difficult to process it using a ’traditional’ system Eg. 100 MB you cannot send 100 GB image you cannot view 100 TB video you cannot edit Kilo (3), mega (6), giga (9), tera (12), peta (15), exa (18), zetta (21), yotta (24)

41 Big Data – 3Vs Volume size of the data

42 Volume - examples Facebook processes 500 TB per day
Walmart handles 1 million customer transaction per hour Airbus generates 640 TB in one fligth (10 TB per 30 minutes) 72 hours of video uploaded to youtube every minute SMS, , internet, social media

43 https://y2socialcomputing.files.wordpress.com/2012/06/
social-media-visual-last-blog-post-what-happens-in-an-internet-minute-infographic.jpg

44 Big Data – 3Vs Volume Variety size of the data
Big Data – 3Vs Volume size of the data Variety type and nature of the data text, semi-structured data, databases, knowledge bases 10% of the data is structured Challenge: to extract meaningful information

45 Linked open data Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak.

46 Linked open data of US government
Format   (# Datasets)   HTML (27005)   XML  (24077)   PDF  (19628)   CSV  (10058)   JSON (8948)   RDF (6153)   JPG (5419)   WMS (5019)   Excel (3389)   WFS (2781)

47 Big Data – 3Vs Volume Variety Velocity size of the data
Big Data – 3Vs Volume size of the data Variety type and nature of the data Velocity speed of generation and processing of data

48 Velocity - examples Traffic data Financial market Social networks

49

50 Big Data – other Vs Variability Veracity Value …
Big Data – other Vs Variability inconsistency of the data Veracity quality of the data Value useful analysis results

51 Applications

52 BDA system architecture
Specialized services for domain A Specialized services for domain B Big Data Services Layer Knowledge Management Layer Data storage and management layer: Development of a general software environment for computational sciences in big data domains. This software environment must exhibit flexibility in terms of the data storage and data management solutions so as to guarantee its generality. The suitability of such solutions is highly dependent on the application area. Therefore, for each application area, an area-specific definition of properties that make such solutions suitable requires a close collaboration between domain experts and data management systems experts. Knowledge management layer: Integration of various semantic technologies such as ontologies and graphical models. Graphical models can be used to improve and merge existing ontologies which, in turn, can be used to assist in the learning of the final graphical models which, in turn, can be used to produce human-understandable reasoning. Due to these dependencies, the benefit of such integration can be maximized only by a coordinated effort of partners with expertise in these different types of semantic technologies. Big data services layer: Development of parallel algorithms for learning graphical models from big data, and reasoning with big models. Our project environment provides a unique opportunity to leverage knowledge from domain experts in machine learning, computer architectures, software design, and performance of distributed systems within the same environment. This is necessary to achieve the best results. Data Storage and Management Layer

53 BigMecs Specialized services for materials design: Deployment of the knowledge system built in the project to design materials with the desired features efficiently. In addition to important contributions to material sciences, we expect that the integration of this service and the other two specialized services into our complete system framework will help demonstrate the power and flexibility of our system. Without a cross-disciplinary research team this would not be possible. Specialized services for environmental sciences: Deployment of the knowledge system built in the project to simulate and analyze models of freshwater flow and nutrients for large parts of the world, improving existing techniques in terms of processing speed, quality of the results, and interpretability of the results. Specialized services for computational criminology: Deployment of the knowledge system built in the project to extract actionable information from large information streams in real-time, as well as to perform post processing and reasoning to help the law enforcement in their duties.

54 BDA system architecture
Large amounts of data, distributed environment Unstructured and semi-structured data Not necessarily a schema Heterogeneous Streams Varying quality Data storage and management layer: Development of a general software environment for computational sciences in big data domains. This software environment must exhibit flexibility in terms of the data storage and data management solutions so as to guarantee its generality. The suitability of such solutions is highly dependent on the application area. Therefore, for each application area, an area-specific definition of properties that make such solutions suitable requires a close collaboration between domain experts and data management systems experts. Knowledge management layer: Integration of various semantic technologies such as ontologies and graphical models. Graphical models can be used to improve and merge existing ontologies which, in turn, can be used to assist in the learning of the final graphical models which, in turn, can be used to produce human-understandable reasoning. Due to these dependencies, the benefit of such integration can be maximized only by a coordinated effort of partners with expertise in these different types of semantic technologies. Big data services layer: Development of parallel algorithms for learning graphical models from big data, and reasoning with big models. Our project environment provides a unique opportunity to leverage knowledge from domain experts in machine learning, computer architectures, software design, and performance of distributed systems within the same environment. This is necessary to achieve the best results. Data Storage and Management Layer

55 Data Storage and management – this course
NoSQL databases OLTP vs OLAP Horizontal scalability Consistency, availability, partition tolerance Data management Hadoop Data management systems OLTP – traditional applications we are used to think about when talking about databases: University Database – students, teacher, courses; airline ticket reservations; credit card processing Many concurrent users; high availability and fast response times for all users; many reads, writes and updates OLAP - - Data warehouses for more complex analysis Preloaded, mainly reads, optimized for read operations Online analytical processing (OLAP) OLAP is characterized by a relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems, response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. OLAP databases store aggregated, historical data in multi-dimensional schemas (usually star schemas). OLAP systems typically have data latency of a few hours. OLAP approach is used to analyze multidimensional data from multiple sources and perspectives. The three basic operations in OLAP are : Roll-up (Consolidation), Drill-down and Slicing & Dicing.[3] Online transaction processing (OLTP) OLTP is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). OLTP systems emphasize very fast query processing and maintaining data integrity in multi-access environments. For OLTP systems, effectiveness is measured by the number of transactions per second. OLTP databases contain detailed and current data. The schema used to store transactional databases is the entity model (usually 3NF).[4] Normalization is the norm for data modeling techniques in this system.

56 BDA system architecture
Semantic technologies Integration Knowledge acquisition Data storage and management layer: Development of a general software environment for computational sciences in big data domains. This software environment must exhibit flexibility in terms of the data storage and data management solutions so as to guarantee its generality. The suitability of such solutions is highly dependent on the application area. Therefore, for each application area, an area-specific definition of properties that make such solutions suitable requires a close collaboration between domain experts and data management systems experts. Knowledge management layer: Integration of various semantic technologies such as ontologies and graphical models. Graphical models can be used to improve and merge existing ontologies which, in turn, can be used to assist in the learning of the final graphical models which, in turn, can be used to produce human-understandable reasoning. Due to these dependencies, the benefit of such integration can be maximized only by a coordinated effort of partners with expertise in these different types of semantic technologies. Big data services layer: Development of parallel algorithms for learning graphical models from big data, and reasoning with big models. Our project environment provides a unique opportunity to leverage knowledge from domain experts in machine learning, computer architectures, software design, and performance of distributed systems within the same environment. This is necessary to achieve the best results. Knowledge Management Layer

57 Knowledge management – this course
Not a focus topic in this course For semantic and integration approaches see TDDD43

58 BDA system architecture
Analytics services for Big Data Data storage and management layer: Development of a general software environment for computational sciences in big data domains. This software environment must exhibit flexibility in terms of the data storage and data management solutions so as to guarantee its generality. The suitability of such solutions is highly dependent on the application area. Therefore, for each application area, an area-specific definition of properties that make such solutions suitable requires a close collaboration between domain experts and data management systems experts. Knowledge management layer: Integration of various semantic technologies such as ontologies and graphical models. Graphical models can be used to improve and merge existing ontologies which, in turn, can be used to assist in the learning of the final graphical models which, in turn, can be used to produce human-understandable reasoning. Due to these dependencies, the benefit of such integration can be maximized only by a coordinated effort of partners with expertise in these different types of semantic technologies. Big data services layer: Development of parallel algorithms for learning graphical models from big data, and reasoning with big models. Our project environment provides a unique opportunity to leverage knowledge from domain experts in machine learning, computer architectures, software design, and performance of distributed systems within the same environment. This is necessary to achieve the best results. Big Data Services Layer

59 Big Data Services – this course
Big data versions of analytics/data mining algorithms OLTP – traditional applications we are used to think about when talking about databases: University Database – students, teacher, courses; airline ticket reservations; credit card processing Many concurrent users; high availability and fast response times for all users; many reads, writes and updates OLAP - - Data warehouses for more complex analysis Preloaded, mainly reads, optimized for read operations Online analytical processing (OLAP) OLAP is characterized by a relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems, response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. OLAP databases store aggregated, historical data in multi-dimensional schemas (usually star schemas). OLAP systems typically have data latency of a few hours. OLAP approach is used to analyze multidimensional data from multiple sources and perspectives. The three basic operations in OLAP are : Roll-up (Consolidation), Drill-down and Slicing & Dicing.[3] Online transaction processing (OLTP) OLTP is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). OLTP systems emphasize very fast query processing and maintaining data integrity in multi-access environments. For OLTP systems, effectiveness is measured by the number of transactions per second. OLTP databases contain detailed and current data. The schema used to store transactional databases is the entity model (usually 3NF).[4] Normalization is the norm for data modeling techniques in this system.

60 Databases Parallel programming Machine learning

61 2016: course for SDM year 1 and year 2
2016: course for SDM year 1 and year 2  some review/introduction lectures

62 Course overview Review Databases (lectures + labs)
Course overview Review Databases (lectures + labs) Python (lectures + labs) Databases for Big Data (lectures + lab) Parallel algorithms for processing Big Data (lectures + lab + exercise session) Machine Learning for Big Data (lectures + lab) Visit to National Supercomputer Centre

63 Info Results reported in connection to exams
Info Results reported in connection to exams Info about handing in labs on web; strong recommendation to hand in as soon as possible Sign up for labs via web (in pairs)

64 Info Relational database labs require special database account
Info Relational database labs require special database account  make sure you are registered for the course BDA labs require special access to NSC resources  fill out forms

65 Info Lab deadlines: Final deadlines in connection to the exams; no reporting between exams HARD DEADLINE: March exam (No guarantee NSC resources available after April.)

66 Examination Written exam Labs

67 What if I already took …? What if I also take…?
TDDD37/732A57 Database technology RDB labs 1-2 in one of the courses, results registered for both 732A47 Text mining Python labs in one of the courses, results registered for both

68 Changes w.r.t. last year New course

69 My own interest and research
My own interest and research Modeling of data Ontologies Ontology engineering Ontology alignment (Winner Anatomy track OAEI 2008 / Organizer OAEI tracks since 2013) Ontology debugging (Founder and organizer WoDOOM/CoDeS since 2012) Ontologies and databases for Big Data Former work: knowledge representation, data integration, knowledge-based information retrieval, object-centered databases

70


Download ppt "732A54 Big Data Analytics 6hp http://www.ida.liu.se/~732A54."

Similar presentations


Ads by Google