Presentation on theme: "Data, Information, and Databases BDIS 6.1"— Presentation transcript:
1 Data, Information, and Databases BDIS 6.1 BSAD 141Dave Novak
2 Topics Covered Information types: transactional –vs- analytical Five characteristics of information qualityDatabase versus a DBMSRDBMS: advantages and terminologyMulti-user issues
3 The Need for High-Quality Information Data are everywhereWhich data are important?Which data should the organization store?Which data need to be further manipulated?Which data are required to make different types of decisions?How does the organization convert various data into information that is needed?
4 The Need for High-Quality Information Recall difference between data and information from Lecture #1
5 The Need for High-Quality Information The need to obtain and analyze the many different levels, formats, and granularities of organizational information to make decisionsGranularity refers to the extent of detail within the information (fine and detailed or “coarse” and abstract information)
6 The Need for High-Quality Information CRITICAL TO REMEMBER!Decisions are only as good as the quality of the data and information that are used to make the decisions…Crap in Crap outUsing technology to help you make a decision using poor quality data doesn’t help
7 Example of Low Quality Data Data Quality ProblemsExample of Low Quality DataIssue 1: Without a first name it would be impossible to correlate this customer with customers in other databases (Sales, Marketing, Billing, Customer Service) to gain a compete customer view (CRM)Issue 2: Without a complete street address there is no possible way to communicate with this customer via mail or deliveries. An order might be sitting in a warehouse waiting for the complete address before shipping. The company has spent time and money processing an order that might never be completedIssue 3: If this is the same customer, the company will waste money sending out two sets of promotions and advertisements to the same customers. It might also send two identical orders and have to incur the expense of one order being returnedIssue 4: This is a good example of where cleaning data is difficult because this may or may not be an error. There are many times when a phone and a fax have the same number. Since the phone number is also in the address field, chances are that the number is inaccurateIssue 5: The business would have no way of communicating with this customer viaIssue 6: The company could determine the area code based on the customer’s address. This takes time, which costs the company money. This is a good reason to ensure that information is entered correctly the first time. All incorrect information needs to be fixed, which costs time and money
8 Characteristics of High Quality Information 1) Accurate2) Complete3) Consistent4) Unique5) TimelyCharacteristics of High Quality Information
9 1) AccurateAre the data (is the information) correct, precise, and exact?For example:Are the data factual?Are values error-free?Have data been verified?Correct spellingPrecise numbersAccuracy Are all the values correct? For example, is the name spelled correctly? Is the dollar amount recorded properly?
10 2) CompleteAre the data whole (complete) and do they have all the necessary parts?For exampleAre there missing values or pieces of data?Full street addressArea code along with phone numberEmpty fieldsFull NamesCompleteness Are any of the values missing? For example, is the address complete including street, city, state, and zip code?
11 3) ConsistentAre the data are in agreement with themselves and with known facts?For exampleDoes summary information agree with detailed information? Can you reconcile the data?Do mathematical manipulations yield correct results?Are data manipulations performed consistently for the entire data set?Consistency Is aggregate or summary information in agreement with detailed information?For example, do all total fields equal the true total of the individual fields?
12 4) UniqueAre the data unique (one of a kind) or are there redundant, repetitious or unnecessary data stored in the same database?For example:Are there duplicate records for the same “event”?Are there different versions of “the same” file or event (which is the latest or most accurate?)Uniqueness Is each transaction, entity, and event represented only once in the information?For example, are there any duplicate customers?
13 5) Timely Are the data current with respect to decision-making needs? Timeliness depends on the situationReal-time information – Immediate, up-to-date informationReal-time system – Provides real-time information in response to requests“Real-time” is a relative description that depends on the use or needTimeliness Is the information current with respect to the business requirements? For example, is information updated weekly, daily, or hourly?
14 How can data be of “poor” quality? Customers intentionally enter inaccurate information to protect their privacy or because they are irritatedDifferent data entry standards and formatsOperators enter abbreviated or erroneous information by accident or to save timeThird party and external information contains inconsistencies, inaccuracies, and errorsAddressing the above sources of information inaccuracies will significantly improve the quality of organizational informationDetermine a few additional sources of low quality informationA customer service representative could accidentally transpose a number in an address or misspell a last name
15 What is a Database?Database – a collection of information organized in a way that provides efficient retrievalThere are electronic and physical databases (paper/print)A database can be a very simple collection of data such as alphabetically arranging names in an address book
16 What is a Database Management System (DBMS)? Database management systems (DBMS) – A set of computer programs / software that allow users to store, modify, query, and retrieve data in a systematic and controlled manner
17 Database Management System (DBMS) A database (the physical collection of data) is typically not portable across different DBMSLike application software, different DBMS are generally designed to work with specific system software and specific database schemaA database is typically something inside the DBMS, although in the case of a MS Excel workbook the database is a standalone object
18 Database Management System (DBMS) A very popular and common DBMS is the relational DBMS (RDBMS)A standard program and user interface is the Structure Query Language (SQL)A programming language used to create, modify, and retrieve information from a databaseDifferent databases use different (proprietary) variations to standard SQL
19 Database Management System (DBMS) According to the following source (which I did not verify with the Gartner report) the top five commercial RDBMS vendors in 2011 were:Oracle (≈ 50% market share)IBM (≈ 20% market share)Microsoft (≈ 17% market share)SAPTeradata
20 Database Management System (DBMS) Oracle: Oracle Database and MySQLIBM: DB2 and InformixMicrosoft: SQL ServerSAP: Sybase Enterprise and Sybase IQTeradata
21 Single File Data Management MS Excel is a database, but it is not a DBMS!Each worksheet is a single large two-dimensional matrixA database is simply an organized collection of data that can be accessedA DBMS is software that is used to manage the database and provides a set of tools used to manipulate and query data
22 Relational Database Management System (RDBMS) Data are organized as a set of formal tablesData can be accessed and combined in different ways without reorganizing the data within the tablesData can be manipulated in different ways and combined with data in other tables without altering the original data in the tableRDBMS can be easily extended / scaled – new data and new categories of data can be added without changing existing data
23 RDBMS TerminologyData model – A picture of logical data structures that detail the relationships among data elementsMetadata – Formal description of data structures (like tables and fields) and any constraints of the table or values within the tableData about the containers of data
24 RDBMS TerminologyData dictionary – Compiles all of the metadata about the elements in the data model
25 Entity Sets (Tables)Relational table or entity set – Each table consists of columns (fields/attributes) and rows (records/entities)The table has a name that describes the group of related entities within the tableFor example, a table labeled “Student” would contain a group of student entities
26 Entity / Record / RowA person, place, thing, transaction, or event about which data are being collected and storedThe individual rows in a table contain entitiesEach row is also referred to as a recordExample?
27 Attributes / Field / Column The data elements that describe the characteristics of a specific entityThe columns in each table contain the attributesExample?
28 What is a Relationship?1) When designing a relational DB, data need to be separated into tables that contain related data elementsFor example we would not store data related to customer (name, address, phone, etc.) and data related to the customer’s particular order (orderID, date, shipping method, etc.) in the same table
29 What is a Relationship?All information specific to a customer would go into a “Customer” tableAll information specific to the specific orders would go into an “Order” tableWe would then create a relationship between the tables to match a particular customer with a particular order
30 What is a Relationship?A relationship in an RDB is an association between the entities within the different tablesThere are THREE (3) types of relationships:One-to-One (1:1)One-to-Many (1:M)Many-to-Many (M:M)
31 Creating Relationships Through Keys KEYS are used to create relationships between the entities in different tables in the RDBPrimary keyForeign key
32 Creating Relationships Through Keys For our purposes:Every table in a RDB MUST have a primary keyThe foreign key is not required in every table and will only appear on the “many” side of the relationship
33 Advantages of RDBMsRDBMS advantages from a business perspective include1) Flexibility2) Scalability and performance3) Improved information integrity (quality)Reduced information redundancy4) Information securityA good way to explain databases is to compare them to spreadsheetsWhat are the limitations when using a spreadsheet?Limited number of rows and columns (Excel - 65,536 rows by 256 columns) Once you use more than 65,536 rows you have outgrown your spreadsheetOnly one user can access the spreadsheetUsers can view all information in the spreadsheetUsers can change all information in the spreadsheetAll of the disadvantages associated with a spreadsheet are fixed when using a database
34 1) Flexibility Handle changes quickly and easily Provide users with different views of the dataArranging data items in different ways depending on the user needShowing a particular user only some of the available fields while not showing them other fields
35 1) Flexibility: SchemaDifferent database schema can be “owned” by or associated with different usersThe schema is a user personalized set of tables, views, and indexes
36 2) Scalability and Performance A DBMS must expand to meet increased demand, while maintaining acceptable performance levelsScalability – Refers to how well a system can adapt to increased demandsPerformance – Measures how quickly a system performs a certain process or transactionWhat happens to a business if its suddenly experienced a 60 percent growth in sales and its IT systems fail with all of the increased activity?
37 3) Information Integrity Information integrity – a measure of information qualityKnow that data have not been entered incorrectly or altered in an unauthorized mannerIntegrity constraint – rules that help ensure the quality of informationWe will discuss entity integrity and referential integrity (there is also domain integrity)Can you define two relational integrity constraints for an ordering system?Users cannot create an order for a nonexistent customerAn order cannot be shipped without an addressCan you define two business-critical integrity constraints for an ordering system?Product returns are not accepted for fresh product 15 days after purchaseA discount maximum of 20 percent
38 3) Information Integrity: Controlling Redundancy Redundant data are ok if they serve a specific purpose such as being used as backup directly linked to the sourceBackup systems promote fault tolerance,Unintentional redundancy is not goodWasted storageDifficult to modifyPossible inconsistencies
39 4) Information Security Information is an organizational asset and must be protectedRDBMS offer several security featuresAccess level – Determines the level of access each individual user hasWho can access the DBMSAccess control – Determines the types of things each group can doTypes of access, such as power to create, modify, delete, and/or readWhich types of SQL statements can be executedWhy you would want to define access level security?Access levels will typically mimic the hierarchical structure of the organization and protect organizational information from being viewed and manipulated by individuals who should not have access to the sensitive or confidential informationLow level employees typically have the lowest levels of accessHigh level employees typically have access to all types of database informationFor example: You would not want analysts viewing all salary information for the entire company - in general:Analysts can usually only view their own salaryManagers have higher access and can view the salaries of all their team members, but cannot view other managers’ salariesDirectors can view all of their managers’ and analysts’ salaries, but not other directors’ salariesThe CFO and CEO can view every employee’s salary
40 Multiuser Issues DBMS serve many different users with different needs Many users may require concurrent access to the same dataMust preserve integrity of data and the performance of the system
41 Multiuser IssuesProblem: if multiple users (say tens or even hundreds of users) access the same data concurrently, how does the DBMS allow one user to change data without being overwritten by another user? This is typically referred to as the Lost-update problem
42 Multiuser IssuesConcurrent transactions are addressed through the use of transactions and locksTransactions – single indivisible action that affects some dataOnce a transaction is committed, it is permanent and changes are visible to all usersIf transaction is not committed, changes are “rolled back” or reversed
43 Multiuser IssuesLocks – literally “locks” the data so that changes cannot be made on the data while another transaction is in process
45 Learning Outcomes Five characteristics of quality information Define database, DBMS, RDBMS, and supporting components and terminologyAdvantages of RDBMSWhat is SQL?Describe the lost-update problem and how it is addressed