Database Systems Design, Implementation, & Management 13th Edition Solution Manual
richard@qwconsultancy.com
1|Pa ge
Appendix B The University Lab: Conceptual Design
Appendix B The University Lab: Conceptual Design Discussion Focus What actions are taken during the database initial study, and why are those actions important to the database designer?
NOTE We recommend that you use Appendix B’s Table B.1, “A Database Design Map for the University Computer Lab (UCL),” in this and all subsequent discussions about the design process. The design procedure summary should be used as a template in all design and implementation exercises, too. Student feedback indicates that this blueprint is especially helpful when it is used in conjunction with class projects. Use Appendix B’s Figure B.4, “The ER Model Segment for Business Rule 1,” to illustrate how the database design map was used to generate the initial ER diagram.
The database initial study is essentially a process based on data gathering and analysis. Carefully conducted and systematic interviews usually constitute an important part of this process. The initial study must take its cues from an organization's key end users. Therefore, one of the first initial study tasks is to establish who the organization's key end users are. Once the key end users are identified, the initial study must be conducted to establish the following outputs: • objectives • organizational structure • description of operations • definition of problems and constraints • description of system objectives • definition of system scope and boundaries The database designer cannot expect to develop a usable design unless these outputs are carefully defined and delineated. The importance of having such a list of outputs is self-explanatory. For example, a database design is not likely to be useful unless and until it accomplishes specific objectives and helps solve an organization's problems. The inherent assumption is that those problems are usually based on the lack of useful and timely information. The value of having such a list of required outputs is clear, too, because this list constitutes a checklist to be used by the database designer. The designer should not proceed with the database design until all the items on this list have been completed.
409
Appendix B The University Lab: Conceptual Design What is the purpose of the conceptual design phase, and what is its end product? The conceptual design phase is the first of three database design phases: conceptual, logical, and physical. The purpose of the conceptual design phase is to develop the following outputs: • information sources and users • Information needs: user requirements • the initial ER model • the definition of attributes and domains The conceptual design's end product is the initial ER diagram, which constitutes the preliminary database blueprint. It is very unlikely that useful logical and physical designs can be produced unless and until this blueprint has been completed. Too much "design" activity takes place without the benefit of a carefully developed database blueprint. Implementing a database without a good database blueprint almost invariably produces a lack of data integrity based on various data anomalies. In fact, it may easily be argued that implementing a successful database without a good database blueprint is just as likely as writing a great book by stringing randomly selected words together. Why is an initial ER model not likely to be the basis for the implementation of the database? ER modeling is an iterative process. The initial ER model may establish many of the appropriate entities and relationships, but it may be impossible to implement such relationships. Also, given the nature of the ER modeling process, it is very likely that the end users begin to develop a greater understanding of their organization's operations, thus making it possible to establish additional entities and relationships. In fact, it may be argued that one very important benefit of ER modeling is based on the fact that it is an outstanding communications tool. In any case, before the ER model can be implemented, it must be carefully verified with respect to the business transactions and information requirements. (Note that students will learn how to develop the verification process in Appendix C.) Clearly, unless and until the ER model accurately reflects an organization's operations and requirements, the development of logical and physical designs is premature. After all, the database implementation is only as good as the final ER blueprint allows it to be!
410
Appendix B The University Lab: Conceptual Design
Answers to Review Questions 1. What factors relevant to database design are uncovered during the initial study phase? The database initial study phase yields the information required to determine an organization's needs, as well as the factors that influence data generation, collection, and processing. Students must understand that this phase is generally concurrent with the planning phase of the SDLC and that, therefore, several of the initial study activities are common to both. The most important discovery of the initial study phase is the set of the company's objectives. Once the designer has a clear understanding of the company's main goals and its mission, (s)he can use this as the guide to making all subsequent decisions concerning the analysis, design, and implementation of the database and the information system. The initial study phase also establishes the company's organizational structure; the description of operations, problems and constraints, alternate solutions; system objectives; and the proposed system scope and boundaries. The organizational structure and the description of operations are interdependent because operations are usually a function of the company's organizational structure. The determination of structure and operations allows the designer to analyze the existing system and to describe a set of problems, constraints, and possible solutions. Naturally, the designer must find a feasible solution within the existing constraints. In most cases, the best solution is not necessarily the most feasible one. The constraints also force the designer to narrow the focus on very specific problems that must be solved. In short, the combination of all the factors we have just discussed help the designer to put together a set of realistic, achievable, and measurable system objectives within the system's required scope and boundaries. 2. Why is the organizational structure relevant to the database designer? The delivery of information must be timely, it must reach the right people, and the delivered information must be accurate. Since the proper use of timely and accurate information is the key factor in the success of any system, the reports and queries drawn from the database must reach the key decision makers within the organization. Clearly, understanding the organization structure helps the designer to define the organization's lines of communication and to establish reporting requirements. 3. What is the difference between the database design scope and its boundaries? Why is the scope and boundary statement so important to the database designer? The system's boundaries are the limits imposed on the database design by external constraints such as available budget and time, the current level of technology, end-user resistance to change, and so on. The scope of a database defines the extent of the database design coverage and reflects a conscious decision to include some things and exclude others. Note that the existence of boundaries usually has an effect on
411
Appendix B The University Lab: Conceptual Design the system's scope. For legal and practical design reasons, the designer cannot afford to work on an unbounded system. If the system's limits have not been adequately defined, the designer may be legally required to expand the system indefinitely. Moreover, an unbounded system will not contain the built-in constraints that make its use practical in a real-world environment. For example, a completely unbounded system will never be completed, nor may it ever be ready for reasonable use. Even a system with an "optimistic" set of bounds may drag the design out over many years and may cost too much. Keep in mind that company managers almost invariably want least-cost solutions to specific problems. 4. What business rule(s) and relationships can be described for the ERD in Figure QB.4?
Figure QB.4 The ERD for Question 4
The business rules and relationships are summarized in Table QB.4
Table QB.4 Business Rules and Relationships Summary Business rules
Relationships
A supplier supplies many parts. Each part is supplied by many suppliers. A part is used in many products. Each product is composed of many parts. A product is bought by many customers. Each customer buys many products.
many to many PART - SUPPLIER many to many PRODUCT - PART many to many PRODUCT - CUSTOMER
Note that the ERD in Figure QB.4 uses the PART_PROD, PROD_VEND and PROD_CUST entities to convert the M:N relationships to a series of 1:M relationships. Also, note the use of two composite 412
Appendix B The University Lab: Conceptual Design entities: • The PART_VEND entity’s composite PK is VEND_ID + PART_CODE. • The PART_PROD entity’s composite PK is, PART_CODE + PROD_CODE. The use of these composite PKs means that the relationship between PART and PART_VEND is strong, as is the relationship between VENDOR and PART_VEND. These strong relationships are indicated through the use of a solid relationship line. No PK has been indicated for the PROD_CUST entity, but the existence of weak relationships – note the dashed relationship lines – lets you assume that the PROD_CUST entity’s PK is not a composite one. In this case, a revision of the ERD might include the establishment of a composite PK (PROD_CODE + CUST_NUM) for the PROD_CUST entity. (If you are using Microsoft Visio Professional, declaring the relationships between CUSTOMER and PROD_CUST and between PRODUCT and PROD_CUST to be strong will automatically generate the composite PK (PROD_CODE + CUST_NUM.) 5. Write the connectivity and cardinality for each of the entities shown in Question 4. We have indicated the connectivities and cardinalities in Figure QB.5. (The Crow’s Foot ERD combines the connectivity and cardinality depiction through the use of the relationship symbols. Therefore, the use of text boxes – we have created those with the Visio text tool -- to indicate connectivities and cardinalities is technically redundant.)
Figure QB.5 Connectivities and Cardinalities
Figure QB.5’s connectivities and cardinalities are reflected in the business rules: • One part can be supplied by one or more suppliers, and a supplier can supply many parts.
413
Appendix B The University Lab: Conceptual Design • •
A product is made up of several parts, and a part can be a component of different products. A product can be bought by several customers, and a customer can purchase several products.
6. What is a module, and what role does a module play within the system? A module is a separate and independent collection of application programs that covers a given operational area within an information system. A module accomplishes a specific system function, and it is, therefore, a component of the overall system. For example, a system designed for a retail company may be composed of the modules shown in Figure QB.6.
Figure QB.6 The Retail Company System Modules
Retail System
Inventory
Purchasing
Sales
Accounting
Within Figure QB.6’s Retail System, each module addresses specific functions. For example: • The inventory module registers any new item, monitors quantity on hand, reorder quantity, location, etc. • The purchasing module registers the orders sent to the suppliers, any supplier information, order status, etc. • The sales module covers the sales of items to customers, generates the sales slips (invoices), credit sales checks, etc. • The accounting module covers accounts payable, accounts receivable, and generates appropriate financial status reports. The example demonstrates that each module has a specific purpose and operates on a database subset (external view). Each external view represents the entities of interest for the specific module. However, an entity set may be shared by several modules. 7. What is a module interface, and what does it accomplish? A module interface is the method through which modules are connected and by which they interact with one another to exchange data and status information. The definition of proper module interfaces is critical for systems development, because such interfaces establish an ordered way through which system components (modules) interchange information. If the module interfaces are not properly defined, even a collection of properly working modules will not yield a useful working system.
414
Appendix B The University Lab: Conceptual Design
Problem Solutions 1. Modify the initial ER diagram presented in Figure B.19 to include the following entity supertype and subtypes: The University Computer Lab USER may be a student or a faculty member. The answer to problem 1 is included in the answer to problem 2. 2. Using an ER diagram, illustrate how the change you made in problem 1 affects the relationship of the USER entity to the following entities: LAB_USE_LOG RESERVATION CHECK_OUT WITHDRAW The new ER diagram segment will contain the supertype USER and the subtypes FACULTY and STUDENT. How the use of this supertype/subtype relationships affect the entities shown here is illustrated in the ER diagram shown in Figure PB.2a.
415
Appendix B The University Lab: Conceptual Design
Figure PB.2a The Crow’s Foot ERD with Supertypes and Subtypes
The ER segment shown in Figure PB.2a reflects the following conditions: • Not all users are faculty members, so FACULTY is optional to USER. • Not all users are students, so STUDENT is optional to USER. • The conditions in the first two bullets are typical of the supertype/subtype implementation. • Not all faculty members withdraw items, so a faculty member may not ever show up in the WITHDRAW table. Therefore, WITHDRAW is optional to FACULTY. • Not all items are necessarily withdrawn; some are never used. Therefore WITHDRAW is optional to ITEM. (An item that is never withdrawn will never show up in the WITHDRAW table.) • Not all items are checked out, so an ITEM may never show up in the CHECK_OUT table. Therefore, CHECK_OUT is optional to ITEM. • Not all users check out items, so it is possible that a USER – a faculty member or a student -never shows up in the CHECK_OUT table. Therefore, CHECK_OUT is optional to USER. • Not all faculty members place reservations, so RESERVATION is optional to FACULTY. • Not all students use the lab, i.e., some students will never sign the log to check in. Therefore, LOG is optional to STUDENT. Given the text’s initial development of the UCL Management System’s ERD, the USER entity was
416
Appendix B The University Lab: Conceptual Design related to both the WITHDRAW and CHECK_OUT entities. Therefore, there was no way of knowing whether a STUDENT or a FACULTY member was related to WITHDRAW or CHECK_OUT. Although the business rules were quite specific about the relationships, the ER diagram did not reflect them. By adding a new USER supertype and two STUDENT and FACULTY subtypes, the ERD more closely represents the business rules. The supertype/subtype relationship in Figure PB.2a lets us see that STUDENT is related to LOG, and that only FACULTY members can make a RESERVATION and WITHDRAW items. However, both STUDENT and FACULTY can CHECK_OUT items. While this supertype/subtype solution conforms to the problem solution requirements, the design is far from complete. For example, one would suppose that FACULTY is already a subtype to EMPLOYEE. Also, can a faculty member also be a student? In other words, are the supertypes/subtypes overlapping or disjoint? In this initial ERD, we have assumed overlapping subtypes; that is, a user can be a faculty member and a student at the same time. Another solution -- which would eliminate the USER/FACULTY and USER/STUDENT supertype/subtypes in the ERD – is to add an attribute, such as USER_TYPE, to the USER entity to identify the user as faculty or student. The application software can then be used to enforce the restrictions on various user types. Actually, that approach was used in the final (verified) Computerlab.mdb database on your CD. (The verified database is provided for Appendix C.) 3. Create the initial ER diagram for a car dealership. The dealership sells both new and used cars, and it operates a service facility. Base your design on the following business rules: a. A salesperson can sell many cars but each car is sold by only one salesperson. b. A customer can buy many cars but each car is sold to only one customer. c. A salesperson writes a single invoice for each car sold. d. A customer gets an invoice for each car (s)he buys. e. A customer might come in just to have a car serviced; that is, one need not buy a car to be classified as a customer. f. When a customer takes one or more cars in for repair or service, one service ticket is written for each car. g. The car dealership maintains a service history for each car serviced. The service records are referenced by the car's serial number. h. A car brought in for service can be worked on by many mechanics, and each mechanic may work on many cars. i. A car that is serviced may or may not need parts. (For example, parts are not necessary to adjust a carburetor or to clean a fuel injector nozzle.) As you examine the initial ERD in Figure PB.3a, note that business rules (a) through (d) refer to the relationships of four main entities in the database: SALESPERSON, INVOICE, CUSTOMER, and CAR. Note also that an INVOICE requires a SALESPERSON, a CUSTOMER, and a CAR. Business rule (e) indicates that INVOICE is optional to CUSTOMER and CAR because a CAR is not necessarily sold to a CUSTOMER. (Some customers only have their cars serviced.) The position of the CAR entity and its relationships to the CUSTOMER and INV_LINE entities is subject to discussion. If the dealer sells the CAR, the CAR entity is clearly related to the INV_LINE that 417
Appendix B The University Lab: Conceptual Design is related to the INVOICE. (If the car is sold, it generates one invoice line on the invoice. However, the invoice is likely to contain additional invoice lines, such as a dealer preparation charge, destination charge, and so on.) At this point, the discussion can proceed in different directions: • The sold car can be linked to the customer through the invoice. Therefore, the relationship between CUSTOMER and CAR shown in Figures PB.3a and PB.3b is not necessary. • If the customer brings a car in for service – whether or not that car was bought at the dealer – the relationship between CUSTOMER and CAR is desirable. After all, when a service ticket is written in the SERVICE_LOG, it would be nice to be able to link the customer to the subsequent transaction. More important, it is the customer who gets the invoice for the service charge. However, if the CUSTOMER-CAR relationship is to be retained, it will be appropriate to make a distinction between the cars in the dealership’s inventory – which are not related to a customer at that point – and the cars that are owned by customers. If no distinction is made between customer-owned cars and cars still in the dealership inventory, Figure PB.3a’s CAR entity will either have a null CUST_NUM or the customer entity must contain a dummy record to indicate a “no customer – dealer-owned” condition.
Figure PB.3a The Car Dealership Initial Crow’s Foot ERD
Regardless of which argument “wins” in the presentation of the various scenarios, remind the students that the ERD to be developed in this exercise is to reflect the initial design. More important, such discussions clearly indicate the need for very detailed descriptions of operations and the development of
418
Appendix B The University Lab: Conceptual Design precisely written business rules. (It may be useful to review that business rules, which are derived from the description of operations, are written to help define entities, relationships, constraints, connectivities, and cardinalities.) The dealer’s service function is likely to be crucial to the dealer – good service helps generate future sales and the service function is very likely an important cash flow generator. Therefore, the CAR entity plays an important role. If a customer brings in a car for service and the car was not bought at the dealership, it must be added to the CAR table in order to enable the system to keep a record of the service. This is why we have depicted the CUSTOMER – owns - CAR relationship in Figures PB.3a and PB.3b. Also, note that the optionality next to CAR reflects the fact that not all cars are owned by a customer: Some cars belong to the dealership. Because Figure PB.3a shows the initial ERD, that ERD will be subject to revision as the description of operations becomes more detailed and accurate, thus modifying some of the existing business rules and creating additional business rules. Therefore, additional entities and relationships are likely to be developed and some optional relationships may become mandatory, while some mandatory relationships may become optional. Additional changes are likely to be generated by normalization procedures. Finally, the initial design includes some features that require fine-tuning. For example, a SALESPERSON is just another kind of EMPLOYEE – perhaps the main difference between “general” employee and a sales person is that the latter requires tracking of sales performance for commission and/or bonus purposes. Therefore, EMPLOYEE would be the supertype and SALESPERSON the subtype. All these issues must be addressed in the verification and logical design phases addressed in Appendix E. Incidentally, your students may ask why the design does not show a HISTORY entity. The reason for is absence is that the car’s history can be traced through the SERVICE entity.
NOTE Although we are generally reluctant to make forward references, you may find it very useful to look ahead to the ERD shown in Appendix C’s Figure PC.1a. The discussion that precedes the presentation of the modified ERD is especially valuable – students often find such sample data to be the key to understanding a complex design. In any case, the modified ERD in Figure PC.1a provides ample evidence that the initial ERD is only a starting point for the design process.
As you discuss the design shown in Figure PB.3a, note that it is far from implementation-ready. For example, • The INVOICE is likely to contain multiple charges, yet it is only capable of handling one charge at a time at this point. The addition of an INV_LINE entity is clearly an excellent idea. • The SERVICE entity has some severe limitations caused by the lack of a SERVICE_LINE entity. (Note the previous point.) Given this design, it is impossible to store and track all the individual service (maintenance) procedures that are generated by a single service request. For example, a 50,000 mile check may involve multiple procedures such as belt replacements, tire rotation, tire balancing, brake service, and so on. Therefore, the SERVICE entity, like the INVOICE entity, must be related to service lines, each one of which details a specific maintenance procedure.
419
Appendix B The University Lab: Conceptual Design •
•
The PART_USAGE entity’s function is rather limited. For example, its depiction as a composite entity does properly translate the notion that a part can be used in many service procedures and a service procedure can use many parts. Unfortunately, the lack of a SERVICE_LINE entity means that we cannot track the parts use to a particular maintenance procedure. According to business rule (d), the relationship between CAR and INVOICE would be 1:1. However, if it is possible for the dealer to take the car in trade at a later date and subsequently sells it again, the same CAR_VIN value may appear in INVOICE more than once. We have depicted the latter scenario.
The initial design does have one very nice feature at this point: The existence of the WORK_LOG entity’s WORKLOG_ACTION attribute makes it possible to record which mechanic started the service procedure and which one ended the procedure. (The WORKLOG_ACTION attribute has only two values, open and close.) Note that this feature eliminates the need for a null ending date in the SERVICE entity while the car is being serviced. Better yet, if we need to be able to track which mechanics opened and closed the service procedure, the WORK_LOG entity’s presence eliminates the need for synonyms in the SERVICE entity. Note, for example, that the following few sample entries in the WORK_LOG table lets us conclude that service number 12345 was opened by mechanic 104 on 10-Mar-2014 and closed by the same mechanic on 11-Mar-2014.
Table PB.3 Sample Data Entries in the WORK_LOG Entity EMP_NUM 104 107 104 104 112
SERVICE_NUM WORKLOG_ACTION 12345 OPEN 12346 OPEN 12345 CLOSE 12346 CLOSE 12347 OPEN
WORKLOG_DATE 10-Mar-2014 10-Mar-2014 11-Mar-2014 11-Mar-2014 11-Mar-2014
The format you see in Table PB.3 is based on a standard we developed for aviation maintenance databases. Because almost all aspects of aviation are tightly regulated, accountability is always close to the top of the list of design requirements. (In this case, we must be able to find out who opened the maintenance procedure and who closed it.) You will discover in Chapter 9, “Database Design,” that we will apply the accountability standard to other aspects of the design, too. (Who performed each maintenance procedure? Who signed out the part(s) used in each maintenance procedure? And so on.) It is worth repeating that a discussion of the shortcomings of the initial design will set an excellent stage for the introduction of Appendix C’s verification process. Strict accountability standards are becoming the rule in many areas outside aviation. Such standards may be triggered by legislation or by company operations in an increasingly litigious environment.
420
Appendix B The University Lab: Conceptual Design 4. Create the initial ER diagram for a video rental shop. Use (at least) the following description of operations on which to base your business rules. The video rental shop classifies movie titles according to their type: Comedy, Western, Classical, Science Fiction, Cartoon, Action, Musical, and New Release. Each type contains many possible titles, and most titles within a type are available in multiple copies. For example, note the summary presented in Table PB.4:
Table PB.4 The Video Rental Type and Title Relationship TYPE Musical
Cartoon
Action
TITLE My Fair Lady My Fair Lady Oklahoma! Oklahoma! Oklahoma! Dilly Dally & Chit Chat Cat Dilly Dally & Chit Chat Cat Dilly Dally & Chit Chat Cat Amazon Journey Amazon Journey
COPY 1 2 1 2 3 1 2 3 1 2
Keep the following conditions in mind as you design the video rental database: • The movie type classification is standard; not all types are necessarily in stock. • The movie list is updated as necessary; however, a movie on that list might not be ordered if the video shop owner decides that it the movie is not desirable for some reason. • The video rental shop does not necessarily order movies from all of the vendor list; some vendors on the vendor list are merely potential vendors from whom movies may be ordered in the future. • Movies classified as new releases are reclassified to an appropriate type after they have been in stock for more than 30 days. The video shop manager wants to have an end-of- period (week, month, year) report for the number of rentals by type. • If a customer requests a title, the clerk must be able to find it quickly. When a customer selects one or more titles, an invoice is written. Each invoice may thus contain charges for one or more titles. All customers pay in cash. • When the customer checks out a title, a record is kept of the checkout date and time and the expected return date and time. Upon the return of rented titles, the clerk must be able to check quickly whether the return is late and to assess the appropriate late return fee. • The video-store owner wants to be able to generate periodic revenue reports by title and by type. The owner also wants to be able to generate periodic inventory reports and to keep track of titles on order. • The video-store owner, who employs two (salaried) full-time and three (hourly) part-time employees, wants to keep track of all employee work time and payroll data. Part-time employees must arrange entries in a work schedule, while all employees sign in and out on a work log.
421
Appendix B The University Lab: Conceptual Design
NOTE The description of operations not only establishes the operational aspects of the business; it also establishes some specific system objectives we have listed next.
As you design this database, remember that transaction and information requirements help drive the design by defining required entities, relationships, and attributes. Also, keep in mind that the description provided by the problem leaves many possibilities for design differences. For example, consider the EMPLOYEE classification as full-time or part-time. If there are few distinguishing characteristics between the two, the situation may be handled by using an attribute EMP_CLASS (whose values might be F or P) in the EMPLOYEE table. If full-time employees earn a base salary and part-time employees earn only an hourly wage, that problem can be handled by having two attributes, EMP_HOURPAY and EMP_BASE_PAY, in EMPLOYEE. Using this approach, the HOUR_PAY would be $0.00 for the salaried full-time employees, while the EMP_BASE_PAY would be $0.00 for the part-time employees. (To ensure correct pay computations, the application software would select either F or P, depending on the employee classification.) On the other hand, if part-time employees are handled quite differently from full-time employees in terms of work scheduling, benefits, and so on, it would be better to use a supertype/subtype classification for FULL_TIME and PART_TIME employees. (The more unique variables exist, the more sense a supertype/subtype relationship makes.) For discussion purposes, examine the following requirements: • The clerk must be able to find customer's requests quickly. • This requirement is met by creating an easy way to query the MOVIE data (by name, type, etc.) while entering the RENTAL data. • The clerk must be able to check quickly whether or not the return is late and to assess the appropriate “late return” fee. This requirement is met by adding attributes such as expected return date, actual return date, and late fees to the RENTAL entity. Note that there is no need to add a new entity, nor do we need to create an additional relationship. Keep in mind that some requirements are easily met by including the appropriate attributes in the tables and by combining those attributes through an application program that enforces the business rule. Remember that not all business rules can be represented in the database conceptual diagram. • The (store owner) wants to be able to keep track of all employee work time and payroll data. • Here we must create two new entities: WORK_SCHEDULE and WORK_LOG, which will show the employee's work schedule and the actual times worked, respectively. These entities will also help us generate the payroll report. The description also specifies some of the expected reports: • End-of-period report for the number of rentals by type. This report will use the RENTAL, MOVIE, and TYPE entities to generate all rental data for some specified period of time. • Revenue report by title and by type. This report will use the RENTAL, MOVIE, and TYPE entities to generate all the rental data. • Periodic inventory reports. This report will use the MOVIE and TYPE entities. • Titles on order. This report will use the ORDER, MOVIE, and TYPE entities. • Employee work times and payroll data. This report will use the EMPLOYEE, WORK_SCHEDULE, and WORK_LOG entities. 422
Appendix B The University Lab: Conceptual Design
This summary sets the stage for the ERD shown in Figure PB.4a. Note that the WORK_SCHEDULE and WORK_LOG entities are optional to EMPLOYEE. The optionalities reflect the following conditions: • Only part-time employees have corresponding records in the work log table. • Only full-time employees have corresponding records in the work schedule table. Although there is a temptation to create FULL_TIME and PART_TIME entities, which are then related to WORK_LOG and WORK_SCHEDULE, respectively, such a decision reflects a substitution of an entity for an attribute. It is far better to simply create an attribute, perhaps named EMP_TYPE, in the EMPLOYEE entity. The EMP_TYPE attribute values would then be P = part-time or F = full-time. The applications software can then be used to force an entry into the WORK_LOG and WORK_SCHEDULE entities, depending on the EMP_TYPE attribute value. Student question: Using the argument just presented, what other entity might be replaced by an attribute? Answer: The TYPE entity can be represented by a TITLE_TYPE attribute in the TITLE entity. The TITLE_TYPE values would then be “Western”, “Adventure”, and so on. This approach works fine, as long as the type values don’t require additional descriptive material. In the latter case, the TYPE would be better represented by an entity in order to avoid data redundancy problems.
Figure PB.4a The Initial Crow’s Foot ERD for the Video Rental Store
Additional discussion: At this point, the ERD has not yet been verified against the transaction
423
Appendix B The University Lab: Conceptual Design requirements. For example, there is no way to check which specific video has been rented by a customer. (If five customers rent copies of the same video, you don’t know which customer has which copy.) Therefore, the design requires additional work triggered by the verification process. In addition, the work log entity’s LOG_DATE is incapable of tracking when the part-time employees logged in or out. Therefore, two dates must be used, perhaps named LOG_DATE_IN and LOG_DATE_OUT. In addition, if you want to determine the hours worked by each part-time employee, it will be necessary to record the time in and time out. Similarly, the work schedule cannot yet be used to track the full-time employees’ schedules. Who has worked and when? Clearly, the verification process discussed in Appendix C is not a luxury! 5. Suppose a manufacturer produces three high-cost, low-volume products: P1, P2, and P3. Product P1 is assembled with components C1 and C2; product P2 is assembled with components C1, C3, and C4; and product P3 is assembled with components C2 and C3. Components may be purchased from several vendors, as shown in Table PB.5:
Table PB.5 The Component/Vendor Summary VENDOR V1 V2 V3
COMPONENTS SUPPLIED C1, C2 C1, C2, C3, C4 C1, C2, C4
Each product has a unique serial number, as does each component. To keep track of product performance, careful records are kept to ensure that each product's components can be traced to the component supplier. Products are sold directly to final customers; that is, no wholesale operations are permitted. The sales records include the customer identification and the product serial number. Using the preceding information, do the following: a. Write the business rules governing the production and sale of the products. The business rules are summarized in Figure PB.5A.
424
Appendix B The University Lab: Conceptual Design
Figure PB.5A The Business Rule Summary PRODUCT P1 P2 P3
VENDOR V1 V2 V3
COMPONENTS C1 C1
Business Rule
C2 C3 C2 C3
1. A component can be part of several products, and a product is made up of several components.
C4
COMPONENTS SUPPLIED C1 C1 C1
C2 C2 C3 C2
Business Rule 2. A component can be supplied by several vendors, and a vendor supplies several components.
C4 C3
b. Create an ER diagram capable of supporting the manufacturer's product/component tracking requirements. The two business rules shown in Figure PB.5A allow the designer to generate the ERD Shown in Figure PB.5B1. (Note the M:N relationships between PRODUCT and COMPONENT and between COMPONENT and VENDOR that have been converted through the composite entities PROD_COMP and COMP_VENB.)
Figure PB.5B1 The Initial Crow’s Foot ERD for Problem B.5B
As you examine Figure PD5.B1, note that we have use default optionalities in the composite entities named PROD_COMP and COMP_VENB. Naturally, these optionalities must be verified against the business rules before the design is implemented. However, at this point the optionalities make sense – after all, various version of a PRODUCT do not necessarily contain all available COMPONENTs, not do all VENDORs supply all COMPONENTs. Quite aside
425
Appendix B The University Lab: Conceptual Design from the likely existence of the relationships we just pointed out, optionalities are generally desirable from an operational point of view – at least from the database management angle. Yet, no matter how “obvious” a relationship may appear to be, it is worth repeating that the existence of the optionalities must be verified. Designs that do not reflect the actual data environment are not likely to be useful at the end user level. Given the ERDs in Figures PB.5B1 and PB.B2, you can see that each PRODUCT entry actually represents a product line, i.e., a collection of products belonging to the same product type or line, rather than a specific product occurrence with a unique serial number. Therefore, this model will not enable us to identify the serial number for each component used in, for example, a product with serial number 348765. Therefore, this solution does not allow us to track the provider of a part that was used in a specific PRODUCT occurrence. (Note the example in Figure PB.5C.)
Figure PB.5C An Initial Implementation PRODUCT
PROD_COMP
COMPONENT
COMP_VEND
VENDOR
P1
P1 P1 P2 P2 P2 P3 P3
C1 C2 C3 C4
C1 C1 C1 C2 C2 C2 C3 C4 C4
V1 V2 V3
P2 P3
C1 C2 C1 C3 C4 C2 C3
V1 V2 V3 V1 V2 V3 V2 V2 V3
As you examine Figure PB.5C, note that there are no serial numbers for the components, nor are there any for the products produced. In other words, we do not meet the requirements imposed by: BUSINESS RULE 3 Each product has a unique serial number. For example, there will be several products P1, each with a unique serial number. Each unique product will be composed of several components, and each of those components has a unique serial number. The implementation of business rule 3 will allow us to keep track of the supplier of each component. One way to produce the tracking capability required by business rule 3 is to use a ternary relationship between PRODUCT, COMPONENT, and VENDOR, shown in Figure P5.5D1:
426
Appendix B The University Lab: Conceptual Design
Figure PB.5D1 The Crow’s Foot Ternary Relationship between PRODUCT, COMPONENT, and VENDOR
The ER diagram we have just shown represents a many-to-many-to-many TERNARY relationship, expressed by M:N:P. This ternary relationship indicates that: • A product is composed of many components and a component appears in many products. • A component is provided by many vendors and a vendor provides many products. • A product contains components of many vendors and a vendor's components appear in many products. Assigning attributes to the SERIALS entity, we may draw the dependency diagram shown in Figure P7-5E.
427
Appendix B The University Lab: Conceptual Design
Figure P7-5E The Initial Dependency Diagram
P_SERIAL C_SERIAL
PROD_TYPE COMP_TYPE VEND_CODE
partial dependency transitive dependencies
We may safely assume that all serial numbers are unique. If we make this assumption, we can conclude that the product serial number will identify the product type and that the component serial number will identify the component type and the vendor. Using the standard normalization procedures, we may thus decompose the entity as shown in the dependency diagrams in Figure PB.5F.
Figure PB.5F The Normalized Structure The Original Dependency Diagram
P_SERIAL C_SERIAL
PROD_TYPE COMP_TYPE VEND_CODE
partial dependency transitive dependencies
The Normalized Dependency Diagrams P_SERIAL
Table name: P_SERIAL
PROD_TYPE
C_SERIAL COMP_TYPE VEND_CODE
P_SERIAL
C_SERIAL
Table name: C_SERIAL
Table name: SERIAL
As you examine the dependency diagrams in Figure PB.5F, note the following: • P_SERIAL has a 1:M relationship with PRODUCT, because one product has many 428
Appendix B The University Lab: Conceptual Design
• •
product serial numbers. C_SERIAL has a 1:M relationship with COMPONENT, because one component has many component serial numbers. SERIAL is the composite entity that connects P_SERIAL and C_SERIAL, thus reflecting the fact that one product has many components and a component can be found in many products.
To illustrate the relationships we have just described, let's take a look at some data in Figure P75G:
Figure PB.5G Sample Data P_SERIAL
PROD_TYPE
C_SERIAL
COMP_TYPE
VENDOR
X0D101 X0C102 200201 200202 200203 200204 300301 300302
P1 P1 P2 P2 P2 P2 P3 P3
C90001 C90002 C90003 C80003 C80002 C80909 C80976 C80908 C80965 C76894 C40097 C45096 C67673 C45679
C1 C1 C1 C2 C2 C2 C2 C2 C3 C3 C4 C4 C4 C4
V1 V2 V3 V1 V1 V2 V3 V2 V2 V2 V2 V2 V3 V3
P_SERIAL
C_SERIAL
X0D101 X0C101 X0C102 X0C102 200201 200201 200201 …….. etc.
C90001 C80976 C90002 C80002 C90002 C76894 C45678 ….. etc.
The new ER diagram will enable us to identify the product by a unique serial number, and each of the product's components will have a unique serial number, too. Therefore, the new ER diagram will look like Figure PB.5H1.
429
Appendix B The University Lab: Conceptual Design
Figure PB.5H1 The Revised (Final) Crow’s Foot ERD
As you examine Figure PB.5H1’s ERD, note that the COMP_VEND composite entity seems redundant, because the CSERIAL entity already depicts the many-to-many relationship between VENDOR and COMPONENT. However, COMP_VEND represents a more general relationship that enables us to determine who the likely providers of the general component are (what vendors supply component C1?), rather than letting us determine a specific component's vendor (which vendor supplied the component C1 with a serial number C90003?). The designer must confer with the end user to decide whether such a general relationship is necessary or if it can be removed from the database without affecting its semantic contents.
430
Appendix B The University Lab: Conceptual Design 6. Create an ER diagram for a hardware store. Make sure that you cover (at least) store transactions, inventory, and personnel. Base your ER diagram on an appropriate set of business rules that you develop. (Note: It would be useful to visit a hardware store and conduct interviews to discover the type and extent of the store's operations.) Since the problem does not specify a set of business rules, we will create some that will enable us to develop an initial ER diagram.
NOTE Please take into consideration that, depending on the assumptions made and on the selection of business rules, students are likely to create quite different solutions to this problem. You may find it quite useful to study each student solution and to incorporate the most interesting parts of each solution into a common ER diagram. We know that this is not an easy job, but your students will benefit because you will thus enable them to develop very important analytical skills. You should stress that: • A problem may be examined from many different angles. • Similar organizations, using different business rules, will generate design problems that may be solved through the use of quite different solutions.
To get the class discussion started, we will assume these business rules: 1. A product is provided by many suppliers, and a supplier can provide several products. 2. An employee has many dependents, but a dependent can be claimed by only one employee. 3. An employee can write many invoices, but each invoice is written by only one employee. 4. Each invoice belongs to only one customer, and each customer owns many invoices. 5. A customer makes several payments, and each payment belongs to only one customer. 6. Each payment may be applied partially or totally to one or more invoices, and each invoice can be paid off in one or more payments. Using these business rules, we may generate the ERD shown in Figure PB.6A.
431
Appendix B The University Lab: Conceptual Design
Figure PB.6A The Crow’s Foot ERD for Problem 6 (The Hardware Store)
432
Appendix B The University Lab: Conceptual Design The ERD shown in Figure PB.6A requires less tweaking than the previous ERDs to get it ready for implementation. For example, given the presence of the INV_LINE entity, the customer can buy more than one product per invoice. Similarly, the ORD_LINE entity makes it possible for more than one product to be ordered per order. However, as you examine the PAYMENT entity in Figure PB.6A, note that the current PK definition limits the payments for a given customer and invoice number to one per day. (Two payments by the same customer for the same invoice number on the same date would violate the entity integrity rules, because the two composite PK values would be identical in that scenario.) Therefore, the design shown in Figure PB.6A still requires additional work, to be completed during the verification process. 7. Use the following brief description of operations as the source for the next database design: All aircraft owned by ROBCOR require periodic maintenance. When maintenance is required, a maintenance log form is used to enter the aircraft identification number, the general nature of the maintenance, and the maintenance starting date. A sample maintenance log form is shown in Figure PB.7A.
FIGURE PB.7A The Maintenance Log Form
433
Appendix B The University Lab: Conceptual Design Note that the maintenance log form shown in Figure PB.7A contains a space used to enter the maintenance completion date and a signature space for the supervising mechanic who releases the aircraft back into service. Each maintenance log form is numbered sequentially. Note: A supervising mechanic is one who holds a special Federal Aviation Administration (FAA) Inspection Authorization (IA). Three of ROBCOR’s ten mechanics hold such an IA. Once the maintenance log form is initiated, the maintenance log form’s number is written on a maintenance specification sheet, also known as a maintenance line form. When completed, the specification sheet contains the details of each maintenance action, the time required to complete the maintenance, parts (if any) used in the maintenance action, and the identification of the mechanic who performed the maintenance action. The maintenance specification sheet is the billing source (time and parts for each of the maintenance actions), and it is one of the sources through which parts use may be audited. A sample maintenance specification sheet (line form) is shown in Figure PB.7B.
FIGURE PB.7B The Maintenance Line Form page 1 of 1
Log #: 2155 Item
Action description
Time
Part
Units
Mechanic
1
Performed run-up. Rough mag reset
0.8 None
0
112
2
Cleaned #2 bottom plug, left engine
0.9 None
0
112
3
Replaced nose gear shimmy dampener
1.3 P-213342A
1
103
4
Replaced left main gear door oleo strut seal
1.7 GR/311109S
1
112
5
Cleaned and checked gear strut seals
1.7 None
0
116
6 7 8
Parts used in any maintenance action must be signed out by the mechanic who used them, thus allowing ROBCOR to track its parts inventory. Each sign-out form contains a listing of all the parts associated with a given maintenance log entry. Therefore, a parts sign-out form contains the maintenance log number against which the parts are charged. In addition, the parts sign-out procedure is used to update the ROBCOR parts inventory. A sample parts sign-out form is shown in Figure PB.7C. 434
Appendix B The University Lab: Conceptual Design
FIGURE PB.7C The Parts Sign-out Form page 1 of 1 Log #: 2155 Form sequence #: 24226 Part
Description
Units
Unit Price
Mechanic
P-213342A
Nose gear shimmy dampener, PA31-350/1973
1
$189.45
112
GR/311109S
Left main gear door oleo strut seal, PA31-350/1973
1
$59.76
103
Mechanics are highly specialized ROBCOR employees, and their qualifications are quite different from those of an accountant or a secretary, for example. Given this brief description of operations, draw the fully labeled ER diagram. Make sure you include all the appropriate relationships, connectivities, and cardinalities. Before drawing the ER diagram, note the following relationships: • Not all employees are mechanics, but all mechanics are employees. Therefore, the MECHANIC entity is optional to EMPLOYEE. The EMPLOYEE is the supertype to MECHANIC. • All mechanics must sign off work on the MAINTENANCE they performed and they must sign out for the PART(s) used. • Only some mechanics (the IAs) may sign off the LOG. Therefore, LOG is optional to MECHANIC. • Because not all MAINTENANCE entries are associated with a PART --- some maintenance doesn't require parts --- PART is optional to MAINTENANCE. These relationships are all reflected in the ER diagrams shown in Figure PB.7.
435
Appendix B The University Lab: Conceptual Design
Figure PB.7D1 The Initial Crow’s Foot ERD for Problem 7 (ROBCOR Aircraft Service)
436
Appendix B The University Lab: Conceptual Design As you discuss the ERD shown in Figure PB.7D1, note its similarity to the car dealership’s maintenance section of the ERD presented in Figure PB.3a. However, the ROBCOR Aircraft Service ERD has been developed at a much higher detail level, thus requiring fewer modifications during the verification process. Figure PB.7D1 shows that: • Each LOG entity occurrence will yield one or more maintenance procedures. • Each of the individual maintenance procedures will be listed in the LOG_LINE entity. • A mechanic must sign off on each of the LOG_LINE entity occurrences. • The possible parts use in each LOG_LINE entity occurrence is now traceable. • A part can be accounted for from the moment it is signed out by the mechanic to the point at which it is installed during the maintenance procedure. The “references” relationship between LOG and PART is subject to discussion. After all, you can always trace each part’s use to the LOG through the LOG_LINE entity. Therefore, the relationship is redundant. Such redundancies are – or should be – picked up during the verification process. We have shown the MECHANIC to be a subtype of the EMPLOYEE supertype. Whether the supertype/subtype relationship makes sense depends on the type and extent of the attributes that are to be associated with the MECHANIC entity. There may be externally imposed requirements – often imposed through the government’s regulatory process -- that can best be met through a supertype/subtype relationship. However, in the absence of such externally imposed requirements, it is usually better to use an attribute in EMPLOYEE – such as the employee’s primary job code – and link the employees to their various qualifications through a composite entity. The applications software will then be used to enforce the requirement that the person doing maintenance work is, in fact, a mechanic. 8. You have just been employed by the ROBCOR Trucking Company to develop a database. To gain a sense of the database’s intended functions, you have spent some time talking to ROBCOR’s employees and you’ve examined some of the forms used to track driver assignments and truck maintenance. Your notes include the following observations: • Some drivers are qualified to drive more than one type of truck operated by ROBCOR. A driver may, therefore, be assigned to drive more than one truck type during some period of time. ROBCOR operates several trucks of a given type. For example, ROBCOR operates two panel trucks, four half-ton pick-up trucks, two single-axle dump trucks, one double-axle truck, and one 16-wheel truck. A driver with a chauffeur’s license is qualified to drive only a panel truck and a half-ton pick-up truck and, thus, may be assigned to drive any one of six trucks. A driver with a commercial license with an appropriate heavy equipment endorsement may be assigned to drive any of the nine trucks in the ROBCOR fleet. Each time a driver is assigned to drive a truck, an entry is made in a log containing the employee number, the truck identification, and the sign-out (departure) date. Upon the driver’s return, the log is updated to include the sign-in (return) date and the number of driver duty hours. • If trucks require maintenance, a maintenance log is filled out. The maintenance log includes the date on which the truck was received by the maintenance crew. The truck cannot be released for service until the
437
Appendix B The University Lab: Conceptual Design
• •
•
•
maintenance log release date has been entered and the log has been signed off by an inspector. All inspectors are qualified mechanics, but not all mechanics are qualified inspectors. Once the maintenance log entry has been made, the maintenance log number is transferred to a service log in which all service log transactions are entered. A single maintenance log entry can give rise to multiple service log entries. For example, a truck might need an oil change as well as a fuel injector replacement, a brake adjustment, and a fender repair. Each service log entry is signed off by the mechanic who performed the work. To track the maintenance costs for each truck, the service log entries include the parts used and the time spent to install the part or to perform the service. (Not all service transactions involve parts. For example, adjusting a throttle linkage does not require the use of a part.) All employees are automatically covered by a standard health insurance policy. However, ROBCOR’s benefits include optional co-paid term life insurance and disability insurance. Employees may select both options, one option, or no options.
Given those brief notes, create the ER diagram. Make sure you include all appropriate entities and relationships, and define all connectivities and cardinalities. The ERD in Figure PB.8a contains a maintenance portion that has become our standard, given that it enables the end user to track all activities and parts for all vehicles. In fact, given its ability to support high accountability standards, we first developed the "basics" of this design for aviation maintenance tracking.
438
Appendix B The University Lab: Conceptual Design
Figure PB.8a The Initial Crow’s Foot ERD for the ROBCOR Trucking Service
439
Appendix B The University Lab: Conceptual Design As you examine the ERD in Figure PB.8a, note that the driver assignment to drive trucks is a M:N relationship: Given the passage of time, a driver can be assigned to drive a truck many times and a truck can be assigned to a driver many times. We have implemented this relationship through the use of a composite entity named ASSIGN. The M:N relationship between EMPLOYEE and BENEFIT – that is, the insurance package mentioned in problem 8’s last bullet -- has been implemented through the composite entity named EMP_BEN. (An employee can select many benefit packages and each insurance package may be selected by many employees.) The reason for the optionality is based on the fact that not all of the insurance packages are necessarily selected by the employee. For example, using the BENEFIT table contents shown in Table PB.8A, an employee may decide to select option 2 or options 2 and 3, or neither option. (The standard health insurance package is assigned automatically.)
Table PB.8A Table name: BENEFIT BEN_CODE 1 2 3
BEN_DESCRIPTION Standard health Co-paid term life insurance, $100,000 Co-paid disability insurance
BEN_CHARGE $0.00 $35.00 $42.50
Incidentally, we have used a BENEFIT entity, rather than an INSURANCE entity to anticipate the likelihood that benefits may include items other than insurance. For example, employees might be given a benefit such as an investment plan, a flextime option, child care, and so on. The decomposition of M:N relationships continues to be a good subject for discussion. For example, we have shown many of the decompositions as composite entities. However, while such an approach is perfectly acceptable at the initial design stage, caution your students that composite PKs cannot be referenced easily by subsequent additions of entities that must reference those PKs. Therefore, we would note that the composite PK used in the LOG_ACTION entity -- EMP_ID + LOG_NUM + LOGACT_TYPE – should be replaced by an “artificial” single-attribute PK named LOGACT_NUM. The EMP_ID and LOG_NUM attributes would continue to be used as FKs to the MECHANIC and LOG entities. (Naturally, the EMP_ID and LOG_NUM attributes should be indexed to avoid duplication of records and to speed up queries.) A few sample entries are shown in Table
Table PB.8B Table name: LOG_ACTION LOGACT_NUM 1000 1001 1002 1003 1004 1005
LOG_NUM 5023 5024 5023 5025 5024 5026
EMP_ID 409 409 411 378 411 409
440
LOGACT_TYPE Open Open Close Open Close Open
LOGACT_DATE 14-May-2014 15-May-2014 15-May-2014 15-May-2014 15-May-2014 16-May-2014
Appendix B The University Lab: Conceptual Design Finally, we have used supertype/subtype relationships between EMPLOYEE and DRIVER and MECHANIC. If drivers and mechanics are assumed to have many characteristics (such as special certifications at different levels) that are not common to EMPLOYEE, this approach eliminates nulls. However, keep in mind the discussion about the use of supertypes/subtypes in Problem B. (The use of the supertype/subtype approach may be dictated by external factors … but the use of supertypes and subtypes must be approached with some caution. For example, if drivers have multiple license types, it would be far better to create a LICENSE entity and relate it to DRIVER through a composite entity, perhaps named DRIVER_LICENSE. The composite entity may then be designed to include the date on which the license was earned and other pertinent facts pertaining to licenses. (Such flexibility is not available in a subtype, unless you are willing to tolerate the possible occurrence of nulls as more pertinent data about the (multiple) licenses are kept – if some of the drivers do not have all of those licenses.)
441
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation
Appendix C The University Lab: Conceptual Design, Verification, Logical Design, and Implementation Discussion Focus How is a database design verified, and why is such verification necessary? Use our detailed answer to question 1 to focus class discussion on database design verification. Stress that the verification process uses the initial ER model as a communication tool. The designer may begin the verification process by describing the organization's operations to its end users, basing the detailed description on the initial ER model. Next, explain how the operations will be supported by the database design. Stress that the design must support the end-user application views, outputs, and inputs. Points to be addressed include such questions as: • Is the description accurate? If not, what aspects of the description must be corrected? • Does the model support the end-user requirements? If not, what aspects of the end-user requirements have not been addressed or have been addressed inadequately? Keep in mind that even a model that perfectly addresses all initially determined end user requirements is likely to need adjustments as those end users begin to understand the ramifications of the database design's capabilities. In many cases, the end users may learn what the organization's processes and procedures actually are, thus leading to new requirements and the perception of new opportunities. The database designer must keep such likely developments in mind, especially if (s)he works as a database design consultant. (Anticipation of such developments must be factored into the contract negotiations for consulting fees.) Discuss the role of the system modules. The use of system modules can hardly be overemphasized in a database design environment. Stress these module characteristics and features: • Modules represent subsets of the database model: Smaller "pieces" are more easily understood. • Modules are self-contained and accomplish a specific system function; if such a system function must be modified, other functions remain unaffected. • Modules fit into a modular database design, which is more easily modified and adapted to new circumstances. Because modification efforts are focused on a database subset, productivity of both designers and application developers is likely to be enhanced. Module interfaces must be clear if the modules are expected to work well within the overall system.
441
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation
Answers to Review Questions 1. Why must a conceptual model be verified? What steps are involved in the verification process? The verification of a conceptual model is crucial to a successful database design. The verification process allows the designer to check the accuracy of the database design by: • Re-examining data and data transformations. • Enabling the designer to evaluate the design efficiency relative to the end user's and system's design goals. Keep in mind that, to a large extent, the best design is the one that serves the end-user requirements best. For example, a design that works well for a manufacturing firm may not fit the needs of a marketing research firm, and vice versa. The verification process helps the designer to avoid implementation problems later by: • Validating the model's entities. (Remember the minimal data rule.) • Confirming entity relationships and eliminating duplicate, unnecessary, or improperly defined relationships. • Eliminating data redundancies. • Improving the model's semantic precision to better represent real-world operations. • Confirming that all user requirements (processing, performance, or security) are met. Verification is a continuous activity in any database design. The database design process is evolutionary in nature: It requires the continuous evaluation of the developing model by examining the effect of adding new entities and by confirming that any design changes enhance the model's accuracy. The verification process requires the following steps: 1. Identify the database's central entity. The central entity is the most important entity in our database, and most of the other entities depend on it. 2. Identify and define each module and its components. The designer divides the database model into smaller sets that reflect the data needs of particular systems modules such as inventory, orders, payroll, etc. 3. Identify and define each of the module's processes. Specifically, this step requires the identification and definition of the database transactions that represent the module's real-world operations.
442
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation 4. Verify each of the transactions against the database. 2. What steps must be completed before the database design is fully implemented? (Make sure that you list the steps in the correct sequence and discuss each step briefly.) The DBLC, discussed in detail in Chapter 9, “Database Design,” constitutes a database's history, tracing it from its conceptual design to its implementation and operation. We highly recommend that the database designer follow the DBLC's steps carefully in order to ensure that the database will properly meet all user and system requirements. Before a database can be successfully implemented, the following steps must be completed: 1. Define the conceptual model's components: entities, attributes, domains, and relationships. 2. Normalize the database to ensure that all transitive dependencies are eliminated and that each entity's attributes are solely dependent on its key attribute(s). 3. Verify the conceptual model to ensure that the proposed database will meet the system's transaction requirements and that the end-user and systems requirements will be met. The verification process will probably delete and/or create entities, attributes, and relationships. It may also refine existing entities, attributes, and relationships. 4. Create the logical design which requires the definition of the table structures, using a specific DBMS (relational, network or hierarchical). Logical design also includes, if necessary, appropriate indexes and views. 5. Create the physical design to define access paths, including space allocation, storage group creation, table spaces, and any other physical storage characteristic that is dependent on the hardware and software to be used in the system's implementation. 6. Implement the design. Somehow, this last step seems to suffer from planning neglect, to the detriment of the system's operation. Implementation, operation, and maintenance plans must (at least) include careful definition and description of the activities required to implement the database design: • loading and conversion • definition of database standards • system and procedures documentation: security, backup, and recovery • operational procedures to be followed by users • a detailed training plan • identification of responsibilities for operation and maintenance. 3. What major factors should be addressed when database system performance is evaluated? Discuss each factor briefly. Database systems performance refers to the system's ability to retrieve information within a reasonable amount of time and at a reasonable cost. Keeping in mind that "reasonable" means different things to different people, we must address at least these important performance factors: • Concurrent users For any given system, the more users connected to the system, the longer the data retrieval time.
443
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation • • •
Resource limits The fewer resources that are available to the user, the longer the access queues will be. Communication speeds Lower communication speeds mean longer response times. Query response time Queries must be tuned to provide optimum query response time. (See Appendix C, “Database Performance Tuning.”) Lack of query response tuning means slow response times. Depending on how good the design and the program code are, the query response time can vary from minutes to hours for the same query.
Although the preceding discussion is focused on the speed aspect of performance, there are other equally important issues that must be considered. A successful database implementation requires a balanced approach to all database issues, including concurrency control, query response time, database integrity, security, backup and recovery, data replication, and data distribution. 4. How would you verify the ER diagram shown in Figure QC.4? Make specific recommendations.
Figure QC.4 The ERD for Question 4
The verification process must include the following steps: 1. Identify and define the main entities, attributes, and domains. In this case, the main entities are PARTS, SUPPLIER, PRODUCT, and CUSTOMER. Identify proper primary keys and composite and multi-valued attributes.
444
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation 2. Identify and define the relationships among the entities. By examining the diagram, we may conclude that several M:N relationships exist: PARTS and SUPPLIER PARTS and PRODUCTS PRODUCT and CUSTOMER 3. Identify the composite entities and their primary and foreign keys. Each composite (bridge) entity creates the connection to maintain a 1:M relationship with each of the original entities. 4. Normalize the model. 5. Verify the model, starting with the identification of the central entity. Given the ER diagram's layout, we conclude that the central entity is PRODUCT. 6. Identify each module and its components. Three modules can be identified: • Inventory, containing PARTS and SUPPLIER • Production, containing PARTS and PRODUCT • Sales, containing PRODUCT and CUSTOMER 7. Identify each module's processes or transaction requirements. Start by listing known transaction descriptions by module. For brevity's sake, we will use the inventory module as an example. The inventory module supports the following transactions: • Add a new product to inventory • Modify an existing product in inventory • Delete a product from inventory • Generate a list of products by product type • Generate a price list with product by product type • Query the product database by product description Check the database model against these transaction requirements, verify the model's efficiency and effectiveness, and make the necessary changes. 5. Describe and discuss the ER model's treatment of the UCL's inventory/order hierarchy: a. Category b. Class c. Type d. Subtype The objective here is to focus student attention on the details of the UCL's approach to inventory management. Note that the UCL's ER model uses two closely related entities to manage items in inventory: ITEM and INVENTORY_TYPC. These two entities maintain a 1:M relationship: One item belongs to only one inventory type, but an inventory type can contain many items. Inventory types are classified through the use of a hierarchy composed of CATEGORY, CLASS, and TYPC. (We may even identify SUBTYPE for each TYPE!) Basically, the hierarchy may be described this way: A category has many classes, and a class has many types. For example, the category hardware includes the classes computer and printer. The class computer has many types that are defined by their CPU: 486 and Pentium computers. Similarly, the category supplies can have several classes: diskette, paper, etc. Each class can have many types: 3.5 DD diskette, 3.5 HD diskette,
445
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation 8.5x11 paper, 8.5x14 paper, and so on. We may even identify subtypes: Each type can have many subtypes. For example, the class "paper" includes the types “single-sheet” and “continuous-feed”; the single-sheet type may be classified by subtype 8 x 11 inches or 11 x 14 inches. The following table summarizes some of the inventory types identified in the system. Note that the hierarchy may be illustrated as shown in Table QC.5A.
Table QC.5A The Classification Hierarchy Category
Class Computer
Hardware Printer
Supply
Paper
Type
Subtype
Desktop Desktop Laptop Laser Laser Inkjet Inkjet Plotter Continuous-feed Single sheet Single sheet
P4 P3 P4 8 ppm 12 ppm Color Black 2x3 8 x 10 11 x 14
It is important to note that each item can belong to only one specific inventory type. Also, keep in mind that the ORDER_ITEM entity interfaces with the INVENTORY_TYPE, rather than with the ITEM entity. The reason for this interface is clearly based on the chapter's description of the UCL operations: "The CLD requests items without specifying a specific brand and/or vendor." Given this requirement, it is clear that the ITEM can't be identified in the request. (The ITEM's primary key is its serial number, which can't be identified until the ITEM is received!) However, to make the request, we must know the requested item's inventory type. Therefore, ORDER is related to the INVENTORY_TYPE, and not to the ITEM. The hierarchy shown here has led us to develop the classification scheme shown in the text's Inventory Classification Hierarchy, illustrated in table QC.5B:
Table QC.5B An Inventory Classification Hierarchy GROUP
CATEGORY
CLASS
TYPE
SUBTYPE
HWPCDTP5 HWPCLP48 HWPRLS HWPRDM80 SUPPSS11 HWEXHDID SWDBXXXX
Hardware (HW) Hardware (HW) Hardware (HW) Hardware (HW) Supply (SU) Hardware (HW) Software (SW)
Personal Computer (PC) Personal Computer (PC) Printer (PR) Printer (PR) Paper (PP) Expansion Board (EX) Database (DB)
Desktop (DT) Laptop (LT) Laser (LS) Inkjet (IJ) Single Sheet (SS) Video (VI) XX
Pentium (P5) Pentium IV Standard 80-column 8.5" x 11" l XX XX
446
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation The classification hierarchy may also be illustrated with the help of the tree diagram shown in Figure QC.5:
Figure QC.5 The INV_TYPE Classification Hierarchy As a Tree Diagram
CATEGORY
Hardware
CLASS
Personal Computer (PC)
Printer (PR)
TYPE
Desktop (DT)
Inkjet(IJ)
SUBTYPE
Intel P4 (300)
Intel P5 (600)
Black (BL)
Color (CO)
6. Modern businesses tend to provide continuous training to keep their employees productive in a fast-changing and competitive world. In addition, government regulations often require certain types of training and periodic retraining. (For example, pilots must take semiannual courses involving weather, air regulations, and so on.) To make sure that an organization can track all training received by each of its employees, trace the development of the ERD segment in Figure QC.6 from the initial business rule that states: An employee can take many courses, and each course can be taken by many employees. Once you have traced the development of the ERD segment, verify it and then provide sample data for each of the three tables to illustrate how the design would be implemented.
Figure QC.6 The ERD Segment for Question 6
Follow the verification steps described in the answer to question 4. Note that the composite TRAINING entity shown in Figure QC.6 reflects part of the verification process that began with the
447
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation M:N relationship between EMPLOYEE and COURSC. (An employee can take many courses and many employees can take each course.) Part of the verification process involves the elimination of multi-valued attributes. For example, an EMPLOYEE table that contains an attribute EMP_TRAINING containing strings such as “fire safety, weather, air regulations” have already been eliminated by the composite TRAINING entity. The structure shown in Figure QC.6 allows us to add attributes to ensure that training details – such as dates, grades, training locations, etc. -- can be traced, too. One additional – and very important -- point is worth mentioning: at this point, Figure QC.6’s ERD cannot handle recurrent training requirements. That is, if some courses must be retaken periodically, as is common in many transportation businesses, the TRAINING entity’s PK – at this point composed of the EMP_NUM + COURSE_CODE – will not yield a unique value if the course is retaken from time to time. The solution to this problem can be found in either one of two ways: 1. Add the training date to the TRAINING entity’s composite PK to become EMP_NUM + COURSE_CODE + TRAIN_DATC. This approach is illustrated in the examples shown in Tables QC.6A through QC.6C. Note that employee 105 took the FAR-135-P course on 26-Sep2013 and on 11-Feb-2014. Employee 101 took the WEA-01 course on 26-Sep-2013 and on 26Mar-2014. Note that the addition of the TRAIN_DATE to the composite PK prevents the duplication of training records. For example, if you tried to enter the first TRAINING record twice, the combination of EMP_NUM+COURSE_CODE+TRAIN_DATE would not be unique and the DBMS would diagnose an entity integrity violation.
Table QC.6A The EMPLOYEE Table Contents EMP_NUM 105 101
EMP_LNAME Ortega Williams
Table QC.6B The TRAINING Table Contents EMP_NUM 105 105 101 105 101 101 105 101
COURSE_CODE FAR-135-P HM-01 FAR-135-P WEA-01 HM-01 WEA-01 FAR-135-P WEA-01
TRAIN_DATE 26-Sep-2013 18-Dec-2013 23-Nov-2013 10-Mar-2014 15-Sep-2013 26-Sep-2013 11-Feb-2014 26-Mar-2014
TRAIN_GRADE 90 92 93 87 91 85 97 89
Table QC.6C The COURSE Table Contents COURSE_CODE FAR-135-P FAR-135-M HM-01 WEA-01 WEA-02
COURSE_DESCRIPTION Aircraft charter regulations for pilots Aircraft maintenance for charter operations Hazardous materials handling Aviation weather – basic operations Aviation weather – instrument operations
448
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation 2.
Create a new PK attribute named TRAIN_NUM to uniquely identify each entity occurrence in the TRAINING entity, and then create a composite index composed of EMP_NUM + COURSE_CODE + TRAIN_DATE. This action will remove the weak/composite designation from the TRAINING, because the TRAINING entity’s PK is no longer composed of the PK attributes of the EMPLOYEE and COURSE entities. (And the “receives” and “is used in” relationships will no longer be classified as “identifying” – thus changing the relationship descriptions from “identifying” or “strong” to “non-identifying” or weak”). The composite index will prevent the duplication of records. Note the change in the structure and contents of the TRAINING table shown in Table QC.6D.
Table QC.6D The Modified TRAINING Table Structure and Contents TRAIN_NUM 1203 1204 1205 1206 1207 1208 1209 1210
EMP_NUM 105 105 101 105 101 101 105 101
COURSE_CODE FAR-135-P HM-01 FAR-135-P WEA-01 HM-01 WEA-01 FAR-135-P WEA-01
TRAIN_DATE 26-Sep-2013 18-Dec-2013 23-Nov-2013 10-Mar-2014 15-Sep-2014 26-Sep-2013 11-Feb-2014 26-Mar-2014
TRAIN_GRADE 90 92 93 87 91 85 97 89
We would recommend the second approach. Generally speaking, single-attribute PKs are preferred over composite PKs. Single-attribute PKs are more easily handled if the table is to be linked to a related table later. (The linking is done through a FK – which is the PK in the “parent” table. But if the parent table uses a composite PK, how can you then create the appropriate FK?) In any case, the declaration of a composite PK automatically generates a matching composite index, so you would not decrease the index library if you used approach 1. 7. You read in this appendix that: An examination of the UCL's Inventory Management module reporting requirements uncovered the following problems: • The Inventory module generates three reports, once of which is an Inventory Movement Report. But the inventory movements are spread across two different entities (CHECK_OUT and WITHDRAW). That spread makes it difficult to generate the output and reduces the system's performance. • An item's quantity on hand is updated with an inventory movement that can represent a purchase, a withdrawal, a check-out, a check-in, or an inventory adjustment. Yet only the withdrawals and check-outs are represented in the system. What solution was proposed for that set of problems? How would such a solution be useful in other types of inventory environments? The proposed solution was to create a common entry point for all inventory movements. This common entry point is represented by a new entity named INV_TRANS. The INV_TRANS entity is
449
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation used to record an entry for each inventory transaction. In other words, the system keeps track of all inputs to and withdrawals from inventory by using this INV_TRANS entity. It is important to realize that the INV_TRANS entity is a crucial entity in the system, because it reflects all item transactions. Such a solution is not unique to the UCL's inventory system: Most inventory systems must be able to keep track of such transactions. Having a central point of reference facilitates the processing, updating, querying, and reporting capabilities of the inventory system. The UCL's data model keeps track of several types of inventory transaction purposes or motives: checkouts, withdrawals, adjustments, and purchases. Note the system's flexibility: The user is able to classify all inventory transactions by type and/or motive. In addition to being flexible, the UCL system is easily expandable: If necessary, the system can support additional types of inventory transaction motives. For example, the system may be expanded to include inter-warehouse inventory transfers, items retired from inventory because they are datelimited, and so on. (Date-limited inventory is typical for such things as pharmaceuticals, food, etc.) Given its flexibility and expandability, we may conclude that the UCL system's inventory data model represents a very viable solution to modeling real world inventory transactions. Therefore, it may be used to fit into just about any inventory environment.
Note: Optimum vs. Implemented Solutions. The final UCL ERD makes use of the INV_TRANS entity to replace the WITHDRAW entity. Perhaps some of your students wonder about the similarity of the CHECK_OUT and CO_ITEM entities when compared to the INV_TRANS and INTR_ITEM entities. For instance, it is quite appropriate to argue that CHECK_OUT is a type of inventory transaction and that, therefore, CHECK_OUT is a subtype of an INV_TRANS supertype. Why did the designer create such apparent system redundancy? Why wasn't the type/subtype hierarchy used more efficiently? (Classification hierarchies and supertypes/subtypes are covered in Chapter 5, “Advanced Data Modeling.”)
To answer this question, return to the discussion about fine-tuning the database for performance, integrity, and security. Based on the estimation of the number of transactions, the number of items, and the number of the possible concurrent accesses to the INV_TRANS entity, it was clear that this entity will be one of the most active in the system. The large number of check-outs reports and the even larger expected number of inventory transactions prompted both the designer and the end user to choose either controlled redundancy or having a performance bottleneck. Perhaps some students will argue that the use of the CHECK_OUT and CO_ITEM entities represents a major burden to the system and that, therefore, the system should be implemented without these entities. This argument clearly has some merit: The only immediate advantage of having the CHECK_OUT and CO_ITEM entities is that the Inventory check-outs report uses these entities, rather than the INV_TRANS and INTR_ITEM entities. Therefore, the elimination of CHECK_OUT
450
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation and CO_ITEM reduces the concurrent access conflicts for the INV_TRANS and INTR_ITEM entities. Finally, we note that both the designer and the end user are aware of the consequences of the selected solution. Remember, this is a real solution to a real problem, and it helps to illustrate the point that we made earlier: The best solution is not always the one that is implemented. Each system is subject to constraints, and the designer must inform the end user of the consequences of the data modeling design selections.
An important note in primary key selection for multi-user systems The LOG is an entity that keeps a record of all the students that use the UCL. Note that the primary key is formed by LOG_DATE, LOG_TIME, and USER_ID. Ask the students why USER_ID has been made a part of the primary key. Since each user can be in only one place at one time, it seems safe to assume that USER_ID does not need to be part of the primary key. So why not just use a primary key composed of LOG_DATE and LOG_TIME? For example, suppose that the student Christobal Colombus enters the UCL on 02-Mar-2014 at 02:10:11 pm. To use the UCL's facilities and services, Mr. Columbus must give his student identification card to the lab assistant. Clearly, Mr. Colombus can only be at that one location at that time. When the lab assistant enters Mr. Columbus's USER_ID, that entry is made at a specific and unique date and time. When the lab assistant registers the next student, that student's USER_ID is entered at a different time in the computer's clock. Since every USER_ID entry is made at a different time, there seems to be no need for the USER_ID to be part of the primary key. However, this scenario is correct only in a single-user, stand-alone system. Remember that the UCL system runs in a LAN and that the ACCESS module is accessed by two lab assistants through two different terminals. Therefore, it is possible that, at a given time, both data entries are made at the same computer clock time. When the data are to be saved to the database, one of the two entries will be executed first; and, to preserve entity integrity, the second entry will be aborted because the date and time already exist in the database. Since it is possible to have two users register in the LOG during the same day and at the same time, only their USER_IDs will be different. Therefore, to ensure uniqueness of the primary keys, the inclusion of USER_ID as part of the primary key is quite appropriate. Of course, you might use the LOG_READER, instead of the USER_ID, to define the primary key. After all, the same LOG_READER cannot be swiped twice at the precise same time. In either case, the uniqueness of the entry is preserved, thus preserving entity integrity. Which attribute (USER_ID or LOG_READER) is used as a part of the primary key is the designer's decision. The only requirement is that entity integrity is maintained.
451
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation
Problem Solutions 1. Verify the conceptual model you created in Appendix B, problem 3. Create a data dictionary for the verified model. The verification of the car dealership's database design conforms to the verification process described in Appendix C. (We have also illustrated the verification process in this appendix's review question 4.) Since the verification process has already been explored in depth in several places, we will focus on the ERDs that were modified during the course of the verification process. Use the Data Dictionary format shown in Chapter 7, “Introduction to Structured Query Language (SQL),”Table 7.3 as your data dictionary template. The basic verified database design is shown in Figure PC.1. As you discuss Figure PC.1, note that the verification process substantially modified the service component of the initial ERD. (See the discussion that accompanies Figure PD.3a in this manual’s Appendix D, problem 3.) These changes reflect the increasingly important accountability requirements. As you examine the ERD in Figure PC.1, focus on the following features: • SALESPERSON and MECHANIC are subtypes of the supertype EMPLOYEE. This feature is based on the likelihood that the subtypes contain data that are unique to those subtypes. For example, a sales person is likely to have at least part of his/her pay determined by sales commissions. Similarly, mechanics are likely to have special certification and training requirements that “general” employees are not likely to have. The use of these subtypes eliminates nulls in the EMPLOYEE table, thus making them desirable in this case. • Although some employee job-related data are stored in their subtypes – see, for example, our discussion of the SALESPERSON and MECHANIC subsets – we still need to know what the employee job assignments are. Although we have not included pay and benefit options in this design, both options are likely to be job related. Some jobs are paid on an hourly basis, some on a weekly basis, and some jobs are salaried. Base pay schedules are usually determined by job qualifications. Therefore, the JOB entity stores a JOB_PERIOD attribute (hour, week, or year) and a JOB_PAY attribute. If the JOB_PERIOD = “hour”, the JOB_PAY = $18.90 is clearly an hourly rate. If the JOB_PERIOD = “year”, a JOB_PAY = $45,275 is clearly a yearly salary. In larger companies, job assignments are useful in tracking the distribution of job “densities” to see if some job classification distributions are appropriate to meet the business objectives. (Do we have too many employees who are classified as “support” personnel? Too many accountants?) Also, note that the relationship between JOB and EMPLOYEE reflects the business rules that each employee has only one job assignment at a time. Naturally, any given job can be held by many employees. For example, many employees may be mechanics, support personnel, accountants, and so on. …Additional discussion points follow Figure PC.1.
452
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation
Figure PC.1 The Verified Car Dealership Crow’s Foot ERD
453
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation … continued discussion of Figure PC.1’s ERD. •
•
•
•
• •
•
To track all maintenance procedures and parts precisely, only qualified mechanics may open and close service logs, check out parts, and sign off service work. Note that the PART_LOG tracks all parts that have been logged out. The relationship between SERVICE_LOG and PART_LOG lets us trace all checked-out parts to a specific service log entry. The use of the PART_CODE in both the SVC_LOG_LINE and the PART_LOG entities makes it possible to write a query to let us check whether or not a logged out part was actually used. All the maintenance actions can be tracked at this point. We know who opened and closed the service log through the SCV_LOG_ACTION. We know which mechanic performed each maintenance procedure (in the SVC_LOG_LINE), and we know which mechanic checked out which part(s) – in PART_LOG. If a car was sold by the dealer, that fact is recorded in an invoice. However, the CAR entity may be expanded to include a “bought here” Y/N attribute, in addition to mileage and other pertinent data. Also, cars owned by the dealership may simply show the dealer as the “customer.” (Naturally, you can add a DLR_CAR entity if the dealer car attributes and data tracking requirements are different from the customer CAR data.) Before any car is sold to a customer, that car must be inspected and, if necessary, repaired. Therefore, even a new car will show up in the SERVICE_LOG. Therefore, the SVC_LOG_NUM will never be null in the INVOICE … even if the invoice records the sale of a car, rather than a specific service charge. The PAYMENT entity has a rather limited set of options at this point. However, it does enable the manager to track multiple payments on a given invoice and to keep track of specific invoice balances. Further verification procedures would (most likely!) add functionality to the PAYMENT entity. For example, you might change the PAYMENT entity to an account transaction entity, perhaps named ACCT_TRANSACTION. This change would reflect the need to identify the transaction type – debit or credit – and, if a payment is made, the payment mode – cash, credit card, check. The employee qualifications can now be tracked without limit. If an employee gains an additional qualification, all that is needed is an entry in the EDUCATION table. The customer car data are stored in CAR, so we can keep the service records on all the customer cars, thus producing the required car histories. (To save space, we have not included all the appropriate attributes – but you can add such attributes as CAR_BOUGHT (Y/N) to indicate whether or not the car was bought at this dealership, CAR_LAST_MILES to indicate the mileage recorded during the most recent service, and so on.) At this point, we assume that the SERVICE_LOG and SVC_LOG_LINE records yield the information required to bill the customer.
454
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation To set the stage for further discussion of Figure PC.1’s ERD, a few sample data entries in the added SERVICE_LOG, LOG_ACTION, MECHANIC, LOG_LINE, PART, and PART_LOG tables are useful. (The PART_CODE entry “000000” is a dummy PART entry that signifies “no part used.”) Also note that the order of the attributes is immaterial. (In other words, whether CAR_VIN is shown in the last column of the SERVICE_LOG table or as the first, second, or third column has no bearing on the discussion or on the results obtained from the use of the table.)
Sample SERVICE_LOG Data SVC_LOG_NUM 10012 10013 10014
LOG_COMPLAINT Hard to start. Accelerates poorly. Oil change. Rotate and balance tires. Temp gauge shows high temps.
SVC_LOG_CHARGE $89.75 $19.95 $135.70
CAR_VIN 2AA-W-123456 5DR-T-8765432 4UY-D-6543210
Sample SVC_LOG_ACTION Data SVC_LOG_NUM 10012 10013 10012 10014 10013
SVC_LOGACT_TYPE Open Open Close Open Close
SVC_LOGACT _DATE 03-Mar-2014 03-Mar-2014 04-Mar-2014 04-Mar-2014 04-Mar-2014
EMP_NUM 104 112 112 104 104
Sample SVC_LOG_LINE Data (Several attributes left out to save space) SVC_LINE_NUM 1 1 2 3 4 5
SVC_LOG_NUM 10012 10013 10013 10013 10013 10013
1 2 3
10014 10014 10014
SVC_LINE_WORK Cleaned injection nozzles Drained oil Installed filter Replaced oil Rotated tires Balanced tires, using four weights (LF0.5oz, RF1.1oz, RR1.2oz, LR0.8 oz) Drained coolant Replaced thermostat Replaced coolant
EMP_NUM 106 112 112 112 114 106
PART_CODE 000000 000000 FLTR-0156 Oil-PZ30/40 000000 WT-LD10012
104 112 104
000000 THERM-007B COOL-289XZ
Sample PART_LOG Data PARTLOG_NUM 10185 10186 10187 10188 10189
EMP_NUM 112 112 114 112 114
PART_CODE FLTR-0156 Oil-PZ30/40 WT-LD10012 THERM-007B COOL-289XZ
SVC_LOG_NUM 10013 10013 10013 10014 10014
455
PARTLOG_DATE 03-Mar-2014 03-Mar-2014 03-Mar-2014 04-Mar-2014 04-Mar-2014
PARTLOG_UNITS 1 8 4 1 1
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation The main processes that can be identified in this system include: • The generation of an invoice (INSERT). • The car sales generation and reports (SELECT). • The registration of a service for a customer's car (INSERT, UPDATE). • The registration of the work log or of the employees (mechanics) who worked on a car (INSERT, UPDATE). • The registration of parts inventory (INSERT, UPDATE). • The registration of parts used in a service (INSERT, UPDATE). • The registration of the car history (INSERT, UPDATE). • Queries and reports such as: ➢ Parts List ➢ Car Price List ➢ Sales Reports ➢ Service Report ➢ Car History Report ➢ Parts Used Report ➢ Work Log Report The designer must check that the database model supports all these processes and that the model is flexible enough to support future modifications. If problems are encountered during the model's verification against the required database transactions that are designed to support the identified processes, the designer must make the necessary changes to the data model. These changes are reflected in Figure PC.1.
NOTE The verification process for Problems 2-5 conforms to the process discussed at length in Problem 1. Therefore, we will only show the verified ERDs. The data dictionary format example shown in Problem 1 can also be used as the template in problems 2-5. Therefore, we do not show additional data dictionaries for Problems 2-5. The ERDs supply the necessary entities, the attribute names, and the relationships. However, it will be very useful to compare the ERDs in the following problems to the original ERDs – in the previous chapter -- from which they were derived.
456
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation 2. Verify the conceptual model you created in Appendix B, Problem 4. Create a data dictionary for the verified model. Compare the ERD shown in Figure PC.2A to the ERD shown in Figure PD.4A (see Appendix D) to see the impact of the verification process. Use the Data Dictionary format shown in Chapter 6, Table 6.3 as your data dictionary template.
Figure PC.2A The Crow's Foot Verified Conceptual Model for the Video Rental Store
457
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation As you discuss the ERD components in Figure PC.2A, note particularly the following points: • Remind your students that relationships are read from the parent to the related entity. Therefore, ORDER contains ORD_LINC. (The natural tendency to read from top to bottom or from left to right is not the governing factor in an ERD!) • We can now track individual copies of each movie. If there are 12 copies of a given movie, each copy can be rented out separately. • ORD_LINE and RENT_LINE are composite entities. So why is COPY not a composite entity? Here is an excellent example of why single attribute PKs are a requirement when the entity is referenced by another entity. In this case, the COPY entity’s PK is referenced by the RENT_LINC. Therefore, COPY must have a single-attribute PK. (Note that the PK of the COPY entity is the single attribute COPY_CODE, rather than the combination of MOVIE_CODE and COPY_CODC.) • It is reasonable to assume that each order goes to a particular vendor. Therefore, the VEND_CODE is the FK in ORDER, rather than in ORD_LINC. However, if the order goes to a clearing house and you still want to keep track of the individual vendors that supplied the movies to the clearing house, VEND_CODE will the FK in ORD_LINC.
NOTE The design shown in Figure PC.2A is implemented in a small sample database named RC_Video.mdb. This database, stored in MS Access format, is located on the Instructor’s CD. If you want your student to write the applications for this segment of the database, you will find that the appropriate tables are available. Because our discussion focus is on the database’s rental transaction segment, the database does not contain all of the tables that are shown in Figure PC.2A’s ERD. We have also added a number of attributes – especially in the RENTAL table – to make it easier to see how the actual applications might be developed.
The partial implementation of the ERD shown in Figure PC.2A is reflected in the RC_Video database’s relational diagram segment depicted in Figure PC.2B.
Figure PC.2B The Relational Diagram for the RC_Video Database
458
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation Once the database design has been implemented, you can easily use MS Access to illustrate a variety of implementation issues. For example, in a real world application the RENTLINE table’s RENTLINE_DATE_OUT can simply be generated by specifying the default date to be the current date, Date(). The RENTLINE_DATE_DUE would then be Date()+2, assuming that the checked out videos are due two days later. (Or substitute whatever criteria you want to use in the queries.) 3. Verify the conceptual model you created in Appendix B, Problem 5. Create a data dictionary for the verified model. Compare the ERD shown in Figure PC.3 to the ERD shown in Figure PB.5a. Note that the original ERD survived the verification process intact. In this case, the verification process merely confirmed that the model met all the database requirements. Use the Data Dictionary format shown in Chapter 7, Table 7.3 as your data dictionary template.
Figure PC.3 The Revised (Final) Crow's Foot ERD for the Manufacturer
459
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation 4. Verify the conceptual model you created in Appendix B, Problem 6. Create a data dictionary for the verified model. Compare the ERD shown in Figure PC.4A to the ERD shown in Figure PD.6A (in Appendix D) to see the impact of the verification process. Use the Data Dictionary format shown in Chapter 7, Table 7.3 as your data dictionary template.
Figure PC.4A The Crow's Foot ERD for Problem 4 (The Hardware Store)
460
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation
NOTE The following screen images are based on the database named RC_Hardware. This database, stored in MS Access format on your instructor’s CD. If you want your student to write the applications for this segment of the database, you will find that the appropriate tables are available. (We have added attributes in various tables to enhance their information content.) However, because our discussion focus is on the database’s sales transaction segment, the database does not contain the DEPENDENT table, nor does it contain the VENDOR, ORDER, and ORD_LINE tables that are shown as entities in Figure PC.4.
As you discuss the ERD shown in Figure PC.4B, note that the transactions are tracked in the ACCT_TRANSACTION table. The sample table contents are captured in the screen shown in Figure PC.4b. You can easily demonstrate with a set of queries applied to this database that the inclusion of this table structure in the database design yields very desirable results.
Figure PC.4B The RC_Hardware ACCT_TRANSACTION Table Contents
As you discuss the sample table contents in Figure PC.4B with your students, note that the invoice balances are stored in this table, rather than in the INVOICE table. The reason for this arrangement is simple: the end user must be able to track the remaining balances after each transaction. If the balances for each of the invoices were kept in the INVOICE table, you would be limited to seeing only the most recent balance. Better yet, you can now track all payment transactions by customer or by invoice. For example, a simple query can be written to show the CUSTOMER, INVOICE, and ACCT_TRANSACTION results grouped by customer. Therefore, you can track the entire payment history for each customer. (See Figure PC.4C – note that the query name is shown in the header.)
461
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation
Figure PC.4C The RC_Hardware Transaction Query Results
Naturally, the CUST_BALANCE value in the CUSTOMER table and the remaining TRANS_INV_BALANCE value in the ACCT_TRANSACTION table must be updated according to the TRANS_AMOUNT value entered by the end user in the ACCT_TRANSACTION table. The applications software must be written to automatically make such updates. For example, if you Microsoft Access, you can use macros or you can use VB to accomplish this task. As you examine the query output in Figure PC.4C, note that you can easily trace the transactions for each of the customers. For example, • Customer 10012 (Smith) made a purchase (invoice #1, transaction #1) on February 3, 2014. The transaction amount was $239.21, but customer 10012 made only a partial payment of $100.00, thus leaving a balance of $139.21 on invoice #1. • Customer 10012 made another purchase (transaction #6, $27.98) on February 15, 2014. This time, customer 10012 paid the entire invoice amount. (Note that the remaining balance for invoice #20 is $0.00 for that transaction.) • Customer 10012 made a $50.00 payment on account -- see transaction #7 on February 15, 2014, leaving the balance at $139.21 – 50 = $89.21 for the original invoice #1. (Note that the remaining balance for invoice #1 after transaction #1 was $139.21 on 3-Feb-2014.) The original invoice amount of $239.21 was retrieved from the INVOICE table used in this query and this value is not – and must not be – updated. (However, the applications software must update the CUSTOMER table’s customer balance to show the total of all outstanding balances for that customer.) • Customer 10012 made a $20.00 payment on account (see transaction #10) on February 17, 2014, leaving the balance for invoice #1 at $89.21 – 20 = $69.21. Again, the original invoice amount of $239.21 was retrieved from the INVOICE table used in this query and this value is not – and must not be – updated. • Customer 10020 (Rieber) received a $10 refund (see transaction #9 on February 17, 2014.) This transaction was applied to the outstanding balance of $92.19 (see transaction #8) for invoice #12, thus reducing the remaining balance for invoice #12 from $92.19 to $82.19. If the customer comes in to make a payment on account, the system’s end user must be able to query the INVOICE table to find the invoices with outstanding invoice balances. The customer then makes a payment to a specific invoice in the ACCT_TRANSACTION table and the applications software will update both the remaining balance in the ACCT_TRANSACTION table and the customer balance in the CUSTOMER table.
462
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation If you want to know the entire payment history for each invoice, you can write the query to group the results by invoice number. Figure PC.4D shows the results. It is – again – worth noting that such capability is provided at the database design level.
Figure PC.4D The RC_Hardware Invoice Payment History
As you discuss Figure PC.4D, note that all the payment transactions for each invoice are easily traced. For example: • The total charge placed on invoice #1 is $239.21. The initial payment on February 3, 2014 was $100.00, leaving a balance of $139.21. • The next payment on invoice 1 was made on February 15. This payment of $50.00 leaves a balance of $89.20. Note that the invoice amount, $239.21, is stored in the INVOICE table and this amount must not be changed. • The next payment on invoice 1 was made on February 17. This payment of $20.00 leaves a balance of $69.20. Again note that the invoice amount, $239.21, is stored in the INVOICE table and this amount must not be changed.
463
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation 5. Verify the conceptual model you created in Appendix B, Problem 7. Create a data dictionary for the verified model. Compare the ERD shown in Figure PC.5A to the ERD shown in Figure PD.7A (see Appendix B) to see the impact of the verification process. Note the ternary relationship between SIGN_OUT, LOG_LINE, PART, and EMPLOYEC. This relationship enables the end user to track all parts used in each of the log lines for each of the logs and to verify that the parts that were signed out for the log line were, in fact, used in that log line’s maintenance procedure. Use the Data Dictionary format shown in Chapter 6, Table 6.3 as your data dictionary template.
Figure PC.5A The Verified Crow's Foot ERD for ROBCOR Aircraft Service
464
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation The basic ERD shown in Figure PC.5A can easily be modified to incorporate additional tracking capability for a host of other requirements. For example, note that Figure PC.5B includes all the educational, training, and testing options for ROBCOR Aircraft Service’s employees. Given the growing regulatory environment and increasingly restrictive insurance requirements, such detailed tracking requirements are becoming more common in a wide range of different types of business operations. Therefore, discussions about the tracking requirements in the production database design are very productive.
Figure PC.5B The Modified Crow's Foot ERD for ROBCOR Aircraft Service
465
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation To help you demonstrate the use of the composite PKs in the EMP_TEST, EDUCATION, and TRAINING, we have implemented a segment of the FlyFar design in the FlyFar database. The relational diagram for the FlyFar database is shown in Figure PC.5C.
NOTE The following screen images are based on the database named FlyFar. This database, stored in MS Access format, may be found on your instructor’s CD. If you want your student to write the applications for this segment of the database, you will find that the appropriate tables are available. (We have added attributes in various tables to enhance their information content.
Figure PC.5C The Relational Diagram for the FlyFar Database Segment
466
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation Note the effect of the composite PKs in the composite entities shown in Figure PC.5C. If you examine the last record in the EMP_TEST table shown in Figure PC.5D, you will see that this attempted record entry duplicates a previously entered record. However, note that the DBMS – in this case, Microsoft Access -- has caught the attempted duplication. (The use of the composite PK -- EMP_NUM + TEST_CODE + EMPTEST_DATE -- requires the PK entries to be unique in order to avoid an entity integrity violation. Therefore, the system catches the duplicate record before you have a chance to save it. In fact, to avoid the entity integrity violation, the DBMS will not permit you to save the duplicate record.)
Figure PC.5D A Duplicate Record Warning
If you change the test date to indicate that the test result to be entered is different from an earlier test result, the DBMS will accept the data entry. (Note that the FAR135-w test was taken twice by employee 105: once on 22-Jan-2013 and once on 01-Mar-2014.)
467
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation Remind your students that you can also create a single-attribute, system-generated PK named EMPTEST_NUM for the tblEMPTEST table in Figure PC.5C. This action will convert the composite (weak) EMP_TEST entity to a strong entity. (The EMP_NUM and TEST_CODE remain as foreign keys.) However, if you still want to avoid the duplication of records – a very desirable feature – you must maintain a candidate key composed of EMP_NUM, TEST_CODE, and EMPTEST_DATE -- and you must set the index properties to “required” and “unique” for each of the attributes in that candidate key. (The same features may be used in the tblEDUCATION and tblTRAINING tables.) Whether or not you use a single-attribute PK or a composite PK may depend on specified system transaction and/or tracking requirements. The single-attribute PK/composite PK decision is often a function of professional judgment – clearly, the composite PKs work well in the original design shown in Figure PC.5B. However, if a PK is to be referenced by the FK(s) in one or more related tables, the creation of a single attribute PK is appropriate. In fact, trying to create a relationship between a FK in one table and a composite PK in a related table will quickly illustrate the need for a single-attribute PK. In any case, query design becomes a more complex task when relationships based on composite PKs are traced through several levels. 6 Design (through the logical phase) a student advising system that will enable an advisor to bring up the student's complete performance record at the university. A sample output screen should look like the one shown in Table PC.6.
Table PC.6 The Student Transcript for Problem 6 Name: Xxxxxxxxxxxxxxxxx X. Xxxxxxxxxxxxxxxxxxxxxxx Department: xxxxxxxxxxxxxxxxxxxxxxx Social Security Number: ###-##-#### Spring, 20XX Course ENG 111 (Freshman English) Xxxxxxxxxxxxxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxxxxxxxxxxxxx
Hours 3 # # # #
Total this semester Total to date
## ###
Summer, 20XX Course CIS 300 (Computers in Society) Xxxxxxxxxxxxxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxxxxxxxxxxxxx
Hours 3 # # # #
Grade B X X X X
Page # of ## Major: xxxxxxxxxxxxxx Report Date: ##/Xxx/####
Grade points ## ## ## ## ## ## ###
Grade A X X X X
468
GPA: #.## Cumulative GPA: #.##
Grade points ## ## ## ## ##
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation Total this semester Total to date
## ###
Fall, 20XX Course CIS 400 (Systems Analysis) Xxxxxxxxxxxxxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxxxxxxxxxxxxx
Hours 3 # # # #
Total this semester Total to date
## ###
## ###
Grade B X X X X
GPA: #.## Cumulative GPA: #.##
Grade points ## ## ## ## ## ## ###
GPA: #.## Cumulative GPA: #.##
Note that this problem is, basically, an extension of the database design developed in Chapter 4's discussion of Tiny College. We merely need to expand the presentation to enable us to develop the required outputs.
The Development of the ERD To satisfy the requirements, the ERD must be based on (at least) the following business rules: 1. A department has many students, and each student "belongs" to only one department. 2. A student takes many classes, and each class is taken by many students. 3. A student may enroll in a class one or more times. Naturally, if a class is taken more than once, that "repeat" class is taken in a different semester. 4. A class is a section of a course, i.e., a course can yield many classes, but each class references only one course. For example, two sections of the course described by CIS483, Database Systems, 3 credit hours, Prerequisites: 9 hours of CIS courses, including CIS370 (Systems Analysis) may be taught in the Fall and Spring semesters, while the course may not be offered in the Summer session. (Since a course is not necessarily offered each semester, CLASS is optional to COURSC.) 5. Each course belongs to a department. For example, the English department would not offer a Database course. The database should include at least the following components: DEPARTMENT (DEPT_CODE, DEPT_NAME) STUDENT (STU_NUM, STU_LNAME, STU_FNAME, STU_INITIAL, DEPT_CODE) DEPT_CODE references DEPT COURSE (CRS_CODE, CRS_DESC, CRS_CREDIT_HOURS) CLASS (CLASS_ID, CRS_CODE, CLASS_PLACE, CLASS_TIME)
469
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation CRS_CODE references COURSE ENROLL (STU_NUM, CLASS_CODE, ENROLL_SEMESTER, ENROLL_CRS_CREDITS, ENROLL_CRS_NAME, ENROLL_GRADE) STU_NUM references STUDENT, CLASS_CODE references CLASS Note 1: The participation of SEMESTER allows a student to register for a class one or more times, but only one time per semester. Note 2: The ENROLL entity includes the course description and course credits, because the course name and its credits may change over time. Therefore, you cannot count on the current course name and credit value to reconstruct previous course names and credit hours used on a transcript, which is a historical record. Naturally, to avoid data anomalies, the applications software should be written to make sure that the system transfers current course data to the current transcript record.
NOTE To keep the model simple, we have not included such "obvious" entities as MAJOR, connected to STUDENT and DEPARTMENT, the PROFESSOR who teaches CLASSES and who may chair the DEPARTMENT, the COLLEGE to which the DEPARTMENT belongs, etc. These details may be discussed in connection with the Tiny College database discussed in Chapter 4, Entity Relationship Modeling.” Given this simplification, the DEPARTMENT used in this example does not have any foreign keys.
Verification of the ER model Required Output: Selected Student Record The required report can be easily generated through the use of the tables depicted in our database model. The SQL code that will generate the required information will look like this: SELECT
FROM WHERE
ORDER BY
STUDENT.STU_NUM, S_LNAME, DEPARTMENT.DEPT_CODE, DEPT_NAME, ENROLL_SEMESTER, ENROLL_CRS_ CREDIT, ENROLL_CRS_NAME, ENROLL_GRADE STUDENT, DEPARTMENT, ENROLL, CLASS, COURSE STUDENT.STU_NUM = ENROLL.STU_NUM AND CLASS.CLASS_ID = ENROLL.CLASS_ID AND DEPARTMENT.DEPT_CODE = STUDENT.DEPT_CODE AND CLASS.CRS_CODE = COURSC.CRS_CODE ENROLL.STU_NUM, ENROLL_SEMESTER, ENROLL_CRS_NAME;
The previous SQL query generates the data needed for the report. Specific output format may be created by using the DBMS's report generator or by using a 3GL programming language such a COBOL or C. Also, note that the "Grade points" column in the Student Record is a computed column that is produced by multiplying the CRS_CREDIT in the COURSE table by the numeric value equivalent to the letter
470
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation ENROLL_GRADE in the ENROLL table. To compute the value for such a column, the programmer uses a conversion table such as the one shown in Table P6.1.
Table P6.1 A Gradepoint Conversion Table Letter Grade A B C D F
Numeric Value 4 3 2 1 0
When the verification process is completed, the ERD looks like the one shown in Figure PC.6.
Figure PC.6 The Crow’s Foot ERD for the (Transcript-based) Student Advising System
471
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation
NOTE As you discuss the ERD shown in Figure PC.6, note that optionalities are often used for operational reasons. For example, keeping CLASS optional to COURSE means that you don’t have to generate a class when a new course is put into a catalog. (In this case, the optionality also reflects the business rule “not all courses generate classes each semester.”) Keeping ENROLL optional to both CLASS and STUDENT means that you won’t have to generate a dummy record in the ENROLL table when you sign up a new student or when you generate a new class entry in the registration schedule.
7
Design and verify a database application for one of your local not-for-profit organizations (for example, the Red Cross, the Salvation Army, your church, mosque, or synagogue). Create a data dictionary for the verified design. Since this problem's solution depends on the selected organization, no solution can be presented here. However, the steps required in the solution are shown in discussion question 4. An abbreviated version is presented in problem 1.
8
Using the information given in the physical design section (C.5), estimate the space requirements for the following entities: RESERVATION INV_TRANS TR_ITEM LOG ITEM INV_TYPE (Hint: You may want to check Appendix B's Table B.3, A Sample Volume of Information Log.) You must generate the data storage requirement for each of the tables. Therefore, begin by identifying the attribute characteristics and storage requirements. The supported data types depend on the database software. For example, some software supports the Julian date format, while other software requires dates to be identified as strings. Even date strings vary in length, depending on the default format (18-Mar-2014 or 3/18/14, for example.) Therefore, the correct answer depends on the DBMS you use. In short, the following data storage requirements are meant to be used for discussion purposes only. (Only a few sample tables are shown, but they are sufficient to illustrate the process and to serve as the basis for a discussion about required table spaces.
472
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation
Table: RESERVATION (4 per week, 14 weeks per semester, 56 reservations per semester) Attribute RES_ID RES_DATE USER_ID LA_ID
Data Type INT DATE CHAR(11) CHAR(11)
Storage (bytes) 4 8 11 11
Row Length (bytes)
Number of Rows
Total Bytes
34
56
1,904
Table: INV_TRANS (80 per week, 14 weeks per semester, 1,120 transactions per semester) Attribute TRANS_ID TRANS_TYPE TRANS_PURPOSE TRANS_DATE LA_ID USER_ID ORDER_ID TRANS_COMEMT
Data Type INT CHAR(1) CHAR(2) DATE CHAR(11) CHAR(11) INT CHAR(50)
Storage (bytes) 4 1 2 8 11 11 4 50
Row Length (bytes)
Number of Rows
Total Bytes
91
1,120
101,920
Table: TR_ITEM (240 per week, 14 weeks per semester, 3,360 per semester) Attribute TRANS_ID ITEM_ID LOC_ID TRANS_QTY
Data Type INT NUMBER(8,0) CHAR(10) INT
Storage (bytes) 4 8 10 4
Row Length (bytes)
Number of Rows
Total Bytes
26
3,360
87,360
Table: LOG (5,000 per week, 14 weeks per semester, 70,000 reservations per semester) Attribute LOG_DATE LOG_TIME LOG_READER USER_ID
Data Type DATE CHAR(12) CHAR(1) CHAR(11)
Storage (bytes) 4 12 1 11
Row Length (bytes)
Number of Rows
Total Bytes
32
70,000
2,240,000
473
Appendix C The University Lab: Conceptual Design Verification, Logical Design, and Implementation
Table: ITEM (890 identified) Attribute ITEM_ID TY_GROUP ITEM_INIV_ID ITEM_DESCRIPTION ITEM_QTY VEND_ID ITEM_STATUS ITERM_BUY_DATE
Data Type NUM(8,0) CHAR(8) CHAR(7) CHAR(10) INT CHAR(5) CHAR(1) DATE
Storage (bytes) 8 8 7 10 4 5 1 8
Row Length (bytes)
Number of Rows
Total Bytes
86
890
76,540
Table: INV_TYPE (15 categories) Attribute TY_GROUP TY_CATEGORY TY_CLASS TY_TYPE TY_SUBTYPE TY_DESCRIPTION TY_UNIT
Data Type CHAR(8) CHAR(2) CHAR(2) CHAR(2) CHAR(2) CHAR(35) CHAR(4)
Storage (bytes) 8 2 2 2 2 35 4
474
Row Length (bytes)
Number of Rows
Total Bytes
55
15
825
Appendix F Client/Server Systems
Appendix F Client/Server Systems Discussion Focus Why may client/server computing be considered an evolutionary, rather than a revolutionary, change? Client/server computing didn't happen suddenly. Instead, it is the result of years of slow-paced changes in end user computing. Using Appendix F, Section F.2's evolution of client/server information systems, first illustrate the typical mainframe scenario (users accessing dumb terminals), then move the discussion to the development on the microcomputer and its impact on work styles to set the stage for the current PC-based client/server computing scenario. Use Appendix F, Table F.1 to illustrate the contrasting characteristics of the mainframe-based and client/server-based information systems. Why may the client/server evolution be characterized as a bottom-up change and how does this change affect the computing environment? Modern end users use intelligent computers, GUIs, user-friendly systems, and data analysis tools to effectively increase their productivity. In addition, data sharing requirements make efficient use of network resources a priority issue. Given such an end user-based environment, it is not surprising that the end user drives the client/server architecture’s development and acceptance. Given this introduction and expanding on its theme, students are more easily able to contrast the PC-based client/server computing model and the traditional mainframe-computing model. After identifying the differences in computing style that characterize these models, the discussion may be shifted toward the formal definition of client/server systems, the forces that drive client/server systems, and the managerial expectations of client/server system benefits. Section F.7.2's examination of the managerial expectations of client/server systems benefits is the key to understanding the opportunities and risks associated with client/server systems. Students will benefit from the suggested readings in this section, so we suggest their assignment. What are the client/server's infrastructure requirements and how do they function? Appendix F’s Client/Server Architecture Section F.3 deals with the technical details of the client/server main components. This discussion is more likely to be fruitful if you first assign Appendix G, “Client/Server Network Infrastructure,” and Appendix F’s section F.3 as background reading to acquaint students with the basic network components such as cabling, topology, types, communication devices and network protocols. Use the example illustrated in Figure F.3 to briefly explain the interaction among the main components: client, server, and communications channel or network. Emphasize that each of these components requires a combination of hardware and software subcomponents. We suggest that, before discussing the technical details of each client/server component, it will be helpful to explain the Client/Server architectural principles that govern most client/server systems. Use the OSI network reference model illustrated in Appendix F,
475
Appendix F Client/Server Systems Table F.2 to explain the communications channel components, and then use Appendix F, Figure F.7 to illustrate the flow of data from the client to the communication channel to the server. What is middleware and why is it a crucial client/server component? Middleware is part of the communications channel. Its presence is so important in a successful client/server environment that some authors define three client/server architectural components: client, server, and middleware. To explain the need for and use of middleware software, use Appendix F, Figure F.11 to illustrate how middleware functions in ORACLE and SQL Server databases. Appendix F, Figure F.12 illustrates the use of middleware in an IBM DB2 shop. Note that even the middleware software can have client and server components. What, if any, client/server standards exist and how do such standards affect the client/server database environment? There is no single standard to choose from at this point. However, there are several de facto standards, created by market acceptance. (See Appendix F, Section F.4, “The Quest for Standards.”) Therefore, client/server developers have many "standards" to choose from when developing applications. The important issue in database is how the selection of one of the de facto standards affects database design, implementation, and management. Section F.7 explains some of the desired features of client/server databases. What are the logical components of a client/server application and how are these components allocated in a client/server environment? Appendix F's Client/Server Architectural Styles section defines an application's main logical components and how those components can be allocated to clients and/or servers. Use Appendix F, Section F.6 as the basis for a discussion about the different levels of processing logic distribution. (Note particularly Figure F.16.) What are some of the managerial and technical issues encountered in the implementation of client/server systems? Appendix F, Section F.7 addresses this discussion question in detail. Specifically, this section shows how the change from traditional to client/server data processing affects the MIS function.
476
Appendix F Client/Server Systems
Answers to Review Questions NOTE Since the answers to many of these questions are covered in detail in Appendix F, we have elected to give you section references to avoid needless duplication.
1. Mainframe computing used to be the only way to manage data. Then personal computers changed the data management scene. How do those two computing styles differ, and how did the shift to PC-based computing evolve? The evolution toward client/server information systems is explained in section F.2. The main differences between mainframe-based information systems and PC-based client/server information system are illustrated in Table F.1. The answer to this question may also include a discussion, based on Section F.2, of the forces that drive client/server systems. 2. What is client/server computing, and what benefits can be expected from client/server systems? Client/Server is a term used to describe a computing model for the development of computerized systems. This model is based on the distribution of application's functions among two types of independent and autonomous entities: servers and clients. A client is any process that request specific services from server processes. A server is a process that provides requested services for clients. See section F.1 for additional client/server definition details. Note that client/server is a computing model that focuses on the separation and distribution of the application's functions. Therefore, the application is divided into client and server processes. The clients request services from the server processes. Note also that the client and server processes can reside on the same or on different computers connected by a network. The final result is an application in which part of the processing is done at the client side and part of the processing is done at the server side. The advantages of separating and distributing the application's processing are efficient resource utilization and maximization of resource effectiveness. Perhaps the greatest single advantage is found in the utilization of the existing personal computer power for local data access and processing. Thus the end user is able to use local PCs to access mainframe and minicomputer legacy data and to process such data locally by using user-friendly PC software. Used correctly, this computing approach yields greater information autonomy, lower costs, improved access to information, and, therefore, a greater potential for better decision making. These benefits may yield better service to customers, thus generating more business. (The managerial expectations of client/server benefits are explained in greater detail in section F.7.)
477
Appendix F Client/Server Systems 3. Explain how client/server system components interact. The main client/server components are the client, the server, and the communications channel. Some experts include middleware as a separate component. Section F.3.1 provides a detailed explanation of the component interactions. 4. Describe and explain the client/server architectural principles. Client/Server components must conform to some basic architectural principles if they are to interact properly. Section F.3 includes a detailed description and explanation of the client/server architectural principles. 5. Describe the client and the server components of the client/server computing model. Give examples of server services. Section F.3 offers an extensive description of the client and server components. This section also provides several examples of server services. Figures F.5 and F.6 illustrate client and server internal components. Server services – file, print, fax, communications, database, transaction, and miscellaneous services such as CD-ROM, video, and back-up -- are detailed in section F.3.3. 6. Using the OSI network reference model, explain the communications middleware component's function. The communications channel provides the means through which clients and servers communicate. The communications channel connects clients and servers and its main function is the delivery of messages between clients and servers. Using the OSI network reference model, section F.3.5 provides a detailed explanation of the communication channel. Note that we use the OSI network reference model because most of the client/server applications are based on a scenario in which clients and servers are tied together through a network. 7. What major network communications protocols are currently in use? The client/server network infrastructure includes the network cabling, network topology, network type, communication devices, and network protocols. Section F.3.6 provides a detailed description of these components and their use. The network protocols determine how messages between computers are sent, interpreted and processed. The main network protocols in use today are Transmission Control Protocol/Internet Protocol (TCP/IP), Sequenced Packet Exchange/Internet Protocol (SPX/IPX), Network Basic Input Output System (NetBIOS). Section F.3.6 provides a more detailed explanation of these and other network protocols.
478
Appendix F Client/Server Systems 8. Explain what middleware is and what it does. Why would MIS managers be particularly interested in such software? Middleware is software that is used to manage client/server interactions. Most important to the end user and MIS manager is the fact that middleware provides services to insulate the client from the details of network protocols and server processes. MIS managers are usually concerned with finding ways to improve end user data access and to improve programmer productivity. By using middleware software, end users can access legacy data and programmers can write better applications faster. The applications are network independent and database server independent. Such an environment yields improved productivity, thereby generating development costs savings. Sections F.3.4 and F.3.5 provide additional database middleware software details. 9. Suppose you are currently considering the purchase of a client/server DBMS. What characteristics should you look for? Why? A client/server DBMS is just one of the components in an information system. The DBMS should be able to support all applications, business rules, and procedures necessary to implement the system. Therefore, the DBMS must match the system's technical characteristics, it must have good management capabilities, and it must provide the desired level of support from vendor and third parties. Specifically: • On the technical side the database should include data distribution, location transparency, transaction transparency, data dictionary, good performance, support for access via a variety of front-ends and programming languages, support several Client types (DOS, UNIX, Windows, etc.), third party support for CASE tools, Application Development Environments, and so on. • On the managerial side the database must provide a wide variety of managerial tools, database backup and recovery, GUI based tools, remote management, interface to other management systems, performance monitoring tools, database utilities, etc. • On the support side the DBMS must have good third party vendor support, technical support, training and consulting. Section F.7 provides additional details about these topics. Always remember that DBMS selection is a task within the system development process and normally follows the requirements definition task. In other words, the DBMS features should be determined by the characteristics of the system that is going to be built, and not the other way around. Unfortunately, in the real world this is a luxury enjoyed by few development projects!
479
Appendix F Client/Server Systems 10. Describe and contrast the four client/server computing architectural styles that were introduced in this appendix. This question deals with identifying the application processing logic components and deciding where to locate them. Section F.6 covers this very important topic in great detail. (Note particularly the summary in Figure F.19, “Functional Logic Splitting in Four Client/Server Architectural Styles.”) Note that client/server computing styles include several layers of hardware and software in which processing takes place. This layered environment is quite different from the homogeneous environment encountered in traditional mini and mainframe programming. 11. Contrast client/server and traditional data processing. From a managerial point of view, client/server data processing tends to be more complex than traditional data processing. In fact, client/server computing changes the way in which we look at the most fundamental computing chores and expands the reach of information systems. These changes create a managerial paradox. On the one hand, MIS frees end users to do their individual data processing and, on the other hand, end users are more dependent on the client/server infrastructure and on the expanded services provided by the MIS department. Client/server computing changes the way in which systems are designed, developed and managed by forcing a change from: • proprietary to open systems • maintenance-oriented coding to analysis, design and service • data collection to data deployment • a centralized to a distributed style of data management • a vertical, inflexible organizational style to a more horizontal, flexible style. For additional details on these and related topics refer to section F.7.1. 12. Discuss and evaluate the following statement: There are no unusual managerial issues related to the introduction of client/server systems. The managerial issues in client/server systems management arise from the changes in the data processing style discussed in Section F.7, the management of multiple hardware and software vendors, the maintenance and support of the client/server infrastructure, such as communications, applications and the management and control of associated costs.
480
Appendix F Client/Server Systems
Problem Solutions 1. ROBCOR, a medium-sized company, has decided to update its computing environment. ROBCOR has been a minicomputer-based shop for several years, and all of its managerial and clerical personnel have personal computers on their desks. ROBCOR has offered you a contract to help the company move to a client/server system. Write a proposal that shows how you would implement such an environment. Because question 13 cannot be answered properly without addressing the computing style issue in question 14, the answers to both questions are supplied after question 14. 2. Identify the main computing style of your university computing infrastructure. Then recommend improvements based on client/server strategy. (You might want to talk with your department's secretary or you may want to talk to your advisor to find out how well the current system meets their information needs.) Questions 13 and 14 are research questions that yield extensive class projects. The questions are designed with two ideas in mind: 1. To have the student assume the consultant's "proactive" role. 2. To entice the students to use the knowledge acquired in this appendix to develop an integrated approach to client/server systems implementation. The expected output for these projects is a business quality paper and a professional-level class presentation of the findings, recommended solutions, and the suggested implementation. The material presented in Section F.7 yields an outline appropriate for such a paper. It will be beneficial if students have taken at least an introductory course in Systems Analysis and Design. Keep in mind that you can either use the two scenarios presented in these questions or you can assign students a real world case to accomplish the same goals. In the first case, the professor assumes the role of the end user. In the second case, an external third party is the end user. The problem with real world cases is that the professor must procure commitment from the third party. Unfortunately, it is sometimes difficult for company managers to provide possibly sensitive internal information to students and to devote scarce time resources to student projects. Even if the project is kept within the university's bounds, you are likely to discover that the university administrators may not be able or willing to provide critical information. Students should be encouraged to use the presentations as a basis for further analysis of the more nettlesome issues that must be confronted in the development of client/server systems. We suggest several class discussion sessions in which different student groups present alternative solutions. Such presentations will force students not only to design a solution but also to sell the solution to management.
481
Appendix G Object Oriented Databases
Appendix G Object-Oriented Databases Discussion Focus Because an OO model necessarily forces the student to come to terms with some very abstract notions, begin the discussion by pointing out the practical benefits of system modularity; then carefully delineate the OO features that almost inevitably lead to modularity. At this point, students should be familiar with the relational model, so show how the OO model handles attributes and demonstrate that: 1. Although the term "object" in a relational environment is sometimes used as synonymous with "entity," an object in an OO environment contains much more than attributes. Use Figure G.2 to show that the object contains both data and methods: Every operation to be performed on the object is implemented by a method! And the methods are carried along with the object, thus producing modularity! 2. An OO attribute (not its value!) can reference one or more other objects. 3. Unlike the relational model, the OO model does not need a JOIN to link tables through their common attribute values. 4. There is quite a difference between the relational model's primary key and the OO model's OID. Section G.5, “OODM and Previous Data Models: Similarities and Differences,” yields an excellent summary of the OO data model; do not omit any of its subsections in your discussion. Next, it is crucial that you shift attention to Section G.5.1's Figure G.34, “An Invoice Representation”; this is where students are most likely to see the database's conceptual modeling implications.
482
Appendix G Object Oriented Databases
Answers to Review Questions NOTE To ensure in-depth chapter coverage, most of the following questions cover the same material that we covered in detail in the text. Therefore, in most cases, we merely cite the specific section, rather than duplicate the text material.
1. Discuss the evolution of object-oriented concepts. Explain how those concepts have affected computer-related activities. See Section G.2. 2. How would you define object orientation? What are some of its benefits? How are OO programming languages related to object orientation? See the definition of OO in Section G.1, “Object Orientation and Its Benefits.” The benefits derived from the application of OO concepts are summarized in Table G.1, “Object Orientation Contributions.” The relationship between OO and OO programming languages (OOPL) is discussed in Section G.3.5, “Messages and Methods.” Add to this discussion that OO concepts have created a powerful programming environment that has radically changed both programming and systems development. Although traditional programmers tended to agree that modularity is one of the primary goals of structured programming and good design, modularity was often difficult to achieve. Even a cursory examination of OO concepts leads to the conclusion that the conceptually autonomous structure (in which an object contains both data and methods) makes the much sought-after modularity almost inevitable. 3. Define and describe the following: a. Object See Section G.3.1, “Objects: Components and Characteristics.” b. Attributes See Section G.3.3, “Attributes (Instance Variables.” c. Object state See Section G.3.4, “Object State.” d. Object ID See Section G.3.2, “Object Identity.” 4. Define and contrast the concepts of method and message. What OO concept provides the differentiation between a method and a message? Give examples. Methods and messages are discussed in Section G.3.5, “Messages and Methods.” Use Figure G.3, “Method Components,” to illustrate the role of methods and messages.
483
Appendix G Object Oriented Databases 5. Explain how encapsulation provides a contrast to traditional programming constructs such as record definition. What benefits are obtained through encapsulation? Give an example. Encapsulation hides the object's internal data representation and method implementation, thus ensuring the object's data integrity and consistency. The programmer needs only ask an object to perform an action, without having to specify how the action is to be performed. Since the implementation details need not be specified, the programmer can concentrate on the overall process. Clearly, an object is an independent entity. Therefore, object independence assures system modularity. For example, an object-oriented system is formed by possibly thousands of independent objects (or even more) that interact to perform specific actions. In short, what we have just described is a perfectly modular system. In contrast, the programmer who uses traditional programming languages has direct access to the internal components of a record type. Therefore, the programmer can directly manipulate the data elements at will. This ability is not necessarily valuable; programmers can (and do) make mistakes, thus causing problems in critical systems. For example, when you create a record type "customer" in your program, you have direct access to all the data elements of such a record, so there is no protection of the data. 6. Using an example, illustrate the concepts of class and class instances. See Section G.3.6, “Classes.” Use Figure G.5, “Class Illustration,” to illustrate that a class is composed of objects or object instances; then use the example summarized in Figure G.6, “Representation of the Class STUDENT.” As you discuss the example in Figure G.5, note that each class instance is an object with a unique OID and that each object knows to which class it belongs. 7. What is a class protocol, and how is it related to the concepts of methods and classes? Draw a diagram to show the relationship between these OO concepts: object, class, instance variables, methods, object's state, object ID, behavior, protocol, and messages. See Section G.3.7, “Protocol.” Use the summary presented in Figure G.8, “OO Summary: Object Characteristics,” to tie the OO concepts together. 8. Define the concepts of class hierarchy, superclasses, and subclasses. Explain the concept of inheritance and the different types of inheritance. Use examples in your explanations. See Sections G.3.8, “Superclasses, Subclasses, and Inheritance,” and G.3.9, “Method Overriding and Polymorphism.” The examples are illustrated in Figures G.9 - G.14.
484
Appendix G Object Oriented Databases 9. Define and explain the concepts of method overriding and polymorphism. Use examples in your explanations. See Section G.3.9, “Method Overriding and Polymorphism.” Illustrate the example with reference to Figure G.14, “Employee Class Hierarchy Polymorphism.” 10. Explain the concept of abstract data types. How they differ from traditional or base data types? What is the relationship between a type and a class in OO systems? See Sections G.3.10 - G.4. Show Figure G.17, “Defining Three Abstract Data Types,” to illustrate the use of abstract data types. Use Table G.3, “Comparing the OO and ER Model Components,” to illustrate the distinction between type and class. 11. What are the five minimum attributes of an OO data model? See Section G.6, “Object Oriented Database Management Systems.” Note especially Table G.4, “The Thirteen OODBMS Rules. Table G.4’s subsection, “Rules that make it a DBMS,” lists the five characteristics (rules) that define an OO data model. 12. Describe the difference between early and late binding. How does each of those affect the objectoriented data model? Give examples. Late and early binding are discussed and illustrated in Section G.4.4, “Late and Early Binding: Use and Importance.” See particularly Figures G.32, “Inventory Class with Early Binding” and G.33, OODM Inventory Class with Late Binding.” 13. What is an object space? Using a graphic representation of objects, depict the relationship(s) that exist between a student taking several classes and a class taken by several students. What type of object is needed to depict that relationship? The object space or object schema is the equivalent of a database schema. The object space is used to represent the composition of the state of an object at a given time. For example, to represent the M:N relationship between students and classes, you can use the schema shown in Figure QG.13 to represent the M:N relationship between CLASS and STUDENT:
485
Appendix G Object Oriented Databases
Figure QG.13 The Object Schema for the Relationship between Student and Class STUDENT
ENROLL
STU_SOC_SEC_NUM
CLASS:
CLASS 1
CLASS_CODE
STU_LNAME STU_FNAME
TAKEN BY:
STU_ADDRESS
STUDENT:
STU_CITY
1
M
ENROLL
STUDENT
STU_STATE STU_ZIPCODE CLASS_TAKEN:
CLASS_DESCRIPTION
CLASS
GRADE M
ENROLL STU_CUM_GPA STU_SEM_GPA
As you discuss Figure QG.13, note the use of an interSection class (ENROLL) to represent the M:N relationship between STUDENT and CLASS. This interSection class contains the GRADE attribute, which represents the student's grade in a given course.
NOTE Now's the time to tie the OO concepts and constructs together. Show how the OO model uses objects, object instances, and object sharing, then explore the 1:1, 1:M, and M:N relationships illustrated in Figures G.23 trough G.28. Next, show how an object space may be represented, using Figure G.29 as an example. Finally, show how abstract data types help the designer implement the OO model.
14. Compare and contrast the OODM with the ER and relational models. How is a weak entity represented in the OODM? Give examples. Section G.4, “Characteristics of an Object Oriented Data Model,” sets the discussion stage. Use Table G.3, “Comparing the OO and ER Model Components,” to summarize the comparisons between the OO model and the ERM. (The acronym “ERM” denotes the Entity Relationship Model.) This question is also address extensively in Section G.9, “How OO Concepts have Influenced the Relational Model.” Section G.5.1, “Object, Entity, and Tuple,” is and excellent comparison source. Your students may get an “I got it!” moment when they examine Figure G.34, “An Invoice Representation.” Although the OODM has much in common with relational and ER data models, the OODM introduces
486
Appendix G Object Oriented Databases some fundamental differences. Table QG.14 provides the summary that will help you clarify the OODM characteristics we introduced in this appendix. (Although a case may be made for the OO data model’s contributions, we urge you to let your students read and discuss C. J. Date’s “Third Manifesto,” referenced in Section G.9.)
Table QG.14 A Comparison of OODM, ERM, and Relational Model Features OODM
ER Model (ERM)
Relational Model
Type Object Class Instance variable OID N/A Object schema Class hierarchy Inheritance Encapsulation Method
Entity definition (limited) Entity Entity set Attribute N/A Primary key ER diagram N/A* N/A* N/A N/A
Table definition (limited) Table row or tuple Table Column (attribute) N/A Primary key Relational schema N/A N/A N/A N/A
You may find it useful to point out the similarities between (entity type and subtypes) and (class hierarchy and inheritance.) Remember that entity types and subtypes are design constructs that provide data modeling with data abstraction, but these constructs do not automatically imply the existence of inheritance. In fact, no RDBMS supports these constructs directly; instead, the programmer has to link the tables at run time to ensure that the attributes will be “inherited”. You could also create views to ensure that the constructs include the “inherited” attributes. 15. Name and describe the 13 mandatory features of an OODBMS. See Section G.6.1, “Features of an Object Oriented Database,” and use the thirteen OODBMS commandments listed in Table G.4, “The Thirteen OODBMS Rules.” 16. What are the advantages and disadvantages of an OODBMS? See Section G.8, “OODBMS: Advantages and Disadvantages.” 17. Explain how OO concepts affect database design. How does the OO environment affect the DBA's role? See Section G.9, “How OO Concepts have Influenced the Relational Model.”
487
Appendix G Object Oriented Databases 18. What are the essential differences between the relational database model and the object database model? Start by using Table QG.14 – shown in the answer to question 14 -- to show and contrast the basic concepts of the object model, the entity-relationship model, and the relational model. A more detailed discussion of the similarities and differences among the models is found in Section G.5, “OODM and Previous Data Models: Similarities and Differences.” In this section, we have shown that: • An object extends beyond the static concept of an entity or tuple in the other data models. • Like the entity set and table, a class includes the data structure. However, unlike the entity set and table, the class also includes methods. • Unlike its relational and ER counterparts, encapsulation allows an object’s “internals” to be hidden from the outside • Unlike its relational and ER counterparts, inheritance allows an object to inherit attributes and methods from a parent class. • The Object Id is a concept associated with the primary key concept in the relational and ER model, but it is not quite the same thing. An object Id is an attribute that is not directly exposed, user definable, or directly accessible as the PK is in the relational model. • The relational and ER model relationships are based on the primary key/foreign key relationships. Such relationships are “value” based; that is, they are based on having two attributes in different tables sharing equal values. The relationships in the object model are not based in the specific value of any attributes. • Data access in the relational model is based on a query language known as SQL. SQL is a setoriented language that uses associative access methods to retrieve related rows from tables. In contrast with the relational model, the object data model suffers from the lack of a standard query language. Because of its identity based access style, the object model resembles the record-at-a-time access of older hierarchical and network models. 19. Using a simple invoicing system as your point of departure, explain how its representation in an entity relationship model (ERM) differs from its representation in an object data model (ODM). (Hint: Check Figure G.34.) Use the appendix’s Figure G.34 to illustrate the idea that the object model represents the INVOICE as an object containing other objects (CUSTOMER and LINE). In contrast, the ER model uses three different and separate entities related to each other through their primary key/foreign key attributes. Note that the object model automatically includes the CUSTOMER and LINE object instances when each INVOICE line instance is made current. 20. What are the essential differences between an RDBMS and an OODBMS? Section G.6, “Object Oriented Database Management Systems,” explains the basic characteristics of an OODBMS. Such characteristics clearly show that the OODBMS shares features such as data accessibility, persistence, backup and recovery, transaction management, concurrency control, and security and integrity with the RDBMS. In addition, the OODBMS has unique characteristics such as support for complex objects, encapsulation and inheritance, abstract data types, and object identity. 21. Discuss the object/relational model's characteristics.
488
Appendix G Object Oriented Databases
See Section G.9, “How OO Concepts Have Influenced the Relational Model.” Section G.5.1, “Object, Entity, and Tuple,” is and excellent comparison source. Your students may get an “I got it!” moment when they examine Figure G.34, “An Invoice Representation.”
Problem Solutions 1. Convert the following relational database tables to the equivalent OO conceptual representation. Explain each of your conversions with the help of a diagram. (Note: The RRE Trucking Company database includes the three tables shown in Figure PG.1).
489
Appendix G Object Oriented Databases
FIGURE PG.1 The RRE Trucking Company Database
490
Appendix G Object Oriented Databases As you examine Figure PG.1, note that, for simplicity's sake, we have chosen not to represent BASE_MANAGER as an abstract data type belonging to the class PERSON.
Figure PG.1 The OO Conceptual Representation 1 1 BASE
TRUCK TRUCK_NUM
c
BASE: 1 BASE TYPE:
1
TYPE
BASE_CODE
n
TYPE_CODE
n
BASE_CITY
c
TYP_DESCRIPTION
c
BASE_STATE
c
BASE_AREA_CODE
c
BASE_PHONE
c
BASE_MANAGER
c
TRUCKS:
TRUCK_MILES
n
TRUCK_BUY_DATE
d
TRUCK_SERIAL_NUM
c
TYPE
TRUCKS: MM CTRUCK
M M
CTRUCK
Note: c = character data d = date data n = numeric data
Figure PG.1 also illustrates that the CTRUCK class represents a collection of TRUCK objects. In other words, one instance of the CTRUCK class will contain several instances of the class TRUCK. 2. Using the tables in Figure PG.1 as a source of information: a. Define the implied business rules for the relationships. Given the tables in Figure PG.1, you may develop the following relationships: • A BASE can have many TRUCKs. • Each TRUCK belongs to only one BASE. • A TRUCK has only one truck TYPE. • Each truck TYPE may have several TRUCKs belonging to it.
491
Appendix G Object Oriented Databases b. Using your best judgment, choose the type of participation of the entities in the relationship (mandatory or optional). Explain your choices. From the data shown in Figure PG.1 you can conclude that: • BASE and TYPE are mandatory for TRUCK. • A TRUCK must have a BASE. • A truck is of a given TYPE. • TRUCK is mandatory for BASE. • A BASE must have at least one TRUCK to be considered a BASE. • TRUCK is optional for TYPE. There can be zero, one, or more TRUCKs belonging to a TYPE. c. Develop the conceptual object schema. Using the results of Problems (a) and (b), the conceptual object schema is represented by Figure PG.2C.
Figure PG.2C The Conceptual Object Schema TRUCK OLD: TX34 TRUCK_NUM: 5001 BASE: [BD39] TYPE: [DF56] TRUCK_MILES: 167123.5 TRUCK_BUY_DATE: 11/8/07 TRUCK_SERIAL_NUM: AA-322-12212-W11 BASE
TYPE
OLD: BD39
OLD: DF56
BASE_CITY: Nashville
TYPE_CODE: 1
BASE_STATE: TN
TYPE_DESCRIPTION: Single box, double-axle
BASE_AREA_CODE: 615
TRUCKS: [Y54F]
BASE_PHONE: 123-4567 BASE_MANAGER: Andrea D. Gallager TRUCKS: [Y678]
CTRUCK
CTRUCK
OLD: Y54F
OLD: Y678
[TX34]
[TX34], [TX37], [TX65]
492
Appendix G Object Oriented Databases 3. Using the data presented in Problem 1, develop an object space diagram representing the object's state for the instances of TRUCK listed below. Label each component clearly with proper OIDs and attribute names. a. The instance of the class TRUCK with TRUCK_NUM = 5001 The instance of this class is shown in problem 2c's conceptual object schema. (Figure PG.2c.) b. The instances of the class TRUCK with TRUCK_NUM = 5003 and 5004. As you examine the conceptual object schema shown in problem 2c, note the following features: • OIDs are used to reference the object instances of the classes BASE and TYPE. • The BASE and TYPE object instances reference two different CTRUCK object instances. • Using the OIDs, each CTRUCK object instance contains the reference to several object instances of the class TRUCK. Using these features, the conceptual object schema looks like Figure PG.3b.
Figure PG.3b The Conceptual Object Schema TRUCK
TRUCK
OLD: TX37
OLD: TX65
TRUCK_NUM: 5003
TRUCK_NUM: 5004
BASE: [BD39]
BASE: [BD39]
TYPE: [DF48]
TYPE: [DF56]
TRUCK_MILES: 221346.6
TRUCK_MILES: 99894.3
TRUCK_BUY_DATE: 12/27/07
TRUCK_BUY_DATE: 2/21/08
TRUCK_SERIAL_NUM: AC-445-78656-Z99
TRUCK_SERIAL_NUM: WG-11223144-T34 TYPE
BASE
OLD: DF56
OLD: BD39 BASE_CITY: Nashville BASE_STATE: TN BASE_AREA_CODE: 615
TYPE_CODE: 2 TYPE_DESCRIPTION: Single box, single-axle TRUCKS: [Y54F] CTRUCK
BASE_PHONE: 123-4567
OLD: Y54F
BASE_MANAGER: Andrea D. Gallager
[TX37], [TX65], ……... TYPE
TRUCKS: [Y678] OLD: DF48 CTRUCK OLD: Y678 [TX34], [TX37], [TX65]
TYPE_CODE: 1 TYPE_DESCRIPTION: Single box, double-axle TRUCKS: [Y54F]
493
Appendix G Object Oriented Databases As you examine Figure PG.3b’s conceptual object schema, note the following features: • OIDs are used to reference the object instances of the classes BASE and TYPE. • Both object instances reference the same BASE and TYPE object instances. This property is also called referential object sharing. 4. Given the information in Problem 1, define a superclass VEHICLE for the TRUCK class. Redraw the object space you developed in Problem 3, taking into consideration the new superclass that you just added to the class hierarchy. To add a superclass VEHICLE to the TRUCK class, first define the superclass VEHICLE, after which you can create the subclass TRUCK. After this task has been completed, the end user will see only the attributes and methods that were inherited from VEHICLE. (The user does not perceive the difference!) To illustrate this point, the object space must also show the new VEHICLE instance. (See Figure PG.4.)
Figure PG.4 The Conceptual Object Schema VEHICLE OLD: VF345 MAKER: Ford
Class/Subclass Relationship
YEAR: 1992 TRUCK Attributes Inherited From the VEHICLE Superclass
OLD: TX34 MAKER: Ford YEAR: 1992 TRUCK_NUM: 5001 BASE: [BD39]
Interclass Relationships
TYPE: [DF56] BASE OLD: BD39 BASE_CITY: Nashville
TRUCK_MILES: 162123.5 TRUCK_BUY_DATE: 11/08/07 TRUCK_SERIAL_NUM: AA-322-12212-W11 TYPE
BASE_STATE: TN BASE_AREA_CODE: 615
OLD: DF56
BASE_PHONE: 123-4567
TYPE_CODE: 1
BASE_MANAGER: Andrea D. Gallager
TYPE_DESCRIPTION: Single box, double-axle
TRUCKS: [Y678]
TRUCKS: [Y54F]
CTRUCK
CTRUCK
OLD: Y678
OLD: Y54F
[TX34], [TX37], [TX65]
[TX34]
494
Appendix G Object Oriented Databases 5. Assume the following business rules: • A course contains many Sections, but each Section references only one course. • A Section is taught by one professor, but each professor may teach one or more different Sections of one or more courses. • A Section may contain many students, and each student may be enrolled in many Sections. • A Section may contain many students, and each student is enrolled in many Sections, but each Section belongs to a different course. (Students may take many courses, but they cannot take many Sections of the same course.) • Each Section is taught in one room, but each room may be used to teach different Sections of one or more courses. • A professor advises many students, but a student has only one advisor. Based on those business rules: a. Identify and describe the main classes of objects. Using the business rules 1 through 6, we may identify the objects: COURSE CLASS PROFESSOR
STUDENT ROOM
NOTE We commonly use CLASS to identify a Section of a COURSE. (IN fact, all of the examples in Chapters 2 and 3 were based on this convention.) Therefore, we use CLASS to identify a Section of a COURSE. We use this convention for the simple reason that it properly reflects commonly used language. For example, students invariably will tell you that they have enrolled in your class; they'll tell you they're going to your class, rather than going to your Section. However, do keep in mind that "class" has a specific (and different!) meaning in the OO environment. Fortunately, the context in which "class" is used easily identifies which "class" you're talking about.
The classes corresponding to these objects are shown in Figure PG.5a.
495
Appendix G Object Oriented Databases
Figure PG.5a The Conceptual Object Schema STUDENT
CLASS
STU_NUM
C
COURSE:
STU_LNAME
C
COURSE
STU_FNAME
C
PROFESSOR:
STU_ADDRESS
C
1
1
PROFESSOR
STU_CITY
C
ROOM:
STU_STATE
C
ROOM
STU_ZIPCODE
C 1
GRADE SCHEDULE:
C
PROF_NUM
C
CRS_DESCRIPTION C
PROF_NAME
C
CRS_CREDIT
PROF_DOB
D
DEPT_CODE
C
OFFERING:
N M
TEACH_LOAD: M CLASS
1 ROOM M
STUDENT
PROFESSOR
CRS_CODE
CLASS
ENROLL:
ADVISOR:
PROFESSOR
COURSE
N
M
BLDG_CODE
C
ROOM_NUM
C
ADVISEES: STUDENT
RESERVATION: CLASS
M
CLASS GRADE
N
STU_CUM_GPA
N
STU_SEM_GPA
N
Note: C = Character D = Date N = Numeric
Use the following descriptions to characterize the model's components: COURSE OFFERING INCLUDES CLASS A COURSE CAN GENERATE MANY CLASSES CLASS IS OPTIONAL TO COURSE (a course may not be offered) PROFESSOR TEACH_LOAD INCLUDES CLASS A PROFESSOR CAN TEACH MANY CLASSES CLASS IS OPTIONAL TO PROFESSOR (a professor may not teach a class) ADVISEES INCLUDES STUDENT A PROFESSOR MAY ADVISE MANY STUDENTS STUDENT IS OPTIONAL TO PROFESSOR (in the advises relationship) ROOM RESERVATION INCLUDES CLASS ONE ROOM CAN HAVE MANY CLASSES SCHEDULED IN IT CLASS IS OPTIONAL TO ROOM (a room may not have classes scheduled in it)
496
M
Appendix G Object Oriented Databases STUDENT ADVISOR INCLUDES PROFESSOR A STUDENT HAS ONE PROFESSOR (who advises that student) PROFESSOR IS MANDATORY (a student must have an advisor) SCHEDULE INCLUDES CLASS A STUDENT MAY TAKE MANY CLASSES (i.e., SECTIONS OF A COURSE) CLASS IS MANDATORY TO STUDENT (a student must take at least one Section of a course) CLASS REQUIRES A COURSE COURSE IS MANDATORY (a class can't exist without a course) PROFESSOR IS MANDATORY (a class must have a professor) ROOM IS MANDATORY (a class must be taught in a room) A CLASS MAY HAVE MAY STUDENTS ENROLLED IN IT STUDENT IS OPTIONAL ( a class may not have any students enrolled in it) AN ENROLLED STUDENT RECEIVES A GRADE b. Modify your description in Part (a) to include the use of abstract data types such as NAME, DOB, and ADDRESS. An abstract data type allows us to create user-defined operations for that new type. To create a new data type, first define the abstract data types or classes: NAME, DOB (date of birth), and ADDRESS, as shown in Figure PG.5B-1.
Figure PG.5B-1 The Abstract Data Types (Classes) NAME
DOB
ADDRESS
FIRST_NAME
c
MONTH
n
STREET
c
INITIAL
c
DAY
n
APT_NUM
c
LAST_NAME
c
YEAR
n
CITY
c
STATE
c
ZIPCODE
c
Having created the new abstract data types or classes, we must redefine PROFESSOR and STUDENT classes so they can reference these newly created classes. For example, the object instance representation for a PROFESSOR will look like Figure PG.5B-2.
497
Appendix G Object Oriented Databases
Figure PG.5B-2 The Object Instance Representation for PROFESSOR NAME OID:
OID:
M45
FIRST_NAME:
June
INITIAL:
W
LAST_NAME:
Hasselblatt
PROFESSOR
DOB
230843
OID: 456
PROF_NAME:
[M45]
MONTH:
10
PROF_DOB:
[456]
DAY:
30
PROF_ADDRESS:
[401]
YEAR:
1961
EPT_CODE:
CIS
TEACH_LOAD:
[D40]
ADVISEES:
[X34]
ADDRESS OID:
[401]
STREET:
North Side
Blvd. APT_NUM :
1093B
CITY:
Paris
STATE:
TN
ZIPCODE:
37892
Within the new object space illustrated in Figure PG.5B-2, the PROFESSOR object instance now contains references to the NAME, DOB, and ADDRESS object instances. c. Use object-representation diagrams to show the relationships between: • Course and Section. • Section and Professor. • Professor and Student. To answer this question, we must remember how 1:M relationships are interpreted in the OODM. We must also remember that the OODM interpretation of such 1:M relationships yields some important implications. Keep in mind that all pairs of objects exist in a 1:M relationship: A course has many Sections (classes), a professor teaches many classes, and a professor advises many students.
498
Appendix G Object Oriented Databases To save space in this manual, we will illustrate only one case of 1:M relationships; the same concepts apply to all cases. We will focus our attention on the relationships of the objects in the class PROFESSOR. The object representation for an object of the (OO) class PROFESSOR will look like Figure PG.5C.
Figure PG.5C The Object Representation for an Object of the Class PROFESSOR PROFESSOR OID:
Collection of SECTION classes OID:
230843
PROF_NAME:
[M45]
PROF_DOB:
[456]
OID: A34332 ……………. OID: 349 ……………. ……………. OID: 369 ……………. ……………. OID: 380 ……………. ……………. …………….
PROF_ADDRESS: [401] DEPT_CODE:
CIS
TEACH_LOAD:
[D40]
ADVISEES:
[X34]
D40
CLASS objects
Collection of STUDENT classes OID:
X34
OID: 346 ……………. OID: 345 ……………. ……………. OID: 556 ……………. ……………. OID: 580 ……………. ……………. …………….
STUDENT objects
Note that we have omitted the object instances for the classes NAME, DOB, and ADDRESS. (These classes are shown in the answer to problem 5g.) Note also that we have used the "collection of" classes to represent the collection of • CLASSes taught by the PROFESSOR. • STUDENTs advised by the PROFESSOR. Collection objects are used to implement 1:M relationships.
499
Appendix G Object Oriented Databases d. Use object representation diagrams to show the relationships between: • Section and Student. • Room and Section. What type of object is necessary to represent those relationships? The relationship between CLASS (Section) and STUDENTS is M:N; that is, each class has many students, and each student has many classes. The relationship between CLASS and ROOM is 1:M, because each class is taught in only one room and each room is used to teach several classes. We covered the use and representation of 1:M relationships in our answer to question 5c, so please refer to that material. Depending on the level of abstraction used, representing a M:N relationship in an object representation diagram is fairly simple. For example, at the conceptual level, we can show the relationship between CLASS and STUDENT in Figure PG.5D-1:
Figure PG.5D-1 The Relationship Between CLASS and STUDENT STUDENT
CLASS
STU_NUM
C
STU_LNAME
C
STU_FNAME
C
STU_ADDRESS
C
STU_CITY
C
STU_STATE
C
STU_ZIPCODE
C
required
1
COURSE PROFESSOR:
required
1
PROFESSOR
ROOM:
ADVISOR:
1
required
PROFESSOR SCHEDULE:
required
ROOM
ENROLL:
required
optional
N N
STU_SEM_GPA
N
Note: C = Character D = Date
500
1
STUDENT GRADE
STU_CUM_GPA
1
M
M
CLASS GRADE
COURSE:
N
Appendix G Object Oriented Databases As you examine Figure PG.5D-1, note that: • A student must be registered in one or more CLASSes, and the student earns a GRADE in each CLASS. (Reminder: We’ve used CLASS to represent a Section of a course.) • The CLASS requires a COURSE, a PROFESSOR, and a ROOM. • The CLASS may have one or more STUDENTS, each of whom earns a GRADE in that CLASS. In other words, STUDENT is optional to CLASS. From a conceptual point of view, the preceding diagram captures both the nature and characteristics of the relationship between CLASS and STUDENT. At the implementation level, the object-oriented data model uses an intersection-class to manage M:N relationships only when additional information (attributes) about the M:N relationship between the objects is required. In this case, GRADE is our additional information, so a GRADE is associated with a CLASS and a STUDENT. The intersection-class is automatically included within the STUDENT and CLASS object space and represents the individual characteristics of the M:N relationship among them. For clarity’s sake we have label this new object as STU-REC. In this case the STU-REC object illustrates what students are in which Section, and in which Sections is the student registered? And what is the grade of each student registered in a Section? The object diagram in Figure PG.5d-2 shows such relationships:
Figure PG.5D-2 The Object Diagram for Problem PG.5d STUDENT STU_NUM
C
STU_LNAME
C
STU_FNAME
C
STU_ADDRESS
C
STU_CITY
C
STU_STATE
C
STU_ZIPCODE
C
ADVISOR:
CLASS
STU_REC STUDENT:
1
1
PROFESSOR:
1
PROFESSOR
CLASS GRADE
1
COURSE
STUDENT CLASS:
COURSE:
N
ROOM:
1
ROOM
1
PROFESSOR SCHEDULE: M
Note: C = Character D = Date N = Numeric
ENROLL:
M
STU_REC
STU_REC STU_GPA
N
As you discuss Figure PG.5D-2, note that STU_REC (the student record) is the intersection class that represents the M:N relationship between STUDENT and CLASS.
501
Appendix G Object Oriented Databases e. Using an OO generalization, define a superclass PERSON for STUDENT and PROFESSOR. Describe this new superclass and its relationship to its subclasses. A superclass PERSON can be defined for STUDENT and PROFESSOR. PERSON will contain the following attributes:
Attribute name
Data Type
NAME DOB ADDRESS
NAME DOB ADDRESS
STUDENT and PROFESSOR will inherit the above attributes from their superclass PERSON. The class hierarchy will look like Figure PG.5E.
Figure PG.5e The Class Hierarchy PERSON
Superclass
NAME DOB ADDRESS
Subclass
Subclass
STUDENT
PROFESSOR
Inherited from PERSON
NAME DOB 1
ADDRESS ADVISOR:
NAME DOB ADDRESS DEPT_CODE
1
C
TEACH_LOAD: M
PROFESSOR CLASS SCHEDULE:
M
STU_REC STU_GPA
N
Note: C = Character D = Date N = Numeric
ADVISEES:
M
STUDENT
As you discuss Figure PG.5E, note the differences between inheritance and interclass relationships. Explain that: • Inheritance is automatic. • Inheritance moves from top to bottom within the class hierarchy. • Inheritance represents a 1:1 relationship between the superclass and its subclass(es). • Inheritance need not be explicitly defined through the attribute data type. In contrast, interclass relationships must be defined explicitly through the attribute's data type. In addition, interclass relationships may represent a 1:1, a 1:M, or a M:N relationship.
502
Appendix G Object Oriented Databases 6. Convert the following relational database tables to the equivalent OO conceptual representation. Explain each of your conversions with the help of a diagram. (Note: The R&C Stores database includes the three tables shown in Figure PG.6)
FIGURE PG.6 The R&C Stores Database
503
Appendix G Object Oriented Databases The conversion is shown in Figure PG.6-1.
Figure PG.6-1 The Completed OO Conceptual Representation for the R&C Stores Database STORE
REGION REGION_CODE
N
REGION_LOCATION C STORES:
M
STORE
EMPLOYEE
STORE_CODE
N
STORE_NAME
C
STORE_YTD_SALES REGION: 1 REGION
N
MANAGER: EMPLOYEE Note: C = Character D = Date N = Numeric
WORKERS: EMPLOYEE
1
EMP_CODE
N
EMP_TITLE
C
EMP_LNAME
C
EMP_FNAME
C
EMP_INITIAL WORKS_AT:
C
STORE MANAGER_OF:
M
1
1
STORE
Note that Figure PG.6-1 reflects the following conditions: • Each REGION can have many STOREs. • The STORE object includes references to the REGION and EMPLOYEE objects. The EMPLOYEE object references reflect that an employee is a manager of a store and that each store employs many employees. • The EMPLOYEE object has reciprocal relationships with the STORE object. These relationships reflect that each employee works at one store and that each store is managed by one employee. The latter relationship makes STORE optional to EMPLOYEE, because not all employees manage a store.
504
Appendix G Object Oriented Databases 7. Convert the following relational database tables to the equivalent OO conceptual representation. Explain each of your conversions with the help of a diagram. ) Note: The Avion Sales database includes the tables shown in Figure PG.7).
FIGURE PG.7 The Avion Sales Database
The OO representation is shown in Figure PG.7-1.
505
Appendix G Object Oriented Databases
Figure PG.7-1 The Completed OO Conceptual Representation for the Avion Sales Database CUSTOMER
INVOICE
CUS_NUM
N
CUS_LNAME
C
INV_ NUM CUSTOMER:
CUS_FNAME
C
CUSTOMER
CUS_INITIAL
C
CUS_CREDIT
C
CUS_BALANCE
C
INVOICES:
M
INVOICE
Note: C = Character D = Date N = Numeric
N
PROD_CODE
N
EMP_NUM
N
1
PROD_COST
N
EMP_TITLE
C
PROD_PRICE
N
EMP_LNAME
C
PROD_QOH
N
EMP_FNAME
C
PROD_MIN_QOH
N
EMP_INITIAL
C
SALESREP: EMPLOYEE
EMPLOYEE
PRODUCT
1
INV_ DATE
D
INV_SUB
N
INV_TAX
N
INV_TOTAL
N
INV_PYMT
N
M LINES: INVLINE_NUM N 1 PRODUCT
PROD_LAST_ORDER D
EMP_HIRE_DATE D
LINES: INVLINE_NUM N 1 INVOICE
INVOICES:
M
M
INVOICE
INVLINE_UNITS N INVLINE_PRICE N INVLINE_TOTAL N
INVLINE_UNITS N INVLINE_PRICE N INVLINE_TOTAL N
8. Using the ERD shown in Appendix C, “The University Lab Conceptual Design Verification, Logical Design, and Implementation,” Figure C.22 (the Check_Out component), create the equivalent OO representation. Figure E.22 in Appendix C shows how the M:N relationship of USER and ITEM can be implemented by modeling this relationship through the Check_Out (bridge) entity. The OO representation of the M:N USER and ITEM relationship uses a CHECKOUT object. This object will have its own attributes and it will reference the USER and ITEM objects as shown in Figure PG.8. Note that the CHECKOUT object is a complex object that contains a group of repeating attributes: item, location, quantity, and date in.
506
Appendix G Object Oriented Databases
Figure PG.8 The Completed OO Conceptual Representation for Figure E.22's Check-Out Component USER
CHECKOUT
USER_ID
ITEM
N
CO_ID
N
ITEM_ID
N
C
CO_DATE
D
ITEM_UNIV_ID
N
USER_CLASS
C
USER:
USER_SEX
C
USER_TYPE
DEPARTMENT:
1
DEPARTMENT CHECKOUT:
ITEM_SERIAL_NUM N
1
ITEM_DESCRIPTION C
USER CO_ITEMS: 1
M
ITEM 1 M
LOCATION
ITEM_QTY
N
ITEM_SATUS
C
ITEM_BUY_DATE
D
INV_TYPE:
1
INV_TYPE
CHECKOUT COI_QTY
N
COI_DATE_IN
D
VENDOR: 1 VENDOR
Note: C = Character D = Date N = Numeric
507
CHECKOUT: CHECKOUT
M
Appendix G Object Oriented Databases 9. Using the contracting company’s ERD in Chapter 6, “Normalization of Database Tables,” Figure 6.15, create the equivalent OO representation. Figure 6.15 depicts the M:N relationship between EMPLOYEE and PROJECT. The object representation of this relationship is shown in Figure PG.9.
Figure PG.9 The Completed OO Conceptual Representation for Figure 6.15's Contracting Company ERD EMPLOYEE
ASSIGN
EMP_NUM
PROJECT
N
ASSIGN_NUM
N
PROJ_NUM
N
EMP_LNAME
C
ASSIGN_DATE
PROJ_NAME
C
EMP_LNAME
C
D 1
PROJECT
D
EMP_FNAME
C
PROJ_DATE ASSIGN:
EMP_INITIAL
C
EMP_HIREDATE JOB:
D
1
M
EMPLOYEE
ASSIGN
ASSIGN_HOURS N
1
JOB ASSIGN: ASSIGN
M
JOB JOB_CODE
N
JOB_DESCRIPTION C JOB_CHG_HOUR
508
N
Note: C = Character D = Date N = Numeric
Appendix I Databases in Electronic Commerce
Appendix I Databases in Electronic Commerce Discussion Focus Use some major websites to illustrate the use of electronic commerce. For example, show how you would buy an airline ticket, a computer, or a book. This chapter’s problem section is an excellent source for discussions about the use of the Web to buy goods and services and to make product/services/price comparisons. www.pricewatch/com and www.priceline.com are excellent ways to illustrate how you might become a more informed shopper.
Answers to Review Questions 1. What does e-commerce mean and how did it evolve? Electronic commerce (e-commerce) is the use of electronic computer-based technology to: • Bring new products, services or ideas to market. • Support and enhance business operations, including the sales of products and/or services over the web. The Internet started in the early 1960’s as a military project to ensure the survival of computer communications in case of nuclear attack. However, the Internet soon became the prime vehicle for sharing academic research, thus making higher education institutions the Internet’s primary users. (Section I.2, “The Road to Electronic Commerce,” shows the evolution toward e-commerce.) • During the early 1960’s, banks created a private telephone network to do electronic funds transfers (EFT). This service allowed two banks to exchange funds electronically in a fast, efficient, and secure manner. • During the early 1970’s, banks also created services such as the Automated Teller Machine (ATM) to provide “after-hours” services to their customers. ATMs where initially installed by only a few banks nationwide and these banks permitted only a limited number of account transactions. • During the late 1970’s and early 80’s, there was a boom in the use of Electronic Data Interchange (EDI). EDI enables two companies to exchange business documents over private phone networks. The use of EDI facilitated the coordination of business operations between business partners. • The early 1980’s and all 1990’s brought us the personal computer, which triggered the Internet’s accelerated growth. The wide acceptance and use of the Internet led to the current dominance of the World Wide Web. The web made the transfer of information among multiple organizations as simple as a click away. The web also became a fertile ground for the exploration and exploitation of new Internet-based technologies for the enhancement of business processes within and between corporations. 2. Identify and briefly explain five advantages and five disadvantages of e-commerce.
509
Appendix I Databases in Electronic Commerce
A summary of advantages of e-commerce is described in section I.3.1. We can briefly enumerate benefits such as: • Comparison-shopping • Reduced costs and increased competition • Convenience for on-line shoppers • 24 x 7 x 365 Operations • Global Access • Lower barriers of entry • Increased Market (Customer) Knowledge A summary of the disadvantages is presented in Section I.3.2 and they are summarized as follows: • Hidden costs of operation • Network unreliability • Higher costs of staying on business • Lack of security • Loss of privacy • Low service levels • Legal issues (fraud, copyright problems) 3. Define and contrast B2B and B2C e-commerce styles. E-commerce styles can be classified as: • Business to Business (B2B): electronic commerce between businesses. • Business to Consumer (B2C): electronic commerce between businesses and consumers. • Intra-Business: electronic commerce activities between employers and employees. Business-to-Business (B2B) is a type of e-commerce in which a business sells products and/or services to other business. B2B refers to all types of electronic commerce transactions that take place between businesses. The seller is any company that sells a product or service, using electronic exchanges (such as over the Internet or using EDI). The buyer may be a not-for-profit company such as the Red Cross, a forprofit company such as Dell Computers, or a government organization such as a local municipality. A business to consumer (B2C) web site sells products or services directly to consumers or end-users. In B2C e-commerce, the main focus is on using the Internet -- in particular, the Web -- as a marketing, sales, and post-sales support channel. A complete discussion about e-commerce styles is presented in Section I.4. 4. Describe and give an example of each of the two principal B2B forms. B2B Integration. In this scenario, companies establish partnerships to reduce costs and time, to improve business opportunities, and to enhance competitiveness. For example, a company that manufactures computers might partner with its suppliers for hard disks, memory and other components. Such a partnership will help automate its purchasing system by integrating it with its suppliers’ ordering
510
Appendix I Databases in Electronic Commerce systems – which, in turn, will tie into their respective inventory systems. In this case, when a component in company “A” gets below the minimum, it will automatically generate an order to supplier “S”. Both systems would be integrated and would exchange business data, probably using XML. (See Section 13.8). Using the same technique, Company “A” may also integrate its distribution system with its distributors. Finally, the distributors may integrate their operations with those of their retailers, which in turn may integrate their activities with those of their customers. As a consequence of such integration, companies (sellers) learn to operate with other companies (buyers) and the integration of their operations makes it possible to achieve a level of efficiency that makes it difficult to switch to another provider. B2B Marketplace. In this scenario, the objective is to provide a way in which businesses can easily search, compare, and purchase products and services from other businesses. The web-based system will basically work as an on-line broker to service both buyers and sellers. Given such an environment, many of the activities are focused on attracting new members -- either providers or buyers. The “broker” offers sellers a way to market their products and/or services to other businesses, while buyers are attracted by the fact that they can compare products from different buyers and get access to special deals that are offered only to the members. Given this B2B Marketplace scenario, the broker obtains revenue through membership and transaction fees. An example of a B2B web market place for the automotive industry is the http://www.covisint.com website. 5. Describe e-commerce architecture: then briefly describe each one of its components. The e-commerce architecture is described in detail in Section I.5, “E-commerce Architecture.” Start with the three layers: Basic Internet Services, Business Enabling Services, and E-commerce Business Services. Figure I.6, “E-commerce Architecture,” provides a good summary. The basic building blocks are provided in Table I.2, “Internet Building Blocks and Basic Services.” The business enabling services are summarized in Table I.3, “Business-Enabling Services.” 6. What types of services are provided by the bottom layer of the e-commerce architecture? The bottom layer if the e-commerce architecture, known as the Internet Basic Services, describes the basic building blocks and services that are provided by the Internet and the World Wide Web. The Internet provides the basic services that facilitate the transmission of data and information between computers. 7. Name and explain the operation of the main building blocks of the Internet and its basic services. The main Internet services are transport services -- such as TCP/IP, Routers and other protocols --, document formatting, storage, addressing, retrieval and presentation protocols such as the World Wide Web, web servers, web browsers, HTML, URL, HTTP, File Transfer Protocol, and other services such as e-mail, news groups and discussion groups. A complete description of the main Internet services is located in Chapter 13’s Table 13.1 and in Section 13.5.1.
511
Appendix I Databases in Electronic Commerce 8. What does business-enabling do? What services layer does it provide? Give six examples of business enabling services. The Internet Basic Services (IBS) only form a foundation on which to run a basic website. However, IBS does not provide the services required for even elementary business transactions. The Business Enabling Layer provides the additional services to better support business transactions. Business transactions require accountability, reliability, authentication, trust, fidelity, and performance. These requirements are supported through hardware and software components that work together to provide the additional functionality not provided by the other layers. Table I.3 describes services that are used to enhance websites by providing their users the ability to perform searches, authenticate and secure business data, manage web site contents, and so on. The list in Table I.3 is not exhaustive -- technological advances enable new services, which in turn are used to enable additional business services. The Business Enabling Services are search services, security, site monitoring and analysis, load balancing, personalization, web development, database integration, transaction management, messaging and support for multiple devices. The services provided by this layer are built on top of the Internet Basic services to provide the additional services that are required to support business transactions. 9. What is the definition of security? Explain why security is so important for e-commerce transactions. In an e-commerce context, security refers to all the activities that are associated with the protection of the data and other components against accidental or intentional (probably illegal) use by unauthorized users. Privacy deals with the rights of individuals and organizations to determine the “who, what, when, where and how” authorization to use his/her data. Providing security is a major concern of e-commerce. Companies spend millions of dollars annually on hardware and software equipment to protect their own data (including personal customer data) and property against criminal activities. For e-commerce to be successful, it must ensure the security and privacy of all business transactions and the data associated with those transactions. 10. Give an example of an e-commerce transaction scenario. What three things should security be concerned with in this e-commerce transaction? E-commerce data must be secured from a transaction’s beginning to its conclusion. Note, for example the following transaction sequence: • A customer buying products online from home, enters order and credit card information in a merchant’s web page. • The information travels from the customer’s computer over the Internet to the merchant’s web server. • The merchant’s web server receives the order and credit card data and stores these data in a database. • The web server sends the order confirmation and shipping information back to the client. • The seller uses a third-party shipping company to deliver the products to the customer.
512
Appendix I Databases in Electronic Commerce • • • •
The seller uses a third party payment processing company to settle payment. The shipping company delivers the product to the customer. At the end of the month, the customer receives his/her credit card statement -- possibly electronically. The customer pays the credit card, either by writing and mailing a check or through the use of electronic funds transfer.
These transaction components are easily illustrated with the help of Figure I.8, “A Sample Ecommerce Transaction.” Given the transaction scenario in Figure I.8, security (procedures and technology) deal with all activities required to: • Warrantee the identity of the transaction’s participants by ensuring that both the buyer and the seller are who they say they are. In other words, it needs to exist a secure way to properly identify transaction participants and the authenticity of the messages. • Protect the transaction data from unauthorized modifications while it travels on the Internet. Because it is not feasible to have private lines to connect every two computers, we use the Internet. Unlike private lines that directly connect the sender with the receiver, the Internet is formed by millions of interconnected networks. E-commerce data has to pass trough several different networks in order to travel from the client to the server; this increases the risks of data being stolen, modified or forged. • Protect the resources (data and computers). This includes protecting the end-user and the business data stored on the web server and databases from unauthorized access. It also includes securing the web server against attacks from hackers wanting to break into the system with intentions of modifying or stealing data or of impairing normal operations by limiting the resource availability. 11. You are hired as a resource security officer for an e-commerce company. Briefly discuss what technical issues you must address in your security plan. The security plan should include issues such as physical security of the computing environment and protection of the data in the databases. On-line transaction security must also cover issues such as authentication, the use of digital certificates to ensure the identity of the parties involved in business transactions, and the use of public-key encryption with digital signatures to guarantee that the data traveling on the Internet cannot be tampered with or read by unauthorized parties. The security plan should include issues such as resource security and transaction security. Transaction security includes encryption methods at the transport level, such as S-HTTP and SSL. Resource security deals with protecting the resources (hardware and software) that enable the conduct of e-commerce -servers, routers, operating systems and applications – against threats posed by hackers, viruses, theft, and so on.
513
Appendix I Databases in Electronic Commerce
Problem Solutions 1. Use the Internet at your university computer lab or home to research the scenarios described in Problems 1-10. Then work through the following problems: a. What web sites did you visit? b. Classify each site (B2B, B2C, and so on.) c. What information did you collect? Was the information useful? Why or why not? d. What decision(s) did you make based on the information you collected? The format is provided in the answers to problems 2-9. Naturally, the web sites shown here change periodically, so use the examples as a general guide. Also, keep in mind that there are many sites beyond the sites we have shown in the answers to problems 2-9.
514
Appendix I Databases in Electronic Commerce 2. Research – and document -- the purchase of a new car. Based on your research, explain why you plan to buy this car.
Vehix.com
A
B B2C
CarMax.com
B2C
AutobyTel.com
B2C
C Car models, features, comparisons, ratings, evaluations, dealer prices on new and used models. Information was very useful.
515
D Made an informed decision. Found the car most affordable, with best ratings and features. Capability of comparing car models and features.
Appendix I Databases in Electronic Commerce 3. Research – and document -- the purchase of a new house. A Century21.com
B C2B B2C
RealEstate.com
B2C
ColdwellBanker.com
C2B B2C
C Searched multiple homes based on my search criteria. Web sites provide information such as school systems, nearby attractions, city guides, comparable house prices, and financing options. The sites also provided ways to place a home for sale. In addition, there were tips for buyers and sellers and price comparisons for new and older homes. Mortgage Calculators were available to help determine what the buyer can qualify for.
516
D I was able to determine how much home I could afford, found a home within price range, locations and features desired.
Appendix I Databases in Electronic Commerce 4. You are in the market for a new job. Search the web for your ideal job. Document your job search and your job selection. A Monster.com
B C2B B2C
C Search job openings by location, industry, and salary range. Research employers, salary comparison. Option to post resume and obtain resume advice. Resume writing service. Interview tips.
HotJobs.com
C2B B2B B2C
Headhunter .net
C2B B2B B2C
Salary calculators, relocation information and services. Moving information.
517
D Was able to fine-tune a job search. Applied for jobs that matched my qualifications and experience. Was able to research companies and to compare salaries in different geographic regions.
Appendix I Databases in Electronic Commerce 5. You need to do your taxes. Download IRS form 1040 and look for online tax processing help, documenting your search. A IRS.gov
B G2C
C Obtained information about latest tax laws, downloaded tax return forms. Learned how to file a tax return electronically. Found information about IRS red flags for auditing.
Hrblock.com
B2C
CompleteTax.com
B2C
Fond tax advice in many different areas, determined how much to save in multiple retirement instruments, etc.
518
D Searched for tax advisors within my area. Learned how retirement instruments can be used to save for retirement and to reduce the tax burden. Used many different tax calculators to estimate how much I will pay in taxes.
Appendix I Databases in Electronic Commerce 6. Research the purchase of a 20-year level term life insurance policy and report your findings. A QuoteSmith.com
B B2B B2C
C Searched for policies by state, age, smoking status, and amount. Obtained policy details, latest prices, and providers. Compared insurance company premiums and ratings companies.
Intellequote.com
B2B B2C
Conseco.com
B2C
519
D Could find best possible deals in no time at all.
Appendix I Databases in Electronic Commerce 7. Research – and document -- the purchase of a new computer. A Dell.com
B B2C
C Searched for computers, compared prices, options, and warranty information. Could configure my computer according to my specifications. Obtained leasing and credit term information. Compare prices in new and refurbished computers.
Pricewatch.com
B2C
Gateway.com
B2C
520
D I was able to find my computers at the right price, with the right features, and with the best warranty.
Appendix I Databases in Electronic Commerce 8. Vacation time is almost here! Research—and document—the destination(s) and activities of next summer’s vacation. A Travelocity.com
B B2C
C Found information in vacation packages with allinclusive features, such as air fare, hotel accommodations, and guided-tour details. I was able to perform searches by destination, travel dates, tour operators, etc.
TourDeals.com
B2C
D I was able to search for (and find) multiple tour vacation packages that fit my criteria. The information provided was also useful to completely plan the vacation and do all booking online.
I could compare prices for tours and hotels. I was also able to find special deals and offers to various destinations. I could do all the trip planning on-line and get additional information such as currency conversions, city guides, comments from past-users, weather information, etc.
Expedia.com
B2C
9. You have some money to invest. Research – and document -- mutual funds information for investment purposes. Report your investment decision(s) based on the research you conduct. 521
Appendix I Databases in Electronic Commerce
A Vanguard.com
Fidelity.com
B B2C
B2C
C Obtained information about investing in the stock market, mutual funds markets, and markets for other instruments. Search yielded comparative fund information such as returns, expense ratios, ratings, market capitalization, family type, price history, etc. Obtained list of best-rated funds according to search criteria.
Morningstar.com
B2C
522
D Was able to determine the best investment strategy to fit my risk tolerance. Obtained all critical information appropriate to my investment needs. Enabled me to manage all of my investments online.
Appendix J Web Database Development with ColdFusion
Appendix J Web Database Development with ColdFusion Discussion Focus Here is a good opportunity to take a look at the “big picture” of Internet database development. Review the main points in Chapter 14, “Database Connectivity and Web Development” and Appendix J, “Web Database Development with ColdFusion.” Specifically, focus on: • Different database connectivity technologies. • Multi-tier architecture for database development. • How Web-to-database middleware is used to integrate databases with the Internet. Rather than showing you long code listings in this manual, we guide you through the solution steps as they are found in the script files located on the Instructor’s CD. Using this technique, the student can step through the solutions and see the code at the same time. Figure J.1 shows the RobCor ColdFusion application’s main menu.
Figure J.1 RobCor Teacher Menu
523
Appendix J Web Database Development with ColdFusion
The ColdFusion Problem Solutions section is a menu driven system that guides you through all of the solutions for this appendix. For example, if you click on the Solutions to Problems link (See Figure J.1) you will open the page shown in Figure J.2.
Figure J.2 RobCor Problem Solutions Menu
If you click on the View link for the first row shown in Figure J.2, you will see the code for the rc_u0.cfm script in Figure J.3.
524
Appendix J Web Database Development with ColdFusion
Figure J.3 RobCor View Code Sample
Answers to Review Questions 1. What are scripts, and how are they created in ColdFusion? Scripts are a series of instructions interpreted and executed at run time. Scripts are used in webdatabase application development to instruct the application server components what actions to do, such as connect, query, and update a database from a web front-end. Scripts are, for the most part, transparent to the clients. The application developer must create scripts to access the database and to create the web pages dynamically. The application server executes the scripts and passes the results (output) to the web server in HTML format. 2. Describe the basic services provided by the ColdFusion Web application server. The ColdFusion Web Application Server provides the following services (among others): • Integrated Development Environment. • Session management with support for persistent application variables. • Security and authentication. • A computationally complete programming language (commands and functions) to represent and store business logic. • Access to other services: FTP, SMTP, IMAP, POP, etc. 525
Appendix J Web Database Development with ColdFusion
3. Discuss the following assertion: The web is not capable of performing transaction management. Note the discussion in Section J.2.3, Transaction Management. The concept of database transactions is foreign to the Web. Remember that the Web’s request-reply model means that the Web client and the Web server interact by using very short messages. Those messages are limited to the request for and delivery of pages and their components. (Page components may include pictures, multimedia files, and so on.) The dilemma created by the Web’s request-reply model is that: • The Web cannot maintain an open line between the client and the database server. • The mechanics of a recovery from incomplete or corrupted database transactions require that the client must maintain an open communications line with the database server. 4. Transaction management is critical to the e-commerce environment. Given the assertion made in Item 3, how is transaction management supported? Clearly, the creation of mission-critical Web applications mandates support for database transaction management capabilities. Given the just-described dilemma, designers must ensure proper transaction management support at the database server level. Many Web-to-middleware products provide transaction management support. For example, ColdFusion provides this support through the use of its CFTRANSACTION tag. If the transaction load is very high, this function can be assigned to an independent computer. By using that approach, the Web application and database servers are free to perform other tasks and the overall transaction load is distributed among multiple processors. 5. Describe the Web page development problems related to database parent/child relationships. This condition is addressed in detail in Section J.2.4, Denormalization of Database Tables. Specifically, note the following: When the Web is used to interact with databases, the application design must take into account the fact that the Web forms cannot use the multiple data entry lines that are typical of parent-child (1:M) relationships. Yet those 1:M relationships are crucial in e-commerce. For example, think of order and order line, or invoice and invoice line. Most end users are familiar with the conventional GUI entry forms that support multi-table (parent-child) data entry through a multiple-component structure composed of a main form and a subform. Using such main-form/subform forms, the end user can enter multiple purchases associated with a single invoice. All data entry is done on a single screen. Unfortunately, the Web environment does not support this very common type of data entry screen. As illustrated in the ColdFusion script examples, the Web can easily handle single-table data entry. However, when multi-table data entries or updates are needed—such as order with order lines, invoice with invoice lines, and reservation with reservation lines—the Web falls short. Although
526
Appendix J Web Database Development with ColdFusion implementing the parent/child data entry is not impossible in a Web environment, its final outcome is less than optimum, usually counterintuitive, less user-friendly, and prone to errors. To see how the Web developer might deal with the parent/child data entry, let’s briefly examine how you might deal with the ORDER and ORDER_LINE relationship used to store customer orders. Using an applications middleware server such as ColdFusion to create a Web front end to update orders, one or more of the following techniques might be used: • Design HTML frames to separate the screen into order header and detail lines. An additional frame would be used to provide status information or menu navigation. • Use recursive calls to pages to refresh and display the latest items added to an order. • Create temporary tables or server-side arrays to hold the child table data while in the data entry mode. This technique is usually based on the bottom-up approach in which the end user first selects the products to order. When the ordering sequence is completed, the orderspecific data, such as customer ID, shipping information, and credit card details, are entered. Using this technique, the order detail data are stored in the temporary tables or arrays. • Use stored procedures or triggers to move the data from the temporary table or array to the master tables. Although the Web itself does not support the parent/child data entry directly, it is possible to resort to Web programming languages such as Java, JavaScript, or VBScript to create the required Web interfaces. The price of that approach is a steeper application development learning curve and a need to hone programming skills. And while that augmentation works, it also means that complete programs are stored outside the HTML code that is used in a Web site.
Problem Solutions In the following exercises, you are required to create ColdFusion scripts. When you create these scripts, include one main script to show the records and the main options, for a total of five scripts for each table (show, search, add, edit, and delete). Consider and document foreign key and business rules when creating your scripts. 1. Create ColdFusion scripts to search, add, edit, and delete records for the USER table in the RobCor database. This problem’s solution is contained in the system summary below problem 5. 2. Create ColdFusion scripts to search, add, edit, and delete records for the INVTYPE table in the RobCor database. This problem’s solution is contained in the system summary below problem 5. 3. Create ColdFusion scripts to search, add, edit, and delete records for the VENDOR table in the RobCor database. This problem’s solution is contained in the system summary below problem 5.
527
Appendix J Web Database Development with ColdFusion
4. Modify the insert scripts (rc-5a.cfm and rc-5b.cfm) for the DEPARTMENT table so the users who can be manager of a department are only those who belong to that department. This problem’s solution is contained in the system summary below problem 5. 5. Create an Order data-entry screen, using the ORDERS and ORDER_LINE tables in the RobCor database. To do this, you can use frames and other advanced ColdFusion tags. Consult the online manual and review the demo applications.
NOTE The following pages show sample ColdFusion scripts that are required by the problem set. To avoid repetition and to save space, we have illustrated only one example of each script type (select, insert, update, and delete). Use the Instructor’s CD to access the complete list of ColdFusion scripts. To install ColdFusion and the scripts follow the instructions in the ColdFusion_Setup.doc
This series of exercises requires the student to create the data manipulation scripts for the USER, INVTYPE and VENDOR tables. The logic used in these scripts is the same as the one shown in Appendix J “Web Database Development with ColdFusion”. To produce a user-friendly environment, we have created a menu to access all USER table database operations. The menu lets the user add, edit, delete, and search records in the USER table. Please refer to the solution scripts found on the Instructor’s CD. The menu script, named rc-u0.cfm produces the output shown in Figure PJ.1.
Figure PJ.1 The User Management Menu
The following pages list scripts that are required to add, edit, delete and search records. For a complete listing please refer to the Instructor’s CD.
Inserting Records in the USER Table The rc-ua1.cfm script produces the data entry screen shown in Figure PJ.2. This data entry screen consists of an HTML form that contains several input boxes to enter the data.
528
Appendix J Web Database Development with ColdFusion
Figure PJ.2 The Add User Web Form
When the user clicks the Add button, script rc-ua2.cfm is invoked. The cc-ua2.cfm script uses the CFINSERT tag to add a row to the database. (Check Figure PJ.3 to see the script’s effect.)
Figure PJ.3 The Add User Results
529
Appendix J Web Database Development with ColdFusion
Updating Records in the USER Table The rc-ue1.cfm script will show all the data that match the USER_ID selected by the user. In addition, the script lets the end user modify the selected data. Figure PJ.4 shows the script’s output.
Figure PJ.4 The Edit User Form
Clicking on the Add button will trigger the rc-ue2.cfm script, which runs the CFUPDATE tag to update the database. The rc-ue2.cfm script produces the output shown in Figure PJ.5.
530
Appendix J Web Database Development with ColdFusion
Figure PJ.5 The Edit User Results Screen
Deleting Records from the USER Table The rc-ud1.cfm script produces the results shown in Figure PJ.6.
Figure PJ.6 The Delete User Form
In this rc-ud1.cfm script, the Delete button is shown only if this user is not a manager of a department and does not have any orders in the ORDERS table. Otherwise, the script will not allow you to delete
531
Appendix J Web Database Development with ColdFusion the USER record. After clicking on the Delete button, the rc-ud2.cfm script is invoked. The rc-ud2.cfm script produces the output shown in Figure PJ.7.
Figure PJ.7 The Delete User Results Screen
You have deleted the user record successfully
Searching for Records in the USER Table The rc-s1.cfm script produces the output shown in Figure PJ.8.
Figure PJ.8 The Search User Form
Figure PJ.8's very simple screen lets the end user enter a last name to conduct a search in the USER table -- or the end user may select a user id as the search key. Clicking on the Search button invokes the rc-us2.cfm script. The rc-us2.cfm script produces the output shown in Figure PJ.9.
532
Appendix J Web Database Development with ColdFusion
Figure PJ.9 The Search User Results Screen
Note that the insert script (rc-5a.cfm and rc-5b.cfm) for the Department table only lists the users that can be manager of a department. The key to this script is in the SELECT SQL statement. (Note the condition used in the WHERE clause. This condition lists only those users who are not already managers of a department.) <CFQUERY NAME="USRLIST" DATASOURCE="RobCor"> SELECT USR_ID, USR_LNAME, USR_FNAME, USR_MNAME FROM USER WHERE USR_ID NOT IN (SELECT USR_ID FROM DEPARTMENT WHERE USR_ID > 0) ORDER BY USR_LNAME, USR_FNAME, USR_MNAME </CFQUERY>
The rc-5a.cfm script produces the output shown in Figure PJ.10.
533
Appendix J Web Database Development with ColdFusion
Figure PJ.10 The Department Data Entry Screen
Script rc-5b.cfm script uses a CFINSERT tag to add the data to the database. The script rc-5b.cfm output is shown in Figure PJ.11.
Figure PJ.11 The Department Insert Query
534
Appendix J Web Database Development with ColdFusion
Challenge Project Create an Order data entry screen, using the ORDERS and ORDER_LINE tables in the RobCor database. (To complete this problem successfully, you should know how to use frames and ColdFusion tags. Please consult the online ColdFusion manual and study the demo applications to learn how such components can be developed.) This challenge project requires the student to know HTML coding and the use of frames. In addition, some type of CGI programming or Java/JavaScript programming is recommended. Although our chapter 14 provides the basis for your students to develop Web to database interfaces, it does not cover all the components required to complete this problem. Therefore, you might pitch this problem at students who have some prior Web development experience. (Or perhaps you used supplemental material to examine our database design and implementation material from an applications development point of view!) Even if your students do not (yet) have the appropriate Web application development skills, they will find the following discussion interesting and useful for several reasons. First, they will have a chance to revisit some important database design and implementation issues in a Web environment. Second, they will be exposed to some of the details of Web database applications development. Finally, they may even store this problem into their minds, to be dusted off when they take Web-based classes! From a design standpoint, the developer can approach this problem from several different angles: • Create a multiple frame page that will have one frame for the ORDER header information and another for the ORDER_LINE data entry. In this case, the ORDER data will have to be entered, validated and saved first, before the ORDER_LINE frame is shown or accessed. Once the main ORDER data are saved, the second frame can be used to enter the ORDER_LINE rows. Both frames use buttons that will enable the system to accept data entry and to perform validation checks. This solution is not particularly well suited to a commercial e-commerce environment for two good reasons: 1. The end-user navigation among frames is awkward and is likely to be rejected by end users. 2. Keeping both frames synchronized is difficult and, unless the coding is particularly robust, is prone to failure. • A second way to tackle this problem is to borrow the typical “shopping cart” style used by most on-line stores. Students can go to Amazon, Sears, or to any other on-line store to step through the process of purchasing a product online. This process usually starts with the selection of all of the desired items and then progresses to the payment component. In other words, the process first collects the order line data and then, at its conclusion, collects the (invoice) payment data. Given this scenario, the browser must use temporary tables to store the data for the orders in progress. Later, such data are used to update the production database. In between, business logic is used to validate the data and to save such data in the proper format. The business logic is implemented through the use of cgi programs such as PERL, Java, etc. In addition, the business logic component usually employs stored procedures in the database environment. This solution generally preferred by on-line stores, because it is based on well-established and proven technology. (In fact, you can even buy applications that provide the entire shopping cart feature straight out of the box!)
535
Appendix J Web Database Development with ColdFusion •
Using Java programming code or some plug-in application like Power-Builder may also solve the problem. This approach requires that the complete application be downloaded to the client computer, to be run locally. Given this scenario, the end user develops an interface that can handle the one-to-many simultaneous data entry format. The entire application logic is then sent to and executed by the client side. The client application is connected to the back-end database through the Web. This solution is similar to those offered by high-end web-to-database middleware products such as NetObjects, IBM’s WebSphere or Oracle 9i Web database. In most cases, the application will be in the form of a Java applet that works in tandem with a server applet.
536
Appendix P Working with MongoDB
Appendix P Working with MongoDB NOTE Most of this appendix, and all of the end of appendix problems, require the use of the Ch14_FACT.json file. The Appendix provides instructions on how to import this file into a MongoDB database as a collection. The documents in this file is a reduced version of the data from the Ch07_FACT database used in Chapter 7. It can be helpful to draw to students’ attention that this is a reduced data set compared to Chapter 7. The reason this is a reduced set is because it is limited to a specific intended application. Refer students back to Chapter 14, Big Data and NoSQL, to remind them that document databases like MongoDB are aggregate aware. Therefore, the data are organized into documents with a great deal of redundancy across documents, but in a manner that reduces the number of documents that need to be accessed during the processing of a transaction.
Answers to Review Questions 1. What is the difference between a replacement update and an operator update in MongoDB? In MongoDB a replacement update will replace the entire document being updated. If the existing document has key:value pairs that are not included in the update command, then those pairs are lost. Only the pairs specified in the update command will exist in the replaced version of the document. With an operator update, the existing document is unchanged except for the changes specified in the update command. Pairs not included in the update command are not affected.
2. Explain what an upsert does. Upsert is a combination insert / update. If an existing document is found that matches the criteria given, then an update is performed on that document using the key:value pairs specified in the command. If an existing document is not found that matches the criteria given, then an insert is performed to create a document with the key:value pairs specified in the command.
3. Describe the difference between using $push and $addToSet in MongoDB. Both commands are used to add a value to an array. The $push command will always add the value to the array, even it results in duplicate values in the array. The $addToSet command will only add the value to the array if adding it does not result in duplicate values in the array.
4. Explain the functions used to enable pagination of results in MongoDB. Results can be provided in pages of information by using limit() and skip() functions. The limit() function specifies how many results to return. The skip() function allows the programmer to provide an offset of documents before the limit is applied. 537
Appendix P Working with MongoDB
5. Explain the difference in processing when using an explicit and and an implicit and with MongoDB. With both forms of logical and the DBMS must apply criteria to a document to determine if the document should be included in the results. An explicit and, using the $and operator, will determine that a document should not be included and stop applying criteria to that document as soon as one of the criteria evaluates to FALSE for that document. An implicit and will apply all criteria to the document before determining if the document should be included or not in the results. As a result, explicit and tends to perform better in most cases.
Problem Solutions For the following set of problems, use the fact database and patron collection created in the text for use with MongoDB. 1. Create a new document in the patron collection. The document should satisfy the following requirements: First name is “Rachel” Last name is “Cunningham” Display name is “Rachel Cunningham” Patron type is student Rachel is 24 years old Rachel has never checked out a book Be certain to use the same keys as already exist in the collection. Be certain capitalization is consistent with the documents already in the collection. Do not store any keys that do not have a value (in other words, no NULLs). db.patron.insert( { fname: "Rachel", lname: "Cunningham", display: "Rachel Cunningham", type: "student", age: 24 } ); 2. Modify the document entered in the previous question with the following data. Do not replace the current document. Rachel has checked out two books on January 25, 2018. The id of the first checkout is “95000” The first book checked out was book number 5237 Book 5237 is titled “Mastering the database environment” Book 5237 was published in 2015 and is in the “database” subject The id of the second checkout is “95001”
538
Appendix P Working with MongoDB The second book checked out was book number 5240 Book 5240 is titled “iOS Programming” Book 5240 was published in 2015 and is in the “programming” subject Use the same keys as already exist within the collection. Conform to the existing documents in terms of capitalization. db.patron.update( { "_id": ObjectId("5a45c23f395ff183e78d9c17")}, { $set: {checkouts: [ { “id”: "95000", “year”: "2018", “month”: "1", “day”: "25", “book": 5237", "title": "Mastering the database environment", “pubyear”: "2015", “subject”: "database" }, { "id": "95001", "year": "2018", "month": "1", "day": "25", “book”: "5240", “title”: "iOS Programming", “pubyear”: "2015", “subject”: "Programming" } ] } } )
3. Write a query to retrieve the _id, display name and age of students that have checked out a book in the cloud subject. db.patron.find({"checkouts.subject":"cloud"}, {display:1, age:1})
4. Write a query to retrieve only the first name, last name, and type of faculty patrons that have checked out at least one book with the subject “programming”. db.patron.find({type: "faculty", "checkouts.subject":"programming"}, {fname:1, lname:1, type:1, _id:0})
539
Appendix P Working with MongoDB
5. Write a query to retrieve the documents of patrons that are faculty and checked out book 5235, or that are students under the age of 30 that have checked out book 5240. Display the documents in a readable format. db.patron.find({$or: [ {type: "faculty", "checkouts.book":"5235"}, {type: "student", "checkouts.book":5240, age: {$lt:30}} ] } ).pretty() 6. Write a query to display the only the first name, last name, and age of students that are between the ages of 22 and 26. db.patron.find({
type:"student", $and: [{age: {$gte:22}}, {age: {$lte:26}} ]
}, {fname:1, lname:1, age:1, _id:0} )
540
Appendix Q Working with Neo4j
Appendix Q Working with Neo4j NOTE Most of this appendix, and all of the end of appendix problems, require the use of the Ch14_FCC.txt file. The Appendix provides instructions on how to import this file into a Neo4j graph.
Answers to Review Questions 1. Explain the difference between using the same variable name and different variable names when matching multiple patterns in Neo4j. Within a given command, all references to a variable are treated as references to the same object (node, edge, or path). Therefore, if the same variable is used in multiple patterns in the same command, then the same node or edge will be required to match both patterns. If different variable names are used, then the node or edge does not have to be the same node or edge in both patterns.
2. What is the difference between using WHERE and embedding properties in a node when creating a pattern in Neo4j? Embedded properties are much more limited. Embedded property specifications are treated as using an equality operator and combined using a logical AND. With a WHERE clause, other operators in addition to equality can be used such as less than, greater than, substrings, etc. Also, criteria in a WHERE clause can be combined with OR connectors as well as AND.
Problem Solutions For the following problems, use the Food Critics Club (FCC) graph database that was created and used earlier in the text for use with Neo4j. 1. Create a node that meets the following requirements. Use existing labels and property names as appropriate. The node will be a member, and should be labeled as such, with member id 5000. The member’s name is “Abraham Greenberg”. Abraham was born in 1978, and lives in the state of “OH”. Abraham’s email address is agreen@nomail.com, and his username is agberg.
541
Appendix Q Working with Neo4j Create (:Member {mid:5000, fname: "Abraham", lname: "Greenberg", birth: 1978, state: "OH", email: "agreen@nomail.com", username: "agberg" } ) 2. Create a restaurant node with restaurant id is 10000, the name “Hungry Much”, and located in Cobb Place, KY. Create (:Restaurant {rid: 10000, name: "Hungry Much", state: "KY", city: "Cobb Place"}) 3. Update the “Hungry Much” restaurant created above to add the phone number “(931) 5558888”, and a price rating of 2. Match(r :Restaurant {name:"Hungry Much"}) Set r.phone = "(931) 555-8888", r.price = 2
4. Create a REVIEWED relationship between the member created above and the restaurant created above. The review should rate the restaurant as a 5 on taste, service, atmosphere, and value. Match (abe :Member {fname: "Abraham", lname: "Greenberg"}), (hungry :Restaurant {name: "Hungry Much"}) Create (abe) -[rev :REVIEWED {taste: 5, service: 5, atmosphere: 5, value: 5}]-> (hungry)
5. Create a REVIEWED relationship between member Frank Norwood and the restaurant created above. The review should rate the restaurant as a 4 on taste, service, and value, and rate the restaurant as a 2 on atmosphere. Match(frank :Member {fname: "Frank", lname: "Norwood"}), (hungry :Restaurant {name: "Hungry Much"}) Create(frank) -[rev :REVIEWED {taste:4, service: 4, atmosphere: 2, value: 4}]-> (hungry)
6. Write a query to display member Frank Norwood and every restaurant that he has rated as a 4 or above on value. Match (frank :Member {fname: "Frank", lname: "Norwood"}) -[rev :REVIEWED]-> (rest :Restaurant) Where rev.value >= 4 Return frank, rest
542
Appendix Q Working with Neo4j 7. Write a query to display cuisine, restaurant, and owner for every “American” or “Steakhouse” cuisine restaurant. Match (c :Cuisine) <-- (rest :Restaurant) <-[:OWNS]- (o :Owner) Where c.name = "American" OR c.name = "Steakhouse" Return c, rest, o 8. Write a query to return the shortest path based only on reviews between members Abraham Greenberg and Herb Christopher. Match p = shortestPath((abe :Member {fname: "Abraham", lname: "Greenberg"}) -[:REVIEWED *](herb :Member {fname: "Herb", lname: "Christopher"})) Return p
543
Chapter 1 Database Systems
Chapter 1 Database Systems Discussion Focus How often have your students heard that “you have only one chance to make a good first impression?” That’s why it’s so important to sell the importance of databases and the desirability of good database design during the first class session. Start by showing your students that they interact with databases on a daily basis. For example, how many of them have bought anything using a credit card during the past day, week, month, or year? None of those transactions would be possible without a database. How many have shipped a document or a package via an overnight service or via certified or registered mail? How many have checked course catalogs and class schedules online? And surely all of your students registered for your class? Did anybody use a web search engine to look for – and find – information about almost anything? This point is easy to make: Databases are important because we depend on their existence to perform countless transactions and to provide information. If you are teaching in a classroom equipped with computers, give some “live” performances. For example, you can use the web to look up a few insurance quotes or compare car prices and models. Incidentally, this is a good place to make the very important distinction between data and information. In short, spend some time discussing the points made in Section 1.1, "Why Databases?" and Section 1.2 “Data vs. Information.” After demonstrating that modern daily life is almost inconceivable without the ever-present databases, discuss how important it is that the (database) transactions are made successfully, accurately, and quickly. That part of the discussion points to the importance of database design, which is at the heart of this book. If you want to have the keys to the information kingdom, you’ll want to know about database design and implementation. And, of course, databases don’t manage themselves … and that point leads to the importance of the database administration (DBA) function. There is a world of exciting database employment opportunities out there. After discussing why databases, database design, and database administration are important, you can move through the remainder of the chapter to develop the necessary vocabulary and concepts. The review questions help you do that … and the problems provide the chance to test the newfound knowledge.
1
Chapter 1 Database Systems
Answers to Review Questions 1. Define each of the following terms: a. data Raw facts from which the required information is derived. Data have little meaning unless they are grouped in a logical manner. b. field A character or a group of characters (numeric or alphanumeric) that describes a specific characteristic. A field may define a telephone number, a date, or other specific characteristics that the end user wants to keep track of. c. record A logically connected set of one or more fields that describes a person, place, event, or thing. For example, a CUSTOMER record may be composed of the fields CUST_NUMBER, CUST_LNAME, CUST_FNAME, CUST_INITIAL, CUST_ADDRESS, CUST_CITY, CUST_STATE, CUST_ZIPCODE, CUST_AREACODE, and CUST_PHONE. d. file Historically, a collection of file folders, properly tagged and kept in a filing cabinet. Although such manual files still exist, we more commonly think of a (computer) file as a collection of related records that contain information of interest to the end user. For example, a sales organization is likely to keep a file containing customer data. Keep in mind that the phrase related records reflects a relationship based on function. For example, customer data are kept in a file named CUSTOMER. The records in this customer file are related by the fact that they all pertain to customers. Similarly, a file named PRODUCT would contain records that describe products – the records in this file are all related by the fact that they all pertain to products. You would not expect to find customer data in a product file, or vice versa.
NOTE Note: Field, record, and file are computer terms, created to help describe how data are stored in secondary memory. Emphasize that computer file data storage does not match the human perception of such data storage.
2
Chapter 1 Database Systems 2. What is data redundancy, and which characteristics of the file system can lead to it? Data redundancy exists when unnecessarily duplicated data are found in the database. For example, a customer's telephone number may be found in the customer file, in the sales agent file, and in the invoice file. Data redundancy is symptomatic of a (computer) file system, given its inability to represent and manage data relationships. Data redundancy may also be the result of poorly-designed databases that allow the same data to be kept in different locations. (Here's another opportunity to emphasize the need for good database design!) 3. What is data independence, and why is it lacking in file systems? Data independence is a condition in which the programs that access data are not dependent on the data storage characteristics of the data. Systems that lack data independence are said to exhibit data dependence. File systems exhibit data dependence because file access is dependent on a file's data characteristics. Therefore, any time the file data characteristics are changed, the programs that access the data within those files must be modified. Data independence exists when changes in the data characteristics don't require changes in the programs that access those data. File systems lack data independence because all data access programs are subject to change when any of the file system’s data storage characteristics – such as changing a data type -- change. 4. What is a DBMS, and what are its functions? A DBMS is best described as a collection of programs that manage the database structure and that control shared access to the data in the database. Current DBMSes also store the relationships between the database components; they also take care of defining the required access paths to those components. The functions of a current-generation DBMS may be summarized as follows: • The DBMS stores the definitions of data and their relationships (metadata) in a data dictionary; any changes made are automatically recorded in the data dictionary. • The DBMS creates the complex structures required for data storage. • The DBMS transforms entered data to conform to the data structures in item 2. • The DBMS creates a security system and enforces security within that system. • The DBMS creates complex structures that allow multiple-user access to the data. • The DBMS performs backup and data recovery procedures to ensure data safety. • The DBMS promotes and enforces integrity rules to eliminate data integrity problems. • The DBMS provides access to the data via utility programs and from programming languages interfaces. • The DBMS provides end-user access to data within a computer network environment. 5. What is structual independence, and why is it important? Structural independence exists when data access programs are not subject to change when the file's structural characteristics, such as the number or order of the columns in a table, change. Structural independence is important because it substantially decreases programming effort and program maintenance costs.
3
Chapter 1 Database Systems 6. Explain the differences between data, information, and a database Data are raw facts. Information is processed data to reveal the meaning behind the facts. Let’s summarize some key points: • Data constitute the building bocks of information. • Information is produced by processing data. • Information is used to reveal the meaning of data. • Good, relevant, and timely information is the key to good decision making. • Good decision making is the key to organizational survival in a global environment. A database is a computer structure for storing data in a shared, integrated fashion so that the data can be transformed into information as needed. 7. What is the role of a DBMS, and what are its advantages? What are its disadvantages? A database management system (DBMS) is a collection of programs that manages the database structure and controls access to the data stored in the database. Figure 1.2 (shown in the text) illustrates that the DBMS serves as the intermediary between the user and the database. The DBMS receives all application requests and translates them into the complex operations required to fulfill those requests. The DBMS hides much of the database’s internal complexity from the application programs and users. The application program might be written by a programmer using a programming language such as COBOL, Visual Basic, or C++, or it might be created through a DBMS utility program. Having a DBMS between the end user’s applications and the database offers some important advantages. First, the DBMS enables the data in the database to be shared among multiple applications or users. Second, the DBMS integrates the many different users’ views of the data into a single all-encompassing data repository. Because data are the crucial raw material from which information is derived, you must have a good way of managing such data. As you will discover in this book, the DBMS helps make data management more efficient and effective. In particular, a DBMS provides advantages such as: • Improved data sharing. The DBMS helps create an environment in which end users have better access to more and better-managed data. Such access makes it possible for end users to respond quickly to changes in their environment. • Better data integration. Wider access to well-managed data promotes an integrated view of the organization’s operations and a clearer view of the big picture. It becomes much easier to see how actions in one segment of the company affect other segments. • Minimized data inconsistency. Data inconsistency exists when different versions of the same data appear in different places. For example, data inconsistency exists when a company’s sales department stores a sales representative’s name as “Bill Brown” and the company’s personnel department stores that same person’s name as “William G. Brown” or when the company’s regional sales office shows the price of product “X” as $45.95 and its national sales office shows the same product’s price as $43.95. The probability of data inconsistency is greatly reduced in a properly designed database. • Improved data access. The DBMS makes it possible to produce quick answers to ad hoc queries. From a database perspective, a query is a specific request for data manipulation (for
4
Chapter 1 Database Systems
• •
example, to read or update the data) issued to the DBMS. Simply put, a query is a question and an ad hoc query is a spur-of-the-moment question. The DBMS sends back an answer (called the query result set) to the application. For example, end users, when dealing with large amounts of sales data, might want quick answers to questions (ad hoc queries) such as: ➢ What was the dollar volume of sales by product during the past six months? ➢ What is the sales bonus figure for each of our salespeople during the past three months? ➢ How many of our customers have credit balances of $3,000 or more? Improved decision making. Better-managed data and improved data access make it possible to generate better quality information, on which better decisions are based. Increased end-user productivity. The availability of data, combined with the tools that transform data into usable information, empowers end users to make quick, informed decisions that can make the difference between success and failure in the global economy.
The advantages of using a DBMS are not limited to the few just listed. In fact, you will discover many more advantages as you learn more about the technical details of databases and their proper design. Although the database system yields considerable advantages over previous data management approaches, database systems do carry significant disadvantages. For example: • Increased costs. Database systems require sophisticated hardware and software and highly skilled personnel. The cost of maintaining the hardware, software, and personnel required to operate and manage a database system can be substantial. Training, licensing, and regulation compliance costs are often overlooked when database systems are implemented. • Management complexity. Database systems interface with many different technologies and have a significant impact on a company’s resources and culture. The changes introduced by the adoption of a database system must be properly managed to ensure that they help advance the company’s objectives. Given the fact that databases systems hold crucial company data that are accessed from multiple sources, security issues must be assessed constantly. • Maintaining currency. To maximize the efficiency of the database system, you must keep your system current. Therefore, you must perform frequent updates and apply the latest patches and security measures to all components. Because database technology advances rapidly, personnel training costs tend to be significant. • Vendor dependence. Given the heavy investment in technology and personnel training, companies might be reluctant to change database vendors. As a consequence, vendors are less likely to offer pricing point advantages to existing customers, and those customers might be limited in their choice of database system components. • Frequent upgrade/replacement cycles. DBMS vendors frequently upgrade their products by adding new functionality. Such new features often come bundled in new upgrade versions of the software. Some of these versions require hardware upgrades. Not only do the upgrades themselves cost money, but it also costs money to train database users and administrators to properly use and manage the new features.
5
Chapter 1 Database Systems 8. List and describe the different types of databases. The focus is on Section 1-3b, TYPES OF DATABASES. Organize the discussion around the number of users, database site location, and data use: • Number of users o Single-user o Multiuser o Workgroup o Enterprise • Database site location o Centralized o Distributed o Cloud-based • Type of data o General-purpose o Discipline-specific • Database use o Transactional (production) database (OLTP) o Data warehouse database (OLAP) • Degree of data structure o Unstructured data o Structured data For a description of each type of database, please see section 1-3b. 9. What are the main components of a database system? The basis of this discussion is Section 1-7a, THE DATABASE SYSTEM ENVIRONMENT. Figure 1.10 provides a good bird’s eye view of the components. Note that the system’s components are hardware, software, people, procedures, and data. 10. What is metadata? Metadata is data about data. That is, metadata define the data characteristics such as the data type (such as character or numeric) and the relationships that link the data. Relationships are an important component of database design. What makes relationships especially interesting is that they are often defined by their environment. For instance, the relationship between EMPLOYEE and JOB is likely to depend on the organization’s definition of the work environment. For example, in some organizations an employee can have multiple job assignments, while in other organizations – or even in other divisions within the same organization – an employee can have only one job assignment. The details of relationship types and the roles played by those relationships in data models are defined and described in Chapter 2, Data Models.”. Relationships will play a key role in subsequent chapters. You cannot effectively deal with database design issues unless you address relationships. 11. Explain why database design is important.
6
Chapter 1 Database Systems The focus is on Section 1-4, WHY DATABASE DESIGN IS IMPORTANT. Explain that modern database and applications development software is so easy to use that many people can quickly learn to implement a simple database and develop simple applications within a week or so, without giving design much thought. As data and reporting requirements become more complex, those same people will simply (and quickly!) produce the required add-ons. That's how data redundancies and all their attendant anomalies develop, thus reducing the "database" and its applications to a status worse than useless. Stress these points: • Good applications can't overcome bad database designs. • The existence of a DBMS does not guarantee good data management, nor does it ensure that the database will be able to generate correct and timely information. • Ultimately, the end user and the designer decide what data will be stored in the database. A database created without the benefit of a detailed blueprint is unlikely to be satisfactory. Pose this question: would you think it smart to build a house without the benefit of a blueprint? So why would you want to create a database without a blueprint? (Perhaps it would be OK to build a chicken coop without a blueprint, but would you want your house to be built the same way?) 12. What are the potential costs of implementing a database system? Although the database system yields considerable advantages over previous data management approaches, database systems do impose significant costs. For example: • Increased acquisition and operating costs. Database systems require sophisticated hardware and software and highly skilled personnel. The cost of maintaining the hardware, software, and personnel required to operate and manage a database system can be substantial. • Management complexity. Database systems interface with many different technologies and have a significant impact on a company's resources and culture. The changes introduced by the adoption of a database system must be properly managed to ensure that they help advance the company's objectives. Given the fact that databases systems hold crucial company data that are accessed from multiple sources, security issues must be assessed constantly. • Maintaining currency. To maximize the efficiency of the database system, you must keep your system current. Therefore, you must perform frequent updates and apply the latest patches and security measures to all components. Because database technology advances rapidly, personnel training costs tend to be significant. • Vendor dependence. Given the heavy investment in technology and personnel training, companies may be reluctant to change database vendors. As a consequence, vendors are less likely to offer pricing point advantages to existing customers and those customers may be limited in their choice of database system components. 13. Use examples to compare and contrast unstructured and structured data. Which type is more prevalent in a typical business environment? Unstructured data are data that exist in their original (raw) state, that is, in the format in which they were collected. Therefore, unstructured data exist in a format that does not lend itself to the processing that yields information. Structured data are the result of taking unstructured data and formatting (structuring) such data to facilitate storage, use, and the generation of information. You apply structure (format) based on the type of processing that you intend to perform on the data. 7
Chapter 1 Database Systems Some data might be not ready (unstructured) for some types of processing, but they might be ready (structured) for other types of processing. For example, the data value 37890 might refer to a zip code, a sales value, or a product code. If this value represents a zip code or a product code and is stored as text, you cannot perform mathematical computations with it. On the other hand, if this value represents a sales transaction, it is necessary to format it as numeric. If invoices are stored as images for future retrieval and display, you can scan them and save them in a graphic format. On the other hand, if you want to derive information such as monthly totals and average sales, such graphic storage would not be useful. Instead, you could store the invoice data in a (structured) spreadsheet format so that you can perform the requisite computations. Based on sheer volume, most data is unstructured or semistructured. Data for conducting actual business transactions is usually structured. 14. What are some basic database functions that a spreadsheet cannot perform. Spreadsheets do not support self-documentation through metadata, enforcement of data types or domains to ensure consistency of data within a column, defined relationships among tables, or constraints to ensure consistency of data across related tables. 15. What common problems do a collection of spreadsheets created by end users share with the typical file system? A collection of spreadsheets shares several problems with the typical file system. First problem is that end users create their own, private, copies of the data, which creates issues of data ownership. This situation also creates islands of information where changes to one set of data are not reflected in all of the copies of the data. This leads to the second problem – lack of data consistency. Because the data in various spreadsheets may be intended to represent a view of the business environment, a lack of consistency in the data may lead to faulty decision making based on inaccurate data. 16. Explain the significance of the loss of direct, hands-on access to business data that users experienced with the advent of computerized data repositories. Users lost direct, hands-on access to the business data when computerized data repositories were developed because the IT skills necessary to directly access and manipulate the data were beyond the average user's abilities, and because security precautions restricted access to the shared data. This was significant because it removed users from the direct manipulation of data and introduced significant time delays for data access. When users need answers to business questions from the data, necessity often does not give them the luxury of time to wait days, weeks, or even months for the required reports. The desire to return hands-on access to the data to the users, among other drivers, helped to propel the development of database systems. While database systems have greatly improved the ability of users to directly access data, the need to quickly manipulate data for themselves has lead to the problems of spreadsheets being used when databases are needed. 17. Explain why the cost of ownership may be lower with a cloud database than with a traditional, company database. Cloud databases reside on the Internet instead of within the organization’s own network infrastructure. This can reduce costs because the organization is not required to purchase and maintain the hardware and software necessary to house the database and support the necessary levels of system performance.
8
Chapter 1 Database Systems
Problem Solutions
ONLINE CONTENT The file structures you see in this problem set are simulated in a Microsoft Access database named Ch01_Problems, available www.cengagebrain.com.
Given the file structure shown in Figure P1.1, answer Problems 1 - 4.
FIGURE P1.1 The File Structure for Problems 1-4
1. How many records does the file contain? How many fields are there per record? The file contains seven records (21-5Z through 31-7P) and each of the records is composed of five fields (PROJECT_CODE through PROJECT_BID_PRICE.) 2. What problem would you encounter if you wanted to produce a listing by city? How would you solve this problem by altering the file structure? The city names are contained within the MANAGER_ADDRESS attribute and decomposing this character (string) field at the application level is cumbersome at best. (Queries become much more difficult to write and take longer to execute when internal string searches must be conducted.) If the ability to produce city listings is important, it is best to store the city name as a separate attribute. 3. If you wanted to produce a listing of the file contents by last name, area code, city, state, or zip code, how would you alter the file structure? The more we divide the address into its component parts, the greater its information capabilities. For example, by dividing MANAGER_ADDRESS into its component parts (MGR_STREET, MGR_CITY, MGR_STATE, and MGR_ZIP), we gain the ability to easily select records on the basis of zip codes, city names, and states. Similarly, by subdividing the MANAGER name into its components MGR_LASTNAME, MGR_FIRSTNAME, and MGR_INITIAL, we gain the ability to produce more efficient searches and listings. For example, creating a phone directory is easy when you can sort by last name, first name, and initial. Finally, separating the area code and the phone number will yield the ability to efficiently group data by area codes. Thus MGR_PHONE might be 9
Chapter 1 Database Systems decomposed into MGR_AREA_CODE and MGR_PHONE. The more you decompose the data into their component parts, the greater the search flexibility. Data that are decomposed into their most basic components are said to be atomic. 4. What data redundancies do you detect? How could those redundancies lead to anomalies? Note that the manager named Holly B. Parker occurs three times, indicating that she manages three projects coded 21-5Z, 25-9T, and 29-2D, respectively. (The occurrences indicate that there is a 1:M relationship between PROJECT and MANAGER: each project is managed by only one manager but, apparently, a manager may manage more than one project.) Ms. Parker's phone number and address also occur three times. If Ms. Parker moves and/or changes her phone number, these changes must be made more than once and they must all be made correctly... without missing a single occurrence. If any occurrence is missed during the change, the data are "different" for the same person. After some time, it may become difficult to determine what the correct data are. In addition, multiple occurrences invite misspellings and digit transpositions, thus producing the same anomalies. The same problems exist for the multiple occurrences of George F. Dorts. 5. Identify and discuss the serious data redundancy problems exhibited by the file structure shown in Figure P1.5.
FIGURE P1.5 The File Structure for Problems 5-8
10
Chapter 1 Database Systems
NOTE It is not too early to begin discussing proper structure. For example, you may focus student attention on the fact that, ideally, each row should represent a single entity. Therefore, each row's fields should define the characteristics of one entity, rather than include characteristics of several entities. The file structure shown here includes characteristics of multiple entities. For example, the JOB_CODE is likely to be a characteristic of a JOB entity. PROJ_NUM and PROJ_NAME are clearly characteristics of a PROJECT entity. Also, since (apparently) each project has more than one employee assigned to it, the file structure shown here shows multiple occurrences for each of the projects. (Hurricane occurs three times, Coast occurs twice, and Satellite occurs four times.)
Given the file's poor structure, the stage is set for multiple anomalies. For example, if the charge for JOB_CODE = EE changes from $85.00 to $90.00, that change must be made twice. Also, if employee June H. Sattlemeier is deleted from the file, you also lose information about the existence of her JOB_CODE = EE, its hourly charge of $85.00, and the PROJ_HOURS = 17.5. The loss of the PROJ_HOURS value will ultimately mean that the Coast project costs are not being charged properly, thus causing a loss of PROJ_HOURS*JOB_CHG_HOUR = 17.5 x $85.00 = $1,487.50 to the company. Incidentally, note that the file contains different JOB_CHG_HOUR values for the same CT job code, thus illustrating the effect of changes in the hourly charge rate over time. The file structure appears to represent transactions that charge project hours to each project. However, the structure of this file makes it difficult to avoid update anomalies and it is not possible to determine whether a charge change is accurately reflected in each record. Ideally, a change in the hourly charge rate would be made in only one place and this change would then be passed on to the transaction based on the hourly charge. Such a structural change would ensure the historical accuracy of the transactions. You might want to emphasize that the recommended changes require a lot of work in a file system. 6. Looking at the EMP_NAME and EMP_PHONE contents in Figure P1.5, what change(s) would you recommend? A good recommendation would be to make the data more atomic. That is, break up the data componnts whenever possible. For example, separate the EMP_NAME into its componenst EMP_FNAME, EMP_INITIAL, and EMP_LNAME. This change will make it much easier to organize employee data through the employee name component. Similarly, the EMP_PHONE data should be decomposed into EMP_AREACODE and EMP_PHONE. For example, breaking up the phone number 653-234-3245 into the area code 653 and the phone number 234-3245 will make it much easier to organize the phone numbers by area code. (If you want to print an employee phone directory, the more atomic employee name data will make the job much easier.)
11
Chapter 1 Database Systems 7. Identify the various data sources in the file you examined in Problem 5. Given their answers to problem 5 and some additional scrutiny of Figure 1.5, your students should be able to identify these data sources: • Employee data such as names and phone numbers. • Project data such as project names. If you start with an EMPLOYEE file, the project names clearly do not belong in that file. (Project names are clearly not employee characteristics.) • Job data such as the job charge per hour. If you start with an EMPLOYEE file, the job charge per hour clearly does not belong in that file. (Hourly charges are clearly not employee characteristics.) • The project hours, which are most likely the hours worked by the employee for that project. (Such hours are associated with a work product, not the employee per se.) 8. Given your answer to Problem 7, what new files should you create to help eliminate the data redundancies found in the file shown in Figure P1.5? The data sources are probably the PROJECT, EMPLOYEE, JOB, and CHARGE. The PROJECT file should contain project characteristics such as the project name, the project manager/coordinator, the project budget, and so on. The EMPLOYEE file might contain the employee names, phone number, address, and so on. The JOB file would contain the billing charge per hour for each of the job types – a database designer, an applications developer, and an accountant would generate different billing charges per hour. The CHARGE file would be used to keep track of the number of hours by job type that will be billed for each employee who worked on the project. 9. Identify and discuss the serious data redundancy problems exhibited by the file structure shown in Figure P1.9. (The file is meant to be used as a teacher class assignment schedule. One of the many problems with data redundancy is the likely occurrence of data inconsistencies – that two different initials have been entered for the teacher named Maria Cordoza.)
FIGURE P1.9 The File Structure for Problems 9-10
Note that the teacher characteristics occur multiple times in this file. For example, the teacher named Maria Cordoza’s first name, last name, and initial occur three times. If changes must be made for any given teacher, those changes must be made multiple times. All it takes is one incorrect entry or one forgotten change to create data inconsistencies. Redundant data are not a luxury you can afford in a data environment. 12
Chapter 1 Database Systems
10. Given the file structure shown in Figure P1.9, what problem(s) might you encounter if building KOM were deleted? You would lose all the time assignment data about teachers Williston, Cordoza, and Hawkins, as well as the KOM rooms 204E, 123, and 34. Here is yet another good reason for keeping data about specific entities in their own tables! This kind of an anomaly is known as a deletion anomaly. 11. Using your school’s student information system, print your class schedule. The schedule probably would contain the student identification number, student name, class code, class name, class credit hours, class instructor name, the class meeting days and times, and the class room number. Use Figure P1.11 as a template to complete the following actions.
a. Create a spreadsheet using the template shown in Figure P1.11 and enter your current class schedule. b. Enter the class schedule of two of your classmates into the same spreadsheet. c. Discuss the redundancies and anomalies caused by this design. This could be a good “mini-group” problem – groups of 3 students max. Ask them to create their individual class schedules in separate spreadsheets and then, a single spreadsheet containing all their class schedules. This exercise should incentivate “group discussion” and discover data anomalies and brain storm better wayd to store the class schedule data. Students are likely to use MS Excel or Google Sheets to create a simple tabular spreasheet containing the data outlined in Figure P1.11. The rows of the spreadsheet(s) will represent each one of the classes they are taking. Students are likely to identify the redundancies around the class information since all three schedules (the student’s own schedule plus the schedules of the two classmates) will have at least the database class in common. This easily leads to discussions of separating the data into at least two tables in a database. However, that still leaves the redundancies of redundant student data with each class that they are taking. Astute students might realize that this is analogous to the Employee Skill Certifications shown in Figures 1.4 and Figure 1.5, such that a table for student data, a table for class data, and a table to relate the students and classes is appropriate.
13
Chapter 2 Data Models
Chapter 2 Data Models Discussion Focus Although all of the topics covered in this chapter are important, our students have given us consistent feedback: If you can write precise business rules from a description of operations, database design is not that difficult. Therefore, once data modeling (Sections 2-1, "Data Modeling and Data Models", Section 2-2 "The Importance of Data Models,” and 2-3, “Data Model Basic Building Blocks,”) has been examined in detail, Section 2-4, “Business Rules,” should receive a lot of class time and attention. Perhaps it is useful to argue that the answers to questions 2 and 3 in the Review Questions section are the key to successful design. That’s why we have found it particularly important to focus on business rules and their impact on the database design process. What are business rules, what is their source, and why are they crucial? Business rules are precisely written and unambiguous statements that are derived from a detailed description of an organization's operations. When written properly, business rules define one or more of the following modeling components: • entities • relationships • attributes • connectivities • cardinalities – these will be examined in detail in Chapter 3, “The Relational Database Model.” Basically, the cardinalities yield the minimum and maximum number of entity occurrences in an entity. For example, the relationship decribed by “a professor teaches one or more classes” means that the PROFESSOR entity is referenced at least once and no more than four times in the CLASS entity. • constraints Because the business rules form the basis of the data modeling process, their precise statement is crucial to the success of the database design. And, because the business rules are derived from a precise description of operations, much of the design's success depends on the accuracy of the description of operations. Examples of business rules are: • An invoice contains one or more invoice lines. • Each invoice line is associated with a single invoice. • A store employs many employees. • Each employee is employed by only one store. • A college has many departments. • Each department belongs to a single college. (This business rule reflects a university that has multiple colleges such as Business, Liberal Arts, Education, Engineering, etc.)
14
Chapter 2 Data Models • • • • • •
A driver may be assigned to drive many different vehicles. Each vehicle can be driven by many drivers. (Note: Keep in mind that this business rule reflects the assignment of drivers during some period of time.) A client may sign many contracts. Each contract is signed by only one client. A sales representative may write many contracts. Each contract is written by one sales representative.
Note that each relationship definition requires the definition of two business rules. For example, the relationship between the INVOICE and (invoice) LINE entities is defined by the first two business rules in the bulleted list. This two-way requirement exists because there is always a two-way relationship between any two related entities. (This two-way relationship description also reflects the implementation by many of the available database design tools.) Keep in mind that the ER diagrams cannot always reflect all of the business rules. For example, examine the following business rule: A customer cannot be given a credit line over $10,000 unless that customer has maintained a satisfactory credit history (as determined by the credit manager) during the past two years. This business rule describes a constraint that cannot be shown in the ER diagram. The business rule reflected in this constraint would be handled at the applications software level through the use of a trigger or a stored procedure. (Your students will learn about triggers and stored procedures in Chapter 8, “Advanced SQL.”) Given their importance to successful design, we cannot overstate the importance of business rules and their derivation from properly written description of operations. It is not too early to start asking students to write business rules for simple descriptions of operations. Begin by using familiar operational scenarios, such as buying a book at the book store, registering for a class, paying a parking ticket, or renting a DVD. Also, try reversing the process: Give the students a chance to write the business rules from a basic data model such as the one represented by the text’s Figure 2.1 and 2.2. Ask your students to write the business rules that are the foundation of the relational diagram in Figure 2.2 and then point their attention to the relational tables in Figure 2.1 to indicate that an AGENT occurrence can occur multiple times in the CUSTOMER entity, thus illustrating the implementation impact of the business rules An agent can serve many customers. Each customer is served by one agent.
15
Chapter 2 Data Models
Answers to Review Questions 1. Discuss the importance of data modeling. A data model is a relatively simple representation, usually graphical, of a more complex real world object event. The data model’s main function is to help us understand the complexities of the realworld environment. The database designer uses data models to facilitate the interaction among designers, application programmers, and end users. In short, a good data model is a communications device that helps eliminate (or at least substantially reduce) discrepancies between the database design’s components and the real world data environment. The development of data models, bolstered by powerful database design tools, has made it possible to substantially diminish the database design error potential. (Review Sections 2.1 and 2.2 in detail.) 2. What is a business rule, and what is its purpose in data modeling? A business rule is a brief, precise, and unambigous description of a policy, procedure, or principle within a specific organization’s environment. In a sense, business rules are misnamed: they apply to any organization -- a business, a government unit, a religious group, or a research laboratory; large or small -- that stores and uses data to generate information. Business rules are derived from a description of operations. As its name implies, a description of operations is a detailed narrative that describes the operational environment of an organization. Such a description requires great precision and detail. If the description of operations is incorrect or inomplete, the business rules derived from it will not reflect the real world data environment accurately, thus leading to poorly defined data models, which lead to poor database designs. In turn, poor database designs lead to poor applications, thus setting the stage for poor decision making – which may ultimately lead to the demise of the organization. Note especially that business rules help to create and enforce actions within that organization’s environment. Business rules must be rendered in writing and updated to reflect any change in the organization’s operational environment. Properly written business rules are used to define entities, attributes, relationships, and constraints. Because these components form the basis for a database design, the careful derivation and definition of business rules is crucial to good database design. 3. How do you translate business rules into data model components? As a general rule, a noun in a business rule will translate into an entity in the model, and a verb (active or passive) associating nouns will translate into a relationship among the entities. For example, the business rule “a customer may generate many invoices” contains two nouns (customer and invoice) and a verb (“generate”) that associates them.
16
Chapter 2 Data Models
4. Describe the basic features of the relational data model and discuss their importance to the end user and the designer. A relational database is a single data repository that provides both structural and data independence while maintaining conceptual simplicity. The relational database model is perceived by the user to be a collection of tables in which data are stored. Each table resembles a matrix composed of row and columns. Tables are related to each other by sharing a common value in one of their columns. The relational model represents a breakthrough for users and designers because it lets them operate in a simpler conceptual environment. End users find it easier to visualize their data as a collection of data organized as a matrix. Designers find it easier to deal with conceptual data representation, freeing them from the complexities associated with physical data representation. 5. Explain how the entity relationship (ER) model helped produce a more structured relational database design environment. An entity relationship model, also known as an ERM, helps identify the database's main entities and their relationships. Because the ERM components are graphically represented, their role is more easily understood. Using the ER diagram, it’s easy to map the ERM to the relational database model’s tables and attributes. This mapping process uses a series of well-defined steps to generate all the required database structures. (This structures mapping approach is augmented by a process known as normalization, which is covered in detail in Chapter 6 “Normalization of Database Tables.”) 6. Consider the scenario described by the statement “A customer can make many payments, but each payment is made by only one customer” as the basis for an entity relationship diagram (ERD) representation. This scenario yields the ERDs shown in Figure Q2.6. (Note the use of the PowerPoint Crow’s Foot template. We will start using the Visio Professional-generated Crow’s Foot ERDs in Chapter 3, but you can, of course, continue to use the template if you do not have access to Visio Professional.)
Figure Q2.6 The Chen and Crow’s Foot ERDs for Question 6
17
Chapter 2 Data Models
Chen model 1 CUSTOMER
M makes
PAYMENT
Crow’s Foot model
CUSTOMER
makes
PAYMENT
NOTE Remind your students again that we have not (yet) illustrated the effect of optional relationships on the ERD’s presentation. Optional relationships and their treatment are covered in detail in Chapter 4, “Entity Relationship (ER) Modeling.”
7. Why is an object said to have greater semantic content than an entity? An object has greater semantic content because it embodies both data and behavior. That is, the object contains, in addition to data, also the description of the operations that may be performed by the object. 8. What is the difference between an object and a class in the object oriented data model (OODM)? An object is an instance of a specific class. It is useful to point out that the object is a run-time concept, while the class is a more static description. Objects that share similar characteristics are grouped in classes. A class is a collection of similar objects with shared structure (attributes) and behavior (methods.) Therefore, a class resembles an entity set. However, a class also includes a set of procedures known as methods.
18
Chapter 2 Data Models 9. How would you model Question 6 with an OODM? (Use Figure 2.4 as your guide.) The OODM that corresponds to question 6’s ERD is shown in Figure Q1.9:
Figure Q2.9 The OODM Model for Question 9 CUSTOMER M PAYMENT
10. What is an ERDM, and what role does it play in the modern (production) database environment? The Extended Relational Data Model (ERDM) is the relational data model’s response to the Object Oriented Data Model (OODM.) Most current RDBMSes support at least a few of the ERDM’s extensions. For example, support for large binary objects (BLOBs) is now common. Although the "ERDM" label has frequently been used in the database literature to describe the relational database model's response to the OODM's challenges, C. J. Date objects to the ERDM label for the following reasons: 1 • The useful contribution of "the object model" is its ability to let users define their own -and often very complex -- data types. However, mathematical structures known as "domains" in the relational model also provide this ability. Therefore, a relational DBMS that properly supports such domains greatly diminishes the reason for using the object model. Given proper support for domains, relational database models are quite capable of handling the complex data encountered in time series, engineering design, office automation, financial modeling, and so on. Because the relational model can support complex data types, the notion of an "extended relational database model" or ERDM is "extremely inappropriate and inaccurate" and "it should be firmly resisted." (The capability that is supposedly being extended is already there!) • Even the label object/relational model (O/RDM) is not quite accurate, because the relational database model's domain is not an object model structure. However, there are already quite a few O/R products -- also known as Universal Database Servers -- on the market. Therefore, Date concedes that we are probably stuck with the O/R label. In fact, Date believes that "an O/R system is in everyone's future." More precisely, Date argues that a true O/R system would be "nothing more nor less than a true relational system -- which is to say, a system that supports the relational model, with all that such support entails."
1
C. J. Date, "Back To the Relational Future", http://www.dbpd.com/vault/9808date.html 19
Chapter 2 Data Models C. J. Date concludes his discussion by observing that "We need do nothing to the relational model achieve object functionality. (Nothing, that is, except implement it, something that doesn't yet seem to have been tried in the commercial world.)" 11. What is a relationship, and what three types of relationships exist? A relationship is an association among (two or more) entities. Three types of relationships exist: oneto-one (1:1), one-to-many (1:M), and many-to-many (M:N or M:M.) 12. Give an example of each of the three types of relationships. 1:1 An academic department is chaired by one professor; a professor may chair only one academic department. 1:M A customer may generate many invoices; each invoice is generated by one customer. M:N An employee may have earned many degrees; a degree may have been earned by many employees. 13. What is a table, and what role does it play in the relational model? Strictly speaking, the relational data model bases data storage on relations. These relations are based on algebraic set theory. However, the user perceives the relations to be tables. In the relational database environment, designers and users perceive a table to be a matrix consisting of a series of row/column intersections.Tables, also called relations, are related to each other by sharing a common entity characteristic. For example, an INVOICE table would contain a customer number that points to that same number in the CUSTOMER table. This feature enables the RDBMS to link invoices to the customers who generated them. Tables are especially useful from the modeling and implementation perspecectives. Because tables are used to describe the entities they represent, they provide ane asy way to summarize entity characteristics and relationships among entities. And, because they are purely conceptual constructs, the designer does not need to be concerned about the physical implementation aspects of the database design.
20
Chapter 2 Data Models 14. What is a relational diagram? Give an example. A relational diagram is a visual representation of the relational database’s entities, the attributes within those entities, and the relationships between those entities. Therefore, it is easy to see what the entities represent and to see what types of relationships (1:1, 1:M, M:N) exist among the entities and how those relationships are implemented. An example of a relational diagram is found in the text’s Figure 2.2. MS Access, Database Tools, “Relationships”option on the main Acces menu could be used to illustrate simple relational diagrams. 15. What is connectivity? (Use a Crow’s Foot ERD to illustrate connectivity.) Connectivity is the relational term to describe the types of relationships (1:1, 1:M, M:N).
In the figure, the businesss rule that an advisor can advise many students and a student has only one assigned advisor is shown with in a relationship with a connectivity of 1:M. The business rule that a student can register only one vehicle to park on campus and a vehicle can be registered by only one student is shown with a relationship with a connectivity of 1:1. Finally, the rule that a student can register for many classes, and a class can be registered for by many students, is shown by the relationship with a connectivity of M:N. 16. Describe the Big Data phenomenon. Over the last few years, a new wave of data has “emerged” to the limelight. Such data have alsways exsisted but did not recive the attention that is receiving today. These data are characterized for being high volume (petabyte size and beyond), high frequency (data are generated almost constantly), and mostly semi-structured. These data come from multiple and vatied sources such as web site logs, web site posts in social sites, and machine generated information (GPS, sensors, etc.) Such data; have been accumulated over the years and companies are now awakining to the fact that it contains a lot of hidden information that could help the day-to-day business (such as browsing patterns, purchasing preferences, behaivor patterns, etc.) The need to manage and leverage this data
21
Chapter 2 Data Models has triggered a phenomenon labeled “Big Data”. Big Data refers to a movement to find new and better ways to manage large amounts of web-generated data and derive business insight from it, while, at the same time, providing high performance and scalability at a reasonable cost. 17. What does the term “3 vs” refers to? The term “3 Vs” refers to the 3 basic characteristics of Big Data databases, they are: • Volume: Refers to the amounts of data being stored. With the adoption and growth of the Internet and social media, companies have multiplied the ways to reach customers. Over the years, and with the benefit of technological advances, data for millions of etransactions were being stored daily on company databases. Furthermore, organizations are using multiple technologies to interact with end users and those technologies are generating mountains of data. This ever-growing volume of data quickly reached petabytes in size and it's still growing. • Velocity: Refers not only to the speed with which data grows but also to the need to process these data quickly in order to generate information and insight. With the advent of the Internet and social media, business responses times have shrunk considerably. Organizations need not only to store large volumes of quickly accumulating data, but also need to process such data quickly. The velocity of data growth is also due to the increase in the number of different data streams from which data is being piped to the organization (via the web, e-commerce, Tweets, Facebook posts, emails, sensors, GPS, and so on). • Variety: Refers to the fact that the data being collected comes in multiple different data formats. A great portion of these data comes in formats not suitable to be handled by the typical operational databases based on the relational model. The 3 Vs framework illustrates what companies now know, that the amount of data being collected in their databases has been growing exponentially in size and complexity. Traditional relational databases are good at managing structured data but are not well suited to managing and processing the amounts and types of data being collected in today's business environment. 18. What is Haddop and what are its basic components? In order to create value from their previously unused Big Data stores, companies are using new Big Data technologies. These emerging technologies allow organizations to process massive data stores of multiple formats in cost-effective ways. Some of the most frequently used Big Data technologies are Hadoop and MapReduce. • Hadoop is a Java based, open source, high speed, fault-tolerant distributed storage and computational framework. Hadoop uses low-cost hardware to create clusters of thousands of computer nodes to store and process data. Hadoop originated from Google's work on distributed file systems and parallel processing and is currently supported by the Apache Software Foundation.2Hadoop has several modules, but the two main components are Hadoop Distributed File System (HDFS) and MapReduce. • Hadoop Distributed File System (HDFS) is a highly distributed, fault-tolerant file storage system designed to manage large amounts of data at high speeds. In order to achieve high throughput, HDFS uses the write-once, read many model. This means that once the data is written, it cannot be modified. HDFS uses three types of nodes: a name node that stores all the metadata about the file system; a data node that stores fixed-size data blocks (that 2 For more information about Hadoop visit hadoop.apache.org.
22
Chapter 2 Data Models
•
19.
could be replicated to other data nodes) and a client node that acts as the interface between the user application and the HDFS. MapReduce is an open source application programming interface (API) that provides fast data analytics services. MapReduce distributes the processing of the data among thousands of nodes in parallel. MapReduce works with structured and nonstructured data. The MapReduce framework provides two main functions, Map and Reduce. In general terms, the Map function takes a job and divides it into smaller units of work; the Reduce function collects all the output results generated from the nodes and integrates them into a single result set.
Define and describe the basic characteristics of a NoSQL database. Every time you search for a product on Amazon, send messages to friends in Facebook, watch a video in YouTube or search for directions in Google Maps, you are using a NoSQL database. NoSQL refers to a new generation of databases that address the very specific challenges of the “big data” era and have the following general characteristics: •
Not based on the relational model. These databases are generally based on a variation of the key-value data model rather than in the relational model, hence the NoSQL name. The key-value data model is based on a structure composed of two data elements: a key and a value; in which for every key there is a corresponding value (or a set of values). The key-value data model is also referred to as the attribute-value or associative data model. In the key-value data model, each row represents one attribute of one entity instance. The “key” column points to an attribute and the “value” column contains the actual value for the attribute. The data type of the “value” column is generally a long string to accommodate the variety of actual data types of the values that are placed in the column.
•
Support distributed database architectures. One of the big advantages of NoSQL databases is that they generally use a distributed architecture. In fact, several of them (Cassandra, Big Table) are designed to use low cost commodity servers to form a complex network of distributed database nodes
•
Provide high scalability, high availability and fault tolerance. NoSQL databases are designed to support the ability to add capacity (add database nodes to the distributed database) when the demand is high and to do it transparently and without downtime. Fault tolerant means that if one of the nodes in the distributed database fails, the database will keep operating as normal.
•
Support very large amounts of sparse data. Because NoSQL databases use the key-value data model, they are suited to handle very high volumes of sparse data; that is for cases where the number of attributes is very large but the number of actual data instances is low.
•
Geared toward performance rather than transaction consistency. One of the biggest problems of very large distributed databases is to enforce data consistency. Distributed databases automatically make copies of data elements at multiple nodes – to ensure high availability and fault tolerance. If the node with the requested data goes down, the request
23
Chapter 2 Data Models can be served from any other node with a copy of the data. However, what happen if the network goes down during a data update? In a relational database, transaction updates are guaranteed to be consistent or the transaction is rolled back. NoSQL databases sacrifice consistency in order to attain high levels of performance. NoSQL databases provide eventual consistency. Eventual consistency is a feature of NoSQL databases that indicates that data are not guaranteed to be consistent immediately after an update (across all copies of the data) but rather, that updates will propagate through the system and eventually all data copies will be consistent. 20.
Using the example of a medical clinic with patients and tests, provide a simple representation of how to model this example using the relational model and how it wold be represented using the key-value data modeling technique. As you can see in Figure Q2.20, the relational model stores data in a tabular format in which each row represents a “record” for a given patient. While, the key-value data model uses three differnet fields to represent each data element in the record. Therefore, for each patient row, there are three rows in the key-value model.
21.
What is logical independence? Logical independence exists when you can change the internal model without affecting the conceptual model. When you discuss logical and other types of independence, it’s worthwhile to discuss and review some basic modeling concepts and terminology:
24
Chapter 2 Data Models •
• • • • • 22.
In general terms, a model is an abstraction of a more complex real-world object or event. A model’s main function is to help you understand the complexities of the real-world environment. Within the database environment, a data model represents data structures and their characteristics, relations, constraints, and transformations. As its name implies, a purely conceptual model stands at the highest level of abstraction and focuses on the basic ideas (concepts) that are explored in the model, without specifying the details that will enable the designer to implement the model. For example, a conceptual model would include entities and their relationships and it may even include at least some of the attributes that define the entities, but it would not include attribute details such as the nature of the attributes (text, numeric, etc.) or the physical storage requirements of those atttributes. The terms data model and database model are often used interchangeably. In the text, the term database model is be used to refer to the implementation of a data model in a specific database system. Data models (relatively simple representations, usually graphical, of more complex realworld data structures), bolstered by powerful database design tools, have made it possible to substantially diminish the potential for errors in database design. The internal model is the representation of the database as “seen” by the DBMS. In other words, the internal model requires the designer to match the conceptual model’s characteristics and constraints to those of the selected implementation model. An internal schema depicts a specific representation of an internal model, using the database constructs supported by the chosen database. The external model is the end users’ view of the data environment.
What is physical independence? You have physical independence when you can change the physical model without affecting the internal model. Therefore, a change in storage devices or methods and even a change in operating system will not affect the internal model. The terms physical model and internal model may require a bit of additional discussion: • The physical model operates at the lowest level of abstraction, describing the way data are saved on storage media such as disks or tapes. The physical model requires the definition of both the physical storage devices and the (physical) access methods required to reach the data within those storage devices, making it both software- and hardware-dependent. The storage structures used are dependent on the software (DBMS, operating system) and on the type of storage devices that the computer can handle. The precision required in the physical model’s definition demands that database designers who work at this level have a detailed knowledge of the hardware and software used to implement the database design. • The internal model is the representation of the database as “seen” by the DBMS. In other words, the internal model requires the designer to match the conceptual model’s characteristics and constraints to those of the selected implementation model. An internal schema depicts a specific representation of an internal model, using the database constructs supported by the chosen database.
25
Chapter 2 Data Models
Problem Solutions Use the contents of Figure 2.1 to work problems 1-3. 1. Write the business rule(s) that governs the relationship between AGENT and CUSTOMER. Given the data in the two tables, you can see that an AGENT – through AGENT_CODE -- can occur many times in the CUSTOMER table. But each customer has only one agent. Therefore, the business rules may be written as follows: One agent can have many customers. Each customer has only one agent. Given these business rules, you can conclude that there is a 1:M relationship between AGENT and CUSTOMER. 2. Given the business rule(s) you wrote in Problem 1, create the basic Crow’s Foot ERD. The Crow’s Foot ERD is shown in Figure P2.2a.
Figure P2.2a The Crow’s Foot ERD for Problem 3 serves
AGENT
CUSTOMER
For discussion purposes, you might use the Chen model shown in Figure P2.2b. Compare the two representations of the business rules by noting the different ways in which connectivities (1,M) are represented. The Chen ERD is shown in Figure P2.2b.
Figure P2.2b The Chen ERD for Problem 2 Chen model 1 AGENT
M serves
26
CUSTOMER
Chapter 2 Data Models 3. Using the ERD you drew in Problem 2, create the equivalent Object representation and UML class diagram. (Use Figure 2.4 as your guide.) The OO model is shown in Figure P2.3.
Figure P2.3a The OO Model for Problem 3 AGENT M CUSTOMER
Figure P.3b The UML Model for Problem 3
Using Figure P2.4 as your guide, work Problems 4–5. The DealCo relational diagram shows the initial entities and attributes for the DealCo stores, located in two regions of the country.
Figure P2.4 The DealCo relational diagram
27
Chapter 2 Data Models 4. Identify each relationship type and write all of the business rules. One region can be the location for many stores. Each store is located in only one region. Therefore, the relationship between REGION and STORE is 1:M. Each store employs one or more employees. Each employee is employed by one store. (In this case, we are assuming that the business rule specifies that an employee cannot work in more than one store at a time.) Therefore, the relationship between STORE and EMPLOYEE is 1:M. A job – such as accountant or sales representative -- can be assigned to many employees. (For example, one would reasonably assume that a store can have more than one sales representative. Therefore, the job title “Sales Representative” can be assigned to more than one employee at a time.) Each employee can have only one job assignment. (In this case, we are assuming that the business rule specifies that an employee cannot have more than one job assignment at a time.) Therefore, the relationship between JOB and EMPLOYEE is 1:M. 5. Create the basic Crow’s Foot ERD for DealCo. The Crow’s Foot ERD is shown in Figure P2.5a.
Figure P2.5a The Crow’s Foot ERD for DealCo REGION
is location for
STORE
employs
JOB
is assigned to
EMPLOYEE
The Chen model is shown in Figure P2.5b. (Note that you always read the relationship from the “1” to the “M” side.)
28
Chapter 2 Data Models
Figure P2.5b The Chen ERD for DealCo M
1 is location for
REGION
STORE 1
employs
JOB
M
M
1 is assigned to
EMPLOYEE
Using Figure P2.6 as your guide, work Problems 6−8 The Tiny College relational diagram shows the initial entities and attributes for Tiny College.
Figure P2.6 The Tiny College relational diagram 6. Identify each relationship type and write all of the business rules. The simplest way to illustrate the relationship between ENROLL, CLASS, and STUDENT is to discuss the data shown in Table P2.6. As you examine the Table P2.6 contents and compare the attributes to relational schema shown in Figure P2.6, note these features: • We have added an attribute, ENROLL_SEMESTER, to identify the enrollment period. • Naturally, no grade has yet been assigned when the student is first enrolled, so we have entered a default value “NA” for “Not Applicable.” The letter grade – A, B, C, D, F, I (Incomplete), or W (Withdrawal) -- will be entered at the conclusion of the enrollment period, the SPRING-12 semester. • Student 11324 is enrolled in two classes; student 11892 is enrolled in three classes, and student 10345 is enrolled in one class.
29
Chapter 2 Data Models
Table P2.6 Sample Contents of an ENROLL Table STU_NUM 11324 11324 11892 11892 11892 10345
CLASS_CODE MATH345-04 ENG322-11 CHEM218-05 ENG322-11 CIS431-01 ENG322-07
ENROLL_SEMESTER SPRING-14 SPRING-14 SPRING-14 SPRING-14 SPRING-14 SPRING-14
ENROLL_GRADE NA NA NA NA NA NA
All of the relationships are 1:M. The relationships may be written as follows: COURSE generates CLASS. One course can generate many classes. Each class is generated by one course. CLASS is referenced in ENROLL. One class can be referenced in enrollment many times. Each individual enrollment references one class. Note that the ENROLL entity is also related to STUDENT. Each entry in the ENROLL entity references one student and the class for which that student has enrolled. A student cannot enroll in the same class more than once. If a student enrolls in four classes, that student will appear in the ENROLL entity four times, each time for a different class. STUDENT is shown in ENROLL. One student can be shown in enrollment many times. (In database design terms, “many” simply means “more than once.”) Each individual enrollment entry shows one student. 7. Create the basic Crow’s Foot ERD for Tiny College. The Crow’s Foot model is shown in Figure P2.7a.
Figure P2.7a The Crow’s Foot Model for Tiny College COURSE
generates
CLASS
is referenced in
STUDENT
is shown in
ENROLL
The Chen model is shown in Figure P2.7b.
30
Chapter 2 Data Models
Figure P2.7b The Chen Model for Tiny College M
1 generates
COURSE
CLASS 1
is referenced in
M
M
1 is shown in
STUDENT
ENROLL
8. Create the UML class diagram that reflects the entities and relationships you identified in the relational diagram. The OO model is shown in Figure P2.8.
Figure P2.8a The OO Model for Tiny College COURSE
STUDENT
ENROLL
CRS_CODE
C
CLASS
ENROLL_SEMESTER C
STU_NUM
C
CLASS_CODE
C
CRS_DESCRIPTION C
ENROLL_GRADE
STU_LNAME
C
CLASS_DAYS
C
CRS_CREDIT
CLASSES:
STU_FNAME
C
CLASS_TIME
C
STU_INITIAL
C
CLASS_ROOM
C
STU_DOB
D
COURSES:
M
COURSE
N
C
M CLASSES: M CLASS
CLASS
1
ENROLLMENT:
STUDENTS: M
ENROLL
STUDENT
ENROLLMENT: Note: C = Character D = Date N = Numeric
ENROLL
Figure P2.8b The UML Model for Tiny College
31
M
Chapter 2 Data Models
9. Typically, a patient staying in a hospital receives medications that have been ordered by a particular doctor. Because the patient often receives several medications per day, there is a 1:M relationship between PATIENT and ORDER. Similarly, each order can include several medications, creating a 1:M relationship between ORDER and MEDICATION. a. Identify the business rules for PATIENT, ORDER, and MEDICATION. The business rules reflected in thePATIENT description are: A patient can have many (medical) orders written for him or her. Each (medical) order is written for a single patient. The business rules refected in the ORDER description are: Each (medical) order can prescribe many medications. Each medication can be prescribed in many orders. The business rules refected in the MEDICATION description are: Each medication can be prescribed in many orders. Each (medical) order can prescribe many medications. b. Create a Crow's Foot ERD that depicts a relational database model to capture these business rules.
Figure P2.9 Crow's foot ERD for Problem 9
10. United Broke Artists (UBA) is a broker for not-so-famous painters. UBA maintains a small network database to track painters, paintings, and galleries. A painting is painted by a particular artist, and that painting is exhibited in a particular gallery. A gallery can exhibit many paintings, but each painting can be exhibited in only one gallery. Similarly, a painting is painted by a single painter, but each painter can paint many paintings. Using PAINTER, PAINTING, and GALLERY, in terms of a relational database: a. What tables would you create, and what would the table components be? We would create the three tables shown in Figure P2.10a. (Use the teacher’s Ch02_UBA database in your instructor's resources to illustrate the table contents.)
32
Chapter 2 Data Models
FIGURE P2.10a The UBA Database Tables
As you discuss the UBA database contents, note in particular the following business rules that are reflected in the tables and their contents: • A painter can paint may paintings. • Each painting is painted by only one painter. • A gallery can exhibit many paintings. • A painter can exhibit paintings at more than one gallery at a time. (For example, if a painter has painted six paintings, two may be exhibited in one gallery, one at another, and three at the third gallery. Naturally, if galleries specify exclusive contracts, the database must be changed to reflect that business rule.) • Each painting is exhibited in only one gallery. The last business rule reflects the fact that a painting can be physically located in only one gallery at a time. If the painter decides to move a painting to a different gallery, the database must be updated to remove the painting from one gallery and add it to the different gallery. b. How might the (independent) tables be related to one another? Figure P2.10b shows the relationships.
33
Chapter 2 Data Models
FIGURE P2.10b The UBA Relational Model
11. Using the ERD from Problem 10, create the relational schema. (Create an appropriate collection of attributes for each of the entities. Make sure you use the appropriate naming conventions to name the attributes.) The relational diagram is shown in Figure P2.11.
FIGURE P2.11 The Relational Diagram for Problem 11
12. Convert the ERD from Problem 10 into the corresponding UML class diagram. The basic UML solution is shown in Figure P2.12.
FIGURE P2.12 The UML for Problem 12
34
Chapter 2 Data Models 13. Describe the relationships (identify the business rules) depicted in the Crow’s Foot ERD shown in Figure P2.13.
Figure P2.13 The Crow’s Foot ERD for Problem 13 The business rules may be written as follows: • A professor can teach many classes. • Each class is taught by one professor. • A professor can advise many students. • Each student is advised by one professor. 14. Create a Crow’s Foot ERD to include the following business rules for the ProdCo company: a. Each sales representative writes many invoices. b. Each invoice is written by one sales representative. c. Each sales representative is assigned to one department. d. Each department has many sales representatives. e. Each customer can generate many invoices. f. Each invoice is generated by one customer. The Crow’s Foot ERD is shown in Figure P2.23. Note that a 1:M relationship is always read from the one (1) to the many (M) side. Therefore, the customer-invoice relationship is read as “one customer generates many invoices.”
35
Chapter 2 Data Models
Figure P2.14 Crow’s Foot ERD for the ProdCo Company
15. Write the business rules that are reflected in the ERD shown in Figure P2.15. (Note that the ERD reflects some simplifying assumptions. For example, each book is written by only one author. Also, remember that the ERD is always read from the “1” to the “M” side, regardless of the orientation of the ERD components.)
FIGURE P2.15 The Crow’s Foot ERD for Problem 15
The relationships are best described through a set of business rules: • One publisher can publish many books. • Each book is published by one publisher. • A publisher can submit many (book) contracts. • Each (book) contract is submitted by one publisher. • One author can sign many contracts.
36
Chapter 2 Data Models • • •
Each contract is signed by one author. One author can write many books. Each book is written by one author.
This ERD will be a good basis for a discussion about what happens when more realistic assumptions are made. For example, a book – such as this one – may be written by more than one author. Therefore, a contract may be signed by more than one author. Your students will learn how to model such relationships after they have become familiar with the material in Chapter 3. 16. Create a Crow’s Foot ERD for each of the following descriptions. (Note: The word many merely means “more than one” in the database modeling environment.) a. Each of the MegaCo Corporation’s divisions is composed of many departments. Each of those departments has many employees assigned to it, but each employee works for only one department. Each department is managed by one employee, and each of those managers can manage only one department at a time. The Crow’s Foot ERD is shown in Figure P2.16a.
FIGURE P2.16a The MegaCo Crow’s Foot ERD EMPLOYEE
is assigned to
manages
DEPARTMENT
As you discuss the contents of Figure P2.16a, note the 1:1 relationship between the EMPLOYEE and the DEPARTMENT in the “manages” relationship and the 1:M relationship between the DEPARTMENT and the EMPLOYEE in the “is assigned to” relationship.
37
Chapter 2 Data Models b. During some period of time, a customer can download many ebooks from BooksOnline. Each of the ebooks can be downloaded by many customers during that period of time. The solution is presented in Figure P2.16b. Note the M:N relationship between CUSTOMER and EBOOK. Such a relationship is not implementable in a relational model.
FIGURE P2.16b The BigVid Crow’s Foot ERD
If you want to let the students convert Figure P2.16b’s ERD into an implementable ERD, add a third DOWNLOAD entity to create a 1:M relationship between CUSTOMER and DOWNLOAD and a 1:M relationship between EBOOK and DOWNLOAD. (Note that such a conversion has been shown in the next problem solution.) c. An airliner can be assigned to fly many flights, but each flight is flown by only one airliner.
FIGURE P2.16c The Airline Crow’s Foot ERD Initial M:N Solution flies
AIRCRAFT
FLIGHT
Implementable Solution AIRCRAFT
is assigned to
ASSIGNMENT
shows in
FLIGHT
We have created a small Ch02_Airline database to let you explore the implementation of the model. (Check the data files available for Instructors at www.cengagebrain.com.) The tables and the relational diagram are shown in the following two figures.
38
Chapter 2 Data Models
FIGURE P2.16c The Airline Database Tables
FIGURE P2.16c The Airline Relational Diagram
39
Chapter 2 Data Models d. The KwikTite Corporation operates many factories. Each factory is located in a region. Each region can be “home” to many of KwikTite’s factories. Each factory employs many employees, but each of those employees is employed by only one factory. The solution is shown in Figure P2.16d.
FIGURE P2.16d The KwikTite Crow’s Foot ERD EMPLOYEE Remember that a 1:M relationship is always read from the “1” side to the “M” side. Therefore, the relationship between FACTORY and REGION is properly read as “factory employs employee.”
contains
REGION
employs
FACTORY
e. An employee may have earned many degrees, and each degree may have been earned by many employees. The solution is shown in Figure P2.16e.
FIGURE P2.16e The Earned Degree Crow’s Foot ERD EMPLOYEE
earns
DEGREE
Note that this M:N relationship must be broken up into two 1:M relationships before it can be implemented in a relational database. Use the Airline ERD’s decomposition in Figure P2.16c as the focal point in your discussion.
40
Chapter 2 Data Models
17. Write the business rules that are reflected in the ERD shown in Figure P2.17. A theater show many movies. A movie can be shown in many theaters. A movie can receive many reviews. Each review is for a single movie. A reviewer can write many reviews. Each review is written by a single reviewer. Note that the M:N relationship between theater and movie must be broken into two 1:M relationships using a bridge table before it can be implemented in a relational database.
41
Chapter 3 The Relational Database Model
Chapter 3 The Relational Database Model Discussion Focus Why is most of this book based on the relational database model? The answer to that question is, quite simply, that the relational database model has a very successful track record and it is the dominant database model in the market. But why has the relational database model (RDM) been so successful? The Object Oriented database model (OODM) seemed to be poised to dislodge the RDM in the face of increasingly complex data that included video and audio … yet the OODM fell short in the database arena. However, the OODM’s basic concepts have become the basis of a wide variety of database systems analysis and design procedures. In addition, the basic OO approach has been adopted by many application generators and other development tools. The OODM’s inability to replace the RDM is due to several factors. First, the large installed base of RDM-based databases is difficult to overcome. Change is often difficult and expensive, so the prime requisite for change is an overwhelming advantage of the change agent. The OODM advantages were simply not accepted as overwhelming and were, therefore, not accepted as cost-effective. Second, compared to the RDM, the OODM’s design, implementation, and management learning curves are much steeper than the RDM’s. Third, the RDM preempted the OODM in some important respects by adopting many of the OODM’s best features, thus becoming the extended relational data model (ERDM). Because the ERDM retains the basic modeling simplicity of the RDM while being able to handle the complex data environment that was supposed to be the OODM’s forte, you can have the proverbial cake and eat it, too. The OODM-ERDM battle for dominance in the database marketplace seems remarkably similar to the one waged by the hierarchical and network models against the relational model almost three decades ago. The OODM and ERDM are similar in the sense that each attempts to address the demand for more semantic information to be incorporated into the model. However, the OODM and the ERDM differ substantially both in underlying philosophy and in the nature of the problem to be addressed. Although the ERDM includes a strong semantic component, it is primarily based on the relational data model’s concepts. In contrast, the OODM is wholly based on the OO and semantic data model concepts. The ERDM is primarily geared to business applications, while the OODM tends to focus on very specialized engineering and scientific applications. In the database arena, the most likely scenario appears to be an ever-increasing merging of OO and relational data model concepts and procedures. Although the ERDM label has frequently been used in the database literature to describe the -- quite successful -- relational data model’s response to the OODM challenge, C. J. Date objects to the ERDM label for the following reasons (set forth in “Back to the Relational Future”). • The useful contribution of the object model is its ability to let users define their own -- and often very complex -- data types. However, mathematical structures known as “domains” in the relational model also provide this ability. Therefore, a relational DBMS that properly supports such domains greatly
42
Chapter 3 The Relational Database Model
•
diminishes the reason for using the object model. Given proper support for domains, relational data models are quite capable of handling the complex data encountered in time series, engineering design, office automation, financial modeling, and so on. Because the relational model can support complex data types, the notion of an “extended relational data model” or ERDM is “extremely inappropriate and inaccurate” and “it should be firmly resisted.” (The capability that is supposedly being extended is already there!) Even the label object/relational data model (O/RDM) is not quite accurate, because the relational data model’s domain is not an object model structure. However, there are already quite a few O/R products -- also known as universal database servers -- on the market. Therefore, Date concedes that we are probably stuck with the O/R label. In fact, Date believes that “an O/R system is in everyone’s future.” More precisely, Date argues that a true O/R system would be “nothing more nor less than a true relational system -- which is to say, a system that supports the relational model, with all that such support entails.”
C. J. Date concludes his discussion by observing that “We need do nothing to the relational model to achieve object functionality. (Nothing, that is, except implement it, something that doesn’t yet seem to have been tried in the commercial world.)” Because C. J. Date is generally considered to be one of the world’s leading database thinkers and innovators, his observations cannot be easily dismissed. In any case, regardless of the label that is used to tag the relational data model’s growing capabilities, it seems clear that the relational data model is likely to maintain its database market dominance for some time. We believe, therefore, that our continued emphasis on the relational data model is appropriate.
43
Chapter 3 The Relational Database Model
Answers to Review Questions
ONLINE CONTENT The website (www.cengagebrain.com) includes MS Access databases and SQL script files (Oracle, SQL Server, and MySQL) for all of the data sets used throughout the book.
1. What is the difference between a database and a table? A table, a logical structure that represents an entity set, is only one of the components of a database. The database is a structure that houses one or more tables and metadata. The metadata are data about data. Metadata include the data (attribute) characteristics and the relationships between the entity sets. 2. What does it mean to say that a database displays both entity integrity and referential integrity? Entity integrity describes a condition in which all tuples within a table are uniquely identified by their primary key. The unique value requirement prohibits a null primary key value, because nulls are not unique. Referential integrity describes a condition in which a foreign key value has a match in the corresponding table or in which the foreign key value is null. The null foreign key “value” makes it possible not to have a corresponding value, but the matching requirement on values that are not null makes it impossible to have an invalid value. 3. Why are entity integrity and referential integrity important in a database? Entity integrity and referential integrity are important because they are the basis for expressing and implementing relationships in the entity relationship model. Entity integrity ensures that each row is uniquely identified by the primary key. Therefore, entity integrity means that a proper search for an existing tuple (row) will always be successful. (And the failure to find a match on a row search will always mean that the row for which the search is conducted does not exist in that table.) Referential integrity means that, if the foreign key contains a value, that value refers to an existing valid tuple (row) in another relation. Therefore, referential integrity ensures that it will be impossible to assign a non-existing foreign key value to a table. 4. What are the requirements that two relations must satisfy in order to be considered unioncompatible? In order for two relations to be union-compatible, both must have the same number of attributes (columns) and corresponding attributes (columns) must have the same domain. The first requirement is easily identified be a cursory glance at the relations' structures. If the first relation has 44
Chapter 3 The Relational Database Model 3 attributes then the second relation must also have 3 attributes. If the first table has 10 attributes, then the second relation must also have 10 attributes. The second requirement is more difficult to assess and requires understanding the meanings of the attributes in the business environment. Recall that an attribute's domain is the set of allowable values for that attribute. To satisfy the second requirement for union-compatibility, the first attribute of the first relation must have the same domain as the first attribute of the second relation. The second attribute of the first relation must have the same domain as the second attribute of the second relation. The third attribute of the first relation must have the same domain as the third attribute of the second relation, and so on. 5. Which relational algebra operators can be applied to a pair of tables that are not unioncompatible? The Product, Join, and Divide operators can be applied to a pair of tables that are not unioncompatible. Divide does place specific requirements on the tables to be operated on; however, those requirements do not include union-compatibility. Select (or Restrict) and Project are performed on individual tables, not pairs of tables. (Note that if two tables are joined, then the result is a single table and the Select or Project operator is performed on that single table.) 6. Explain why the data dictionary is sometimes called "the database designer's database." Just as the database stores data that is of interest to the users regarding the objects in their environment that are important to them, the data dictionary stores data that is of interest to the database designer about the important decisions that were made in regard to the database structure. The data dictionary contains the number of tables that were created, the names of all of those tables, the attributes in each table, the relationships between the tables, the data type of each attribute, the enforced domains of the attributes, etc. All of these data represent decisions that the database designer had to make and data that the database designer needs to record about the database. 7. A database user manual notes that, “The file contains two hundred records, each record containing nine fields.” Use appropriate relational database terminology to “translate” that statement. Using the proper relational terminology, the statement may be translated to "the table -- or entity set -- contains two hundred rows -- or, if you like, two hundred tuples, or entities. Each of these rows contains nine attributes." 8. Using the STUDENT and PROFESSOR tables shown in Figure Q3.8 to illustrate the difference between a natural join, an equijoin, and an outer join.
FIGURE Q3.8 The Ch03_CollegeQue Database Tables
45
Chapter 3 The Relational Database Model
The natural JOIN process begins with the PRODUCT of the two tables. Next, a SELECT (or RESTRICT) is performed on the PRODUCT generated in the first step to yield only the rows for which the PROF_CODE values in the STUDENT table are matched in the PROF table. Finally, a PROJECT is performed to produce the natural JOIN output by listing only a single copy of each attribute. The order in which the query output rows are shown is not relevant. STU_CODE 128569 512272 531235 553427
PROF_CODE 2 4 2 1
DEPT_CODE 6 4 6 2
The equiJOIN's results depend on the specified condition. At this stage of the students' understanding, it may be best to focus on equijoins that retrieve all matching values in the common attribute. In such a case, the output will be: STU_CODE STUDENT. PROFESSOR. DEPT_CODE PROF_CODE PROF_CODE 128569 2 2 6 512272 4 4 4 531235 2 2 6 553427 1 1 2 Notice that in equijoins, the common attribute appears from both tables. It is normal to prefix the attribute name with the table name when an attribute appears more than once in a table. This maintains the requirement that attribute names be unique within a relational table. In the Outer JOIN, the unmatched pairs would be retained and the values that do not have a match in the other table would be left null. It should be made clear to the students that Outer Joins are not the 46
Chapter 3 The Relational Database Model opposite of Inner Joins (like Natural Joins and Equijoins). Rather, they are "Inner Join Plus" – they include all of the matched records found by the Inner Join plus the unmatched records. Outer JOINs are normally performed as either a Left Outer Join or a Right Outer Join so that the operator specifies which table's unmatched rows should be included in the output. Full Outer Joins depict the matched records plus the unmatched records from both tables. Also, like Equijoins, Outer Joins do not drop a copy of the common attribute. Therefore, a Full Outer Join will yield these results: STU_CODE 128569 512272 531235 553427 100278 531268
STUDENT. PROF_CODE 2 4 2 1
PROFESSOR. PROF_CODE 2 4 2 1
DEPT_CODE
3
6
6 4 6 2
A Left Outer Join of STUDENT to PROFESSOR would include the matched rows plus the unmatched STUDENT rows: STU_CODE 128569 512272 531235 553427 100278 531268
STUDENT. PROF_CODE 2 4 2 1
PROFESSOR. PROF_CODE 2 4 2 1
DEPT_CODE 6 4 6 2
A Right Outer Join of STUDENT to PROFESSOR would include the matched rows plus the unmatched PROFESSOR row. STU_CODE STUDENT. PROFESSOR. DEPT_CODE PROF_CODE PROF_CODE 128569 2 2 6 512272 4 4 4 531235 2 2 6 553427 1 1 2 3 6
47
Chapter 3 The Relational Database Model 9. Create the table that would result from πstu_code(student). STU_CODE 128569 512272 531235 553427 100278 531268 10. Create the table that would result from πstu_code,dept_code(student ⋈ professor). STU_CODE DEPT_CODE 128569 6 512272 4 531235 6 553427 2
11. Create the basic ERD for the database shown in Figure Q3.8. Both the Chen and Crow’s Foot solutions are shown in Figure Q3.11. (We have used the PowerPoint template to produce the first of the two Crow’s Foot ERDs and Visio Professional to produce the second of the two Crow’s Foot ERDs.
Figure Q3.11 The Chen and Crow’s Foot ERD Solutions for Question 11 Chen ERD (generated with PowerPoint) 1 PROFESSOR
M advises
STUDENT
Crow’s Foot ERD (generated with PowerPoint)
PROFESSOR
advises
STUDENT
Chen ERD (generated with Visio Professional)
48
Chapter 3 The Relational Database Model
NOTE From this point forward, we will show the ERDs in Visio Professional format unless the problem specifies a different format. 12. Create the relational diagram for the database shown in Figure Q3.8. The relational diagram, generated in the Microsoft Access Ch03_CollegeQue database, is shown in Figure Q.3.11.
Figure Q3.11 The Relational Diagram
Use Figure Q3.13 to answer questions 13 – 17. Figure Q3.13 The Ch03_VendingCo database tables
13. Write the relational algebra formula to apply a UNION relational operator to the tables shown in Figure Q3.13. The question does not specify the order in which the table should be used in the operation. Therefore, both of the following are correct. BOOTH ⋃ MACHINE MACHINE ⋃ BOOTH
49
Chapter 3 The Relational Database Model You can use this as an opportunity to emphasize that the order of the tables in a UNION command do not change the contents of the data returned. 14. Create the table that results from applying a UNION relational operator to the tables shown in Fig Q3.13 BOOTH_PRODUCT Chips Cola Energy Drink Chips Chocolate Bar
BOOTH_PRICE 1.5 1.25 2 1.25 1
Note that when the attribute names are different, the result will take the attribute names from the first relation. In this case, the solution assumes the operation was BOOTH UNION MACHINE. If the operation had been MACHINE UNION BOOTH then the attribute names from the MACHINE table would have appears as the attribute names in the result. Also, notice that the "Chips" from both tables appears in the result, but the "Energy Drink" from both does not. A UNION operator will eliminate duplicate rows from the result; however, the entire row must match for two rows to be considered duplicates. In the case of "Chips", the product names were the same but the prices were different. In the case of "Energy Drink", both the product names and the prices matched so the second Energy Drink row was dropped from the result.
50
Chapter 3 The Relational Database Model 15. Write the relational algebra formula to apply an INTERSECT relational operator to the tables shown in Figure Q3.13. The question does not specify the order in which the table should be used in the operation. Therefore, both of the following are correct. BOOTH ⋂ MACHINE MACHINE ⋂ BOOTH
16. Create the table that results from applying an INTERSECT relational operator to the tables shown in Fig Q3.13. BOOTH_PRODUCT Energy Drink
BOOTH_PRICE 2
Note that when the attribute names are different, the result will take the attribute names from the first relation. In this case, the solution assumes the operation was BOOTH INTERSECT MACHINE. If the operation had been MACHINE INTERSECT BOOTH then the attribute names from the MACHINE table would have appears as the attribute names in the result. 17. Using the tables in Figure Q3.13, create the table that results from MACHINE DIFFERENCE BOOTH. MACHINE_PRODUCT Chips Chocolate Bar
MACHINE_PRICE 1.25 1
Note that the order in which the relations are specified is significant in the results returned. The DIFFERENCE operator returns the rows from the first relation that are not duplicated in the second relation. Just as with the INTERSECT operator, the entire row must match an existing row to be considered a duplicate. 18. Suppose that you have the ERM shown in Figure Q3.18.
FIGURE Q3.14 The Crow’s Foot ERD for Question 18
51
Chapter 3 The Relational Database Model How would you convert this model into an ERM that displays only 1:M relationships? (Make sure that you draw the revised ERM.) The Crow’s Foot solution is shown in Figure Q3.18. Note that the original M:N relationship has been decomposed into two 1:M relationships based on these business rules: • A driver may receive many (driving) assignments. • Each (driving) assignment is made for a single driver. • A truck may be driven in many (driving) assignments. • Each (driving) assignment is made for a single truck. Note that a driver can drive only one truck at a time, but during some period of time, a driver may be assigned to drive many trucks. The same argument holds true for trucks – a truck can only be driven during one trip (assignment) at a time, but during some period of time, a truck may be assigned to be driven in many trips. Also, remind students that they will be introduced to optional (and additional) relationships as they study Chapter 4, “Entity Relationship (ER) Modeling.” Finally, remind your students that you always read the relationship from the “1” side to the “M” side. Therefore, you read “DRIVER receives ASSIGNMENT and “TRUCK is driven in ASSIGNMENT.”
Figure Q3.18 The Crow’s Foot ERM Solution for Question 18
19. What are homonyms and synonyms, and why should they be avoided in database design? Homonyms appear when more than one attribute has the same name. Synonyms exist when the same attribute has more than one name. Avoid both to avoid inconsistencies. For example, suppose we check the database for a specific attribute such as NAME. If NAME refers to customer names as well as to sales rep names, a clear case of a homonym, we have created an ambiguity, because it is no longer clear which entity the NAME belongs to. Synonyms make it difficult to keep track of foreign keys if they are named differently from the primary keys they point to. Using REP_NUM as the foreign key in the CUSTOMER table to reference the primary key REP_NUM in the SALESREP table is much clearer than naming the CUSTOMER table's foreign key SLSREP. The proliferation of different attribute names to describe the same attributes will also make the data dictionary more cumbersome to use. Some data RDBMSes let the data dictionary check for homonyms and synonyms to alert the user to their existence, thus making their use less likely. For example, if a CUSTOMER table contains the (foreign) key REP_NUM, the entry of the attribute REP_NUM in the SALESREP table will either cause it to inherit all the characteristics of the original REP_NUM, or it will reject the use of this attribute name when different characteristics are declared by the user.
52
Chapter 3 The Relational Database Model
20. How would you implement a l:M relationship in a database composed of two tables? Give an example. Let’s suppose that an auto repair business wants to track its operations by customer. At the most basic level, it’s reasonable to assume that any database design you produce will include at least a car entity and a customer entity. Further suppose that it is reasonable to assume that: • A car is owned just by one customer. • A customer can own more than one car. The CAR and CUSTOMER entities and their relationships are represented by the Crow’s Foot ERD shown in Figure Q3.20. (Discussion: Explain to your students that the ERDs are very basic at this point. Your students will learn how to incorporate much more detail into their ERDs in Chapter 4. For example, no thought has –yet – been given to optional relationships or to the strength of those relationships. At this stage of learning the business of database design, simple is good! To borrow an old Chinese proverb, a journey of a thousand miles begins with a single step.)
Figure Q3.20 The CUSTOMER owns CAR ERM
21. Identify and describe the components of the table shown in Figure Q3.21, using correct terminology. Use your knowledge of naming conventions to identify the table’s probable foreign key(s).
FIGURE Q3.21 The Ch03_NoComp Database EMPLOYEE Table
53
Chapter 3 The Relational Database Model Figure Q3.21's database table contains: • One entity set: EMPLOYEE. • Five attributes: EMP_NUM, EMP_LNAME, EMP_INIT, EMP_FNAME, DEPT_CODE and JOB_CODE. • Ten entities: the ten workers shown in rows 1-10. • One primary key: the attribute EMP_NUM because it identifies each row uniquely. • Two foreign keys: the attribute DEPT_CODE, which probably references a department to which the employee is assigned and the attribute JOB_CODE which probably references another table in which you would find the description of the job and perhaps additional information pertaining to the job. Use the database composed of the two tables shown in Figure Q3.22 to answer Questions 2227.
FIGURE Q3.18 The Ch03_Theater Database Tables
22. Identify the primary keys. DIR_NUM is the DIRECTOR table's primary key. PLAY_CODE is the PLAY table's primary key. 23. Identify the foreign keys. The foreign key is DIR_NUM, located in the PLAY table. Note that the foreign key is located on the "many" side of the relationship between director and play. (Each director can direct many plays ... but each play is directed by only one director.)
54
Chapter 3 The Relational Database Model 24. Create the ERM. The entity relationship model is shown in Figure Q3.24.
Figure Q3.24 The Theater Database ERD
25. Create the relational diagram to show the relationship between DIRECTOR and PLAY. The relational diagram, shown in Figure 3.21, was generated with the help of Microsoft Access. (Check the Ch03_Theater database.)
Figure Q3.25 The Relational Diagram
26. Suppose you wanted quick lookup capability to get a listing of all plays directed by a given director. Which table would be the basis for the INDEX table, and what would be the index key? The PLAY table would be the basis for the appropriate index table. The index key would be the attribute DIR_NUM. 27. What would be the conceptual view of the INDEX table that is described in question 26? Depict the contents of the conceptual INDEX table. The conceptual index table is shown in Table Q3.27.
Figure Q3.27 The Conceptual Index Table Index Key 100 101 102
Pointers to the PLAY table 4 2, 5, 7 1, 3, 6
55
Chapter 3 The Relational Database Model
Problem Solutions Use the database shown in Figure P3.1 to answer Problems 1-9.
FIGURE P3.1 The Ch03_StoreCo Database Tables
1. For each table, identify the primary key and the foreign key(s). If a table does not have a foreign key, write None in the space provided. TABLE EMPLOYEE STORE REGION
PRIMARY KEY EMP_CODE STORE_CODE REGION_CODE
FOREIGN KEY(S) STORE_CODE REGION_CODE, EMP_CODE NONE
56
Chapter 3 The Relational Database Model 2. Do the tables exhibit entity integrity? Answer yes or no and then explain your answer. TABLE EMPLOYEE
ENTITY INTEGRITY Yes
STORE
Yes
REGION
Yes
EXPLANATION Each EMP_CODE value is unique and there are no nulls. Each STORE_CODE value is unique and there are no nulls. Each REGION_CODE value is unique and there are no nulls.
3. Do the tables exhibit referential integrity? Answer yes or no and then explain your answer. Write NA (Not Applicable) if the table does not have a foreign key. TABLE EMPLOYEE
REFERENTIAL INTEGRITY Yes
STORE
Yes
REGION
NA
EXPLANATION Each STORE_CODE value in EMPLOYEE points to an existing STORE_CODE value in STORE. Each REGION_CODE value in STORE points to an existing REGION_CODE value in REGION and each EMP_CODE value in STORE points to an existing EMP_CODE value in EMPLOYEE.
4. Describe the type(s) of relationship(s) between STORE and REGION. Because REGION_CODE values occur more than once in STORE, we may conclude that each REGION can contain many stores. But since each STORE is located in only one REGION, the relationship between STORE and REGION is M:1. (It is, of course, equally true that the relationship between REGION and STORE is 1:M.)
57
Chapter 3 The Relational Database Model 5. Create the ERD to show the relationship between STORE and REGION. The Crow’s Foot ERD is shown in Figure P3.5. Note that each store is located in a single region, but that each region can have many stores located in it. (It’s always a good time to focus a discussion on the role of business rules in the creation of a database design.)
Figure P3.5 ERD for the STORE and REGION Relationship
6. Create the relational diagram to show the relationship between STORE and REGION. The relational diagram is shown in Figure P3.6. Note (again) that the location of the entities is immaterial … the relationships are carried along with the entity. Therefore, it does not matter whether you locate the REGION on the left side or on the right side of the display. But you always read from the “1” side to the “M” side, regardless of the entity location.
Figure P3.6 The Relational Diagram for the STORE and REGION Relationship
58
Chapter 3 The Relational Database Model 7. Describe the type(s) of relationship(s) between EMPLOYEE and STORE. (Hint: Each store employs many employees, one of whom manages the store.) There are TWO relationships between STORE and REGION. The first relationship, expressed by STORE employs EMPLOYEE, is a 1:M relationship, because one store can employ many employees and each employee is employed by one store. The second relationship, expressed by EMPLOYEE manages STORE, is a 1:1 relationship, because each store is managed by one employee and an employee manages only one store.
NOTE It is useful to introduce several ways in which the manages relationship may be implemented. For example, rather than creating the manages relationship between EMPLOYEE and STORE, it is possible to simply list the manager's name as an attribute in the STORE table. This approach creates a redundancy which may not do much damage if the information requirements are limited. However, if it is necessary to keep track of each manager's sales and personnel management performance by store, the manages relationship we have shown here will do a much better job in terms of information generation. Also, you may want to introduce the notion of an optional relationship. After all, not all employees participate in the manages relationship. We will cover optional relationships in detail in Chapter 4, “Entity relationship (ER) Modeling.”
8. Draw the ERD to show the relationships among EMPLOYEE, STORE, and REGION. The Crow’s Foot ERD is shown in Figure P3.8. Remind students that you always read from the “1” side to the “M” side in any 1:M relationship, i.e., a STORE employs many EMPLOYEEs and a REGION contains many STORES. In a 1:1 relationship, you always read from the “parent” entity to the related entity. In this case, only one EMPLOYEE manages each STORE … and each STORE is managed by only one EMPLOYEE. We have shown Figure P3.8’s Visio Professional-generated ERD to include the properties of the manages relationship. Note that there is no mandatory 1:1 relationship available at this point. That’s why there is an optional relationship – the O symbol – next to the STORE entity to indicate that an employee is not necessarily a manager. Let your students know that such optional relationships will be explored in detail in Chapter 4. (Explain that you can create mandatory 1:1 relationships when you add attributes to the entity boxes and specify a mandatory data entry for those attributes that are involved in the 1:1 relationship.)
59
Chapter 3 The Relational Database Model
Figure P3.8 StoreCo Crow’s Foot ERD
9. Create the relational diagram to show the relationships among EMPLOYEE, STORE, and REGION. The relational diagram is shown in Figure P3.9.
Figure P3.9 The Relational Diagram
NOTE The relational diagram in Figure P3.9 was generated in Microsoft Access. If a relationship already exists between two entities, Access generates a virtual table (in this case, EMPLOYEE_1) to generate the additional relationship. The virtual table cannot be queried; its only function is to store the manages relationship between EMPLOYEE and STORE. Just how multiple relationships are stored and managed is a function of the software you use. 60
Chapter 3 The Relational Database Model
Use the database shown in Figure P3.10 to work Problems 10−16. Note that the database is composed of four tables that reflect these relationships: • An EMPLOYEE has only one JOB_CODE, but a JOB_CODE can be held by many EMPLOYEEs. • An EMPLOYEE can participate in many PLANs, and any PLAN can be assigned to many EMPLOYEEs. Note also that the M:N relationship has been broken down into two 1:M relationships for which the BENEFIT table serves as the composite or bridge entity.
FIGURE P3.10 The Ch03_BeneCo Database Tables
10. For each table in the database, identify the primary key and the foreign key(s). If a table does not have a foreign key, write None in the assigned space provided. TABLE EMPLOYEE BENEFIT JOB PLAN
PRIMARY KEY EMP_CODE EMP_CODE + PLAN_CODE JOB-CODE PLAN_CODE
61
FOREIGN KEY(S) JOB_CODE EMP_CODE, PLAN_CODE None None
Chapter 3 The Relational Database Model 11. Create the ERD to show the relationship between EMPLOYEE and JOB. The ERD is shown in Figure P3.11. Note that the JOB_CODE = 1 occurs twice in the EMPLOYEE table, as does the JOB_CODE = 2, thus providing evidence that a JOB can be assigned to many EMPLOYEEs. But each EMPLOYEE has only one JOB_CODE, so there exists a 1:M relationship between JOB and EMPLOYEE.
Figure P3.11 The ERD for the EMPLOYEE-JOB Relationship
12. Create the relational diagram to show the relationship between EMPLOYEE and JOB. The relational schema is shown in Figure P3.12.
Figure P3.12 The Relational Diagram
13. Do the tables exhibit entity integrity? Answer yes or no and then explain your answer. TABLE EMPLOYEE
ENTITY INTEGRITY Yes
BENEFIT
Yes
JOB
Yes
PLAN
Yes
EXPLANATION Each EMP_CODE value is unique and there are no nulls. Each combination of EMP_CODE and PLAN_CODE values is unique and there are no nulls. Each JOB_CODE value is unique and there are no nulls. Each PLAN_CODE value is unique and there are no nulls.
62
Chapter 3 The Relational Database Model 14. Do the tables exhibit referential integrity? Answer yes or no and then explain your answer. Write NA (Not Applicable) if the table does not have a foreign key. TABLE EMPLOYEE
REFERENTIAL INTEGRITY Yes
BENEFIT
Yes
JOB PLAN
NA NA
EXPLANATION Each JOB_CODE value in EMPLOYEE points to an existing JOB_CODE value in JOB. Each EMP_CODE value in BENEFIT points to an existing EMP_CODE value in EMPLOYEE and each PLAN_CODE value in BENEFIT points to an existing PLAN_CODE value in PLAN.
15. Create the ERD to show the relationships among EMPLOYEE, BENEFIT, JOB, and PLAN. The Crow’s Foot ERD is shown in Figure P3.15.
Figure P3.15 BeneCo Crow’s Foot ERD
63
Chapter 3 The Relational Database Model 16. Create the relational diagram to show the relationships among EMPLOYEE, BENEFIT, JOB, and PLAN. The relational diagram is shown in Figure P3.16. Note that the location of the entities is immaterial – the relationships move with the entities.
Figure P3.16 The Relational Diagram
Use the database shown in Figure P3.17 to answer Problems 17-23.
FIGURE P3.17 The Ch03_TransCo Database Tables
64
Chapter 3 The Relational Database Model
17. For each table, identify the primary key and the foreign key(s). If a table does not have a foreign key, write None in the space provided. TABLE TRUCK BASE TYPE
PRIMARY KEY TRUCK_NUM BASE_CODE TYPE_CODE
FOREIGN KEY(S) BASE_CODE, TYPE_CODE None None
NOTE Note: The TRUCK_SERIAL_NUM could also be designated as the primary key. Because the TRUCK_NUM was designated to be the primary key, TRUCK_SERIAL_NUM is an example of a candidate key.
18. Do the tables exhibit entity integrity? Answer yes or no and then explain your answer. TABLE TRUCK
ENTITY INTEGRITY Yes
BASE
Yes
TYPE
Yes
EXPLANATION The TRUCK_NUM values in the TRUCK table are all unique and there are no nulls. The BASE_CODE values in the BASE table are all unique and there are no nulls. The TYPE_CODE values in the TYPE table are all unique and there are no nulls.
19. Do the tables exhibit referential integrity? Answer yes or no and then explain your answer. Write NA (Not Applicable) if the table does not have a foreign key. TABLE TRUCK
REFERENTIAL INTEGRITY Yes
BASE TYPE
NA NA
EXPLANATION The BASE_CODE values in the TRUCK table reference existing BASE_CODE values in the BASE table or they are null. (The TRUCK table's BASE_CODE is null for TRUCK_NUM = 1004.) Also, the TYPE_CODE values in the TRUCK table reference existing TYPE_CODE values in the TYPE table.
20. Identify the TRUCK table’s candidate key(s). A candidate key is any key that could have been used as a primary key, but that was, for some reason, not chosen to be the primary key. For example, the TRUCK_SERIAL_NUM could have been selected as the PK, but the TRUCK_NUM was actually designated to be the PK. Therefore, the TRUCK_SERIAL_NUM is a candidate key. Also, any combination of attributes that would uniquely
65
Chapter 3 The Relational Database Model identify any truck would be a candidate key. For example, the combination of BASE_CODE, TYPE_CODE, TRUCK_MILES, and TRUCK_BUY_DATE is not likely to be duplicated and this combination would, therefore, be a candidate key. However, while the latter combination might constitute a candidate key, such a combination would not be practical. (An extreme – and impractical -- example of a candidate key would be the combination of all of a table’s attributes.)
NOTE Some of the answers to the following problem 21 define only a few of the available correct choices. For example, a superkey is, in effect, a candidate key containing redundant attributes. Therefore, any primary key plus any other attribute(s) is a superkey. Because a secondary key does not necessarily yield unique outcomes, the number of attributes that constitute a secondary key is somewhat arbitrary. The adequacy of a secondary key depends on the extent of the end-user's willingness to accept multiple matches.
21. For each table, identify a superkey and a secondary key. TABLE
SUPERKEY
SECONDARY KEY
TRUCK
TRUCK_NUM + TRUCK_MILES
BASE_CODE + TYPE_CODE
TRUCK_NUM + TRUCK_MILES + TRUCK_BUY_DATE TRUCK_NUM + TRUCK_MILES + TRUCK_BUY_DATE + TYPE_CODE
(This secondary key is likely to produce multiple matches, but it is not likely that end-users will know attribute values such as TRUCK_MILES or TRUCK_BUY_DATE. Therefore, the selected attributes create a reasonable secondary key.)
BASE_CODE + BASE_CITY
BASE_CITY + BASE_STATE
BASE_CODE + BASE_CITY + BASE_CITY
(This a very effective secondary key, since it is not likely that a state contains two cities with the same name.)
TYPE_CODE + TYPE_DESCRIPTION
TYPE_DESCRIPTION
BASE
TYPE
22. Create the ERD for this database. The Crow’s Foot ERD is shown in Figures P3.22.
Figure P3.22 TransCo Crow's Foot ERD
66
Chapter 3 The Relational Database Model
23. Create the relational diagram for this database. The relational diagram is shown in Figure P3.23.
Figure P3.23 The Ch03_TransCo Relational Diagram
67
Chapter 3 The Relational Database Model Use the database shown in Figure P3.24 to answer Problems 24−31. ROBCOR is an aircraft charter company that supplies on-demand charter flight services using a fleet of four aircraft. Aircrafts are identified by a unique registration number. Therefore, the aircraft registration number is an appropriate primary key for the AIRCRAFT table.
FIGURE P3.24 The Ch03_AviaCo Database Tables (Part 1) The nulls in the CHARTER table’s CHAR_COPILOT column indicate that a copilot is not required for some charter trips or for some aircraft. Federal Aviation Administration (FAA) rules require a copilot on jet aircraft and on aircraft having a gross take-off weight over 12,500 pounds. None of the aircraft in the AIRCRAFT table are governed by this requirement; however, some customers may require the presence of a copilot for insurance reasons. All charter trips are recorded in the CHARTER table.
68
Chapter 3 The Relational Database Model
FIGURE P3.24 The Ch03_AviaCo Database Tables (Part 2)
69
Chapter 3 The Relational Database Model
NOTE Earlier in the chapter, it was stated that it is best to avoid homonyms and synonyms. In this problem, both the pilot and the copilot are pilots in the PILOT table, but EMP_NUM cannot be used for both in the CHARTER table. Therefore, the synonyms CHAR_PILOT and CHAR_COPILOT were used in the CHARTER table. Although the solution works in this case, it is very restrictive and it generates nulls when a copilot is not required. Worse, such nulls proliferate as crew requirements change. For example, if the AviaCo charter company grows and starts using larger aircraft, crew requirements may increase to include flight engineers and load masters. The CHARTER table would then have to be modified to include the additional crew assignments; such attributes as CHAR_FLT_ENGINEER and CHAR_LOADMASTER would have to be added to the CHARTER table. Given this change, each time a smaller aircraft flew a charter trip without the number of crew members required in larger aircraft, the missing crew members would yield additional nulls in the CHARTER table. You will have a chance to correct those design shortcomings in Problem 27. The problem illustrates two important points: 1. Don’t use synonyms. If your design requires the use of synonyms, revise the design! 2. To the greatest possible extent, design the database to accommodate growth without requiring structural changes in the database tables. Plan ahead and try to anticipate the effects of change on the database.
ROBCOR is an aircraft charter company that supplies on-demand charter flight services, using a fleet of four aircraft. Aircraft are identified by a (unique) registration number. Therefore, the aircraft registration number is an appropriate primary key for the AIRCRAFT table. The nulls in the CHARTER table’s CHAR_COPILOT column indicate that a copilot is not necessarily required for some charter trips or for some aircraft. Federal Aviation Administration (FAA) rules require a copilot on jet aircraft and on aircraft having a gross take-off weight over 12,500 pounds. None of the aircraft in the AIRCRAFT table is governed by this requirement; however, some customers may require the presence of a copilot for insurance reasons. All charter trips are recorded in the CHARTER table.
NOTE Earlier in the chapter it was stated that it is best to avoid homonyms and synonyms. In this problem, both the pilot and the copilot are pilots in the PILOT table, but EMP_NUM cannot be used for both in the CHARTER table. Therefore, the synonyms CHAR_PILOT and CHAR_COPILOT were used in the CHARTER table. Although the “solution” works in this case, it is very restrictive and it generates nulls when a copilot is not required. Worse, such nulls proliferate as crew requirements change. For example, if the AviaCo charter company grows and starts using larger aircraft, crew requirements may increase to include flight engineers and load masters. The CHARTER table
70
Chapter 3 The Relational Database Model would then have to be modified to include the additional crew assignments; such attributes as CHAR_FLT_ENGINEER and CHAR_LOADMASTER would have to be added to the CHARTER table. Given this change, each time a smaller aircraft flew a charter trip without the number of crew members required in larger aircraft, the “missing” crew members would yield additional nulls in the CHARTER table. You will have a chance to correct those design shortcomings in Problem 33. The problem illustrates two important points: 1. Don’t use synonyms. If your design requires the use of synonyms, revise the design! 2. To the greatest possible extent, design the database to accommodate growth without requiring structural changes in the database tables. Plan ahead—and try to anticipate the effects of change on the database.
24. For each table, where possible, identify: a. The primary key TABLE CHARTER AIRCRAFT MODEL PILOT EMPLOYEE CUSTOMER
PRIMARY KEY CHAR_TRIP AC_NUMBER MOD_CODE EMP_NUM EMP_NUM CUS_CODE
b. A superkey TABLE CHARTER AIRCRAFT MODEL PILOT EMPLOYEE CUSTOMER
SUPER KEY CHAR_TRIP + CHAR_DATE AC_NUM + MOD-CODE MOD_CODE + MOD_NAME EMP_NUM + PIL_LICENSE EMP_NUM + EMP_DOB CUS_CODE + CUS_LNAME
71
Chapter 3 The Relational Database Model c. A candidate key TABLE CHARTER
CANDIDATE KEY No practical candidate key is available. For example, CHAR_DATE + CHAR_DESTINATION + AC_NUMBER + CHAR_PILOT + CHAR_COPILOT
AIRCRAFT MODEL PILOT EMPLOYEE
will still not necessarily yield unique matches, because it is possible to fly an aircraft to the same destination twice on one date with the same pilot and copilot. You could, of course, present the argument that the combination of all the attributes would yield a unique outcome. See the previous discussion. See the previous discussion. See the previous discussion. See the previous discussion. But Perhaps the combination of EMP_LNAME + EMP_FNAME + EMP_INITIAL + EMP_DOB will yield an acceptable candidate key.
CUSTOMER
See the previous discussion
d. The foreign key(s) TABLE CHARTER
AIRCRAFT MODEL PILOT EMPLOYEE CUSTOMER
FOREIGN KEY CHAR_PILOT (references PILOT) CHAR_COPILOT (references PILOT) AC_NUMBER (references AIRCRAFT) CUS_CODE (references CUSTOMER) MOD_CODE None EMP_NUM (references EMPLOYEE) None None
e. A secondary key TABLE CHARTER AIRCRAFT MODEL PILOT EMPLOYEE CUSTOMER
SECONDARY KEY CHAR_DATE + AC_NUMBER + CHAR_DESTINATION MOD_CODE MOD_MANUFACTURER + MOD_NAME PIL_LICENSE + PIL_MED_DATE EMP_LNAME + EMP_FNAME + EMP_DOB CUS_LNAME + CUS_FNAME + CUS_PHONE
72
Chapter 3 The Relational Database Model 25. Create the ERD. (Hint: Look at the table contents. You will discover that an AIRCRAFT can fly many CHARTER trips but that each CHARTER trip is flown by one AIRCRAFT. Similarly, you will discover that a MODEL references many AIRCRAFT but that each AIRCRAFT references a single MODEL, etc.) The Crow’s Foot ERD is shown in Figure P3.25. The optional (default) 1:1 relationship crops up in this ERD, just as it did in the Problem 8 solution. Use the same discussion that accompanied Problem 8. Also, note that EMPLOYEE is the “parent” of PILOT. Note that all pilots are employees, but not all employees are pilots – some are mechanics, accountants, and so on. (This discussion previews some of the Chapter 4 coverage … coming attractions, so to speak.) The relationship between PILOT and EMPLOYEE is read from the “parent” entity to the related entity. In this case, the relationship is read as “an EMPLOYEE is a PILOT.”
Figure P3.25 The Ch03_AviaCo Database ERD
73
Chapter 3 The Relational Database Model 26. Create the relational diagram. The relational diagram is shown in Figure P3.26.
Figure P3.26 The Ch03_AviaCo Database Relational Diagram
74
Chapter 3 The Relational Database Model 27. Modify the ERD you created in Problem 25 to eliminate the problems created by the use of synonyms. (Hint: Modify the CHARTER table structure by eliminating the CHAR_PILOT and CHAR_COPILOT attributes; then create a composite table named CREW to link the CHARTER and EMPLOYEE tables. Some crewmembers, such as flight attendants, may not be pilots. That’s why the EMPLOYEE table enters into this relationship.) The Crow’s Foot ERD is shown in Figures P3.27.
Figure P3.27 The Ch03_AviaCo_2 Database ERD
75
Chapter 3 The Relational Database Model 28. Draw the relational diagram for the design you revised in problem 27. (After you have had a chance to revise the design, your instructor will show you the results of the design change, using a copy of the revised database named Ch03_AviaCo_2). The relational diagram for the Ch03_AviaCo_2 database is shown in Figure P3.28. Note that there are a few additional entities that you will encounter again in Chapter 4. (You can safely ignore the extra entities, RATING and EARNEDRATING at this point … but you can let the students “read” the relationship between these two entities. Note that you can easily derive the M:N relationship between PILOT and RATING. (A PILOT can earn many RATINGs. A RATING can be earned by many PILOTs.) Even though your students may not know what a rating is, they can still draw up conclusions about its relationship to other entities by looking at relational diagrams and ERDs. And that’s one of the many strengths of design tools. Also, you can let your students break the M:N relationship down into two 1:M relationships – note that this is done through the EARNEDRATING entity. The issues encountered in the design and implementation of the Ch3_AviaCo_2 database will be revisited many times in the book.
Figure P3.28 The Ch03_AviaCo_2-Relational Diagram
76
Chapter 3 The Relational Database Model You are interested in seeing data on charters flown by either Mr. Robert Williams (employee number 105) or Ms. Elizabeth Travis (employee number 109) as pilot or copilot, but not charters flown by both of them. Complete problems 29 – 31 to find these data. 29. Create the table that would result from applying the SELECT and PROJECT relational operators to the CHARTER table to return only the CHAR_TRIP, CHAR_PILOT, and CHAR_COPILOT attributes for charters flown by either employee 105 or employee 109. CHAR_TRIP 10003 10006 10009 10010 10013 10016 10018
CHAR_PILOT 105 109 105 109 105 109 105
CHAR_COPILOT 109
105 104
30. Create the table that would result from applying the SELECT and PROJECT relational operators to the CHARTER table to return only the CHAR_TRIP, CHAR_PILOT, and CHAR_COPILOT attributes for charters flown by both employee 105 and employee 109. CHAR_TRIP 10003 10016
CHAR_PILOT 105 109
CHAR_COPILOT 109 105
31. Create the table that would result from applying a DIFFERENCE relational operator of your result from problem 29 to your result from problem 30. CHAR_TRIP 10006 10009 10010 10013 10018
CHAR_PILOT 109 105 109 105 105
CHAR_COPILOT
104
77
Chapter 4 Entity Relationship (ER) Modeling
Chapter 4 Entity-Relationship (ER) Modeling Discussion Focus This chapter attempts to present a fairly balanced view of Entity-Relationship Modeling. One the one hand, a pragmatic focus on the aspects of the ERD that directly impact the design and implementation of the database must be considered. On the other hand, properly documenting the business needs that the database must support is equally important. Therefore, we present both Crow's Foot and Chen notation ERDs and discuss the differences between them. Crow's Foot tends to be more pragmatic, while Chen notation has greater semantic content. One exercise that we have found very valuable for our students is after working with both notations is to have the students engage in a discussion of the strengths and weaknesses of the different notations. Among the points of comparison that we focus on in class are: • In both notations, entities are modeled the same. • Attributes are modeled differently – in ovals with Chen notation and in an attribute box with Crow's Foot. • Identifiers are modeled the same. • Single-valued attributes are modeled the same. • Chen notation distinguishes multi-valued attributes with a double-line; however, Crow's Foot notation does not distinguish multi-valued attributes. This can be attributed to the pragmatism of Crow's Foot notation – if you wouldn't implement the design with a multi-valued attribute, then don't draw the design with a multi-valued attribute. • Chen notation distinguishes derived attributes with a dotted attribute line; however Crow's Foot notation does not distinguish derived attributes. • Chen notation commonly uses specific minimum and maximum cardinalities. Specific minimum and maximum cardinalities are almost never used in practice with Crow's Foot notations, which is why in practice most people who use Crow's Foot notation refer to connectivity as maximum cardinality and participation as minimum cardinality. • Both notations typically distinguish weak entities; however, as a peculiarity of the MS Visio tool, weak entities are implied through relationship strength. Student's tend to catch on to the notations rather quickly; however, giving them some simple modeling tasks where the data model components (entities, attributes, etc.) are already identified that merely require them to apply the notation can be very helpful. One very important issue that often confuses students is the resolution of M:N relationships into 1:M relationships. We have used Microsoft Visio Professional to create the ERDs in the text and in this manual. Note that ERDs are done at the conceptual level. However, MS Visio has an implementation focus. Therefore, M:N relationships are not directly supported. Instead, the designer is limited to modeling 1:1 and/or 1:M relationships.
78
Chapter 4 Entity Relationship (ER) Modeling Although M:N relationships may properly be viewed in a relational database model at the conceptual level, such relationships should not be implemented, because their existence creates undesirable redundancies. Therefore, M:N relationships must be decomposed into 1:M relationships to fit into the ER framework. For example, if you were to develop an ER model for a video rental store, you would note that tapes can be rented more than once and that customers can rent more than one tape. To make the discussion more interesting and to address several design issues at once, explain to the students that it seems reasonable to keep in mind that newly arrived tapes that have just been entered into inventory have not yet been rented and that some tapes may never be rented at all if there is no demand for them. Therefore, CUSTOMER is optional to TAPE. Assuming that the video store only rents videos and that a CUSTOMER entry will not exist unless a person coming into the video store actually rents that first tape, TAPE is mandatory to CUSTOMER. On the other hand, if the store has other services, such as selling movies or games, then a CUSTOMER entry could exist without having rented a video. In which case, a TAPE is optional to CUSTOMER. Note that this discussion includes a very brief description of the video store's operations and some business rules. The relationship between customers and tapes would thus be M:N, as shown in Figure IM4.1.
Figure IM4.1 The M:N Relationship A customer can rent many tapes. A tape can be rented by many customers. Some customers do not rent tapes. (Such customers might buy tapes or other items.) Therefore, TAPE is optional to CUSTOMER in the rental relationship. Some tapes are never rented. Therefore, CUSTOMER is optional to TAPE in the rental relationship. Chen Model M CUSTOMER
N rents
TAPE
Crow’s Foot Model CUSTOMER
rents
TAPE
As you discuss the presentation in Figure IM4.1, note that the ERD reflects two business rules: 1. A CUSTOMER may rent many TAPEs. 2. A TAPE can be rented by many CUSTOMERs. The M:N relationship depicted in Figure IM4.1 must be broken up into two 1:M relationships through the use of a bridge entity, also known as a composite entity. The composite entity, named RENTAL in the example shown in Figure IM4.2, must include at least the primary key components (CUS_NUM and TAPE_CODE) of the two entities it bridges, because the RENTAL entity’s foreign keys must point to the primary keys of the two entities CUSTOMER and TAPE.
79
Chapter 4 Entity Relationship (ER) Modeling
Figure IM4.2 Decomposition of the M:N Relationship
Several points about Figure IM4.2 are worth emphasizing: • The RENTAL entity’s PK could have been the combination TAPE_CODE + CUS_NUM. This composite PK would have created strong relationships between RENTAL and CUSTOMER and between RENTAL and TAPE. Because this composite PK was not used, it is a candidate key. • In this case, the designer made the decision to use a single-attribute PK rather than a composite PK. Note that the RENTAL entity uses the PK RENT_NUM. It is useful to point out that single-attribute PKs are usually more desirable than composite PKs – especially when relationships must be established between the RENTAL and some – as yet unidentified – entity. (You cannot use a composite PK as a foreign key in a related entity!) In addition, a composite PK makes queries less efficient – a point that will become clearer in Chapter 11, “Database Performance Tuning and Query Optimization.” • Note the placement of the optional symbols. Because a tape that is never rented will never show up in the RENTAL entity, RENTAL has become optional to TAPE. That's why the optional symbol has migrated from CUSTOMER to the opposite side of RENTAL. Also, note the addition of a few attributes in each of the three entities to make it easier to see what is being tracked. • Because the M:N relationship has now been decomposed into two 1:M relationships, the ER model shown in Figure IM4.2 is can be implemented. However, it may be useful to remind your students that “implementable” is not necessarily synonymous with “practical” or “useful.” (We’ll modify the ERD in Figure IM4.2 after some additional discussion.) • Remind the students that the relationships are read from the 1 side to the M side. Therefore, the relationships between CUSTOMER and RENTAL and between TAPE and RENTAL are read as CUSTOMER generates RENTAL TAPE enters RENTAL (Compare these relationships to those generated by the two business rules above Figure IM4.1A.) • The dashed relationship lines indicate weak relationships. In this case, the RENTAL entity’s primary key is RENT_NUM – and this PK did not use any attribute from the CUSTOMER and TAPE entities. The (implied) cardinalities in Figure IM4.2 reflect the rental transactions. Each rental transaction, i.e., each record in the RENTAL table, will reference one and only one customer and one an only one tape. The (simplified!) implementation of this model may thus yield the sample data shown in the database in
80
Chapter 4 Entity Relationship (ER) Modeling Figure IM4.3. The database's relational diagram is shown in Figure IM4.4.
Figure IM4.3 The Ch04_Rental Database Tables
The relational diagram that corresponds to the design in Figure IM4.2 is shown in Figure IM 4.4.
Figure IM4.4 The Ch04_Rental Database Relational Diagram
81
Chapter 4 Entity Relationship (ER) Modeling
The Ch04_Rental database’s TAPE and RENTAL tables contain some attributes that merit additional discussion. • The TAPE_CODE attribute values include a “trailer” after the dash. For example, note that the third record in the TAPE table has a PK value of R2345-2. The “trailer” indicates the tape copy. For example, the “-2” trailer in the PK value R2345-2 indicates the second copy of the :Once Upon a Midnight Breezy” tape. So why include a separate TAPE_COPY attribute? This decision was made to make it easier to generate queries that make use of the tape copy value. (It’s much more difficult to use a right-string function to “strip” the tape copy value than simply using the TAPE_COPY value. And “simple” usually translates into “fast” in a query environment – “fast” is a good thing! • The RENTAL table uses two dates: RENT_DATE_OUT and RENT_DATE_RETURN. This decision leaves the RENT_DATE_RETURN value null until the tape is returned. Naturally, such nulls can be avoided by creating an additional table in which the return date is not a named attribute. Note the following few check-in and check-out transactions: RENT_NUM 10050 10050 10051 10051 10052 10053 ….. …..
TRANS_TYPE Checked-out Returned Checked-out Returned Checked-out Returned …………… ……………
TRANS_DATE 10-Jan-2018 11-Jan-2018 10-Jan-2018 11-Jan-2018 11-Jan-2018 10-Jan-2018 …………… ……………
The decision to leave the RENT_DATE_RETURN date in the RENTAL table – and leaving its value null until the tape is returned is –again – up to the designer, who evaluates the design according to often competing goals: simplicity, elegance, reporting capability, query speed, index management, and so on. (Remind your students that they, too, should be able to evaluate such decisions as they gain database design knowledge.) Discuss with your students that Figure IM4.4's database is not quite ready for prime time. For example, its structure allows customers to rent only one tape per rental transaction. Therefore, you'd have to generate a separate rental transaction for each tape rented by a customer. (In other words, if a customer rents five tapes at a time, you'd have to generate five separate rentals.) Clearly, the design would be much improved by expanding it to include rental lines, as is done in Figure IM4.5.
82
Chapter 4 Entity Relationship (ER) Modeling
Figure IM4.5 Implementable Video Rental ERD
NOTE As you discuss the ERDs in Figure IM4.5, note the use of optionalities. For example, a tape may be in the rental inventory, but there is no guarantee that any customer will ever rent it. In short, a tape that has never been rented will not show up in any rental transaction, thus never appearing in a RENT_LINE entity. Therefore, RENT_LINE is optional to TAPE. Generally, if a relationship has not been defined as mandatory or cannot reasonably be assumed to be mandatory, it is considered to be optional.
What role does the ER diagram play in the design process? A completed ER diagram is the actual blueprint of the database. Its composition must reflect an organization's operations accurately if the database is to meet that organization's data requirements. It forms the basis for a final check on whether the included entities are appropriate and sufficient, on the attributes found within those entities, and on the relationships between those entities. It is also used as a final crosscheck against the proposed data dictionary entries. The completed ER diagram also lets the designer communicate more precisely with those who commissioned the database design. Finally, the completed ER diagram serves as the implementation guide to those who create the actual database. In short, the ER diagram is as important to the database designer as a blueprint is to the architect and builder.
83
Chapter 4 Entity Relationship (ER) Modeling
Answers to Review Questions 1. What two conditions must be met before an entity can be classified as a weak entity? Give an example of a weak entity. To be classified as a weak entity, two conditions must be met: 1. The entity must be existence-dependent on its parent entity. 2. The entity must inherit at least part of its primary key from its parent entity. For example, the (strong) relationship depicted in the text’s Figure 4.9 shows a weak CLASS entity: 1. CLASS is clearly existence-dependent on COURSE. (You can’t have a database class unless a database course exists.) 2. The CLASS entity’s PK is defined through the combination of CLASS_SECTION and CRS_CODE. The CRS_CODE attribute is also the PK of COURSE. The conditions that define a weak entity are the same as those for a strong relationship between an entity and its parent. In short, the existence of a weak entity produces a strong relationship. And if the entity is strong, its relationship to the other entity is weak. (Note the solid relationship line in the text’s Figure 4.9.) Keep in mind that whether or not an entity is weak usually depends on the database designer’s decisions. For instance, if the database designer had decided to use a single-attribute as shown in the text’s Figure 4.8, the CLASS entity would be strong. (The CLASS entity’s PK is CLASS_CODE, which is not derived from the COURSE entity.) In this case, the relationship between COURSE and CLASS is weak. (Note the dashed relationship line in the text’s Figure 4.8.) However, regardless of how the designer classifies the relationship – weak or strong – CLASS is always existence-dependent on COURSE. 2. What is a strong (or identifying) relationship, and how is it depicted in a Crow’s Foot ERD? A strong relationship exists when en entity is existence-dependent on another entity and inherits at least part of its primary key from that entity. The Visio Professional software shows the strong relationship as a solid line. In other words, a strong relationship exists when a weak entity is related to its parent entity. (Note the discussion in question 1.)
84
Chapter 4 Entity Relationship (ER) Modeling 3. Given the business rule “an employee may have many degrees,” discuss its effect on attributes, entities, and relationships. (Hint: Remember what a multivalued attribute is and how it might be implemented.) Suppose that an employee has the following degrees: BA, BS, and MBA. These degrees could be stored in a single string as a multivalued attribute named EMP_DEGREE in an EMPLOYEE table such as the one shown next: EMP_NUM 123 124 125 126
EMP_LNAME Carter O’Shanski Jones Ortez
EMP_DEGREE AA, BBA BBA, MBA, Ph.D. AS BS, MS
Although the preceding solution has no obvious design flaws, it is likely to yield reporting problems. For example, suppose you want to get a count for all employees who have BBA degrees. You could, of course, do an “in-string” search to find all of the BBA values within the EMP_DEGREE strings. But such a solution is cumbersome from a reporting point of view. Query simplicity is a valuable thing to application developers – and to end users who like maximum query execution speeds. Database designers ought to pay some attention to the competing database interests that exist in the data environment. One – very poor – solution is to create a field for each expected value. This “solution is shown next: EMP_NUM 123 124 125 126
EMP_LNAME Carter O’Shanski Jones Ortez
EMP_DEGREE1 AA BBA AS BS
EMP_DEGREE2 EMP_DEGREE3 BBA MBA Ph.D. MS
This “solution yields nulls for all employees who have fewer than three degrees. And if even one employee earns a fourth degree, the table structure must be altered to accommodate the new data value. (One piece of evidence of poor design is the need to alter table structures in response to the need to add data of an existing type.) In addition, the query simplicity is not enhanced by the fact that any degree can be listed in any column. For example, a BA degree might be listed in the second column, after an “associate of arts (AA) degree has been entered in EMP_DEGREE1. One might simplify the query environment by creating a set of attributes that define the data entry, thus producing the following results: EMP_NUM 123 124 125 126
EMP_LNAME Carter O’Shanski Jones Ortez
EMP_AA X
EMP_AS
EMP_BA
EMP_BS
EMP_BBA X X
EMP_MS
X X
This “solution” clearly proliferates the nulls at an ever-increasing pace.
85
X
EMP_MBA
EMP_PhD
X
X
Chapter 4 Entity Relationship (ER) Modeling The only reasonable solution is to create a new DEGREE entity that stores each degree in a separate record, this producing the following tables. (There is a 1:M relationship between EMPLOYEE and DEGREE. Note that the EMP_NUM can occur more than once in the DEGREE table. The DEGREE table’s PK is EMP_NUM + DEGREE_CODE. This solution also makes it possible to record the date on which the degree was earned, the institution from which it was earned, and so on. Table name: EMPLOYEE EMP_NUM 123 124 125 126
EMP_LNAME Carter O’Shanski Jones Ortez
Table name: DEGREE EMP_NUM 123 123 124 124 124 125 126 126
DEGREE_CODE AA BBA BBA MBA Ph.D. AS BS MS
DEGREE_DATE May-1999 Aug-2004 Dec-1990 May-2001 Dec-2005 Aug-2002 Dec-1989 May-2002
DEGREE_PLACE Lake Sumter CC U. of Georgia U. of Toledo U. of Michigan U. of Tennessee Valdosta State U. of Missouri U. of Florida
Note that this solution leaves no nulls, produces a simple query environment, and makes it unnecessary to alter the table structure when employees earn additional degrees. (You can make the environment even more flexible by naming the new entity QUALIFICATION, thus making it possible to store degrees, certifications, and other useful data that define an employee’s qualifications.) 4. What is a composite entity, and when is it used? A composite entity is generally used to transform M:N relationships into 1:M relationships. (Review the discussion that accompanied Figures IM4.3 through IM4.5.) A composite entity, also known as a bridge entity, is one that has a primary key composed of multiple attributes. The PK attributes are inherited from the entities that it relates to one another.
86
Chapter 4 Entity Relationship (ER) Modeling 5. Suppose you are working within the framework of the conceptual model in Figure Q4.5.
Figure Q4.5 The Conceptual Model for Question 5
Given the conceptual model in Figure Q4.5: a. Write the business rules that are reflected in it. Even a simple ERD such as the one shown in Figure Q4.5 is based on many business rules. Make sure that each business rule is written on a separate line and that all of its details are spelled out. In this case, the business rules are derived from the ERD in a “reverseengineering” procedure designed to document the database design. In a real world database design situation, the ERD is generated on the basis of business rules that are written before the first entity box is drawn. (Remember that the business rules are derived from a carefully and precisely written description of operations.) Given the ERD shown in Figure Q4.5, you can identify the following business rules: 1. A customer can own many cars. 2. Some customers do not own cars. 3. A car is owned by one and only one customer. 4. A car may generate one or more maintenance records. 5. Each maintenance record is generated by one and only one car. 6. Some cars have not (yet) generated a maintenance procedure. 7. Each maintenance procedure can use many parts. (Comment: A maintenance procedure may include multiple maintenance actions, each one of which may or may not use parts. For example, 10,000-mile check may include the installation of a new oil filter and a new air filter. But tightening an alternator belt does not require a part.) 8. A part may be used in many maintenance records. (Comment: Each time an oil change is made, an oil filter is used. Therefore, many oil filters may be used during some period of time. Naturally, you are not using the same oil filter each time – but the part classified as “oil filter” shows up in many maintenance records as time passes.)
87
Chapter 4 Entity Relationship (ER) Modeling
9. 10.
Note that the apparent M:N relationship between MAINTENANCE and PART has been resolved through the use of the composite entity named MAINT_LINE. The MAINT_LINE entity ensures that the M:N relationship between MAINTENANCE and PART has been broken up to produce the two 1:M relationships shown in business rules 9 and 10. Each maintenance procedure generates one or more maintenance lines. Each part may appear in many maintenance lines. (Review the comment in business rule 8.)
As you review the business rules 9 and 10, use the following two tables to show some sample data entries. For example, take a look at the (simplified) contents of the following MAINTENANCE and LINE tables and note that the MAINT_NUM 10001 occurs three times in the LINE table: Sample MAINTENANCE Table Data MAINT_NUM 10001 10002 10003
MAINT_DATE 15-Mar-2018 15-Mar-2018 16-Mar-2018
Sample LINE Table Data MAINT_NUM 10001 10001 10001 10002 10003 10003
LINE_NUM 1 2 3 1 1 2
LINE_DESCRIPTION Replace fuel filter Replace air filter Tighten alternator belt Replace taillight bulbs Replace oil filter Replace air filter
LINE_PART FF-015 AF-1187 NA BU-2145 OF-2113 AF-1187
LINE_UNITS 1 1 0 2 1 1
b. Identify all of the cardinalities. The Visio-generated Crow’s Foot ERD, shown in Figure Q4.5, does not show cardinalities directly. Instead, the cardinalities are implied through the Crow’s Foot symbols. You might write the cardinality (0,N) next to the MAINT_LINE entity in its relationship with the PART entity to indicate that a part might occur “N” times in the maintenance line entity or that it might never show up in the maintenance line entity. The latter case would occur if a given part has never been used in maintenance. 6. What is a recursive relationship? Given an example. A recursive relationship exists when an entity is related to itself. For example, a COURSE may be a prerequisite to a COURSE. (See Section 4.1.J, “Recursive Relationships,” for additional examples.
88
Chapter 4 Entity Relationship (ER) Modeling
7. How would you (graphically) identify each of the following ERM components in a Crow’s Foot model? The answers to questions (a) through (d) are illustrated with the help of Figure Q4.7.
FIGURE Q4.7 Crow’s Foot ERM Components STUDENT
STUDENT
Simplified Crow’s Foot entity box (no attribute component.) Crow’s Foot entity box (attribute component included.)
STU_NUM (PK) STU_LNAME STU_FNAME STU_INITIAL DEPT_CODE (FK) Crow’s Foot connectivity symbol, implied (0,N) cardinality.
A weak relationship A strong relationship
a. an entity An entity is represented by a rectangle containing the entity name. (Remember that, in ER modeling, the word "entity" actually refers to the entity set.) The Crow’s Foot ERD – as represented in Visio Professional – does not distinguish among the various entity types such as weak entities and composite entities. Instead, the Crow’s Foot ERD uses relationship types – strong or weak – to indicate the nature of the relationships between entities. For example, a strong relationship indicates the existence of a weak entity. A composite entity is defined by the fact that at least one of the PK attributes is also a foreign key. Therefore, the Visio Crow’s Foot ERD’s composite and weak entities are not differentiated – whether or not an entity is weak or composite depends on the definition of the business rule(s) that describe the relationships. In any case, two conditions must be met before an entity can be classified as weak: 1. The entity must be existence-dependent on its parent entity 2. The entity must inherit at least part of its primary key from its parent entity.
89
Chapter 4 Entity Relationship (ER) Modeling b. the cardinality (0,N) Cardinalities are implied through the use of Crow’s Foot symbols. For example, note the implied (0,N) cardinality in Figure Q4.7. c. a weak relationship A weak relationship exists when the PK of the related entity does not contain at least one of the PK attributes of the parent entity. For example, if the PK of a COURSE entity is CRS_CODE and the PK of the related CLASS entity is CLASS_CODE, the relationship between COURSE and CLASS is weak. (Note that the CLASS PK does not include the CRS_CODE attribute.) A weak relationship is indicated by a dashed line in the (Visio) ERD. d. a strong relationship A strong relationship exists when the PK of the related entity contains at least one of the PK attributes of the parent entity. For example, if the PK of a COURSE entity is CRS_CODE and the PK of the related CLASS entity is CRS_CODE + CLASS_SECTION, the relationship between COURSE and CLASS is strong. (Note that the CLASS PK includes the CRS_CODE attribute.) A strong relationship is indicated by a solid line in the (Visio) ERD. 8. Discuss the difference between a composite key and a composite attribute. How would each be indicated in an ERD? A composite key is one that consists of more than one attribute. If the ER diagram contains the attribute names for each of its entities, a composite key is indicated in the ER diagram by the fact that more than one attribute name is underlined to indicate its participation in the primary key. A composite attribute is one that can be subdivided to yield meaningful attributes for each of its components. For example, the composite attribute CUS_NAME can be subdivided to yield the CUS_FNAME, CUS_INITIAL, and CUS_LNAME attributes. There is no ER convention that enables us to indicate that an attribute is a composite attribute. 9. What two courses of action are available to a designer when encountering a multivalued attribute? The discussion that accompanies the answer to question 3 is valid as an answer to this question. For additional insight see discussion in section 4-1b, in particular Figures 4.3, 4.4, 4.5 and, Table 4.1. 10. What is a derived attribute? Give an example. A derived attribute is an attribute whose value is calculated (derived) from other attributes. The derived attribute need not be physically stored within the database; instead, it can be derived by using an algorithm. For example, an employee’s age, EMP_AGE, may be found by computing the integer value of the difference between the current date and the EMP_DOB. If you use MS Access, you would use INT((DATE() – EMP_DOB)/365).
90
Chapter 4 Entity Relationship (ER) Modeling Similarly, a sales clerk's total gross pay may be computed by adding a computed sales commission to base pay. For instance, if the sales clerk's commission is 1%, the gross pay may be computed by EMP_GROSSPAY = INV_SALES*1.01 + EMP_BASEPAY Or the invoice line item amount may be calculated by LINE_TOTAL = LINE_UNITS*PROD_PRICE 11. How is a relationship between entities indicated in an ERD? Give an example, using the Crow’s Foot notation. Use Figure Q4.7 as the basis for your answer. Note the distinction between the dashed and solid relationship lines, then tie this distinction to the answers to question 7c and 7d. 12. Discuss two ways in which the 1:M relationship between COURSE and CLASS can be implemented. (Hint: Think about relationship strength.) Note the discussion about weak and strong entities in questions 7c and 7d. Then follow up with this discussion: The relationship is implemented as strong when the CLASS entity’s PK contains the COURSE entity’s PK. For example, COURSE(CRS_CODE, CRS_TITLE, CRS_DESCRIPTION, CRS_CREDITS) CLASS(CRS_CODE, CLASS_SECTION, CLASS_TIME, CLASS_PLACE) Note that the CLASS entity’s PK is CRS_CODE + CLASS_SECTION – and that the CRS_CODE component of this PK has been “borrowed” from the COURSE entity. (Because CLASS is existence-dependent on COURSE and uses a PK component from its parent (COURSE) entity, the CLASS entity is weak in this strong relationship between COURSE and CLASS. The Visio Crow’s Foot ERD shows a strong relationship as a solid line. (See Figure Q4.12a.) Visio refers to a strong relationship as an identifying relationship.
Figure Q4.12a Strong COURSE and CLASS Relationship
91
Chapter 4 Entity Relationship (ER) Modeling Sample data are shown next: Table name: COURSE CRS_CODE ACCT-211
CRS_TITLE Basic Accounting
CIS-380
Database Techniques I
CIS-490
Database Techniques II
CRS-DESCRIPTION An introduction to accounting. Required of all business majors. Database design and implementation issues. Uses CASE tools to generate designs that are then implemented in a major database management system. The second half of CIS-380. Basic Web database application development and management issues.
CRS_CREDITS 3 3
4
Table name: CLASS CRS_CODE ACCT-211 ACCT-211 ACCT-211 CIS-380 CIS-380 CIS-490 CIS-490
CLASS_SECTION 1 2 3 1 2 1 2
CLASS_TIME 8:00 a.m. – 9:30 a.m. T-Th. 8:00 a.m. – 8:50 a.m. MWF 8:00 a.m. – 8:50 a.m. MWF 11:00 a.m. – 11:50 a.m. MWF 3:00 p.m. – 3:50 a.m. MWF 1:00 p.m. – 3:00 p.m. MW 6:00 p.m. – 10:00 p.m. Th.
CLASS_PLACE Business 325 Business 325 Business 402 Business 415 Business 398 Business 398 Business 398
The relationship is implemented as weak when the CLASS entity’s PK does not contain the COURSE entity’s PK. For example, COURSE(CRS_CODE, CRS_TITLE, CRS_DESCRIPTION, CRS_CREDITS) CLASS(CLASS_CODE, CRS_CODE, CLASS_SECTION, CLASS_TIME, CLASS_PLACE) (Note that CRS_CODE is no longer part of the CLASS PK, but that it continues to serve as the FK to COURSE.) The Visio Crow’s Foot ERD shows a weak relationship as a dashed line. (See Figure Q4.12b.) Visio refers to a weak relationship as a non-identifying relationship.
Figure Q4.12b Weak COURSE and CLASS Relationship
92
Chapter 4 Entity Relationship (ER) Modeling Given the weak relationship depicted in Figure Q4.13b, the CLASS table contents would look like this: Table name: CLASS CLASS_CODE 21151 21152 21153 38041 38042 49041 49042
CRS_CODE ACCT-211 ACCT-211 ACCT-211 CIS-380 CIS-380 CIS-490 CIS-490
CLASS_SECTION 1 2 3 1 2 1 2
CLASS_TIME 8:00 a.m. – 9:30 a.m. T-Th. 8:00 a.m. – 8:50 a.m. MWF 8:00 a.m. – 8:50 a.m. MWF 11:00 a.m. – 11:50 a.m. MWF 3:00 p.m. – 3:50 a.m. MWF 1:00 p.m. – 3:00 p.m. MW 6:00 p.m. – 10:00 p.m. Th.
CLASS_PLACE Business 325 Business 325 Business 402 Business 415 Business 398 Business 398 Business 398
The advantage of the second CLASS entity version is that its PK can be referenced easily as a FK in another related entity such as ENROLL. Using a single-attribute PK makes implementation easier. This is especially true when the entity represents the “1” side in one or more relationships. In general, it is advisable to avoid composite PKs whenever it is practical to do so.
13. How is a composite entity represented in an ERD, and what is its function? Illustrate the Crow’s Foot model. The label "composite" is based on the fact that the composite entity contains at least the primary key attributes of each of the entities that are connected by it. The composite entity is an important component of the ER model because relational database models should not contain M:N relationships – and the composite entity can be used to break up such relationships into 1:M relationships. Remind students to heed the advice given in the answer to the previous question. That is, avoid composite PKs whenever it is practical to do so. Note that the CLASS entity structure shown in Figure Q4.12b is far better than that of the CLASS entity structure shown in Figure Q4.12a. Suppose, for example, that you want to design a class enrollment entity to serve as the “bridge” between STUDENT and CLASS in the M:N relationship defined by these two business rules: • A student can take many classes. • Each class can be taken by many students. In this case, you could create a (composite) entity named ENROLL to link CLASS and STUDENT, using these structures: STUDENT(STU_NUM, STU_LNAME …………..) ENROLL(STU_NUM, CLASS_NUM, ENROLL_GRADE ………) CLASS(CLASS_CODE, CRS_CODE, CLASS_SECTION, CLASS_TIME, CLASS_PLACE) Your students might argue that a composite PK in ENROLL does no harm, since it is not likely to be related to another entity in the typical academic database setting. Although that is a good observation, you would run into a problem in the event that might trigger a required relationship between ENROLL and another entity. In any case, you may simplify the creation of future 93
Chapter 4 Entity Relationship (ER) Modeling relationships if you create an “artificial” single-attribute PK such as ENROLL_NUM, while maintaining the STU_NUM and CLASS_NUM as FK attributes. In other words: ENROLL(ENROLL_NUM, STU_NUM, CLASS_NUM, ENROLL_GRADE ………) The ENROLL_NUM attribute values can easily be generated through the proper use of SQL code or application software, thus eliminating the need for data entry by humans. The use of composite vs. single-attribute PKs is worth discussing. Composite PKs are frequently encountered in composite entities and your students will see that MS Visio will generate composite PKs automatically when you classify a relationship as strong. Composite PKs are not “wrong” is any sense, but minimizing their use does make the implementation of multiple relationships simple … Simple is generally a good thing!
94
Chapter 4 Entity Relationship (ER) Modeling
NOTE Because composite entities are frequently encountered in the real world environment, we continue to use them in the text and in many of our exercises and examples. However, the words of caution about their use should be repeated from time to time and you might ask your students to convert such composite entities. Let’s examine another example of the use of composite entities. Suppose that a trucking company keeps a log of its trucking operations to keep track of its driver/truck assignments. The company may assign any given truck to any given driver many times and, as time passes, each driver may be assigned to drive many of the company's trucks. Since this M:N relationship should not be implemented, we create the composite entity named LOG whose attributes are defined by the enduser information requirements. In this case, it may be useful to include LOG_DATE, TRUCK_NUM, DRIVER_NUM, LOG_TIME_OUT, and LOG_TIME_IN. Note that the LOG's TRUCK_NUM and DRIVER_NUM attributes are the driver LOG's foreign keys. The TRUCK_NUM and DRIVER_NUM attribute values provide the bridge between the TRUCK and DRIVER, respectively. In other words, to form a proper bridge between TRUCK and DRIVER, the composite LOG entity must contain at least the primary keys of the entities connected by it. You might think that the combination of the composite entity’s foreign keys may be designated to be the composite entity's primary key. However, this combination will not produce unique values over time. For example, the same driver may drive a given truck on different dates. Adding the date to the PK attributes will solve that problem. But we still have a non-unique outcome when the same driver drives a given truck twice on the same date. Adding a time attribute will finally create a unique set of PK attribute values – but the PK is now composed of four attributes: TRUCK_NUM, DRIVER_NUM, LOG_DATE, and LOG_TIME_OUT. (The combination of these attributes yields a unique outcome, because the same driver cannot check out two trucks at the same time on a given date.) Because multi-attribute PKs may be difficult to manage, it is often advisable to create an “artificial” single-attribute PK, such as LOG_NUM, to uniquely identify each record in the LOG table. (Access users can define such an attribute to be an “autonumber” to ensure that the system will generate unique LOG_NUM values for each record.) Note that this solution produces a LOG table that contains two candidate keys: the designated primary key and the combination of foreign keys that could have served as the primary key.
95
Chapter 4 Entity Relationship (ER) Modeling While the preceding solution simplifies the PK definition, it does not prevent the creation of duplicate records that merely have a different LOG_NUM value. Note, for example, the first two records in the following table: LOG_NUM 10015 10016 10017
LOG_DATE 12-Mar-2014 12-Mar-2014 12-Mar-2014
TRUCK_NUM 322453 322453 545567
DRIVER_NUM 1215 1215 1298
LOG_TIME_OUT 07:18 a.m. 07:18 a.m. 08:12 a.m.
LOG_TIME_IN 04:23 p.m. 04:23 p.m. 09:15 p.m.
To avoid such duplicate records, you can create a unique index on TRUCK_NUM + DRIVER_NUM + LOG_DATE + LOG_TIME_OUT. Composite entities may be named to reflect their component entities. For example, an employee may have several insurance policies (life, dental, accident, health, etc.) and each insurance policy may be held by many employees. This M:N relationship is converted to a set of two 1:M relationships, by creating a composite entity named EMP_INS. The EMP_INS entity must contain at least the primary key components of each of the two entities connected by it. How many additional attributes are kept in the composite entity depends on the end-user information requirements. 14. What three (often conflicting) database requirements must be addressed in database design? Database design must reconcile the following requirements: a. Design elegance requires that the design must adhere to design rules concerning nulls, derived attributes, redundancies, relationship types, and so on. b. Information requirements are dictated by the end users c. Operational (transaction) speed requirements are also dictated by the end users. Clearly, an elegant database design that fails to address end user information requirements or one that forms the basis for an implementation whose use progresses at a snail's pace has little practical use. 15. Briefly, but precisely, explain the difference between single-valued attributes and simple attributes. Give an example of each. A single -valued attribute is one that can have only one value. For example, a person has only one first name and only one social security number. A simple attribute is one that cannot be decomposed into its component pieces. For example, a person's sex is classified as either M or F and there is no reasonable way to decompose M or F. Similarly, a person's first name cannot be decomposed into meaningful components. (In contrast, if a phone number includes the area code, it can be decomposed into the area code and the phone number. And a person's name may be decomposed into a first name, an initial, and a last name.) Single-valued attributes are not necessarily simple. For example, an inventory code HWPRIJ23145 may refer to a classification scheme in which HW indicates Hardware, PR indicates Printer, IJ indicates Inkjet, and 23145 indicates an inventory control number. Therefore,
96
Chapter 4 Entity Relationship (ER) Modeling HWPRIJ23145 may be decomposed into its component parts... even though it is single-valued. To facilitate product tracking, manufacturing serial codes must be single-valued, but they may not be simple. For instance, the product serial number TNP5S2M231109154321 might be decomposed this way: TN = state = Tennessee P5 = plant number 5 S2 = shift 2 M23 = machine 23 11 = month, i.e., November 09 = day 154321 = time on a 24-hour clock, i.e., 15:43:21, or 3:43 p.m. plus 21 seconds. 16. What are multivalued attributes, and how can they be handled within the database design? The answer to question 3 is just as valid as an answer to this question. You can augment that discussion with the following discussion: As the name implies, multi-valued attributes may have many values. For example, a person's education may include a high school diploma, a 2-year college associate degree, a four-year college degree, a Master's degree, a Doctoral degree, and various professional certifications such as a Certified Public Accounting certificate or a Certified Data Processing Certificate. There are basically three ways to handle multi-valued attributes -- and two of those three ways are bad: 1. Each of the possible outcomes is kept as a separate attribute within the table. This solution is undesirable for several reasons. First, the table would generate many nulls for those who had minimal educational attainments. Using the preceding example, a person with only a high school diploma would generate nulls for the 2-year college associate degree, the four-year college degree, the Master's degree, the Doctoral degree, and for each of the professional certifications. In addition, how many professional certification attributes should be maintained? If you store two professional certification attributes, you will generate a null for someone with only one professional certification and you'd generate two nulls for all persons without professional certifications. And suppose you have a person with five professional certifications? Would you create additional attributes, thus creating many more nulls in the table, or would you simply ignore the additional professional certifications, thereby losing information? 2. The educational attainments may be kept as a single, variable-length string or character field. This solution is undesirable because it becomes difficult to query the table. For example, even a simple question such as "how many employees have four-year college degrees?" requires string partitioning that is time-consuming at best. Of course, if there is no need to ever group employees by education, the variable-length string might be acceptable from a design point of view. However, as database designers we know that, sooner or later, information requirements are likely to grow, so the string storage is probably a bad idea from that perspective, too.
97
Chapter 4 Entity Relationship (ER) Modeling
3. Finally, the most flexible way to deal with multi-valued attributes is to create a composite entity that links employees to education. By using the composite entity, there will never be a situation in which additional attributes must be created within the EMPLOYEE table to accommodate people with multiple certifications. In short, we eliminate the generation of nulls. In addition, we gain information flexibility because we can also store the details (date earned, place earned, etc.) for each of the educational attainments. The (simplified) structures might look like those in Figure Q4.16 A and B.
Figure Q4.16a The Ch04_Questions Database Tables
98
Chapter 4 Entity Relationship (ER) Modeling
Figure Q4.16b The Ch04_Questions Relational Diagram
By looking at the structures shown in Figures Q4.16a and Q4.16b, we can tell that the employee named Romero earned a Bachelor's degree in 1989, a Certified Network Professional certification in 2002, and a Certified Data Processing certification in 2004. If Randall were to earn a Master's degree and a Certified Public Accountant certification later, we merely add another two records in the EMP_EDUC table. If additional educational attainments beyond those listed in the EDUCATION table are earned by any employee, all we need to do is add the appropriate record(s) to the EDUCATION table, then enter the employee's attainments in the EMP_EDUC table. There are no nulls, we have superb query capability, and we have flexibility. Not a bad set of design goals! The database design on which Figures Q4.16a and Q4.16b are based is shown in Figure Q4.16c.
Figure Q4.16c The Crow’s Foot ERD for the Ch04_Questions Database
NOTE Discuss with the students that the design in Figure Q4.16c shows that an employee must meet at least one educational requirement, because EMP_EDUC is not optional to EMPLOYEE. Thus each employee must appear at least once in the EMP_EDUC table. And, given this design, some of the educational attainments may not yet been earned by employees, because the design shows EMP_EDUC to be optional to EDUCATION. In other words, some of the EDUCATION records are not necessarily referenced by any employee. (In the original M:N relationship between EMPLOYEE and EDUCATION, EMPLOYEE must have been optional to EDUCATION.)
99
Chapter 4 Entity Relationship (ER) Modeling
The final four questions are based on the ERD in Figure Q4.17.
FIGURE Q4.17 The ERD For Questions 17−20
17. Write the ten cardinalities that are appropriate for this ERD. The cardinalities are indicated in Figure Q4.17sol.
FIGURE Q4.17sol The Cardinalities
18. Write the business rules reflected in this ERD. The following business rules are reflected in the ERD: • A store may place many orders. (Note the use of “may” – which is reflected in the ORDER optionality.) • An order must be placed by a store. (Note that STORE is mandatory to ORDER. In this ERD, the order environment apparently reflects a wholesale environment.)
100
Chapter 4 Entity Relationship (ER) Modeling • • • •
•
• • •
An order contains at least one order line. (Note that ORDER_LINE is mandatory to ORDER, and vice-versa.) Each order line is contained in one and only one order. (Discussion: Although a given item – such as a hammer – may be found in many orders, a specific hammer sold to a specific store is found in only one order.) Each order line has a specific product written in it. A product may be written in many orders. (Discussion: Many stores can order one or more specific products, but a product that is not in demand may never be sold to a store and will, therefore, not show up in any order line -- note that ORDER_LINE is optional to PRODUCT. Also, note that each order line may indicate more than one of a specific item. For example, the item may be “hammer” and the number sold may be 1 or 2, or 500. The ORDER_LINE entity would have at least the following attributes: ORDER_NUM, ORDLINE_NUM, PROD_CODE, ORDLINE_PRICE, ORDLINE_QUANTITY. The ORDER_LINE composite PK would be ORDER_NUM + ORDLINE_NUM. You might add the derived attribute ORDLINE_AMOUNT, which would be the result of multiplying ORDLINE_PRICE and ORDLINE_QUANTITY.) A store may employ many employees. (Discussion: A new store may not yet have any employees, yet the database may already include the new store information … location, type, and so on. If you made the EMPLOYEE entity mandatory to STORE, you would have to create an employee for that store before you had even hired one.) Each employee is employed by one (and only one) store. An employee may have one or more dependents. (Discussion: You cannot require an employee to have dependents, so DEPENDENT is optional to EMPLOYEE. Note the use of the word “may” in the relationship.) A dependent must be related to an employee. (Discussion: It makes no sense to keep track of dependents of people who are not even employees. Therefore, EMPLOYEE is mandatory to DEPENDENT.)
19. What two attributes must be contained in the composite entity between STORE and PRODUCT? Use proper terminology in your answer. The composite entity must at least include the primary keys of the entities it references. The combination of these attributes may be designated to be the composite entity's (composite) primary key. Each of the (composite) primary key's attributes is a foreign key that references the entities for which the composite entity serves as a bridge. As you discuss the model in Figure Q4.17sol, note that an order is represented by two entities, ORDER and ORDER_LINE. Note also that the STORE’s 1:M relationship with ORDER and the ORDER’s 1:M relationship with ORDER_LINE reflect the conceptual M:N relationship between STORE and PRODUCT. The original business rules probably read: • A store can order many products • A product can be ordered by many stores.
101
Chapter 4 Entity Relationship (ER) Modeling 20. Describe precisely the composition of the DEPENDENT weak entity’s primary key. Use proper terminology in your answer. The DEPENDENT entity will have a composite PK that includes the EMPLOYEE entity’s PK and one of its attributes. For example, if the EMPLOYEE entity’s PK is EMP_NUM, the DEPENDENT entity’s PK might be EMP_NUM + DEP_NUM. 21. The local city youth league needs a database system to help track children that sign up to play soccer. Data needs to be kept on each team and the children that will be playing on each team and their parents. Also, data needs to be kept on the coaches for each team. Draw the data model described below. Entities required: Team, Player, Coach, and Parent. Attributes required: Team: Team ID number, Team name, and Team colors. Player: Player ID number, Player first name, Player last name, and Player age. Coach: Coach ID number, Coach first name, Coach last name, and Coach home phone number. Parent: Parent ID number, Parent last name, Parent first name, Home phone number, and Home Address (Street, City, State, and ZIP Code). The following relationships must be defined: •
Team is related to Player.
•
Team is related to Coach.
•
Player is related to Parent.
Connectivities and participations are defined as follows: •
A Team may or may not have a Player.
•
A Player must have a Team.
•
A Team may have many Players.
•
A Player has only one Team.
•
A Team may or may not have a Coach.
•
A Coach must have a Team.
•
A Team may have many Coaches.
•
A Coach has only one Team.
•
A Player must have a Parent.
•
A Parent must have a Player.
•
A Player may have many Parents.
102
Chapter 4 Entity Relationship (ER) Modeling •
A Parent may have many Players.
This is a great exercise in that it opens up possibilities for several discussion points. The conceptual ERD prior to placement of foreign keys and the resolution of the M:N relationship is shown in Figure Q4.21a.
FIGURE Q4.21a Conceptual ERD for Question 21
The most apparent issue that must be resolved is the M:N relationship. This is necessary so that foreign keys can be appropriately placed throughout the data model. The revised ERD with properly placed foreign keys is shown in Figure Q4.21b.
103
Chapter 4 Entity Relationship (ER) Modeling
FIGURE Q4.21b ERD with foreign keys for Question 21
This solution, however, still leaves an interesting question about the Team_Colors attribute. What if teams have more than one color as is implied by the plural "colors" being used by the business users? Let's consider three options: 1) leave it as is (as if Team_Colors is a single-valued attribute), 2) create multiple attributes within the TEAM entity, or 3) create a new COLOR table. Team_Colors may be left as a single attribute if it is determined through discussion with the business users that they are not concerned with dealing with the different colors individually. For example, they will never be interested to know how many teams have the color Blue as one of their team colors, then we may choose to implement the design as given above. However, if the users are interested, or foresee the possibility that at some time in the future they may become interested, in addressing the different colors for a given team individually, then we must modify the above design to accommodate this need. If we determine that all teams have the same number of colors, and no team now or in the future will ever have more than that number of colors, then we may modify the design by adding additional attributes in the TEAM entity. For example, if all teams, now and forever, will always have exactly two team colors then we may produce the design shown in Figure Q4.21c.
104
Chapter 4 Entity Relationship (ER) Modeling
FIGURE Q4.21c ERD with two team colors for Question 21
This is a reasonable solution given the assurance that all teams now and forever will have exactly two team colors. A problem arises, however, if we cannot rely on that assurance. If some teams have fewer colors, then our design will lead to an increased number of nulls. If a team ever has more than two colors, we will have to modify the structure of the database after it has been built to add another team color attribute. This change in structure may require changes in the front-end applications so that they can properly address this new attribute. To avoid these potentially serious modifications in the future, we can re-design the database with a more robust structure that can handle any number of team colors without future modifications to the database or the front-end applications. The design with a separate table to handle the multi-valued Team_Colors attribute is shown in Figure Q4.21d.
105
Chapter 4 Entity Relationship (ER) Modeling
FIGURE Q4.21d ERD with Color table for Question 21
Problem Solutions 1. Use the following business rules to create a Crow’s Foot ERD. Write all appropriate connectivities and cardinalities in the ERD. • A department employs many employees, but each employee is employed by one department. • Some employees, known as “rovers,” are not assigned to any department. • A division operates many departments, but each department is operated by one division. • An employee may be assigned many projects, and a project may have many employees assigned to it. • A project must have at least one employee assigned to it. • One of the employees manages each department, and each department is managed by only one employee. • One of the employees runs each division, and each division is run by only one employee.
106
Chapter 4 Entity Relationship (ER) Modeling
The answers to problem 1 (all parts) are included in Figure P4.1.
Figure P4.1 Problem 1 ERD Solution
As you discuss the ERD shown in Figure P4.1, note that this design reflects several useful features that become especially important when the design is implemented. For example: • The ASSIGN entity is shown to be optional to the PROJECT. This decision makes sense from a practical perspective, because it lets you create a new project record without having to create a new assignment record. (If a new project is started, there will not yet be any assignments.) • The relationship expressed by “DEPARTMENT employs EMPLOYEE” is shown as mandatory on the EMPLOYEE side. This means that a DEPARTMENT must have at least one EMPLOYEE in order to have departmental status. However, DEPARTMENT is optional to EMPLOYEE, so an employee can be entered without entering a departmental FK value. If the existence of nulls is not acceptable, you can create a “No assignment” record in the DEPARTMENT table, to be referenced in the EMPLOYEE table if an employee is not assigned to a department. • Note also the implications of the 1:1 “EMPLOYEE manages DEPARTMENT” relationship. The flip side of this relationship is that “each DEPARTMENT is managed by one EMPLOYEE”. (This latter relationship is shown as mandatory in the ERD. That is, each department must be managed by an employee!) Therefore, one of the EMPLOYEE table’s PK values must appear as the FK value in the DEPARTMENT table. (Because this is a 1:1 relationship, the index property of the EMP_NUM FK in the DEPARTMENT table must be set to “unique.”)
107
Chapter 4 Entity Relationship (ER) Modeling •
Although you ought to approach a 1:1 relationship with caution – most 1:1 relationships are the result of a misidentification of attributes as entities – the 1:1 relationships reflected in the “EMPLOYEE manages DEPARTMENT” and “EMPLOYEE runs DISISION” are appropriate. These 1:1 relationships avoid the data redundancies you would encounter if you duplicated employee data – such a names, phones, and e-mail addresses – in the DIVISION and DEPARTMENT entities.
Also, if you have multiple relationships between two entities -- such as the “EMPLOYEE manages DEPARTMENT” and “DEPARTMENT employs EMPLOYEE” relationships – you must make sure that each relationship has a designated primary entity. For example, the 1:1 relationship expressed by “EMPLOYEE manages DEPARTMENT” requires that the EMPOYEE entity be designated as the primary (or “first”) entity. If you use Visio to create your Crow’s Foot ERDs, Figure P4.3 show how the 1:1 relationship is specified. If you use some other CASE tool, you will discover that it, too, is likely to require similar relationship specifications. 2. Create a complete ERD in Crow’s Foot notation that can be implemented in the relational model using the following description of operations. Hot Water (HW) is a small start-up company that sells spas. HW does not carry any stock. A few spas are set up in a simple warehouse so customers can see some of the models available, but any products sold must be ordered at the time of the sale. • HW can get spas from several different manufacturers. • Each manufacturer produces one or more different brands of spas. • Each and every brand is produced by only one manufacturer. • Every brand has one or more models. • Every model is produced as part of a brand. For example, Iguana Bay Spas is a manufacturer that produces Big Blue Iguana spas, a premium-level brand, and Lazy Lizard spas, an entry-level brand. The Big Blue Iguana brand offers several models, including the BBI-6, an 81-jet spa with two 6-hp motors, and the BBI-10, a 102-jet spa with three 6-hp motors. • Every manufacturer is identified by a manufacturer code. The company name, address, area code, phone number, and account number are kept in the system for every manufacturer. • For each brand, the brand name and brand level (premium, mid-level, or entry-level) are kept in the system. • For each model, the model number, number of jets, number of motors, number of horsepower per motor, suggested retail price, HW retail price, dry weight, water capacity, and seating capacity must be kept in the system.
Figure P4.2 Problem 2 ERD Solution
108
Chapter 4 Entity Relationship (ER) Modeling
3. The Jonesburgh County Basketball Conference (JCBC) is an amateur basketball association. Each city in the county has one team as its representative. Each team has a maximum of 12 players and a minimum of 9 players. Each team also has up to three coaches (offensive, defensive, and physical training coaches). During the season, each team plays two games (home and visitor) against each of the other teams. Given those conditions, do the following: • Identify the connectivity of each relationship. • Identify the type of dependency that exists between CITY and TEAM. • Identify the cardinality between teams and players and between teams and city. • Identify the dependency between COACH and TEAM and between TEAM and PLAYER. • Draw the Chen and Crow’s Foot ERDs to represent the JCBC database. • Draw the UML class diagram to depict the JCBC database. The Chen ERD solution is shown in Figure P4.3Chen. (The Crow’s Foot solution is shown after the discussion.)
109
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.3Chen The JCBC Chen ERD M
M GAME
(1,1)
1
1
1
sponsors
CITY (1,1)
(2,N)
(2,N)
(1,1)
1
1 has
TEAM (1,1)
1
(1,3)
M
(9,12)
PLAYER (1,1)
is coached by
M
(1,1)
COACH
To help the students understand the ER diagram's components better, note the following relationships: • The main components are TEAM and GAME. • Each team plays each other team at least twice. • To play a game, two teams are necessary: the home team and the visiting team. • Each team plays at least twice: once as the home team and once as the visiting team. Given these relationships, it becomes clear that TEAM participates in a recursive M:N relationship with GAME. The relationship between TEAM and GAME becomes clearer if we list some attributes for each of these entities. Note that the TEAM_NUM appears twice in a GAME record: once as a GAME_HOME_TEAM and once as a GAME_VISIT_TEAM. GAME entity
TEAM entity
GAME_ID GAME_DATE GAME_HOME_TEAM GAME_VISIT_TEAM GAME_HOME_SCORE GAME_VISIT_SCORE
TEAM_CODE TEAM_NAME CITY_CODE
Implementation of this solution yields the relational diagram shown in Figure P4.3RD. (If you implement this design in Microsoft Access, note that Access will generate a virtual table named TEAM_1 to indicate that two relationships exist between GAME and TEAM. We created a database named Ch04_JCBC_V1 to illustrate this design implementation. 110
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.3RD The JCBC Relational Diagram, Version 1
The solution shown in Figure P4.3Chen yields a database that enables its users to track all games. For example, a simple query – based on the two relationships between TEAM and GAME yields the output shown in Figure P4.3SO. (We have created only a few records to show the results for games 1 and 2 played by teams named Bears, Rattlers, Sharks, and Tigers, respectively.)
Figure P4.3SO The JCBC Database Game Summary Output, Version 1
As you examine the design and its implementation – check the relational diagram in Figure P4.3RD -- note that this solution uses synonyms, because the TEAM_NUM shows up in GAME twice: once as the GAME_HOME_TEAM and once as the GAME_VISIT_TEAM. Given the use of these synonyms, the GAME entity also becomes very cumbersome structurally as you decide to track more game data. For example, if you wanted to keep track of runs, hits, and errors, you would have to have one set of each for each of the two teams – all in the same record. Clearly, such a structure is undesirable: the use of synonyms requires the addition of two new attributes – one for the home
111
Chapter 4 Entity Relationship (ER) Modeling team and one for the visiting team -- for each additional characteristic you want to track. To eliminate the structural problem discussed in the previous paragraph, you can let each game be represented by two entities: GAME and GAME_LINE. Figure P4.3RD2 shows the structures of these two entities in a segment of the revised relational diagram. We have added a LOCATION entity to specify the actual location of the game – knowing that a game is played in Nashville is not sufficiently specific. Players, coaches, and spectators ought to know where in Nashville the game is played.
Figure P4.3RD2 The Revised JCBC Database Relational Diagram
NOTE Quite aside from the fact that we ought to know where in each city any given game is played, the LOC_ID attribute in GAME refers to a LOCATION entity that was created to make the database more flexible by permitting the use of multiple locations in each city. Although this capability was not required by the problem description – each city only fields one team at this point – is very likely that additional teams will be organized in the future. Good design first ensures that current requirements are met. This design does that. But good design also anticipates the reasonably expected changing dynamics of the database environment. This revised design does that, too. Additional flexibility is gained by the use of the GAME entity. For example, if you want to track the assignment of referees in each of the games, you can easily create a REFEREE entity in a M:N relationship with the GAME entity. (A referee may referee many games and many referees referee each game.) This M:N relationship may then be transformed into two 1:M relationships through the use of a composite entity, perhaps named REF_GAME.
112
Chapter 4 Entity Relationship (ER) Modeling
Finally, point out to the students that the relationship between the newly created GAME and GAME_LINE entities is structurally similar to the by now familiar relationship between INVOICE and INV_LINE entities. The completed database design is implemented as shown in the Crow’s Foot ERD in Figure P4.3CF.
113
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.3CF The JCBC Crow’s Foot ERD
114
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.3UML The JCBC UML Class Diagram
115
Chapter 4 Entity Relationship (ER) Modeling
NOTE You may wonder why we examined this solution in such detail. (The sample implementation is shown in the database named Ch04_JCBC_Version2.) After all, mere games hardly seem to merit this level of database design attention. Actually, there is the proverbial method in the madness. The basketball – or any other game environment -- is likely to be familiar to your students. Therefore, it becomes easier for you to show the design and implementation of recursive relationships – which are actually rather complex things. Fortunately, even complex design issues become manageable in a familiar data environment. Recursive relationships are common enough – or should be – to merit attention and the development of expertise in their implementation. In many manufacturing industries, incredibly detailed part tracking is mandatory. For example, the implementation of the recursive relationship “PART contains PART” is especially desirable in the aviation manufacturing businesses. Such businesses are required by federal law to maintain absolute parts tracing records. If a complex part fails, it must be possible to follow all the trails to all the component parts that may have been involved in the part’s failure. 4. Create an ERD based on the Crow’s Foot model, using the following requirements: • An INVOICE is written by a SALESREP. Each sales representative can write many invoices, but each invoice is written by a single sales representative. • The INVOICE is written for a single CUSTOMER. However, each customer can have many invoices. • An INVOICE can include many detail lines (LINE), each of which describes one product bought by the customer. • The product information is stored in a PRODUCT entity. • The product’s vendor information is found in a VENDOR entity.
NOTE The ERD must reflect business rules that you are free to define (within reason). Make sure that your ERD reflects the conditions you require. Finally, make sure that you include the attributes that would permit the model to be successfully implemented. The Crow’s Foot ERD solution is shown in Figure 4.4.
116
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.4 The Crow’s Foot ERD Solution for Problem 4
117
Chapter 4 Entity Relationship (ER) Modeling
NOTE Keep in mind that the preceding ER diagram reflects a set of business rules that may easily be modified. For example, if customers are supplied via a commercial customer list, many of the customers on that list will not (yet!) have bought anything, so INVOICE would be optional to CUSTOMER. We are assuming here that many vendors can supply a product and that each vendor can supply many products. The PRODUCT may be optional to VENDOR if the vendor list includes potential vendors from which you have not (yet) ordered anything. Some products may never sell, so LINE is optional to PRODUCT... because an unsold product will never appear in an invoice line. You may also want to show the students how the composite entities may be represented at the final implementation level. For example, LINE is shown as weak to INVOICE, because it borrows the invoice number as part of its primary key and it is existence-dependent on INVOICE. The modified ER diagram is shown next. The point of this exercise is that the design's final iteration depends on the exact nature of the business rules and the desired level of implementation detail.
5. The Hudson Engineering Group (HEG) has contacted you to create a conceptual model whose application will meet the expected database requirements for the company’s training program. The HEG administrator gives you the description (see below) of the training group’s operating environment. (Hint: Some of the following sentences identify the volume of data rather than cardinalities. Can you tell which ones?) The HEG has 12 instructors and can handle up to 30 trainees per class. HEG offers five Advanced Technology courses, each of which may generate several classes. If a class has fewer than ten trainees, it will be canceled. Therefore, it is possible for a course not to generate any classes. Each class is taught by one instructor. Each instructor may teach up to two classes or may be assigned to do research only. Each trainee may take up to two classes per year. Given that information, do the following: a. Define all of the entities and relationships. (Use Table 4.4 as your guide.) The HEG entities and relationships are shown in Table P4.5a.
Table P4.5a The Components of the HEG ERD ENTITY INSTRUCTOR COURSE CLASS TRAINEE
RELATIONSHIP teaches generates is listed in is written in
CONNECTIVITY 1:M 1:M 1:M 1:M
ENTITY CLASS CLASS ENROLL ENROLL
As you examine the summary in Table P4.5a, it is reasonable to assume that many of the relationships are optional and that some are mandatory. (Remember a point we made earlier: when in doubt, assume an optional relationship.)
118
Chapter 4 Entity Relationship (ER) Modeling • • •
•
•
A COURSE does not necessarily generate a class during each training period. (Some courses may be taught every other period or during some other specified time frames. Therefore, it is reasonable to assume that CLASS is optional to COURSE. Each CLASS must be related to a COURSE. (The class must cover designated course material!) Therefore, COURSE is mandatory to CLASS. Some instructors may teach a class every other period or even rarely. Therefore, it is reasonable to assume that CLASS is optional to INSTRUCTOR during any enrollment period. This optionality makes sense from an implementation point of view, too. For example, if you appoint a new instructor, that instructor will not – yet – have taught a class. Not all trainees are likely to be enrolled in classes during some time period. In fact, in a real world setting, many trainees are likely to get informal “on the job” training without going to formal classes. Therefore, it is reasonable to assume that ENROLL is optional to TRAINEE. You cannot create an enrollment record without having a trainee. Therefore, TRAINEE is mandatory to ENROLL. (Discussion point: What about making TRAINEE optional to ENROLL? In any case, optional relationships may be used for operational reasons, whether or not they are directly derived from a business rule.)
Note that a real world database design requires the explicit recognition of each relationship’s characteristics. When in doubt, ask the end users! b. Describe the relationship between instructor and class in terms of connectivity, cardinality, and existence-dependence. Both questions (a) and (b) have been addressed in the ER diagram shown in Figure P4.4b.
119
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.5b The HEG ERD
As you discuss Figure P4.5b, keep the discussion in part (a) in mind. Also, note the following points: • A trainee can take more than one class, and each class contains many (10 or more) trainees, so there is a M:N relationship between TRAINEE and CLASS. (Therefore, a composite entity is used to serve as the bridge between TRAINEE and CLASS.) • A class is taught by only one instructor, but an instructor can teach up to two classes. Therefore, there is a 1:M relationship between INSTRUCTOR and CLASS. • Finally, a COURSE may generate more than one CLASS, while each CLASS is based on one COURSE, so there is a 1:M relationship between COURSE and CLASS. These relationships are all reflected in the ER diagram shown in Figure P4.4b. Note the optional and mandatory relationships: • To exist, a CLASS must have TRAINEEs enrolled in it, but TRAINEEs do not necessarily take CLASSes. (Some may take "on the job training.") • An INSTRUCTOR may not be teaching any CLASSes during some enrollment periods. For example, an instructor may be assigned to duties other than training. However, each CLASS must have an INSTRUCTOR. • If an insufficient number of people sign up for a CLASS, a COURSE may not generate any CLASSes, but each CLASS must represent a COURSE.
120
Chapter 4 Entity Relationship (ER) Modeling
NOTE The sentences "HEG has twelve instructors." and "HEG offers five advanced technology courses." are not reflected in the ER diagram. Instead, they represent additional information concerning the volume of data (number of entities in an entity set), rather than information concerning entity relationships. Because the HEG description in Problem 4 leaves room for different interpretations of optional vs. mandatory relationships, we like to give the student the benefit of the doubt. Therefore, unless the question or problem description is sufficiently precise to leave no doubt about the existence of optional/mandatory relationships, we base the student grade on two criteria: 1. Was the basic nature of the relationship – 1:1, 1:M, or M:N – selected and displayed properly? 2. Given the student’s rendering of such a relationship, are the cardinalities appropriate?
You can add substantial detail to the ERD by including sample attributes for each of the entities. Using Visio Professional, you can also let your student declare the nature – weak or strong – of the relationships among the entities. Finally, remind your students that the order in which the attributes appear in each entity is immaterial. Therefore, the (composite) PK of the can be written as either CLASS_CODE + TRN_NUM or as TRN_NUM + CLASS_CODE. That’s why it is also immaterial which one of the foreign key attributes is FK1 or FK2. As you discuss the ERD shown in Figure P4.5b, note that the basic components of this problem are found in the text’s Figure 4.35. Note also that the ENROLL entity in Figure P4.5b uses a composite PK (TRN_NUM + CLASS_CODE) and that, therefore the relationships between ENROLL and CLASS and TRAINEE are strong. Finally, discuss the reason for the weak relationship between COURSE and CLASS – the CLASS entity’s PK (CLASS_CODE) does not “borrow” the PK of the parent COURSE entity. If the CLASS entity’s PK had been composed of CRS_CODE + CLASS_SECTION, the relationship between COURSE and CLASS would have been strong. Discussion: Review the text to show the two possible relationship strengths between COURSE and CLASS. Emphasize that the choice of the PK component(s) is usually a designer option, but that single-attribute PKs tend to yield more design options than composite PKs. Even the composite ENROLL entity can be modified to have a single-attribute PK such as ENROLL_NUM. Given that choice, CLASS_CODE + TRN_NUM constitute a candidate key – CLASS_CODE and TRN_NUM continue to serve as foreign keys to CLASS and TRAINEE, respectively. Given the latter scenario, you can create a (unique) composite index to prevent duplicate enrollments. 6. Automata Inc. produces specialty vehicles by contract. The company operates several departments, each of which builds a particular vehicle, such as a limousine, a truck, a van, or an RV.
121
Chapter 4 Entity Relationship (ER) Modeling
Before a new vehicle is built, the department places an order with the purchasing department to request specific components. Automata’s purchasing department is interested in creating a database to keep track of orders and to accelerate the process of delivering materials. The order received by the purchasing department may contain several different items. An inventory is maintained so that the most frequently requested items are delivered almost immediately. When an order comes in, it is checked to determine whether the requested item is in inventory. If an item is not in inventory, it must be ordered from a supplier. Each item may have several suppliers. Given that functional description of the processes encountered at Automata’s purchasing department, do the following: a. Identify all of the main entities. b. Identify all of the relations and connectivities among entities. c. Identify the type of existence dependency in all the relationships. d. Give at least two examples of the types of reports that can be obtained from the database. The initial Crow’s Foot ERD is shown in Figure P4.6init. The discussion preceding Figure P4.6rev explains why the revision was made.
122
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.6init Initial Automata Crow’s Foot ERD
As you explain the development of the Crow’s Foot ERD shown in Figure P4.6init, several points are worth stressing: • The ORDER and ORD_LINE entities are perfect reflections of the INVOICE and INV_LINE entities the students have encountered before. This kind of 1:M relationship is quite common in a business environment and you will see it recur throughout the book and in its many problems. Note that the ORD_LINE entity is weak, because it inherits part of its PK from its ORDER “parent” entity. Therefore, the “contains” relationship between ORDER and ORD_LINE is properly shown as an identifying (strong) relationship. (The relationship line is solid, rather than dashed.) Finally, note that ORD_LINE is mandatory to ORDER; it is not possible to have an ORDER that does not contain at least one order line. And, of course, ORDER is mandatory to ORD_LINE, because an ORD_LINE occurrence cannot exist without referencing an ORDER. • The ORDER entity is shown as optional to DEPARTMENT, indicating that it is quite possible that a department has not (yet) placed an order. Aside from the fact that such an optionality makes common sense, it also makes operational sense from a database point of view. For example, if the ORDER entity were mandatory to the DEPARTMENT entity, the creation of a new department would require the creation of an order, so you might have to create a “dummy” order when you create a new department. Also, keep in mind that an
123
Chapter 4 Entity Relationship (ER) Modeling
•
order cannot be written by a department that does not (yet) exist. Note also that the VENDOR may not (yet) have received and order, so ORDER is optional to VENDOR. The VENDOR entity may contain vendors who are simply potential suppliers if items and you may want to have such potential vendors available just in case your “usual” vendor(s) run(s) out of items that you need.
The other optionalities should be discussed, too – using the same basic scenarios that were described in bullets 2 and 3.
NOTE In this presentation, the relationship between VENDOR and ITEM is shown as 1:M. Therefore, each vendor can supply many items, but only one vendor can supply each item. If it is possible for items to be supplied by more than one vendor, there is a M:N relationship between VENDOR and ITEM and this relationship would have to be implemented through a composite (bridge) entity. Actually, such an M:N relationship is specified in the brief description of the Automata company’s data environment. Therefore, the following Figure P4.6rev more accurately reflects the problem description.
Figure P4.6rev Revised Automata Crow’s Foot ERD
124
Chapter 4 Entity Relationship (ER) Modeling 7. United Helpers is a nonprofit organization that provides aid to people after natural disasters. Based on the following brief description of operations, create the appropriate fully labeled Crow’s Foot ERD. •
Individuals volunteer their time to carry out the tasks of the organization. For each volunteer, their name, address, and telephone number are tracked. Each volunteer may be assigned to several tasks during the time that they are doing volunteer work, and some tasks require many volunteers. It is possible for a volunteer to be in the system without having been assigned a task yet. It is possible to have tasks that no one has been assigned. When a volunteer is assigned to a task, the system should track the start time and end time of that assignment.
•
For each task, there is a task code, task description, task type, and a task status. For example, there may be a task with task code “101,” description of “answer the telephone,” a type of “recurring,” and a status of “ongoing.” There could be another task with a code of “102,” description of “prepare 5000 packages of basic medical supplies,” a type of “packing,” and a status of “open.”
•
For all tasks of type “packing,” there is a packing list that specifies the contents of the packages. There are many different packing lists to produce different packages, such as basic medical packages, child care packages, food packages, etc. Each packing list has a packing list ID number, packing list name, and a packing list description, which describes the items that ideally go into making that type of package. Every packing task is associated with only one packing list. A packing list may not be associated with any tasks, or may be associated with many tasks. Tasks that are not packing tasks are not associated with any packing list.
•
Packing tasks result in the creation of packages. Each individual package of supplies that is produced by the organization is tracked. Each package is assigned an ID number. The date the package was created, and total weight of the package is recorded. A given package is associated with only one task. Some tasks (e.g., “answer the phones”) will not have produced any packages, while other tasks (e.g., “prepare 5000 packages of basic medical supplies”) will be associated with many packages.
•
The packing list describes the ideal contents of each package, but it is not always possible to include the ideal number of each item. Therefore, the actual items included in each package should be tracked. A package can contain many different items, and a given item can be used in many different packages.
•
Each item that the organization provides has an item ID number, item description, item value, and item quantity on hand stored in the system. Along with tracking the actual items that are placed in each package, the quantity of each item placed in the package must be tracked as well. For example, a packing list may state that basic medical packages should include 100 bandages, 4 bottles of iodine, and 4 bottles of hydrogen peroxide. However, because of the limited supply of items, a given package may include only 10 bandages, 1 bottle of iodine, and no hydrogen peroxide. The fact that this package includes bandages and iodine needs to be recorded along with the quantity of each item included. It is possible for the organization to have items that have not been included in any package yet, but every package will contain at least one item.
125
Chapter 4 Entity Relationship (ER) Modeling
The ERD for United Helpers is shown in Figure P4.6a.
FIGURE P4.7a United Helpers ERD
This problem, however, does leave room for interesting discussion with the students regarding the need to verify requirements with the business users. In fact, getting unambiguous business rules can be one of the most difficult parts of the design process. In this problem, the potential for a relationship between the packing list (LIST) and the items (ITEM) stocked by the organization can be a source for discussion. Students may envision that a LIST can specify many ITEMs and an ITEM can be specified in many LISTs. This would imply the need for a M:N relationship between ITEM and LIST. However, the business users may not intend for the packing list to be that
126
Chapter 4 Entity Relationship (ER) Modeling specific. For example, the packing list may specify that "2 liter of iodine" should be included in a given type of package without specifying whether it should be two 1-liter bottles of iodine or four 500ml bottles of iodine. Note that "1-liter bottle of iodine" and "500ml bottle of iodine" would have to be separate entity instances in ITEM because they have different values. If it is the case that the packing list is intentionally generic in its description of the ideal contents, then a relationship between LIST and ITEM would not be appropriate. 8. Using the Crow’s Foot methodology, create an ERD that can be implemented for a medical clinic, using at least the following business rules: a. A patient can make many appointments with one or more doctors in the clinic, and a doctor can accept appointments with many patients. However, each appointment is made with only one doctor and one patient. b. Emergency cases do not require an appointment. However, for appointment management purposes, an emergency is entered in the appointment book as “unscheduled.” c. If kept, an appointment yields a visit with the doctor specified in the appointment. The visit yields a diagnosis and, when appropriate, treatment. d. With each visit, the patient’s records are updated to provide a medical history e. Each patient visit creates a bill. Each patient visit is billed by one doctor, and each doctor can bill many patients. f. Each bill must be paid. However, a bill may be paid in many installments, and a payment may cover more than one bill. g. A patient may pay the bill directly, or the bill may be the basis for a claim submitted to an insurance company. h. If the bill is paid by an insurance company, the deductible is submitted to the patient for payment.
The ERD solution is shown in Figure P4.8.
127
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.8 The Medical Clinic’s Crow’s Foot ERD
128
Chapter 4 Entity Relationship (ER) Modeling 9. Create a Crow’s Foot notation ERD to support the following business operations: • A friend of yours has opened Professional Electronics and Repairs (PEAR) to repair smartphones, laptops, tablets, and MP3 players. She wants you to create a database to help her run her business. • When a customer brings a device to PEAR for repair, data must be recorded about the customer, the device, and the repair. The customer’s name, address, and a contact phone number must be recorded (if the customer has used the shop before, the information already in the system for the customer is verified as being current). For the device to be repaired, the type of device, model, and serial number are recorded (or verified if the device is already in the system). Only customers who have brought devices into PEAR for repair will be included in this system. • Since a customer might sell an older device to someone else who then brings the device to PEAR for repair, it is possible for a device to be brought in for repair by more than one customer. However, each repair is associated with only one customer. When a customer brings in a device to be fixed, it is referred to as a repair request, or just “repair,” for short. Each repair request is given a reference number, which is recorded in the system along with the date of the request, and a description of the problem(s) that the customer wants fixed. It is possible for a device to be brought to the shop for repair many different times, and only devices that are brought in for repair are recorded in the system. Each repair request is for the repair of one and only one device. If a customer needs multiple devices fixed, then each device will require its own repair request. • There are a limited number of repair services that PEAR can perform. For each repair service, there is a service ID number, description, and charge. “Charge” is how much the customer is charged for the shop to perform the service, including any parts used. The actual repair of a device is the performance of the services necessary to address the problems described by the customer. Completing a repair request may require the performance of many services. Each service can be performed many different times during the repair of different devices, but each service will be performed only once during a given repair request. • All repairs eventually require the performance of at least one service, but which services will be required may not be known at the time the repair request is made. It is possible for services to be available at PEAR but that have never been required in performing any repair. • Some services involve only labor activities and no parts are required, but most services require the replacement of one or more parts. The quantity of each part required in the performance of each service should also be recorded. For each part, the part number, part description, quantity in stock, and cost is recorded in the system. The cost indicated is the amount that PEAR pays for the part. Some parts may be used in more than one service, but each part is required for at least one service.
129
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.9 The PEAR ERD
10. Luxury-Oriented Scenic Tours (LOST) provides guided tours to groups of visitors to the Washington D.C. area. In recent years, LOST has grown quickly and is having difficulty keeping up with all of the various information needs of the company. The company’s operations are as follows. • LOST offers many different tours. For each tour, the tour name, approximate length (in hours), and fee charged is needed. Guides are identified by an employee ID, but the system should also record a guide’s name, home address, and date of hire. Guides take a test to be qualified to lead specific tours. It is important to know which guides are qualified to lead which tours and the date that they completed the qualification test for each tour. A guide may be qualified to lead many different tours. A tour can have many different qualified guides. New guides may or may not be qualified to lead any tours, just as a new tour may or may not have any qualified guides.
130
Chapter 4 Entity Relationship (ER) Modeling •
•
Every tour must be designed to visit at least three locations. For each location, a name, type, and official description are kept. Some locations (such as the White House) are visited by more than one tour, while others (such as Arlington Cemetery) are visited by a single tour. All locations are visited by at least one tour. The order in which the tour visits each location should be tracked as well. When a tour is actually given, that is referred to as an “outing.” LOST schedules outings well in advance so they can be advertised and so employees can understand their upcoming work schedules. A tour can have many scheduled outings, although newly designed tours may not have any outings scheduled. Each outing is for a single tour and is scheduled for a particular date and time. All outings must be associated with a tour. All tours at LOST are guided tours, so a guide must be assigned to each outing. Each outing has one and only one guide. Guides are occasionally asked to lead an outing of a tour even if they are not officially qualified to lead that tour. Newly hired guides may not have ever been scheduled to lead any outings. Tourists, called “clients” by LOST, pay to join a scheduled outing. For each client, the name and telephone number are recorded. Clients may sign up to join many different outings, and each outing can have many clients. Information is kept only on clients who have signed up for at least one outing, although newly scheduled outings may not have any clients signed up yet. a. Create a Crow’s Foot notation ERD to support LOST operations.
131
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.10a The first LOST ERD
b. The operations provided state that it is possible for a guide to lead an outing of a tour even if the guide is not officially qualified to lead outings of that tour. Imagine that the business rules instead specified that a guide is never, under any circumstance, allowed to lead an outing unless he or she is qualified to lead outings of that tour. How could the data model in part a) be modified to enforce this new constraint?
132
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.10b The second LOST ERD
133
Chapter 4 Entity Relationship (ER) Modeling
Case Solutions 11. The administrators of Tiny College are so pleased with your design and implementation of their student registration/tracking system that they want you to expand the design to include the database for their motor vehicle pool. A brief description of operations follows: A brief description of operations follows: • Faculty members may use the vehicles owned by Tiny College for officially sanctioned travel. For example, the vehicles may be used by faculty members to travel to off-campus learning centers, to travel to locations at which research papers are presented, to transport students to officially sanctioned locations, and to travel for public service purposes. The vehicles used for such purposes are managed by Tiny College’s Travel Far But Slowly (TFBS) Center. • Using reservation forms, each department can reserve vehicles for its faculty, who are responsible for filling out the appropriate trip completion form at the end of a trip. The reservation form includes the expected departure date, vehicle type required, destination, and name of the authorized faculty member. The faculty member arriving to pick up a vehicle must sign a checkout form to log out the vehicle and pick up a trip completion form. (The TFBS employee who releases the vehicle for use also signs the checkout form.) The faculty member’s trip completion form includes the faculty member’s identification code, the vehicle’s identification, the odometer readings at the start and end of the trip, maintenance complaints (if any), gallons of fuel purchased (if any), and the Tiny College credit card number used to pay for the fuel. If fuel is purchased, the credit card receipt must be stapled to the trip completion form. Upon receipt of the faculty trip completion form, the faculty member’s department is billed at a mileage rate based on the vehicle type (sedan, station wagon, panel truck, minivan, or minibus) used. (Hint: Do not use more entities than are necessary. Remember the difference between attributes and entities!) • All vehicle maintenance is performed by TFBS. Each time a vehicle requires maintenance, a maintenance log entry is completed on a prenumbered maintenance log form. The maintenance log form includes the vehicle identification, a brief description of the type of maintenance required, the initial log entry date, the date on which the maintenance was completed, and the identification of the mechanic who released the vehicle back into service. (Only mechanics who have an inspection authorization may release the vehicle back into service.) • As soon as the log form has been initiated, the log form’s number is transferred to a maintenance detail form; the log form’s number is also forwarded to the parts department manager, who fills out a parts usage form on which the maintenance log number is recorded. The maintenance detail form contains separate lines for each maintenance item performed, for the parts used, and for identification of the mechanic who performed the maintenance item. When all maintenance items have been completed, the maintenance detail form is stapled to the maintenance log form, the maintenance log form’s completion date is filled out, and the mechanic who releases the vehicle back into service signs the form. The stapled forms are then filed, to be used later as the source for various maintenance reports.
134
Chapter 4 Entity Relationship (ER) Modeling •
•
TFBS maintains a parts inventory, including oil, oil filters, air filters, and belts of various types. The parts inventory is checked daily to monitor parts usage and to reorder parts that reach the “minimum quantity on hand” level. To track parts usage, the parts manager requires each mechanic to sign out the parts that are used to perform each vehicle’s maintenance; the parts manager records the maintenance log number under which the part is used. Each month TFBS issues a set of reports. The reports include the mileage driven by vehicle, by department, and by faculty members within a department. In addition, various revenue reports are generated by vehicle and department. A detailed parts usage report is also filed each month. Finally, a vehicle maintenance summary is created each month.
Given that brief summary of operations, draw the appropriate (and fully labeled) ERD. Use the Chen methodology to indicate entities, relationships, connectivities, and cardinalities. The solution is shown in Figure P4.11.
135
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.11 The Tiny-College TFBS Maintenance ERD
136
Chapter 4 Entity Relationship (ER) Modeling 12. During peak periods, Temporary Employment Corporation (TEC) places temporary workers in companies. TEC’s manager gives you the following description of the business: •
TEC has a file of candidates who are willing to work.
•
If the candidate has worked before, that candidate has a specific job history. (Naturally, no job history exists if the candidate has never worked.) Each time the candidate works, one additional job history record is created.
•
Each candidate has earned several qualifications. Each qualification may be earned by more than one candidate. (For example, it is possible for more than one candidate to have earned a Bachelor of Business Administration degree or a Microsoft Network Certification. And clearly, a candidate may have earned both a BBA and a Microsoft Network Certification.)
•
TEC offers courses to help candidates improve their qualifications.
•
Every course develops one specific qualification; however, TEC does not offer a course for every qualification. Some qualifications have multiple courses that develop that qualification.
•
Some courses cover advanced topics that require specific qualifications as prerequisites. Some courses cover basic topics that do not require any prerequisite qualifications. A course can have several prerequisites. A qualification can be a prerequisite for more than one course.
•
Courses are taught during training sessions. A training session is the presentation of a single course. Over time, TEC will offer many training sessions for each course; however, new courses may not have any training sessions scheduled right away.
•
Candidates can pay a fee to attend a training session. A training session can accommodate several candidates, although new training sessions will not have any candidates registered at first.
•
TEC also has a list of companies that request temporaries.
•
Each time a company requests a temporary employee, TEC makes an entry in the Openings folder. That folder contains an opening number, a company name, required qualifications, a starting date, an anticipated ending date, and hourly pay.
•
Each opening requires only one specific or main qualification.
•
When a candidate matches the qualification, the job is assigned, and an entry is made in the Placement Record folder. That folder contains an opening number, a candidate number and total hours worked. In addition, an entry is made in the job history for the candidate.
•
An opening can be filled by many candidates, and a candidate can fill many openings.
•
TEC uses special codes to describe a candidate’s qualifications for an opening. The list of codes is shown in Table P4.12.
137
Chapter 4 Entity Relationship (ER) Modeling
TABLE P4.12 TEC QUALIFICATION CODES CODE SEC-45 SEC-60 CLERK PRG-VB PRG-C++ DBA-ORA DBA-DB2 DBA-SQLSERV SYS-1 SYS-2 NW-NOV WD-CF
DESCRIPTION Secretarial work, at least 45 words per minute Secretarial work, at least 60 words per minute General clerking work Programmer, Visual Basic Programmer, C++ Database Administrator, Oracle Database Administrator, IBM DB2 Database Administrator, MS SQL Server Systems Analyst, level 1 Systems Analyst, level 2 Network Administrator, Novell experience Web Developer, ColdFusion
TEC’s management wants to keep track of the following entities: • COMPANY • OPENING • QUALIFICATION • CANDIDATE • JOB_HISTORY • PLACEMENT • COURSE • SESSION Given that information, do the following: a. Draw the Crow’s Foot ERDs for this enterprise. b. Identify all possible relationships. c. Identify the connectivity for each relationship. d. Identify the mandatory/optional dependencies for the relationships. e. Resolve all M:N relationships. The solutions for problems 12a-12e are shown in Figure P4.12.
138
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.12 TEC Solution ERD
To help the students understand Figure P4.12’s ER diagram's components better, the following discussion is likely to be useful: • Each COMPANY may list one or more OPENINGs. Because we will maintain COMPANY data even if a company has not (yet!) hired any of TEC's candidates, OPENING is an 139
Chapter 4 Entity Relationship (ER) Modeling
•
optional entity in the COMPANY lists OPENING relationship. OPENING is existence-dependent on COMPANY, because there cannot be an opening unless a company lists it. If you decide to use the COMPANY primary key as a component of the OPENING's primary key, you have satisfied the conditions that will allow you to classify OPENING as a weak entity and the relationship between COMPANY and OPENING will be strong or identifying. In other words, the OPENING entity is weak if its PK is the combination of OPENING_NUM and COMP_CODE. (The COMP_CODE remains the FK in the OPENING entity.)
Note that there is a 1:M relationship between COMPANY and OPENING, because a company can list multiple job openings. The next table segment shows that the WEST Company has two available job openings and the EAST Company has one available job opening. Naturally, the actual table would have additional attributes in it – but we’re merely illustrating the behavior of the PK components here. COMP_CODE West West East
OPENING_NUM 1 2 1
However, if the OPENING’s PK is defined to be a single OPENING attribute such as a unique OPENING_NUM, OPENING is no longer a weak entity. We have decided to use the latter approach in Figure P4.10. (If you use Microsoft Access to implement this design, OPENING_NUM may be declared as an autonumber.) Note that this decision causes the relationship between COMPANY and OPENING to be weak. (The relationship line is dashed.) In this case, the COMP_CODE attribute would continue to be the FK pointing to the COMPANY table, but it would no longer be a part of the OPENING entity PK. The next table segment shows what such an arrangement would look like: OPENING_NUM 10025 10026 10027 • •
•
•
COMP_CODE West West East
Similarly, the relationship between PLACEMENT and OPENING may be defined as strong or weak. We have used a weak relationship between OPENING and PLACEMENT. A job candidate may have had many jobs -- remember that TEC is a temp employer. Therefore, a candidate may have many entries in HISTORY. But keep in mind that a candidate may just have completed job training and, therefore, may not have had job experience (i.e., no job history) yet. In short, HISTORY is optional to CANDIDATE. To enable TEC or its clients to trace the entire employment record of any candidate, it is reasonable to expect that the HISTORY entity also records the job(s) held by the candidate before that candidate was placed by TEC. Only the portion of the job history created through TEC placement is reflected in the PLACEMENT entity. Therefore, PLACEMENT is optional to HISTORY. The semantics of the problem seem to suggest that the HISTORY is an entity that exists in a 140
Chapter 4 Entity Relationship (ER) Modeling 1:1 relationship with PLACEMENT. After all, each placement generates one (and only one) entry in the candidate’s history. • Because each placement must generate an entry in the HISTORY entity, one would reasonably conclude that HISTORY is mandatory to PLACEMENT. Note that PLACEMENT is redundant, because a job placement obviously creates a job history entry. However, such a redundancy can be justified on the basis that PLACEMENT may be used to track job placement details that are of interest to TEC management. • HISTORY is clearly existence-dependent on CANDIDATE; it is not possible to make an entry in HISTORY without having a CANDIDATE to generate that history. Given this scenario, the CANDIDATE entity’s primary key may be used as one of the components of the HISTORY entity's primary key, thus making HISTORY a weak entity. • Each CANDIDATE may have earned one or more QUALIFICATIONs. Although a company may list a qualification, there may not be a matching candidate because it is possible that none of the candidates have this qualification. For instance, it is possible that none of the available candidates is a Pascal programmer. Therefore, CANDIDATE is optional to QUALIFICATION. However, many candidates may have a given qualification. For example, many candidates may be C++ programmers. Keep in mind that each qualification may be matched to many job candidates, so the relationship between CANDIDATE and QUALIFICATION is M:N. This relationship must be decomposed into two 1:M relationships with the help of a composite entity we will name EDUCATION. The EDUCATION entity will contain the qualification code, the candidate identification, the date on which the candidate earned the qualification, and so on. A few sample data entries might look like this: QUAL_CODE PRG-VB PRG-C++ DBA-ORA DBA-DB2 DBA-ORA
CAND_NUM 4358 4358 4358 2113 2113
EDUC_DATE 12-Dec-00 05-Mar-03 23-Nov-01 02-Jun-85 26-Jan-02
Note that the preceding table contents illustrate that candidate 4358 has three listed qualifications, while candidate 2113 has two listed qualifications. Note also that the qualification code DBA-ORA occurred more than once. Clearly, the PK must be a combination of QUAL_CODE and CAND_NUM, thus making the relationships between QUALIFICATION and EDUCATION and between EDUCATION and CANDIDATE strong. In this example, the EDUCATION entity is both weak and composite.
141
Chapter 4 Entity Relationship (ER) Modeling
NOTE If you use Visio to create the ERD, you only enter the EDUC_DATE column in the EDUCATION entity. Do not type the foreign key attributes under the column headings – Visio will automatically create the FK entries as you declare the relationships. In this ERD, select the relationships between QUALIFICATION and EDUCATION and between EDUCATION and CANDIDATE to be strong, thus ensuring that the relationship lines will be solid, rather than dashed. The QUAL_CODE and the CAND_NUM will automatically be inserted as PKs and as FKs in the EDUCATION entity. If you declare the QUAL_CODE and the CAND_NUM attributes in the EDUCATION entity and you then create the relationship lines, Visio will write duplicate QUAL_CODE and the CAND_NUM attributes into the EDUCATION entity as PKs and FKs. Clearly, this result is undesirable if the entity is implemented as a table. Follow this simple rule: Never declare a FK in an entity by creating it yourself. Let Visio write all the FKs as you establish the relationships. •
•
•
Each job OPENING requires one QUALIFICATION, and any given qualification may fit many openings, thus producing a 1:M relationship between QUALIFICATION and OPENING. For example, a job opening for a C++ programmer requires an applicant to have the C++ programming qualification, but there may be many job openings for C++ programmers! However, a qualification does not require an opening. (After all, if there is no listing with a C++ requirement, a candidate who has the C++ qualification does not match the listing!) Therefore, OPENING is optional to QUALIFICATION. In the ERD shown in Figure P4.10a, we decided to define the OPENING entity’s PK to be OPENING_NUM. This decision produces a non-identifying (weak) relationship between OPENING and QUALIFICATION. However, if you want to ensure that there cannot be a listed opening unless it also lists the required qualification for that opening, the OPENING is existence-dependent on QUALIFICATION. If you then decide to let the OPENING entity inherit QUAL_CODE from QUALIFICATION as part of its PK, OPENING is properly classified as a weak entity to QUALIFICATION. One or more candidates may fill a listed job opening. Also, keep in mind that, during some period of time, a candidate may fill many openings. (TEC supplies temporaries, remember?) Therefore, the relationship between OPENING and CANDIDATE is M:N. We will decompose this M:N relationship into two 1:M relationships, using the composite entity named PLACEMENT as the bridge between CANDIDATE and OPENING. Because a candidate is not necessarily placed, PLACEMENT is optional to CANDIDATE. Similarly, since an opening may be listed even when there is no available candidate, PLACEMENT is optional to OPENING.
142
Chapter 4 Entity Relationship (ER) Modeling 13. Use the following description of the operations of the RC_Charter2 Company to complete this exercise. •
The RC_Charter2 Company operates a fleet of aircraft under the Federal Air Regulations (FAR) Part 135 (air taxi or charter) certificate, enforced by the FAA. The aircraft are available for air taxi (charter) operations within the United States and Canada.
•
Charter companies provide so-called “unscheduled” operations—that is, charter flights take place only after a customer reserves the use of an aircraft to fly at a customerdesignated date and time to one or more customer-designated destinations, transporting passengers, cargo, or some combination of passengers and cargo. A customer can, of course, reserve many different charter flights (trips) during any time frame. However, for billing purposes, each charter trip is reserved by one and only one customer. Some of RC_Charter2’s customers do not use the company’s charter operations; instead, they purchase fuel, use maintenance services, or use other RC_Charter2 services. However, this database design will focus on the charter operations only.
•
Each charter trip yields revenue for the RC_Charter2 Company. That revenue is generated by the charges that a customer pays upon the completion of a flight. The charter flight charges are a function of aircraft model used, distance flown, waiting time, special customer requirements, and crew expenses. The distance flown charges are computed by multiplying the round-trip miles by the model’s charge per mile. Round-trip miles are based on the actual navigational path flown. The sample route traced in Figure P4.10 illustrates the procedure. Note that the number of round-trip miles is calculated to be 130 + 200 + 180 + 390 = 900.
FIGURE P4.13a ROUND-TRIP MILE DETERMINATION Destination 180 miles
Intermediate Stop
200 miles 390 miles
Pax Pickup 130 miles
Home Base
143
Chapter 4 Entity Relationship (ER) Modeling
•
Depending on whether a customer has RC_Charter2 credit authorization, the customer may:
a) Pay the entire charter bill upon the completion of the charter flight. b) Pay a part of the charter bill and charge the remainder to the account. The charge amount may not exceed the available credit. c) Charge the entire charter bill to the account. The charge amount may not exceed the available credit. d) Customers may pay all or part of the existing balance for previous charter trips. Such payments may be made at any time and are not necessarily tied to a specific charter trip. The charter mileage charge includes the expense of the pilot(s) and other crew required by FAR 135. However, if customers request additional crew not required by FAR 135, those customers are charged for the crew members on an hourly basis. The hourly crewmember charge is based on each crew member’s qualifications. •
The database must be able to handle crew assignment. Each charter trip requires the use of an aircraft, and a crew flies each aircraft. The smaller piston engine-powered charter aircraft require a crew consisting of only a single pilot. Larger aircraft (that is, aircraft having a gross takeoff weight of 12,500 pounds or more) and jet-powered aircraft require a pilot and a copilot, while some of the larger aircraft used to transport passengers may require flight attendants as part of the crew. Some of the older aircraft require the assignment of a flight engineer, and larger cargo-carrying aircraft require the assignment of a loadmaster. In short, a crew can consist of more than one person and not all crew members are pilots.
•
The charter flight’s aircraft waiting charges are computed by multiplying the hours waited by the model’s hourly waiting charge. Crew expenses are limited to meals, lodging, and ground transportation.
The RC_Charter2 database must be designed to generate a monthly summary of all charter trips, expenses, and revenues derived from the charter records. Such records are based on the data that each pilot in command is required to record for each charter trip: trip date(s) and time(s), destination(s), aircraft number, pilot (and other crew) data, distance flown, fuel usage, and other data pertinent to the charter flight. Such charter data are then used to generate monthly reports that detail revenue and operating cost information for customers, aircraft, and pilots. All pilots and other crew members are RC_Charter2 Company employees; that is, the company does not use contract pilots and crew. FAR Part 135 operations are conducted under a strict set of requirements that govern the licensing and training of crew members. For example, pilots must have earned either a Commercial license or an Airline Transport Pilot (ATP) license. Both licenses require appropriate ratings. Ratings are specific competency requirements. For example: • To operate a multiengine aircraft designed for takeoffs and landings on land only, the appropriate rating is MEL, or Multiengine Landplane. When a multiengine aircraft can take off and land on water, the appropriate rating is MES, or Multiengine Seaplane.
144
Chapter 4 Entity Relationship (ER) Modeling •
•
The instrument rating is based on a demonstrated ability to conduct all flight operations with sole reference to cockpit instrumentation. The instrument rating is required to operate an aircraft under Instrument Meteorological Conditions (IMC), and all such operations are governed under FAR-specified Instrument Flight Rules (IFR). In contrast, operations conducted under “good weather” or visual flight conditions are based on the FAR Visual Flight Rules (VFR). The type rating is required for all aircraft with a takeoff weight of more than 12,500 pounds or for aircraft that are purely jet-powered. If an aircraft uses jet engines to drive propellers, that aircraft is said to be turboprop-powered. A turboprop—that is, a turbo propeller-powered aircraft—does not require a type rating unless it meets the 12,500pound weight limitation.
Although pilot licenses and ratings are not time-limited, exercising the privilege of the license and ratings under Part 135 requires both a current medical certificate and a current Part 135 checkride. The following distinctions are important: • The medical certificate may be Class I or Class II. The Class I medical is more stringent than the Class II, and it must be renewed every six months. The Class II medical must be renewed yearly. If the Class I medical is not renewed during the six-month period, it automatically reverts to a Class II certificate. If the Class II medical is not renewed within the specified period, it automatically reverts to a Class III medical, which is not valid for commercial flight operations. • A Part 135 checkride is a practical flight examination that must be successfully completed every six months. The checkride includes all flight maneuvers and procedures specified in Part 135. Nonpilot crew members must also have the proper certificates in order to meet specific job requirements. For example, loadmasters need an appropriate certificate, as do flight attendants. In addition, crew members such as loadmasters and flight attendants, who may be required in operations that involve large aircraft (more than a 12,500-pound. takeoff weight and passenger configurations over 19) are also required periodically to pass a written and practical exam. The RC_Charter2 Company is required to keep a complete record of all test types, dates, and results for each crew member, as well as pilot medical certificate examination dates. In addition, all flight crew members are required to submit to periodic drug testing; the results must be tracked, too. (Note that nonpilot crew members are not required to take pilotspecific tests such as Part 135 checkrides. Nor are pilots required to take crew tests such as loadmaster and flight attendant practical exams.) However, many crew members have licenses and/or certifications in several areas. For example, a pilot may have an ATP and a loadmaster certificate. If that pilot is assigned to be a loadmaster on a given charter flight, the loadmaster certificate is required. Similarly, a flight attendant may have earned a commercial pilot’s license. Sample data formats are shown in Table P4.13.
145
Chapter 4 Entity Relationship (ER) Modeling
TABLE P4.13 SAMPLE DATA FORMATS Part A Tests TEST CODE 1 2 3 4 5 6 7
TEST DESCRIPTION Part 135 Flight Check Medical, Class 1 Medical, Class 2 Loadmaster Practical Flight Attendant Practical Drug test Operations, written exam
TEST FREQUENCY 6 months 6 months 12 months 12 months 12 months Random 6 months
Part B Results EMPLOYEE 101 103 112 103 112 101 101 125
TEST CODE 1 6 4 7 7 7 6 2
TEST DATE 12-Nov-17 23-Dec-17 23-Dec-17 11-Jan-18 16-Jan-18 16-Jan-18 11-Feb-18 15-Feb-18
TEST RESULT Pass-1 Pass-1 Pass-2 Pass-1 Pass-1 Pass-1 Pass-2 Pass-1
Part C Licenses and Certificates LICENSE OR CERTIFICATE ATP Comm Med-1 Med-2 Instr MEL LM FA
LICENSE OR CERTIFICATE DESCRIPTION Airline Transport Pilot Commercial license Medical certificate, class 1 Medical certificate, class 2 Instrument rating Multiengine Land aircraft rating Loadmaster Flight Attendant
146
Chapter 4 Entity Relationship (ER) Modeling
Part D Licenses and Certificates Held by Employees EMPLOYEE 101 101 101 103 112 103 112
LICENSE OR CERTIFICATE Comm Instr MEL Comm FA Instr LM
DATE EARNED 12-Nov-93 28-Jun-94 9-Aug-94 21-Dec-95 23-Jun-02 18-Jan-96 27-Nov-05
Pilots and other crew members must receive recurrency training appropriate to their work assignments. Recurrency training is based on an FAA-approved curriculum that is jobspecific. For example, pilot recurrency training includes a review of all applicable Part 135 flight rules and regulations, weather data interpretation, company flight operations requirements, and specified flight procedures. The RC_Charter2 Company is required to keep a complete record of all recurrency training for each crew member subject to the training. The RC_Charter2 Company is required to maintain a detailed record of all crew credentials and all training mandated by Part 135. The company must keep a complete record of each requirement and of all compliance data. To conduct a charter flight, the company must have a properly maintained aircraft available. A pilot who meets all of the FAA’s licensing and currency requirements must fly the aircraft as Pilot in Command (PIC). For those aircraft that are powered by piston engines or turboprops and have a gross takeoff weight under 12,500 pounds, single-pilot operations are permitted under Part 135 as long as a properly maintained autopilot is available. However, even if FAR Part 135 permits single-pilot operations, many customers require the presence of a copilot who is capable of conducting the flight operations under Part 135. The RC_Charter2 operations manager anticipates the lease of turbojet-powered aircraft, and those aircraft are required to have a crew consisting of a pilot and copilot. Both pilot and copilot must meet the same Part 135 licensing, ratings, and training requirements. The company also leases larger aircraft that exceed the 12,500-pound gross takeoff weight. Those aircraft can carry the number of passengers that requires the presence of one or more flight attendants. If those aircraft carry cargo weighing over 12,500 pounds, a loadmaster must be assigned as a crew member to supervise the loading and securing of the cargo. The database must be designed to meet the anticipated additional charter crew assignment capability. a. Given this incomplete description of operations, write all applicable business rules to establish entities, relationships, optionalities, connectivities, and cardinalities. (Hint: Use the following five business rules as examples, writing the remaining business rules in the same format.)
147
Chapter 4 Entity Relationship (ER) Modeling • A customer may request many charter trips. • Each charter trip is requested by only one customer. • Some customers have not yet requested a charter trip. • An employee may be assigned to serve as a crew member on many charter trips. • Each charter trip may have many employees assigned to it to serve as crew members. b. Draw the fully labeled and implementable Crow’s Foot ERD based on the business rules you wrote in Part a of this problem. Include all entities, relationships, optionalities, connectivities, and cardinalities. The following business rules can be derived from the description of operations: • A customer may request many charter trips. • Each charter trip is requested by only one customer. • Some customers have not (yet) requested a charter trip. • Every charter trip is requested by at least one customer. • An employee may be assigned to serve as a crew member on many charter trips. • Each charter trip may have many employees assigned to it to serve as crew members. • An employee may not yet have been assigned to serve as a crew member on any charter trip. • A charter trip may not yet have any employee assigned to serve as a crew member. • Each customer may make many payments. • Some customers have not made any payments yet. • Every payment is made by only one customer. • Every payment must have been made by a customer. • A payment may be toward many charter trips. • A payment may not be in reference to any charter trip. • Every charter trip must have a payment made. • Each charter trip has only one payment. • Every charter trip involves the use of a single aircraft. • Every charter trip requires at least one aircraft. • An aircraft may be used for many charter trips. • An aircraft may not yet have been used for any charter trip. • Each aircraft is only one model airplane. • Every aircraft has a model designation. • An airplane model is not required to be associated with any aircraft that the company owns. • The company may own many aircraft of a given model. • A given flight assignment may be given to many crew members. • Some flight assignments may not have ever been given to any crew member. • Every crew member assignment is associated with a flight assignment. • Every crew member assignment is associated with only one flight assignment. • An employee may have taken many tests. • Some employees may have taken no tests yet. • A test may be taken by many employees. • A test may not have been taken by any employee yet. • Each employee has one job with the company. • Every employee has only one job with the company.
148
Chapter 4 Entity Relationship (ER) Modeling • • • • • • • • • • • • • • •
A job may be done by many employees. A job may be currently unfilled and not be associated with any employee. An employee may be a pilot, and every pilot is an employee. A pilot may have earned many ratings. Some pilots have not earned any rating yet. A rating may be earned by many pilots. Some ratings are not held by any pilots. A pilot may have many licenses. A pilot may not have any license yet. A license may be held by many pilots. A license may not be held by any pilot yet. Every employee can have many qualifications. Some employees do not have any qualifications. Each qualification can be held by many employees. Some qualifications are not held by any employee.
The completed ERD is shown in Figure P4.13b.
149
Chapter 4 Entity Relationship (ER) Modeling
Figure P4.13b The RC_Charter2 Flight Department Crow’s Foot ERD
150
Ch05-Advanced Data Modeling
Chapter 5 Advanced Data Modeling Discussion Focus Your discussion can be divided into three parts to reflect the chapter coverage: •
•
•
The first part of the discussion covers the Extended Entity Relationship Model. a. Start by exploring the use of entity supertypes and subtypes. b. Use the specialization hierarchy example in Figure 5.2 to illustrate the main constructs. c. Illustrate the benefits of attribute inheritance and relationship inheritance. d. Remember that an entity supertype and an entity subtype are related in a 1:1 relationship. e. Emphasize the use of the subtype discriminator and then explain the concept of overlapping and disjoint constraints in relation to entity subtypes. f. The completeness constraint indicates whether all entity supertypes must have at least one subtype. g. Explore the specialization and generalization hierarchies. h. Finally, explain the use of entity clusters as an alternative method to simplify crowded data models. The second part of the discussion covers the importance of proper primary key selection. a. Start by clearly stating the function of a PK -- identification -- and how that function differs from the descriptive nature of the other attributes in an entity. Explain the use of PKs to uniquely identify each entity instance. b. Discuss natural keys, primary keys, and surrogate keys. c. Examine the primary key guidelines that specify the PK characteristics. PKs must be unique, non-intelligent, they do not change over time, they are ideally composed of a single attribute, they are numeric, and they are security compliant. d. Finally, contrast the use of surrogate and composite primary keys. Remind students that composite primary keys are useful in composite entities where each primary key combination is allowed only once in the M:N relationship. The third part of the discussion covers four special design cases: a. Implementing 1:1 relationships. b. Maintaining the history of time-variant data. c. Fan traps. d. Redundant relationships.
151
Ch05-Advanced Data Modeling
Answers to Review Questions 1. What is an entity supertype, and why is it used? An entity supertype is a generic entity type that is related to one or more entity subtypes, where the entity supertype contains the common characteristics and the entity subtypes contain the unique characteristics of each entity subtype. The reason for using supertypes is to minimize the number of nulls and to minimize the likelihood of redundant relationships. 2. What kinds of data would you store in an entity subtype? An entity subtype is a more specific entity type that is related to an entity supertype, where the entity supertype contains the common characteristics and the entity subtypes contain the unique characteristics of each entity subtype. The entity subtype will store the data that is specific to the entity; that is, attributes that are unique the subtype. 3. What is a specialization hierarchy? A specialization hierarchy depicts the arrangement of higher-level entity supertypes (parent entities) and lower-level entity subtypes (child entities). To answer the question precisely, we have used the text’s Figure 5.2. (We have reproduced the figure on the next page for your convenience.) Figure 5.2 shows the specialization hierarchy formed by an EMPLOYEE supertype and three entity subtypes—PILOT, MECHANIC, and ACCOUNTANT.
(Text) FIGURE 5.2 A Specialization Hierarchy 152
Ch05-Advanced Data Modeling
The specialization hierarchy shown in Figure 5.2 reflects the 1:1 relationship between EMPLOYEE and its subtypes. For example, a PILOT subtype occurrence is related to one instance of the EMPLOYEE supertype and a MECHANIC subtype occurrence is related to one instance of the EMPLOYEE supertype. 4. What is a subtype discriminator? Given an example of its use. A subtype discriminator is the attribute in the supertype entity that is used to determine to which entity subtype the supertype occurrence is related. For any given supertype occurrence, the value of the subtype discriminator will determine which subtype the supertype occurrence is related to. For example, an EMPLOYEE supertype may include the EMP_TYPE value “P” to indicate the PROFESSOR subtype. 5. What is an overlapping subtype? Give an example. Overlapping subtypes are subtypes that contain non-unique subsets of the supertype entity set; that is, each entity instance of the supertype may appear in more than one subtype. For example, in a university environment, a person may be an employee or a student or both. In turn, an employee may be a professor as well as an administrator. Because an employee also may be a student, STUDENT and EMPLOYEE are overlapping subtypes of the supertype PERSON, just as PROFESSOR and ADMINISTRATOR are overlapping subtypes of the supertype EMPLOYEE. The text’s Figure 5.4 (reproduced next for your convenience) illustrates overlapping subtypes with the use of the letter O inside the category shape.
153
Ch05-Advanced Data Modeling
(Text) FIGURE 5.4 Specialization Hierarchy with Overlapping Subtypes 6. What is a disjoint subtype? Give an example. Disjoint subtypes, also known as nonoverlapping subtypes, are subtypes that contain a unique subset of the supertype entity set; in other words, each entity instance of the supertype can appear in only one of the subtypes. For example, in Figure 5.2, shown in Question 3, an employee (supertype) who is a pilot (subtype) can appear only in the PILOT subtype, not in any of the other subtypes. In an ERD, such disjoint subtypes are indicated by the letter d inside the category shape. See Figure 5.2 in textbook or in Question 3. Also, see Figure 5.5 Disjoint and Overlapping Subtypes in the textbook. 7. What is the difference between partial completeness and total completeness? Partial completeness means that not every supertype occurrence is a member of a subtype; that is, there may be some supertype occurrences that are not members of any subtype. Total completeness means that every supertype occurrence must be a member of at least one subtype.
154
Ch05-Advanced Data Modeling For questions 8 – 10, refer to Figure Q5.8
FIGURE Q5.8 The PRODUCT data model
8. List all of the attributes of a movie. Recall that the subtype inherits all of the attributes and relationships of the supertype. Therefore, all of the attributes of a subtype include the common attributes from the supertype plus the unique (unique to that subtype) attributes from the subtype. All of the attributes of a movie would be: • Prod_Num • Prod_Title • Prod_ReleaseDate • Prod_Price • Prod_Type • Movie_Rating • Movie_Director
9. According to the data model, is it required that every entity instance in the PRODUCT table be associated with an entity instance in the CD table? Why or why not? No. The completeness constraint for the data model shows a total completeness constraint from PRODUCT to the subtypes. However, the total completeness constraint indicates that every instance in the supertype (PRODUCT) must be associated with one row in some subtype, not all subtypes. Since the subtypes are designated as disjoint, or exclusive, then every row in the supertype is associated a row in only one subtype. For some products that subtype will be CD, but for other products the subtype will be either Movie or Book.
155
Ch05-Advanced Data Modeling 10. Is it possible for a book to appear in the BOOK table without appearing in the PRODUCT table? Why or why not? No. Subtypes can only exist within the context of a supertype.
11. What is an entity cluster, and what advantages are derived from its use? An entity cluster is a “virtual” entity type used to represent multiple entities and relationships in the ERD. An entity cluster is formed by combining multiple interrelated entities into a single abstract entity object. An entity cluster is considered “virtual” or “abstract” in the sense that it is not actually an entity in the final ERD, but rather a temporary entity used to represent multiple entities and relationships with the purpose of simplifying the ERD and thus enhancing its readability. 12. What primary key characteristics are considered desirable? Explain why each characteristic is considered desirable. Desirable PK characteristics are summarized in the text’s Table 5.3, reproduced below for your convenience. The table also includes the reason why each characteristic is desirable. (See the Rationale column.) PK Characteristic Unique values Nonintelligent
No change over time
Preferably single-attribute
Preferably numeric
Rationale The PK must uniquely identify each entity instance. A primary key must be able to guarantee unique values. It cannot contain nulls. The PK should not have embedded semantic meaning. An attribute with embedded semantic meaning is probably better used as a descriptive characteristic of the entity rather than as an identifier. In other words, a student ID of “650973” would be preferred over “Smith, Martha L.” as a primary key identifier. If an attribute has semantic meaning, it may be subject to updates. This is why names do not make good primary keys. If you have “Vickie Smith” as the primary key, what happens when she gets married? If a primary key is subject to change, the foreign key values must be updated, thus adding to the database work load. Furthermore, changing a primary key value means that you are basically changing the identity of an entity. A primary key should have the minimum number of attributes possible. Single-attribute primary keys are desirable but not required. Single-attribute primary keys simplify the implementation of foreign keys. Having multiple-attribute primary keys can cause primary keys of related entities to grow through the possible addition of many attributes, thus adding to the database work load and making (application) coding more cumbersome. Unique values can be better managed when they are numeric because the database can use internal routines to implement a “counter-style” attribute that automatically increments values with
156
Ch05-Advanced Data Modeling
Security complaint
the addition of each new row. In fact, most database systems include the ability to use special constructs, such as Autonumber in MS Access, to support self-incrementing primary key attributes. The selected primary key must not be composed of any attribute(s) that might be considered a security risk or violation. For example, using a Social Security number as a PK in an EMPLOYEE table is not a good idea.
TABLE 5.3 Desirable Primary Key Characteristics 13. Under what circumstances are composite primary keys appropriate? Composite primary keys are particularly useful in two cases: • As identifiers of composite entities, where each primary key combination is allowed only once in the M:N relationship. • As identifiers of weak entities, where the weak entity has a strong identifying relationship with the parent entity. To illustrate the first case, assume that you have a STUDENT entity set and a CLASS entity set. In addition, assume that those two sets are related in a M:N relationship via an ENROLL entity set in which each student/class combination may appear only once in the composite entity. The text’s Figure 5.7 (reproduced here for your convenience) shows the ERD to represent such a relationship.
(Text) FIGURE 5.7 M:N Relationship Between Student and Class
157
Ch05-Advanced Data Modeling As shown in the text’s Figure 5.7, the composite primary key automatically provides the benefit of ensuring that there cannot be duplicate values—that is, it ensures that the same student cannot enroll more than once in the same class. In the second case, a weak entity in a strong identifying relationship with a parent entity is normally used to represent one of two cases: 1. A real-world object that is existent dependent on another real-world object. Those types of objects are distinguishable in the real world. A dependent and an employee are two separate people who exist independent of each other. However, such objects can exist in the model only when they relate to each other in a strong identifying relationship. For example, the relationship between EMPLOYEE and DEPENDENT is one of existence dependency in which the primary key of the dependent entity is a composite key that contains the key of the parent entity. 2. A real-world object that is represented in the data model as two separate entities in a strong identifying relationship. For example, the real-world invoice object is represented by two entities in a data model: INVOICE and LINE. Clearly, the LINE entity does not exist in the real world as an independent object, but rather as part of an INVOICE. In both cases, having a strong identifying relationship ensures that the dependent entity can exist only when it is related to the parent entity. In summary, the selection of a composite primary key for composite and weak entity types provides benefits that enhance the integrity and consistency of the model. 14. What is a surrogate primary key, and when would you use one? A surrogate primary key is an “artificial” PK that is used to uniquely identify each entity occurrence when there is no good natural key available or when the “natural” PK would include multiple attributes. A surrogate PK is also used if the natural PK would be a long text variable. The reason for using a surrogate PK is to ensure entity integrity, to simplify application development – by making queries simpler – to ensure query efficiency – for example, a query based on a simple numeric attribute is much faster than one based on a 200-bit character string -- and to ensure that relationships between entities can be created more easily than would be the case with a composite PK that may have to be used as a FK in a related entity. 15. When implementing a 1:1 relationship, where should you place the foreign key if one side is mandatory and one side is optional? Should the foreign key be mandatory or optional? Section 5.4.1 provides a detailed discussion. The text’s Table 5.5, reproduced here for your convenience, shows the rationale for selecting the foreign key in a 1:1 relationship based on the relationship properties in the ERD. Case I
II
ER Relationship Constraints Action One side is mandatory and the Place the PK of the entity on the mandatory side in other side is optional. the entity on the optional side as a FK and make the FK mandatory. Both sides are optional. Select the FK that causes the fewest number of nulls or place the FK in the entity in which the
158
Ch05-Advanced Data Modeling
III
Both sides are mandatory.
(relationship) role is played. See Case II or consider revising your model to ensure that the two entities do not belong together in a single entity.
TABLE 5.5 Selection of Foreign Key in a 1:1 Relationship 16. What are time-variant data, and how would you deal with such data from a database design point of view? As the label implies, time variant data are time-sensitive. For example, if a university wants to keep track of the history of all administrative appointments by date of appointment and date of termination, you see time-variant data at work. 17. What is the most common design trap, and how does it occur? A design trap occurs when a relationship is improperly or incompletely identified and therefore, it is represented in a way that is not consistent with the real world. The most common design trap is known as a fan trap. A fan trap occurs when you have one entity in two 1:M relationships to other entities, thus producing an association among the other entities that is not expressed in the model.
159
Ch05-Advanced Data Modeling
Problem Solutions 1. Given the following business scenario, create a Crow’s Foot ERD using a specialization hierarchy if appropriate. Two-Bit Drilling Company keeps information on employees and their insurance dependents. Each employee has an employee number, name, date of hire, and title. If an employee is an inspector, then the date of certification and the renewal date for that certification should also be recorded in the system. For all employees, the Social Security number and dependent names should be kept. All dependents must be associated with one and only one employee. Some employees will not have dependents, while others will have many dependents. The data model for this solution is shown in FigP5.1 below.
FIGURE P5.1 Two-Bit Drilling Company ERD
In this scenario, a specialization hierarchy is appropriate because there is an identifiable type or kind of employee (Inspectors), and additional attributes are recorded that are specific to just that kind or type. It is worth noting that if there is only a single subtype, the disjoint/overlapping designation may be omitted – if there is only one subtype then there is no other subtype to overlap or be disjoint from. Also, when there is only a single subtype, the completeness constraint is always partial completeness. If the completeness constraint were identified as total completeness, that would mean that every employee must be an inspector, in which inspector would be a synonym for employee not a kind of employee. 2. Given the following business scenario, create a Crow’s Foot ERD using a specialization hierarchy if appropriate.
160
Ch05-Advanced Data Modeling
Tiny Hospital keeps information on patients and hospital rooms. The system assigns each patient a patient ID number. In addition, the patient’s name and date of birth are recorded. Some patients are resident patients (they spend at least one night in the hospital) and others are outpatients (they are treated and released). Resident patients are assigned to a room. Each room is identified by a room number. The system also stores the room type (private or semiprivate), and room fee. Over time, each room will have many patients that stay in it. Each resident patient will stay in only one room. Every room must have had a patient, and every resident patient must have a room. The data model for this scenario is given in Figure P5.2 below.
FIGURE P5.2 Tiny Hospital ERD
Note that in this scenario, a specialization hierarchy is not appropriate. While resident patients are an identifiable kind or type of patient instance, there are not additional attributes that are unique to only that kind or type of patient. Participation in a relationship that is unique to a particular kind or type of instance is not sufficient justification for a specialization hierarchy. Indicating that only some instances will participate in a relationship is addressed by the optional participation designation. In this scenario, all resident patients must have a room; however, not all patients are resident patients so ROOM is optional to patient. If students ask about the need for an attribute to distinguish between outpatients and resident patients, remind them that in this limited scenario the only distinction between outpatients and resident patients is whether or not they are associated with a room. Therefore, they can consider the Room_Num foreign key in the PATIENT table can serve in that capacity. 3. Given the following business scenario, create a Crow’s Foot ERD using a specialization hierarchy if appropriate. Granite Sales Company keeps information on employees and the departments that they work in. For each department, the department name, internal mail box number, and office phone extension are kept. A department can have many assigned employees, and each employee is assigned to only one department. Employees can be salaried employees, hourly employees, or contract employees. All employees are assigned an employee number. This is kept along with the employee’s name and address. For hourly employees, hourly wage and target weekly work hours are stored (e.g. the company may target 40 hours/week for some, 32 hours/week for others, and 20 hours/week for others). Some salaried employees are salespeople that can earn
161
Ch05-Advanced Data Modeling a commission in addition to their base salary. For all salaried employees, the yearly salary amount is recorded in the system. For salespeople, their commission percentage on sales and commission percentage on profit are stored in the system. For example, John is a salesperson with a base salary of $50,000 per year plus 2-percent commission on the sales price for all sales he makes plus another 5 percent of the profit on each of those sales. For contract employees, the beginning date and end dates of their contract are stored along with the billing rate for their hours. The data model for this scenario is given in Figure P5.3 below.
FIGURE P5.3 Granite Sales ERD.
162
Ch05-Advanced Data Modeling
4. In Chapter 4, you saw the creation of the Tiny College database design. That design reflected such business rules as “a professor may advise many students” and “a professor may chair one department.” Modify the design shown in Figure 4.35 to include these business rules: • An employee could be staff or a professor or an administrator. • A professor may also be an administrator. • Staff employees have a work level classification, such a Level I and Level II. • Only professors can chair a department. A department is chaired by only one professor. • Only professors can serve as the dean of a college. Each of the university’s colleges is served by one dean. • A professor can teach many classes. • Administrators have a position title. Given that information, create the complete ERD containing all primary keys, foreign keys, and main attributes. The solution is shown in Figure P5.4 below.
163
Ch05-Advanced Data Modeling
FIGURE P5.4 Updated Tiny College ERD
164
Ch05-Advanced Data Modeling Note that the business rules require that the subtypes be overlapping for some subtypes but disjoint for others. Specifically, the STAFF subtype is disjoint from ADMIN and PROFESSOR, but ADMIN and PROFESSOR are overlapping. Such complex requirements may be implemented in the database through the use of database constraints as described in Chapter 7, Introduction to Structured Query Language (SQL). 5. Tiny College wants to keep track of the history of all administrative appointments (date of appointment and date of termination). (Hint: Time variant data are at work.) The Tiny College chancellor may want to know how many deans worked in the College of Business between January 1, 1960 and January 1, 2018 or who the dean of the College of Education was in 1990. Given that information, create the complete ERD containing all primary keys, foreign keys, and main attributes. The solution is shown in the following figure:
FIGURE P5.5 Tiny College Job History ERD Segment
165
Ch05-Advanced Data Modeling 6. Some Tiny College staff employees are information technology (IT) personnel. Some IT personnel provide technology support for academic programs. Some IT personnel provide technology infrastructure support. Some IT personnel provide technology support for academic programs and technology infrastructure support. IT personnel are not professors. IT personnel are required to take periodic training to retain their technical expertise. Tiny College tracks all IT personnel training by date, type, and results (completed vs. not completed). Given that information, create the complete ERD containing all primary keys, foreign keys, and main attributes. This problem provides an opportunity to reinforce the idea that to qualify as a subtype, the identifiable kind or type of instance must include additional attributes – being an identifiable kind or type of entity instance is necessary but not sufficient to justify the create of subtypes. Given the minimal attributes specified in the problem, the solution would be as shown in Figure 5.6a.
FIGURE 5.6a Minimal Tiny College IT Staffing Solution
166
Ch05-Advanced Data Modeling If, as is often the case in the problems included in textbook, we assume that the attributes specified are just a subset of the complete attribute requirements for each entity, we can consider what the data model would be given that additional attributes that are unique to the described kinds of entity instances will exist. In that case, the expanded solution including subtypes for the described kinds of staff members is shown in Figure 5.6b.
FIGURE 5.6b Expanded Tiny College IT Staffing Solution
167
Ch05-Advanced Data Modeling
Note that in the specification of ITSTAFF as a subtype of STAFF, there is no disjoint/overlapping designation for the subtype. When there is only one subtype, there is nothing to be disjointed from or to overlap with; therefore, the designation may be safely omitted. 7. The FlyRight Aircraft Maintenance (FRAM) division of the FlyRight Company (FRC) performs all maintenance for FRC’s aircraft. Produce a data model segment that reflects the following business rules: • All mechanics are FRC employees. Not all employees are mechanics. • Some mechanics are specialized in engine (EN) maintenance. Some mechanics are specialized in airframe (AF) maintenance. Some mechanics are specialized in avionics (AV) maintenance. (Avionics are the electronic components of an aircraft that are used in communication and navigation.) All mechanics take periodic refresher courses to stay current in their areas of expertise. FRC tracks all course taken by each mechanic—date, course type, certification (Y/N), and performance. • FRC keeps a history of the employment of all mechanics. The history includes the date hired, date promoted, date terminated, and so on. (Note: The “and so on” component is, of course, not a real-world requirement. Instead, it has been used here to limit the number of attributes you will show in your design.) Given those requirements, create the Crow’s Foot ERD segment. The solution is shown in the following figure:
Note that this is a very simplified version of the aircraft problem domain. The purpose is to help students with the modeling notation for specialization hierarchies and to illustrate how this notation
168
Ch05-Advanced Data Modeling is different from the original entity relationship models. To truly justify the existence of the mechanic subtypes, each subtype MUST have attributes that are unique to that particular subtype. A good class exercise is to have students suggest attributes that may be unique to each subtype. 8. “Martial Arts R Us” (MARU) needs a database. MARU is a martial arts school with hundreds of students. It is necessary to keep track of all the different classes that are being offered, who is assigned to teach each class, and which students attend each class. Also, it is important to track the progress of each student as they advance. Create a complete Crow’s Foot ERD for these requirements: • Students are given a student number when they join the school. This is stored along with their name, date of birth, and the date they joined the school. • All instructors are also students, but clearly, not all students are instructors. In addition to the normal student information, for each instructor, the date that they start working as an instructor must be recorded, along with their instructor status (compensated or volunteer). • An instructor may be assigned to teach any number of classes, but each class has one and only one assigned instructor. Some instructors, especially volunteer instructors, may not be assigned to any class. • A class is offered for a specific level at a specific time, day of the week, and location. For example, one class taught on Mondays at 5:00 pm in Room #1 is an intermediate-level class. Another class taught on Mondays at 6:00 pm in Room #1 is a beginner-level class. A third class taught on Tuesdays at 5:00 pm in Room #2 is an advanced-level class. • Students may attend any class of the appropriate level during each week so there is no expectation that any particular student will attend any particular class session. Therefore, the actual attendance of students at each individual class meeting must be tracked. • A student will attend many different class meetings; and each class meeting is normally attended by many students. Some class meetings may have no students show up for that meeting. New students may not have attended any class meetings yet. • At any given meeting of a class, instructors other than the assigned instructor may show up to help. Therefore, a given class meeting may have several instructors (a head instructor and many assistant instructors), but it will always have at least the one instructor that is assigned to that class. For each class meeting, the date that the class was taught and the instructors’ roles (head instructor or assistant instructor) need to be recorded. For example, Mr. Jones is assigned to teach the Monday, 5:00 pm, intermediate class in Room #1. During one particular meeting of that class, Mr. Jones was present as the head instructor and Ms. Chen came to help as an assistant instructor. • Each student holds a rank in the martial arts. The rank name, belt color, and rank requirements are stored. Each rank will have numerous rank requirements. Each requirement is considered a requirement just for the rank at which the requirement is introduced. Every requirement is associated with a particular rank. All ranks except white belt have at least one requirement. • A given rank may be held by many students. While it is customary to think of a student as having a single rank, it is necessary to track each student’s progress through the ranks. Therefore, every rank that a student attains is kept in the system. New students joining the school are automatically given a white belt rank. The date that a student is awarded each rank should be kept in the system. All ranks have at least one student that has achieved that rank at some time.
169
Ch05-Advanced Data Modeling
The solution for this case is shown in Figure P5.8 below.
FIGURE P5.8 MARU ERD Solution
Notice that the figure includes surrogate keys for RANK, REQUIREMENT, and MEETING because the natural keys did not meet the requirements for a good primary key. The most common areas for confusion among students on this particular case surround attendance in the class meetings. Students tend to think of relationship between CLASS and STUDENT similar to the M:N enroll relationship that they have seen throughout the textbook. In this case, however, the relationship is not an enrollment relationship – instead it is an attendance relationship. As described in the case, students do not enroll in any particular class. What must be tracked is the attendance for each individual class meeting. Therefore, the M:N relationship in this scenario is actually between the STUDENT and the individual class MEETING.
170
Ch05-Advanced Data Modeling The case also provides an opportunity to reinforce the fact that subtypes inherit not only the attributes of the supertype but also the relationships. One requirement of the case is that the system must be able to track which instructors actually taught each class meeting. There is already a M:N relationship between STUDENT and MEETING that can be implemented with the ATTENDANCE bridge entity using only the Stu_Num and Meet_Num attributes. Students should consider that because INSTRUCTOR is a subtype of STUDENT, instructors are already associated in a M:N relationship with MEETING through that same bridge. By adding the Attend_Role attribute to ATTENDANCE, the bridge entity can properly track all students in a given class meeting and record what role they played in that meeting (e.g. student, assistant instructor, or head instructor). Finally, it is worth pointing out to the students that requirements are described as being an attribute of a rank. Some students will immediate consider requirements to be an entity, while others will model requirement as an attribute of the RANK entity. Considering rank requirements to be an attribute of RANK is perfectly acceptable – however, it must be noted that as such rank requirements would be a multi-valued attribute. Therefore, the preferred implementation of a multi-valued attribute (creating a new entity for the multi-valued attribute) would result in the creation of the REQUIREMENT table anyway. So either way the student approaches the problem, it will eventually lead to the solution shown above.
9. The Journal of E-commerce Research Knowledge is a prestigious information systems research journal. It uses a peer-review process to select manuscripts for publication. Only about 10 percent of the manuscripts submitted to the journal are accepted for publication. A new issue of the journal is published each quarter. Create a complete ERD to support the business needs described below. • Unsolicited manuscripts are submitted by authors. When a manuscript is received, the editor will assign the manuscript a number, and record some basic information about it in the system. The title of the manuscript, the date it was received, and a manuscript status of “received” are entered. Information about the author(s) is also recorded. For each author, the author’s name, mailing address, e-mail address, and affiliation (school or company for which the author works) is recorded. Every manuscript must have an author. Only authors that have submitted manuscripts are kept in the system. It is typical for a manuscript to have several authors. A single author may have submitted many different manuscripts to the journal. Additionally, when a manuscript has multiple authors, it is important to record the order in which the authors are listed in the manuscript credits. • At her earliest convenience, the editor will briefly review the topic of the manuscript to ensure that the manuscript’s contents fall within the scope of the journal. If the content is not within the scope of the journal, the manuscript’s status is changed to “rejected” and the author is notified via e-mail. If the content is within the scope of the journal, then the editor selects three or more reviewers to review the manuscript. Reviewers work for other companies or universities and read manuscripts to ensure the scientific validity of the manuscripts. For each reviewer, the system records a reviewer number, reviewer name, reviewer e-mail address, affiliation, and areas of interest. Areas of interest are pre-defined areas of expertise that the reviewer has specified. An area of interest is identified by a IS code and includes a description (e.g. IS2003 is the code for “database modeling”). A reviewer can have many areas of interest, and an area of interest can be associated with
171
Ch05-Advanced Data Modeling
•
•
many reviewers. All reviewers must specify at least one area of interest. It is unusual, but it is possible to have an area of interest for which the journal has no reviewers. The editor will change the status of the manuscript to “under review” and record which reviewers the manuscript was sent to and the date on which it was sent to each reviewer. A reviewer will typically receive several manuscripts to review each year, although new reviewers may not have received any manuscripts yet. The reviewers will read the manuscript at their earliest convenience and provide feedback to the editor regarding the manuscript. The feedback from each reviewer includes rating the manuscript on a 10-point scale for appropriateness, clarity, methodology, and contribution to the field, as well as a recommendation for publication (accept or reject). The editor will record all of this information in the system for each review received from each reviewer and the date that the feedback was received. Once all of the reviewers have provided their evaluations, the editor will decide whether or not to publish the manuscript and changed its status to “accepted” or “rejected”. If the manuscript will be published, the date of acceptance is recorded. Once a manuscript has been accepted for publication, it must be scheduled. For each issue of the journal, the publication period (Fall, Winter, Spring, or Summer), publication year, volume, and number are recorded. An issue will contain many manuscripts, although the issue may be created in the system before it is known which manuscripts will be published in that issue. An accepted manuscript appears in only one issue of the journal. Each manuscript goes through a typesetting process that formats the content, including fonts, font size, line spacing, justification, and so on. Once the manuscript has been typeset, its number of pages is recorded in the system. The editor will then decide which issue each accepted manuscript will appear in and the order of manuscripts within each issue. The order and the beginning page number for each manuscript must be stored in the system. Once the manuscript has been scheduled for an issue, the status of the manuscript is changed to “scheduled.” Once an issue is published, the print date for the issue is recorded, and the status of each manuscript in that issue is changed to “published.”
The solution for this case is shown in Figure P5.9 below.
172
Ch05-Advanced Data Modeling
FIGURE P5.9 Journal of E-commerce Research Knowledge ERD Solution
Again, this is another opportunity to stress to students that the creation of subtypes requires that there exist identifiable kinds or types of entity instances and that kind or type must have additional attributes that are unique to that kind or type. In this case, AUTHOR is a subtype because it is an identifiable kind or type of PERSON and it includes additional attributes that are unique to authors
173
Ch05-Advanced Data Modeling (i.e. the address attributes). There is no subtype for reviewers because there are no attributes that are unique to just that kind or type of PERSON. Reviewers do have relationships that are unique to them, but that is not a sufficient reason to create a subtype. It is not uncommon for students to want to make a separate subtype for each value that the manuscript status attribute can have. Students will often, rightly, point out that there are new attributes that come into play with different manuscript statuses. What the students are missing is that there is no described mechanism by which a manuscript that has been accepted can fail to be published. Therefore, once a manuscript is accepted, it does have all of the attributes in the ACCEPTED subtype – the user just doesn't have a value for all of them yet. 10. Global Unified Technology Sales (GUTS) is moving toward a “bring your own device” (BYOD) model for employee computing. Employees can use traditional desktop computers in their offices. They can also use a variety of personal mobile computing devices such as tablets, smartphones, and laptops. The new computing model introduces some security risks that GUTS is attempting to address. The company wants to ensure that any devices connecting to their servers are properly registered and approved by the Information Technology department. Create a complete ERD to support the business needs described below: • Every employee works for a department that has a department code, name, mail box number, and phone number. The smallest department currently has 5 employees, and the largest department has 40 employees. This system will only track in which department an employee is currently employed. Very rarely, a new department can be created within the company. At such times, the department may exist temporarily without any employees. For every employee, their employee number and name (first, last, and middle initial) are recorded in the system. It is also necessary to keep each employee’s title. • An employee can have many devices registered in the system. Each device is assigned an identification number when it is registered. Most employees have at least one device, but newly hired employees might not have any devices registered initially. For each device, the brand and model need to be recorded. Only devices that are registered to an employee will be in the system. While unlikely, it is possible that a device could transfer from one employee to another. However, if that happens, only the employee who currently owns the device is tracked in the system. When a device is registered in the system, the date of that registration needs to be recorded. • Devices can be either desktop systems that reside in a company office or mobile devices. Desktop devices are typically provided by the company and are intended to be a permanent part of the company network. As such, each desktop device is assigned a static IP address, and the MAC address for the computer hardware is kept in the system. A desktop device is kept in a static location (building name and office number). This location should also be kept in the system so that if the device becomes compromised, the IT department can dispatch someone to remediate the problem. • For mobile devices, it is important to also capture the device’s serial number, which operating system (OS) it is using, and the version of the OS. The IT department is also verifying that each mobile device has a screen lock enabled and has encryption enabled for data. The system should support storing information on whether or not each mobile device has these capabilities enabled.
174
Ch05-Advanced Data Modeling •
•
•
•
•
Once a device is registered in the system, and the appropriate capabilities are enabled if it is a mobile device, the device may be approved for connections to one or more servers. Not all devices meet the requirements to be approved at first so the device might be in the system for a period of time before it is approved to connect to any server. GUTS has a number of servers, and a device must be approved for each server individually. Therefore, it is possible for a single device to be approved for several servers but not for all servers. Each server has a name, brand, and IP address. Within the IT department’s facilities are a number of climate-controlled server rooms where the physical servers can be located. Which room each server is in should also be recorded. Further, it is necessary to track which operating system is being used on each server. Some servers are virtual servers and some are physical servers. If a server is a virtual server, then the system should track which physical server it is running on. A single physical server can host many virtual servers, but each virtual server is hosted on only one physical server. Only physical servers can host a virtual server. In other words, one virtual server cannot host another virtual server. Not all physical servers host a virtual server. A server will normally have many devices that are approved to access the server, but it is possible for new servers to be created that do not yet have any approved devices. When a device is approved for connection to a server, the date of that approval should be recorded. It is also possible for a device that was approved for a server to lose its approval. If that happens, the date that the approval was removed should be recorded. If a device loses its approval, it may regain that approval at a later date if whatever circumstance that lead to the removal is resolved. A server can provide many user services, such as email, chat, homework managers, and others. Each service on a server has a unique identification number and name. The date that GUTS began offering that service should be recorded. Each service runs on only one server although new servers might not offer any services initially. Client-side services are not tracked in this system so every service must be associated with a server. Employees must get permission to access a service before they can use it. Most employees have permissions to use a wide array of services, but new employees might not have permission on any service. Each service can support multiple approved employees as users, but new services might not have any approved users at first. The date on which the employee is approved to use a service is tracked by the system. The first time an employee is approved to access a service, the employee must create a username and password. This will be the same username and password that the employee will use for every service for which the employee is eventually approved.
The solution for this case is shown in Figure P5.10 below.
175
Ch05-Advanced Data Modeling
FIGURE P5.10 Global Unified Technology Sales ERD Solution
11. Global Computer Solutions (GCS) is an information technology consulting company with many offices located throughout the United States. The company’s success is based on its ability to maximize its resources—that is, its ability to match highly skilled employees with projects according to region. To better manage its projects, GCS has contacted you to design a database so that GCS managers can keep track of their customers, employees, projects, project schedules, assignments, and invoices.
176
Ch05-Advanced Data Modeling The GCS database must support all of GCS’s operations and information requirements. A basic description of the main entities follows: • • • •
The employees working for GCS have an employee ID, an employee last name, a middle initial, a first name, a region, and a date of hire recorded in the system. Valid regions are as follows: Northwest (NW), Southwest (SW), Midwest North (MN), Midwest South (MS), Northeast (NE), and Southeast (SE). Each employee has many skills, and many employees have the same skill. Each skill has a skill ID, description, and rate of pay. Valid skills are as follows: Data Entry I, Data Entry II, Systems Analyst I, Systems Analyst II, Database Designer I, Database Designer II, Java I, Java II, C++ I, C++ II, Phyton I, Phyton II, ColdFusion I, ColdFusion II, ASP I, ASP II, Oracle DBA, MS SQL Server DBA, Network Engineer I, Network Engineer II, Web Administrator, Technical Writer, and Project Manager. Table P5.11a shows an example of the Skills Inventory. Skill Data Entry I Data Entry II Systems Analyst I Systems Analyst II DB Designer I DB Designer II Cobol I Cobol II C++ I C++ II VB I VB II ColdFusion I ColdFusion II ASP I ASP II Oracle DBA SQL Server DBA Network Engineer I Network Engineer II Web Administrator Technical Writer Project Manager
Employee Seaton Amy; Williams Josh; Underwood Trish Williams Josh; Seaton Amy Craig Brett; Sewell Beth; Robbins Erin; Bush Emily; Zebras Steve Chandler Joseph; Burklow Shane; Robbins Erin Yarbrough Peter; Smith Mary Yarbrough Peter; Pascoe Jonathan Kattan Chris; Epahnor Victor; Summers Anna; Ellis Maria Kattan Chris; Epahnor Victor, Batts Melissa Smith Jose; Rogers Adam; Cope Leslie Rogers Adam; Bible Hanah Zebras Steve; Ellis Maria Zebras Steve; Newton Christopher Duarte Miriam; Bush Emily Bush Emily; Newton Christopher Duarte Miriam; Bush Emily Duarte Miriam; Newton Christopher Smith Jose; Pascoe Jonathan Yarbrough Peter; Smith Jose Bush Emily; Smith Mary Bush Emily; Smith Mary Bush Emily; Smith Mary; Newton Christopher Kilby Surgena; Bender Larry Paine Brad; Mudd Roger; Kenyon Tiffany; Connor Sean
Table P5.11a Skills Inventory •
GCS has many customers. Each customer has a customer ID, customer name, phone number, and region.
177
Ch05-Advanced Data Modeling •
• •
GCS works by projects. A project is based on a contract between the customer and GCS to design, develop, and implement a computerized solution. Each project has specific characteristics such as the project ID, the customer to which the project belongs, a brief description, a project date (the date the contract was signed), an estimated project start date and end date, an estimated project budget, an actual start date, an actual end date, an actual cost, and one employee assigned as manager of the project. The actual cost of the project is updated each Friday by adding that week’s cost to the actual cost. The weeks cost is computed by multiplying the hours each employee worked by the rate of pay of the project. The employee who is the manager of the project must complete a project schedule, which effectively is a design and development plan. In the project schedule (or plan), the manager must determine the tasks that will be performed to take the project from beginning to end. Each task has a task ID, a brief task description, starting and ending dates, the type of skills needed, and the number of employees (with the required skills) needed to complete the task. General tasks are the initial interview, database and system design, implementation, coding, testing, and final evaluation and sign-off. For example, GCS might have the project schedule shown in Table P5.11b. Project ID: 1 Description: Sales Management System Company : See Rocks Contract Date: 2/12/2018 Region: NW Start Date: 3/1/2018 End Date: 7/1/2018 Budget: $15,500 Start End Task Skill(s) Quantity Date Date Description Required Required 3/1/18 3/6/18 Initial Interview Project Manager 1 Systems Analyst II 1 DB Designer I 1 3/11/18 3/15/18 Database Design DB Designer I 1 3/11/18 4/12/18 System Design Systems Analyst II 1 Systems Analyst I 2 3/18/18 3/22/18 Database Implementation Oracle DBA 1 3/25/18 5/20/18 System Coding & Testing Java I 2 Java II 1 Oracle DBA 1 3/25/18 6/7/18 System Documentation Technical Writer 1 6/10/18 6/14/18 Final Evaluation Project Manager 1 Systems Analyst II 1 DB Designer I 1 Java II 1 6/17/18 6/21/18 On-Site System Online and Project Manager 1 Data Loading Systems Analyst II 1 DB Designer I 1 Java II 1 7/1/18 7/1/18 Sign-Off Project Manager 1
Table P5.11b Project Schedule Form
178
Ch05-Advanced Data Modeling •
•
•
Assignments: GCS pools all of its employees by region, and from this pool, employees are assigned to a specific task scheduled by the project manager. For example, in the first project’s schedule, you know that a Systems Analyst II, Database Designer I, and Project Manager are needed for the period from 3/1/18 to 3/6/18. The project manager is assigned when the project is created and remains for the duration of the project. Using that information, GCS searches the employees who are located in the same region as the customer, matching the skills required and assigns the employees to the project task. Each project schedule task can have many employees assigned to it, and a given employee can work on multiple project tasks. However, an employee can work on only one project task at a time. For example, if an employee is already assigned to work on a project task from 2/20/18 to 3/3/18, the employee cannot work on another task until the current assignment is closed (ends). The date on which an assignment is closed does not necessarily match the ending date of the project schedule task, because a task can be completed ahead of or behind schedule. The date on which an assignment is closed does not necessarily match the ending date of the project schedule task because a task can be completed ahead of or behind schedule. Given all of the preceding information, you can see that the assignment associates an employee with a project task, using the project schedule. Therefore, to keep track of the assignment, you require at least the following information: assignment ID, employee, project schedule task, assignment start date, and assignment end date. Th end date could be any date, as some projects run ahead of or behind schedule. Table P5.11c shows a sample assignment form.
179
Ch05-Advanced Data Modeling
Project ID: Company: Project Task Initial Interview
1 Description: Sales Management System See Rocks Contract Date: 2/12/2018 As of: 03/29/18 SCHEDULED ACTUAL ASSIGNMENTS Start End Start End Date Date Skill Employee Date Date 3/1/18 3/6/18 Project Mgr. 101—Connor S. 3/1/18 3/6/18 Sys. Analyst II 102—Burklow S. 3/1/18 3/6/18 DB Designer I 103—Smith M. 3/1/18 3/6/18 3/11/18 3/15/18 DB Designer I 104—Smith M. 3/11/18 3/14/18
Database Design System Design
3/11/18 4/12/18 Sys. Analyst II Sys. Analyst I Sys. Analyst I 3/18/18 3/22/18 Oracle DBA
Database Implementation System Coding 3/25/18 5/20/18 Java I & Testing Java I Java II Oracle DBA System 3/25/18 6/7/18 Tech. Writer Documentation Final 6/10/18 6/14/18 Project Mgr. Evaluation Sys. Analyst II DB Designer I Java II On-Site System 6/17/18 6/21/18 Project Mgr. Online and Sys. Analyst II Data Loading DB Designer I Java II Sign-Off 7/1/18 7/1/18 Project Mgr.
105—Burklow S. 106—Bush E. 107—Zebras S. 108—Smith J.
3/11/18 3/11/18 3/11/18 3/15/18
109—Summers A. 110—Ellis M. 111—Ephanor V. 112—Smith J. 113—Kilby S.
3/21/18 3/21/18 3/21/18 3/21/18 3/25/18
3/19/18
Table P5.11c Project Assignment Form
•
(Note: The assignment number is shown as a prefix of the employee name; for example, 101, 102.) Assume that the assignments shown previously are the only ones existing as of the date of this design. The assignment number can be whatever number matches your database design. The hours an employee works are kept in a work log containing a record of the actual hours worked by an employee on a given assignment. The work log is a weekly form that the employee fills out at the end of each week (Friday) or at the end of each month. The form contains the date, which is the current Friday of the month or the last workday of the month if it doesn’t falls on a Friday. The form also contains the assignment ID, the total hours worked either that week or up to the end of the month, and the bill number to which the work log entry is charged. Obviously, each work log entry can be related to only one bill. A sample list of the current work log entries for the first sample project is shown in Figure P5.11d.
180
Ch05-Advanced Data Modeling
Employee Week Assignment Hours Bill Name Ending Number Worked Number Burklow S. 3/1/18 1-102 4 xxx Connor S. 3/1/18 1-101 4 xxx Smith M. 3/1/18 1-103 4 xxx Burklow S. 3/8/18 1-102 24 xxx Connor S. 3/8/18 1-101 24 xxx Smith M. 3/8/18 1-103 24 xxx Burklow S. 3/15/18 1-105 40 xxx Bush E. 3/15/18 1-106 40 xxx Smith J. 3/15/18 1-108 6 xxx Smith M. 3/15/18 1-104 32 xxx Zebras S. 3/15/18 1-107 35 xxx Burklow S. 3/22/18 1-105 40 Bush E. 3/22/18 1-106 40 Ellis M. 3/22/18 1-110 12 Ephanor V. 3/22/18 1-111 12 Smith J. 3/22/18 1-108 12 Smith J. 3/22/18 1-112 12 Summers A. 3/22/18 1-109 12 Zebras S. 3/22/18 1-107 35 Burklow S. 3/29/18 1-105 40 Bush E. 3/29/18 1-106 40 Ellis M. 3/29/18 1-110 35 Ephanor V. 3/29/18 1-111 35 Kilby S. 3/29/18 1-113 40 Smith J. 3/29/18 1-112 35 Summers A. 3/29/18 1-109 35 Zebras S. 3/29/18 1-107 35 Note: xxx represents the bill ID. Use the one that matches the bill number in your database.
Table P5.11d Project Work-Log Form as of 3/29/18
•
(Note: xxx represents the bill ID. Use the one that matches the bill number in your database.) Finally, every 15 days, a bill is written and sent to the customer for the total hours worked on the project during that period. When GCS generates a bill, it uses the bill number to update the work-log entries that are part of that bill. In summary, a bill can refer to many work log entries, and each work log entry can be related to only one bill. GCS sent one bill on 3/15/18 for the first project (SEE ROCKS), totaling the hours worked between 3/1/18 and 3/15/18. Therefore, you can safely assume that there is only one bill in this table and that the bill covers the work-log entries shown in the preceding form.
181
Ch05-Advanced Data Modeling Your assignment is to create a database that will fulfill the operations described in this problem. The minimum required entities are employee, skill, customer, region, project, project schedule, assignment, work log, and bill. (There are additional required entities that are not listed.) • Create all of the required tables and all of the required relationships. • Create the required indexes to maintain entity integrity when using surrogate primary keys. • Populate the tables as needed, as indicated in the sample data and forms. This is a complex database design case that requires the identification of many business rules, the organization of those business rules, and the development of a complete database model. Note that this database design case has three primary objectives: • Evaluation of primary keys and surrogate keys. (When should each one be used?) • Evaluation of the use of indexes on candidate keys to avoid duplicate entries when using surrogate keys. • Evaluation of the use of redundant relationships. In some cases, it is better to have the foreign key attribute added to an entity, instead of using multiple join operations. We recommend that you use this problem as the basis for a two part case project. One way to work with this database case is to form small groups of two or three students and then let each group work the problem independently. The following bullet list provides a sample scenario: • Divide the class in groups of three students per group. • Distribute the GCS database case to all students. • Assign a deadline for the groups to submit an initial design ERD with written explanations of the ERD components and features. This deadline should be two weeks from the assignment date. (While the groups are working on the design phase, students will be learning to use SQL to generate information.) • The initial ERD must include: ➢ All the main entities with all primary/foreign keys clearly labeled. ➢ The identification of all relevant dependent attributes. ➢ For each table, the identification of all possible required indexes. • Meet with each group and evaluate each design, paying close attention to: ➢ The propagation of primary/foreign keys and how surrogate keys would be useful to simplify the design. ➢ The use of indexes to minimize the occurrence of duplicate entries. ➢ By this time, students should be familiar with SQL. Ask questions about how a query would be written to generate information. You can use the sample queries provided in the GCSdata-sol.mdb teacher solution file. This database is located on your Instructor’s CD.) Please note that there are two database files available: •
The GCSdata.mdb database is located in the Student subfolder on the Instructor’s CD. This MS Access database contains the sample CUSTOMER, EMPLOYEE, REGION, and SKILL tables. You can either distribute this file to your students by copying it to a common drive in
182
Ch05-Advanced Data Modeling
•
your lab or you can ask your students to download this file from the Course Technology website for this book. The GCSdata-sol.mdb database is located in the Teacher subfolder on the Instructor’s CD. This MS Access database contains the complete set of populated tables. In addition, the solution database contains some sample queries. You can use the sample queries as the basis for second part of this case, which may be used to complement the SQL coverage in chapters 7 and 8.
Figure P5-11a shows the sample tables in the GCSdata.mdb student database.
183
Ch05-Advanced Data Modeling
Figure P5-11a GCS Student Sample Database Tables
The GCSdata-sol.mdb file contains the solution for this design case. Figure P5-11b shows the relational diagram for the solution.
184
Ch05-Advanced Data Modeling
Figure P5.11b – Relational Diagram for the GCS Database
To help your students understand the ERD, use Table P5.11 to describe the main tables and the main indexes that are appropriate for this design implementation.
185
Ch05-Advanced Data Modeling
TABLE P5.11 ERD Documentation Table Name Customer
Region
Employee
Skill
EmpSkill
Project
Task (project schedule) TS (task schedule) Assign
Worklog
Bill
Unique, Not Null Index (on candidate key) Explanation unique(cus_name) The unique index on cus_name is used to ensure no duplicate customers exist. region_id (surrogate) unique(region_name) The unique index on region_name is used to ensure that no duplicate regions are entered. emp_id (surrogate) unique(emp_lname, The unique index on emp_lname, emp_fname, emp_mi) emp_fname and emp_mi is used to ensure that no duplicate employees are entered. skill_id (surrogate) unique(skill_description) The unique index on skill_description is used to ensure that no duplicate skills are entered. emp_id, skill_id The composite primary key ensures (composite) that no duplicate skills are entered for each employee. prj_id (surrogate) unique(cus_id, The unique index on cus_id and prj_description) prj_description is used to ensure that no duplicate project entries exist for a given customer. task_id (surrogate) unique(prj_id, The unique index on prj_id and task_descript) task_descript is used to ensure that no duplicate task is given for the same project. ts_id(surrogate) unique(task_id, skill_id) The unique index on task_id and skill_id is to prevent duplicate listings for a single skill within a single task for a single project. asn_id (surrogate) unique (ps_id, emp_id, The unique index on ps_id, emp_id, ts_id) and ts_id is used to ensure that an employee cannot be assigned twice to perform the same skill on the same task for a given project. wl_id (surrogate) unique(asn_id, wl_date) The unique indexes on asn_id and wl_date are used to ensure that no duplicate work log entries exist (for an employee) on a given date. bill_id (surrogate) Primary key cus_id (surrogate)
186
Ch05-Advanced Data Modeling It is important to point out to your students that the surrogate primary keys are usually not shown in the graphical user interfaces that are available to the end users. The only function of the surrogate primary key is to provide a single-attribute identifier for each row in the table. The completed ERD for the GCS database is shown in Figure P5-11C.
Figure P5.11c – ERD for the GCS Database
187
Chapter 6 Normalization of Database Tables
Chapter 6 Normalization of Database Tables Discussion Focus Why are some table structures considered to be bad and others good and how do you recognize the difference between good and bad structures? From an information management point of view, possibly the most vexing and destructive problems are created through uncontrolled data redundancies. Such redundancies produce update and delete anomalies that create data integrity problems. The loss of data integrity can destroy the usefulness of the data within the database. (If necessary, review Chapter 1, Section 1-6b, “Data Redundancy”, to make sure that your students understand the terminology and that they appreciate the dangers of data redundancy.) Table structures are poor whenever they promote uncontrolled data redundancy. For example, the table structure shown in Figure IM6.1 is poor because it stores redundant data. In this example, the AC_MODEL, AC_RENT_CHG, and AC_SEATS attributes are redundant. (For example, note that the hourly rental charge of $58.50 is stored four times, once for each of the four Cessna C-172 Skyhawk aircraft – check records 1, 2, 4, and 9.)
Figure IM6.1 A Poor Table Structure
If you use the AIRCRAFT_1 table as shown in Figure IM6.1, a change in hourly rental rates for the Cessna 172 Skyhawk must be made four times; if you forget to change just one of those rates, you have a data integrity problem. How much better it would be to have critical data in only one place! Then, if a change must be made, it need be made only once. In contrast to the poor AIRCRAFT_1 table structure shown in Figure IM6.1, table structures are good when they preclude the possibility of producing uncontrolled data redundancies. You can produce such a happy circumstance by splitting the AIRCRAFT_1 table shown in Figure IM6.1 into the AIRCRAFT and MODEL tables shown in Figures IM6.2 and IM6.3, respectively. To retain access to all of the data originally stored in the AIRCRAFT_1 table, these two tables can be connected through the AIRCRAFT table's foreign key, MOD_CODE.
188
Chapter 6 Normalization of Database Tables
Figure IM6.2 The Revised AIRCRAFT Table
Figure IM6.3 The MODEL Table
Note that – after the revision -- a rental rate change need be made in only one place and the number of seats for each model is given in only one place. No more data update and delete anomalies -- and no more data integrity problems. The relational diagram in Figure IM6.4 shows how the two tables are related.
Figure IM6.4 The Relational Diagram
189
Chapter 6 Normalization of Database Tables What does normalization have to do with creating good tables, and what's the point of having to learn all these picky normalization rules? Normalization provides an organized way of determining a table's structural status. Better yet, normalization principles and procedures provide a set of simple techniques through which we can achieve the desired and definable structural results. Without normalization principles and procedures, we lack evaluation standards and must rely on experience (and yes, some intuition) to minimize the probability of generating data integrity problems. The problem with relying on experience is that we usually learn from experience by making errors. While we're learning, who and what will be hurt by the errors we make? Relying on intuition may work reasonably well for some, but intuitive work habits seldom create design consistency. Worse, you can't teach intuition to those who follow in your database footsteps. In short, the normalization principles and rules drastically decrease the likelihood of producing bad table structures, they help standardize the process of producing good tables, and they make it possible to transmit skills to the next generation of database designers.
NOTE Given the clear advantages of using normalization procedures to check and correct table structures, students sometimes think that normalization corrects all design defects. Unfortunately, normalization is only a part of the “good design to implementation” process. For example, normalization does not detect the presence of synonyms. Remind your students that normalization takes place in tandem with data modeling. The proper procedure is to follow these steps: 1. Create a detailed description of operations. 2. Derive all the appropriate business rules from the description of operations. 3. Model the data with the help of a good tool such as Visio’s Crow’s Foot option to produce an initial ERD. This ERD is the initial database blueprint. 4. Use the normalization procedures to remove data redundancies. This process may produce additional entities. 5. Revise the ERD created in step 3. 6. Use the normalization procedures to audit the revised ERD. If additional data redundancies are discovered, repeat steps 4 and 5. Also remind your students that some business rules cannot be incorporated in the ERD, regardless of the level of business rule detail or the completeness of the normalization process. For example, the business rule that specifies the constraint “A pilot may not perform flight duties more than 10 hours per 24-hour period.” cannot be modeled in the ERD. However, tools such a Visio do allow you to write “reminders” of such constraints as text. Because such constraints cannot be modeled, they must be enforced through the application software.
190
Chapter 6 Normalization of Database Tables
Answers to Review Questions 1. What is normalization? Normalization is the process for assigning attributes to entities. Properly executed, the normalization process eliminates uncontrolled data redundancies, thus eliminating the data anomalies and the data integrity problems that are produced by such redundancies. Normalization does not eliminate data redundancy; instead, it produces the carefully controlled redundancy that lets us properly link database tables. 2. When is a table in 1NF? A table is in 1NF when all the key attributes are defined (no repeating groups in the table) and when all remaining attributes are dependent on the primary key. However, a table in 1NF still may contain partial dependencies, i.e., dependencies based on only part of the primary key and/or transitive dependencies that are based on a non-key attribute. 3. When is a table in 2NF? A table is in 2NF when it is in 1NF and it includes no partial dependencies. However, a table in 2NF may still have transitive dependencies, i.e., dependencies based on attributes that are not part of the primary key. 4. When is a table in 3NF? A table is in 3NF when it is in 2NF and it contains no transitive dependencies. 5. When is a table in BCNF? A table is in Boyce-Codd Normal Form (BCNF) when it is in 3NF and every determinant in the table is a candidate key. For example, if the table is in 3NF and it contains a nonprime attribute that determines a prime attribute, the BCNF requirements are not met. (Reference the text's Figure 6.8 to support this discussion.)This description clearly yields the following conclusions: • If a table is in 3NF and it contains only one candidate key, 3NF and BCNF are equivalent. • BCNF can be violated only if the table contains more than one candidate key. Putting it another way, there is no way that the BCNF requirement can be violated if there is only one candidate key.
191
Chapter 6 Normalization of Database Tables 6. Given the dependency diagram shown in Figure Q6.6, answer items 6a-6c:
FIGURE Q5.6 Dependency Diagram for Question 6
C1
C2
C3
C4
C5
a. Identify and discuss each of the indicated dependencies. C1 → C2 represents a partial dependency, because C2 depends only on C1, rather than on the entire primary key composed of C1 and C3. C4 → C5 represents a transitive dependency, because C5 depends on an attribute (C4) that is not part of a primary key. C1, C3 → C2, C4, C5 represents a set of proper functional dependencies, because C2, C4, and C5 depend on the primary key composed of C1 and C3. b. Create a database whose tables are at least in 2NF, showing the dependency diagrams for each table. The normalization results are shown in Figure Q6.6b.
Figure Q6.6b The Dependency Diagram for Question 6b Table 1
C1
Primary key: C1 Foreign key: None Normal form: 3NF
C2
Table 2 C1
C3
C4
C5
Primary key: C1 + C3 Foreign key: C1 (to Table 1) Normal form: 2NF, because the table exhibits the transitive dependencies C4 C5
192
Chapter 6 Normalization of Database Tables c. Create a database whose tables are at least in 3NF, showing the dependency diagrams for each table. The normalization results are shown in Figure Q6.6c.
Figure Q6.6c The Dependency Diagram for Question 6c
C1
C1
C3
C4
C2
Table 1 Primary key: C1 Foreign key: None Normal form: 3NF
C4
Table 2 Primary key: C1 + C3 Foreign key: C1 (to Table 1) C4 (to Table 3) Normal form: 3NF
C5
Table 3 Primary key: C4 Foreign key: None Normal form: 3NF
7. The dependency diagram in Figure Q6.7 indicates that authors are paid royalties for each book that they write for a publisher. The amount of the royalty can vary by author, by book, and by edition of the book.
Figure Q6.7 Book royalty dependency diagram
a. Based on the dependency diagram, create a database whose tables are at least in 2NF, showing the dependency diagram for each table.
193
Chapter 6 Normalization of Database Tables The normalization results are shown in Figure Q6.7a.
Figure Q6.7a The 2NF normalization results for Question 7a.
b. Create a database whose tables are at least in 3NF, showing the dependency diagram for each table. The normalization results are shown in Figure Q6.7a.
Figure Q6.7b The 3NF normalization results for Question 7b.
194
Chapter 6 Normalization of Database Tables
8. The dependency diagram in Figure Q6.8 indicates that a patient can receive many prescriptions for one or more medicines over time. Based on the dependency diagram, create a database whose tables are in at least 2NF, showing the dependency diagram for each table.
Figure Q6.8 Prescription dependency diagram
195
Chapter 6 Normalization of Database Tables The normalization results are shown in Figure Q6.8a.
Figure Q6.8a The 2NF normalization results for Question 8.
9. What is a partial dependency? With what normal form is it associated? A partial dependency exists when an attribute is dependent on only a portion of the primary key. This type of dependency is associated with 1NF. 10. What three data anomalies are likely to be the result of data redundancy? How can such anomalies be eliminated? The most common anomalies considered when data redundancy exists are: update anomalies, addition anomalies, and deletion anomalies. All these can easily be avoided through data normalization. Data redundancy produces data integrity problems, caused by the fact that data entry failed to conform to the rule that all copies of redundant data must be identical.
196
Chapter 6 Normalization of Database Tables 11. Define and discuss the concept of transitive dependency. Transitive dependency is a condition in which an attribute is dependent on another attribute that is not part of the primary key. This kind of dependency usually requires the decomposition of the table containing the transitive dependency. To remove a transitive dependency, the designer must perform the following actions: • Place the attributes that create the transitive dependency in a separate table. • Make sure that the new table's primary key attribute is the foreign key in the original table. Figure Q6.11 shows an example of a transitive dependency removal.
Figure Q6.11 Transitive Dependency Removal Original table
INV_NUM
INV_DATE
INV_AMOUNT
CUS_NUM
CUS_ADDRESS
CUS_PHONE
Transitive Dependencies
INV_NUM
INV_DATE
INV_AMOUNT
CUS_NUM New Tables
CUS_NUM
CUS_ADDRESS CUS_PHONE
12. What is a surrogate key, and when should you use one? A surrogate key is an artificial PK introduced by the designer with the purpose of simplifying the assignment of primary keys to tables. Surrogate keys are usually numeric, they are often automatically generated by the DBMS, they are free of semantic content (they have no special meaning), and they are usually hidden from the end users.
197
Chapter 6 Normalization of Database Tables 13. Why is a table whose primary key consists of a single attribute automatically in 2NF when it is in 1NF? A dependency based on only a part of a composite primary key is called a partial dependency. Therefore, if the PK is a single attribute, there can be no partial dependencies. 14. How would you describe a condition in which one attribute is dependent on another attribute when neither attribute is part of the primary key? This condition is known as a transitive dependency. A transitive dependency is a dependency of one nonprime attribute on another nonprime attribute. (The problem with transitive dependencies is that they still yield data anomalies.) 15. Suppose that someone tells you that an attribute that is part of a composite primary key is also a candidate key. How would you respond to that statement? This argument is incorrect if the composite PK contains no redundant attributes. If the composite primary key is properly defined, all of the attributes that compose it are required to identify the remaining attribute values. By definition, a candidate key is one that can be used to identify all of the remaining attributes, but it was not chosen to be a PK for some reason. In other words, a candidate key can serve as a primary key, but it was not chosen for that task for one reason or another. Clearly, a part of a proper (“minimal”) composite PK cannot be used as a PK by itself. More formally, you learned in Chapter 3, “The Relational Database Model,” Section 3-2, that a candidate key can be described as a superkey without redundancies, that is, a minimal superkey. Using this distinction, note that a STUDENT table might contain the composite key STU_NUM, STU_LNAME This composite key is a superkey, but it is not a candidate key because STU_NUM by itself is a candidate key! The combination STU_LNAME, STU_FNAME, STU_INIT, STU_PHONE might also be a candidate key, as long as you discount the possibility that two students share the same last name, first name, initial, and phone number. If the student’s Social Security number had been included as one of the attributes in the STUDENT table—perhaps named STU_SOCSECNUM—both it and STU_NUM would have been candidate keys because either one would uniquely identify each student. In that case, the selection of STU_NUM as the primary key would be driven by the designer’s choice or by end-user requirements. Note, incidentally, that a primary key is a superkey as well as a candidate key. 16. A table is in ___3rd___ normal form when it is in ___2nd normal form___ and there are no transitive dependencies. (See the discussion in Section 6-3c, “Conversion to Third Normal Form.”
198
Chapter 6 Normalization of Database Tables
Problem Solutions 1. Using the descriptions of the attributes given in the figure, convert the ERD shown in Figure P6.1 into a dependency diagram that is in at least 3NF. An initial dependency diagram depicting only the primary key dependencies is shown in Figure P6.1a below.
Figure P6.1a Initial dependency diagram for Problem 1.
There are no composite keys being used, therefore, by definition, there is not an issue with partial dependencies and the entities are already in 2NF. Based on the descriptions of the attributes, it appears that the patient name, phone number, and address can be determined by the patient id number. Therefore, the following transitive dependency can be determined. App_PatientID → (App_Name, App_Phone, App_Street, App_City, App_State, App_Zip) As discussed in the chapter, ZIP_Codes can be used to determine a city and state; therefore, we also have the transitive dependency: App_Zip → App_City, App_State Figure P6.1b depicts the dependency diagram with these transitive dependencies included.
Figure P6.1b Revised dependency diagram for Problem 1.
199
Chapter 6 Normalization of Database Tables
Since the first transitive dependency completely encloses the second transitive dependency, it is appropriate to resolve the first transitive dependency before resolving the second. Figure P6.1c shows the results of resolving the first transitive dependency.
Figure P6.1c Resolving the first transitive dependency
Finally, the second and final transitive dependency can now be resolved as shown in the final dependency diagram in Figure P6.1d.
Figure P6.1d Final dependency diagram for Problem 1
200
Chapter 6 Normalization of Database Tables
Note that at this time we have resolved all of the transitive dependencies. Decisions on whether or not to denormalize, and perhaps not remove the final transitive dependency, have yet to be made. Also, the structures have not yet had the benefit of additional design modifications such as achieving proper naming conventions for the attributes in the new tables. However, creating the fully normalized structures is an important set toward making informed decisions about the compromises in the design that we may choose to make. NOTE: Please note that we are making the assumption that a zip code only determines one city and state. Unfortunately, this is not true, there are a handful of zip codes that traverse states. In these cases, it would be appropriate not to use the [App_zip, App_City, App_State] relation and instead add these attributes to the previous relation. Hence, the relation would be: [App_PatiendID, App_Name, App_Phone, App_Street, App_City, App_Zip, App_State] 2. Using the descriptions of the attributes given in the figure, convert the ERD shown in Figure P6.2 into a dependency diagram that is in at least 3NF. An initial dependency diagram depicting only the primary key dependencies is shown in Figure P6.2a below.
Figure P6.2a Initial dependency diagram for Problem 2.
201
Chapter 6 Normalization of Database Tables
Based on the descriptions of the attributes given, the following partial dependency can be determined: Pres_SessionNum → (Pres_Date, Pres_Room) Also, the following transitive dependencies can be determined: Pres_AuthorID → (Pres_FName, Pres_LName) Figure P6.2b shows the revised dependency diagram including the partial and transitive dependencies.
Figure P6.2b Revised dependency diagram for Problem 2
Resolving the partial dependency to achieve 2NF yields the dependency diagram shown in Figure P6.2c.
202
Chapter 6 Normalization of Database Tables
Figure P6.2c 2NF dependency diagram for Problem 2
Finally, the transitive dependency is resolved to achieve the 3NF solution shown in the final dependency diagram in Figure P6.2d.
Figure P6.2d Final dependency diagram for Problem 2
203
Chapter 6 Normalization of Database Tables
204
Chapter 6 Normalization of Database Tables
3. Using the INVOICE table structure shown in Table P6.3, do the following:
Table P6.3 Sample INVOICE Records Attribute Name INV_NUM PROD_NUM SALE_DATE PROD_LABEL
Sample Value 211347 AA-E3422QW 15-Jan-2018 Rotary sander
VEND_CODE VEND_NAME QUANT_SOLD PROD_PRICE
211 NeverFail, Inc. 1 $49.95
Sample Value 211347 QD-300932X 15-Jan-2018 0.25-in. drill bit 211 NeverFail, Inc. 8 $3.45
Sample Value 211347 RU-995748G 15-Jan-2018 Band saw
Sample Value 211348 AA-E3422QW 15-Jan-2018 Rotary sander
Sample Value 211349 GH-778345P 16-Jan-2018 Power drill
309 BeGood, Inc. 1 $39.99
211 NeverFail, Inc. 2 $49.95
157 ToughGo, Inc. 1 $87.75
a. Write the relational schema, draw its dependency diagram and identify all dependencies, including all partial and transitive dependencies. You can assume that the table does not contain repeating groups and that any invoice number may reference more than one product. (Hint: This table uses a composite primary key.) The solutions to both problems (3a and 3b) are shown in Figure P6.3a.
NOTE We have combined the solutions to Problems 3a and 3b to let you illustrate the start of the normalization process within a single PowerPoint slide. Students generally seem to have an easier time understanding the normalization process if they can compare the normal forms directly. We will continue to use this technique for several of the initial normalization decompositions … if the available PowerPoint slide space permits it.
b. Remove all partial dependencies, write the relational schema, and draw the new dependency diagrams. Identify the normal forms for each table structure you created.
NOTE You can assume that any given product is supplied by a single vendor but a vendor can supply many products. Therefore, it is proper to conclude that the following dependency exists: PROD_NUM → PROD_DESCRIPTION, PROD_PRICE, VEND_CODE, VEND_NAME (Hint: Your actions should produce three dependency diagrams.)
205
Chapter 6 Normalization of Database Tables
Figure P6.3a The Dependency Diagrams for Problems 3a and 3b
206
Chapter 6 Normalization of Database Tables
c. Remove all transitive dependencies, write the relational schema, and draw the new dependency diagrams. Also identify the normal forms for each table structure you created. To illustrate the effect of Problem 3's complete decomposition, we have shown Problem 3a's dependency diagram again in Figure P6.3c.
Figure P6.3c The Dependency Diagram for Problem 3c
207
Chapter 6 Normalization of Database Tables d. Draw the Crow’s Foot ERD.
NOTE Emphasize that, because the dependency diagrams cannot show the nature (1:1, 1:M, M:N) of the relationships, the ER Diagrams remain crucial to the design effort. Complex design is impossible to produce successfully without some form of modeling, be it ER, Semantic Object Modeling, or some other modeling methodology. Yet, as the preceding decompositions demonstrate, the dependency diagrams are a valuable addition to the designer's toolbox. (Normalization is likely to suggest the existence of entities that may not have been considered during the modeling process.) And, if information or transaction management issues require the existence of attributes that create other than 3NF or BCNF conditions, the proper dependency diagrams will at least force awareness of these conditions.
The invoicing ERD, accompanied by its relational diagram, is shown in Figure P6.3d. (The relational diagram only includes the critical PK and FK components, plus a few sample attributes, for space considerations.)
Figure P6.3d The Invoicing ERD and Its (Partial) Relational Diagram
Crow’s Foot Invoicing ERD
Invoicing Relational Diagram, Sample Attributes INVOICE INV_NUM INV_DATE
LINE 1
M
INV_NUM PROD_NUM NUM_SOLD
PRODUCT 1 M
1
PROD_NUM
VEND_CODE
PROD_DESCRIPTION PROD_PRICE VEND_CODE
208
VENDOR
VEND_NAME M
Chapter 6 Normalization of Database Tables
4. Using the STUDENT table structure shown in Table P6.4, do the following:
Table P6.4 Sample STUDENT Records Attribute Name
Sample Value Sample Value Sample Value Sample Value Sample Value
STU_NUM STU_LNAME STU_MAJOR DEPT_CODE DEPT_NAME DEPT_PHONE COLLEGE_NAME ADVISOR_LNAME ADVISOR_OFFICE ADVISOR_BLDG ADVISOR_PHONE STU_GPA STU_HOURS STU_CLASS
211343 Stephanos Accounting ACCT Accounting 4356 Business Admin Grastrand T201 Torre Building 2115 3.87 75 Junior
200128 Smith Accounting ACCT Accounting 4356 Business Admin Grastrand T201 Torre Building 2115 2.78 45 Sophomore
199876 Jones Marketing MKTG Marketing 4378 Business Admin Gentry T228 Torre Building 2123 2.31 117 Senior
199876 Ortiz Marketing MKTG Marketing 4378 Business Admin Tillery T356 Torre Building 2159 3.45 113 Senior
223456 McKulski Statistics MATH Mathematics 3420 Arts & Sciences Chen J331 Jones Building 3209 3.58 87 Junior
a. Write the relational schema, draw its dependency diagram, and identify all dependencies, including all transitive dependencies. The dependency diagram for problem 4a is shown in Figure P6.4a.
Figure P6.4a The Dependency Diagram for Problem 4a
209
Chapter 6 Normalization of Database Tables
STU_NUM STU_LNAME STU_MAJOR DEPT_CODE DEPT_NAME DEPT_PHONE COLLEGE_NAME
Transitive Dependencies
ADV_LASTNAME ADV_OFFICE ADV_BUILDING ADV_PHONE STU_CLASS STU_GPA STU_HOURS
Transitive Dependency
Transitive Dependency
Note 1: The ADV_LASTNAME is not a determinant of ADV_OFFICE or ADV_PHONE, because there are (potentially) many advisors who have the same last name. Note 2: If a department has only one phone, DEPT_CODE is a determinant of DEPT_PHONE. If a department has several phones, the DEPT_CODE is no longer a determinant of DEPT_PHONE. In any case, if you know the DEPT_PHONE value, you know the DEPT_CODE value. Therefore, DEPT_PHONE is a determinant of DEPT_CODE. This latter dependency, indicated in orange, sets the stage for a BCNF violation when the initial structure is normalized. Note 3: ADV_OFFICE is a determinant of ADV_BUILDING if the ADV_OFFICE is , in effect, a code. For example, if offices such as HE-201 and HE-324 use the prefix HE to indicate their location in the Heinz building, the office locators determine the building.
As you discuss Figure 6.4a, note that the single attribute PK (STU_NUM) automatically places this table in 2NF, because it is not possible to have partial dependencies when the PK consists of a single attribute. The relational schema for the dependency diagram shown in Figure P6.4a is written as: STUDENT(STU_NUM, STU_LNAME, STU_MAJOR, DEPT_CODE, DEPT_NAME, DEPT_PHONE, ADVISOR_LNAME, ADVISOR_OFFICE, ADVISOR_BLDG, ADVISOR_PHONE, STU_GPA, STU_HOURS, STU_CLASS)
210
Chapter 6 Normalization of Database Tables
b. Write the relational schema and draw the dependency diagram to meet the 3NF requirements to the greatest extent possible. If you believe that practical considerations dictate using a 2NF structure, explain why your decision to retain 2NF is appropriate. If necessary, add or modify attributes to create appropriate determinants and to adhere to the naming conventions.
NOTE Although the completed student hours (STU_HOURS) do determine the student classification (STU_CLASS), this dependency is not as obvious as you might initially assume it to be. For example, a student is considered a junior if that student has completed between 61 and 90 credit hours. Therefore, a student who is classified as a junior may have completed 66, 72, or 87 hours or any other number of hours within the specified range of 61– 90 hours. In short, any hour value within a specified range will define the classification.
The normalized structure is shown in Figure P6.4b. The relational schemas are written as: STUDENT(STU_NUM, STU_LNAME, STU_MAJOR, DEPT_CODE, ADVISOR_NUM, STU_GPA, STU_HOURS, STU_CLASS) (Note that we have added the ADVISOR_NUM to serve as a FK to the advisor attributes.) MAJOR(MAJOR_CODE, DEPT_CODE, MAJOR_DESCRIPTION) BUILDING(BLDG_CODE, BLDG_NAME, BLDG_MANAGER) DEPARTMENT(DEPT_CODE, DEPT_NAME, DEPT_PHONE, COLLEGE_CODE) COLLEGE(COLL_CODE, COLL_NAME) (After studying Chapter 4, “Entity Relationship Modeling,” your students should know enough about database design to suggest many improvements in the design before it can be implemented.)
211
Chapter 6 Normalization of Database Tables
Figure P6.4b The Normalized Dependency Diagrams for Problem 4b
STU_NUM STU_LNAME STU_MAJOR DEPT_CODE ADV_NUM STU_CLASS STU_GPA STU_HRS
Transitive Dependency
Transitive Dependency
MAJOR_CODE DEPT_CODE MAJOR_DESCRIPTION BLDG_CODE BLDG_NAME BLDG_MANAGER DEPT_CODE DEPT_NAME DEPT_PHONE COLL_CODE COLL_CODE COLL_NAME Note: If a department has only one phone, DEPT_CODE is a determinant of DEPT_PHONE. If a department has several phones, the DEPT_CODE is not a determinant of the DEPT_PHONE. However, if you know a department phone number, you also know the DEPT_CODE ... thus creating a condition in which the BCNF requirement is not met.
Note: If several advisors share a phone, the ADV_PHONE is not a determinant of the other advisor attributes.
ADV_NUM ADV_LASTNAME ADV_OFFICE ADV_BUILDING ADV_PHONE Transitive Dependency Note: The ADV_NUM attribute was created to produce a proper primary key. The dotted transitive dependency line indicates that this dependency is subject to interpretation. (See the discussion in the IM text.)
As you discuss Figure P6.4b, explain that, in this case, the STUDENT table structure indicates a 2NF condition because two transitive dependencies exist. If there is an information requirement to track the components of each major, we can break out a major code, store it in student, create a new entity named MAJOR, and relate it to its department in a 1:M relationship. (Each department offers many majors, but only one department offers each major.) Creating a new entity to eliminate the student classification-induced transitive dependency increases implementation complexity needlessly; student hours are updated each semester by application software and other application software can then use a look-up table to update the classification when necessary. Structure simplicity is a virtue. In any case, the normalization diagram may be modified as shown next. (We have added a few attributes, such as BLDG_MANAGER, to improve the database's ability to provide information.) Note that the assumptions inherent in the business rules also make an impact on normalization practices! If the room is numbered to reflect the building it is in – for example, HE105 indicates room 105 in the Heinz building – one might argue that the ADV_OFFICE value is the determinant of the ADV_BUILDING. (You will learn in Chapter 7 that you can create a query to find a building by looking at room prefixes.) However, if you define dependencies in strictly relational algebra terms,
212
Chapter 6 Normalization of Database Tables you might argue that partitioning the attribute value to “create” a dependency indicates that the partitioned attribute is not (in that strict sense) a determinant. Although we have indicated a transitive dependency from ADV_OFFICE to ADV_BUILDING, we have used a dotted line to indicate that there is room for argument in this set of transitive dependencies. In any case, the (arguable) dependency ADV_OFFICE → ADV_BUILDING does not create any problems in a practical sense, so it is acceptable to ignore this (arguable) transitive dependency. Keep in mind that the decomposition shown in Figure P6.8 is subject to many modifications, depending on information requirements and business rules. For example, both the department and the college may be tied to the building in which they are located. Additional modifications are discussed in the answer to Problem 9. c. Draw the Crow’s Foot ERD.
NOTE This ERD constitutes a small segment of a university’s full-blown design. For example, this segment might be combined with the Tiny College presentation in Chapter 4. The Crow’s Foot ERD is shown in Figure P6.4c.
Figure P6.4c The College ERD
As you examine the ER diagrams in Figure P6.4c, note that we have made several assumptions that cannot be inferred directly from the dependency diagram in problem 4b. For example: • Apparently, some buildings do not house advisors. Some buildings may be used for storage, others for classrooms, and so on.
213
Chapter 6 Normalization of Database Tables • •
• •
• •
When a student is assigned to a department, that department must assign an advisor to that student. That is, a student must have an advisor. Therefore, ADVISOR is mandatory to STUDENT. Evidently, some advisors do not (yet?) have students assigned to them. From an operational point of view, this optionality is desirable, because it enables us to create a new advisor without having to assign a student advisee to that new advisor. (The new advisor may have to receive some training before having students assigned to him or her.) Some departments do not offer majors. For example, a department may offer service courses only. Some colleges do not have departments. This condition is subject to a business rule that is not specified, nor can it be inferred from the dependency diagram. However, this characteristic is not unusual in a college environment. For example, some professional curricula are certified by special boards. Such boards may make certification conditional on the professional curriculum’s independence. (We have created the optionality for discussion purposes. This discussion should stress the importance of the business rules. You generate the business rules by asking detailed questions!) All departments must be affiliated with a college. STUDENT is optional to MAJOR. This optionality, too, is desirable from an operational point of view. For example, new majors may not (yet) have attracted students.
Business rules may change the nature of the structures shown here. For example, an advisor is likely to be a professor ... who is an employee of the university. Therefore, you might introduce a superset/subset relationship between EMPLOYEE and PROFESSOR, while the need to distinguish between professors and advisors disappears. Similarly, EMPLOYEE may be the source of information concerning the BUILDING manager, thus creating a relationship between BUILDING and EMPLOYEE. Note also that the nature of the relationships (1:1, 1:M, M:N) is not revealed by the dependency diagrams. For example, the 1:M relationship between MAJOR and DEPARTMENT (a department can offer many majors, but each major is offered by only one department) cannot be inferred from the dependency diagram. Normalization and ER modeling are part of the same design process! Finally, note that we have also included several new entities, MAJOR and BUILDING, to reflect the preceding discussion.
NOTE Remind your students that the order of the attribute listing in each entity is immaterial. Although it is customary to list the PK attribute first, there is no requirement to do so. Similarly, whether the STU_LNAME is listed before or after the STU_GPA has no effect on the STUDENT entity’s functionality.
214
Chapter 6 Normalization of Database Tables 5. To keep track of office furniture, computers, printers, and so on, the FOUNDIT company uses the table structure shown in Table P6.5.
Table P6.5 Sample ITEM Records Attribute Name ITEM_ID ITEM_LABEL ROOM_NUMBER BLDG_CODE BLDG_NAME BLDG_MANAGER
Sample Value 231134-678 HP DeskJet 895Cse 325 NTC Nottooclear I. B. Rightonit
Sample Value 342245-225 HP Toner 325 NTC Nottoclear I. B. Rightonit
Sample Value 254668-449 DT Scanner 123 CSF Canseefar May B. Next
a. Given that information, write the relational schema and draw the dependency diagram. Make sure that you label the transitive and/or partial dependencies. The answers to this problem are shown in Figure P6.5a and the relational schema definition below the figure..
Figure P6.5a The FOUNDIT Co. Initial Dependency Diagram
The dotted transitive dependency lines indicate that these transitive dependencies are subject to interpretation. We will address these dependencies in the discussion that accompanies Problem 5b’s solution. The relational schema may be written as follows: ITEM(ITEM_ID, ITEM_DESCRIPTION, BLDG_ROOM, BLDG_CODE, BLDG_NAME, BLDG_MANAGER)
215
Chapter 6 Normalization of Database Tables b. Write the relational schema and create a set of dependency diagrams that meet 3NF requirements. Rename attributes to meet the naming conventions, and create new entities and attributes as necessary. The dependency diagrams are shown in Figure P6.5b. We have added a sample relational diagram to illustrate the relationships at this point. The relational schemas are written below Figure 6.5b. The dependency diagrams in Figure P6.5b reflect the notion that one employee manages each building.
Figure P6.5b FOUNDIT Co. 3NF and Its Relational Diagram FOUNDIT Co. Dependency Diagrams: all tables in 3NF
ITEM_ID ITEM_DESCRIPTION BLDG_ROOM BLDG_CODE
BLDG_CODE BLDG_NAME
EMP_CODE
EMP_CODE
EMP_LNAME EMP_FNAME
EMP_INITIAL
FOUNDIT Co. Relational Diagram EMPLOYEE EMP_CODE
BUILDING 1
BLDG_CODE
EMP_LNAME EMP_FNAME
M
ITEM 1
ITEM_ID
BLDG_NAME
ITEM_DESCRIPTION
EMP_CODE
ITEM_ROOM M
EMP_INITIAL
BLDG_CODE
The relational schemas are written as follows: EMPLOYEE(EMP_CODE, EMP_LNAME, EMP_FNAME, EMP_INITIAL) BUILDING(BLDG_CODE, BLDG_NAME, EMP_CODE) ITEM(ITEM_ID, ITEM_DESCRIPTION, ITEM_ROOM, BLDG_CODE) As you discuss the dependency diagrams in Figure P6.5b, remind the students that BLDG_CODE is not a determinant of BLDG_ROOM. A building can have many rooms, so knowing the building code will not tell you what the room in that building is.
216
Chapter 6 Normalization of Database Tables If the room is numbered to reflect the building it is in – for example, HE105 indicates room 105 in the Heinz building – one might argue that the BLDG_ROOM value is the determinant of the BLDG_CODE and the BLDG_NAME values. You will learn in Chapter 7, “Introduction to Structured Query Language (SQL),” that you can create a query to find a building by looking at room prefixes. However, if you define dependencies in strictly relational algebra terms, you might argue that partitioning the attribute value to “create” a dependency indicates that the partitioned attribute is not (in that strict sense) a determinant. Although we have indicated a transitive dependency from BLDG_ROOM to BLDG_CODE and BLDG_NAME, we have used a dotted line to indicate that there is room for argument in this set of transitive dependencies. In any case, the (arguable) dependency BLDG_ROOM → BLDG_CODE does not create any problems in a practical sense, so we have not identified it in the Problem 9 solution. Clearly, BLDG_CODE is a determinant of BLDG_NAME. Therefore, the transitive dependency is marked properly in the Problem 5b solution. c. Draw the Crow’s Foot ERD. Use Figure P6.5c to show that, in this case, the ER diagram reflects the business rule that one employee can manage many (or at least more than one) building. Because all employees are not required to manage buildings, BUILDING is optional to EMPLOYEE in the manages relationship. Once again, the nature of this relationship is not and cannot be reflected in the dependency diagram.
NOTE We also assume here that each item has a unique item code and that, therefore, an item can be located in only one place at a time. However, we demonstrate in Appendixes B and C that inventory control requirements usually cover both durable and consumable items. Although durables such as tables, desks, lamps, computers, printers, etc. would be uniquely identified by an assigned inventory code, consumables such as individual reams of paper would clearly not be so identified. Therefore, a given inventory description such as "8.5 inch x 11 inch laser printer paper" could describe reams of paper located in many different buildings and in rooms within those buildings. We demonstrate in Appendixes B and C how such a condition may be properly handled.
217
Chapter 6 Normalization of Database Tables
Figure P6.5c The FOUNDIT Co. ERD
As you examine Figure P6.5c, note that the BLDG_ROOM is actually an ITEM entity attribute, so it is appropriate to rename it ITEM_ROOM. Also, keep in mind that a room may be related to the building in which it is located. (A BUILDING may contain many ROOMs. Each ROOM is located in a single building.) Therefore, you can expand the design shown in Figure P6.5b to the one shown in Figure P6.5c. This solution assumes that a room is directly traceable to a building. For example, room SC-508 would be located in the Science (SC) Building and room BA-305 would be located in the Business Administration (BA) building. Note that we have made ROOM optional to BUILDING to reflect the likelihood that some buildings – such as storage sheds -- may not contain designated (numbered) rooms. Although optionalities make excellent default conditions, it is always wise to establish the optionality based on a business rule. In any case, the designer must ask about the nature of the room/building relationship.
6. The table structure shown in Table P6.6 contains many unsatisfactory components and characteristics. For example, there are several multivalued attributes, naming conventions are violated, and some attributes are not atomic.
Table P6.6 Sample EMPLOYEE Records Attribute Name EMP_NUM EMP_LNAME EMP_EDUCATION JOB_CLASS EMP_DEPENDENTS
DEPT_CODE DEPT_NAME DEPT_MANAGER
Sample Value 1003 Willaker BBA, MBA SLS Gerald (spouse), Mary (daughter), John (son) MKTG Marketing Jill H. Martin
Sample Value 1018 Smith BBA SLS
Sample Value 1019 McGuire
MKTG Marketing Jill H. Martin
SVC General Service Hank B. Jones
218
JNT JoAnne (spouse)
Sample Value 1023 McGuire BS, MS, Ph.D. DBA George (spouse) Jill (daughter) INFS Info. Systems Carlos G. Ortez
Chapter 6 Normalization of Database Tables EMP_TITLE EMP_DOB EMP_HIRE_DATE EMP_TRAINING EMP_BASE_SALARY EMP_COMMISSION_RATE
Sales Agent 23-Dec-1968 14-Oct-1997 L1, L2 $38,255.00 0.015
Sales Agent 28-Mar-1979 15-Jan-2006 $30,500.00 0.010
Janitor 18-May-1982 21-Apr-2003 L1 $19.750.00
DB Admin 20-Jul-1959 15-Jul-1999 L1, L3, L8, L15 $127,900.00
a. Given the structure shown in Table P6.6, write the relational schema and draw its dependency diagram. Label all transitive and/or partial dependencies. The dependency diagram is shown in Figure P6.6a. Note that the order of the attributes has been changed to make the transitive dependencies easier to mark. (In any case, the order in which the attributes are written into a relational database table is immaterial.) The relational schema is written below Figure P6.6a.
Figure P6.6a The Dependency Diagram for Problem 6a
EMP_CODE EMP_LNAME
EMP_EDUCATION DEPT_CODE DEPT_NAME DEPT_MANAGER
Transitive Dependencies
Continued….
EMP_DEPENDENTS EMP_DOB EMP_HIRE_DATE EMP_TRAINING
Continued….
JOB_TITLE JOB_CLASS EMP_BASE_SALARY EMP_COMMISSION_RATE
Transitive Dependencies
The relational schema is written as: EMPLOYEE(EMP_CODE, EMP_LNAME, EMP_EDUCATION, JOB_CLASS, EMP_DEPENDENTS,DEPT_CODE, DEPT_NAME, DEPT_MANAGER, EMP_TITLE, EMP_DOB, EMP_HIRE_DATE, EMP_TRAINING, EMP_BASE_SALARY, EMP_COMMISSION_RATE) b. Draw the dependency diagrams that are in 3NF. (Hint: You might have to create a few new attributes. Also make sure that the new dependency diagrams contain attributes that meet 219
Chapter 6 Normalization of Database Tables proper design criteria; that is, make sure there are no multivalued attributes, that the naming conventions are met, and so on.) Dependency diagrams have no way to indicate multi-valued attributes, nor do they provide the means through which such attributes can be handled. Therefore, the solution to this problem requires a basic knowledge of modeling concepts, once again indicating that normalization and design are part of the same process. Given the sample data shown in Problem 6, EDUCATION, DEPENDENT and QUALIFICATION are multi-valued attributes whose values are stored as strings. We have created the appropriate entities to avoid the use of multi-valued attributes. (See Figure P6.6b.)
Figure P6.6b The Dependency Diagrams for Problem 6b EMPLOYEE EMP_CODE EMP_LNAME DEPT_CODE
JOB_CLASS
EMP_DOB
EMP_HIRE_DATE
DEPARTMENT DEPT_CODE DEPT_NAME
EMP_CODE
QUALIFICATION
EDUCATION
EMP_CODE EDUC_CODE QUAL_DATE
EDUC_CODE EDUC_DESCRIPTION
DEPENDENT EMP_CODE DEP_NUM DEP_FNAME DEP_TYPE
JOB JOB_CLASS JOB_TITLE JOB_BASE_SALARY
As you discuss Figure P6.6b, note that a real world design would have to include additional entities or additional attributes in the existing entities. For example, while the job description is likely to include a (job) base salary, employee experience – perhaps measured by time in the job classification and performance – is likely to add to the job’s base salary. Therefore, the EMPLOYEE table might include a salary or hourly wage adjustment attribute. Overall employment longevity is likely to be included, too … employers often find it useful to keep (expensive) job turnover rates low. And, of course, you might include year-to-date (YTD) earnings and taxes in each employee’s records, too. This problem is a great source of discussion material! The relational schemas are written as:
220
Chapter 6 Normalization of Database Tables
EMPLOYEE(EMP_CODE, EMP_LNAME, DEPT_CODE, JOB_CLASS, EMP_DOB, EMP_HIREDATE) DEPENDENT(EMP_CODE, DEP_NUM, DEP_FNAME, DEP_TYPE) DEPARTMENT(DEPT_CODE, DEPT_NAME, EMP_CODE) JOB(JOB_CLASS, JOB_TITLE, JOB_BASE_SALARY) EDUCATION(EDUC_CODE, EDUC_DESCRIPTION) QUALIFICATION(EMP_CODE, EDUC_CODE, QUAL_DATE_EARNED) c. Draw the relational diagram. The relational diagram is shown in Figure P6.6c.
Figure P6.6c The Relational Diagram for Problem 6c DEPARTMENT DEPT_CODE _CODE
1
EMPLOYEE
DEPT_NAME
1
EMP_CODE
1
1
DEPT_NUM DEPT_FNAME DEPT_TYPE
EMP_LNAME
M
M
JOB_CLASS EMP_HIRE_DATE
M DEPT_CODE
QUAL_DATE_EARNED M 1
EDUCATION
EMP_CODE EDUC_CODE
DEPT_CODE
DEPENDENT EMP_CODE
QUALIFICATION
1 EMP_CODE
JOB JOB_CLASS JOB_TITLE JOB_BASE_SALARY
221
1 M
EDUC_CODE EDUC_DESCRIPTION
Chapter 6 Normalization of Database Tables d. Draw the Crow’s Foot ERD. The Crow’s Foot solution is shown in Figure P6.6d.
Figure P6.6d The Crow’s Foot ERD for Problem 6d
7. Suppose you are given the following business rules to form the basis for a database design. The database must enable the manager of a company dinner club to mail invitations to the club’s members, to plan the meals, to keep track of who attends the dinners, and so on. • Each dinner serves many members, and each member may attend many dinners. • A member receives many invitations, and each invitation is mailed to many members. • A dinner is based on a single entree, but an entree may be used as the basis for many dinners. For example, a dinner may be composed of a fish entree, rice, and corn. Or the dinner may be composed of a fish entree, a baked potato, and string beans. Because the manager is not a database expert, the first attempt at creating the database uses the structure shown in Table P6.7:
222
Chapter 6 Normalization of Database Tables
Table P6.7 Sample RESERVATION Records Attribute Name MEMBER_NUM MEMBER_NAME MEMBER_ADDRESS MEMBER_CITY MEMBER_ZIPCODE INVITE_NUM INVITE_DATE ACCEPT_DATE DINNER_DATE DINNER_ATTENDED DINNER_CODE DINNER_DESCRIPTION ENTREE_CODE ENTREE_DESCRIPTION DESERT_CODE DESERT_DESCRIPTION
Sample Value 214 Alice B. VanderVoort 325 Meadow Park Murkywater 12345 8 23-Feb-2016 27-Feb-2016 15-Mar-2016 Yes DI5 Glowing sea delight EN3 Stuffed crab DE8 Chocolate mousse with raspberry sauce
Sample Value 235 Gerald M. Gallega 123 Rose Court Highlight 12349 9 12-Mar-2016 15-Mar-2016 17-Mar-2016 Yes DI5 Glowing sea delight EN3 Stuffed crab DE5 Cherries jubilee
Sample Value 214 Alice B. VanderVoort 325 Meadow Park Murkywater 12345 10 23-Feb-2016 27-Feb-2016 15-Mar-2016 No DI2 Ranch Superb EN5 Marinated steak DE2 Apple pie with honey crust
a. Given the table structure illustrated in Table P6.7, write its relational schema and draw its dependency diagram. Label all transitive and/or partial dependencies. (Hint: This structure uses a composite primary key.) The relational schema may be written as follows: MEMBER(MEMBER_NUM, MEMBER_NAME, MEMBER_ADDRESS, MEMBER_CITY, MEMBER_ZIP_CODE, INVITE_NUM, INVITE_DATE, ACCEPT_DATE, DINNER_DATE, DINNER_ATTENDED, DINNER_CODE, ENTRÉE_CODE, ENTRÉE_DESCRIPTION, DESSERT_CODE, DESSERT_DESCRIPTION) The dependency diagram is shown in Figure P6.7a. Note that DIN_CODE in Figure P6.7a does not determine DIN_ATTEND; just because a dinner is offered does not mean that it is attended. Note also that we have shortened the prefixes – for example, MEMBER_ADDRESS has been shortened to MEM_ADDRESS -- to provide sufficient space to include all the attributes.
223
Chapter 6 Normalization of Database Tables
Figure P6.7a The Dependency Diagram for Problem 7a
MEM_NUM
MEM_NAME
MEM_ADDRESS
MEM_ZIP
INVITE_NUM
Transitive Dependency
continued….
DIN_ATTEND
MEM_CITY
DIN_CODE
DIN_DESCRIPTION
Transitive Dependency
ENT_CODE
ACC_DATE
Transitive Dependency
ENT_DESCRIPTION
Transitive Dependency
INVITE_DATE
DES_CODE
DES_DESCRIPTION
Transitive Dependency
b. Break up the dependency diagram you drew in Problem 7a to produce dependency diagrams that are in 3NF and write the relational schema. (Hint: You might have to create a few new attributes. Also, make sure that the new dependency diagrams contain attributes that meet proper design criteria; that is, make sure that there are no multivalued attributes, that the naming conventions are met, and so on.) Actually, there is no way to prevent the existence of multi-valued attributes by merely following normalization rules. Instead, knowledge of E-R modeling concepts will help define the environment in which the multi-valued attributes are dealt with. Although we keep repeating the message, it is worth repeating: normalization and modeling fit within the same design spectrum and they take place concurrently as the definition of entities and their attributes take place. The design process can be described thus: • Define entities, attributes, and relationships and model them. • Normalize. • Redesign based on the normalization outcomes and the evaluation of the design's ability to meet transaction and information requirements. • Normalize the results and evaluate the normal forms until the process has yielded a stable design, implementation, and applications development environment. Such a process will yield the dependency diagrams shown in Figure P6.7b. In this case, it hardly seems practical to eliminate the 2NF condition displayed by MEMBER. After all, zip codes tend to be thought of as part of the address. Worse, the elimination of the MEMBER's 2NF condition would require the creation of a ZIPCODE table, with ZIP_CODE as the foreign key in the MEMBER table. Such a solution would merely add complexity without adding functionality.
224
Chapter 6 Normalization of Database Tables
Figure P6.7b The Dependency Diagram for Problem 7b
MEM_NUM
MEM_NAME
MEM_ADDRESS
MEM_CITY
MEM_STATE
MEM_ZIP
MEMBER
Transitive Dependency
INVITE_NUM
INVITE_DATE
DIN_CODE
DIN_DATE
ENT_CODE
ENT_DESCRIPTION
DIN_CODE
DIN_DESCRIPTION
MEM_NUM
INVITE_ACCEPT
INVITE_ATTEND
ENT_CODE
DES_CODE
DINNER
ENTREE
DES_CODE
DES_DESCRIPTION
INVITATION
DESSERT
As you examine Figure P6.7b, note how easy it is to see the functionality of the decomposition. For example, the (composite) INVITATION and DINNER entities make it possible to track who was sent an invitation on what date (INVITE_DATE) to a dinner to be held at some specified date (DIN_DATE), what dinner (DIN_CODE) would be served on that date, who (MEM_NUM) accepted the invitation (INVITE_ACCEPT), and who actually attended (INVITE_ATTEND. The INVITE_ACCEPT attribute would be a simple Y/N, as would be the INVITE_ATTEND. To avoid nulls, the default values for INVITE_ACCEPT and INVITE_ATTEND could be set to N. Getting the number of acceptances for a given dinner by a given date would be simple, thus enabling the catering service to plan the dinner better. The relational schemas follow: MEMBER(MEM_NUM, MEM_NAME, MEM_ADDRESS, MEM_CITY, MEM_STATE, MEM_ZIP) INVITATION(INVITE_NUM, INVITE_DATE, DIN_CODE, MEM_NUM, INVITE_ACCEPT, INVITE_ATTEND) ENTRÉE(ENT_CODE, ENT_DESCRIPTION) DINNER(DIN_CODE, DIN_DATE, DIN_DESCRIPTION, ENT_CODE, DES_CODE) 225
Chapter 6 Normalization of Database Tables
DESSERT(DES_CODE, DES_DESCRIPTION) Naturally, to tracks costs and revenues, the manager would ask you to add appropriate attributes in DESSERT and ENTRÉE. For example, the DESSERT table might include DES_COST and DES_PRICE to enable the manager to track net returns on each dessert served. One would also expect that the manager would want to track YTD expenditures of the members and, of course, there would have to be an invoicing module for billing purposes. And what about keeping track of member balances as the members charge meals and make payments on account? c. Using the results of Problem 7b, draw the Crow’s Foot ERD. The Crow’s Foot ERD is shown in Figure P6.7c.
Figure P6.7c The Crow’s Foot ERD for Problem 7c
226
Chapter 6 Normalization of Database Tables
8. Use the dependency diagram shown in Figure 6.8 to work the following problems.
FIGURE P6.8 Initial Dependency Diagram for Problem 8
A
B
C
D
E
F
G
a. Break up the dependency diagram in Figure 6.8 to create two new dependency diagrams, one in 3NF and one in 2NF. The dependency diagrams are shown in Figure P6.8a.
Figure P6.8a The Dependency Diagram for Problem 8a
A
D
3 NF
A
B
C
E
F
G
Transitive dependency Note that this is not a transitive dependency, because C does not determine another non-key Attribute value. Instead, C determines the value of a key attribute.
227
2 NF
Chapter 6 Normalization of Database Tables b. Modify the dependency diagrams you created in Problem 8a to produce a collection of dependency diagrams that are all in 3NF. (Hint: One of your dependency diagrams will be in 3NF, but not in BCNF.) The solution is shown in Figure P6.8b.
Figure P6.8b The Dependency Diagram for Problem 8b
c. Modify the dependency diagrams in Problem 8b to produce a collection of dependency diagrams that are all in 3NF and BCNF. The solution is shown in Figure P6.8c. Note that the A, C, and E attributes in the first three structures can be used as foreign keys in the fourth structure.
Figure P6.8c The Dependency Diagrams for Problem 8c
9. Suppose that you have been given the table structure and data shown in Table 6.9, which was imported from an Excel spreadsheet. The data reflect that a professor can have multiple advisees, can serve on multiple committees, and can edit more than one journal.
Table P6.9 Sample PROFESSOR Records Attribute Name
Sample Value
Sample Value
Sample Value
Sample Value
EMP_NUM PROF_RANK EMP_NAME DEPT_CODE DEPT_NAME
123 Professor Ghee CIS Computer Info. Systems KDD-567 1215, 2312, 3233, 2218, 2098
104 Asst. Professor Rankin CHEM Chemistry
118 Assoc. Professor Ortega CIS Computer Info. Systems KDD-562 2134, 2789, 3456, 2002, 2046, 2018, 2764
120 Assoc. Professor Smith ENG English
PROF_OFFICE ADVISEE
BLF-119 3102, 2782, 3311, 2008, 2876, 2222, 3745, 1783, 2378
228
PRT-345 2873, 2765, 2238, 2901, 2308
Chapter 6 Normalization of Database Tables COMMITTEE_CODE JOURNAL_CODE
PROMO, TRAF APPL, DEV JMIS, QED, JMGT
DEV
SPR, TRAF
PROMO, SPR DEV
JCIS, JMGT
Given the information in Table 6.9: a. Draw the dependency diagram. The dependency diagram is shown in Figure P6.9a.
Figure P6.9a The Dependency Diagram for Problem 9a
EMP_NUM
PROF_RANK
EMP_NAME
DEPT_CODE
DEPT_NAME PROF_OFFICE
transitive dependency If each professor has a private office, PROF_OFFICE is a determinant of EMP_NUM. However, if an office can be shared among two or more professors, the dependency shown here does not exist. Because this dependency is not clear-cut, the dependency line is shown as a dashed line.
ADVISEE
COMMITTEE_CODE
JOURNAL_CODE
Note that Figure P6.9a reflects several ambiguities. For example, although each PROF_OFFICE value shown in Table P6.9 is unique, does that limited information indicate that each professor has a private office? If so, the office number identifies the professor who uses that office. This condition yields a dependency. However, this dependency is not a transitive one, because a nonkey attribute, PROF_OFFICE, determines the value of a key attribute, EMP_NUM. (We have indicated this potential transitive dependency through a dashed dependency line.)
NOTE The assumption that PROF_OFFICE → EMP_CODE is a rather restrictive one, because it would mean that professors cannot share an office. One could safely assume that administrators at all levels would not care to be tied by such a restrictive office assignment requirement. Therefore, we will remove this restriction in the remaining problem solutions.
229
Chapter 6 Normalization of Database Tables Also, note that there is no reliable way to identify the effect of multivalued attributes on the dependencies. For example, EMP_NUM = 123 could identify any one of five advisees. Therefore, knowing the EMP_NUM does not identify a specific ADVISEE value. The same is true for the COMMITTEE_CODE and JOURNAL_CODE attributes. Therefore, these attributes are not marked with a solid arrow line. However, if you know that EMP_NUM = 123, you will also know all five advisees, all four committee codes, and all three journal codes for that employee number value. But you do not have a unique identification for each of those attribute values. Therefore, you cannot conclude that EMP_NUM → ADVISEE, nor can you conclude that EMP_NUM → COMMITTEE_CODE or that EMP_NUM → JOURNAL_CODE. b. Identify the multivalued dependencies. Table P6.9 shows several professor attributes – ADVISEE, COMMITTEE_CODE, and JOURNAL_CODE -- that represent multivalued dependencies. c. Create the dependency diagrams to yield a set of table structures in 3NF. The dependency diagrams are shown in Figure P6.9c. Note that we have assumed that it is possible that professors can share an office.
Figure P6.9c The Dependency Diagram for Problem 9c
EMP_NUM
PROF_RANK
ADVISEE
DEPT_CODE
EMP_NAME
PROF_OFFICE
COMMITTEE_CODE
DEPT_NAME
3NF
230
JOURNAL_CODE
3NF
Chapter 6 Normalization of Database Tables d. Eliminate the multivalued dependencies by converting the affected table structures to 4NF. The structures shown in Figure 6.9d1 conform to the 4NF requirement. Yet this normalization does not yield a viable database design. Here is another opportunity to stress that normalization without data modeling is a poor way to generate useful databases. (Note that we have assumed that an advisee can have only one advisor, but that an advisor can have many advisees.)
Figure P6.9d1 The Initial Dependency Diagrams for Problem 9d
EMP_NUM
PROF_RANK
ADVISEE
EMP_NUM
EMP_NUM
COMMITTEE_CODE
EMP_NUM
JOURNAL_CODE
EMP_NAME
PROF_OFFICE
DEPT_CODE
DEPT_NAME
Problem: This “solution” has limited value, because the Relationship between EMP_NUM and JOURNAL_CODE and between EMP_NUM and COMMITTEE_CODE is M:N. (A professor can write for many journals and each journal includes articles by many professors. Similarly, a professor can serve on many committees and each committee is composed of several professors.)
The dependency diagrams shown in Figure P6.9d1 constitute an attempt to eliminate the shortcomings of the “system” shown in Figure P6.9c. Unfortunately, while this solution meets the normalization requirements, it lacks the ability to properly link the professors to committees and journals. (That’s because the relationships between professors and journals and between professors and committees are M:N.) This solution would yield tables P6.9d1 and P6.9d2. (One would expect a professor to be an employee, so it’s reasonable to assume that – at some point -we’ll have to create a supertype/subtype relationship between employee and professor. (To save space, we show only the first three EMP_NUM value sets from Table P6.9.)
231
Chapter 6 Normalization of Database Tables
Table 6.9d1 Implementation of the M:N Relationship between EMP_NUM and COMMITTEE_CODE EMP_NUM 123 123 123 123 104 118 118
COMMITTEE_CODE PROMO TRAF APPL JMGT DEV SPR TRAF
The PK of the table shown in Table P6.9d1 is EMP_NUM + COMMITTEE_CODE.
Table 6.9d2 Implementation of the M:N Relationship between EMP_NUM and JOURNAL_CODE EMP_NUM 123 123 123 118 118
JOURNAL_CODE JMIS QED JMGT JCIS JMGT
The PK of the table shown in Table P6.9d2 is EMP_NUM + JOURNAL_CODE. Because EMP_CODE = 104 does not show any entries in the JOURNAL_CODE, the employee code does not occur in Table P6.9d2. The preceding table structures create multiple redundancies. Therefore, this solution is not acceptable. Here is yet another indication that normalization, while very useful, is not always (usually?) capable of producing implementable solutions. For example, the preceding examples illustrate that mulivalued attributes and M:N relationships cannot be effectively modeled without first using the ERD. (After the ERD has done its work, you should, of course, use dependency diagrams to check for data redundancies!) Figure P6.9e shows a more practical solution to the problem and its structures all conform to the normalization requirements.
232
Chapter 6 Normalization of Database Tables e. Draw the Crow’s Foot ERD to reflect the dependency diagrams you drew in Part c. (Note: You might have to create additional attributes to define the proper PKs and FKs. Make sure that all of your attributes conform to the naming conventions.) Given the discussion in the previous problem segment d, we have incorporated additional features in the Crow’s Foot ERD shown in Figure P6.9e. Note that we have eliminated the M:N relationships in this design by creating composite entities. This design is implementable and it meets design standards. Normalization was part of the process that led to this solution, but it was only a part of that solution. Normalization does not replace design!
Figure P6.9e The Crow’s Foot ERD for Problem 9e
10. The manager of a consulting firm has asked you to evaluate a database that contains the table structure shown in Table P6.10.
233
Chapter 6 Normalization of Database Tables
Table P6.10 Sample CLIENT Records Attribute Name CLIENT_NUM CLIENT_NAME CLIENT_REGION CONTRACT_DATE CONTRACT_NUMBER CONTRACT_AMOUNT CONSULT_CLASS_1 CONSULT_CLASS_2 CONSULT_CLASS_3 CONSULT_CLASS_4 CONSULTANT_NUM_1 CONSULTANT_NAME_1 CONSULTANT_REGION_1 CONSULTANT_NUM_2 CONSULTANT_NAME_2 CONSULTANT_REGION_2 CONSULTANT_NUM_3 CONSULTANT_NAME_3 CONSULTANT_REGION_3 CONSULTANT_NUM_4 CONSULTANT_NAME_4 CONSULTANT_REGION_4
Sample Value 298 Marianne R. Brown Midwest 10-Feb-2018 5841 $2,985,00.00 Database Administration Web Applications
Sample value 289 James D. Smith Southeast 15-Feb-2018 5842 $670,300.00 Internet Services
Sample Value 289 James D. Smith Southeast 12-Mar-2018 5843 $1,250,000.00 Database Design Database Administration Network Installation
29 Rachel G. Carson Midwest 56 Karl M. Spenser Midwest 22 Julian H. Donatello Midwest
34 Gerald K. Ricardo Southeast 38 Anne T. Dimarco Southeast 45 Geraldo J. Rivera Southeast 18 Donald Chen West
25 Angela M. Jamison Southeast 34 Gerald K. Ricardo Southeast
Table P6.10 was created to enable the manager to match clients with consultants. The objective is to match a client within a given region with a consultant in that region, and to make sure that the client’s need for specific consulting services is properly matched to the consultant’s expertise. For example, if the client need help with database design and is located in the Southeast, the objective is to make a match with a consultant who is located in the Southeast and whose expertise is in database design. (Although the consulting company manage tries to match consultant and client locations to minimize travel expense, it is not always possible to do so.) The following basic business rules are maintained: • Each client is located in one region • A region can contain many clients. • Each consultant can work on many contracts • Each contract might require the services of many consultants. • A client can sign more than one contract, but each contract is signed by only one client. • Each contract might cover multiple consulting classifications. (For example, a contract may list consulting services in database and networking.) • Each consultant is located in one region. • A region can contain many consultants. • Each consultant has one or more areas of expertise (class). For example, a consultant might be classified as an expert in both database design and networking. • Each area of expertise (class) can have many consultants in it. For example, the consulting company might employ many consultants who are networking experts.
234
Chapter 6 Normalization of Database Tables a. Given that brief description of the requirements and the business rules, write the relational schema and draw the dependency diagram for the preceding (and very poor) table structure. Label all transitive and/or partial dependencies. Here is a perfect illustration of the value of business rules. If the business rules had not been available, the sample record would produce ambiguities. For example, if you only look at the sample data in the one available record, defining the relationships between client, contract, date, consultant, and expertise would have been difficult, at best. The business rules augment the original data and their use removes the ambiguities. The business rules help establish that a client can sign more than one contract, so you need more than the client number to identify the remaining attributes. Clearly, another client can sign a contract on the same date, so the CLIENT_NUM is not the determinant of the date. Also, the same client can sign multiple contracts on the same date or on different dates, using the same set of consultants for each contract or a different set of consultants for each contract. Remember also that the consultants have more than one area of expertise, so the same consultant may work on different contracts for the same client or for different clients. Given the combination of the business rules and the sample record in the original problem – or given the use of the two records provided in the first part of this discussion -- the dependencies show up in Figure P6.10a.
Figure P6.10a The ConsultCo Dependency Diagram
CLIENT_NUM
CLIENT_NAME DATE
CONTRACT
CLASS_1
CLASS_2
CLASS_3
Transitive dependencies
CLASS_4
REGION
CONS_NUM_1 CONS_NAME_1
Transitive dependencies
continued….
REGION
CONS_NUM_2
CONS_NAME_2 CONS_NUM_3 CONS_NAME_3 CONS_NUM_4 CONS_NAME_4
Transitive dependencies
Transitive dependencies
Note: The REGION attribute has been duplicated to show all of the dependencies in a single diagram
The relational schema is written as follows: CONTRACT(CLIENT_NUM, CLIENT_NAME, DATE, CONTRACT, CLASS_1, CLASS_2, CLASS_3, CLASS_4, REGION, CONS_NUM_1, CONS_NAME_1, CONS_NUM_2, CONS_NAME_2, CONS_NUM_3, CONS_NAME_3, CONS_NUM_4, CONS_NAME_4)
235
Chapter 6 Normalization of Database Tables
Or, if you prefer that the PK be the first listed attribute, you can write the relational schema this way: CONTRACT(CONTRACT, CLIENT_NUM, CLIENT_NAME, DATE, CLASS_1, CLASS_2, CLASS_3, CLASS_4, REGION, CONS_NUM_1, CONS_NAME_1, CONS_NUM_2, CONS_NAME_2, CONS_NUM_3, CONS_NAME_3, CONS_NUM_4, CONS_NAME_4) In any case, remind your students that the order in which the attributes are listed is immaterial in a relational database environment. b. Break up the dependency diagram you drew in Problem 10a to produce dependency diagrams that are in 3NF and write the relational schema. (Hint: You might have to create a few new attributes. Also make sure that the new dependency diagrams contain attributes that meet proper design criteria; that is, make sure that there are no multivalued attributes, that the naming conventions are met, and so on.) To complete the structures, we have added the REGION_NAME and we have modified the attribute names to make them conform to our naming conventions. Although the normalization procedure has left us with the 3NF system shown in Figure P6.10b, it is not possible to see that some of the relationships between the entities are of the M:N variety. (It would be appropriate to point out that the multivalued attributes encountered in Problem 10's sample values are probably best handled through the use of composite entities. Similarly, the M:N relationship between contract and consultant would have to be handled through a composite entity, perhaps named ASSIGNMENT, to indicate the assignment of consultants to contracts. We will resolve those issues in the answers to subsequent problems.) Here is yet another indication that normalization, while very useful as a tool to help eliminate data redundancies, is incapable of serving as the sole source of good database design.
236
Chapter 6 Normalization of Database Tables
Figure P6.10b The ConsultCo Dependency Diagrams in 3NF CLIENT
CLIENT_NUM CLIENT_NAME REGION_CODE
CLASS
CONTRACT
CLASS_CODE CLASS_DESCRIPTION
CONTR_NUM CLIENT_NUM CONTR_DATE REGION_CODE
CONSULTANT
REGION
REGION_CODE REGION_NAME
CONS_NUM CONS_NAME REGION_CODE
Note: At this point, the entities are properly identified and the attributes are all accounted for. However, several of these entities are related to each other through M:N relationships. Normalization does not provide information about the nature of the relationships, thus illustrating the need for combining normalization and ER modeling techniques.
The relational schemas are written as follows: CLIENT(CLIENT_NUM, CLIENT_NAME, REGION_CODE) CLASS(CLASS_CODE, CLASS_DESCRIPTION) CONTRACT(CONTR_NUM, CLIENT_CODE, CONTR_DATE, REGION_CODE) CONSULTANT(CONS_NUM, CONS_NAME, REGION_CODE) REGION(REGION_CODE, REGION_NAME) Keep in mind that the preceding dependency diagrams and relational schemas do not (yet) define a practical design. For example, processing requirements usually dictate that the attributes be made more atomic. (Printing mailing labels, creating mailing lists and phone directories would mandate the decomposition of CLIENT_NAME into CLIENT_FNAME, CLIENT_LNAME, and CLIENT_INITIAL. The CONS_NAME must be similarly decomposed.) Also, remember that this simple system lacks many important entities and attributes. For instance, at this point there's no way to contact the clients, nor can clients contact the consultants. Clearly, we ought to add addresses and phone numbers. However, we have added some crucial relationships to enable us to track billing charges by class and to track billable hours by class, by consultant, and by 237
Chapter 6 Normalization of Database Tables class. (Note also that the ASSIGN_CHG_HOUR is written into the ASSIGNMENT table by the applications software from the CLASS table to ensure the historical accuracy of the charges. If the CLASS_CHG_HOUR changes, we must preserve the original charge per hour that was in effect when the assignment charge was made.) You can let your students use database software such as Microsoft Access to implement this system. Naturally, you can add tables and attributes to enable the system to handle invoicing and reporting of consulting activities by consultant, by type, by client, and so on. We have added a few of the appropriate entities and attributes in the answer to problem 10c. c. Using the results of Problem 10b, draw the Crow’s Foot ERD. The Crow’s Foot ERD is shown in Figure P6.10c.
Figure P6.10c The ConsultCo ERD for Problem 10c
The addition of the ASSIGNMENT entity addresses the problem of keeping track of billable hours and charges by consultant and that the addition of the SKILL entity enables the end user to track all consultant qualifications. Whether or not optionalities are included in the ERD depends on the business rules and on the operational requirements. For example, you can infer from Figure P6.10c that the ASSIGNMENT entity does not necessarily contain a given CLASS code. (Perhaps there is a “customer support” classification that may not have been used – yet.) Similarly, you can infer that a given CONTRACT
238
Chapter 6 Normalization of Database Tables number has not (yet) been used in the ASSIGN entity. (It is again worth emphasizing that many optionalities exist for operational reasons. That’s why the optionality is often used as the default condition. In any case, the database designer is obligated to develop precise business rules to make sure that the data environment is properly reflected in the design.)
11. Given the sample records in the CHARTER table shown in Table P6.11, do the following:
Table P6.11 Sample CHARTER Records Attribute Name CHAR_TRIP CHAR_DATE CHAR_CITY CHAR_MILES CUST_NUM CUST_LNAME CHAR_PAX CHAR_CARGO PILOT COPILOT FLT_ENGINEER LOAD_MASTER AC_NUMBER MODEL_CODE MODEL_SEATS MODEL_CHG_MILE
Sample Value 10232 15-Jan-2018 STL 580 784 Brown 5 235 lbs. Melton
1234Q PA31-350 10 $2.79
Sample Value 10233 15-Jan-2018 MIA 1,290 231 Hanson 12 18,940 lbs. Chen Henderson O’Shaski Benkasi 3456Y CV-580 38 $23.36
Sample Value 10234 16-Jan-2018 TYS 524 544 Bryana 2 348 lbs. Henderson Melton
Sample Value 10235 17-Jan-2018 ATL 768 784 Brown 5 155 lbs. Melton
1234Q PA31-350 10 $2.79
2256W PA31-350 10 $2.79
a. Write the relational schema and draw the dependency diagram for the table structure. Make sure that you label all dependencies. CHAR_PAX indicates the number of passengers carried. The CHAR_MILES entry is based on round-trip miles, including pickup points. (Hint: Look at the data values to determine the nature of the relationships. For example, note that employee Melton has flown two charter trips as pilot and one trip as copilot.)
The dependency diagram is shown in Figure P6.11a.
239
Chapter 6 Normalization of Database Tables
Figure P6.11a The Dependency Diagram for Problem 11a
CHAR_TRIP CHAR_DATE CHAR_CITY CHAR_MILES CUST_NUM
CUST_LNAME CHAR_PAX CHAR_CARGO
Transitive dependency
PILOT COPILOT FLT_ENGINEER LOAD_MASTER AC_NUMBER MOD_CODE
MOD_SEATS MOD_CHG_MILE
Transitive dependencies
The relational schema is written as follows: CHARTER(CHAR_TRIP, CHAR_DATE, CHAR_CITY, CHAR_MILES, CUST_NUM, CUST_LNAME, CHAR_PAX, CHAR_CARGO, PILOT, COPILOT, FLT_ENGINEER, LOAD_MASTER, AC_NUMBER, MODEL_CODE, MODEL_SEATS, MODEL_CHG_MILE) b. Decompose the dependency diagram in Problem 11a to create table structures that are all in 3NF and write the relational schema. Make sure that you label all dependencies. The normalized dependency diagram is shown in Figure P6.11b. (Note the addition of MOD_CODE in the AIRCRAFT table to serve as the AIRCRAFT table’s FK to MODEL.)
240
Chapter 6 Normalization of Database Tables
Figure P6.11b The Normalized Dependency Diagram for Problem 11b CHARTER table
CHAR_TRIP CHAR_DATE CHAR_CITY CHAR_PAX CHAR_MILES CUST_NUMBER
PILOT
Continued ….
COPILOT
FLT_ENGINEER LOAD_MASTER
CUSTOMER table
CUST_NUMBER
CUST_LNAME
AIRCRAFT table
AC_NUM
MOD_CODE
241
MODEL table
MOD_CODE MOD_SEATS MOD_CHG_MILE
Chapter 6 Normalization of Database Tables c. Draw the Crow’s Foot ERD to reflect the properly decomposed dependency diagrams you created in Problem 11b. Make sure that the ERD yields a database that can track all of the data shown in Problem 11. Show all entities, relationships, connectivities, optionalities, and cardinalities. The initial Crow’s Foot ERD is shown in Figure P6.11c.
Figure P6.11c The Initial Crow’s Foot ERD for Problem 11c
While the ERD shown in Figure P6.11c faithfully reflects the results generated by the normalization process, it has a major design flaw. This flaw has the following consequences: • If additional crewmembers such as copilots, loadmasters, and flight engineers are not assigned to the flight, the CHARTER table will include many nulls. (Many of the smaller aircraft that used in charter flying require only that a pilot and a functioning autopilot be used. In fact, the Federal Air Regulations (FARs) that govern charter aviation permit single pilot operations for aircraft that have less than a 12,500-lbs. gross take-off weight and that are not turbine-powered.) • The inclusion of COPILOT, FLT_ENGINEER, and LOAD_MASTER also produce synonyms in the CHARTER table. • As the aircraft used in the charter flights become larger and more complex, crews become larger, thus producing more synonyms and more potential nulls. (Not to mention that the CHARTER table will have to be modified to accept additional crew members such as flight attendants.)
242
Chapter 6 Normalization of Database Tables The problems associated with the ERD shown in Figure P6.11c are eliminated through the composite entity named CREW in Figure P6.11d. Note that this modification makes it possible to assign any number of crewmembers. To ensure that the crewmembers are properly qualified, a job attribute can be added to the EMPLOYEE entity and the applications software can then assign crewmembers based on job classifications such a pilot, loadmaster, flight attendant, etc. Because only some employees are qualified as crewmembers, CREW is optional to EMPLOYEE. But each crewmember must be an employee, so EMPLOYEE is mandatory to CREW.
Figure P6.11d The Final Crow’s Foot ERD for Problem 11c
Note that the application shown in Figure P6.11e -- based on the design shown in Figure P6.11d -enables the end user to input only those crew members that are required for the charter flight. (In this case, only two crew members are required, but the design permits the addition of many more crew members without making structural changes in the database tables. Such flexibility is the essence of good design.)
243
Chapter 6 Normalization of Database Tables
Figure P6.11e Sample Charter Record
244
Chapter 7 An Introduction to Structured Query Language (SQL)
Chapter 7 Introduction to Structured Query Language (SQL) NOTE Several points are worth emphasizing: • We have provided the SQL scripts for both chapters 7 and 8. These scripts are intended to facilitate the flow of the material presented to the class. However, given the comments made by our students, the scripts should not replace the manual typing of the SQL commands by students. Some students learn SQL better when they have a chance to type their own commands and get the feedback provided by their errors. We recommend that the students use their lab time to practice the commands manually. • Because this chapter focuses on learning SQL, we recommend that if you use Microsoft Access, that you use the Microsoft Access SQL window to type SQL queries. Using this approach, you will be able to demonstrate the interoperability of standard SQL. For example, you can cut and paste the same SQL command from the SQL query window in Microsoft Access, to Oracle SQL * Plus and to MS SQL Query Analyzer. This approach achieves two objectives: ➢ It demonstrates that adhering to the SQL standard means that most of the SQL code will be portable among DBMSes. ➢ It also demonstrates that even a widely accepted SQL standard is sometimes implemented with slight distinctions by different vendors. For example, the treatment of date formats in Microsoft Access and Oracle is slightly different. • Chapter 7 is all about SELECT queries to retrieve data. We choose to start with SELECT queries because simple SELECT queries are conceptually easy to understand, which gives students a good place to start. Also, most database jobs will require students to work with databases that are already in place. We emphasize to students the importance to learning the data model in which they work. This also provides an opportunity to highlight the importance of good naming conventions when creating the database design. Students can see how helpful it is to have a proper naming convention for attributes within an entity, the importance of having the name of the foreign key reflect the table from which it originates, and the benefits of descriptive entity and attribute names.
Answers to Review Questions 1. Explain why it would be preferable to use a DATE data type to store date data instead of a character data type. The DATE data type uses numeric values based on the Julian calendar to store dates. This makes date arithmetic such as adding and subtracting days or fractions of days possible (as well as numerous special date-oriented functions discussed in the next chapter!).
245
Chapter 7 An Introduction to Structured Query Language (SQL) 2. Explain why the following command would create an error, and what changes could be made to fix the error. SELECT V_CODE, SUM(P_QOH) FROM PRODUCT; The command would generate an error because an aggregate function is applied to the P_QOH attribute but V_CODE is neither in an aggregate function nor in a GROUP BY clause. This can be fixed by either 1) placing V_CODE in an appropriate aggregate function based on the data that is being requested by the user, 2) adding a GROUP BY clause to group by values of V_CODE (i.e. GROUP BY V_CODE), 3) removing the V_CODE attribute from the SELECT clause, or 4) removing the Sum aggregate function from P_QOH. Which of these solutions is most appropriate depends on the question that the query was intended to answer. 3. What is a CROSS JOIN? Give an example of its syntax. A CROSS JOIN is identical to the PRODUCT relational operator. The CROSS JOIN is also known as the Cartesian product of two tables. For example, if you have two tables, AGENT, with 10 rows and CUSTOMER, with 21 rows, the CROSS JOIN resulting set will have 210 rows and will include all of the columns from both tables. Syntax examples are: SELECT * FROM CUSTOMER CROSS JOIN AGENT; or SELECT * FROM CUSTOMER, AGENT If you do not specify a join condition when joining tables, the result will be a CROSS Join or PRODUCT operation. 4. What three join types are included in the OUTER JOIN classification? An OUTER JOIN is a type of JOIN operation that yields all rows with matching values in the join columns as well as all unmatched rows. (Unmatched rows are those without matching values in the join columns). The SQL standard prescribes three different types of join operations: LEFT [OUTER] JOIN RIGHT [OUTER] JOIN FULL [OUTER] JOIN. The LEFT [OUTER] JOIN will yield all rows with matching values in the join columns, plus all of the unmatched rows from the left table. (The left table is the first table named in the FROM clause.) The RIGHT [OUTER] JOIN will yield all rows with matching values in the join columns, plus all of the unmatched rows from the right table. (The right table is the second table named in the FROM clause.)
246
Chapter 7 An Introduction to Structured Query Language (SQL) The FULL [OUTER] JOIN will yield all rows with matching values in the join columns, plus all the unmatched rows from both tables named in the FROM clause. 5.
Using tables named T1 and T2, write a query example for each of the three join types you described in Question 2. Assume that T1 and T2 share a common column named C1. LEFT OUTER JOIN example: SELECT * FROM T1 LEFT OUTER JOIN T2 ON T1.C1 = T2.C1; RIGHT OUTER JOIN example: SELECT * FROM T1 RIGHT OUTER JOIN T2 ON T1.C1 = T2.C1; FULL OUTER JOIN example: SELECT * FROM T1 FULL OUTER JOIN T2 ON T1.C1 = T2.C1;
6. What is a recursive join? A recursive join is a join in which a table is joined to itself.
7. Rewrite the following WHERE clause without the use of the IN special operator. WHERE V_STATE IN (‘TN’, ‘FL’, ‘GA’) WHERE V_STATE = 'TN' OR V_STATE = 'FL' OR V_STATE = 'GA' Notice that each criteria must be complete (i.e. attribute-operator-value). 8. Explain the difference between an ORDER BY clause and a GROUP BY clause. An ORDER BY clause has no impact on which rows are returned by the query, it simply sorts those rows into the specified order. A GROUP BY clause does impact the rows that are returned by the query. A GROUP BY clause gathers rows into collections that can be acted on by aggregate functions.
247
Chapter 7 An Introduction to Structured Query Language (SQL) 9. Explain why the two following commands produce different results. SELECT DISTINCT COUNT (V_CODE) FROM PRODUCT; SELECT COUNT (DISTINCT V_CODE) FROM PRODUCT; The difference is in the order of operations. The first command executes the Count function to count the number of values in V_CODE (say the count returns "14" for example) including duplicate values, and then the Distinct keyword only allows one count of that value to be displayed (only one row with the value "14" appears as the result). The second command applies the Distinct keyword to the V_CODEs before the count is taken so only unique values are counted. 10. What is the difference between the COUNT aggregate function and the SUM aggregate function? COUNT returns the number of values without regard to what the values are. SUM adds the values together and can only be applied to numeric values. 11. In a SELECT query, what is the difference between a WHERE clause and a HAVING clause? Both a WHERE clause and a HAVING clause can be used to eliminate rows from the results of a query. The differences are 1) the WHERE clause eliminates rows before any grouping for aggregate functions occurs while the HAVING clause eliminates groups after the grouping has been done, and 2) the WHERE clause cannot contain an aggregate function but the HAVING clause can. 12. What is a subquery, and what are its basic characteristics? A subquery is a query (expressed as a SELECT statement) that is located inside another query. The first SQL statement is known as the outer query, the second is known as the inner query or subquery. The inner query or subquery is normally executed first. The output of the inner query is used as the input for the outer query. A subquery is normally expressed inside parenthesis and can return zero, one, or more rows and each row can have one or more columns. A subquery can appear in many places in a SQL statement: • as part of a FROM clause, • to the right of a WHERE conditional expression, • to the right of the IN clause, • in a EXISTS operator, • to the right of a HAVING clause conditional operator, • in the attribute list of a SELECT clause. Examples of subqueries are: INSERT INTO PRODUCT SELECT * FROM P; DELETE FROM PRODUCT WHERE V_CODE IN (SELECT V_CODE FROM VENDOR
248
Chapter 7 An Introduction to Structured Query Language (SQL) WHERE V_AREACODE = ‘615’); SELECT FROM WHERE
V_CODE, V_NAME VENDOR V_CODE NOT IN (SELECT V_CODE FROM PRODUCT);
13. What are the three types of results a subquery can return? A subquery can return 1) a single value (one row, one column), 2) a list of values (many rows, one column), or 3) a virtual table (many rows, many columns). 14. What is a correlated subquery? Give an example. A correlated subquery is subquery that executes once for each row in the outer query. This process is similar to the typical nested loop in a programming language. Contrast this type of subquery to the typical subquery that will execute the innermost subquery first, and then the next outer query … until the last outer query is executed. That is, the typical subquery will execute in serial order, one after another, starting with the innermost subquery. In contrast, a correlated subquery will run the outer query first, and then it will run the inner subquery once for each row returned in the outer subquery. For example, the following subquery will list all the product line sales in which the “units sold” value is greater than the “average units sold” value for that product (as opposed to the average for all products.) SELECT FROM WHERE
INV_NUMBER, P_CODE, LINE_UNITS LINE LS LS.LINE_UNITS > (SELECT AVG(LINE_UNITS) FROM LINE LA WHERE LA.P_CODE = LS.P_CODE);
The previous nested query will execute the inner subquery once to compute the average sold units for each product code returned by the outer query. 15. Explain the difference between a regular subquery and a correlated subquery. A regular, or uncorrelated subquery, executes before the outer query. It executes only once and the result is held for use by the outer query. A correlated subquery relies in part on the outer query, usually through a WHERE criteria in the subquery that references an attribute in the outer query. Therefore, a correlated subquery will execute once for each row evaluated by the outer query; and the correlated subquery can potentially produce a different result for each row in the outer query. 16. What does it mean to say that SQL operators are set-oriented? The description of SQL operators as set-oriented means that the commands work over entire tables at a time, not row-by-row. 17. The relational set operators UNION, INTERSECT, and MINUS work properly only if the relations are union-compatible. What does union-compatible mean, and how would you check for this condition?
249
Chapter 7 An Introduction to Structured Query Language (SQL)
Union compatible means that the relations yield attributes with identical names and compatible data types. That is, the relation A(c1,c2,c3) and the relation B(c1,c2,c3) have union compatibility if both relations have the same number of attributes, and corresponding attributes in the relations have “compatible” data types. Compatible data types do not require that the attributes be exactly identical – only that they are comparable. For example, VARCHAR(15) and CHAR(15) are comparable, as are NUMBER (3,0) and INTEGER, and so on. Note that this is a practical definition of unioncompatibility, which is different than the theoretical definition discussed in Chapter 3. From a theoretical perspective, corresponding attributes must have the same domain. However, the DBMS does not understand the meaning of the business domain so it must work with a more concrete understanding of the data in the corresponding columns. Thus, it only considers the data types. 18. What is the difference between UNION and UNION ALL? Write the syntax for each. UNION yields unique rows. In other words, UNION eliminates duplicates rows. On the other hand, a UNION ALL operator will yield all rows of both relations, including duplicates. Notice that for two rows to be duplicated, they must have the same values in all columns. To illustrate the difference between UNION and UNION ALL, let’s assume two relations: A (ID, Name) with rows (1, Lake, 2, River, and 3, Ocean) and B (ID, Name) with rows (1, River, 2, Lake, and 3, Ocean). Given this description, SELECT * FROM A UNION SELECT * FROM B will yield: ID 1 2 3 1 2
Name Lake River Ocean River Lake
while SELECT * FROM A UNION ALL SELECT * FROM B will yield:
250
Chapter 7 An Introduction to Structured Query Language (SQL) ID 1 2 3 1 2 3
Name Lake River Ocean River Lake Ocean
19. Suppose that you have two tables, EMPLOYEE and EMPLOYEE_1. The EMPLOYEE table contains the records for three employees: Alice Cordoza, John Cretchakov, and Anne McDonald. The EMPLOYEE_1 table contains the records for employees John Cretchakov and Mary Chen. Given that information, what is the query output for the UNION query? (List the query output.) The query output will be: Alice Cordoza John Cretchakov Anne McDonald Mary Chen 20. Given the employee information in Question 19, what is the query output for the UNION ALL query? (List the query output.) The query output will be: Alice Cordoza John Cretchakov Anne McDonald John Cretchakov Mary Chen 21. Given the employee information in Question 19, what is the query output for the INTERSECT query? (List the query output.) The query output will be: John Cretchakov 22. Given the employee information in Question 19, what is the query output for the MINUS query? (List the query output.) This question can yield two different answers. If you use SELECT * FROM EMPLOYEE MINUS
251
Chapter 7 An Introduction to Structured Query Language (SQL) SELECT * FROM EMPLOYEE_1 the answer is Alice Cordoza Ann McDonald If you use SELECT * FROM EMPLOYEE_1 MINUS SELECT * FROM EMPLOYEE the answer is Mary Chen 23. Suppose that a PRODUCT table contains two attributes, PROD_CODE and VEND_CODE. Those two attributes have values of ABC, 125, DEF, 124, GHI, 124, and JKL, 123, respectively. The VENDOR table contains a single attribute, VEND_CODE, with values 123, 124, 125, and 126, respectively. (The VEND_CODE attribute in the PRODUCT table is a foreign key to the VEND_CODE in the VENDOR table.) Given that information, what would be the query output for: Because the common attribute is V_CODE, the output will only show the V_CODE values generated by the each query. a. A UNION query based on these two tables? 125,124,123,126 b. A UNION ALL query based on these two tables? 125,124,124,123,123,124,125,126 c. An INTERSECT query based on these two tables? 123,124,125 d. A MINUS query based on these two tables? If you use PRODUCT MINUS VENDOR, the output will be NULL If you use VENDOR MINUS PRODUCT, the output will be 126 24. Why does the order of the operands (tables) matter in a MINUS query but not in a UNION query?
252
Chapter 7 An Introduction to Structured Query Language (SQL)
MINUS queries are analogous to algebraic subtraction – it results in the value that existed in the first operand that is not in the second operand. UNION queries are analogous to algebraic addition – it results in a combination of the two operands. (These analogies are not perfect, obviously, but they are helpful when learning the basics.) Addition and UNION have the commutative property (a + b = b + a), while subtraction and MINUS do not (a – b ≠ b – a).
25. What MS Access/SQL Server function should you use to calculate the number of days between the current date and January 25, 1999? SELECT DATE()-#25-JAN-1999# NOTE: In MS Access you do not need to specify a FROM clause for this type of query. 26. What Oracle function should you use to calculate the number of days between your birthday and the current date? The SYSDATE keyword can be used to retrieve the current date from the server. By subtracting your birthdate from the current date, using date arithmetic, the number of dates will be returned. Note that in Oracle, the SQL statement requires the use of the FROM clause. In this case, you may use the DUAL table. (The DUAL table is a dummy “virtual” table provided by Oracle for this type of query. The table contains only one row and one column so queries against it can return just one value.) 27. What string function should you use to list the first three characters of a company’s EMP_LNAME values? Give an example, using a table named EMPLOYEE. In Oracle, you use the SUBSTR function as illustrated next: SELECT SUBSTR(EMP_LNAME, 1, 3) FROM EMPLOYEE; In SQL Server, you use the SUBSTRING function as shown: SELECT SUBSTRING(EMP_LNAME, 1, 3) FROM EMPLOYEE; 28. What two things must a SQL programmer understand before beginning to craft a SELECT query? Before crafting a SELECT query, the SQL programmer must 1) understand the data model in which the query will operate, and 2) the problem being solved. Data models are often complex to the point that knowing what data is available, the meaning of that data, and how to transform the data to produce the desired results will require the programmer to become very familiar with the data model before the query can be created. Problem statements that seem clear to users can often be interpreted
253
Chapter 7 An Introduction to Structured Query Language (SQL) in many ways, so it is important for the programmer to understand exactly what the user is requesting.
Problem Solutions All of the problems in the Problem section require writing SQL code. Since there are minor differences in the code based on the DBMS used, solutions for all of the problems are provided in separate files for Oracle, MySQL, and Microsoft SQL Server. Solutions for Microsoft Access are provided in .mdb files for each data model used in the problem section. The files are located in the “Teacher” data files that accompany the book, and are named as follows: Oracle:
Ch07_ProblemSolutions_ORA.txt
MySQL:
Ch07_ProblemSolutions_MySQL.txt
SQL Server: Ch07_ProblemSolutions_SQL.txt MS Access:
Ch07_ConstructCo.mdb Ch07_Fact.mdb Ch07_LargeCo.mdb Ch07_SaleCo.mdb
254
Chapter 8 Advanced SQL
Chapter 8 Advanced SQL NOTE Several points are worth emphasizing: • Chapter 8 focuses on creating database structures and manipulating the data in tables. The material covers creating databases and objects within the databases, such as tables, indexes, and views. • We have provided the SQL scripts for this chapter. These scripts are intended to facilitate the flow of the material presented to the class. However, given the comments made by our students, the scripts should not replace the manual typing of the SQL commands by students. Some students learn SQL better when they have a chance to type their own commands and get the feedback provided by their errors. We recommend that the students use their lab time to practice the commands manually. • In this chapter, the stored procedures and triggers are executed in the Oracle RDBMS. Unlike SQL, which is standardized, languages for creating stored procedures and triggers are not standardized across different DBMS products. For example, while PL/SQL in Oracle and TSQL (Transact SQL) in SQL Server perform similar tasks in roughly similar ways, the syntax and keywords for these languages are very different. This material is presented to help students understand the nature of the tasks performed by these program modules. The programs shown in the text are to illustrate these concepts. The concepts are common across most DBMS products, even though the actual syntax for the languages is different. Even if instructors do not use Oracle or do not teach the syntax of PL/SQL that is presented in the chapter, students still benefit from understanding the need for programs such as the ones presented and the nature of how these programs are implemented.
Answers to Review Questions 1. What type of integrity is enforced when a primary key is declared? Creating a primary key constraint enforces entity integrity (i.e. no part of the primary key can contain a null and the primary key values must be unique). 2. Explain why it might be more appropriate to declare an attribute that contains only digits as a character data type instead of a numeric data type. An attribute that contains only digits may be properly defined as character data when the values are nominal; that is, the values do not have numerical significance but serve only as labels such as ZIP codes and telephone numbers. One easy test is to consider whether or not a leading zero should be retained. For the ZIP code 03133, the leading zero should be retained; therefore, it is appropriate to define it as character data. For the quantity on hand of 120, we would not expect to retain a leading zero such as 0120; therefore, it is appropriate to define the quantity on hand as a numeric data type.
255
Chapter 8 Advanced SQL 3. What is the difference between a column constraint and a table constraint? A column constraint can refer to only the attribute with which it is specified. A table constraint can refer to any attributes in the table. 4. What are “referential constraint actions”? Referential constraint actions, such as ON DELETE CASCADE, are default actions that the DBMS should take when a DML command would result in a referential integrity constraint violation. Without referential constraint actions, DML commands that would result in a violation of referential integrity will fail with an error indicating that the referential integrity constrain cannot be violated. Referential constraint actions can allow the DML command to successfully complete while making the designated changes to the related records to maintain referential integrity. 5. What is the purpose of a CHECK constraint? A CHECK constraint is used to limit the values that can appear in an attribute. It performs the function of enforcing a domain. 6. Explain when an ALTER TABLE command might be needed. ALTER TABLE is used to modify the structure of an existing table by adding, removing, or modifying column definitions and, in some cases, constraints. Many database structures have long, useful lives in an organization. It is not uncommon for a database to exist in organizational systems for decades. If the existing database structure needs to be modified to accommodate changes in business requirements or the integration of new systems, the existing structure will be modified with ALTER TABLE commands. This preserves the existing data in the table, as opposed to dropping the table and then re-creating it. 7. What is the difference between an INSERT command and an UPDATE command? The INSERT command is used to add a new row to a table. The UPDATE command changes the values in attributes of an existing row. UPDATE will not increase the number of rows in a table, but INSERT will. 8. What is the difference between using a subquery with a CREATE TABLE command and using a subquery with an INSERT command? Using a subquery with a CREATE TABLE command is a DDL command and will create a new database table. The table will be structured to match the structure of the data returned by the subquery, and the data from the subquery will be placed in the table. Therefore, using a subquery with CREATE TABLE will both create the structure and place data inside that structure. Using a subquery with an INSERT command is a DML command and will add data to an existing table. This operation requires that the target table where the data should be stored must already exist. The programmer must ensure that the structure of the data being returned by the subquery is
256
Chapter 8 Advanced SQL appropriate in terms of data types and constraints for the structure of the table where the results are to be stored. 9. What is the difference between a view and a materialized view? A view defines a query to retrieve data, but it does not create another copy of the data. Whenever the view is used, the defined query is executed to retrieve the current data from the base tables. A materialized view also defines a query, but it also stores another copy of the data in the materialized view. When the materialized view is used, the data from the secondary copy is returned. A materialized view must be periodically be refreshed as the data in the base tables changes over time. 10. What is a sequence? Write its syntax. A sequence is a special type of object that generates unique numeric values in ascending or descending order. You can use a sequence to assign values to a primary key field in a table. A sequence provides functionality similar to the Autonumber data type in MS Access. For example, both, sequences and Autonumber data types provide unique ascending or descending values. However, there are some subtle differences between the two: • In MS Access an Autonumber is a data type; in Oracle a sequence is a completely independent object, rather than a data type. • In MS Access, you can only have one Autonumber per table; in Oracle you can have as many sequences as you want and they are not tied to any particular table. • In MS Access, the Autonumber data type is tied to a field in a table; in Oracle, the sequencegenerated value is not tied to any field in any table and can, therefore, be used on any attribute in any table. The syntax used to create a sequence is: CREATE SEQUENCE CUS_NUM_SEQ START WITH 100 INCREMENT BY 10 NOCACHE; 11. What is a trigger, and what is its purpose? Give an example. A trigger is a block of PL/SQL code that is automatically invoked by the DBMS upon the occurrence of a data manipulation event (INSERT, UPDATE or DELETE.) Triggers are always associated with a table and are invoked before or after a data row is inserted, updated, or deleted. Any table can have one or more triggers. Triggers provide a method of enforcing business rules such as: • A customer making a credit purchase must have an active account. • A student taking a class with a prerequisite must have completed that prerequisite with a B grade. • To be scheduled for a flight, a pilot must have a valid medical certificate and a valid training completion record.
257
Chapter 8 Advanced SQL Triggers are also excellent for enforcing data constraints that cannot be directly enforced by the data model. For example, suppose that you must enforce the following business rule: If the quantity on hand of a product falls below the minimum quantity, the P_REORDER attribute must the automatically set to 1. To enforce this business rule, you can create the following TRG_PRODUCT_REORDER trigger: CREATE OR REPLACE TRIGGER TRG_PRODUCT_REORDER BEFORE INSERT OR UPDATE OF P_ONHAND, P_MIN ON PRODUCT FOR EACH ROW BEGIN IF :NEW.P_ONHAND <= :NEW.P_MIN THEN NEW.P_REORDER := 1; ELSE :NEW.P_REORDER := 0; END IF; END; 12. What is a stored procedure, and why is it particularly useful? Give an example. A stored procedure is a named block of PL/SQL and SQL statements. One of the major advantages of stored procedures is that they can be used to encapsulate and represent business transactions. For example, you can create a stored procedure to represent a product sale, a credit update, or the addition of a new customer. You can encapsulate SQL statements within a single stored procedure and execute them as a single transaction. There are two clear advantages to the use of stored procedures: 1. Stored procedures substantially reduce network traffic and increase performance. Because the stored procedure is stored at the server, there is no transmission of individual SQL statements over the network. 2. Stored procedures help reduce code duplication through code isolation and code sharing (creating unique PL/SQL modules that are called by application programs), thereby minimizing the chance of errors and the cost of application development and maintenance. For example, the following PRC_LINE_ADD stored procedure will add a new invoice line to the LINE table and it will automatically retrieve the correct price from the PRODUCT table.
258
Chapter 8 Advanced SQL CREATE OR REPLACE PROCEDURE PRC_LINE_ADD (W_LN IN NUMBER, W_P_CODE IN VARCHAR2, W_LU NUMBER) AS W_LP NUMBER := 0.00; BEGIN -- GET THE PRODUCT PRICE SELECT P_PRICE INTO W_LP FROM PRODUCT WHERE P_CODE = W_P_CODE; -- ADDS THE NEW LINE ROW INSERT INTO LINE VALUES(INV_NUMBER_SEQ.CURRVAL, W_LN, W_P_CODE, W_LU, W_LP); DBMS_OUTPUT.PUT_LINE('Invoice line ' || W_LN || ' added'); END;
Problem Solutions All of the problems in the Problem section require writing SQL or PL/SQL code. Since there are minor differences in the code based on the DBMS used, solutions for problems are provided in separate files for Oracle, MySQL, and Microsoft SQL Server. Solutions for Microsoft Access are provided in .mdb files for each data model used in the problem section. A very few of the problems do not apply to all DBMS products. For example, MySQL is installed in “autocommit” mode by default, therefore, issuing COMMIT commands are not necessary. On the other hand, Oracle does not use autocommit by default and does require COMMIT commands to make DML command results permanent in the database. Therefore, instructions about issuing commands to make DML changes permanent do not apply to MySQL, but are necessary for Oracle. Also, since only PL/SQL is presented in the text for creating stored procedures and triggers, problems that require creating these types of modules are only provided in PL/SQL for Oracle.
259
Chapter 8 Advanced SQL The files are in the “Teacher” data files that accompany the book, and are named as follows: Oracle:
Ch08_ProblemSolutions_ORA.txt
MySQL:
Ch08_ProblemSolutions_MySQL.txt
SQL Server: Ch08_ProblemSolutions_SQL.txt MS Access:
Ch08_AviaCo.mdb Ch08_ConstructCo.mdb Ch08_MovieCo.mdb Ch08_SaleCo.mdb Ch08_SimpleCo.mdb
260
Chapter 9 Database Design
Chapter 9 Database Design Discussion Focus What is the relationship between a database and an information system, and how does this relationship have a bearing on database design? An information system performs three sets of services: • It provides for data collection, storage, and retrieval. • It facilitates the transformation of data into information. • It provides the tools and conditions to manage both data and information. Basically, a database is a fact (data) repository that serves an information system. If the database is designed poorly, one can hardly expect that the data/information transformation will be successful, nor is it reasonable to expect efficient and capable management of data and information. The transformation of data into information is accomplished through application programs. It is impossible to produce good information from poor data; and, no matter how sophisticated the application programs are, it is impossible to use good application programs to overcome the effects of bad database design. In short: Good database design is the foundation of a successful information system. Database design must yield a database that: • Does not fall prey to uncontrolled data duplication, thus preventing data anomalies and the attendant lack of data integrity. • Is efficient in its provision of data access. • Serves the needs of the information system. The last point deserves emphasis: even the best-designed database lacks value if it fails to meet information system objectives. In short, good database designers must pay close attention to the information system requirements. Systems design and database design are usually tightly intertwined and are often performed in parallel. Therefore, database and systems designers must cooperate and coordinate to yield the best possible information system. What is the relationship between the SDLC and the DBLC? The SDLC traces the history (life cycle) of an information system. The DBLC traces the history (life cycle) of a database system. Since we know that the database serves the information system, it is not surprising that the two life cycles conform to the same basic phases. Suggestion: Use Figure 9.8 as the basis for a discussion of the parallel activities.
261
Chapter 9 Database Design What basic database design strategies exist, and how are such strategies executed? Suggestion: Use Figure 9.14 as the basis for this discussion. There are two basic approaches to database design: top-down and bottom-up. Top-down design begins by identifying the different entity types and the definition of each entity's attributes. In other words, top-down design: • starts by defining the required data sets and then • defines the data elements for each of those data sets. Bottom-up design: • first defines the required attributes and then • groups the attributes to form entities. Although the two methodologies tend to be complementary, database designers who deal with small databases with relatively few entities, attributes, and transactions tend to emphasize the bottom-up approach. Database designers who deal with large, complex databases usually find that a primarily top-down design approach is more appropriate. In spite of the frequent arguments concerning the best design approach, perhaps the top-down vs. bottom-up distinction is quite artificial. The text's note is worth repeating:
NOTE Even if a generally top-down approach is selected, the normalization process that revises existing table structures is (inevitably) a bottom-up technique. E-R models constitute a top-down process even if the selection of attributes and entities may be described as bottom-up. Since both the E-R model and normalization techniques form the basis for most designs, the top-down vs. bottom-up debate may be based on a distinction without a difference.
262
Chapter 9 Database Design
Answers to Review Questions 1. What is an information system? What is its purpose? An information system is a system that • provides the conditions for data collection, storage, and retrieval • facilitates the transformation of data into information • provides management of both data and information. An information system is composed of hardware, software (DBMS and applications), the database(s), procedures, and people. Good decisions are generally based on good information. Ultimately, the purpose of an information system is to facilitate good decision making by making relevant and timely information available to the decision makers. 2. How do systems analysis and systems development fit into a discussion about information systems? Both systems analysis and systems development constitute part of the Systems Development Life Cycle, or SDLC. Systems analysis, phase II of the SDLC, establishes the need for and the extent of an information system by • Establishing end-user requirements. • Evaluating the existing system. • Developing a logical systems design. Systems development, based on the detailed systems design found in phase III of the SDLC, yields the information system. The detailed system specifications are established during the systems design phase, in which the designer completes the design of all required system processes. 3. What does the acronym SDLC mean, and what does an SDLC portray? SDLC is the acronym that is used to label the System Development Life Cycle. The SDLC traces the history of a information system from its inception to its obsolescence. The SDLC is composed of six phases: planning, analysis, detailed system, design, implementation and maintenance. 4. What does the acronym DBLC mean, and what does a DBLC portray? DBLC is the acronym that is used to label the Database Life Cycle. The DBLC traces the history of a database system from its inception to its obsolescence. Since the database constitutes the core of an information system, the DBLC is concurrent to the SDLC. The DBLC is composed of six phases: initial study, design, implementation and loading, testing and evaluation, operation, and maintenance and evolution.
263
Chapter 9 Database Design 5. Discuss the distinction between centralized and decentralized conceptual database design. Centralized and decentralized design constitute variations on the bottom-up and top-down approaches we discussed in the third question presented in the discussion focus. Basically, the centralized approach is best suited to relatively small and simple databases that lend themselves well to a bird's-eye view of the entire database. Such databases may be designed by a single person or by a small and informally constituted design team. The company operations and the scope of its problems are sufficiently limited to enable the designer(s) to perform all of the necessary database design tasks: 1. Define the problem(s). 2. Create the conceptual design. 3. Verify the conceptual design with all user views. 4. Define all system processes and data constraints. 5. Assure that the database design will comply with all achievable end user requirements. The centralized design procedure thus yields the design summary shown in Figure Q9.5A.
Figure Q9.5A The Centralized Design Procedure
Conceptual Model
Conceptual Model Verification
User Views
System Processes
Data Constraints
264
D A T A D I C T I O N A R Y
Chapter 9 Database Design Note that the centralized design approach requires the completion and validation of a single conceptual design.
NOTE Use the text’s Figures 9.15 and 9.16 to contrast the two design approaches, then use Figure 9.6 to show the procedure flows; demonstrate that such procedure flows are independent of the degree of centralization.
In contrast, when company operations are spread across multiple operational sites or when the database has multiple entities that are subject to complex relations, the best approach is often based on the decentralized design. Typically, a decentralized design requires that the design task be divided into multiple modules, each one of which is assigned to a design team. The design team activities are coordinated by the lead designer, who must aggregate the design teams' efforts. Since each team focuses on modeling a subset of the system, the definition of boundaries and the interrelation between data subsets must be very precise. Each team creates a conceptual data model corresponding to the subset being modeled. Each conceptual model is then verified individually against the user views, processes, and constraints for each of the modules. After the verification process has been completed, all modules are integrated in one conceptual model. Since the data dictionary describes the characteristics of all the objects within the conceptual data model, it plays a vital role in the integration process. Naturally, after the subsets have been aggregated into a larger conceptual model, the lead designer must verify that the combined conceptual model is still able to support all the required transactions. Thus the decentralized design activities may be summarized as shown in Figure Q9.6B.
265
Chapter 9 Database Design
Figure Q9.6B The Decentralized Design Procedure
DATA COMPONENT
Conceptual Models
Verification
Subset A
Subset B
Subset C
Views, Processes, Constraints
Views, Processes, Constraints
Views, Processes, Constraints
Aggregation
FINAL CONCEPTUAL MODEL
D A T A D I C T I O N A R Y
Keep in mind that the aggregation process requires the lead designer to assemble a single model in which various aggregation problems must be addressed: • synonyms and homonyms. Different departments may know the same object by different names (synonyms), or they may use the same name to address different objects (homonyms.) The object may be an entity, an attribute, or a relationship. • entity and entity subclasses. An entity subset may be viewed as a separate entity by one or more departments. The designer must integrate such subclasses into a higher-level entity. • Conflicting object definitions. Attributes may be recorded as different types (character, numeric), or different domains may be defined for the same attribute. Constraint definitions, too, may vary. The designer must remove such conflicts from the model. 6. What is the minimal data rule in conceptual design? Why is it important? The minimal data rule specifies that all the data defined in the data model are actually required to fit present and expected future data requirements. This rule may be phrased as All that is needed is
there, and all that is there is needed.
266
Chapter 9 Database Design 7. Discuss the distinction between top-down and bottom-up approaches to database design. There are two basic approaches to database design: top-down and bottom-up. Top-down design begins by identifying the different entity types and the definition of each entity's attributes. In other words, top-down design: • starts by defining the required data sets and then • defines the data elements for each of those data sets. Bottom-up design: • first defines the required attributes and then • groups the attributes to form entities. Although the two methodologies tend to be complementary, database designers who deal with small databases with relatively few entities, attributes, and transactions tend to emphasize the bottom-up approach. Database designers who deal with large, complex databases usually find that a primarily top-down design approach is more appropriate. 8. What are business rules? Why are they important to a database designer? Business rules are narrative descriptions of the business policies, procedures, or principles that are derived from a detailed description of operations. Business rules are particularly valuable to database designers, because they help define: • Entities • Attributes • Relationships (1:1, 1:M, M:N, expressed through connectivities and cardinalities) • Constraints To develop an accurate data model, the database designer must have a thorough and complete understanding of the organization's data requirements. The business rules are very important to the designer because they enable the designer to fully understand how the business works and what role is played by data within company operations.
NOTE Do keep in mind that an ERD cannot always include all the applicable business rules. For example, although constraints are often crucial, it is often not possible to model them. For instance, there is no way to model a constraint such as “no pilot may be assigned to flight duties more than ten hours during any 24-hour period.” It is also worth emphasizing that the description of (company) operations must be done in almost excruciating detail and it must be verified and re-verified. An inaccurate description of operations yields inaccurate business rules that lead to database designs that are destined to fail.
9. What is the data dictionary's function in database design? 267
Chapter 9 Database Design
A good data dictionary provides a precise description of the characteristics of all the entities and attributes found within the database. The data dictionary thus makes it easier to check for the existence of synonyms and homonyms, to check whether all attributes exist to support required reports, to verify appropriate relationship representations, and so on. The data dictionary's contents are both developed and used during the six DBLC phases: DATABASE INITIAL STUDY The basic data dictionary components are developed as the entities and attributes are defined during this phase. DATABASE DESIGN The data dictionary contents are used to verify the database design components: entities, attributes, and their relationships. The designer also uses the data dictionary to check the database design for homonyms and synonyms and verifies that the entities and attributes will support all required query and report requirements. IMPLEMENTATION AND LOADING The DBMS's data dictionary helps to resolve any remaining attribute definition inconsistencies. TESTING AND EVALUATION If problems develop during this phase, the data dictionary contents may be used to help restructure the basic design components to make sure that they support all required operations. OPERATION If the database design still yields (the almost inevitable) operational glitches, the data dictionary may be used as a quality control device to ensure that operational modifications to the database do not conflict with existing components. MAINTENANCE AND EVOLUTION As users face inevitable changes in information needs, the database may be modified to support those needs. Perhaps entities, attributes, and relationships must be added, or relationships must be changed. If new database components are fit into the design, their introduction may produce conflict with existing components. The data dictionary turns out to be a very useful tool to check whether a suggested change invites conflicts within the database design and, if so, how such conflicts may be resolved. 10. What steps are required in the development of an ER diagram? (Hint: See Table 9.3.) Table 9.3 is reproduced for your convenience.
268
Chapter 9 Database Design TABLE 9.3 Developing the Conceptual Model, Using ER Diagrams STEP ACTIVITY 1 Identify, analyze, and refine the business rules. 2 Identify the main entities, using the results of Step 1. 3 Define the relationships among the entities, using the results of Steps 1 and 2. 4 Define the attributes, primary keys, and foreign keys for each of the entities. 5 Normalize the entities. (Remember that entities are implemented as tables in an RDBMS.) 6 Complete the initial ER diagram. 7 Validate the ER model against the user’s information and processing requirements. 8 Modify the ER diagram, using the results of Step 7. Point out that some of the steps listed in Table 9.3 take place concurrently. And some, such as the normalization process, can generate a demand for additional entities and/or attributes, thereby causing the designer to revise the ER model. For example, while identifying two main entities, the designer might also identify the composite bridge entity that represents the many-to-many relationship between those two main entities. 11. List and briefly explain the activities involved in the verification of an ER model. Section 9-4c, “Data Model Verification,” includes a discussion on verification. In addition, Appendix C, “The University Lab: Conceptual Design Verification, Logical Design, and Implementation,” covers the verification process in detail. The verification process is detailed in the text’s Table 9.5, reproduced here for your convenience. TABLE 9.5 The ER Model Verification Process STEP 1 2 3
4 5 6
ACTIVITY Identify the ER model’s central entity. Identify each module and its components. Identify each module’s transaction requirements: Internal: Updates/Inserts/Deletes/Queries/Reports External: Module interfaces Verify all processes against the module’s processing and reporting requirements. Make all necessary changes suggested in Step 4. Repeat Steps 2−5 for all modules.
Keep in mind that the verification process requires the continuous verification of business transactions as well as system and user requirements. The verification sequence must be repeated for each of the system’s modules. 12. What factors are important in a DBMS software selection? The selection of DBMS software is critical to the information system’s smooth operation. Consequently, the advantages and disadvantages of the proposed DBMS software should be carefully studied. To avoid false expectations, the end user must be made aware of the limitations of both the DBMS and the
269
Chapter 9 Database Design database. Although the factors affecting the purchasing decision vary from company to company, some of the most common are: • Cost. Purchase, maintenance, operational, license, installation, training, and conversion costs. • DBMS features and tools. Some database software includes a variety of tools that facilitate the application development task. For example, the availability of query by example (QBE), screen painters, report generators, application generators, data dictionaries, and so on, helps to create a more pleasant work environment for both the end user and the application programmer. Database administrator facilities, query facilities, ease of use, performance, security, concurrency control, transaction processing, and third-party support also influence DBMS software selection. • Underlying model. Hierarchical, network, relational, object/relational, or object. • Portability. Across platforms, systems, and languages. • DBMS hardware requirements. Processor(s), RAM, disk space, and so on. 13. List and briefly explain the four steps performed during the logical design stage. 1) Map conceptual model to logical model components. In this step, the conceptual model is converted into a set of table definitions including table names, column names, primary keys, and foreign keys to implement the entities and relationships specified in the conceptual design. 2) Validate the logical model using normalization. It is possible for normalization issues to be discovered during the process of mapping the conceptual model to logical model components. Therefore, it is appropriate at this stage to validate that all of the table definitions from the previous step conform to the appropriate normalization rules. 3) Validate logical model integrity constraints. This step involves the conversion of attribute domains and constraints into constraint definitions that can be implemented within the DBMS to enforce those domains. Also, entity and referential integrity constraints are validated. Views may be defined to enforce security constraints. 4) Validate the logical model against the user requirements. The final step of this stage is to ensure that all definitions created throughout the logical model are validated against the users' data, transaction, and security requirements. Every component (table, view, constraint, etc.) of the logical model must be associated with satisfying the user requirements, and every user requirement should be addressed by the model components. 14. List and briefly explain the three steps performed during the physical design stage. 1) Define data storage organization. Based on estimates of the data volume and growth, this step involves the determination of the physical location and physical organization for each table. Also, which columns will be indexed and the type of indexes to be used are determined. Finally, the type of implementation to be used for each view is decided. 2) Define integrity and security measures. This step involves creating users and security groups, and then assigning privileges and controls to those users and group. 3) Determine performance measurements.
270
Chapter 9 Database Design The actual performance of the physical database implementation must be measured and assessed for compliance with user performance requirements. 15. What three levels of backup may be used in database recovery management? Briefly describe what each of those three backup levels does. A full backup of the database creates a backup copy of all database objects in their entirety. A differential backup of the database creates a backup of only those database objects that have changed since the last full backup. A transaction log backup does not create a backup of database objects, but makes a backup of the log of changes that have been applied to the database objects since the last backup.
Problem Solutions 1. The ABC Car Service & Repair Centers are owned by the SILENT car dealer; ABC services and repairs only SILENT cars. Three ABC Car Service & Repair Centers provide service and repair for the entire state. Each of the three centers is independently managed and operated by a shop manager, a receptionist, and at least eight mechanics. Each center maintains a fully stocked parts inventory. Each center also maintains a manual file system in which each car’s maintenance history is kept: repairs made, parts used, costs, service dates, owner, and so on. Files are also kept to track inventory, purchasing, billing, employees’ hours, and payroll. You have been contacted by the manager of one of the centers to design and implement a computerized system. Given the preceding information, do the following: a. Indicate the most appropriate sequence of activities by labeling each of the following steps in the correct order. (For example, if you think that “Load the database.” is the appropriate first step, label it “1.”) ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____
Normalize the conceptual model. Obtain a general description of company operations. Load the database. Create a description of each system process. Test the system. Draw a data flow diagram and system flowcharts. Create a conceptual model, using ER diagrams. Create the application programs. Interview the mechanics. Create the file (table) structures. Interview the shop manager.
The answer to this question may vary slightly from one designer to the next, depending on the selected design methodology and even on personal designer preferences. Yet, in spite of such differences, it is possible to develop a common design methodology to permit the development of a basic
271
Chapter 9 Database Design decision-making process and the analysis required in designing an information system. Whatever the design philosophy, a good designer uses a specific and ordered set of steps through which the database design problem is approached. The steps are generally based on three phases: analysis, design, and implementation. These phases yield the following activities: ANALYSIS 1. Interview the shop manager 2. Interview the mechanics 3. Obtain a general description of company operations 4. Create a description of each system process DESIGN 5. Create a conceptual model, using E-R diagrams 6. 8. Draw a data flow diagram and system flow charts 7. Normalize the conceptual model IMPLEMENTATION 8. Create the table structures 9. Load the database 10. Create the application programs 11. Test the system. This listing implies that, within each of the three phases, the steps are completed in a specific order. For example, it would seem reasonable to argue that we must first complete the interviews if we are to obtain a proper description of the company operations. Similarly, we may argue that a data flow diagram precedes the creation of the E-R diagram. Nevertheless, the specific tasks and the order in which they are addressed may vary. Such variations do not matter, as long as the designer bases the selected procedures on an appropriate design philosophy, such as top-down vs. bottom-up. Given this discussion, we may present problem 1's solution this way: __7__ Normalize the conceptual model. __3__ Obtain a general description of company operations. __9__ Load the database. __4__ Create a description of each system process. _11__ Test the system. __6__ Draw a data flow diagram and system flow charts. __5__ Create a conceptual model, using E-R diagrams.
272
Chapter 9 Database Design _10__ Create the application programs. __2__ Interview the mechanics. __8__ Create the file (table) structures. __1__ Interview the shop manager.
273
Chapter 9 Database Design b. Describe the various modules that you believe the system should include. This question may be addressed in several ways. We suggest the following approach to develop a system composed of four main modules: Inventory, Payroll, Work order, and Customer. We have illustrated the Information System's main modules in Figure P9.1B.
Figure P9.1B The ABC Company’s IS System Modules
The Inventory module will include the Parts and Purchasing sub-modules. The Payroll Module will handle all employee and payroll information. The Work order module keeps track of the car maintenance history and all work orders for maintenance done on a car. The Customer module keeps track of the billing of the work orders to the customers and of the payments received from those customers. c. How will a data dictionary help you develop the system? Give examples. We have addressed the role of the data dictionary within the DBLC in detail in the answer to review question 10. Remember that the data dictionary makes it easier to check for the existence of synonyms and homonyms, to check whether all attributes exist to support required reports, to verify appropriate relationship representations, and so on. Therefore, the data dictionary's contents will help us to provide consistency across modules and to evaluate the system's ability to generate the required reports. In addition, the use of the data dictionary facilitates the creation of system documentation.
274
Chapter 9 Database Design d. What general (system) recommendations might you make to the shop manager? (For example. if the system will be integrated, what modules will be integrated? What benefits would be derived from such an integrated system? Include several general recommendations.) The designer's job is to provide solutions to the main problems found during the initial study. Clearly, any system is subject to both internal and external constraints. For example, we can safely assume that the owner of the ABC Car Service and Repair Center has a time frame in mind, not to mention a spending limitation. As is true in all design work, the designer and the business owner must prioritize the modules and develop those that yield the greatest benefit within the stated time and development budget constraints. Keep in mind that it is always useful to develop a modular system that provides for future enhancement and expansion. Suppose, for example, that the ABC Car Service & Repair company management decides to integrate all of its service stations in the state in order to provide better statewide service. Such integration is likely to yield many benefits: The car history of each car will be available to any station for cars that have been serviced in more than one location; the inventory of parts will be on-line, thus allowing parts orders to be placed between service stations; mechanics can better share tips concerning the solution to car maintenance problems, and so on. e. What is the best approach to conceptual database design? Why? Given the nature of this business, the best way to produce this conceptual database design would be to use a centralized and top-down approach. Keep in mind that the designer must keep the design sufficiently flexible to make sure that it can accommodate any future integration of this system with the other service stations in the state. f. Name and describe at least four reports the system should have. Explain their use. Who will use those reports? REPORT 1 Monthly Activity contains a summary of service categories by branch and by month. Such reports may become the basis for forecasting personnel and stock requirements for each branch and for each period. REPORT 2 Mechanic Summary Sheet contains a summary of work hours clocked by each mechanic. This report would be generated weekly and would be useful for payroll and maintenance personnel scheduling purposes. REPORT 3 Monthly Inventory contains a summary of parts in inventory, inventory draw-down, parts reorder points, and information about the vendors who will provide the parts to be reordered. This report will be especially useful for inventory management purposes. REPORT 4 Customer Activity contains a breakdown of customers by location, maintenance activity, current balances, available credit, and so on. This report would be useful to forecast various service demand factors, to mail promotional materials, to send maintenance reminders, to keep track of special 275
Chapter 9 Database Design customer requirements, and so on. 2. Suppose you have been asked to create an information system for a manufacturing plant that produces nuts and bolts of many shapes, sizes, and functions. What questions would you ask, and how would the answers to those questions affect the database design? Basically, all answers to all (relevant) questions help shape the database design. In fact, all information collected during the initial study and all subsequent phases will have an impact on the database design. Keep in mind that the information is collected to establish the entities, attributes, and the relationships among the entities. Specifically, the relationships, connectivities, and cardinalities are shaped by the business rules that are derived from the information collected by the designer. Sample questions and their likely impact on the design might be: • Do you want to develop the database for all departments at once, or do you want to design and implement the database for one department at a time? • How will the design approach affect the design process? (In other words, assess top-down vs. bottom-up, centralized or decentralized, system scope and boundaries.) • Do you want to develop one module at a time, or do you want an integrated system? (Inventory, production, shipping, billing, etc.) • Do you want to keep track of the nuts and bolts by lot number, production shift, type, and department? Impact: conceptual and logical database design. • Do you want to keep track of the suppliers of each batch of raw material used in the production of the nuts and bolts? Impact: conceptual and logical database design. E-R model. • Do you want to keep track of the customers who received the batches of nuts and bolts? Impact: conceptual and logical database design. ER model. • What reports will you require, what will be the specific reporting requirements, and to whom will these reports be distributed? The answers to such questions affect the conceptual and logical database design, the database’s implementation, its testing, and its subsequent operation. a. What do you envision the SDLC to be? The SDLC is not a function of the information collected. Regardless of the extent of the design or its specific implementation, the SDLC phases remain:
PLANNING Initial assessment Feasibility study
ANALYSIS User requirements Study of existing systems Logical system design
DETAILED SYSTEMS DESIGN Detailed system specifications 276
Chapter 9 Database Design
IMPLEMENTATION Coding, testing, debugging Installation, fine-tuning
MAINTENANCE Evaluation Maintenance Enhancements b. What do you envision the DBLC to be? As is true for the SDLC, the DBLC is not a function of the kind and extent of the collected information. Thus, the DBLC phases and their activities remain as shown:
DATABASE INITIAL STUDY Analyze the company situation Define problems and constraints Define objectives Define scope and boundaries
DATABASE DESIGN Create the conceptual design Create the logical design create the physical design
IMPLEMENTATION AND LOADING Install the DBMS Create the database(s) Load or convert the data
TESTING AND EVALUATION Test the database Fine-tune the database Evaluate the database and its application programs
OPERATION Produce the required information flow
MAINTENANCE AND EVOLUTION Introduce changes Make enhancements 3. Suppose you perform the same functions noted in Problem 2 for a larger warehousing operation. How are the two sets of procedures similar? How and why are they different? The development of an information system will differ in the approach and philosophy used. More
277
Chapter 9 Database Design precisely, the designer team will probably be formed by a group of system analysts and may decide to use a decentralized approach to database design. Also, as is true for any organization, the system scope and constraints may be very different for different systems. Therefore, designers may opt to use different techniques at different stages. For example, the database initial study phase may include separate studies carried out by separate design teams at several geographically distant locations. Each of the findings of the design teams will later be integrated to identify the main problems, solutions, and opportunities that will guide the design and development of the system. 4. Using the same procedures and concepts employed in Problem 1, how would you create an information system for the Tiny College example in Chapter 4? Tiny College is a medium-sized educational institution that uses many database-intensive operations, such as student registration, academic administration, inventory management, and payroll. To create an information system, first perform an initial database study to determine the information system's objectives. Next, study Tiny College's operations and processes (flow of data) to identify the main problems, constraints, and opportunities. A precise definition of the main problems and constraints will enable the designer to make sure that the design improves Tiny College's operational efficiency. An improvement in operational efficiency is likely to create opportunities to provide new services that will enhance Tiny College's competitive position. After the initial database study is done and the alternative solutions are presented, the end users ultimately decide which one of the probable solutions is most appropriate for Tiny College. Keep in mind that the development of a system this size will probably involve people who have quite different backgrounds. For example, it is likely that the designer must work with people who play a managerial role in communications and local area networks, as well as with the "troops in the trenches" such as programmers and system operators. The designer should, therefore, expect that there will be a wide range of opinions concerning the proposed system's features. It is the designer's job to reconcile the many (and often conflicting) views of the "ideal" system. Once a proposed solution has been agreed upon, the designer(s) may determine the proposed system's scope and boundaries. We are then able to begin the design phase. As the design phase begins, keep in mind that Tiny College's information system is likely to be used by many users (20 to 40 minimum) who are located on distant sites across campus. Therefore, the designer must consider a range of communication issues involving the use of such technologies as local area networks. These technologies must be considered as the database designer(s) begin to develop the structure of the database to be implemented. The remaining development work conforms to the SDLC and the DBLC phases. Special attention must be given to the system design's implementation and testing to ensure that all the system modules interface properly. Finally, the designer(s) must provide all the appropriate system documentation and ensure that all
278
Chapter 9 Database Design appropriate system maintenance procedures (periodic backups, security checks, etc.) are in place to ensure the system's proper operation. Keep in mind that two very important issues in a university-wide system are end-user training and support. Therefore, the system designer(s) must make sure that all end users know the system and know how it is to be used to enjoy its benefits. In other words, make sure that end-user support programs are in place when the system becomes operational. 5. Write the proper sequence of activities in the design of a video rental database. (The initial ERD was shown in Figure 9.9.) The design must support all rental activities, customer payment tracking, and employee work schedules, as well as track which employees checked out the videos to the customers. After you finish writing the design activity sequence, complete the ERD to ensure that the database design can be successfully implemented. (Make sure that the design is normalized properly and that it can support the required transactions. Given its level of detail and (relative) complexity, this problem would make an excellent class project. Use the chapter’s coverage of the database life cycle (DBLC) as the procedural template. The text’s Figure 9.3 is particularly useful as a procedural map for this problem’s solution and Figure 9.6 provides a more detailed view of the database design’s procedural flow. Make sure that the students review section 9-3b, “Database Design,” before they attempt to produce the problem solution. Appendix B, “The University Lab: Conceptual Design,” and Appendix C “The University Lab: Conceptual Design Verification, Logical Design, and Implementation” show a very detailed example of the procedures required to deliver a completed database. You will find a more detailed video rental database problem description in Appendix B, problem 4. This problem requires the completion of the initial database design. The solution is shown in this manual’s Appendix B coverage. This design is verified in Appendix C, Problem 2. The Visio Professional files for the initial and verified designs are located on your instructor’s CD. Select the FigB-P04a-The-Initial-Crows-Foot-ERD-for-the-Video-Rental-Store.vsd file to see the initial design. Select the Fig-C-P02a-The-Revised-Video-Rental-Crows-Foot-ERD.vsd file to see the verified design. 6. In a construction company, a new system has been in place for a few months and now there is a list of possible changes/updates that need to be done. For each of the changes/updates, specify what type of maintenance needs to be done: (a) corrective, (b) adaptive, and (c) perfective. a. An error in the size of one of the fields has been identified and it needs to be updated status field needs to be changed. This is a change in response to a system error – corrective maintenance. b. The company is expanding into a new type of service and this will require to enhancing the system with a new set of tables to support this new service and integrate it with the existing data. This is a change to enhance the system – perfective maintenance.
279
Chapter 9 Database Design c. The company has to comply with some government regulations. To do this, it will require adding a couple of fields to the existing system tables. This is a change in response to changes in the business environment – adaptive maintenance. 7. You have been assigned to design the database for a new soccer club. Indicate the most appropriate sequence of activities by labeling each of the following steps in the correct order. (For example, if you think that “Load the database” is the appropriate first step, label it “1.”) _10__ Create the application programs. __4__ Create a description of each system process. _11__ Test the system. __9__ Load the database. __7__ Normalize the conceptual model. __1__ Interview the soccer club president. __5__ Create a conceptual model using ER diagrams. __2__ Interview the soccer club director of coaching. __8__ Create the file (table) structures. __3__ Obtain a general description of the soccer club operations. __6__ Draw a data flow diagram and system flowcharts.
280
Chapter .10 Transaction Management and Concurrency Control
Chapter 10 Transaction Management and Concurrency Control Discussion Focus Why does a multi-user database environment give us a special interest in transaction management and concurrency control? Begin by exploring what a transaction is, what its components are, and why it must be managed carefully even in a single-user database environment. Then explain why a multi-user database environment makes transaction management even more critical. Emphasize the following points: • A transaction represents a real-world event such as the sale of a product. • A transaction must be a logical unit of work. That is, no portion of a transaction stands by itself. For example, the product sale has an effect on inventory and, if it is a credit sale, it has an effect on customer balances. • A transaction must take a database from one consistent state to another. Therefore, all parts of a transaction must be executed or the transaction must be aborted. (A consistent state of the database is one in which all data integrity constraints are satisfied.) All transactions have four properties: Atomicity, Consistency, Isolation, and Durability. (These four properties are also known as the ACID test of transactions.) In addition, multiple transactions must conform to the property of serializability. Table IM10.1 provides a good summary of transaction properties.
Table IM10.1 Transaction Properties.
Single-user Databases Multi-user Databases
atomicity: Unless all parts of the executed, the transaction is aborted consistency. Indicates the permanence of the database’s consistent state. durability: Once a transaction is committed, it cannot be rolled back isolation: Data used by one transaction cannot be used by another transaction until the first transaction is completed. serializability: The result of the concurrent execution of transactions is the same as though the transactions were executed in serial order.
281
Chapter .10 Transaction Management and Concurrency Control Note that SQL provides transaction support through COMMIT
(permanently saves changes to disk)
and ROLLBACK
(restores the previous database state)
Each SQL transaction is composed of several database requests, each one of which yields I/O operations. A transaction log keeps track of all transactions that modify the database. The transaction log data may be used for recovery (ROLLBACK) purposes. Next, explain that concurrency control is the management of concurrent transaction execution. (Therefore, a single-user database does not require concurrency control!) Specifically, explore the concurrency control issues concerning lost updates, uncommitted data, and inconsistent databases. Note that multi-user DBMSs use a process known as a scheduler to enforce concurrency control. Since concurrency control is made possible through the use of locks, examine the various levels and types of locks. Because the use of locks creates a possibility of producing deadlocks, discuss the detection, prevention, and avoidance strategies that enable the system to manage deadlock conditions.
Answers to Review Questions 1. Explain the following statement: a transaction is a logical unit of work. A transaction is a logical unit of work that must be entirely completed of aborted; no intermediate states are accepted. In other words, a transaction, composed of several database requests, is treated by the DBMS as a unit of work in which all transaction steps must be fully completed if the transaction is to be accepted by the DBMS. Acceptance of an incomplete transaction will yield an inconsistent database state. To avoid such a state, the DBMS ensures that all of a transaction's database operations are completed before they are committed to the database. For example, a credit sale requires a minimum of three database operations: 1. An invoice is created for the sold product. 2. The product's inventory quantity on hand is reduced. 3. The customer accounts payable balance is increased by the amount listed on the invoice. If only parts 1 and 2 are completed, the database will be left in an inconsistent state. Unless all three parts (1, 2, and 3) are completed, the entire sales transaction is canceled.
282
Chapter .10 Transaction Management and Concurrency Control 2. What is a consistent database state, and how is it achieved? A consistent database state is one in which all data integrity constraints are satisfied. To achieve a consistent database state, a transaction must take the database from one consistent state to another. (See the answer to question 1.) 3. The DBMS does not guarantee that the semantic meaning of the transaction truly represents the real-world event. What are the possible consequences of that limitation? Give an example. The database is designed to verify the syntactic accuracy of the database commands given by the user to be executed by the DBMS. The DBMS will check that the database exists, that the referenced attributes exist in the selected tables, that the attribute data types are correct, and so on. Unfortunately, the DBMS is not designed to guarantee that the syntactically correct transaction accurately represents the real-world event. For example, if the end user sells 10 units of product 100179 (Crystal Vases), the DBMS cannot detect errors such as the operator entering 10 units of product 100197 (Crystal Glasses). The DBMS will execute the transaction, and the database will end up in a technically consistent state but in a real-world inconsistent state because the wrong product was updated. 4. List and discuss the five transaction properties. The five transaction properties are: Atomicity
requires that all parts of a transaction must be completed or the transaction is aborted. This property ensures that the database will remain in a consistent state.
Consistency
Indicates the permanence of the database consistent state.
Isolation
means that the data required by an executing transaction cannot be accessed by any other transaction until the first transaction finishes. This property ensures data consistency for concurrently executing transactions.
Durability
indicates that the database will be in a permanent consistent state after the execution of a transaction. In other words, once a consistent state is reached, it cannot be lost.
Serializability
means that a series of concurrent transactions will yield the same result as if they were executed one after another.
All five transaction properties work together to make sure that a database maintains data integrity and consistency for either a single-user or a multi-user DBMS.
283
Chapter .10 Transaction Management and Concurrency Control 5. What does serializability of transactions mean? Serializability of transactions means that a series of concurrent transactions will yield the same result as if they were executed one after another 6. What is a transaction log, and what is its function? The transaction log is a special DBMS table that contains a description of all the database transactions executed by the DBMS. The database transaction log plays a crucial role in maintaining database concurrency control and integrity. The information stored in the log is used by the DBMS to recover the database after a transaction is aborted or after a system failure. The transaction log is usually stored in a different hard disk or in a different media (tape) to prevent the failure caused by a media error. 7. What is a scheduler, what does it do, and why is its activity important to concurrency control? The scheduler is the DBMS component that establishes the order in which concurrent database operations are executed. The scheduler interleaves the execution of the database operations (belonging to several concurrent transactions) to ensure the serializability of transactions. In other words, the scheduler guarantees that the execution of concurrent transactions will yield the same result as though the transactions were executed one after another. The scheduler is important because it is the DBMS component that will ensure transaction serializability. In other words, the scheduler allows the concurrent execution of transactions, giving end users the impression that they are the DBMS's only users. 8. What is a lock, and how, in general, does it work? A lock is a mechanism used in concurrency control to guarantee the exclusive use of a data element to the transaction that owns the lock. For example, if the data element X is currently locked by transaction T1, transaction T2 will not have access to the data element X until T1 releases its lock. Generally speaking, a data item can be in only two states: locked (being used by some transaction) or unlocked (not in use by any transaction). To access a data element X, a transaction T1 first must request a lock to the DBMS. If the data element is not in use, the DBMS will lock X to be used by T1 exclusively. No other transaction will have access to X while T1 is executed. 9. What are the different levels of lock granuality? Lock granularity refers to the size of the database object that a single lock is placed upon. Lock granularity can be: Database-level, meaning the entire database is locked by one lock. Table-level, meaning a table is locked by one lock. Page-level, meaning a diskpage is locked by one lock. Row-level, meaning one row is locked by one lock. Field-level, meaning one field in one row is locked by one lock.
284
Chapter .10 Transaction Management and Concurrency Control
10. Why might a page-level lock be preferred over a field-level lock? Smaller lock granualarity improves the concurrency of the database by reducing contention to lock database objects. However, smaller lock granularity also means that more locks must be maintained and managed by the DBMS, requiring more processing overhead and system resources for lock management. Concurrency demands and system resource usage must be balanced to ensure the best overall transaction performance. In some circumstances, page-level locks, which require fewer system resources, may produce better overall performance than field-level locks, which require more system resources. 11. What is concurrency control, and what is its objective? Concurrency control is the activity of coordinating the simultaneous execution of transactions in a multiprocessing or multi-user database management system. The objective of concurrency control is to ensure the serializability of transactions in a multi-user database management system. (The DBMS's scheduler is in charge of maintaining concurrency control.) Because it helps to guarantee data integrity and consistency in a database system, concurrency control is one of the most critical activities performed by a DBMS. If concurrency control is not maintained, three serious problems may be caused by concurrent transaction execution: lost updates, uncommitted data, and inconsistent retrievals. 12. What is an exclusive lock, and under what circumstances is it granted? An exclusive lock is one of two lock types used to enforce concurrency control. (A lock can have three states: unlocked, shared (read) lock, and exclusive (write) lock. The "shared" and "exclusive" labels indicate the nature of the lock.) An exclusive lock exists when access to a data item is specifically reserved for the transaction that locked the object. The exclusive lock must be used when a potential for conflict exists, e.g., when one or more transactions must update (WRITE) a data item. Therefore, an exclusive lock is issued only when a transaction must WRITE (update) a data item and no locks are currently held on that data item by any other transaction. To understand the reasons for having an exclusive lock, look at its counterpart, the shared lock. Shared locks are appropriate when concurrent transactions are granted READ access on the basis of a common lock, because concurrent transactions based on a READ cannot produce a conflict. A shared lock is issued when a transaction must read data from the database and no exclusive locks are held on the data to be read. 13. What is a deadlock, and how can it be avoided? Discuss several strategies for dealing with deadlocks.
285
Chapter .10 Transaction Management and Concurrency Control Base your discussion on Chapter 10’s Section 10-3d, Deadlocks. Start by pointing out that, although locks prevent serious data inconsistencies, their use may lead to two major problems: 1. The transaction schedule dictated by the locking requirements may not be serializable, thus causing data integrity and consistency problems. 2. The schedule may create deadlocks. Database deadlocks are the equivalent of a traffic gridlock in a big city and are caused by two transactions waiting for each other to unlock data. Use Table 10.13 in the text to illustrate the scenario that leads to a deadlock.) The table has been reproduced below for your convenience.
TABLE 10.13 How a Deadlock Condition is Created TIME 0 1 2 3 4 5 6 7 8 9 … … … …
TRANSACTION
REPLY
T1:LOCK(X) T2: LOCK(Y) T1:LOCK(Y) T2:LOCK(X) T1:LOCK(Y) T2:LOCK(X) T1:LOCK(Y) T2:LOCK(X) T1:LOCK(Y) ………….. ………….. ………….. …………..
OK OK WAIT WAIT WAIT WAIT WAIT WAIT WAIT …….. …….. …….. ……..
LOCK STATUS Data X Data Y Unlocked Unlocked Locked Unlocked Locked Locked Locked Deadlock Locked Locked Locked Locked Locked Locked Locked Locked Locked Locked Locked ……… …….… ……… …….… ……… …….… ……… ………
In a real world DBMS, many more transactions can be executed simultaneously, thereby increasing the probability of generating deadlocks. Note that deadlocks are possible only if one of the transactions wants to obtain an exclusive lock on a data item; no deadlock condition can exist among shared locks. Three basic techniques exist to control deadlocks:
Deadlock Prevention A transaction requesting a new lock is aborted if there is a possibility that a deadlock may occur. If the transaction is aborted, all the changes made by this transaction are rolled back and all locks are released. The transaction is then re-scheduled for execution. Deadlock prevention works because it avoids the conditions that lead to deadlocking.
Deadlock Detection
286
Chapter .10 Transaction Management and Concurrency Control The DBMS periodically tests the database for deadlocks. If a deadlock is found, one of the transactions (the "victim") is aborted (rolled back and rescheduled) and the other transaction continues. Note particularly the discussion in Section 10-4a, Wait/Die and Wound/Wait Schemes.
Deadlock Avoidance The transaction must obtain all the locks it needs before it can be executed. This technique avoids rollback of conflicting transactions by requiring that locks be obtained in succession. However, the serial lock assignment required in deadlock avoidance increases the response times. The best deadlock control method depends on the database environment. For example, if the probability of deadlocks is low, deadlock detection is recommended. However, if the probability of deadlocks is high, deadlock prevention is recommended. If response time is not high on the system priority list, deadlock avoidance may be employed. 14. What are some disadvantages of time-stamping methods for concurrency control? The disadvantages are: 1) each value stored in the database requires two additional time stamp fields – one for the last time the field was read and one for the last time it was updated, 2) increased memory and processing overhead requirements, and 3) many transactions may have to be stopped, rescheduled, and restamped. 15. Why might it take a long time to complete transactions when an optimistic approach to concurrency control is used? Because the optimistic approach makes the assumption that conflict from concurrent transactions is unlikely, it does nothing to avoid conflicts or control the conflicts. The only test for conflict occurs during the validation phase. If a conflict is detected, then the entire transaction restarts. In an environment with few conflicts from concurrency, this type of single checking scheme works well. In an environment where conflicts are common, a transaction may have to be restarted numerous times before it can be written to the database. 16. What are the three types of database critical events that can trigger the database recovery process? Give some examples for each one. Backup and recovery functions constitute a very important component of today’s DBMSs. Some DBMSs provide functions that allow the database administrator to perform and schedule automatic database backups to permanent secondary storage devices, such as disks or tapes. Critical events include: • Hardware/software failures. hard disk media failure, a bad capacitor on a motherboard, or a failing memory bank. Other causes of errors under this category include application program or operating system errors that cause data to be overwritten, deleted, or lost. • Human-caused incidents. This type of event can be categorized as unintentional or intentional.
287
Chapter .10 Transaction Management and Concurrency Control
•
− An unintentional failure is caused by carelessness by end-users. Such errors include deleting the wrong rows from a table, pressing the wrong key on the keyboard, or shutting down the main database server by accident. − Intentional events are of a more severe nature and normally indicate that the company data are at serious risk. Under this category are security threats caused by hackers trying to gain unauthorized access to data resources and virus attacks caused by disgruntled employees trying to compromise the database operation and damage the company. Natural disasters. This category includes fires, earthquakes, floods, and power failures.
17. What are the four ANSI transaction isolation levels? What type of reads does each level allow? The four ANSI transaction isolation levels are 1) read uncommitted, 2) read committed, 3) repeatable read, and 4) Serializable. These levels allow different “questionable” reads. A read is questionable if it can produce inconsistent results. Read uncommitted isolation will allow dirty reads, non-repeatable reads and phantom reads. Read committed isolation will allow non-repeatable reads and phantom reads. Repeatable read isolation will allow phantom reads. Serializable does not allow any questionable reads.
Problem Solutions 1. Suppose you are a manufacturer of product ABC, which is composed of parts A, B, and C. Each time a new product is created, it must be added to the product inventory, using the PROD_QOH in a table named PRODUCT. And each time the product ABC is created, the parts inventory, using PART_QOH in a table named PART, must be reduced by one each of parts A, B, and C. The sample database contents are shown in Table P10.1
Table P10.1 The Database for Problem 1 Table name: PRODUCT PROD_CODE ABC
PROD_QOH 1,205
Table name: PART PART_CODE A B C
PART_QOH 567 98 549
Given that information, answer Questions a through e. a. How many database requests can you identify for an inventory update for both PRODUCT and PART? Depending in how the SQL statements are written, there are two correct answers: 4 or 2. b. Using SQL, write each database request you identified in step a. The database requests are shown in the following table.
288
Chapter .10 Transaction Management and Concurrency Control Four SQL statements
Two SQL statements
UPDATE PRODUCT SET PROD_QOH = PROD_OQH + 1 WHERE PROD_CODE = ‘ABC’
UPDATE PRODUCT SET PROD_QOH = PROD_OQH + 1 WHERE PROD_CODE = ‘ABC’
UPDATE PART SET PART_QOH = PART_OQH - 1 WHERE PART_CODE = ‘A’
UPDATE PART SET PART_QOH = PART_OQH - 1 WHERE PART_CODE = ‘A’ OR PART_CODE = ‘B’ OR PART_CODE = ‘C’
UPDATE PART SET PART_QOH = PART_OQH - 1 WHERE PART_CODE = ‘B’ UPDATE PART SET PART_QOH = PART_OQH - 1 WHERE PART_CODE = ‘C’
c. Write the complete transaction(s). The transactions are shown in the following table. Four SQL statements
Two SQL statements
BEGIN TRANSACTION
BEGIN TRANSACTION
UPDATE PRODUCT SET PROD_QOH = PROD_OQH + 1 WHERE PROD_CODE = ‘ABC’
UPDATE PRODUCT SET PROD_QOH = PROD_OQH + 1 WHERE PROD_CODE = ‘ABC’
UPDATE PART SET PART_QOH = PART_OQH - 1 WHERE PART_CODE = ‘A’
UPDATE PART SET PART_QOH = PART_OQH - 1 WHERE PART_CODE = ‘A’ OR PART_CODE = ‘B’ OR PART_CODE = ‘C’
UPDATE PART SET PART_QOH = PART_OQH - 1 WHERE PART_CODE = ‘B’
COMMIT;
UPDATE PART SET PART_QOH = PART_OQH - 1 WHERE PART_CODE = ‘C’ COMMIT;
289
Chapter .10 Transaction Management and Concurrency Control
d. Write the transaction log, using Table 10.1 as your template. We assume that product ‘ABC’ has a PROD_QOH = 23 at the start of the transaction and that the transaction is representing the addition of 1 new product. We also assume that PART components “A”, “B” and “C” have a PROD_QOH equal to 56, 12, and 45 respectively. TRL ID 1
TRX NUM 1A3
PREV PTR NULL
NEXT PTR 2
OPERATION START
2 3 4 5 6
1A3 1A3 1A3 1A3 1A3
1 2 3 4 5
3 4 5 6 NULL
UPDATE UPDATE UPDATE UPDATE COMMIT
TABLE **START TRANSACTION PRODUCT PART PART PART ** END TRANSACTION
ROW ID
ATTRIBUTE
BEFORE VALUE
AFTER VALUE
‘ABC’ ‘A’ ‘B’ ‘C’
PROD_QOH PART_QOH PART_QOH PART_QOH
23 56 12 45
24 55 11 44
e. Using the transaction log you created in Step d, trace its use in database recovery. Begin with the last trl_id (trl_id 6) for the transaction (trx_num 1A3) and work backward using the prev_ptr to identify the next step to undo moving from the end of the transaction back to the beginning. Trl_ID 6: Nothing to change because it is a end of transaction marker. Trl_ID 5: Change PART_QOH from 44 to 45 for ROW_ID 'C' in PART table. Trl_ID 4: Change PART_QOH from 11 to 12 for ROW_ID 'B' in PART table. Trl_ID 3: Change PART_QOH from 55 to 56 for ROW_ID 'A' in PART table. Trl_ID 2: Change PROD_QOH from 24 to 23 for ROW_ID 'ABC' in PRODUCT table. Trl_ID 1: Nothing to change because it is a beginning of transaction marker. 2. Describe the three most common problems with concurrent transaction execution. Explain how concurrency control can be used to avoid those problems. The three main concurrency control problems are triggered by lost updates, uncommitted data, and inconsistent retrievals. These control problems are discussed in detail in Section 10-2. Note particularly Section 10-2a, Lost Updates, Section 10-2b, Uncommitted Data, and Section 10-2c, Inconsistent Retrievals. 3. What DBMS component is responsible for concurrency control? How is this feature used to resolve conflicts? Severe database integrity and consistency problems can arise when two or more concurrent transactions are executed. In order to avoid such problems, the DBMS must exercise concurrency control. The DBMS's component in charge of concurrency control is the scheduler. The scheduler is discussed in Section 10-2d. Note particularly the Read/Write conflict scenarios illustrated with the help of Table 10.11, Read/Write Conflict Scenarios: Conflicting Database Operations Matrix.
290
Chapter .10 Transaction Management and Concurrency Control 4. Using a simple example, explain the use of binary and shared/exclusive locks in a DBMS. Discuss Section 10-3, Concurrency Control with Locking Methods. Binary locks are discussed in Section 10-3b, Lock Types. 5. Suppose that your database system has failed. Describe the database recovery process and the use of deferred-write and write-through techniques. Recovery restores a database from a given state, usually inconsistent, to a previously consistent state. Depending on the type and the extent of the failure, the recovery process ranges from a minor shortterm inconvenience to a major long-term rebuild action. Regardless of the extent of the required recovery process, recovery is not possible without backup. The database recovery process generally follows a predictable scenario: 1. Determine the type and the extent of the required recovery. 2. If the entire database needs to be recovered to a consistent state, the recovery uses the most recent backup copy of the database in a known consistent state. 3. The backup copy is then rolled forward to restore all subsequent transactions by using the transaction log information. 4. If the database needs to be recovered, but the committed portion of the database is usable, the recovery process uses the transaction log to "undo" all the transactions that were not committed. Recovery procedures generally make use of deferred-write and write-thru techniques. In the case of the deferred-write or deferred-update, the transaction operations do not immediately update the database. Instead: • All changes (previous and new data values) are first written to the transaction log • The database is updated only after the transaction reaches its commit point. • If the transaction fails before it reaches its commit point, no changes (no roll-back or undo) need to be made to the database because the database was never updated. In contrast, if the write-thru or immediate-update technique is used: • The database is immediately updated by transaction operations during the transaction's execution, even before the transaction reaches its commit point. • The transaction log is also updated, so if a transaction fails, the database uses the log information to roll back ("undo") the database to its previous state.
ONLINE CONTENT The Ch10_ABC_Markets database is available at www.cengagebrain.com. This database is stored in Microsoft Access format.
6. ABC Markets sell products to customers. The relational diagram shown in Figure P10.6 represents the main entities for ABC’s database. Note the following important characteristics: 291
Chapter .10 Transaction Management and Concurrency Control •
•
• •
•
A customer may make many purchases, each one represented by an invoice. ➢ The CUS_BALANCE is updated with each credit purchase or payment and represents the amount the customer owes. ➢ The CUS_BALANCE is increased (+) with every credit purchase and decreased (-) with every customer payment. ➢ The date of last purchase is updated with each new purchase made by the customer. ➢ The date of last payment is updated with each new payment made by the customer. An invoice represents a product purchase by a customer. ➢ An INVOICE can have many invoice LINEs, one for each product purchased. ➢ The INV_TOTAL represents the total cost of invoice including taxes. ➢ The INV_TERMS can be “30,” “60,” or “90” (representing the number of days of credit) or “CASH,” “CHECK,” or “CC.” ➢ The invoice status can be “OPEN,” “PAID,” or “CANCEL.” A product’s quantity on hand (P_QTYOH) is updated (decreased) with each product sale. A customer may make many payments. The payment type (PMT_TYPE) can be one of the following: ➢ “CASH” for cash payments. ➢ “CHECK” for check payments ➢ “CC” for credit card payments The payment details (PMT_DETAILS) are used to record data about check or credit card payments: ➢ The bank, account number, and check number for check payments ➢ The issuer, credit card number, and expiration date for credit card payments.
Note: Not all entities and attributes are represented in this example. Use only the attributes indicated.
FIGURE P10.6 The ABC Markets Relational Diagram
292
Chapter .10 Transaction Management and Concurrency Control
Using this database, write the SQL code to represent each one of the following transactions. Use BEGIN TRANSACTION and COMMIT to group the SQL statements in logical transactions. a. On May 11, 2016, customer ‘10010’ makes a credit purchase (30 days) of one unit of product ‘11QER/31’ with a unit price of $110.00; the tax rate is 8 percent. The invoice number is 10983, and this invoice has only one product line. a. BEGIN TRANSACTION b. INSERT INTO INVOICE i. VALUES (10983, ‘10010’, ‘11-May-2018’, 118.80, ‘30’, ‘OPEN’); c. INSERT INTO LINE i. VALUES (10983, 1, ‘11QER/31’, 1, 110.00); d. UPDATE PRODUCT i. SET P_QTYOH = P_QTYOH – 1 ii. WHERE P_CODE = ‘11QER/31’; e. UPDATE CUSTOMER f. SET CUS_DATELSTPUR = ‘11-May-2018’, CUS_BALANCE = CUS_BALANCE +118.80 g. WHERE CUS_CODE = ‘10010’; h. COMMIT; b. On June 3, 2016, customer ‘10010’ makes a payment of $100 in cash. The payment ID is 3428. a. BEGIN TRANSACTION b. INSERT INTO PAYMENTS VALUES (3428, ‘03-Jun-2018’, ‘10010’, 100.00, ‘CASH’, 'None'); UPDATE CUSTOMER; SET CUS_DATELSTPMT = ‘03-Jun-2018’, CUS_BALANCE = CUS_BALANCE -100.00 WHERE CUS_CODE = ‘10010’; COMMIT 7. Create a simple transaction log (using the format shown in Table 10.13) to represent the actions of the two previous transactions. The transaction log is shown in Table P10.7
Table P10.7 The ABC Markets Transaction Log TRL ID
TRX NUM
PREV PTR
NEXT PTR
OPERATION
987
101
Null
1023
START
* Start Trx.
1023
101
987
1026
INSERT
INVOICE
10983
10983, 10010, 11-May-2018, 118.80, 30, OPEN
1026
101
1023
1029
INSERT
LINE
10983, 1
10983, 1, 11QER/31, 1,
TABLE
ROW ID
293
ATTRIBUTE
BEFORE VALUE
AFTER VALUE
Chapter .10 Transaction Management and Concurrency Control 110.00 1029
101
1026
1031
UPDATE
PRODUCT
11QER/31
P_QTYOH
47
46
1031
101
1029
1032
UPDATE
CUSTOMER
10010
CUS_BALANCE
345.67
464.47
1032
101
1031
1034
UPDATE
CUSTOMER
10010
CUS_DATELSTPUR
5-May-2014
11-May-2018
1034
101
1032
Null
COMMIT
* End Trx. *
1089
102
Null
1091
START
* Start Trx.
1091
102
1089
1095
INSERT
PAYMENT
3428
1095
102
1091
1096
UPDATE
CUSTOMER
10010
CUS_BALANCE
464.47
364.47
1096
102
1095
1097
UPDATE
CUSTOMER
10010
CUS_DATELSTPMT
2-May-2014
3-Jun-2018
1097
102
1096
Null
COMMIT
* End Trx.
3428, 3-Jun-2018, 10010, 100.00, CASH, None
Note: Because we have not shown the table contents, the "before" values in the transaction can be assumed. The "after" value must be computed using the assumed "before" value, plus or minus the transaction value. Also, in order to save some space, we have combined the "after" values for the INSERT statements into a single cell. Actually, each value could be entered in individual rows. 8. Assuming that pessimistic locking is being used, but the two-phase locking protocol is not, create a chronological list of the locking, unlocking, and data manipulation activities that would occur during the complete processing of the transaction described in Problem 6a. Time 1 2 3 4 5 6 7 8 9 10 11 12 13
Action Lock INVOICE Insert row 10983 into INVOICE Unlock INVOICE Lock LINE Insert tow 10983, 1 into LINE Unlock LINE Lock PRODUCT Update PRODUCT 11QER/31, P_QTYOH from 47 to 46 Unlock PRODUCT Lock CUSTOMER Update CUSTOMER 10010, CUS_BALANCE from 345.67 to 464.47 Update CUSTOMER 10010, CUS_DATELSTPUR from 05-May-2018 to 11-May-2018 Unlock CUSTOMER
9. Assuming that pessimistic locking with the two-phase locking protocol is being used, create a chronological list of the locking, unlocking, and data manipulation activities that would occur during the complete processing of the transaction described in Problem 6a. Time 1 2 3 4
Action Lock INVOICE Lock LINE Lock PRODUCT Lock CUSTOMER
294
Chapter .10 Transaction Management and Concurrency Control 5 6 7 8 9 10 11 12 13
Insert row 10983 into INVOICE Insert tow 10983, 1 into LINE Update PRODUCT 11QER/31, P_QTYOH from 47 to 46 Update CUSTOMER 10010, CUS_BALANCE from 345.67 to 464.47 Update CUSTOMER 10010, CUS_DATELSTPUR from 05-May-2018 to 11-May-2018 Unlock INVOICE Unlock LINE Unlock PRODUCT Unlock CUSTOMER
10. Assuming that pessimistic locking is being used, but the two-phase locking protocol is not, create a chronological list of the locking, unlocking, and data manipulation activities that would occur during the complete processing of the transaction described in Problem 6b. Time 1 2 3 4 5 6 7
Action Lock PAYMENT Insert row 3428 into PAYMENT Unlock PAYMENT Lock CUSTOMER Update CUSTOMER 10010, CUS_BALANCE from 464.47 to 364.47 Update CUSTOMER 10010, CUS_DATELSTPMT from 02-May-2018 to 03-Jun-2018 Unlock CUSTOMER
11. Assuming that pessimistic locking with the two-phase locking protocol is being used, create a chronological list of the locking, unlocking, and data manipulation activities that would occur during the complete processing of the transaction described in Problem 6b. Time 1 2 3 4 5 6 7
Action Lock PAYMENT Lock CUSTOMER Insert row 3428 into PAYMENT Update CUSTOMER 10010, CUS_BALANCE from 464.47 to 364.47 Update CUSTOMER 10010, CUS_DATELSTPMT from 02-May-2018 to 03-Jun-2018 Unlock PAYMENT Unlock CUSTOMER
295
Chapter .10 Transaction Management and Concurrency Control
Additional Problems and Answers The following problems are designed to give your students additional practice … or you can use them as test questions. 1. Write the SQL statements that might be used in transaction management and explain how they work. The following transaction registers the credit sale of a product to a customer. Comment Start transaction
BEGIN TRANSACTION
INSERT INTO INVOICE Add record to invoice (INV_NUM, INV_DATE, ACCNUM, TOTAL) VALUES (1020,’15-MAR-2018’,'60120010',3500); UPDATE INVENTRY SET ON_HAND = ON_HAND – 100 WHERE PROD_CODE = '345TYX';
Update the quantity on hand of the product
UPDATE ACCREC SET BALANCE = BALANCE + 3500 WHERE ACCNUM = '60120010'; COMMIT;
Update the customer balance account
The credit sale transaction must do all of the following: 1. Create the invoice record. 2. Update the inventory data. 3. Update the customer account data. In SQL, the transaction begins automatically with the first SQL statement, or the user can start with the BEGIN TRANSACTION statement. The SQL transaction ends when • the last SQL statement is found and/or the program ends • the user cancels the transaction • COMMIT or ROLLBACK statements are found The DBMS will ensure that all SQL statements are executed and completed before committing all work. If, for any reason, one or more of the SQL statements in the transaction cannot be completed, the entire transaction is aborted and the database is rolled back to its previous consistent state.
296
Chapter .10 Transaction Management and Concurrency Control 2. Starting with a consistent database state, trace the activities that are required to execute a set of transactions to produce an updated consistent database state. The following example traces the execution of problem 1's credit sale transaction. We will assume that the transaction is based on a currently consistent database state. We will also assume that the transaction is the only transaction being executed by the DBMS. Time 0 1
Transaction
Table
Operation
Write
INVOICE
INV_NUM = 1020
2 3
Read
INVENTORY
ON_HAND = 134 ON_HAND = 134 - 100
4 5 6 7 8
Write Read
ACCREC
Write COMMIT
Comment Database is in a consistent state. INSERT Invoice number into the INVOICE table UPDATE the quantity on hand of product 345TYX
ON_HAND = 34 ACC_BALANCE = 900 ACC_BALANCE = 900 + 3500 ACC_BALANCE = 4400 Permanently saves all changes to the database. The database is in a consistent state.
3. Write an example of a database request. A database transaction is composed of one or more database requests. A database request is the equivalent of an SQL statement, i.e. SELECT, INSERT, UPDATE, PROJECT, etc. Some database transactions are as simple as: SELECT * FROM ACCOUNT; A database request can include references to one or more tables. For example, the request SELECT FROM WHERE
ACCT_NUM, CUSTOMER.CUS_NUM, INV_NUM, INV_DATE, INV_TOTAL ACCOUNT, INVOICE ACCOUNT.ACCT_NUM = INVOICE.ACCT_NUM AND ACCCOUNT.ACCT_NUM = '60120010'
will list all the invoices for customer '60120010'. Note that the preceding query accesses two tables: ACCOUNT and INVOICE. Note also that, if an attribute shows up in different places (as a primary key and as a foreign key), its source must be specified to avoid an ambiguity error message. A database request may update a database or insert a new row in a table. For example the database request INSERT INTO INVOICE (INV_NUM, INV_DATE, ACCT_NUM, INV_TOTAL) VALUES (1020,’10-Feb-2018’,'60120010',3500); will insert a new row into the INVOICE table.
297
Chapter .10 Transaction Management and Concurrency Control
4. Trace the use of the transaction log in database recovery. The following transaction log traces the database transaction explained in problem 1.
Transaction Log TRL ID
TRX NUM
PREV PTR
NEXT PTR
OPERATION
1
101
NULL
2
* Start TRX *
2
101
1
3
INSERT
INVOICE
3
101
2
4
UPDATE
PRODUCT
345TYX
PROD_ON_HAND
134
34
4
101
3
5
UPDATE
ACCOUNT
60120010
ACCT_BALANCE
900
4,400
5
101
4
NULL
COMMIT
TABLE
ROW ID
ATTRIBUTE
BEFORE VALUE
AFTER VALUE
1020, ’10-Feb-2018’, '60120010', 3500
* The TID (Transaction ID) is automatically assigned by the DBMS The transaction log maintains a record of all database transactions that changed the database. For example, the preceding transaction log records • the insertion of a new row to the INVOICE table • the update to the P_ON_HAND attribute for the row identified by '345TYX' in the PRODUCT table • and the update of ACCT_BALANCE attribute for the row identified by '60120010' in the ACCOUNT table. The transaction log also records the transaction's beginning and end in order to help the DBMS to determine the operations that must be rolled back if the transaction is aborted. Note: Only the current transaction may be rolled back, not all the previous transactions. If the database must be recovered, the DBMS will: • Change the BALANCE attribute of the row '60120010' from 4400 to 900 in the ACCREC table. • Change the ON_HAND attribute of the row '345TYX' from 34 to 134 in the INVENTORY table. • Delete row 1020 of the INVOICE table. 5. Suppose you are asked to evaluate a DBMS in terms of lock granularity and the different locking levels. Create a simple database environment in which these features would be important. Lock granularity describes the different lock levels supported by a DBMS. The lock levels are: Database-level • The DBMS locks the entire database. If transaction T1 locks the database, transaction T2 cannot access any database tables until T1 releases the lock. 298
Chapter .10 Transaction Management and Concurrency Control •
Database-level locks work well in a batch processing environment when the bulk of the transactions are executed off-line and a batch process is used to update the master database at off-peak times such as midnight, weekends, etc.
Table-level • The DBMS locks an entire table within a database. This lock level prevents access to any row by a transaction T2 while transaction T1 is using the table. However, two transactions can access the database as long as they access different tables. • Table-level locks are only appropriate when table sharing is not an issue. For example, if researchers pursue unrelated research projects within a research database, the DBMS will be able to allow such users to access their data without affecting the work of other users. Page-level • The DBMS will lock an entire disk-page. A disk-page or page is the equivalent of a disk-block, which may be described as a (referenced) section of a disk. A page has a fixed size, such as 4K, 8K, 16K, etc. A table may span several pages, and a page may contain several rows of one or more tables. Page-level locks are (currently) the most frequently used of the multi-user DBMS locking devices. • Page-level locks are particularly appropriate in a multi-user DBMS system in which data sharing is a crucial component. For example, page-level locks are common in accounting systems, sales systems, and payroll systems. In fact, just about any business DBMS application that runs in a multi-user environment benefits from page-level locking. Row-level • The row-level lock is much less restrictive than the locks we have just discussed. Row-level locks permit the DBMS to allow concurrent transactions to access different rows of the same table, even if these rows are located on the same page. Although the row-level locking approach improves the availability of data, its management requires high overhead cost because a lock exists for each row in each table of the database. • Row-level locking is yet to be implemented in most of the currently available DBMS systems. Consequently, row-level locks are mostly a curiosity at this point. Nevertheless, its very high degree of shareability makes it a potential option for multi-user database applications like the ones that currently use page-level locking. Field-level • The field-level locking approach allows concurrent transactions to access the same row as long as they use different attributes within that row. Although field-level locking clearly yields the most flexible multi-user data access, it requires too much computer overhead to be practical at this point.
299
Ch11 Database Performance Tuning-and Query Optimization
Chapter 11 Database Performance Tuning and Query Optimization Discussion Focus This chapter focuses on the factors that directly affect database performance. Because performancetuning techniques can be DBMS-specific, the material in this chapter may not be applicable under all circumstances, nor will it necessarily pertain to all DBMS types. This chapter is designed to build a foundation for the general understanding of database performancetuning issues and to help you choose appropriate performance-tuning strategies. (For the most current information about tuning your database, consult the vendor’s documentation.) •
• • •
•
Start by covering the basic database performance-tuning concepts. Encourage students to use the web to search for information about the internal architecture (internal process and database storage formats) of various database systems. Focus on the similarities to lay a common foundation. Explain how a DBMS processes SQL queries in general terms and stress the importance of indexes in query processing. Emphasize the generation of database statistics for optimum query processing. Step through the query processing example in section 11-4, Optimizer Choices. Discuss the common practices used to write more efficient SQL code. Emphasize that some practices are DBMS-specific. As technology advances, the query optimization logic becomes increasingly sophisticated and effective. Therefore, some of the SQL practices illustrated in this chapter may not improve query performance as dramatically as it does in older systems. Finally, illustrate the chapter material using the query optimization example in section 11-8.
300
Ch11 Database Performance Tuning-and Query Optimization
Answers to Review Questions 1. What is SQL performance tuning? SQL performance tuning describes a process – on the client side – that will generate an SQL query to return the correct answer in the least amount of time, using the minimum amount of resources at the server end. 2. What is database performance tuning? DBMS performance tuning describes a process – on the server side – that will properly configure the DBMS environment to respond to clients’ requests in the fastest way possible, while making optimum use of existing resources. 3. What is the focus of most performance tuning activities, and why does that focus exist? Most performance-tuning activities focus on minimizing the number of I/O operations, because the I/O operations are much slower than reading data from the data cache. 4. What are database statistics, and why are they important? The term database statistics refers to a number of measurements gathered by the DBMS to describe a snapshot of the database objects’ characteristics. The DBMS gathers statistics about objects such as tables, indexes, and available resources -- such as number of processors used, processor speed, temporary space available, and so on. Such statistics are used to make critical decisions about improving the query processing efficiency. 5. How are database statistics obtained? Database statistics can be gathered manually by the DBA or automatically by the DBMS. For example, many DBMS vendors support the SQL’s ANALYZE command to gather statistics. In addition, many vendors have their own routines to gather statistics. For example, IBM’s DB2 uses the RUNSTATS procedure, while Microsoft’s SQL Server uses the UPDATE STATISTICS procedure and provides the Auto-Update and Auto-Create Statistics options in its initialization parameters. 6. What database statistics measurements are typical of tables, indexes, and resources? For tables, typical measurements include the number of rows, the number of disk blocks used, row length, the number of columns in each row, the number of distinct values in each column, the maximum value in each column, the minimum value in each column, what columns have indexes, and so on. For indexes, typical measurements include the number and name of columns in the index key, the number of key values in the index, the number of distinct key values in the index key, histogram of key values in an index, etc.
301
Ch11 Database Performance Tuning-and Query Optimization
For resources, typical measurements include the logical and physical disk block size, the location and size of data files, the number of extends per data file, and so on. 7. How is the processing of SQL DDL statements (such as CREATE TABLE) different from the processing required by DML statements? A DDL statement actually updates the data dictionary tables or system catalog, while a DML statement (SELECT, INSERT, UPDATE and DELETE) mostly manipulates end user data. 8. In simple terms, the DBMS processes queries in three phases. What are those phases, and what is accomplished in each phase? The three phases are: 1. Parsing. The DBMS parses the SQL query and chooses the most efficient access/execution plan. 2. Execution. The DBMS executes the SQL query using the chosen execution plan. 3. Fetching. The DBMS fetches the data and sends the result set back to the client. Parsing involves breaking the query into smaller units and transforming the original SQL query into a slightly different version of the original SQL code -- but one that is “fully equivalent” and more efficient. Fully equivalent means that the optimized query results are always the same as the original query. More efficient means that the optimized query will, almost always, execute faster than the original query. (Note that we say almost always because many factors affect the performance of a database. These factors include the network, the client’s computer resources, and even other queries running concurrently in the same database.) After the parsing and execution phases are completed, all rows that match the specified condition(s) have been retrieved, sorted, grouped, and/or – if required – aggregated. During the fetching phase, the rows of the resulting query result set are returned to the client. During this phase, the DBMS may use temporary table space to store temporary data. 9. If indexes are so important, why not index every column in every table? (Include a brief discussion of the role played by data sparsity.) Indexing every column in every table will tax the DBMS too much in terms of index-maintenance processing, especially if the table has many attributes, many rows, and/or requires many inserts, updates, and/or deletes. One measure to determine the need for an index is the data sparsity of the column you want to index. Data sparsity refers to the number of different values a column could possibly have. For example, a STU_SEX column in a STUDENT table can have only two possible values, “M” or “F”; therefore this column is said to have low sparsity. In contrast, the STU_DOB column that stores the student date of birth can have many different date values; therefore, this column is said to have high sparsity. Knowing the sparsity helps you decide whether or not the use of an index is appropriate. For
302
Ch11 Database Performance Tuning-and Query Optimization example, when you perform a search in a column with low sparsity, you are very likely to read a high percentage of the table rows anyway; therefore index processing may be unnecessary work. 10. What is the difference between a rule-based optimizer and a cost-based optimizer? A rule-based optimizer uses a set of preset rules and points to determine the best approach to execute a query. The rules assign a “cost” to each SQL operation; the costs are then added to yield the cost of the execution plan. A cost-based optimizer uses sophisticated algorithms based on the statistics about the objects being accessed to determine the best approach to execute a query. In this case, the optimizer process adds up the processing cost, the I/O costs and the resource costs (RAM and temporary space) to come up with the total cost of a given execution plan. 11. What are optimizer hints and how are they used? Hints are special instructions for the optimizer that are embedded inside the SQL command text. Although the optimizer generally performs very well under most circumstances, there are some circumstances in which the optimizer may not choose the best execution plan. Remember, the optimizer makes decisions based on the existing statistics. If the statistics are old, the optimizer may not do a good job in selecting the best execution plan. Even with the current statistics, the optimizer choice may not be the most efficient one. There are some occasions when the end-user would like to change the optimizer mode for the current SQL statement. In order to accomplish this task, you have to use hints. 12. What are some general guidelines for creating and using indexes? Create indexes for each single attribute used in a WHERE, HAVING, ORDER BY, or GROUP BY clause. If you create indexes in all single attributes used in search conditions, the DBMS will access the table using an index scan, instead of a full table scan. For example, if you have an index for P_PRICE, the condition P_PRICE > 10.00 can be solved by accessing the index, instead of sequentially scanning all table rows and evaluating P_PRICE for each row. Indexes are also used in join expressions, such as in CUSTOMER.CUS_CODE = INVOICE.CUS_CODE. Do not use indexes in small tables or tables with low sparsity. Remember, small tables and low sparsity tables are not the same thing. A search condition in a table with low sparsity may return a high percentage of table rows anyway, making the index operation too costly and making the full table scan a viable option. Using the same logic, do not create indexes for tables with few rows and few attributes—unless you must ensure the existence of unique values in a column. Declare primary and foreign keys so the optimizer can use the indexes in join operations. All natural joins and old-style joins will benefit if you declare primary keys and foreign keys because the optimizer will use the available indexes at join time. (The declaration of a PK or FK will automatically create an index for the declared column. Also, for the same reason, it is better to write joins using the SQL JOIN syntax. (See Chapter 8, “Advanced SQL.”)
303
Ch11 Database Performance Tuning-and Query Optimization Declare indexes in join columns other than PK/FK. If you do join operations on columns other than the primary and foreign key, you may be better off declaring indexes in such columns. 13. Most query optimization techniques are designed to make the optimizer’s work easier. What factors should you keep in mind if you intend to write conditional expressions in SQL code? Use simple columns or literals as operands in a conditional expression—avoid the use of conditional expressions with functions whenever possible. Comparing the contents of a single column to a literal is faster than comparing to expressions. Numeric field comparisons are faster than character, date, and NULL comparisons. In search conditions, comparing a numeric attribute to a numeric literal is faster than comparing a character attribute to a character literal. In general, numeric comparisons (integer, decimal) are handled faster by the CPU than character and date comparisons. Because indexes do not store references to null values, NULL conditions involve additional processing and therefore tend to be the slowest of all conditional operands. Equality comparisons are faster than inequality comparisons. As a general rule, equality comparisons are processed faster than inequality comparisons. For example, P_PRICE = 10.00 is processed faster because the DBMS can do a direct search using the index in the column. If there are no exact matches, the condition is evaluated as false. However, if you use an inequality symbol (>, >=, <, <=) the DBMS must perform additional processing to complete the request. This is because there would almost always be more “greater than” or “less than” values and perhaps only a few exactly “equal” values in the index. The slowest (with the exception of NULL) of all comparison operators is LIKE with wildcard symbols, such as in V_CONTACT LIKE “%glo%”. Also, using the “not equal” symbol (<>) yields slower searches, especially if the sparsity of the data is high; that is, if there are many more different values than there are equal values. Whenever possible, transform conditional expressions to use literals. For example, if your condition is P_PRICE -10 = 7, change it to read P_PRICE = 17. Also, if you have a composite condition such as: P_QOH < P_MIN AND P_MIN = P_REORDER AND P_QOH = 10 change it to read: P_QOH = 10 AND P_MIN = P_REORDER AND P_MIN > 10 When using multiple conditional expressions, write the equality conditions first. (Note that we did this in the previous example.) Remember, equality conditions are faster to process than inequality conditions. Although most RDBMSs will automatically do this for you, paying attention to this detail lightens the load for the query optimizer. (The optimizer won’t have to do what you have already done.) If you use multiple AND conditions, write the condition most likely to be false first. If you use this technique, the DBMS will stop evaluating the rest of the conditions as soon as it finds a conditional expression that is evaluated to be false. Remember, for multiple AND conditions to be found true, all conditions must be evaluated as true. If one of the conditions evaluates to false, everything else is evaluated as false. Therefore, if you use this technique, the DBMS won’t waste time unnecessarily
304
Ch11 Database Performance Tuning-and Query Optimization evaluating additional conditions. Naturally, the use of this technique implies an implicit knowledge of the sparsity of the data set. Whenever possible, try to avoid the use of the NOT logical operator. It is best to transform a SQL expression containing a NOT logical operator into an equivalent expression. For example: NOT (P_PRICE > 10.00) can be written as P_PRICE <= 10.00. Also, NOT (EMP_SEX = 'M') can be written as EMP_SEX = 'F'. 14. What recommendations would you make for managing the data files in a DBMS with many tables and indexes? First, create independent data files for the system, indexes and user data table spaces. Put the data files on separate disks or RAID volumes. This ensures that index operations will not conflict with end-user data or data dictionary table access operations. Second, put high-usage end-user tables in their own table spaces. By doing this, the database minimizes conflicts with other tables and maximizes storage utilization. Third, evaluate the creation of indexes based on the access patterns. Identify common search criteria and isolate the most frequently used columns in search conditions. Create indexes on high usage columns with high sparsity. Fourth, evaluate the usage of aggregate queries in your database. Identify columns used in aggregate functions and determine if the creation of indexes on such columns will improve response time. Finally, identify columns used in ORDER BY statements and make sure there are indexes on such columns. 15. What does RAID stand for, and what are some commonly used RAID levels? RAID is the acronym for Redundant Array of Independent Disks. RAID is used to provide balance between performance and fault tolerance. RAID systems use multiple disks to create virtual disks (storage volumes) formed by several individual disks. RAID systems provide performance improvement and fault tolerance. Table 11.7 in the text shows the commonly used RAID levels. (We have reproduced the table for your convenience.)
305
Ch11 Database Performance Tuning-and Query Optimization TABLE 11.7 Common RAID Configurations RAID Level Description 0 The data blocks are spread over separate drives. Also known as striped array. Provides increased performance but no fault tolerance. Fault tolerance means that in case of failure, data could be reconstructed and retrieved. Requires a minimum of two drives. 1 The same data blocks are written (duplicated) to separate drives. Also referred to as mirroring or duplexing. Provides increased read performance and fault tolerance via data redundancy. Requires a minimum of two drives. 3 The data are striped across separate drives, and parity data are computed and stored in a dedicated drive. Parity data are specially generated data that permit the reconstruction of corrupted or missing data. Provides good read performance and fault tolerance via parity data. Requires a minimum of three drives. 5 The data and the parity are striped across separate drives. Provides good read performance and fault tolerance via parity data. Requires a minimum of three drives.
Problem Solutions Problems 1 and 2 are based on the following query: SELECT FROM WHERE ORDER BY
EMP_LNAME, EMP_FNAME, EMP_AREACODE, EMP_SEX EMPLOYEE EMP_SEX = ‘F’ AND EMP_AREACODE = ‘615’ EMP_LNAME, EMP_FNAME;
1. What is the likely data sparsity of the EMP_SEX column? Because this column has only two possible values (“M” and “F”), the EMP_SEX column has low sparsity. 2. What indexes should you create? Write the required SQL commands. You should create an index in EMP_AREACODE and a composite index on EMP_LNAME, EMP_FNAME. In the following solution, we have named the two indexes EMP_NDX1 and EMP_NDX2, respectively. The required SQL commands are: CREATE INDEX EMP_NDX1 ON EMPLOYEE(EMP_AREACODE); CREATE INDEX EMP_NDX2 ON EMPLOYEE(EMP_LNAME, EMP_FNAME);
306
Ch11 Database Performance Tuning-and Query Optimization 3. Using Table 11.4 as an example, create two alternative access plans. Use the following assumptions: a. There are 8,000 employees. b. There are 4,150 female employees. c. There are 370 employees in area code 615. d. There are 190 female employees in area code 615. The solution is shown in Table P11.3. TABLE P11.3 COMPARING ACCESS PLANS AND I/O COSTS Plan
Step
A
A1
A
A2
B
B1
B
B2
B
B3
B
B4
Operation
I/O Operations
Full table scan EMPLOYE Select only rows with EMP_SEX=’F’ and EMP_AREACODE=’615’ SORT Operation Index Scan Range of EMP_NDX1 Table Access by RowID EMPLOYEE Select only rows with EMP_SEX=’F’ SORT Operation
I/O Cost
Resulting Set Rows
Total I/O Cost
8,000
8,000
190
8,000
190
190
190
8,190
370
370
370
370
370
370
370
740
370
370
190
930
190
190
190
1,120
As you examine Table P11.3, note that in plan A the DBMS uses a full table scan of EMPLOYEE. The SORT operation is done to order the output by employee last name and first name. In Plan B, the DBMS uses an Index Scan Range of the EMP_NDX1 index to get the EMPLOYEE RowIDs. After the EMPLOYEE RowIDs have been retrieved, the DBMS uses those RowIDs to get the EMPLOYEE rows. Next, the DBMS selects only those rows with SEX = ‘F.’ Finally, the DBMS sorts the result set by employee last name and first name. Problems 4- 6are based on the following query: SELECT FROM WHERE
EMP_LNAME, EMP_FNAME, EMP_DOB, YEAR(EMP_DOB) AS YEAR EMPLOYEE YEAR(EMP_DOB) = 1966;
4. What is the likely data sparsity of the EMP_DOB column? Because the EMP_DOB column stores employee’s birthdays, this column is very likely to have high data sparsity.
307
Ch11 Database Performance Tuning-and Query Optimization 5. Should you create an index on EMP_DOB? Why or why not? Creating an index in the EMP_DOB column would not help this query, because the query uses the YEAR function. However, if the same column is used for other queries, you may want to re-evaluate the decision not to create the index. 6. What type of database I/O operations will likely be used by the query? (See Table 11.3.) This query more than likely uses a full table scan to read all rows of the EMPLYEE table and generate the required output. We have reproduced the table here to facilitate your discussion: TABLE 11.3 Sample DBMS Access Plan I/O Operations Operation Table Scan (Full) Table Access (Row ID) Index Scan (Range) Index Access (Unique) Nested Loop Merge Sort
Description Reads the entire table sequentially, from the first row to the last row, one row at a time (slowest) Reads a table row directly, using the row ID value (fastest) Reads the index first to obtain the row IDs and then accesses the table rows directly (faster than a full table scan) Used when a table has a unique index in a column Reads and compares a set of values to another set of values, using a nested loop style (slow) Merges two data sets (slow) Sorts a data set (slow)
308
Ch11 Database Performance Tuning-and Query Optimization Problems 7-10 are based on the ER model shown in Figure P11.7 and on the query shown after the figure.
Figure P11.7 The Ch11_SaleCo ER Model
Given the following query SELECT FROM WHERE
P_CODE, P_PRICE PRODUCT P_PRICE >= (SELECT AVG(P_PRICE) FROM PRODUCT);
7. Assuming that there are no table statistics, what type of optimization will the DBMS use? The DBMS will use the rule-based optimization. 8. What type of database I/O operations will likely be used by the query? (See Table 11.3.) The DBMS will likely use a full table scan to compute the average price in the inner subquery. The DBMS is also very likely to use another full table scan of PRODUCT to execute the outer query. (We have reproduced the table for your convenience.)
309
Ch11 Database Performance Tuning-and Query Optimization
TABLE 11.3 Sample DBMS Access Plan I/O Operations Operation Table Scan (Full) Table Access (Row ID) Index Scan (Range) Index Access (Unique) Nested Loop Merge Sort
Description Reads the entire table sequentially, from the first row to the last row, one row at a time (slowest) Reads a table row directly, using the row ID value (fastest) Reads the index first to obtain the row IDs and then accesses the table rows directly (faster than a full table scan) Used when a table has a unique index in a column Reads and compares a set of values to another set of values, using a nested loop style (slow) Merges two data sets (slow) Sorts a data set (slow)
9. What is the likely data sparsity of the P_PRICE column? Because each product is likely to have a different price, the P_PRICE column is likely to have high sparsity. 10. Should you create an index? Why or why not? Yes, you should create an index because the column P_PRICE has high sparsity and the column is very likely to be used in many different SQL queries as part of a conditional expression. Problems 11-14 are based on the following query: SELECT FROM GROUP BY HAVING
P_CODE, SUM(LINE_UNITS) LINE P_CODE SUM(LINE_UNITS) > (SELECT MAX(LINE_UNITS) FROM LINE);
11. What is the likely data sparsity of the LINE_UNITS column? The LINE_UNITS column in the LINE table represents the quantity purchased of a given product in a given invoice. This column is likely to have many different values and therefore, the column is very likely to have high sparsity. 12. Should you create an index? If so, what would the index column(s) be, and why would you create that index? If not, explain your reasoning. Yes, you should create an index on LINE_UNITS. This index is likely to help in the execution of the inner query that computes the maximum value of LINE_UNITS.
310
Ch11 Database Performance Tuning-and Query Optimization 13. Should you create an index on P_CODE? If so, write the SQL command to create that index. If not, explain your reasoning. Yes, creating an index on P_CODE will help in query execution. However, most DBMSs automatically index foreign key columns. If this is not the case in your DBMS, you can manually create an index using the CREATE INDEX LINE_NDX1 ON LINE(P_CODE) command. (Note that we have named the index LINE_NDX1.) 14. Write the command to create statistics for this table. ANALYZE TABLE LINE COMPUTE STATISTICS; Problems 15-16 are based on the following query: SELECT FROM WHERE
P_CODE, P_QOH*P_PRICE PRODUCT P_QOH*P_PRICE > (SELECT AVG(P_QOH*P_PRICE) FROM PRODUCT)
15. What is the likely data sparsity of the P_QOH and P_PRICE columns? The P_QOH and P_PRICE are likely to have high data sparsity. 16. Should you create an index? If so, what would the index column(s) be, and why should you create that index? In this case, creating an index on P_QOH or on P_PRICE will not help the query execute faster for two reasons: first, the WHERE condition on the outer query uses an expression and second, the aggregate function also uses an expression. When using expressions in the operands of a conditional expression, the DBMS will not use indexes available on the columns that are used in the expression. Problems 17-19 are based on the following query: SELECT FROM WHERE ORDER BY
V_CODE, V_NAME, V_CONTACT, V_STATE VENDOR V_STATE = ‘TN’ V_NAME;
17. What indexes should you create and why? Write the SQL command to create the indexes. You should create an index on the V_STATE column in the VENDOR table. This new index will help in the execution of this query because the conditional operation uses the V_STATE column in the conditional criteria. In addition, you should create an index on V_NAME, because it is used in the ORDER BY clause. The commands to create the indexes are: CREATE INDEX VEND_NDX1 ON VENDOR(V_STATE); CREATE INDEX VEND_NDX2 ON VENDOR(V_NAME);
311
Ch11 Database Performance Tuning-and Query Optimization
Note that we have used the index names VEND_NDX1 and VEND_NDX2, respectively. 18. Assume that 10,000 vendors are distributed as shown in Table P11.18. What percentage of rows will be returned by the query?
Table P11.18 Distribution of Vendors by State State AK AL AZ CA CO FL GA HI IL IN KS KY LA MD MI MO
Number of Vendors 15 55 100 3244 345 995 75 68 89 12 19 45 29 208 745 35
State MS NC NH NJ NV OH OK PA RI SC SD TN TX UT VA WA
Number of Vendors 47 358 25 645 16 821 62 425 12 65 74 113 589 36 375 258
Given the distribution of values in Table P11.18, the query will return 113 of the 10,000 rows, or 1.13% of the total table rows. 19. What type of I/O database operations would be most likely to be used to execute that query? Assuming that you create the index on V_STATE and that you generate the statistics on the VENDOR table, the DBMS is very likely to use the index scan range to access the index data and then use the table access by row ID to get the VENDOR rows.
312
Ch11 Database Performance Tuning-and Query Optimization 20 Using Table 11.4 as an example, create two alternative access plans. The two access plans are shown in Table P11.20.
Table P11.20 Comparing Access Plans and I/O Costs Plan
Step
A
A1
A
A2
B
B1
B
B2
B
B3
Operation Full table scan VENDOR Select only rows with V_STATE=’TN’ SORT Operation Index Scan Range of VEND_NDX1 Table Access by RowID VENDOR SORT Operation
I/O Operations
I/O Cost
Resulting Set Rows
Total I/O Cost
10,000
10,000
113
10,000
113
113
113
10,113
113
113
113
113
113
113
113
226
113
113
113
339
In Plan A, the DBMS uses a full table scan of VENDOR. The SORT operation is done to order the output by vendor name. In Plan B, the DBMS uses an Index Scan Range of the VEND_NDX1 index to get the VENDOR RowIDs. Next, the DBMS uses the RowIDs to get the EMPLOYEE rows. Finally, the DBMS sorts the result set by V_NAME. 21. Assume that you have 10,000 different products stored in the PRODUCT table and that you are writing a Web-based interface to list all products with a quantity on hand (P_QOH) that is less than or equal to the minimum quantity, P_MIN. What optimizer hint would you use to ensure that your query returns the result set to the Web interface in the least time possible? Write the SQL code. You will write your query using the FIRST_ROWS hint to minimize the time it takes to return the first set of rows to the application. The query would be SELECT /*+ FIRST_ROWS */* FROM PRODUCT WHERE P_QOH <= P_MIN;
313
Ch11 Database Performance Tuning-and Query Optimization Problems 22-24 are based on the following query: SELECT FROM WHERE AND AND ORDER BY
P_CODE, P_DESCRIPT, P_PRICE, PRODUCT.V_CODE, V_STATE PRODUCT P, VENDOR V P.V_CODE = V.V_CODE V_STATE = ‘NY’ V_AREACODE = ‘212’; P_PRICE;
22. What indexes would you recommend? This query uses the V_STATE and V_AREACODE attributes in its conditional criteria. Furthermore, the conditional criteria use equality comparisons. Given these conditions, an index on V_STATE and another index on V_AREACODE are highly recommended. 23. Write the commands required to create the indexes you recommended in Problem 22. CREATE INDEX VEND_NDX1 ON VENDOR(V_STATE); CREATE INDEX VEND_NDX2 ON VENDOR(V_AREACODE); Note that we have used the index names VEND_NDX1 and VEND_NDX2, respectively. 24. Write the command(s) used to generate the statistics for the PRODUCT and VENDOR tables. ANALYZE TABLE PRODUCT COMPUTE STATISTICS; ANALYZE TABLE VENDOR COMPUTER STATISTICS; Problems 25 and 26 are based on the following query: SELECT FROM WHERE ORDER BY
P_CODE, P_DESCRIPT, P_QOH, P_PRICE, V_CODE PRODUCT V_CODE = ‘21344’ P_CODE;
25. What index would you recommend, and what command would you use? This query uses one WHERE condition and one ORDER BY clause. The conditional expression uses the V_CODE column in an equality comparison. In this case, creating an index on the V_CODE attribute is recommended. If V_CODE is declared to be a foreign key, the DBMS may already have created such an index automatically. If the DBMS does not generate the index automatically, create one manually. The ORDER BY clause uses the P_CODE column. Create an index on the columns used in an ORDER BY is recommended. However, because the P_CODE column is the primary key of the PRODUCT table, a unique index already exists for this column and therefore, it is not necessary to create another index on this column.
314
Ch11 Database Performance Tuning-and Query Optimization
26. How should you rewrite the query to ensure that it uses the index you created in your solution to Problem 25? In this case, the only index that should be created is the index on the V_CODE column. Assuming that such an index is called PROD_NDX1, you could use an optimizer hint as shown next: SELECT FROM WHERE ORDER BY
/*+ INDEX(PROD_NDX1)*/P_CODE, P_DESCRIPT, P_QOH, P_PRICE, V_CODE PRODUCT V_CODE = ‘21344’ P_CODE;
Problems 27 and 28 are based on the following query: SELECT FROM WHERE AND AND ORDER BY
P_CODE, P_DESCRIPT, P_QOH, P_PRICE, V_CODE PRODUCT P_QOH < P_MIN P_MIN = P_REORDER’ P_REORDER = 50; P_QOH;
27. Use the recommendations given in Section 11-5b to rewrite the query to produce the required results more efficiently. SELECT FROM WHERE AND AND ORDER BY
P_CODE, P_DESCRIPT, P_QOH, P_PRICE, V_CODE PRODUCT P_REORDER = 50 P_MIN = 50 P_QOH < 50 P_QOH;
This new query rewrites some conditions as follows: • Because P_REORDER must be equal to 50, it replaces P_MIN = P_REORDER with P_MIN = 50. • Because P_MIN must be 50, it replaces P_QOH<P_MIN with P_QOH<50. Having literals in the query conditions make queries more efficient. Note that you still need all three conditions in the query conditions. 28. What indexes you would recommend? Write the commands to create those indexes. Because the query uses equality comparison on P_REORDER, P_MIN and P_QOH, you should have indexes in such columns. The commands to create such indexes are: CREATE INDEX PROD_NDX1 ON PRODUCT(P_REORDER); CREATE INDEX PROD_NDX2 ON PRODUCT(P_MIN); CREATE INDEX PROD_NDX3 ON PRODUCT(P_QOH);
315
Ch11 Database Performance Tuning-and Query Optimization Problems 29-32 are based on the following query: SELECT FROM WHERE GROUP BY
CUS_CODE, MAX(LINE_UNITS*LINE_PRICE) CUSTOMER NATURAL JOIN INVOICE NATURAL JOIN LINE CUS_AREACODE = ‘615’ CUS_CODE;
29. Assuming that you generate 15,000 invoices per month, what recommendation would you give the designer about the use of derived attributes? This query uses the MAX aggregate function to compute the maximum invoice line value by customer. Because this table increases at a rate of 15,000 rows per month, the query would take considerable amount of time to run as the number of invoice rows increases. Furthermore, because the MAX aggregate function uses an expression (LINE_UNITS*LINE_PRICE) instead of a simple table column, the query optimizer is very likely to perform a full table scan in order to compute the maximum invoice line value. One way to speed up the query would be to store the derived attribute LINE_TOTAL in the LINE_TABLE and create an index on LINE_TOTAL. This way, the query would benefit by using the index to execute the query. 30. Assuming that you follow the recommendations you gave in Problem 29, how would you rewrite the query? SELECT FROM WHERE GROUP BY
CUS_CODE, MAX(LINE_TOTAL) CUSTOMER NATURAL JOIN INVOICE NATURAL JOIN LINE CUS_AREACODE = ‘615’ CUS_CODE;
31. What indexes would you recommend for the query you wrote in Problem 30, and what SQL commands would you use? The query will benefit from having an index on CUS_AREACODE and an index on CUS_CODE. Because CUS_CODE is a foreign key on invoice, it’s very likely that an index already exists. In any case, the query uses the CUS_AREACODE in an equality comparison and therefore, an index on this column is highly recommended. The command to create this index would be: CREATE INDEX CUS_NDX1 ON CUSTOMER(CUS_AREACODE); 32. How would you rewrite the query to ensure that the index you created in Problem 31 is used? You need to use the INDEX optimizer hint: SELECT FROM WHERE GROUP BY
/*+ INDEX(CUS_NDX1) */ CUS_CODE, MAX(LINE_TOTAL) CUSTOMER NATURAL JOIN INVOICE NATURAL JOIN LINE CUS_AREACODE = ‘615’ CUS_CODE;
316
Chapter 12 Distributed Database Management Systems
Chapter 12 Distributed Database Management Systems Discussion Focus Discuss the possible data request scenarios in a distributed database environment. 1. Single request accessing a single remote database. (See Figure D12.1.)
Figure D12.1 Single Request to Single Remote DBMS
REQUEST
The most primitive and least effective of the distributed database scenarios is based on a single SQL statement (a "request" or "unit of work") is directed to a single remote DBMS. (Such a request is known as a remote request.). We suggest that you remind the student of the distinction between a request and a transaction: • A request uses a single SQL statement to request data. • A transaction is a collection of two or more SQL statements. 2. Multiple requests accessing a single remote database. (See Figure D12.2.)
Figure D12.2 Multiple Requests to a Single Remote DBMS REQUEST
REQUEST
REQUEST
317
Chapter 12 Distributed Database Management Systems A unit of work now consists of multiple SQL statements directed to a single remote DBMS. The local user defines the start/stop sequence of the units of work, using COMMIT, but the remote DBMS manages the unit of work's processing. 3. Multiple requests accessing multiple remote databases. (See Figure D12.3.)
Figure D12.3 Multiple requests, Multiple Remote DBMSes
REQUEST
REQUEST
REQUEST
REQUEST
REQUEST
A unit of work now may be composed of multiple SQL statements directed to multiple remote DBMSes. However, any one SQL statement may access only one of the remote DBMSes. As was true in the second scenario, the local user defines the start/stop sequence of the units of work, using COMMIT, but the remote DBMS to which the SQL statement was directed manages the unit of work's processing. In this scenario, a two-phase COMMIT must be used to coordinate COMMIT processing for the multiple locations.
318
Chapter 12 Distributed Database Management Systems 4. Multiple requests accessing any combination of multiple remote DBMSes. (See Figure D12.4.)
Figure D12.4 Multiple Requests and any Combination of Remote Databases
REQUEST
REQUEST
REQUEST
A unit of work now may consist of multiple SQL statements addressed to multiple remote DBMSes, and each SQL statement may address any combination of databases. As was true in the third scenario, each local user defines the start/stop sequence of the units of work, using COMMIT, but the remote DBMS to which the SQL statement was directed manages the unit of work's processing. A two-phase COMMIT must be used to coordinate COMMIT processing for the multiple locations.
Remaining discussion focus: The review questions cover a wide range of distributed database concept and design issues. The most important questions to be raised are: • What is the difference between a distributed database and distributed processing? • What is a fully distributed database management system? • Why is there a need for a two-phase commit protocol, and what are these two phases? • What does "data fragmentation" mean, and what strategies are available to deal with data fragmentation? • Why and how must data replication be addressed in a distributed database environment? What replication strategies are available, and how do they work? • Since the current literature abounds with references to file servers and client-server architectures, what do these terms mean? How are file servers different from client/server architectures? Why would you want to know?
319
Chapter 12 Distributed Database Management Systems We have answered these questions in detail in the Answers to Review Question section of this chapter. Note particularly the answers to questions 5, 6, 11, and 15-17.
NOTE Many questions raised in this section are more specific -- and certainly more technical -- than the questions raised in the previous chapters. Since the chapter covers the answers to these questions in great detail, we have elected to give you section references to avoid needless duplication.
Answers to Review Questions 1. Describe the evolution from centralized DBMSs to distributed DBMSs. This question is answered in detail in section 12-1. 2. List and discuss some of the factors that influenced the evolution of the DDBMS. These factors are listed and discussed in section 12-1. 3. What are the advantages of the DDBMS? See section 12-2 and Table 12.1. 4. What are the disadvantages of the DDBMS? See section 12-2 and Table 12.1. 5. Explain the difference between distributed database and distributed processing. See section 12-3. 6. What is a fully distributed database management system? See section 12-4. 7. What are the components of a DDBMS? See section 12-5. 8. List and explain the transparency features of a DDBMS. See section 12-7.
320
Chapter 12 Distributed Database Management Systems 9. Define and explain the different types of distribution transparency. See section 12-8. 10. Describe the different types of database requests and transactions. A database transaction is formed by one or more database requests. Each database request is the equivalent of a single SQL statement. The basic difference between a local transaction and a distributed transaction is that the latter can update or request data from several remote sites on a network. In a DDBMS, a database request and a database transaction can be of two types: remote or distributed.
NOTE The figure references in the discussions refer to the figures found in the text.
Note: The figure references in the discussions refer to the figures found in the text. The figures are not reproduced in this manual. A remote request accesses data located at a single remote database processor (or DP site). In other words, an SQL statement (or request) can reference data at only one remote DP site. Use Figure 12.9 to illustrate the remote request. A remote transaction, composed of several requests, accesses data at only a single remote DP site. Use Figure 12.10 to illustrate the remote transaction. As you discuss Figure 12.10, note that both tables are located at a remote DP (site B) and that the complete transaction can reference only one remote DP. Each SQL statement (or request) can reference only one (the same) remote DP at a time; the entire transaction can reference only one remote DP; and it is executed at only one remote DP. A distributed transaction allows a transaction to reference several different local or remote DP sites. Although each single request can reference only one local or remote DP site, the complete transaction can reference multiple DP sites because each request can reference a different site. Use Figure 12.11 to illustrate the distributed transaction. A distributed request lets us reference data from several different DP sites. Since each request can access data from more than one DP site, a transaction can access several DP sites. The ability to execute a distributed request requires fully distributed database processing because we must be able to: 1. Partition a database table into several fragments. 2. Reference one or more of those fragments with only one request. In other words, we must have fragmentation transparency.
321
Chapter 12 Distributed Database Management Systems The location and partition of the data should be transparent to the end user. Use Figure 12.12 to illustrate the distributed request. As you discuss Figure 12.12, note that the transaction uses a single SELECT statement to reference two tables, CUSTOMER and INVOICE. The two tables are located at two different remote DP sites, B and C. The distributed request feature also allows a single request to reference a physically partitioned table. For example, suppose that a CUSTOMER table is divided into two fragments C1 and C2, located at sites B and C respectively. The end user wants to obtain a list of all customers whose balance exceeds $250.00. Use Figure 12.13 to illustrate this distributed request. Note that full fragmentation support is provided only by a DDBMS that supports distributed requests. 11. Explain the need for the two-phase commit protocol. Then describe the two phases. See section 12-9c. 12. What is the objective of the query optimization functions? The objective of query optimization functions is to minimize the total costs associated with the execution of a database request. The costs associated with a request are a function of: • the access time (I/O) cost involved in accessing the physical data stored on disk • the communication cost associated with the transmission of data among nodes in distributed database systems • the CPU time cost. It is difficult to separate communication and processing costs. Query-optimization algorithms use different parameters, and the algorithms assign different weight to each parameter. For example, some algorithms minimize total time, others minimize the communication time, and still others do not factor in the CPU time, considering it insignificant relative to the other costs. Query optimization must provide distribution and replica transparency in distributed database systems. 13. To which transparency feature are the query optimization functions related? Query-optimization functions are associated with the performance transparency features of a DDBMS. In a DDBMS the query-optimization routines are more complicated because the DDBMS must decide where and which fragment of the database to access. Data fragments are stored at several sites, and the data fragments are replicated at several sites. 14. What issues should be considered when resolving data requests in a distributed environment? A data request could be either a read or a write request. However, most requests tend to be read requests. In both cases, resolving data requests in a distributed data environment most consider the following issues:
322
Chapter 12 Distributed Database Management Systems • Data distribution. • Data replication. • Network and node availability. A more detailed discussion of these factors can be found in section 12-10. 15. Describe the three data fragmentation strategies. Give some examples of each. See section 12-11a. 16. What is data replication, and what are the three replication strategies? See section 12-11b. 17. What are the two basic styles of data replication? There are basically two styles of replication: • Push replication. In this case, the originating DP node sends the changes to the replica nodes to ensure that all data are mutually consistent. • Pull replication. The originating DP node notifies the replica nodes so they can pull the updates one their own time. See section 12-11b for more information. 18. What trade-offs are involved in building highly distributed data environments? In the year 2000, Dr. Eric Brewer stated in a presentation that: “in any highly distributed data system there are three common desirable properties: consistency, availability and partition tolerance. However, it is impossible for a system to provide all three properties at the same time.” Therefore, the system designers have to balance the trade-offs of these properties in order to provide a workable system. This is what is known as the CAP theorem. For more information on this, see section 12-12. 19. How does a BASE system differ from a traditional distributed database system? A traditional database system enforces the ACID properties as to ensure that all database transactions yield a database in a consistent state. In a centralized database system, all data resides in a centralized node. However, in a distributed database system data are located in multiple geographically disperse sites connected via a network. In such cases, network latency and network partitioning impose a new level of complexity. In most highly distributed systems, designers tend to emphasize availability over data consistency and partition tolerance. This trade-off has given way to a new type of database systems in which data are basically available, soft state and eventually consistent (BASE). For more information about BASE systems see section12-12.
323
Chapter 12 Distributed Database Management Systems
Problem Solutions The first problem is based on the DDBMS scenario in Figure P12.1.
Figure P12.1 The DDBMS Scenario for Problem 1 TABLES
FRAGMENTS
LOCATION
CUSTOMER PRODUCT
N/A PROD_A PROD_B N/A N/A
A A B B B
INVOICE INV_LINE
1. Specify the minimum types of operations the database must support to perform the following operations. These opertaions should include remote request, remote transaction, distributed transaction, and distributed requests in order to perform the following operations.
NOTE To answer the following questions, remind the students that the key to each answer is in the number of different data processors that are accessed by each request/transaction. Ask the students to first identify how many different DP sites are to be accessed by the transaction/request. Next, remind the students that a distributed request is necessary if a single SQL statement is to access more than one DP site. Use the following summary: Number of DPs Operation
1
>1
Request
Remote
Distributed
Transaction
Remote
Distributed
Based on this summary, the questions are answered easily.
324
Chapter 12 Distributed Database Management Systems At C: a. SELECT FROM
* CUSTOMER;
This SQL sequence represents a remote request. b. SELECT * FROM INVOICE WHERE INV_TOTAL < 1000; This SQL sequence represents a remote request. c. SELECT FROM WHERE
* PRODUCT PROD_QOH < 10;
This SQL sequence represents a distributed request. Note that the distributed request is required when a single request must access two DP sites. The PRODUCT table is composed of two fragments, PRO_A and PROD_B, which are located in sites A and B, respectively. d. BEGIN WORK; UPDATE CUSTOMER SET CUS_BALANCE = CUS_BALANCE + 100 WHERE CUS_NUM='10936'; INSERT INTO INVOICE(INV_NUM, CUS_NUM, INV_DATE, INV_TOTAL) VALUES ('986391', '10936', ‘15-FEB-2018’, 100); INSERT INTO INVLINE(INV_NUM, PROD_CODE, LINE_PRICE) VALUES ('986391', '1023', 100); UPDATE PRODUCT SET PROD_QOH = PROD_QOH - 1 WHERE PROD_CODE = '1023'; COMMIT WORK; This SQL sequence represents a distributed request. Note that UPDATE CUSTOMER and the two INSERT statements only require remote request capabilities. However, the entire transaction must access more than one remote DP site, so we also need distributed transaction capability. The last UPDATE PRODUCT statement accesses two remote sites because the PRODUCT table is divided into two fragments located at two remote DP sites. Therefore, the transaction as a whole requires distributed request capability.
325
Chapter 12 Distributed Database Management Systems e. BEGIN WORK; INSERT CUSTOMER(CUS_NUM, CUS_NAME, CUS_ADDRESS, CUS_BAL) VALUES ('34210','Victor Ephanor', '123 Main St', 0.00); INSERT INTO INVOICE(INV_NUM, CUS_NUM, INV_DATE, INV_TOTAL) VALUES ('986434', '34210', ‘10-AUG-2018’, 2.00); COMMIT WORK; This SQL sequence represents a distributed transaction. Note that, in this transaction, each individual request requires only remote request capabilities. However, the transaction as a whole accesses two remote sites. Therefore, distributed request capability is required. At A: f. SELECT FROM WHERE
CUS_NUM, CUS_NAME, INV_TOTAL CUSTOMER, INVOICE CUSTOMER.CUS_NUM = INVOICE.CUS_NUM;
This SQL sequence represents a distributed request. Note that the request accesses two DP sites, one local and one remote. Therefore distributed capability is needed. g. SELECT FROM WHERE
* INVOICE INV_TOTAL > 1000;
This SQL sequence represents a remote request, because it accesses only one remote DP site. h. SELECT FROM WHERE
* PRODUCT PROD_QOH < 10;
This SQL sequence represents a distributed request. In this case, the PRODUCT table is partitioned between two DP sites, A and B. Although the request accesses only one remote DP site, it accesses a table that is partitioned into two fragments: PROD-A and PROD-B. A single request can access a partitioned table only if the DBMS supports distributed requests. At B: i. SELECT FROM
* CUSTOMER;
This SQL sequence represents a remote request.
326
Chapter 12 Distributed Database Management Systems j. SELECT FROM WHERE
CUS_NAME, INV_TOTAL CUSTOMER, INVOICE INV_TOTAL > 1000 AND CUSTOMER.CUS_NUM = INVOICE.CUS_NUM;
This SQL sequence represents a distributed request. k. SELECT FROM WHERE
* PRODUCT PROD_QOH < 10;
This SQL sequence represents a distributed request. (See explanation for part h.) 2. The following data structure and constraints exist for a magazine publishing company. a. The company publishes one regional magazine each in Florida (FL), South Carolina (SC), Georgia (GA), and Tennessee (TN). b. The company has 300,000 customers (subscribers) distributed throughout the four states listed in Part 2a. c. On the first of each month, an annual subscription INVOICE is printed and sent to each customer whose subscription is due for renewal. The INVOICE entity contains a REGION attribute to indicate the customer’s state of residence (FL, SC, GA, TN): CUSTOMER (CUS_NUM, CUS_NAME, CUS_ADDRESS, CUS_CITY, CUS_STATE, CUS_ZIP, CUS_SUBSDATE) INVOICE (INV_NUM, INV_REGION, CUS_NUM, INV_DATE, INV_TOTAL)
The company is aware of the problems associated with centralized management and has decided that it is time to decentralize the management of the subscriptions in its four regional subsidiaries. Each subscription site will handle its own customer and invoice data. The management at company headquauters, however, will have access to customer and invoice data to generate annual reports and to issue ad hoc queries, such as: • List all current customers by region. • List all new customers by region. • Report all invoices by customer and by region. Given these requirements, how must you partition the database? The CUSTOMER table must be partitioned horizontally by state. (We show the partitions in the answer to 3c.)
327
Chapter 12 Distributed Database Management Systems 3. Given the scenario and the requirements in Problem 2, answer the following questions: a. What recommendations will you make regarding the type and characteristics of the required database system? The Magazine Publishing Company requires a distributed system with distributed database capabilities. The distributed system will be distributed among the company locations in South Carolina, Georgia, Florida, and Tennessee. The DDBMS must be able to support distributed transparency features, such as fragmentation transparency, replica transparency, transaction transparency, and performance transparency. Heterogeneous capability is not a mandatory feature since we assume there is no existing DBMS in place and that the company wants to standardize on a single DBMS. b. What type of data fragmentation is needed for each table? The database must be horizontally partitioned, using the STATE attribute for the CUSTOMER table and the REGION attribute for the INVOICE table. c. What must be the criteria used to partition each database? The following fragmentation segments reflect the criteria used to partition each database: Horizontal Fragmentation of the CUSTOMER Table by State Fragment Name
Location
Condition
Node name
C1
Tennessee
CUS_STATE = 'TN'
NAS
C2
Georgia
CUS_STATE = 'GA'
ATL
C3
Florida
CUS_STATE = 'FL'
TAM
C4
South Carolina
CUS_STATE = 'SC'
CHA
Horizontal Fragmentation of the INVOICE Table by Region Fragment Name
Location
Condition
I1
Tennessee
REGION_CODE = 'TN'
NAS
I2
Georgia
REGION_CODE = 'GA'
ATL
I3
Florida
REGION_CODE = 'FL'
TAM
I4
South Carolina
REGION_CODE = 'SC'
CHA
328
Node name
Chapter 12 Distributed Database Management Systems d. Design the database fragments. Show an example with node names, location, fragment names, attribute names, and demonstration data. Note the following fragments: Fragment C1
Location: Tennessee
Node: NAS
CUS_NUM
CUS_NAME
CUS_ADDRESS
CUS_CITY
CUS_STATE
10884
James D. Burger
123 Court Avenue
Memphis
TN
8-DEC-16
10993
Lisa B. Barnette
910 Eagle Street
Nashville
TN
12-MAR-17
Fragment C2 CUS_NUM
Location: Georgia
CUS_SUB_DATE
Node: ATL
CUS_NAME
CUS_ADDRESS
CUS_CITY
11887
Ginny E. Stratton
335 Main Street
Atlanta
GA
11-AUG-16
13558
Anna H. Ariona
657 Mason Ave.
Dalton
GA
23-JUN-17
Fragment C3
CUS_STATE
Location: Florida
CUS_SUB_DATE
Node: TAM
CUS_NUM
CUS_NAME
CUS_ADDRESS
CUS_CITY
10014
John T. Chi
456 Brent Avenue
Miami
FL
18-NOV-16
15998
Lisa B. Barnette
234 Ramala Street
Tampa
FL
23-MAR-17
Fragment C4
CUS_STATE
Location: South Carolina
CUS_SUB_DATE
Node: CHA
CUS_NUM
CUS_NAME
CUS_ADDRESS
CUS_CITY
CUS_STATE
21562
Thomas F. Matto
45 N. Pratt Circle
Charleston
SC
2-DEC-16
18776
Mary B. Smith
526 Boone Pike
Charleston
SC
28-OCT-17
Fragment I1
Location: Tennessee
Node: NAS
INV_NUM
REGION_CODE
CUS_NUM
INV_DATE
INV_TOTAL
213342
TN
10884
1-NOV-15
45.95
209987
TN
10993
15-FEB-16
45.95
329
CUS_SUB_DATE
Chapter 12 Distributed Database Management Systems Fragment I2
Location: Georgia
Node: ATL
INV_NUM
REGION_CODE
CUS_NUM
INV_DATE
INV_TOTAL
198893
GA
11887
15-AUG-15
70.45
224345
GA
13558
1-JUN-16
45.95
Fragment I3
Location: Florida
Node: TAM
INV_NUM
REGION_CODE
CUS_NUM
INV_DATE
INV_TOTAL
200915
FL
10014
1-NOV-15
45.95
231148
FL
15998
1-MAR-16
24.95
Fragment I4
Location: South Carolina
Node: CHA
INV_NUM
REGION_CODE
CUS_NUM
INV_DATE
INV_TOTAL
243312
SC
21562
15-NOV-15
45.95
231156
SC
18776
1-OCT-16
45.95
e. What type of distributed database operations must be supported at each remote site? To answer this question, you must first draw a map of the locations, the fragments at each location, and the type of transaction or request support required to access the data in the distributed database. Node Fragment
NAS
ATL
TAM
CHA
CUSTOMER
C1
C2
C3
C4
INVOICE
I1
I2
I3
I4
Distributed Operations Required
none
none
none
none
Headquarters
distributed request
Given the problem's specifications, you conclude that no interstate access of CUSTOMER or INVOICE data is required. Therefore, no distributed database access is required in the four nodes. For the headquarters, the manager wants to be able to access the data in all four nodes through a single SQL request. Therefore, the DDBMS must support distributed requests. f. What type of distributed database operations must be supported at the headquarters site? See the answer for part e.
330
Chapter 13 Business Intelligence and Data Warehouses
Chapter 13 Business Intelligence and Data Warehouses Discussion Focus Start by discussing the need for business intelligence in a highly competitive global economy. Note that Business Intelligence (BI) describes a comprehensive, cohesive, and integrated set of applications used to capture, collect, integrate, store, and analyze data with the purpose of generating and presenting information used to support business decision making. As the names implies, BI is about creating intelligence about a business. This intelligence is based on learning and understanding the facts about a business environment. BI is a framework that allows a business to transform data into information, information into knowledge, and knowledge into wisdom. BI has the potential to positively affect a company's culture by creating “business wisdom” and distributing it to all users in an organization. This business wisdom empowers users to make sound business decisions based on the accumulated knowledge of the business as reflected on recorded facts (historic operational data). Table 13.1 in the text gives some real-world examples of companies that have implemented BI tools (data warehouse, data mart, OLAP, and/or data mining tools) and shows how the use of such tools benefited the companies. Discuss the need for data analysis and how such analysis is used to make strategic decisions. The computer systems that support strategic decision-making are known as Decision Support Systems (DSS). Explain what a DSS is and what its main functional components are. (Use Figure 13.1.) The effectiveness of DSS depends on the quality of the data gathered at the operational level. Therefore, remind the students of the importance of proper operational database design -- and use this reminder to briefly review the major (production database) design issues that were explored in Chapters 3, 4, 5, and 6. Next, review Section 13-3a to illustrate how operational and decision support data differ -- use the summary in Table 13.5 --, placing special emphasis on these characteristics that form the foundation for decision support analysis: • timespan • granularity (See Section 13-3a and use Figure 13.2 to illustrate the conversion • dimensionality from operational data to DSS data.) After a thorough discussion of these three characteristics, students should be able to understand what the main DSS database requirements are. Note how these three requirements match the main characteristics of a DSS and its decision support data. After laying this foundation, introduce the data warehouse concept. A data warehouse is a database that provides support for decision making. Using Section 13-4 as the basis for your discussion, note that a data warehouse database must be: • Integrated. • Subject-oriented. • Time-variant. • Non-volatile.
331
Chapter 13 Business Intelligence and Data Warehouses After you have explained each one of these four characteristics in detail, your students should understand: • What the characteristics are of the data likely to be found in a data warehouse. • How the data warehouse is a part of a BI infrastructure. Stress that the data warehouse is a major component of the BI infrastructure. Discuss the contents of Table 13.9 Twelve Rules For Data Warehouse. (See Inmon, Bill, and Chuck Kelley, "The Twelve Rules of Data Warehouse for a Client/Server World," Data Management Review, 4(5), May, 1994, pp. 6-16.) The data warehouse stores the data needed for decision support. On-Line Analytical Processing (OLAP) refers to a set of tools used by the end users to access and analyze such data. Therefore, the Data Warehouse and OLAP tools are complements to each other. By illustrating various OLAP Architectures, the instructor will help students see how: • Operational data are transformed to data warehouse data. • Data warehouse data are extracted for analysis. • Multidimensional tools are used to analyze the extracted data. The OLAP Architectures are yet another example of the application of client/server concepts to systems development. Because they are the key to data warehouse design, star schemas constitute the chapter's focal point. Therefore, make sure that the following data warehouse design components are thoroughly understood: • Facts. • Dimensions. (See Section 13-5.) • Attributes. • Attribute hierarchies. These four concepts are used to implement data warehouses in the relational database environment. Carefully explain the chapter's Sales and Orders star schema's construction to help ensure that students are equipped to handle the actual design of star schemas. Illustrate the use of performance-enhancing techniques (Section 13-5f), and OLAP (Section 13-6). Introduce Data Analytics and how it is used to extract knowledge from data. Explain the use of explanatory and predictive analytics (See section 13-7). Then, get students involved in doing some hands-on SQL examples of using SQL Analytics functions. Finally, introduce students to data visualization (See Section 13-9). Illustrate how visualization is used to quickly identify trends, patterns and relationships using Figures 13.26 and 13.27. Explain how the science of data visualization has evolved and it is used to discover the “story behind the data” – use Figure 13.28 to show the importance of the science behind data visualization. Use the Vehicle Crash Analysis dashboard in Figure 13.29 to show how data visualization can be used to quickly extract information from data.
332
Chapter 13 Business Intelligence and Data Warehouses
Answers to Review Questions
ONLINE CONTENT Answers to selected Review Questions and Problems for this chapter are contained in the Premium Website for this book.
1. What is business intelligence? Give some recent examples of BI usage, using the Internet for assistance. What BI benefits have companies found? Business intelligence (BI) is a term used to describe a comprehensive, cohesive, and integrated set of applications used to capture, collect, integrate, store, and analyze data with the purpose of generating and presenting information used to support business decision making. As the names implies, BI is about creating intelligence about a business. This intelligence is based on learning and understanding the facts about a business environment. BI is a framework that allows a business to transform data into information, information into knowledge, and knowledge into wisdom. BI has the potential to positively affect a company's culture by creating “business wisdom” and distributing it to all users in an organization. This business wisdom empowers users to make sound business decisions based on the accumulated knowledge of the business as reflected on recorded facts (historic operational data). Table 13.1 – in the text -- gives some real-world examples of companies that have implemented BI tools (data warehouse, data mart, OLAP, and/or data mining tools) and shows how the use of such tools benefited the companies. Emphasize that the main focus of BI is to gather, integrate, and store business data for the purpose of creating information. BI integrates people and processes using technology in order to add value to the business. Such value is derived from how end users use such information in their daily activities, and in particular, their daily business decision making. Also note that the BI technology components are varied. Examples of BI usage found in web sources: 1) The Dallas Teachers Credit Union (DTCU), used geographical data analysis to increase its customer base from 250,000 professional educators to 3.5 million potential customers - virtually overnight. The increase gave the credit union the ability to compete with larger banks that had a strong presence in Dallas. [http://www.computerworld.com/s/article/47371/Business_Intelligence?taxonomyId=120] 2) Researchers from the Rand Corporation, recently applied business intelligence and analytics technology to determine the dangerous side effects of prescription drugs. [http://www.panorama.com/industry-news/article-view.html?name=Analytics-spots-prescriptionproblems-508338] 3) Microsoft Case Study web site for hundreds of cases about Business Intelligence usage. [http://www.microsoft.com/casestudies/]
333
Chapter 13 Business Intelligence and Data Warehouses 2. Describe the BI framework. Illustrate the evolution of BI. BI is not a product by itself, but a framework of concepts, practices, tools, and technologies that help a business better understand its core capabilities, provide snapshots of the company situation, and identify key opportunities to create competitive advantage. In practice, BI provides a well-orchestrated framework for the management of data that works across all levels of the organization. BI involves the following general steps: 1. Collecting and storing operational data 2. Aggregating the operational data into decision support data 3. Analyzing decision support data to generate information 4. Presenting such information to the end user to support business decisions 5. Making business decisions, which in turn generate more data that is collected, stored, etc. (restarting the process). 6. Monitoring results to evaluate outcomes of the business decisions (providing more data to be collected, stored, etc.) To implement all these steps, BI uses varied components and technologies. Section 13-2 is where you’ll find a discussion of these components and technologies – see Table 13.2. Figure 13.2 illustrates the evolution of BI formats.
3. What are decision support systems, and what role do they play in the business environment? Decision Support Systems (DSS) are based on computerized tools that are used to enhance managerial decision-making. Because complex data and the proper analysis of such data are crucial to strategic and tactical decision making, DSS are essential to the well-being and even survival of businesses that must compete in a global market place. 4. Explain how the main components of the BI architecture interact to form a system. Describe the evolution of BI information dissemination formats. Refer the students to section 13-3 in the chapter. Emphasize that, actually, there is no single BI architecture; instead, it ranges from highly integrated applications from a single vendor to a loosely integrated, multi-vendor environment. However, there are some general types of functionality that all BI implementations share. Like any critical business IT infrastructure, the BI architecture is composed of data, people, processes, technology, and the management of such components. Figure 13.1 (in the text) depicts how all those components fit together within the BI framework. Figure 13.2, in section 13-2c “Business Intelligence Evolution”, tracks the changes of business intelligence reporting and information dissemination over time. In summary: 1) 1970s: centralized reports running on mainframes, minicomputers, or even central server environments. Such reports were predefined and took considerable time to process. 2) 1980s: desktop computers, downloaded spreadsheet data from central locations. 3) 1990s: first generation DSS, centralized reporting and OLAP. 4) 2000s: BI web-based dashboards and mobile BI. 5) 2010s - Present: Big Data, NoSQL, Data Visualization
334
Chapter 13 Business Intelligence and Data Warehouses
5. What are the most relevant differences between operational and decision support data? Operational data and decision support data serve different purposes. Therefore, it is not surprising to learn that their formats and structures differ. Most operational data are stored in a relational database in which the structures (tables) tend to be highly normalized. Operational data storage is optimized to support transactions that represent daily operations. For example, each time an item is sold, it must be accounted for. Customer data, inventory data, and so on, are in a frequent update mode. To provide effective update performance, operational systems store data in many tables, each with a minimum number of fields. Thus, a simple sales transaction might be represented by five or more different tables (for example, invoice, invoice line, discount, store, and department). Although such an arrangement is excellent in an operational database, it is not efficient for query processing. For example, to extract a simple invoice, you would have to join several tables. Whereas operational data are useful for capturing daily business transactions, decision support data give tactical and strategic business meaning to the operational data. From the data analyst’s point of view, decision support data differ from operational data in three main areas: times pan, granularity, and dimensionality. 1. Time span. Operational data cover a short time frame. In contrast, decision support data tend to cover a longer time frame. Managers are seldom interested in a specific sales invoice to customer X; rather, they tend to focus on sales generated during the last month, the last year, or the last five years. 2. Granularity (level of aggregation). Decision support data must be presented at different levels of aggregation, from highly summarized to near-atomic. For example, if managers must analyze sales by region, they must be able to access data showing the sales by region, by city within the region, by store within the city within the region, and so on. In that case, summarized data to compare the regions is required, but also data in a structure that enables a manager to drill down, or decompose, the data into more atomic components (that is, finer-grained data at lower levels of aggregation). In contrast, when you roll up the data, you are aggregating the data to a higher level. 3. Dimensionality. Operational data focus on representing individual transactions rather than on the effects of the transactions over time. In contrast, data analysts tend to include many data dimensions and are interested in how the data relate over those dimensions. For example, an analyst might want to know how product X fared relative to product Z during the past six months by region, state, city, store, and customer. In that case, both place and time are part of the picture. Figure 13.3 (in the text) shows how decision support data can be examined from multiple dimensions (such as product, region, and year), using a variety of filters to produce each dimension. The ability to analyze, extract, and present information in meaningful ways is one of the differences between decision support data and transaction-at-a-time operational data. The DSS components that form a system are shown in the text’s Figure 13.1. Note that: • The data store component is basically a DSS database that contains business data and businessmodel data. These data represent a snapshot of the company situation. • The data extraction and filtering component is used to extract, consolidate, and validate the data store.
335
Chapter 13 Business Intelligence and Data Warehouses • •
The end user query tool is used by the data analyst to create the queries used to access the database. The end user presentation tool is used by the data analyst to organize and present the data.
6. What is a data warehouse, and what are its main characteristics? How does it differ from a data mart? A data warehouse is an integrated, subject-oriented, time-variant and non-volatile database that provides support for decision-making. (See section 13-4 for an in-depth discussion about the main characteristics.) The data warehouse is usually a read-only database optimized for data analysis and query processing. Typically, data are extracted from various sources and are then transformed and integrated—in other words, passed through a data filter—before being loaded into the data warehouse. Users access the data warehouse via front-end tools and/or end-user application software to extract the data in usable form. Figure 13.4 in the text illustrates how a data warehouse is created from the data contained in an operational database. You might be tempted to think that the data warehouse is just a big summarized database. But a good data warehouse is much more than that. A complete data warehouse architecture includes support for a decision support data store, a data extraction and integration filter, and a specialized presentation interface. To be useful, the data warehouse must conform to uniform structures and formats to avoid data conflicts and to support decision making. In fact, before a decision support database can be considered a true data warehouse, it must conform to the twelve rules described in section 13-4b and illustrated in Table 13.9.. 7. Give three examples of likely problems when operational data are integrated into the data warehouse. Within different departments of a company, operational data may vary in terms of how they are recorded or in terms of data type and structure. For instance, the status of an order may be indicated with text labels such as "open", "received", "cancel", or "closed" in one department while another department has it as "1", "2", "3", or "4". The student status can be defined as "Freshman", "Sophomore", "Junior", or "Senior" in the Accounting department and as "FR", "SO", "JR", or "SR" in the Computer Information Systems department. A social security number field may be stored in one database as a string of numbers and dashes ('XXX-XX-XXXX'), in another as a string of numbers without the dashes ('XXXXXXXXX'), and in yet a third as a numeric field (#########). Most of the data transformation problems are related to incompatible data formats, the use of synonyms and homonyms, and the use of different coding schemes.
Use the following scenario to answer questions 8 through 14. While working as a database analyst for a national sales organization, you are asked to be part of its data warehouse project team.
336
Chapter 13 Business Intelligence and Data Warehouses 8. Prepare a high-level summary of the main requirements for evaluating DBMS products for data warehousing. There are four primary ways to evaluate a DBMS that is tailored to provide fast answers to complex queries: • the database schema supported by the DBMS • the availability and sophistication of data extraction and loading tools • the end user analytical interface • the database size requirements Establish the requirements based on the size of the database, the data sources, the necessary data transformations, and the end user query requirements. Determine what type of database is needed, i.e., a multidimensional or a relational database using the star schema. Other valid evaluation criteria include the cost of acquisition and available upgrades (if any), training, technical and development support, performance, ease of use, and maintenance. 9. Your data warehousing project group is debating whether to create a prototype of a data warehouse before its implementation. The project group members are especially concerned about the need to acquire some data warehousing skills before implementing the enterprise-wide data warehouse. What would you recommend? Explain your recommendations. Knowing that data warehousing requires time, money, and considerable managerial effort, many companies create data marts, instead. Data marts use smaller, more manageable data sets that are targeted to fit the special needs of small groups within the organization. In other words, data marts are small, single-subject data warehouse subsets. Data mart development and use costs are lower and the implementation time is shorter. Once the data marts have demonstrated their ability to serve the DSS, they can be expanded to become data warehouses or they can be migrated into larger existing data warehouses. 10. Suppose you are selling the data warehouse idea to your users. How would you define multidimensional data analysis for them? How would you explain its advantages to them? Multidimensional data analysis refers to the processing of data in which data are viewed as part of a multidimensional structure, one in which data are related in many different ways. Business decision makers usually view data from a business perspective. That is, they tend to view business data as they relate to other business data. For example, a business data analyst might investigate the relationship between sales and other business variables such as customers, time, product line, and location. The multidimensional view is much more representative of a business perspective. A good way to visualize the data is to use tools such as pivot tables in MS Excel or data visualization products such as MS Power BI, Tableau Software’s Tableau or QlikView. 11. The data warehousing project group has invited you to provide an OLAP overview. The group’s members are particularly concerned about the OLAP client/server architecture requirements and how OLAP will fit the existing environment. Your job is to explain to them the main OLAP client/server components and architectures.
337
Chapter 13 Business Intelligence and Data Warehouses OLAP systems are based on client/server technology and they consist of these main modules: • OLAP Graphical User Interface (GUI) • OLAP Analytical Processing Logic • OLAP Data Processing Logic. The location of each of these modules is a function of different client/server architectures. How and where the modules are placed depends on hardware, software, and professional judgment. Any placement decision has its own advantages or disadvantages. However, the following constraints must be met: • The OLAP GUI is always placed in the end user's computer. The reason it is placed at the client side is simple: this is the main point of contact between the end user and the system. Specifically, it provides the interface through which the end user queries the data warehouse's contents. • The OLAP Analytical Processing Logic (APL) module can be place in the client (for speed) or in the server (for better administration and better throughput). The APL performs the complex transformations required for business data analysis, such as multiple dimensions, aggregation, period comparison, and so on. • The OLAP Data Processing Logic (DPL) maps the data analysis requests to the proper data objects in the Data Warehouse and is, therefore, generally placed at the server level. 12. One of your vendors recommends using an MDBMS. How would you explain this recommendation to your project leader? Multidimensional On-Line Analytical Processing (MOLAP) provides OLAP functionality using multidimensional databases (MDBMS) to store and analyze multidimensional data. Multidimensional database systems (MDBMS) use special proprietary techniques to store data in matrix-like arrays of ndimensions. 13. The project group is ready to make a final decision between ROLAP and MOLAP. What should be the basis for this decision? Why? The basis for the decision should be the system and end user requirements. Both ROLAP and MOLAP will provide advanced data analysis tools to enable organizations to generate required information. The selection of one or the other depends on which set of tools will fit best within the company's existing expertise base, its technology and end user requirements, and its ability to perform the job at a given cost. The proper OLAP/MOLAP selection criteria must include: • purchase and installation price • supported hardware and software • compatibility with existing hardware, software, and DBMS • available programming interfaces • performance • availability, extent, and type of administrative tools • support for the database schema(s) • ability to handle current and projected database size 338
Chapter 13 Business Intelligence and Data Warehouses • database architecture • available resources • flexibility • scalability • total cost of ownership. 14. The data warehouse project is in the design phase. Explain to your fellow designers how you would use a star schema in the design. The star schema is a data modeling technique that is used to map multidimensional decision support data into a relational database. The reason for the star schema's development is that existing relational modeling techniques, E-R and normalization, did not yield a database structure that served the advanced data analysis requirements well. Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structures on which the operational database is built. The basic star schema has two four components: facts, dimensions, attributes, and attribute hierarchies. The star schemas represent aggregated data for specific business activities. Using the schemas, we will create multiple aggregated data sources that will represent different aspects of business operations. For example, the aggregation may involve total sales by selected time periods, by products, by stores, and so on. Aggregated totals can be total product units, total sales values by products, etc.
15. Briefly discuss OLAP architectural styles with and without data marts. Section 13-6d, “OLAP Architecture”, details the basic architectural components of an OLAP environment: • The graphical user interface (GUI front-end) – located always at the end-user end. • The analytical processing logic – this component could be located in the back end (OLAP server) or could be split between the back end and front-end components. • Data processing logic – logic used to extract data from data; typically located in the backend. The term OLAP “engine” is sometimes used to refer to the arrangement of the OLAP components as a whole. However, the architecture allows for the split of the some of the components in a client/server arrangement as depicted in Figures 13.16 and 13.17. Figure 13.16 shows a typical OLAP architecture without data marts. In this architecture, the OLAP tool will extract data from the data warehouse and process the data to be presented by the end-user GUI. The processing of the data takes place mostly on the OLAP engine. The OLAP engine location could be located in each client computer or it could be shared from an OLAP “server”. Figure 13.17 shows a typical OLAP architecture with local data marts (end-user located). The local data marts are “miniature” data warehouses that focus in a subset of the data in the data warehouse. Normally these data marts are subject oriented, such as customers, products, sales, etc. The local data marts provide faster processing but require that the data be periodically “synchronized” with the main data warehouse.
339
Chapter 13 Business Intelligence and Data Warehouses
16. What is OLAP, and what are its main characteristics? OLAP stands for On-Line Analytical Processing and uses multidimensional data analysis techniques. OLAP yields an advanced data analysis environment that provides the framework for decision making, business modeling, and operations research activities. Its four main characteristics are: 1. Multidimensional data analysis techniques 2. Advanced database support 3. Easy to use end user interfaces 4. Support for client/server architecture. 17. Explain ROLAP, and list the reasons you would recommend its use in the relational database environment. Relational On-Line Analytical Processing (ROLAP) provides OLAP functionality for relational databases. ROLAP's popularity is based on the fact that it uses familiar relational query tools to store and analyze multidimensional data. Because ROLAP is based on familiar relational technologies, it represents a natural extension to organizations that already use relational database management systems within their organizations. 18. Explain the use of facts, dimensions, and attributes in the star schema. Facts are numeric measurements (values) that represent a specific business aspect or activity. For example, sales figures are numeric measurements that represent product and/or service sales. Facts commonly used in business data analysis are units, costs, prices, and revenues. Facts are normally stored in a fact table, which is the center of the star schema. The fact table contains facts that are linked through their dimensions. Dimensions are qualifying characteristics that provide additional perspectives to a given fact. Dimensions are of interest to us, because business data are almost always viewed in relation to other data. For instance, sales may be compared by product from region to region, and from one time period to the next. The kind of problem typically addressed by DSS might be "make a comparison of the sales of product units of X by region for the first quarter from 1995 through 2005." In this example, sales have product, location, and time dimensions. Dimensions are normally stored in dimension tables. Each dimension table contains attributes. The attributes are often used to search, filter, or classify facts. Dimensions provide descriptive characteristics about the facts through their attributes. Therefore, the data warehouse designer must define common business attributes that will be used by the data analyst to narrow down a search, group information, or describe dimensions. For example, we can identify some possible attributes for the product, location and time dimensions:
340
Chapter 13 Business Intelligence and Data Warehouses • • •
Product dimension: product id, description, product type, manufacturer, etc. Location dimension: region, state, city, and store number. Time dimension: year, quarter, month, week, and date.
These product, location, and time dimensions add a business perspective to the sales facts. The data analyst can now associate the sales figures for a given product, in a given region, and at a given time. The star schema, through its facts and dimensions, can provide the data when they are needed and in the required format, without imposing the burden of additional and unnecessary data (such as order #, po #, status, etc.) that commonly exist in operational databases. In essence, dimensions are the magnifying glass through which we study the facts. 19. Explain multidimensional cubes, and describe how the slice and dice technique fits into this model. To explain the multidimensional cube concept, let's assume a sales fact table with three dimensions: product, location, and time. In this case, the multidimensional data model for the sales example is (conceptually) best represented by a three-dimensional cube. This cube represents the view of sales dimensioned by product, location, and time. (We have chosen a three-dimensional cube because such a cube makes it easier for humans to visualize the problem. There is, of course, no limit to the number of dimensions we can use.) The power of multidimensional analysis resides in its ability to focus on specific slices of the cube. For example, the product manager may be interested in examining the sales of a product, thus producing a slice of the product dimension. The store manager may be interested in examining the sales of a store, thus producing a slice of the location dimension. The intersection of the slices yields smaller cubes, thereby producing the "dicing" of the multidimensional cube. By examining these smaller cubes within the multidimensional cube, we can produce very precise analyses of the variable components and interactions. In short, Slice and dice refers to the process that allows us to subdivide a multidimensional cube. Such subdivisions permit a far more detailed analysis than would be possible with the conventional two-dimensional data view. The text's Section 13-5 and Figures 13.5 through 13.9 illustrate the slice and dice concept. To gain the benefits of slice and dice, we must be able to identify each slice of the cube. Slice identification requires the use of the values of each attribute within a given dimension. For example, to slice the location dimension, we can use a STORE_ID attribute in order to focus on a given store.
20. In the star schema context, what are attribute hierarchies and aggregation levels and what is their purpose? Attributes within dimensions can be ordered in an attribute hierarchy. The attribute hierarchy yields a top-down data organization that permits both aggregation and drill-down/roll-up data analysis. Use Figure Q13.18 to show how the attributes of the location dimension can be organized into a hierarchy that orders that location dimension by region, state, city, and store.
341
Chapter 13 Business Intelligence and Data Warehouses
Figure Q13.18 A Location Attribute Hierarchy Region
roll-up
City
drill-down
State The attribute hierarchy makes it possible to perform drill-down and roll-up searches.
Store
The attribute hierarchy gives the data warehouse the ability to perform drill-down and roll-up data searches. For example, suppose a data analyst wants an answer to the query "How does the 2005 total monthly sales performance compare to the 2000 monthly sales performance?" Having performed the query, suppose that the data analyst spots a sharp total sales decline in March, 2005. Given this discovery, the data analyst may then decide to perform a drill-down procedure for the month of March to see how this year's March sales by region stack up against last year's. The drill-down results are then used to find out whether the low over-all March sales were reflected in all regions or only in a particular region. This type of drill-down operation may even be extended until the data analyst is able to identify the individual store(s) that is (are) performing below the norm. The attribute hierarchy allows the data warehouse and OLAP systems to use a carefully defined path that will govern how data are to be decomposed and aggregated for drill-down and roll-up operations. Of course, keep in mind that it is not necessary for all attributes to be part of an attribute hierarchy; some attributes exist just to provide narrative descriptions of the dimensions. 21. Discuss the most common performance improvement techniques used in star schemas. The following four techniques are commonly used to optimize data warehouse design: • Normalization of dimensional tables is done to achieve semantic simplicity and to facilitate end user navigation through the dimensions. For example, if the location dimension table contains transitive dependencies between region, state, and city, we can revise these relationships to the third normal form (3NF). By normalizing the dimension tables, we simplify the data filtering operations related to the dimensions. • We can also speed up query operations by creating and maintaining multiple fact tables related to each level of aggregation. For example, we may use region, state, and city in the location dimension. These aggregate tables are pre-computed at the data loading phase, rather than at run-time. The purpose of this technique is to save processor cycles at run-time, thereby speeding up data analysis. An end user query tool optimized for decision analysis will then properly access the summarized fact tables, instead of computing the values by accessing a "lower level of detail" fact table.
342
Chapter 13 Business Intelligence and Data Warehouses •
Denormalizing fact tables is done to improve data access performance and to save data storage space. The latter objective, storage space savings, is becoming less of a factor: Data storage costs are on a steeply declining path, decreasing almost daily. DBMS limitations that restrict database and table size limits, record size limits, and the maximum number of records in a single table, are far more critical than raw storage space costs. Denormalization improves performance by storing in one single record what normally would take many records in different tables. For example, to compute the total sales for all products in all regions, we may have to access the region sales aggregates and summarize all the records in this table. If we have 300,000 product sales records, we wind up summarizing at least 300,000 rows. Although such summaries may not be a very taxing operation for a DBMS initially, a comparison of ten or twenty years' worth of sales is likely to start bogging the system down. In such cases, it will be useful to have special aggregate tables, which are denormalized. For example a YEAR_TOTAL table may contain the following fields: YEAR_ID, MONTH_1, MONTH_2,....MONTH12, YEAR_TOTAL Such a denormalized YEAR_TOTAL table structure works well to become the basis for year-toyear comparisons at the month level, the quarter level, or the year level. But keep in mind that design criteria such as frequency of use and performance requirements are evaluated against the possible overload placed on the DBMS to manage these denormalized relations.
•
Table partitioning and replication are particularly important when a DSS is implemented in widely dispersed geographic areas. Partitioning will split a table into subsets of rows or columns. These subsets can then be placed in or near the client computer to improve data access times. Replication makes a copy of a table and places it in a different location for the same reasons.
22. What is data analytics? Briefly define explanatory and predictive analytics. Give some examples. Data analytics is a subset of BI functionality that encompasses a wide range of mathematical, statistical, and modeling techniques with the purpose of extracting knowledge from data. Data analytics is used at all levels within the BI framework, including queries and reporting, monitoring and alerting, and data visualization. Hence, data analytics is a “shared” service that is crucial to what BI adds to an organization. Data analytics represents what business managers really want from BI: the ability to extract actionable business insight from current events and foresee future problems or opportunities. Data analytics discovers characteristics, relationships, dependencies, or trends in the organization’s data, and then explains the discoveries and predicts future events based on the discoveries. Data analytics tools can be grouped into two separate (but closely related and often overlapping) areas: • Explanatory analytics focuses on discovering and explaining data characteristics and relationships based on existing data. Explanatory analytics uses statistical tools to formulate hypotheses, test them, and answer the how and why of such relationships—for example, how do past sales relate to previous customer promotions? • Predictive analytics focuses on predicting future data outcomes with a high degree of accuracy. Predictive analytics uses sophisticated statistical tools to help the end user create advanced models
343
Chapter 13 Business Intelligence and Data Warehouses that answer questions about future data occurrences—for example, what would next month’s sales be based on a given customer promotion?
23.
Describe and contrast the focus of data mining and predictive analytics. Give some examples. In practice, data analytics is better understood as a continuous spectrum of knowledge acquisition that goes from discovery to explanation to prediction. The outcomes of data analytics then become part of the information framework on which decisions are built. You can think of data mining (explanatory analytics) as explaining the past and present, while predictive analytics forecasts the future. However, you need to understand that both sciences work together; predictive analytics uses explanatory analytics as a stepping stone to create predictive models. Data mining refers to analyzing massive amounts of data to uncover hidden trends, patterns, and relationships; to form computer models to simulate and explain the findings; and then to use such models to support business decision making. In other words, data mining focuses on the discovery and explanation stages of knowledge acquisition. However, data mining can also be used as the basis to create advanced predictive data models. For example, a predictive model could be used to predict future customer behavior, such as a customer response to a target marketing campaign. So, what is the difference between data mining and predictive analytics? In fact, data mining and predictive analytics use similar and overlapping sets of tools, but with a slightly different focus. Data mining focuses on answering the “how” and “what” of past data, while predictive analytics focuses on creating actionable models to predict future behaviors and events. In some ways, you can think of predictive analytics as the next logical step after data mining; once you understand your data, you can use the data to predict future behaviors. In fact, most BI vendors are dropping the term data mining and replacing it with the more alluring term predictive analytics. Predictive analytics can be traced back to the banking and credit card industries. The need to profile customers and predict customer buying patterns in these industries was a critical driving force for the evolution of many modeling methodologies used in BI data analytics today. For example, based on your demographic information and purchasing history, a credit card company can use data-mining models to determine what credit limit to offer, what offers you are more likely to accept, and when to send those offers. Another example, a data mining tool could be used to analyze customer purchase history data. The data mining tool will find many interesting purchasing patterns, and correlations about customer demographics, timing of purchases and the type of items they purchase together. The predictive analytics tool will use those finding to build a model that will predict with high degree of accuracy when a certain type of customer will purchase certain items and what items are likely to be purchased on certain nights and times.
24.
How does data mining work? Discuss the different phases in the data mining process. Data mining is subject to four phases: • In the data preparation phase, the main data sets to be used by the data mining operation are identified and cleansed from any data impurities. Because the data in the data warehouse are already integrated and filtered, the Data Warehouse usually is the target set for data mining operations.
344
Chapter 13 Business Intelligence and Data Warehouses •
•
•
The data analysis and classification phase objective is to study the data to identify common data characteristics or patterns. During this phase the data mining tool applies specific algorithms to find: ▪ data groupings, classifications, clusters, or sequences. ▪ data dependencies, links, or relationships. ▪ data patterns, trends, and deviations. The knowledge acquisition phase uses the results of the data analysis and classification phase. During this phase, the data mining tool (with possible intervention by the end user) selects the appropriate modeling or knowledge acquisition algorithms. The most typical algorithms used in data mining are based on neural networks, decision trees, rules induction, genetic algorithms, classification and regression trees, memory-based reasoning, or nearest neighbor and data visualization. A data mining tool may use many of these algorithms in any combination to generate a computer model that reflects the behavior of the target data set. Although some data mining tools stop at the knowledge acquisition phase, others continue to the prognosis phase. In this phase, the data mining findings are used to predict future behavior and forecast business outcomes. Examples of data mining findings can be:
65% of customers who did not use the credit card in six months are 88% likely to cancel their account 82% of customers who bought a new TV 27" or bigger are 90% likely to buy a entertainment center within the next 4 weeks. If age < 30 and income <= 25,0000 and credit rating < 3 and credit amount > 25,000, the minimum term is 10 years. The complete set of findings can be represented in a decision tree, a neural net, a forecasting model or a visual presentation interface which is then used to project future events or results. For example the prognosis phase may project the likely outcome of a new product roll-out or a new marketing promotion. 25.
Describe the characteristics of predictive analytics. What is the impact of Big Data in predictive analytics? Predictive analytics employs mathematical and statistical algorithms, neural networks, artificial intelligence, and other advanced modeling tools to create actionable predictive models based on available data. The algorithms used to build the predictive model are specific to certain types of problems and work with certain types of data. Therefore, it is important that the end user, who typically is trained in statistics and understands business, applies the proper algorithms to the problem in hand. However, thanks to constant technology advances, modern BI tools automatically apply multiple algorithms to find the optimum model. Most predictive analytics models are used in areas such as customer relationships, customer service, customer retention, fraud detection, targeted marketing, and optimized pricing. Predictive analytics can add value to an organization in many different ways; for example, it can help optimize existing processes, identify hidden problems, and anticipate future problems or opportunities. However, predictive analytics is not the “secret sauce” to fix all business problems. Managers should carefully monitor and evaluate the value of predictive analytics models to determine their return on investment. Predictive analytics received a big stimulus with the advent of social media. Companies turned to data
345
Chapter 13 Business Intelligence and Data Warehouses mining and predictive analytics as a way to harvest the mountains of data stored on social media sites. Google was one of the first companies that offered targeted ads as a way to increase and personalize search experiences. Similar initiatives were used by all types of organizations to increase customer loyalty and drive up sales. Take the example of the airline and credit card industries and their frequent flyer and affinity card programs. Nowadays, many organizations use predictive analytics to profile customers in an attempt to get and keep the right ones, which in turn will increase loyalty and sales. 26.
Describe Data Visualization? What is the goal of data visualization? Data visualization is the process of abstracting data to provide a visual data representation that enhances the user’s ability to comprehend the meaning of the data. The goal of data visualization is to allow the user to quickly and efficiently see the data’s big picture by identifying trends, patterns and relationships.
27.
Is data visualization only useful when used with Big Data? Explain and Expand. It is a mistake to think that data visualization is useful only when dealing with Big Data. Any organization (regardless of size) that collects and uses data in its daily activities can benefit from the use of data analytics and visualization techniques. We all have heard the saying “a picture is worth a thousand words,” and this has never been more accurate than in data visualization. Tables with hundreds, thousands, or millions of rows of data cannot be processed by the human mind in a meaningful way. Providing summarized tabular data to managers does not give them enough insight into the meaning of the data to make informed decisions. Data visualization encodes the data into visually rich formats (mostly graphical) that provide at-a-glance insight into overall trends, patterns, and possible relationships. Data visualization techniques range from simple to very complex, and many are familiar. Such techniques include pie charts, line graphs, bar charts, bubble charts, bubble maps, donut charts, scatter plots, Gantt charts, heat maps, histograms, time series plots, steps charts, waterfall charts, and many more. The tools used in data visualization range from a simple spreadsheet (such as MS Excel) to advanced data visualization software such as Tableau, Microsoft PowerBI, Domo, and Qlik.4 Common productivity tools such as Microsoft Excel can often provide surprisingly powerful data visualizations. Excel has long included basic charting and PivotTable and PivotChart capabilities for visualizing spreadsheet data. More recently, the introduction of the PowerPivot add-in has eliminated row and column data limitations and allows for the integration of data from multiple sources. This puts powerful data visualization capabilities within reach of most business users. 28. As a discipline As a discipline, data visualization can be studied as _a group of visual communication techniques used to explore and discover data insights by applying: pattern recognition, spatial awareness and aesthetics. 29. Describe the different types of data and how they map to star schemas and data analysis. Give some examples of the different data types In general, there are two types of data: • Qualitative: describes qualities of the data. This type of data can be subdivided in two subtypes: –– Nominal: This is data that can be counted but not ordered or aggregated. Examples: sex (male or female); student class (graduate or undergraduate).
346
Chapter 13 Business Intelligence and Data Warehouses –– Ordinal: This is data that can be counted and ordered but not aggregated. Examples: rate your teacher (excellent, good, fair, poor), what is your family income (under 20,000, 20,001 to 40,000, 40,001 to 60,000, 60,001 or more). • Quantitative: describes numeric facts or measures of the data. This type of data can be counted, ordered and aggregated. Statisticians refer to this data as “interval and ratio” data. Examples of quantitative data include age, GPA, number of accidents, etc. You can think of qualitative data as being the dimensions on a star schema and the quantitative data as being the facts of a star schema. This is important because it means that you must use the correct type of functions and operations with each data type, including the proper way to visually represent it. 30. What five graphical data characteristics does data visualization use to highlight and contrast data findings and convey a story? Data visualization uses shape, color, size, position, and group/order to represent and highlight data in certain ways. The way you visualize the data tells a story and has an impact on the end users. Some data visualizations can provide unknown insights and others can be a way to draw attention to an issue. When used correctly, data visualization can tell the story behind the data. For example you can use data visualization to explore data and provide some useful data insights using vehicle crash data for the state of Iowa, available at https://catalog.data.gov/. The data set contains data on car accidents in Iowa from 2010 to early 2015. See Figure 13.29 in the textbook and explain how you can quickly identify some trends and characteristics using either Excel or a toll such as Tableau. NOTE: The data files for this chapter has a Dashboards folder that contains two complete data sets: H1B Visa data and Vehicle Crash data. In addition, there are sample dashboards in Excel, MS Power BI and Tableau. Read the included documentation explaining some of the data transformations applied to the raw data.
347
Chapter 13 Business Intelligence and Data Warehouses
Problem Solutions
ONLINE CONTENT The databases used for this problem set are available at www.cengagebrain.com. These databases are stored in Microsoft Access 2000 format. The databases, named Ch13_P1.mdb, Ch13_P3.mdb, and Ch13_P4.mdb, contain the data for Problems 1, 3, and 4, respectively. The data for Problem 2 are stored in Microsoft Excel format at www.cengagebrain.com. The spreadsheet filename is Ch13_P2.xls. 1. The university computer lab's director keeps track of the lab usage, as measured by the number of students using the lab. This particular function is very important for budgeting purposes. The computer lab director assigns you the task of developing a data warehouse in which to keep track of the lab usage statistics. The main requirements for this database are to • Show the total number of users by different time periods. • Show usage numbers by time period, by major, and by student classification. • Compare usage for different major and different semesters. Use the Ch13_P1.mdb database, which includes the following tables: USELOG STUDENT
contains the student lab access data is a dimension table containing student data
348
Chapter 13 Business Intelligence and Data Warehouses Given the three preceding requirements, and using the Ch13_P1.mdb data, complete the following problems: a. Define the main facts to be analyzed. (Hint: These facts become the source for the design of the fact table.) b. Define and describe the appropriate dimensions. (Hint: These dimensions become the source for the design of the dimension tables.) c. Draw the lab usage star schema, using the fact and dimension structures you defined in Problems 1a and 1b. d. Define the attributes for each of the dimensions in Problem 1b. e. Recommend the appropriate attribute hierarchies. f. Implement your data warehouse design, using the star schema you created in Problem 1c and the attributes you defined in Problem 1d. g. Create the reports that will meet the requirements listed in this problem’s introduction. Before problems 1 a-g can be answered, the students must create the time and semester dimensions. Looking at the data in the USELOG table, the students should be able to figure out that the data belong to the Fall 2017 and Spring 2018 semesters, so the semester dimension must contain entries for at least these two semesters. The time dimension can be defined in several different ways. It will be very useful to provide class time during which students can explore the different benefits derived from various ways to represent the time dimension. Regardless of what time dimension representation is selected, it is clear that the date and time entries in the USELOG must be transformed to meet the TIME and SEMESTER codes. For data analysis purposes, we suggest using the TIME and SEMESTER dimension table configurations shown in Tables P13.1A and P13.1B. (We have used these configurations in the DWP1sol.MDB database that is located on the CD.)
Table P13.1A The TIME Dimension Table Structure TIME_ID 1 2 3
TIME_DESCRIPTION Morning Afternoon Night
BEGIN_TIME 6:01AM 12:01PM 6:01PM
END_TIME 12:00PM 6:00PM 6:00AM
Table P13.1B The SEMESTER Dimension Table Structure SEMESTER_ID FA17 SP18
SEMESTER_DESCRIPTION Fall 2017 Spring 2018
BEGIN_DATE 15-Aug-2017 08-Jan-2018
END_DATE 18-Dec-2017 15-May-2018
The USELOG table contains only the date and time of the access, rather than the semester or time IDs. The student must create the TIME and SEMESTER dimension tables and assign the proper TIME_ID and SEMESTER_ID keys to match the USELOG's time and date. The students should also create the MAJOR dimension table, using the data already stored in the STUDENT table. Using Microsoft Access, we used the Make New Table query type to produce the MAJOR table. The Make New Table query lets you create a new table, MAJOR, using query output. In this case, the query must select all unique major codes and descriptions. The same technique can be used to create the student classification dimension table (In our solution, we have named the student classification dimension table CLASS.) Naturally, you
349
Chapter 13 Business Intelligence and Data Warehouses can use some front-end tool other than Access, but we have found Access to be particularly effective in this environment. To produce the solution we have stored in the PW-P1sol.MBD database, we have used the queries listed in Table P13.1C.
Table P13.1C The Queries in the DW_P1sol.MDB Database Query Name
Query Description
Update DATE format in USELOG
The DATE field in USELOG was originally given to us as a character field. This query converted the date text to a date field we can use for date comparisons. This query changes the STUDENT_ID format to make it compatible with the format used in USELOG. This query changes the STUDENT_ID format to make it compatible with the format used in STUDENT. Creates a temporary storage table (TEST) used to make some data transformations previous the creation of the fact table. The TEST table contains the fields that will be used in the USEFACT table, plus other fields used for data transformation purposes. Before we create the USEFACT table, we must transform the dates and time to match the SEMESTER_ID and TIME_ID keys used in our SEMESTER and TIME dimension tables. This query does that. This query does data aggregation over the data in TEST table. This query table will be used to create the new USEFACT table. This query uses the results of the previous query to populate our USEFACT table. Used to generate Report1 Used to generate Report2 Used to generate Report3
Update STUDENT_ID format in STUDENT Update STUDENT_ID format in USELOG Append TEST records from USELOG & STUDENT
Update TIME_ID and SEMESTER_ID in TEST
Count STUDENTS sort by Fact Keys: SEM, MAJOR, CLASS, TIME. Populate USEFACT Compares usage by Semesters by Times Usage by Time, Major and Classification Usage by Major and Semester
Having completed the preliminary work, we can now present the solutions to the seven problems: a. Define the main facts to be analyzed. (Hint: These facts become the source for the design of the fact table.) The main facts are the total number of students by time, the major, the semester, and the student classification. b. Define and describe the possible dimensions. (Hint: These dimensions become the source for the design of the dimension tables.) The possible dimensions are semester, major, classification, and time. Each of these dimensions provides an additional perspective to the total number of students fact table. The dimension table names and attributes are shown in the screen shot that illustrates the answer to problem 3.
350
Chapter 13 Business Intelligence and Data Warehouses c. Draw the lab usage star schema, using the fact and dimension structures you defined in Problems 1a and 1b. Figure P13.1c shows the MS Access relational diagram – see the Ch13-P1sol.mdb database in the Student Online Companion -- to illustrate the star schema, the relationships, the table names, and the field names used in our solution. The students are given only the USELOG and STUDENT tables and they must produce the fact table and dimension tables.
Figure P13.1c The Microsoft Access Relational Diagram
d. Define the attributes for each of the dimensions in Problem (b). Given problem 1c's star schema snapshot, the dimension attributes are easily defined: Semester dimension: semester_id, semester_description, begin_date, and end_date. Major dimension: major_code and major_name. Class dimension: class_id, and class_description. Time dimension: time_id, time_description, begin_time and end_time. e. Recommend the appropriate attribute hierarchies. See the answer to question 18 and the dimensions shown in Problems 1c and 1d to develop the appropriate attribute hierarchies.
351
Chapter 13 Business Intelligence and Data Warehouses
NOTE To create the dimension tables in MS Access, we had to modify the data. These modifications can be examined in the update queries stored in the Ch13_P1sol.mdb database. We used the switch function in MS Access to assign the proper SEMESTER_ID and the TIME_ID values to the USEFACT table.
f. Implement your data warehouse design, using the star schema you created in problem (c) and the attributes you defined in Problem (d). The solution is included in the Ch13_P1sol.mdb database on the Instructor's CD. g. Create the reports that will meet the requirements listed in this problem’s introduction. Use the Ch13_P1sol.mdb database on the Instructor's CD as the basis for the reports. Keep in mind that the Microsoft Access export function can be used to put the Access tables into a different database such as Oracle or DB2. 2. Victoria Ephanor manages a small product distribution company. Because the business is growing fast, Ephanor recognizes that it is time to manage the vast information pool to help guide the accelerating growth. Ephanor, who is familiar with spreadsheet software, currently employs a sales force of four people. She asks you to develop a data warehouse application prototype that will enable her to study sales figures by year, region, salesperson, and product. (This prototype is to be used as the basis for a future data warehouse database.) Using the data supplied in the Ch13_P2.xls file, complete the following seven problems: a. Identify the appropriate fact table components. The dimensions for this star schema are: Year, Region, Agent, and Product. (These are shown in Figure P13.2c.)
b. Identify the appropriate dimension tables. (These are shown in Figure P13.2c.) c. Draw a star schema diagram for this data warehouse. See Figure P13.2c.
Figure P13.2C The Star Schema for the Ephanor Distribution Company
352
Chapter 13 Business Intelligence and Data Warehouses
The Star Schema for the ORDER Fact Table
YEAR
ORDER
AGENT
REGION
Year Region Agent Product Total_Value
PRODUCT
The ORDER Fact Table contains the Total Value of the orders for a given year, region, agent, and product. The dimension tables are YEAR, REGION, AGENT and PRODUCT
d. Identify the attributes for the dimension tables that will be required to solve this problem. The solution to this problem is presented in the Ch13_P2sol.xls file in the Student Online Companion.
353
Chapter 13 Business Intelligence and Data Warehouses e. Using a Microsoft Excel or any other spreadsheet capable of producing pivot tables, generate a pivot table to show the sales by product and by region. The end user must be able to specify the display of sales for any given year. The sample output is shown in the first pivot table in Figure P13.2E.
FIGURE P13.2E Using a pivot table
The solution to this problem is presented in the Ch13_P2sol.xls file in the Student Online Companion. f. Using Problem 2e as your base, add a second pivot table (see Figure P13.2E) to show the sales by salesperson and by region. The end user must be able to specify sales for a given year or for all years, and for a given product or for all products. The solution to this problem is presented in the Ch13_P2sol.xls file in the Student Online Companion.
354
Chapter 13 Business Intelligence and Data Warehouses g. Create a 3-D bar graph to show sales by salesperson, by product, and by region. (See the sample output in Figure P13.2G.)
FIGURE P13.2G 3-D bar graph showing the relationships among sales person, product, and region
The solution to this problem is presented in the Ch13_P2sol.xls file in the Student Online Companion.
355
Chapter 13 Business Intelligence and Data Warehouses 3. David Suker, the inventory manager for a marketing research company, wants to study the use of supplies within the different company departments. Suker has heard that his friend, Ephanor, has developed a spreadsheet-based data warehouse model that she uses in her analysis of sales data (See Problem 2). Suker is interested in developing a data warehouse model like Ephanor’s so he can analyze orders by department and by product. He will use Microsoft Access as the Data Warehouse DBMS and Microsoft Excel as the analysis tool.
a. Develop the order star schema. Figure P13.3A's MS-Access relational diagram reflects the star schema and its relationships. Note that the students are given only the ORDERS table. The student must study the data set and make the queries necessary to create the dimension tables (TIME, DEPT, VENDOR and PRODUCT) and the ORDFACT fact table.
Figure P13.3A The Marketing Research Company Relational Diagram
b. Identify the appropriate dimension attributes. The dimensions are: TIME, DEPT, VENDOR, and PRODUCT. (See Figure P13.3A.)
c. Identify the attribute hierarchies required to support the model. The main hierarchy used for data drilling purposes is represented by TIME-DEPT-VENDORPRODUCT sequence. (See Figure P13.3A.) Within this hierarchy, the user can analyze data at different aggregation levels. Additional hierarchies can be constructed in the TIME dimension to account for quarters or, if necessary, by daily aggregates. The VENDOR dimension could also be expanded to include geographic information that could be used for drill-down purposes. d. Develop a crosstab report (in Microsoft Access), using a 3-D bar graph to show sales by product and by department. (The sample output is shown in Figure P13.3.) 356
Chapter 13 Business Intelligence and Data Warehouses
FIGURE P13.3 A Crosstab Report: Sales by Product and Department
The solution to this problem is included in the Ch13_P3sol.mdb database in the Student Online Companion.
357
Chapter 13 Business Intelligence and Data Warehouses 4. ROBCOR, Inc., whose sample data are contained in the database named Ch13_P4.mdb, provides "on demand" aviation charters using a mix of different aircraft and aircraft types. Because ROBCOR has grown rapidly, its owner has hired you to be its first database manager. The company's database, developed by an outside consulting team, already has a charter database in place to help manage all of its operations. Your first and critical assignment is to develop a decision support system to analyze the charter data. (Review the company’s operations in Problems 24-31 in Chapter 3, The Relational Database Model.) The charter operations manager wants to be able to analyze charter data such as cost, hours flown, fuel used, and revenue. She also wants to be able to drill down by pilot, type of airplane, and time periods. Given those requirements, complete the following: a. Create a star schema for the charter data.
NOTE The students must first create the queries required to filter, integrate, and consolidate the data prior to their inclusion in the Data Warehouse. The Ch13_P4.mdb database contains the data to be used by the students. The Ch13_P4sol.mdb database contains the data and solution to the problems.
The problem requires the creation of the time dimension. Looking at the data in the CHARTER table, the students should figure out that the two attributes in the time dimension should be year and month. Another possible attribute could be day, but since no one pilot or airplane was used more than once a day, including it as an attribute would only reduce the database's efficiency. The analysis to be done on the time dimension can be done on a monthly or yearly basis. The CHARTER table contains the date of the charter. No time IDs exist and the date is contained within a single field. The student must create the TIME dimension table and assign the proper TIME_ID keys and its attributes. A temporary table is created to aid in the creation of the CHARTER_FACT table. The queries in Table P13.4-1 are used in the transformation process:
358
Chapter 13 Business Intelligence and Data Warehouses
Table P13.4-1 The ROBCOR Data Warehouse Queries Query Name
Query Description
Make a TEMP table from CHARTER, PILOT, and MODEL
Creates a temporary storage table used to make the necessary data transformations before the creation of the fact table. Used to create the TIME_ID key used in the TIME dimension table. In order to get the year and month attributes in the TIME dimension it is necessary to separate that data in the temporary table first. The date is in the TEMP table but will not be in the fact table. This query is used to create the time table using the appropriate data from the TEMP table. This query does data aggregation over the data in the TEMP table. This query table will be used to create the new CHARTER_FACT table. This query uses the results of the previous query to populate our CHARTER_FACT table.
Update TIME_ID in TEMP Update YEAR and MONTH in TEMP
Make TIME table from TEMP Aggregate TEMP table by fact keys
Populate CHARTER_FACT table
The MS Access relational diagram in Figure P13.4a reflects the star schema, the relationships, the table names, and field names used in our solution. The student is given only the CHARTER, AIRCRAFT, MODEL, EMPLOYEE, PILOT, and CUSTOMER tables, and they must produce the fact table and the dimension table.
Figure P13.4A The RobCor Relational Diagram
359
Chapter 13 Business Intelligence and Data Warehouses b. Define the dimensions and attributes for the charter operation’s star schema. The dimensions are TIME, MODEL, and PILOT. Each of these dimensions is depicted in Figure P13.4a’s star schema figure. The attributes are: Time dimension: time id, year, and month. Model dimension: model code, manufacturer, name, number of seats, etc. Pilot dimension: employee number, pilot license, pilot ratings, etc. c. Define the necessary attribute hierarchies. The main attribute hierarchy is based on the sequence year-month-model-pilot. The aggregate analysis is based on this hierarchy. We can produce a query to generate revenue, hours flown, and fuel used on a yearly basis. We can then drill down to a monthly time period to generate the aggregate information for each model of airplane. We can also drill down to get that information about each pilot. d. Implement the data warehouse design, using the design components you developed in Problems 4a-4c. The Ch13_P4sol.mdb database contains the data and solutions for problems 4a-4c. e. Generate the reports that will illustrate that your data warehouse is able to meet the specified information requirements. The Ch13-P4sol.mdb database contains the solution for problem 4e. Using the data provided in the Ch13_SaleCo_DW database, solve the following problems.
ONLINE CONTENT The script files used to populate the database are available at www.cengagebrain.com. The script files are available in Oracle, MySQL and SQL Server formats. MS Access does not have SQL support for the complex grouping required. 5. What is the SQL command to list the total sales by customer and by product, with subtotals by customer and a grand total for all product sales? Oracle: SELECT CUS_CODE, P_CODE, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT GROUP BY ROLLUP (CUS_CODE, P_CODE); SQL Server and MySQL: SELECT CUS_CODE, P_CODE, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT
360
Chapter 13 Business Intelligence and Data Warehouses GROUP BY CUS_CODE, P_CODE WITH ROLLUP;
6. What is the SQL command to list the total sales by customer, month and product, with subtotals by customer and by month and a grand total for all product sales? Oracle: SELECT CUS_CODE, TM_MONTH, P_CODE, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWTIME T ON S.TM_ID = T.TM_ID GROUP BY ROLLUP (CUS_CODE, TM_MONTH, P_CODE); SQL Server and MySQL: SELECT CUS_CODE, TM_MONTH, P_CODE, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWTIME T ON S.TM_ID = T.TM_ID GROUP BY CUS_CODE, TM_MONTH, P_CODE WITH ROLLUP;
7. What is the SQL command to list the total sales by region and customer, with subtotals by region and a grand total for all sales? Oracle: SELECT REG_ID, CUS_CODE, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWCUSTOMER C ON S.CUS_CODE = C.CUS_CODE GROUP BY ROLLUP (REG_ID, CUS_CODE); SQL Server and MySQL: SELECT REG_ID, CUS_CODE, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWCUSTOMER C ON S.CUS_CODE = C.CUS_CODE GROUP BY REG_ID, CUS_CODE WITH ROLLUP; 8. What is the SQL command to list the total sales by month and product category, with subtotals by month and a grand total for all sales? Oracle: SELECT TM_MONTH, P_CATEGORY, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWPRODUCT P ON S.P_CODE = P.P_CODE JOIN DWTIME T ON S.TM_ID = T.TM_ID GROUP BY ROLLUP (TM_MONTH, P_CATEGORY); SQL Server and MySQL: SELECT TM_MONTH, P_CATEGORY, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWPRODUCT P ON S.P_CODE = P.P_CODE JOIN DWTIME T ON S.TM_ID = T.TM_ID
361
Chapter 13 Business Intelligence and Data Warehouses GROUP BY TM_MONTH, P_CATEGORY WITH ROLLUP;
9. What is the SQL command to list the number of product sales (number of rows) and total sales by month, with subtotals by month and a grand total for all sales? Oracle: SELECT TM_MONTH, COUNT(*) AS NUMPROD, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWTIME T ON S.TM_ID = T.TM_ID GROUP BY ROLLUP (TM_MONTH); SQL Server and MySQL: SELECT TM_MONTH, COUNT(*) AS NUMPROD, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWTIME T ON S.TM_ID = T.TM_ID GROUP BY TM_MONTH WITH ROLLUP;
10. What is the SQL command to list the number of product sales (number of rows) and total sales by month and product category with subtotals by month and product category and a grand total for all sales? Oracle: SELECT TM_MONTH, P_CATEGORY, COUNT(*) AS NUMPROD, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWPRODUCT P ON S.P_CODE = P.P_CODE JOIN DWTIME T ON S.TM_ID = T.TM_ID GROUP BY ROLLUP (TM_MONTH, P_CATEGORY); SQL Server and MySQL: SELECT TM_MONTH, P_CATEGORY, COUNT(*) AS NUMPROD, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWPRODUCT P ON S.P_CODE = P.P_CODE JOIN DWTIME T ON S.TM_ID = T.TM_ID GROUP BY TM_MONTH, P_CATEGORY WITH ROLLUP;
11. What is the SQL command to list the number of product sales (number of rows) and total sales by month, product category and product, with subtotals by month and product category and a grand total for all sales? Oracle: SELECT TM_MONTH, P_CATEGORY, P_CODE, COUNT(*) AS NUMPROD, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWTIME T ON S.TM_ID = T.TM_ID JOIN DWPRODUCT P ON S.P_CODE = P.P_CODE GROUP BY ROLLUP (TM_MONTH, P_CATEGORY, P_CODE);
362
Chapter 13 Business Intelligence and Data Warehouses
SQL Server and MySQL: SELECT TM_MONTH, P_CATEGORY, P_CODE, COUNT(*) AS NUMPROD, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWTIME T ON S.TM_ID = T.TM_ID JOIN DWPRODUCT P ON S.P_CODE = P.P_CODE GROUP BY TM_MONTH, P_CATEGORY, P_CODE WITH ROLLUP; 12. Using the answer to Problem 10 as your base, what command would you need to generate the same output but with subtotals in all columns? (Hint: Use the CUBE command). Oracle: SELECT TM_MONTH, P_CATEGORY, COUNT(*) AS NUMPROD, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWPRODUCT P ON S.P_CODE = P.P_CODE JOIN DWTIME T ON S.TM_ID = T.TM_ID GROUP BY CUBE (TM_MONTH, P_CATEGORY); SQL Server: SELECT TM_MONTH, P_CATEGORY, COUNT(*) AS NUMPROD, SUM(SALE_UNITS*SALE_PRICE) AS TOTSALES FROM DWDAYSALESFACT S JOIN DWPRODUCT P ON S.P_CODE = P.P_CODE JOIN DWTIME T ON S.TM_ID = T.TM_ID GROUP BY TM_MONTH, P_CATEGORY WITH CUBE;
13. Create your own data analysis and visualization presentation. The purpose of this project is for you to search for a publicly available data set using the Internet and create your own presentation using what you have learned in this chapter. a. Search for a data set that may interest you and download it. Some examples of public data sets sources are: • http://www.data.gov • http://data.worldbank.org • http://aws.amazon.com/datasets • http://usgovxml.com/ • https://data.medicare.gov/ • http://www.faa.gov/data_research/ b. Use any tool available to you to analyze the data. You can use tools such as MS Excel Pivot Tables, Pivot Charts, or other free tools, such as Google Fusion tables, Tableau free trial, IBM Many Eyes, etc. c. Create a short presentation to explain some of your findings (what the data sources are, where the data comes from, what the data represents, etc.) There are an incredible number of possible visualizations that students can create for an exercise like this. Most students enjoy the opportunity to express their creativity in producing visually interesting solutions. Attempt to keep the focus on how the visualization might make the data actionable. What can we learn from the visualization, and how might a decision maker be
363
Chapter 13 Business Intelligence and Data Warehouses influenced by it. Data Sources available: There are several public sources of large data sets that could be used by students to practice visualizations. Some of the most common sources are: http://catalog.data.gov http://data.worldbank.org http://aws.amazon.com/datasets http://usgovxml.com https://data.medicare.gov http://www.faa.gov/data_research/ https://www.cdc.gov/nchs/data_access/ https://data.world/ For some good examples of data visualizations, see the Centers for Disease Control and Prevention, Data Visualization Gallery at https://www.cdc.gov/nchs/data-visualization/
NOTE: The data files for this chapter has a Dashboards folder that contains two complete data visualization examples, including two data sets: H1B Visa data and Vehicle Crash data. In addition, there are sample dashboards build in Excel, MS Power BI and Tableau. Read the included documentation explaining some of the data transformations applied to the raw data. These should serve as a good starting point to the students on how to create some simple dashboards. See sample figures below.
364
Chapter 13 Business Intelligence and Data Warehouses
Figure P13.13A H1b Visa Applications Dashboard (Excel)
Figure P13.13B H1b Visa Applications Dashboard (PowerBi)
365
Chapter 13 Business Intelligence and Data Warehouses
Figure P13.13C H1b Visa Applications Dashboard (Tableau)
366
Chapter 14 Big Data Analytics and NoSQL
Chapter 14 Big Data Analytics and NoSQL Discussion Focus Start by explaining that Big Data is a nebulous term. Its definition and the composition of the techniques and technologies that are covered under this umbrella term are constantly changing and being redefined. There is no standardizing body for Big Data or NoSQL so there is no one in charge to make a definitive statement about exactly what qualifies as Big Data. This is made worse by the fact that most technologies for big data problems and the NoSQL movement are opensource so even the developers working in the arena are often a loose community without hierarchy or structure. As a generic definition, Big Data is data of such volume, velocity, and/or variety that it is difficult for traditional relational database technologies to store and process it. Students need to understand that the definition of Big Data is relative, not absolute. We cannot look at a collection of data and state categorically that it is Big Data now and for all time. We may categorize a set of data or a data storage and processing requirement as a Big Data problem today. In three years, or even in one year, relational database technologies may have advanced to the point where that same problem is no longer a Big Data problem. NoSQL has the same problem in terms of its definition. Since Big Data and NoSQL are both defined in terms of a negative statement that says what they are not instead of a positive statement that says what they are, they both suffer from being ill-defined and overly broad. Discuss the many V’s of Big Data. The basic V’s, volume, velocity, and variety are key to Big Data. Again, because of the lack of an authority to define what Big Data is, other V’s are added by writers and thinkers who like to jump on the alliteration of the 3 V’s. Beyond the 3 V’s, the other V’s that are proposed by various sources are often not really unique to Big Data. For example, all data have Volume. Big Data problems require Volume that is too large for relational database technologies to support. Veracity is the trustworthiness of the data. All data needs to be trustworthy. Big Data problems do not require support for a higher level of trustworthiness than relational database technologies can support. Therefore, the argument can be made that veracity is a characteristic of all data, not just Big Data. Students should understand that critical thinking about Big Data is necessary when assessing claims and technologies in this fast-changing arena. Discuss that Hadoop has been the beneficiary of great marketing and widespread buy-in from pundits. Hadoop has become synonymous with Big Data in the minds of many people that are passingly familiar with data management. However, Hadoop is a very specialized technology that is aimed at very specific tasks associated with storing and processing very large data sets in nonintegrative ways. This makes the Hadoop ecosystem very important because the ecosystem can expand the basic HDFS and MapReduce capabilities to support a wider range of needs and allow greater integration of the data.
367
Chapter 14 Big Data Analytics and NoSQL Stress to students that the NoSQL landscape is constantly changing. There are about 100 products competing in the NoSQL environment as any point in time, with new entrants emerging almost daily and other products disappearing at about the same rate. The text follows the standard categories of NoSQL databases that appear in the literature, as shown below, but many products do not fit neatly into only one category: • Key-value • Document • Column family • Graph Each category attempts to deal with non-relational data in different ways. Data analysis focuses on attempting to generate knowledge to expand and inform the organization’s decision making processes. These topics were covered to a great extent in Chapter 13 when analyzing data from transactional databases integrated into data warehouses. In this chapter, the use of exploratory and predictive analytics are applied to non-relational databases.
Answers to Review Questions 1. What is Big Data? Give a brief definition. Big Data is data of such volume, velocity, and/or variety that it is difficult for traditional relational database technologies to store and process it. 2. What are the traditional 3 Vs of Big Data? Briefly, define each. Volume, velocity, and variety are the traditional 3 Vs of Big Data. Volume refers to the quantity of the data that must be stored. Velocity refers to the speed with which new data is being generated and entering the system. Variety refers to the variations in the structure, or the lack of structure, in the data being captured. 3. Explain why companies like Google and Amazon were among the first to address the Big Data problem. In the 1990s, the use of the Internet exploded and commercial websites helped attract millions of new consumers to online transactions. When the dot-com bubble burst at the end of the 1990s, the millions of new consumers remained but the number of companies providing them services reduced dramatically. As a result, the surviving companies, like Google and Amazon experienced exponential growth in a very short time. This lead to these companies being among the first to experience the volume, velocity, and variety of data that is associated with Big Data. 4. Explain the difference between scaling up and scaling out. Scaling up involves improving storage and processing capabilities through the use of improved hardware, software, and techniques without changing the quantity of servers. Scaling out involves improving storage and processing capabilities through the use of more servers.
368
Chapter 14 Big Data Analytics and NoSQL 5. What is stream processing, and why is it sometimes necessary? Stream processing is the processing of data inputs to make decisions on which data should be stored and which data should be discarded. In some situations, large volumes of data can enter the system as such a rapid pace that it is not feasible to try to actually store all of the data. The data must be processed and filtered as it enters the system to determine which data to keep and which data to discard. 6. How is stream processing different from feedback loop processing? Stream processing focuses on inputs, while feedback loop processing focuses on outputs. Stream processing is performed on the data as it enters the system to decide which data should be stored and which should be discarded. Feedback loop processing uses data after it has been stored to conduct analysis for the purpose of making the data actionable by decision makers. 7. Explain why veracity, value, and visualization can also be said to apply to relational databases as well as Big Data. Veracity of data is an issue with even the smallest of data stores, which is why data management is so important in relational databases. Value of data also applies to traditional, structured data in a relational database. One of the keys to data modeling is that only the data that is of interest to the users should be included in the data model. Data that is not of value should not be recorded in any data store – Big Data or not. Visualization was discussed and illustrated at length in Chapter 13 as an important tool in working with data warehouses, which are often maintained as structured data stores in relational DBMS products. 8. What is polyglot persistence, and why is it considered a new approach? Polyglot persistence is the idea that an organization’s data storage solutions will consist of a range of data storage technologies. This is a new approach because the relational database has previously dominated the data management landscape to the point that the use of a relational DBMS for data storage was taken for granted in most cases. With Big Data problems, the reliance on only relational databases is no longer valid.
9. What are the key assumptions made by the Hadoop Distributed File System approach? HDFS is designed around the following assumptions: High volume Write-once, read-many Streaming access Fault tolerance HDFS assumes that the massive volumes of data will need to be stored and retrieved. HDFS assumes that data will be written once, that is, there will very rarely be a need to update the data once it has been written to disk. However, the data will need to be retrieved many times. HDFS assumes that when a file is retrieved, the entire contents of the file will need to be streamed in a sequential fashion. HDFS does not work well when only small parts of a file are needed. Finally, HDFS assumes that failures in the servers will be frequent. As the number of servers increases, the probability of a failure increases significantly. HDFS assumes that servers will fail so the data must be redundant to avoid loss of data when servers fail. 369
Chapter 14 Big Data Analytics and NoSQL
10. What is the difference between a name node and a data node in HDFS? The name node stores the metadata that tracks where all of the actual data blocks reside in the system. The name node is responsible for coordinating tasks across multiple data nodes to ensure sufficient redundancy of the data. The name node does not store any of the actual user data. The data nodes store the actual user data. A data node does not store metadata about the contents of any data node other than itself.
11. Explain the basic steps of MapReduce processing. • A client node submits a job to the Job Tracker. • Job Tracker determines where the data to be processed resides. • Job Tracker contacts the Task Tracker on the nodes as close as possible to the data. • Each Task Tracker creates mappers and reducers as needed to complete the processing of each block of data and consolidate that data into a result. • Task Trackers report results back to the Job Tracker when the mappers and reducers are finished. • The Job Tracker updates the status of the job to indicate when it is complete.
12. Briefly explain how HDFS and MapReduce are complementary to each other. Both HDFS and MapReduce rely on the concept of massive, relatively independent, distributions. HDFS decomposes data into large, independent chunks of data that are then distributed across a number of independent servers. MapReduce decomposes processing into independent tasks that are distributed across a number of independent servers. The distribution of data in HDFS is coordinated by a name node server that collects data from each server about the state of the data that it holds. The distribution of processing in MapReduce is coordinated by a job tracker that collects data from each server about the state of the processing it is performing.
13. What are the four basic categories of NoSQL databases? Key-value database, document databases, column family databases, and graph databases.
14. How are the value components of a key-value database and a document database different? In a key-value database, the value component is nonintelligible for the database. In other words, the DBMS is unaware of the meaning of any of the data in the value component – it is treated as an indecipherable mass of data. All processing of the data in the value component must be accomplished by the application logic. In a document database, the value component is partially interpretable by the DBMS. The DBMS can identify and search for specific tags, or subdivisions, within the value component.
370
Chapter 14 Big Data Analytics and NoSQL 15. Briefly explain the difference between row-centric and column-centric data storage. Row-centric storage treats a row as the smallest data storage unit. All of the column values associated with a particular row of data are stored together in physical storage. This is the optimal storage approach for operations that manipulate and retrieve all columns in a row, but only a small number of rows in a table. Column-centric storage treats a row as a divisible collection of values that are stored separately with the values of a single column across many rows being physically stored together. This is optimal when operations manipulate and retrieve a small number of columns in a row for all rows in the table. 16. What is the difference between a column and a super column in a column family database? Columns in a column family database are relatively independent of each other. A super column is a group of columns that are logically related. This relationship can be based on the nature of the data in the columns, such as a group of columns that comprise an address, or it can be based on application processing requirements.
17. Explain why graph databases tend to struggle with scaling out? Graph databases are designed to address problems with highly related data. The data that appears in a graph database are tightly integrated and queries that traverse a graph focus on the relationships among the data. Scaling out requires moving data to number of different servers. As a general rule, scaling out is recommended when the data on each server is relatively independent of the data on other servers. Due to the dependencies among the data on different servers in a graph database, the inter-server communication overhead is very high with a graph database. This has a significant negative impact on the performance of graph databases in a scaled out environment. 18. Explain what it means for a database to be aggregate aware. Aggregate aware means that the designer of the database has to be aware of the way the data in the database will be used, and then design the database around whichever component would be central to that usage. Instead of decomposing the data structures to eliminate redundancy, an aggregate aware database is collects, or aggregates, all of the data around a central component to minimize the structures required during processing.
371
Chapter 15 Database Connectivity and Web Technologies
Chapter 15 Database Connectivity and Web Technologies Discussion Focus Begin by making sure that the students are familiar with the basic vocabulary. The following two questions are a good place to start. (You may want to examine the contents of Appendix F, “Client/Server Systems.”) There is some irony in the Web development arena … the microcomputer was supposed to liberate the end user from the mainframe computer’s “fat server, thin client” environment. However, the Web has, in effect, brought us back to the old mainframe structure in which most processing is done on the server side, while the client is the source and recipient of data/information requests and returns.
1. Describe the following: TCP/IP, Router, HTML, HTTP, and URL. Transmission Control Protocol/Internet Protocol (TCP/IP) is the basic network protocol that determines the rules used to create and route “packets” of data between computers in the same or different networks. Each computer connected to the Internet has a unique TCP/IP address. The TCP/IP address is divided in two parts used to identify the network and the computer (or host)
Router is a special hardware/software equipment that connects multiple and diverse networks. The router is in charge of delivering packets of data from a local network to a remote network. Routers are the traffic cops of the Internet, monitoring all traffic and moving data from one network to another.
HTML stands for Hyper Text Markup Language is the standard document-formatting language for Web pages. HTML allows documents to be presented in a Web browser in a standard manner.
URL stands for Uniform Resource Locator. An URL identifies the address of a resource on the Internet. The URL is an abbreviation (ideally easily remembered) that uniquely identifies an internet resource, for example www.amazon.com, www.faa.gov, and www.mtsu.edu.)
372
Chapter 15 Database Connectivity and Web Technologies HTTP stands for Hyper Text Transfer Protocol. HTTP is the standard protocol used by the Web browser and Web server to communicate—that is, to send requests and replies between servers and browsers. HTTP uses TCP/IP to transmit the data between computers.
2. Describe the client/server model for application processing. Client/server is a term used to describe a computing model for the development of computerized systems. This model is based on the distribution of functions between two types of independent and autonomous processes: servers and clients. A client is any process that requests specific services from server processes. A server is a process that provides requested services for clients. Client and server processes can reside in the same computer or in different computers connected by a network.
The client/server model makes possible the division of the application processing tasks into three main components: presentation logic, processing logic, and data storage. • •
•
The presentation logic formats and presents data in output devices, such as the screen, and manages the end-user input. The application uses presentation logic to manage the graphical user interface at the client end. The processing logic component refers to the application code that performs data validation, error checking, and business logic processing. The processing logic component represents the business rules. For example, the processing logic “knows” that a sales transaction generates an invoice record, an inventory update, and a customer’s account receivable update. The processing logic performs several functions, including enforcement of business rules, managing information flows within the business, and mapping the real-world business transactions to the actual computer database. The data storage component deals with the actual data storage and retrieval from permanent storage devices. For example, the data manipulation logic is used to access the data in a database and to enforce data integrity.
Although there is no methodology to dictate the precise distribution of the logic components among clients and servers, the client/server architectural principles of process distribution (autonomy, resource maximization, scalability, and interoperability) and hardware and software independence facilitate the creation of distributed applications running across multiple servers. Those applications provide services that communicate with each other in order to carry out specific function, therefore the term multi-tier applications. So, where should the services be placed? With the probable exception of the presentation logic, which should go on the client side, each of the remaining service components may be placed on the server side, thus becoming a service for many clients.
Answers to Review Questions
373
Chapter 15 Database Connectivity and Web Technologies
1. Give some example of database connectivity options and what they are used for. Database connectivity refers to the mechanisms through which application programs connect and communicate with data repositories. The database connectivity software is also known as database middleware, because it represents a piece of software that interfaces between the application program and the database. The data repository is also known as the data source, because it represents the data management application (i.e. an Oracle RDBMS, SQL Server DBMS, or IBM DBMS) that will be used to store the data generated by the application program. Ideally, a data source or data repository could be located anywhere and hold any type of data. For example, the data source could be a relational database, a hierarchical database, a spreadsheet, a text data file, and so on. Although there are many different ways to achieve database connectivity, this section will cover only the following interfaces: native SQL connectivity (vendor provided), Microsoft’s Open Database Connectivity (ODBC), Data Access Objects (DAO) and Remote Data Objects (RDO), Microsoft’s Object Linking and Embedding - Databases (OLE-DB) and Microsoft’s ActiveX Data Objects (ADO.NET) and Java Database Connectivity (JDBC).
374
Chapter 15 Database Connectivity and Web Technologies 2. What are ODBC, DAO, and RDO? How are they related? Open Database Connectivity (ODBC) is Microsoft’s implementation of a superset of the SQL Access Group Call Level Interface (CLI) standard for database access. ODBC allows any Windows application to access relational data sources using SQL via a standard application programming interface (API). ODBC was the first widely adopted database middleware standard and enjoyed rapid adoption in Windows applications. As programming languages evolved, ODBC did not provide significant functionality beyond the ability to execute SQL to manipulate relational style data. Therefore, programmers needed a better way to access data. To answer this need, Microsoft developed two other data access interfaces: ▪
▪
Data Access Objects (DAO) is an object oriented API used to access MS Access, MS FoxPro and dBase databases (using the Jet data engine) from Visual Basic programs. DAO provided an optimized interface that exposed the functionality of the Jet data engine (on which MS Access database if based on) to programmers. The DAO interface can also be used to access other relational style data sources. Remote Data Objects (RDO) is a higher-level object oriented application interface used to access remote database servers. RDO uses the lower-level DAO and ODBC for direct access to databases. RDO was optimized to deal with server based databases, such as MS SQL Server, Oracle, DB2, and so on.
3. What is the difference between DAO and RDO? DAO is uses the MS Jet engine to access file-based relational databases such as MS Access, MS FoxPro and Dbase. In contrast, RDO allows to access relational database servers such as SQL Server, DB2, and Oracle. RDO uses DAO and ODBC to access remote database server data.
4. What are the three basic components of the ODBC architecture? The basic ODBC architecture is composed of three main components: • • •
A high level ODBC API trough which application programs access ODBC functionality. A Driver Manager component that is in charge of managing all database connections. An ODBC Driver component that talks directly to the DBMS (data source).
5. What steps are required to create an ODBC data source name? To define a data source you must create a data source name (DSN) for the data source. To create a DSN you have to provide: •
An ODBC driver. You must identify the driver to use to connect to the data source. The ODBC driver is normally provided by the database vendor; although h Microsoft provides several drives
375
Chapter 15 Database Connectivity and Web Technologies
•
•
to connect to the most common databases. For example, if you are using an Oracle DBMS you will select the Oracle ODBC drive provided by Oracle or, if desired, the Microsoft-provided ODBC Driver for Oracle. A DSN name. This is a unique name by which the data source will be known to ODBC and therefore, to the applications. ODBC offers two types of data sources: User and System. User data sources are only available to the user. System data sources are available to all users, including operating system services. ODBC driver parameters. Most ODBC drivers require some specific parameters in order to establish a connection to the database. For example, if you are using a MS Access database you must point to the location of the MS Access (.mdb) file and, if necessary, provide the user name and password. If you are using a DBMS server, you must provide the server name, the database name, and the user name and password used to connect to the database. Figure 15.3 shows the ODBC screens required to create a System ODBC data source for an Oracle DBMS. Note that some ODBC drivers use the native driver provided by the DBMS vendor.
6. What is OLE-DB used for, and how does it differ from ODBC? Although ODBC, DAO, and RDO were widely used, they did not provide support for non-relational data. To answer the need for non-relational data access and to simplify data connectivity, Microsoft developed Object Linking and Embedding for Database (OLE-DB). Based on Microsoft’s Component Object Model (COM), OLE-DB is a database middleware that was developed to add object-oriented functionality for access to relational and non-relational data. OLE-DB was the first piece of Microsoft’s strategy to provide a unified object-oriented framework for the development of next-generation applications.
7. Explain the OLE-DB model based on its two types of objects. OLE-DB is composed of a series of COM objects that provide low-level database connectivity for applications. Because OLE-DB is based on the COM object model, the objects contain data and methods (also known as the interface.) The OLE-DB model is better understood when you divide its functionality in two types of objects: • •
Consumers are all those objects (applications or processes) that request and use data. The data consumers request data by invoking the methods exposed by the data provider objects (public interface) and passing the required parameters. Providers are the objects that manage the connection with a data source and provide data to the consumers. Providers are divided in two categories: data providers and service providers. ➢ Data providers provide data to other processes. Database vendors create data provider objects that expose the functionality of the underlining data source (relational, objectoriented, text, and so on.) ➢ Service providers provide additional functionality to consumers. The service provider is located between the data provider and the consumer: The service provider requests data from the data provider; transforms the data and provides the transformed data to the data consumer. In other words, the service provider acts like a data consumer of the data provider
376
Chapter 15 Database Connectivity and Web Technologies and as a data provider for the data consumer (end-user application). For example, a service provider could offer cursor management services, transaction management services, query processing services, indexing services, and so on. 8. How does ADO complement OLE-DB? OLE-DB provided additional capabilities for the applications accessing the data. However, it did not provide support for scripting languages, especially the ones used for web development, such as Active Server Pages (ASP) and ActiveX. To provide such support, Microsoft developed a new object framework called ActiveX Data Objects (ADO). ADO provides a high level application-oriented interface to interact with OLE-DB, DAO, and RDO. ADO provided a unified interface to access data from any programming language that uses the underlying OLE-DB objects. Figure 15.5 – borrowed from the text and reproduced here for your convenience -- illustrates the ADO/OLE-DB architecture and how it interacts with ODBC and native connectivity options.
Figure 15.5 OLE-DB ARCHITECTURE Client Applications
OLE-DB Consumers
Access
C++
Excel
ActiveX Data Objects (ADO)
OLE-DB Services Providers Email Processing
Indexing Processing
Cursor Processing
Query Processing
OLE-DB Data Providers OLE-DB Provider for Oracle
OLE-DB Provider for Exchange
OLE-DB Provider for SQL Server
SQL*NET
DATABASE
OLE-DB Provider for ODBC
ODBC
SQL-Server
377
DATABASE
Chapter 15 Database Connectivity and Web Technologies 9. What is ADO.NET, and what two new features make it important for application development? ADO.NET is the data access component of Microsoft’s .NET application development framework. Microsoft’s .NET framework is a component-based platform for the development of distributed, heterogeneous, interoperable applications aimed to manipulate any type of data, over any network, and under any operating system and programming language. The .NET framework is beyond the reach of this book. Therefore, this section will only introduce the basic data access component of the .NET architecture, ADO.NET. ADO.Net introduced two new features critical for the development of distributed applications: datasets and XML support. • •
A DataSet is a disconnected memory-resident representation of the database. ADO.NET stores all its internal data in XML format.
10. What is a DataSet, and why is it considered to be disconnected? A DataSet is a disconnected memory-resident representation of the database. That is, the DataSet contains tables, columns, rows, relationships, and constraints. Once the data are read from a data provider, the data are placed on a memory-resident DataSet. The DataSet is then disconnected from the data provider. The data consumer application interacts with the data in the DataSet object to make changes (inserts, updates and deletes) in the dataset. Once the processing is done, the DataSet data are synchronized with the data source and the changes are made permanent.
A DataSet is in fact a simple database with tables, rows and constraints. Even, more important, the DataSet doesn’t require keeping a permanent connection to the data source. The DataAdapter uses the SelectCommand to populate the DataSet from a data source. However, once the DataSet is populated, it is completely independent of the data source – that’s why it’s called “disconnected.”
11. What are Web server interfaces used for? Give some examples. Web server interfaces are used to extend the functionality of the web server to provide more services. If a Web server is to successfully communicate with other external programs to provide a service, both programs must use a standard way to exchange messages and respond to requests. A Web server interface defines how a Web server communicates with external programs. Currently there are two well-defined Web server interfaces: • •
Common Gateway Interface (CGI) Application Programming Interface (API)
Web server interfaces can be used to extend the services of a web server and provide support for access to external databases, fax services, telephony services, directory services, etc.
378
Chapter 15 Database Connectivity and Web Technologies
12. Search the Internet for Web application servers. Choose one and prepare a short presentation for your class. You are encouraged to use any web search engine to list multiple vendors. Examples of such vendors are: Oracle Application Server, IBM WebSphere, Sun Java, Microsoft, JBOSS, etc. We encourage the student to visit the web pages of the products and compare features of at least two products. Some of the many other Web application servers, as of this writing, include Oracle Application Server by Oracle Corp., WebLogic by BEA Systems, NetDynamics by Sun Microsystems, NetObjects’ Fusion, Microsoft’s Visual Studio.NET, and WebObjects by Apple.
13. What does this statement mean: The Web is a stateless system? What implications does a stateless system have for database applications developers? Simply put, the label stateless system indicates that, at any given time, a Web server does not know the status of any of the clients communicating with it. That is, there is no open communications line between the server and each client accessing it -- that, of course, is impractical in a worldwide Web! Instead, client and server computers interact in very short “conversations” that follow the requestreply model. For example, the browser is only concerned with the current page, so there is no way for the second page to know what was done in the first page. The only time the client and server computers communicate is when the client requests a page—when the user clicks a link—and the server sends the requested page to the client. Once the client receives the page and its components, the client/server communication is ended. Therefore, although you may be browsing a page and think that the communication is open, you are actually just browsing the HTML document stored in the local cache (temporary directory) of the client browser. The server does not have any idea what the end user is doing with the document, what data is entered in a form, what option is selected, and so on. In the Web, if we want to act on a client’s selection, we need to jump to a new page (go back to the Web server), therefore losing track of whatever was done before!
Not knowing what was done before, or what a client selected before it got to this page, makes adding business logic to the Web cumbersome. For example, suppose that you need to write a program that performs the following steps: display a data entry screen, capture data, validate data, and save data. This entire sequence can be completed in a single COBOL program because COBOL uses a working storage section that holds in memory all variables used in the program. Now imagine the same COBOL program -- but each section (PERFORM statement) is now a separate program! That is precisely how the Web works. In short, the Web’s stateless nature means that extensive processing required by a program’s execution cannot be done directly in a single Web page; the client browser’s processing ability is limited by the lack of processing ability and the lack of a working storage area to hold variables used by all pages in a Web site.
379
Chapter 15 Database Connectivity and Web Technologies The browser does not have computational abilities beyond formatting output text and accepting form field inputs. Even when the browser accepts form field data, there is no way to perform immediate data entry validation. Therefore, to perform such crucial processing in the client, the Web defers to other Web programming languages such as Java, JavaScript, and VBScript.
14. What is a Web application server, and how does it work from a database perspective? Web application server extends the functionality of a web server and provides features such as: • • • • • • •
An integrated development environment with session management and support for persistent application variables. Security and authentication of users through user IDs and passwords. Computational languages to represent and store business logic in the application server. Automatic generation of HTML pages integrated with Java, JavaScript, VBScript, ASP, and so on. Performance and fault-tolerant features. Database access with transaction management capabilities. Access to multiple services, such as file transfers (FTP), database connectivity, electronic mail, and directory services.
The web application server interfaces with the database connectivity standards to access databases using any of the supported API. So, a web page will be processed by the web application server, the application server will connect to the database using the ADO, OLE-DB or ODBC standard (or any other standard supported by the application server).
15. What are scripts, and what is their function? (Think in terms of database applications development!) A script is a series of instructions executed in interpreter mode. The script is a plain text file that is not compiled like COBOL, C++, or Java. Scripts are normally used in web application development environments. For instance, ColdFusion scripts contain the code that is required to connect, query, and update a database from a Web front end.
16. What is XML, and why is it important? Extensible Markup Language (XML) is a meta-language used to represent and manipulate data elements. XML is designed to facilitate the exchange of structured documents such as orders or invoices over the Internet. The World Wide Web Consortium (W3C) published the first XML 1.0 standard definition in 1998. This standard sets the stage for giving XML the real-world appeal of being a true vendor-independent platform. Therefore, it is not surprising that XML is rapidly becoming the data exchange standard for e-commerce applications. XML is important because it provides the
380
Chapter 15 Database Connectivity and Web Technologies semantics that facilitate the sharing, exchange, and manipulation of structured documents over organizational boundaries.
17. What are document type definition (DTD) documents and what do they do? Companies that exchange data using XML must have a way to understand and validate each other’s tags. One way to accomplish that task is through the use of Document Type Definitions. A Document Type Definition (DTD) is a file with a .dtd extension that describes XML elements—in effect, a DTD file provides the composition of the database’s logical model, and defines the syntax rules or valid tags for each type of XML document. (The DTD component is very similar to having a public data dictionary for business data.)
18. What are XML schema definition (XSD) documents and what do they do? An XML Schema Definition (XSD) document is an advanced data definition language that is used to describe the structure (elements, data types, relationship types, ranges, and default values) of XML data documents. Unlike a DTD document, which uses a unique syntax, an XML Schema Definition (XSD) file uses a syntax that resembles an XML document. One of the main advantages of an XML schema is that it more closely maps to database terminology and features. For example, an XML schema will be able to define common database types, such as date, integer or decimal, minimum and maximum values, list of valid values, and required elements. Using the XML schema, a company would be able to validate the data for values that may be out of range, incorrect dates, valid values, and so on. For example, a university application must be able to specify that a GPA value must be between zero and 4.0 and it must be able to detect an invalid birth date such as “14/13/1987.” (There is no 14th month.) Many vendors are rapidly adopting this new standard and are supplying tools to translate DTD documents into XML Schema Definition (XSD) documents. It is widely expected that XML schemas will replace DTD as the method to describe XML data.
381
Chapter 15 Database Connectivity and Web Technologies 19. What is a JDBC, and what is it used for? JDBC (Java Database Connectivity) is discussed in detail in Section 15-1e. Java is an object-oriented programming language developed by Sun Microsystems that runs on top of Web browser software. Java is one of the most common programming languages for Web development. Sun Microsystems created Java as a “write once, run anywhere” environment. That means that a programmer can write a Java application once and then without any modification, run the application in multiple environments (Microsoft Windows, Apple OS X, IBM AIX, etc.). The crossplatform capabilities of Java are based on its portable architecture. Java code is normally stored in pre-processed chunks known as applets that run on a virtual machine environment in the host operating system. This environment has well-defined boundaries and all interactivity with the host operating system is closely monitored. Sun provides Java runtime environments for most operating systems (from computers to hand-held devices to TV set-top boxes.) Another advantage of using Java is its “on-demand” architecture. When a Java application loads, it can dynamically download all its modules or required components via the Internet.
When Java applications want to access data outside the Java runtime environment, they use predefined application programming interfaces. Java Database Connectivity (JDBC) is an application programming interface that allows a Java program to interact with a wide range of data sources (relational databases, tabular data sources, spreadsheets, and text files). JDBC allows a Java program to establish a connection with a data source, prepare and send the SQL code to the database server, and process the result set. One of the main advantages of JDBC is that it allows a company to leverage its existing investment in technology and personnel training. JDBC allows programmers to use their SQL skills to manipulate the data in the company’s databases. As a matter of fact, JDBC allows direct access to a database server or access via database middleware. Furthermore, JDBC provides a way to connect to databases through an ODBC driver. (Figure 15.7 in the text illustrates the basic JDBC architecture and the various database access styles.) The database access architecture in JDBC is very similar to the ODBC/OLE/ADO.NET architecture. All database access middleware shares similar components and functionality. One advantage of JDBC over other middleware is that it requires no configuration on the client side. The JDBC driver is automatically downloaded and installed as part of the Java applet download. Because Java is a Webbased technology, applications can connect to a database directly using a simple URL. Once the URL is invoked, the Java architecture comes into place, the necessary applets are downloaded to the client (including the JDBC database driver and all configuration information), and then the applets are executed securely in the client’s runtime environment. Every day, more and more companies are investing resources in developing and expanding their Web presence and finding ways to do more business on the Internet. Such business will generate increasing amounts of data that will be stored in databases. Java and the .NET framework are part of the trend toward increasing reliance on the Internet as a critical business resource. In fact, it has been said that
382
Chapter 15 Database Connectivity and Web Technologies the Internet will become the development platform of the future. In the next section you will learn more about Internet databases and how they are used.
20. What is cloud computing, and why is it a “game changer”? According to the National Institute of Standards and Technology (NIST), cloud computing is “a computing model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computer resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.” The term cloud services is used in this book to refer to the services provided by cloud computing. Cloud services allow any organization to quickly and economically add information technology services such as applications, storage, servers, processing power, databases, and infrastructure to its IT portfolio (see Figure 15.21). Cloud computing is important for database technologies because it has the potential to become a “game changer.” Cloud computing eliminates financial and technological barriers so organizations can leverage database technologies in their business processes with minimal effort and cost. In fact, cloud services have the potential to turn basic IT services into “commodity” services such as electricity, gas, and water, and to enable a revolution that could change not only the way that companies do business, but the IT business itself. As Nicholas Carr put it so vividly: “Cloud computing is for IT what the invention of the power grid was for electricity.” For example, imagine that the chief technology officer of a nonprofit organization wants to add e-mail services to the IT portfolio. A few years ago, this proposition would have implied building the e-mail system’s infrastructure from the ground up, including hardware, software, setup, configuration, operation, and maintenance. However, in today’s cloud computing era, you can use Google Apps for Business or Microsoft Exchange Online and get a scalable, flexible, and more reliable e-mail solution for a fraction of the cost. The best part is that you do not have to worry about the daily chores of managing and maintaining the IT infrastructure, such as OS updates, patches, security, fault tolerance, and recovery. What used to take months or years to implement can now be done in a matter of minutes.
21. Name and contrast the types of cloud computing implementation. Section 15-4a describes this item in detail. There are basically three cloud computing implementation types (based on who the target customers are): •
Public cloud. This type of cloud infrastructure is built by a third-party organization to sell cloud services to the general public. The public cloud is the most common type of cloud implementation; examples include Amazon Web Services (AWS), Google Application Engine, and Microsoft Azure. In this model, cloud consumers share resources with other consumers transparently. The public cloud infrastructure is managed exclusively by the third-party provider.
383
Chapter 15 Database Connectivity and Web Technologies •
•
Private cloud. This type of internal cloud is built by an organization for the sole purpose of servicing its own needs. Private clouds are often used by large, geographically dispersed organizations to add agility and flexibility to internal IT services. The cloud infrastructure could be managed by internal IT staff or an external third party. Community cloud. This type of cloud is built by and for a specific group of organizations that share a common trade, such as agencies of the federal government, the military, or higher education. The cloud infrastructure could be managed by internal IT staff or an external third party.
22. Name and describe the most prevalent characteristics of cloud computing services. Section 15-4b describes the basic characteristics of cloud computing services. In summary, the characteristics are: • • • • • • •
Ubiquitous access via Internet. Shared infrastructure. Lower costs and variable pricing. Flexible and scalable services. Dynamic provisioning. Service orientation. Managed operations.
23. Using the Internet, search for providers of cloud services. Then, classify the types of services they provide (SaaS, PaaS, and IaaS). A starting point will be the examples shown in Figure 15.22. Further examples are: • • • • •
DropBox.com – a cloud service storage provider (IaaS) Carbonite.com – provide online backup of data (SaaS) iCloud.com (Apple) – provides storage and synchronization of Apple device data (contacts, music, apps, photos, documents, backups (IaaS and SaaS) GoodData.com – Business intelligence platform (PaaS) Heroku.com – Ruby web programing environment service (PaaS)
24. Summarize the main advantages and disadvantages of cloud computing services. Table 15.4 summarizes the main advantages and disadvantages of cloud computing services.
25. Define SQL data services and list their advantages.
384
Chapter 15 Database Connectivity and Web Technologies SQL data services refer to Internet-based data management services that provide access to hosted relational data management using standard protocols and common programming interfaces. The advantages of SQL data services include: • • • • •
High reliability and scalability of relational database capabilities at a low cost High level of failure tolerance Dynamic and automatic load balancing Automated data backup and recovery Dynamic creation and allocation of database processes and storage.
Problem Solutions
ONLINE CONTENT The databases used in the Problems for this chapter can be found at www.cengagebrain.com.
PROBLEMS In the following exercises, you set up database connectivity using MS Excel.
NOTE: Although the precise steps to setup data connectivity vary slightly according to the version of Excel and operating system platform you are using, in general terms, the steps are outlined as indicated in the sections below. 1. Use MS Excel to connect to the Ch02_InsureCo MS Access database using ODBC, and retrieve all of the AGENTs. To perform this task, complete the following steps: • • • • • •
From Excel, select Data, From Other Sources, From Microsoft Query options to retrieve data from an ODBC data source. Select the MS Access Database* option and click OK. Select the Database file location and click OK. Select the table and columns to use in the query (select all columns) and click Next. On the Query Wizard – Filter Data click Next. On the Query Wizard – Sort Order click Next.
385
Chapter 15 Database Connectivity and Web Technologies • •
Select Return Data to Microsoft Office Excel. Position the cursor where you want the data to be placed on your spreadsheet and click OK.
The solution is shown in Figure P15.1.
Figure P15.1 Solution to Problem 1 – Retrieve all AGENTs
2. Use MS Excel to connect to the Ch02_InsureCo MS Access database using ODBC, and retrieve all of the CUSTOMERs. To perform this task, complete the following steps: • • • • • • • •
From Excel, select Data, From Other Sources, From Microsoft Query options to retrieve data from an ODBC data source. Select the MS Access Database* option and click OK. Select the Database file location and click OK. Select the table and columns to use in the query (select all columns) and click Next. On the Query Wizard – Filter Data click Next. On the Query Wizard – Sort Order click Next. Select Return Data to Microsoft Office Excel. Position the cursor where you want the data to be placed on your spreadsheet and click OK.
The solution is shown in Figure P15.2.
Figure P15.2 Solution to Problem 2 – Retrieve all CUSTOMERs
386
Chapter 15 Database Connectivity and Web Technologies
3. Use MS Excel to connect to the Ch02_InsureCo MS Access database using ODBC, and retrieve the customers whose AGENT_CODE is equal to 503. To perform this task, complete the following steps: • • • • • • • •
From Excel, select Data, From Other Sources, From Microsoft Query options to retrieve data from an ODBC data source. Select the MS Access Database* option and click OK. Select the Database file location and click OK. Select the table and columns to use in the query (select all columns) and click Next. On the Query Wizard – Filter Data, select the AGENT_CODE column, select “equals” from the left drop down box, then select “503” from the right drop down box, and then click Next. On the Query Wizard – Sort Order click Next. Select Return Data to Microsoft Office Excel. Position the cursor where you want the data to be placed on your spreadsheet and click OK.
The results are shown in Figure P15.3.
Figure P15.3
Solution to Problem 3 – Retrieve all CUSTOMERs with AGENT_CODE=503
387
Chapter 15 Database Connectivity and Web Technologies 4. Create a System DSN ODBC connection called Ch02_SaleCo using the Administrative Tools section of the Windows Control Panel. To create the DSN, complete the following steps: • • • • •
Using Windows XP, open the Control Panel, open Administrative Tools, open Data Sources (ODBC). Click on the System DSN tab, click on Add, select the Microsoft Access Drive (*.mdb) driver and click on Finish. On the ODBC Microsoft Access Setup window, enter the Ch02_SaleCo on the Data Source Name field. Under Database, click on the Select button, browse to the location of the MS Access file and click OK twice. The new system DSN now appears in the list of system data sources.
The results are shown in Figure P15.4.
Figure P15.4 Solution to Problem 4 – Create Ch02_SaleCo System DSN
388
Chapter 15 Database Connectivity and Web Technologies 5. Use MS Excel to list all of the invoice lines for Invoice 1003 using the Ch02_SaleCo System DSN. To perform this task, complete the following steps: • • • • • • •
From Excel, select Data, Import External Data and New Database Query options to retrieve data from an ODBC data source. Select the Ch02_SaleCo data source and click OK. Select the LINE table, then select all columns, and click Next. On the Query Wizard – Filter Data, select the INV_NUMBER column, select “equals” from the left drop down box, then select “1003” from the right drop down box, and then click Next. On the Query Wizard – Sort Order click Next. Select Return Data to Microsoft Office Excel. Position the cursor where you want the data to be placed on your spreadsheet and click OK.
The results are shown in Figure P15.5.
Figure P15.5
Solution to Problem 5 – Retrieve all invoice LINEs with INV_NUMBER=1003
6. Create a System DSN ODBC connection called Ch02_TinyCollege using the Administrative Tools section of the Windows Control Panel. To perform this task, complete the following steps: • • • • •
Using Windows XP, open the Control Panel, open Administrative Tools, open Data Sources (ODBC). Click on the System DSN tab, click on Add, select the Microsoft Access Drive (*.mdb) driver and click on Finish. On the ODBC Microsoft Access Setup window, enter the Ch02_TinyCollege on the Data Source Name field. Under Database, click on the Select button, browse to the location of the MS Access file and click OK twice. The new system DSN now appears in the list of system data sources.
389
Chapter 15 Database Connectivity and Web Technologies 7. Use MS Excel to list all classes taught in room KLR200 using the Ch02_TinyCollege System DSN. To perform this task, complete the following steps: • • • • • • •
From Excel, select Data, Import External Data and New Database Query options to retrieve data from an ODBC data source. Select the Ch02_TinyCollege data source and click OK. Select the CLASS table, select all columns and click Next. On the Query Wizard – Filter Data, select the CLASS_ROOM column, select “equals” from the left drop down box, then select “KLR200” from the right drop down box, and then click Next. On the Query Wizard – Sort Order click Next. Select Return Data to Microsoft Office Excel. Position the cursor where you want the data to be placed on your spreadsheet and click OK.
The results of these actions are shown in Figure P15.7.
Figure P15.7
Solution to Problem 7 – Retrieve all classes taught in room KLR200
390
Chapter 15 Database Connectivity and Web Technologies 8. Create a sample XML document and DTD for the exchange of customer data. The solutions are shown in Figures P15.8a and P15.8b.
Figure P15.8a Customer DTD Solution
391
Chapter 15 Database Connectivity and Web Technologies
Figure P15.8b Customer XML Solution
392
Chapter 15 Database Connectivity and Web Technologies 9. Create a sample XML document and DTD for the exchange of product and pricing data. The solutions are shown in Figures P15.9a and P15.9b.
Figure P15.9a Product DTD Solution
393
Chapter 15 Database Connectivity and Web Technologies
Figure P15.9b Product XML Solution
394
Chapter 15 Database Connectivity and Web Technologies 10. Create a sample XML document and DTD for the exchange of order data. The solutions are shown in Figures P15.10a and P15.10b.
Figure P15.10a Order DTD Solution
395
Chapter 15 Database Connectivity and Web Technologies
Figure P15.10b Order XML Solution
396
Chapter 15 Database Connectivity and Web Technologies 11. Create a sample XML document and DTD for the exchange of student transcript data. Use your college transcript as a sample. The solution to Problem 11 will follow the same format as the previous solutions. However, because Problem 11 requires the students to do some research regarding the information that goes in the transcript data, we have not included a specific solution here. Encourage the student to use his/her creativity and analytical skills to research and create a simple XML file containing the data that is customary on your university. Not all fields in the Student transcript must be included in this exercise. Allow the students to represent just the most important fields.
397
Chapter 16 Database Administration and Security
Chapter 16 Database Administration and Security Discussion Focus The following discussion sequence is designed to fit the chapter objectives: • Illustrate the value of data as a corporate asset to be managed. • Explain the data-information-decision cycle and demonstrate how this cycle may be supported through the use of a DBMS. • Emphasize the role of databases within an organization and relate this role to the data-informationdecision cycle; then show how this role is essential at all managerial levels. • Discuss the evolution of the data administration (DA) function, starting with the DP department and ending with the MIS department. During this discussion, emphasize the change in managerial emphasis from an operational orientation to a more tactical and strategic orientation. Illustrate how a DBMS can foster a company's success; examples from companies involved in banking, air services, and financial services are particularly illustrative. • Show the different ways of positioning the DBA function within an organization; emphasize how such positioning is a function of the company's internal organization. • Contrast the DBA and DA functions. • Discuss the DBA's technical and managerial roles. • Explain the importance of data security and database security. • Show how data dictionaries and CASE tools fit into data administration.
Answers to Review Questions Note: To ensure in-depth chapter coverage, most of the following questions cover the same material that we covered in detail in the text. Therefore, in most cases, we merely cite the specific section, rather than duplicate the text presentation. 1. Explain the difference between data and information. Give some examples of raw data and information. Given the importance of the distinction between data and information, we addressed the topic in several chapters. This question was first addressed in Chapter 1, “Database Concepts,” Section 1-1, “Data vs. Information.” Emphasize that one of the key purposes of having an information system is to facilitate the transformation of data into information. In turn, information becomes the basis for decision making. (See Figure 16.1, “The data-information-decision making cycle”) We revisit the data/information transformation in Chapter 13, “Business Intelligence and Data Warehouses,” Section 13-1, “The Need for Data Analysis.” Section 13-2, “Business Intelligence,” addresses the Decision Support System (DSS),” addresses the use of a comprehensive, cohesive, and integrated framework , which is designed to assist managerial decision making within an organization
398
Chapter 16 Database Administration and Security and which, therefore, includes an extensive data-to-information transformation component. Figures 13.1 (Business Intelligence Framework) and 13.2 (Business Intelligence Components) illustrate the BI's main components, so use these figures as the focus for discussion. Finally, review the operational data transformation to decision support data, using Figure 13.3, “Transforming Operational Data into Decision Support Data,” as the basis for discussion. Data are raw facts of interest to an end user. Examples of data include a person's date of birth, an employee name, the number of pencils in stock, etc. Data represent a static aspect of a real world object, event, or thing. Information is processed data. That is, information is the product of applying some analytical process to data. Typically, we represent the information generation process as shown in Figure Q16.1.
Figure Q16.1 Transformation of Data Into Information Data
Process
Information
123.4 122.8 130.1 127.3 121.9 128.7 123.5 129.2 127.0 121.6 130.6 132.4 129.1 125.3 127.9 131.8 127.2 128.3 124.5 123.6 132.2
For example, invoice data may include the invoice number, customer, items purchased, invoice total, etc. The end user can generate information by tabulating such data and computing totals by customer, cash purchase summaries, credit purchase summaries, a list of most-frequently purchased items, etc. Since the data-information transformation is crucial, it is important to keep emphasizing that data stored in the database constitute the raw material for the creation of information. For example, data in a CUSTOMER table might be transformed to provide customer information about age distribution and gender as shown in Figure Q16.2:
399
Chapter 16 Database Administration and Security
Figure Q16.2 Customer Information Summary Age Distribution by Gender Male Female
0-29
30-39
40-49
50 and over
Similarly, data in a CAR table might be transformed to provide information that relates displacement to horsepower as shown in Figure Q16.3:
Figure Q16.3 Engine Horsepower vs. Displacement Horsepower vs. Displacement
Horsepower
Displacement
400
Chapter 16 Database Administration and Security Data transformations into information can be accomplished in many ways, using simple tabulation, graphics, statistical modeling, etc. 2. Define dirty data and identify some of its sources. Dirty data is data that contains inaccuracies or inconsistencies (i.e. data that lacks integrity). Dirty data may result from a lack of enforcement of integrity constraints, typographical errors, the use of synonyms and homonyms across systems, the use of nonstandard abbreviations, or differences in the decomposition of composite attributes. 3. What is data quality, and why is it important? Data quality is a comprehensive approach to ensuring the accuracy, validity, and timeliness of the data. Data quality is important because without quality data, accurate and timely information cannot be produced. Without accurate and timely information, it is difficult (impossible?) to make good decisions; and without good decisions, organizations will fail in their competitive environments. 4. Explain the interactions among end user, data, information, and decision-making. Draw a diagram and explain the interactions. See Section 16-1. The interactions are illustrated in Figure 16.1. Emphasize the end user's role throughout the process. It is the end user who must analyze data to produce the information that is later used in decision making. Most business decisions create additional data that will be used to monitor and evaluate the company situation. Thus data will be, or should be, recycled in order to produce feedback concerning an action's effectiveness and efficiency. 5. Suppose that you are a DBA. What data dimensions would you describe to top-level managers to obtain their support for endorsing the data administration function? The first step will be to emphasize the importance of data as a company asset, to be managed as any other asset. Top-level managers must understand this crucial notion and must be willing to commit company resources to manage data as an organizational asset. The next step is to identify and define the need for and role of the DBMS in the organization. Refer the student to Section 16-2 and apply the concepts discussed in this section to a teacher-selected organization. Managers and end users must understand how the DBMS can enhance and support the work of the organization at all levels (top management, middle management, and operational.) Finally, the impact of a DBMS introduction into an organization must be illustrated and explained. Refer to Section 16-3 to accomplish this task. Note particularly the technical, managerial, and cultural aspects of the process. 6. How and why did database management systems become the organizational data management standard in organizations? Discuss some of the advantages of the database approach over the filesystem approach.
401
Chapter 16 Database Administration and Security Contrast the file system's "single-ownership" approach with the DBMS's "shared-ownership." Make sure that students are made aware of the change in focus or function when the shift from file system to the DBMS occurs. In other words, show what happens when the data processing (DP) department becomes a management information systems (MIS) department. Using Section 16-3, discuss how the change from DP to MIS shifts data management from an operational level to a tactical or strategic level. 7. Using a single sentence, explain the role of databases in organizations. Then explain your answer. The single sentence will be:
The database's predominant role is to support managerial decision making at all levels in the organization. Refer to section 16-2 for a complete explanation of the role(s) played by an organization's DBMS. 8. Define security and privacy. How are these two concepts related? Security means protecting the data against accidental or intentional use by unauthorized users. Privacy deals with the rights of people and organizations to determine who accesses the data and when, where, and how the data are to be used. The two concepts are closely related. In a shared system, individual users must ensure that the data are protected from unauthorized use by other individuals. Also, the individual user must have the right to determine who, when, where, and how other users use the data. The DBMS must provide the tools to allow such flexible management of the data security and access rights in a company database. 9. Describe and contrast the information needs at the strategic, tactical, and operational levels in an organization. Use examples to explain your answer. See Section 16-2 to contrast the different DBMS roles at each managerial level. Use Figures 16.3-16.5 as the basis for your discussions.
402
Chapter 16 Database Administration and Security
10. What special considerations must you take into account when introducing a DBMS into an organization? See Section 16-3. We suggest that you start a discussion about the special considerations (managerial, technical, and cultural) to be taken into account when a new DBMS is to be introduced in an organization. For example, focus the discussion on such questions as: • What about retraining requirements for the new system? ➢ Who needs to be retrained? ➢ What must be the type and extent of the retraining? • Is it reasonable to expect some resistance to change ➢ from the computer services department administrator(s)? ➢ from secretaries? ➢ from technical support personnel? ➢ from other departmental end users? • How will the resistance in the preceding question be manifested? • How will you deal with such resistance? 11. Describe the DBA's responsibilities. The database administrator (DBA) is the person responsible for the control and management of the shared database within an organization. The DBA controls the database administration function within the organization. The DBA is responsible for managing the overall corporate data resource, both computerized and noncomputerized. Therefore, the DA is given a higher degree of responsibility and authority than the DBA. Depending on organizational style, the DBA and DA roles may overlap and may even be combined in a single position or person. The DBA position requires both managerial and technical skills. Refer to section 16-5 and Table 16.1 to explain and illustrate the general responsibilities of the DA and DBA functions. 12. How can the DBA function be placed within the organization chart? What effect(s) will such placement have on the DBA function? The DBA function placement varies from company to company and may be either a staff or line position. In a staff position, the DBA function creates a consulting environment in which the DBA is able to devise the overall data-administration strategy but does not have the authority to enforce it. In a line position, the DBA function has both the responsibility and the authority to plan, define, implement, and enforce the policies, standards and procedures.
403
Chapter 16 Database Administration and Security
13. Why and how are new technological advances in computers and databases changing the DBA's role? See Section 16-5, particularly Section 16-5b, "The DBA's Technical Role." Then tie this discussion to the increasing use of web applications. The DBA function is probably one of the most dynamic functions of any organization. New technological developments constantly change the DBA function. For example, note how each of the following has an effect on the DBA function: • the development of the DDBMS • the development of the OODBMS • the increasing use of LANs • the rapid integration of Intranet and Extranet applications and their effects on the database design, implementation, and management. (Security issues become especially important!) 14. Explain the DBA department's internal organization, based on the DBLC approach. See Section 16-4, especially Figures 16.4 and 16.5. 15. Explain and contrast the differences and similarities between the DBA and DA. See Section 16-5, especially Table 16.1. 16. Explain how the DBA plays an arbitration role for an organization's two main assets. Draw a diagram to facilitate your explanation. See Section 16-5, especially Figure 16.6. 17. Describe and characterize the skills desired for a DBA. See Section 16-5, especially Table 16.2. 18. What are the DBA's managerial roles? Describe the managerial activities and services provided by the DBA. See Section 16-5a, especially Table 16.3. 19. What DBA activities are used to support end users? See Section 16-5a.
404
Chapter 16 Database Administration and Security
20. Explain the DBA's managerial role in the definition and enforcement of policies, procedures, and standards. See Section 16-5a. 21. Protecting data security, privacy, and integrity are important database functions. What activities are required in the DBA's managerial role of enforcing these functions? See Section 16-5a. 22. Discuss the importance and characteristics of database data backup and recovery procedures. Then describe the actions that must be detailed in backup and recovery plans. See Section 16-5a. 23. Assume that your company assigned you the responsibility of selecting the corporate DBMS. Develop a checklist for the technical and other aspects involved in the selection process. See Section 16-5b. The checklist is shown in the "DBMS and Utilities Evaluation, Selection, and Installation" segment. 24. Describe the activities that are typically associated with the design and implementation services of the DBA technical function. What technical skills are desirable in the DBA's personnel? See Section 16-5b. 25. Why are testing and evaluation of the database and applications not done by the same people who are responsible for the design and implementation? What minimum standards must be met during the testing and evaluation process? See Section 16-5b. Note particularly the material in the "Testing and Evaluation of databases and Applications" segment. 26. Identify some bottlenecks in DBMS performance, and then propose some solutions used in DBMS performance tuning. See Section 16-5b. Also see Chapter 11, “Database Performance Tuning and Query Optimization.” 27. What are the typical activities involved in the maintenance of the DBMS and its utilities and applications? Would you consider application performance tuning to be part of the maintenance activities? Explain your answer. See Section 16-5b. Database performance tuning is part of the maintenance activities. As the database system enters into operation, the database starts to grow. Resources initially assigned to the application are sufficient for the initial loading of the database. As the system grows, the database becomes bigger,
405
Chapter 16 Database Administration and Security and the DBMS requires additional resources to satisfy the demands on the larger database. Database performance will decrease as the database grows and more users access it. 28. How do you normally define security? How is your definition of security similar to or different from the definition of database security in this chapter? See Section 16-6. The levels are highly restricted, confidential, and unrestricted. 29. What are the levels of data confidentiality? See Section 16-6. 30. What are security vulnerabilities? What is a security threat? Give some examples of security vulnerabilities that exist in different IS components. See Section 16-6b. 31. Define the concept of a data dictionary, and discuss the different types of data dictionaries. If you were to manage an organization's entire data set, what characteristics would you look for in the data dictionary? See Section 16-7a. 32. Using SQL statements, give some examples of how you would use the data dictionary to monitor the security of the database. NOTE If you use IBM's DB2, the names of the main tables are SYSTABLES, SYSCOLUMNS, and SYSTABAUTH.
See Section 16-7a. 33. What characteristics do a CASE tool and a DBMS have in common? How can these characteristics be used to enhance the data administration? See Section 16-7b. 34. Briefly explain the concepts of information engineering (IE) and information systems architecture (ISA). How do these concepts affect the data administration strategy? See Section 16-8. 35. Identify and explain some of the critical success factors in the development and implementation of a good data administration strategy.
406
Chapter 16 Database Administration and Security See Section 16-8. 36. How have cloud-based data services affected the DBA’s role.? The answer to this question is explained in detail in section 16-9. 37. What is the tool used by Oracle to create users? See Section 16-10d. Note the Oracle Security Manager screen in Figure 16.13 and the Create user Dialog Box in Figure 16.14. 38. In Oracle, what is a tablespace? See Section 16-10c. The following summary is useful: • A tablespace is a logical storage space. • Tablespaces are primarily used to logically group related data. • Tablespace data are physically stored in one or more datafiles. 39. In Oracle, what is a database role? See Section 16-10d. A database role is a named collection of database access privileges that authorize a user to perform specified actions on the database. Examples of roles are CONNECT, RESOURCE, and DBA. 40. In Oracle, what is a datafile? How does it differ from a file systems file? See Section 16-10c. The following summary will be useful: • A database is composed of one or more tablespaces. Therefore, there is a 1:M relationship between the database and its tablespaces. • Tablespace data are physically stored in one or more datafiles. Therefore, there is a 1:M relationship between tablespaces and datafiles. • A datafile physically stores the database data. • Each datafile is associated with one and only one tablespace. (But each datafile can reside in a different directory on the same hard disk -- or even on different disks.) In contrast to the datafile, a file system's file is created to store data about a single entity, and the programmer can directly access the file. But file access requires the end user to know the structure of the data that are stored in the file.
407
Chapter 16 Database Administration and Security
While a database is stored as a file, this file is created by the DBMS, rather than by the end user. Because the DBMS handles all file operations, the end user does not know -- nor does that end user need to know -- the database's file structure. When the DBA creates a database -- or, more accurately, uses the Oracle Storage Manager to let Oracle create a database -- Oracle automatically creates the necessary tablespaces and datafiles. We have summarized the basic database components logically in Figure Q16.39.
Figure Q16.39 The Logical Tablespace and Datafile Components of an Oracle Database Each database can contain many tablespaces.
Basic Oracle Database Environment Each tablespace consists of one or more datafiles. Each datafile “belongs” to one tablespace.
SYSTEM1.ORA (10Mb) Database ROBCOR Schema names: CCORONEL PROB Tables:
EMPLOYEE CHARTER PILOT AIRCRAFT MODEL
Note: The SYSTEM, USERS, and TEMPORARY tablespaces may located on the same hard disk. However, the three-disk option shown here is preferred. The datafiles within each of the tablespaces are created by the DBA when the database is created.
Tablespace
Disk C
The SYSTEM tablespace contains all object ownership records, the data dictionary, and the names and addresses of all tablespaces, tables, indexes, and clusters.
Disk D
All user table definitions are stored in this data dictionary. All user objects carry the username. (User name and schema name are the same thing.) For example, the EMPLOYEE table in the PROB schema is identified as PROB.EMPLOYEE.
SYSTEM
USERS
All user tables are located in the USERS tablespace.
Disk E
TEMPORARY
A tablespace can contain many tables, indexes, and clusters. If the (fixed size) tablespace is full, the DBA -- who has resource privileges -- can add datafiles. A database table may have rows in more than one datafile.
41. In Oracle, what is a database profile? See Section 16-10d. A profile is a named collection of database settings that control how much of the database resource can be used by a given user.
408