Computer 6th Semester Database Management System
Chapter 1: Introduction
Data, Database and Database Management System Data Data are raw fact or information. Data is representation of fact, concept in formalized manner suitable for communication, processing by automating mean or human being. Information is data that has been refined and organized by processing and purposeful intelligence. Data and information are seem to be synonymous but the distinction between them is data is used to refer “What is actually stored in the database” and information is used to refer “the meaning of data as understand by some user” Database Database is the collection of inter-related data and set of program to access data. Database system is basically just a computerized record keeping system i.e. as a kind of electronic filing cabinet. In other word, it is a repository or container for a collection of computerized data files where users can perform variety of operations of such files. E.g.: Adding new files to the database, inserting data into existing file, removing, operating file from database etc. Database Management System (DBMS) DBMS is a collection of program that manages the database, structure and control access to the data store in database. File System File System is a method of storing and organizing computer and the data they contain to make them easy to find and access them. File System may use the data storage device such as hard disk or CDROM and involves maintaining the physical location of file. History of Database
1950s and early 1960s: Data processing using magnetic tapes for storage Tapes provide only sequential access
Punched cards for input Late 1960s and 1970s: Hard disks allow direct access to data Network and hierarchical data models in widespread use
Unit 1/Introduction/Page No.1
By Ramesh Shrestha
Computer 6th Semester Database Management System
Ted Codd defines the relational data model Would win the ACM Turing Award for this work IBM Research begins System R prototype UC Berkeley begins Ingres prototype High-performance (for the era) transaction processing Purpose of Database System In the early days database applications where built directly on the top of the file system. Drawbacks of using file system to store data are as follows (Advantage of DBMS) 1. Data Redundancy and Consistency Multiple file formats, duplication of information in different files. 2. Difficulty in accessing data Need to write a new program to carry out each new task. 3. Data Isolation Data scattered in different files and may be in different formats. So it may be difficult to write application to retrieve appropriate data accommodating all files and format. 4. Integrity Problem Data are stored in different file and may be used by different program. In long run, if file format is been change according to one program then the other program may stop responding. 5. Atomicity problem Failures may leave database in an inconsistent state with partial updates carried out. Example: Transfer of funds from one account to another should either complete or not happen at all 6. Concurrency Access by Multiple Users Accessing data at a same time by two or more different users create a problem while writing data trying to update data from two or more side. 7. Security Problem Not only authorized but also other can easily see data store in data file.
Unit 1/Introduction/Page No.2
By Ramesh Shrestha
Computer 6th Semester Database Management System
So, database system offer solution to the entire above problem. Disadvantage of Database System In spite of using DBMS, there are few situations in which such a program may involve unnecessary overhead cost as that would not be incurred in traditional file system. The overhead cost of using DBMS is due to the following. 1. High initial investment in software, hardware and training. 2. Overhead for providing security, recovery and integrity function. 3. Cost of the maintenances of the software which remain forever. 4. Cost of backup and recovery.
Data Independence Data Independence can be defined as the capacity to change the schema at one level of a database system without having to change the schema at the next higher level. There are two types of data independence:1. Logical data Independence It is the capacity to change the conceptual schema (logical) without having to change the external schemas or application programs. We may change the conceptual schema to expand the database (by adding a record type or data item), to change constraints, or to reduce the database. For example:- adding additional attribute to an entity should not affect the external schemas or application programs. If the conceptual schema undergoes a logical reorganization, application programs that reference the external schema constructs must work as before. 2. Physical data independence It is the capacity to change the internal schema without having to change the conceptual schema. Changes to the internal schema may be needed because some physical files had to be reorganized. For example, by creating additional access structure to improve the performance of retrieval or update. For example: - providing an access path to improve retrieval speed of section records by semester and year should not require a query to changed, although it should become more efficient by utilizing the access path.
Unit 1/Introduction/Page No.3
By Ramesh Shrestha
Computer 6th Semester Database Management System
Data Independence occurs because when the schema is changed at some level, the schema at the next level remains unchanged, only the mapping* between the two levels is changed. *Mapping is the process of transforming requests and results between levels.
View of data A database system is a collection of interrelated files and a set of program that allows user to access and modify these files. The major purpose of database system is to provide user with an abstract view of the data i.e. the system hides certain details of how the data are stored and maintained. Parts of view of data Data Abstraction For the system to be useable, it must retrieve data efficiently. The need for efficiency has led designer to use complex data structure to represent data in the database. Since database users are not computer trained, developer hides complexity from user through several level of abstraction to simplify user interaction with the system. Three level of abstraction are:1. Physical Level The lowest level of abstraction describes how the data are actually stored. The physical level describes complex low level data structure in detail. 2. Logical Level What data are stored in the database and what relationship exists among data is deal by logical level. Database administrator, who must decide what information to keep in the database use the logical level of abstraction.
3. View Level The highest level of abstraction describes only part of the entire database to be viewed by the users. Different users have access to only a part of database. The system may provide many users for same database.
View 1
Unit 1/Introduction/Page No.4
View 2
Logical View
View 3
By Ramesh Shrestha
Computer 6th Semester Database Management System
Fig: Level of Abstraction Schema and Instances Schema The overall design of the database is called database schema. E.g.: Employee name
Schema age
salary
Emoloyee(name,age,salary)
The database schema is the description of a database specified during database design. Analogous to type information of variable of a program. Types of schema 1. Physical Schema Database design at a physical level. It describes physical storage of database. 2. Logical Schema Database design at logical level. It hides the details of physical storage structure and concentrates on describing entities, data types, relationship etc. 3. External View It describes the part of database that a particular user group is interested in and hides the rest of the database from the user group.
External View 1 ----------------------------------------------------------External View n
Logical Schema Unit 1/Introduction/Page No.5
By Ramesh Shrestha
Physical Schema
Computer 6th Semester Database Management System
Fig: The three schema architecture Instances The actual contain of database at a particular point in time is instances. Analogous to the values of variable.
Employee name
Schema age
salary
Emoloyee(name,age,salary) (ram,25,10000)
instance
Database Users Database users are the person who uses the database. 1. Sophisticated Users These are the users who interact with system without writing program. They form their request in a database query language. They submit each query to a query processor which function is to break down DML statement into instruction that the storage manager understand. E.g.: analyst 2. Specialized Users They are sophisticated users who write specialized database application. Among these applications are computer aided design system, knowledge base and expert system, complex data system (audio/video) etc. 3. Na誰ve Users These are unsophisticated users who interact with the system by invoking one of the application programs that have been written previously. E.g.: a bank teller who need to transfer Rs 5000 from account A to account B. Then he/she invokes a program called transfer. 4. Application Programmers Unit 1/Introduction/Page No.6
By Ramesh Shrestha
Computer 6th Semester Database Management System
These are the computer professional who write application program to interact database.
Database Administrator A person who has a central control over the system (both the data and program) is called database administrator (DBA). Functions of DBA 1. Schema Definition DBA creates original database schema by executing a set of data definition statement in DDL. 2. Storage Structure and access method definition It is that how data is to be represented by writing the storage structure definition. The associated internal or conceptual schema most also be specified using data definition language (DDL). 3. Schema and physical modification DBA carries out changes to the schema and physical organization to reflect the changing needs of the organization or to alter the physical organization to improve the performances. 4. Granting of authorization for data access By granting different types of authorization DBA can regulate which part of database, various users can access. 5. Routine maintenances DBA’s routine maintenance activities are: a. Periodically backing of database either on tapes on to remote server. b. Ensuring that enough free disk space is available for normal operation and upgrading disk space required. c. Monitoring jobs running on the database and ensuring that performance does not degrade. Database System Architecture A database system is partition into module that deals with each of the responsibilities of the overall system. The functional component of the database system can be broadly divided into: Storage manager and query processor components. 1. Storage manager
Unit 1/Introduction/Page No.7
By Ramesh Shrestha
Computer 6th Semester Database Management System
It is a program module that provides the interface between the low level data stored in the database and the application program and queries submitted to the system. The storage manager is responsible for interaction with the file manager. The storage manager translates the various DML statements into low level system file command. Storage manager is responsible for storing, retrieving and updating data in the database. The storage manager component includes: a. Authorization and integrity manager Test for the satisfaction of integrity constraints and checks the authority to users to access data. b. Transaction manager Ensures that the database remains in the consistent state despite system failure and the concurrent transaction execution proceed without conflicting. c. File manager Manages the allocation of space on the disk storage and the data structure use to represent information store on disk. d. Buffer manager Responsible for fetching data from disk storage into main memory. 2. Query processor a. DDL interpreter This interprets DDL statement and records the definitions in the data dictionary. b. DML compiler This translates DML language in a query language into an evaluation plan consisting of low level instruction that the query evaluation engine understand. A query can be translated into number of alternatives evaluation plan that all gives the same level. The DML compiler also performs query optimization i.e. it peaks the lowest cost evaluation plan from the alternatives. Application Architecture Today usually database system are not present at the side of users but are connected with the network. We differentiate between them as a client machine on which remote database user works and server machine on which database system run. Database application are partition into 2 or 3 parts:1. Two tier architecture The application is partition into the component resides at the client machine which involves database functionality at the server machine through query language statement.
Unit 1/Introduction/Page No.8
By Ramesh Shrestha
Computer 6th Semester Database Management System
User Application
Network Database
Fig: Two tier architecture 2. Three tier architecture Three tier architecture, client machine act as a front end and does not contain any direct database call. Instead client communicates with application server through a form interface. The application server in term communicates with database system to access data. User Application
Network
Application server Database
Fig: Three tier architecture
Unit 1/Introduction/Page No.9
By Ramesh Shrestha
Computer 6th Semester Database Management System
10. Crash Recovery Crash recovery is the process by which the database is moved back to a consistent and usable state. This is done by rolling back incomplete transactions and completing committed transactions that were still in memory when the crash occurred. When a database is in a consistent and usable state, it has attained what is known as a "point of consistency". Failure Classification: Transaction Failure: There are two types of errors that may cause a transaction to fail: Logical error: The transaction can no longer continue with its normal execution because of some internal condition, such as bad input, data not found, overflow, or resource limit exceeded. System error: The system has entered an undesirable state like deadlock, as a result of which a transaction cannot continue with its normal execution. The transaction however, can be re-executed at a later time.
System Crash: There is a hardware malfunction, or a bug in the database software or the operating system, that causes the loss of the content of volatile storage, and brings transaction processing to a halt. The content of nonvolatile storage remains intact, and is not corrupted.
Disk Failure: A disk block loses its content as a result of either a head crash or failure during a data transfer operation. Copies of the data on other disks, or archival backups on tertiary media are used to recover from failure.
To recover from failures, recovery algorithms are used which usually have two parts: Actions taken during normal transaction processing to ensure that enough information exists to allow recovery from failures. Actions taken after a failure to recover the database contents to a state that ensures database consistency, transaction atomicity, and durability. Log-Based Recovery The log is a sequence of log records, recording all the update activities in the database. There are several types of log records. An update log record describes a single data base write. It has these fields:
Transaction identifier is the unique identifier of the transaction that performed the write operation. Data-item identifier is the unique identifier of the data item written. Typically, it is location on disk of the data item. Old value is the value of the data item prior to the write. New value is the value that the data item will have after the write.
Other special log records exist to record significant events during transaction processing, such as the start of a transaction and the commit and abort of a transaction. Various types of log record are denoted as: <Ti start> Transaction Ti has started. <Ti, Xj, V1, V2> Transaction Ti has performed a write on data item Xj. Xj had value V1 before the write, and will have value V2 after the write. Unit 9/Crash Recovery/Page No.1
By Ramesh Shrestha
Computer 6th Semester Database Management System
<Ti commit> Transaction Ti has committed. <Ti abort> Transaction Ti has aborted.
Whenever a transaction performs a write, it is essential that the log record for that write be created before the database is modified. Once a log record exists, we can output the modification to the database if that is desirable. We have the ability to undo a modification that has already been output to the database. We can do that by using the old-value field in log records. For log records to be useful for recovery from system and disk failures, the log must reside in stable storage. Backup Recovery The backup of the whole database is created and stored in stable storage from time to time. When failure occurs, then the most current backup is retrieved to get the most current database. Shadow Paging The database is partitioned into some number of fixed-length blocks, which are referred to as pages. To find the ith page of the database we use a page table. The page table contains n entries, one for each database page. Each entry contains a pointer to a page on disk. The key idea behind the shadow-paging technique is to maintain two page tables during the life of a transaction: the current page table and the shadow page table. When the transaction starts, both page tables are identical. The shadow page is never changed over the duration of the transaction. The current page table may be changed when a transaction performs a write operation.
Unit 9/Crash Recovery/Page No.2
By Ramesh Shrestha
Computer 6th Semester Database Management System Suppose that the transaction Tj performs a write(X) operation, and that X resides on the ith page. The system executes the write operation as follows: 1. If the ith page (i.e. the page in which X resides) is not already in main memory, then the system issues input(X). 2. If this is the write first performed on the ith page by this transaction, then the system modifies the current page table as follows: a) It finds an unused page on disk. Usually, the database system has access to a list of unused (free) pages. b) It deletes the page found in 2a from the list of free the list of free pages frames; it copies the contents of the ith to the page found in step 2a. c) It modifies the current page table so that the ith entry points to the page found in step 2a. 3. It assigns the value of xj to X in the buffer page. Thus, the shadow-page approach to recovery is to store the shadow page table in nonvolatile storage, so that the state of the database prior to the execution of the transaction can be recovered in the event of a crash, or transaction abort. When the transaction is committed, the shadow page entry becomes a copy of the current page table entry and the disk block with the old data is released. If the shadow is stored in nonvolatile memory and a system crash occurs, then the shadow page table is copied to the current page table. This guarantees that the shadow page table will point to the database pages corresponding to the state of the database prior to any transaction that was active at the time of the crash, making aborts automatic.
Unit 9/Crash Recovery/Page No.3
By Ramesh Shrestha
Computer 6th Semester Database Management System
Advantages of the shadow-page technique:
The overhead of log-record output is eliminated. Recovery from crashes is significantly faster since no undo or redo operations are needed.
Drawbacks:
Commit overhead: The commit of a single transaction using shadow paging requires multiple blocks to be the output- the actual data blocks, the current page table, and the disk address of the current page table. Data fragmentation: Shadow paging causes database pages to change location when they are updated. As a result, either we lose the locality property of the pages or we must resort to more complex, higher-overhead schemes for physical storage management. Garbage collection: Each time that a transaction commits, the database pages containing the old version of data changed by the transaction become inaccessible. Such pages are considered garbage, since they are not part of free space and do not contain useful information.
Dumping A database dump contains a record of the table structure and/or the data from a database and is usually in the form of a list of SQL statements. A database dump is most often used for backing up a database so that its contents can be restored in the event of data loss. Corrupted databases can often be recovered by analysis of the dump.
Unit 9/Crash Recovery/Page No.4
By Ramesh Shrestha
Computer 6th Semester Database Management System
Chapter 2: Data Model Data Model
Representation of reality is model. A collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints is data model. Data model is a model that describes an abstract way ho data is represented in a database.
Logical Data Model
A logical data model describes the data in as much detail as possible, without regard to how they will be physical implemented in the database. Features of a logical data model include: Includes all entities and relationships among them. All attributes for each entity are specified. The primary key for each entity is specified. Foreign keys (keys identifying the relationship between different entities) are specified. Normalization occurs at this level.
Fig: Logical Data Model Physical Data Model
Physical data model represents how the model will be built in the database. A physical database model shows all table structures, including column name, column data type, column constraints, primary key, foreign key, and relationships between tables. Features of a physical data model include: Specification all tables and columns. Foreign keys are used to identify relationships between tables.
Unit 2/Data Model/Page No.1
By Ramesh Shrestha
Computer 6th Semester Database Management System
Denormalization may occur based on user requirements. Physical considerations may cause the physical data model to be quite different from the logical data model. Physical data model will be different for different RDBMS. For example, data type for a column may be different between MySQL and SQL Server.
Fig: Physical data independence Conceptual Data Model
A conceptual data model identifies the highest-level relationships between the different entities. Features of conceptual data model include: Includes the important entities and the relationships among them. No attribute is specified. No primary key is specified.
Fig: Conceptual data model
Unit 2/Data Model/Page No.2
By Ramesh Shrestha
Computer 6th Semester Database Management System
Logical data model vs Conceptual data model
In a logical data model, primary keys are present, whereas in a conceptual data model, no primary key is present. In a logical data model, all attributes are specified within an entity. No attributes are specified in a conceptual data model. Relationships between entities are specified using primary keys and foreign keys in a logical data model. In a conceptual data model, the relationships are simply stated, not specified, so we simply know that two entities are related, but we do not specify what attributes are used for this relationship.
Types of data model Record-based logical models
Describes data at the conceptual and view levels. Named so because the database is structured in fixed-format records of several types. Each record type defines a fixed number of fields or attributes. Record-based models do not include a mechanism for direct representation of a code in a database. Data model that represents data by using record structure are called record-based data models. The three most widely accepted models are relational, network and hierarchical.
1. Relational Model
This model uses a collection of table to represent both data and relationship among those data. Each table has multiple columns and each column has unique name. Each table contains record of particular type. Each record type defines a fixed number of fields or attributes. The relational model is at lower level of abstraction then the ER Model. Database design are often carried out in ER model and then translated into relational model.
Teacher Id
Name
Unit 2/Data Model/Page No.3
Account Address
A101
6000
A103
7000
A301
8000
By Ramesh Shrestha
Computer 6th Semester Database Management System
Customer ram
980
ktm
A101
hari
981
bkt
A103
gopal
941
ptn
A103
ram
980
ktm
A301
981
bkt
A301
hari
Fig: Relational Model 2. Network data model
Data are represented by collections of records. Relationships among data are represented by links. Organization is that of an arbitrary graph. Following figure shows a sample network database that is the equivalent of the relational database shown before. customer
Account
ram
980
ktm
A101
6000
hari
981
bkt
A103
7000
gopal
941
ptn
A301
8000
Fig: Network Model 3. Hierarchical data model Similar to the network model. Organization of the records is as a collection of trees, rather than arbitrary graphs. Following figure shows a sample hierarchical database that is the equivalent of the relational database shown before. Account
ram
A101
6000
980
ktm
A301
8000
hari
A103
7000
981
A301
bkt
8000
gopal
A103
941
ptn
7000
Fig: Hierarchical Data Model
Unit 2/Data Model/Page No.4
By Ramesh Shrestha
Computer 6th Semester Database Management System
Demerit
Redundancy Waste of space
Object-based Model
Focuses on describing the data, the relationship among the data, and any constraints defined. Commonly used Object-based model is ER model.
Entity Relationship Model (E-R Model)
It is based on a perception of real world that consists of collection of basic object called entities and relationship among entities. An entity is a things or object in the real world i.e. distinguishable from other objects. E.g. each person is entity. Each entity is described by the set of attributes. Relationship is an association among several entities. The set of all entities of same type and set of all relationship of same type are termed as entity set and relationship set respectively. Another essential element of the E-R diagram is the mapping cardinalities, which express the number of entities to which another entity can be associated via a relationship set. The overall logical structure of a database can be expressed graphically by an E-R diagram: rectangles: represent entity sets. ellipses: represent attributes. diamonds: represent relationships among entity sets. lines: link attributes to entity sets and entity sets to relationships (Rectangle) ---->entity set (Line) ---------> Link (Ellipse)---->attributes
(Diamond)----->relationship set
id
name
Teacher
address
Unit 2/Data Model/Page No.5
id Teaches
Fig: Simple ER Diagram
name
Student
address
By Ramesh Shrestha
Computer 6th Semester Database Management System
Chapter 3 Relational Model Relational Model represents the database as a collection of tables. Both the entity and the relationship are represented by table. Each table has multiple columns which has unique name. Each table contains record of particular type. Each record type defines a fixed number of fields or attributes. Terminology Attributes Attributes are generally referred to as the headers of the columns of the table. Domain Domain is the set of permitted values for each attributes. Tuple Each row of the relation (table) is called a tuple. A tuple variable is a variable that stands for a tuple, in other words, a tuple variable is a variable whose domain is the set of all tuple. Arity (Degree) If a relation R contains tuples with m components then R is said to be of arity m. Cardinality If a relation R contains n tuples, the relation R is said to be of cardinality n. STUDENT Name Ram Shyam Tuples Sita
Attributes SSN Address 2011CE03 Ktm 2012CE05 Bkt 2010CE09 Ptn
Age 23 19 25
GPA 3.53 3.25 2.80
Cardinalities
Degree Fig: Student Table Structure of relational databases A relational database consists of collection of table i.e. in this model, entity sets and relationships all are represented by tables. A row in a table represents a relationship among a set of values. Since a table is a collection of such relationships, there is a close correspondence between the concept of a table and the mathematical concept of a relation from which the relational data model takes its name.
Unit 3/Relational Model/Page No.1
By Ramesh Shrestha
Computer 6th Semester Database Management System
Consider the STUDENT table of fig (1) shown above. It has five column headers: Name, SSN, Address, Age and GPA. Following the terminology of the relational model, we refer these headers as attributes. For each attributes, there is a set of permitted values, called domain of that attributes. Let D1 denotes the set of all student name, D2 the set of all student SSN, D3 the set of all address, D4 the set of all age, and D5 the set of all GPA. Any row of student must consists of 5 tuples (v1, v2, v3, v4,v5) where v1 is the student name (i.e. v1 is in Domain D1), v2 is the student SSN (i.e. v2 is in Domain D2), v3 is the student address (i.e. v3 is in domain D3) and so on. In general, student will contain only a subset of the set of all possible rows. Therefore, student is the subset of D1XD2XD3XD4XD5 In general, a table of n attributes must be a subset of D1XD2XD3XD4XD5. Mathematical Concept Mathematicians define a relation to be a subset of a Cartesian product of a list of domains. We can see the correspondence with our tables. We will use the terms relation and tuple in place of table and row. Some more formalities Let the tuple variable t refers to a tuple of the relation R. We say t ε R to denote that the tuple t is in relation R. Then t[Name]=t[1]= the value of t on the name attributes. So t[Name]=t[1]=”Ram”, and t[Address]=t[3]=”Ktm”. We will also require that the domains of all attributes be indivisible unit i.e. atomic. A domain is atomic if element of the domain are considered to be indivisible units. For example, the set of integers is an atomic domain, but the set of all sets of integers is a non-atomic domain. Relational Algebra The relational algebra is a procedural query language. It comprises a set of basic operations. An operation is the application of an operator to one or more source (or inputs) relations to produce a new relation as a result.
Unit 3/Relational Model/Page No.2
By Ramesh Shrestha
Computer 6th Semester Database Management System
Source Relation 1
Source Relation 2
…
Source Relation n
Relational Operation
Resultant Relation Fig: A Relational Algebraic operation Six basic operators Select : Project : π Union : Set Difference : Cartesian Product : X Rename : ρ
The operators take one or two relations as inputs and produce a new relation as a result. Several other operations, defined in terms of the fundamental operations:
Set intersection : Natural join : Division : ÷ Assignment : ←
Select Operation The select operation selects tuple that satisfy a given predicate. We use a lowercase Greek letter sigma () to denote selection. The predicate appears as a subscript to . Syntax: F (R) where R=relation F= predicate (or condition) Eg:- List all the student name who live in ktm
address=”ktm” (student)
Unit 3/Relational Model/Page No.3
By Ramesh Shrestha
Computer 6th Semester Database Management System
We can combine several predicates into a larger predicate by using connectives i.e. ‘and ()’, ‘or (v)’ and ‘not (
) ’
Eg:Empoyee (eid, name, deptno, address, salary) Find all the tuples whose employee id is greater than 10 and address is ktm. eid>10 address==’ktm’(Employee)) Project Operation Whereas a selection operation extracts rows of a relation meeting specified conditions, a projection operation extracts specified columns of a relation. The desired columns are simply specified by name. With the help of this operation, any number of columns can be omitted from a table or columns of table can be rearranged. Projection is denoted by the uppercase Greek letter pi(π) Syntax: π A,B,…(R) where A,B,… are the attributes
R is a relation, Ex:- List all the name and address of all employee who has salary greater than 2500. π name, address(salary>2500(Employee)) Note: - Instead of using name of attribute we can use the column numbers as, π 2, 3(salary>2500(Employee))
Union Operation The union of relation R1 and R2, denoted by R1 U R2 is the set of tuples, with arity n, that are in R1 or R2 or both. Relation R1 and R2 are union compatible if they meet following two conditions:-
Unit 3/Relational Model/Page No.4
By Ramesh Shrestha
Computer 6th Semester Database Management System
i) ii)
The relations R1 and R2 must be of the same arity. That is, they must have same number of attributes. The domains of the ith attribute of R1 and the ith attribute of R2 must be the same for all i.
Eg:-
Empoyee (eid, name, deptno, address, salary) List name and address of employee who are working on department 10 or live in pokhara or both π name, address(deptno=10(Employee))
U π name, address(address=’pokhara’(Employee))
Intersection Operation The intersection operation also requires that the two involved relations must be union compatible. Let R1 and R2 be two relations which are union-compatible. The intersection of relation R1 and R2, denoted by R1∩R2, is the set of tuples that are in both R1 and R2. Eg:- List name and address of employee who are working on department 10 and live in pokhara. π name, address(deptno=10(Employee)) ∩ π name, address(address=’pokhara’(Employee)) Difference Operation The difference operation also requires that the two involved relations must be union-compatible. Let R1 and R2 ne two relations which are union-compatible. The difference of relations R1 and R2, denoted by R1-R2, is the set of tuples that are in R1 but not in R2. Eg:- List name and address of employee who are working on department 10 and does not live in pokhara. π name, address(deptno=10(Employee))
− π name, address(address=’pokhara’(Employee))
Unit 3/Relational Model/Page No.5
By Ramesh Shrestha
Computer 6th Semester Database Management System
Cartesian product Operation The Cartesian product operation does not require relations to union0compatible. It means that the involved relations may have different schemas. Let R1 and R2 be the relations that may have different schemas, then the Cartesian product denoted by R1 X R2, is the set of all possible combinations of tuples of the two operation relations. Each resultant tuple consists of all the attributes of R1 and R2. Eg:Employee E table eno 1 2 3
Department D table
ename Ram Shyam Hari
Dept A C A
dno A B C
dname Marketing Sales Legal
ename Ram Ram Ram Shyam Shyam Shyam Hari Hari Hari
Dept A A A C C C A A A
dno A B C A B C A B C
dname Marketing Sales Legal Marketing Sales Legal Marketing Sales Legal
EXD eno 1 1 1 2 2 2 3 3 3
Eg:borrower (cname, loan#) loan (loan#, bname, amount) Then, borrowerXloan= borrower.cname, borrower.loan#, loan.loan#, loan.bname, loan.amount = cname, borrower.loan#, loan.loan#, bname, amount Find the name of all customers who have loan at ktm branch π cname(bname=’ktm’ borrower.loan#=loan.loan# (borrowerXloan))
Unit 3/Relational Model/Page No.6
By Ramesh Shrestha
Computer 6th Semester Database Management System
Since the Cartesian product takes all possible pairing of one tuple from borrower and one tuple from loan. The result may contain customer name that does not have loan at customer branch. So if a customer has loan at ktm branch then there is some tuple in borrower contains their name and borrower.loan#=loan.loan#. Rename Operation
It is useful to give the name to the results of relational algebra expression. It is denoted by the lowercase Greek letter rho ( ). Given a relational algebra expression E, the expression x (E) returns the result of expression E under name x. We can also apply the rename operation to a relation r to get the same relation under new name. Rename operation solves the problems that occur with naming when performing the Cartesian product of a relation itself.
Eg:- customer (cname, cstreet, ccity) Suppose we want to find the name of all the customers who live in the same city and street as Ram.
We get the street and city of Ram as, π cstreet,ccity(cname=’Ram’(Customer)) To find other customers with the same information we need to reference the customer relation again. p(customer X (π cstreet,ccity(cname=’Ram’(Customer)))) where p is the selection predicate requiring cstreet and ccity value to be equal Now the final result will be, π customer,cname(customer.cstreet=cust.cstreet customer.ccity=cust.ccity(CustomerX (π cstreet,ccity(cname=’Ram’( cust (Customer))))))
# deposit (cname, bname, balance) Find the largest account balance in the bank. Compute a temporary relation consisting those balances that are not the largest π deposit,balance(deposit.balance< d.balance(deposit X d (deposit))) This resulting contains all balances except the largest one. Now taking the set difference, π balance(deposit) - π deposit,balance(deposit.balance< d.balance(deposit X Unit 3/Relational Model/Page No.7
d (deposit)))
By Ramesh Shrestha
Computer 6th Semester Database Management System
Natural Join Operation
Eg:1) A loan# 5 6 7 8 A
It is denoted by It is applicable only when there is atleast one attribute common to both relations. It R and S be two relations with atleast one attribute in common then natural join processes as follows: Compute RXS. Select those tuples from RXS whose components corresponding to common an attribute have sane value. Remove duplicate attributes.
B loan# 5 6 8 7
cname Ram Hari Sita Laxman
amount 500 600 700 800
B
loan# 5 6 7 8
cname Ram Hari Sita Laxman
Amount 500 600 800 700
2) borrower (cname, loan#) loan (loan#, bname, amount)
borrower
loan= cname, loan#, bname, amount
Find all customer names that have loan at Kathmandu branch. π cname(bname=”kathmandu”( borrower loan))
3) customer (cname, street, city) account (acc#, bname, balance) Unit 3/Relational Model/Page No.8
By Ramesh Shrestha
Computer 6th Semester Database Management System
depositor (cname, acc#) customer
account
depositor = cname, street, city, acc#, bname, balance
Find the name of all branches with customers that have an account in the bank and who live in ktm. πbname, cname(city=”ktm”(customer account depositor))
Division Operation It is denoted by ÷ and is suit R and S be two relations with arity r and s respectively. The divide operator can be perform on S and R i.e. S÷R If S>R The last attribute of relation S arranged in same order Eg:-1) S A M P W
R B N Q X
C o r y
D P S Z
C o r l
Result of S÷R D P s M
A m p
B n q
2) Completed Student Ram Ram Ram Hari Hari Sita Sita
Task Db1 Db2 Com1 Db1 Com1 Db1 Db2
Dbproject Task Db1 Db2
Completed÷Dbproject Student Ram Sita
3) customer (cname, street, city) account (acc#, bname, balance) depositor (cname, acc#)
Unit 3/Relational Model/Page No.9
By Ramesh Shrestha
Computer 6th Semester Database Management System
Find all the customers who have an account at all branch at Kathmandu 1st step: Find all branch at ktm π bname(city=”ktm”(branch))
2nd step: Find the customer name and branches of all. π cname, bname(depositor
account)
3rd step: Divide 2nd by 1st π cname, bname(depositor
account)
÷ π bname(city=”ktm”(branch))
Assignment Operation The assignment operation, denoted by ←, works like assignment in a programming language. Eg:- 1) Example listed in division operator can be summarized using division operator R1← π bname(city=”ktm”(branch)) R2← π cname, bname(depositor account) R3← R2÷R1 2) To find the customer name whose branch name is Teku π cname(bname=”Teku”(depositor branch)) R1← depositor branch R2← bname=”Teku”(R1) R3← π cname(R2) It is convenient at times to write a relational algebra expression by assigning parts of it to temporary relation variables. Outer Join The outer join operation is an extension of the join operation to deal with missing information. Let us consider following table information as an example. Depositor Bname Newroad Teku Baneshwor
acc# A200 A250 A100
Bal 3000 5000 2000
Unit 3/Relational Model/Page No.10
Customer acc# Cname A200 Ram A100 Hari A60 Sita
By Ramesh Shrestha
Computer 6th Semester Database Management System
a) Left Outer Join It is denoted by It takes all tuples in the left relation that does not match any tuple in the right relation. Depositor Customer bname acc# Bal cname Newroad A200 3000 Ram Baneshwor A100 2000 Hari Teku
A250
5000 -
b) Right Outer Join It is denoted by It takes all tuples in the right relation that does not match with any tuples in the right relation. Depositor Customer bname acc# Bal Cname Newroad A200 3000 Ram Baneshwor A100 2000 Hari -
A60
-
Sita
c) Full Outer Join It is denoted by It does both of left outer and right outer join operations, padding tuples from the left relation that does not match any from the right relation, as well as the tuple from the right relation that does not match any from the left relation and adding them to the result of the join. Depositor Customer bname acc# Bal cname Newroad A200 3000 Ram Baneshwor A100 2000 Hari Teku -
A250 A60
Unit 3/Relational Model/Page No.11
5000 Sita
By Ramesh Shrestha
Computer 6th Semester Database Management System
Aggregate Function Aggregate function takes a collection of values and returns a single value as result.
The relational algebra operation is the letter G in calligraphic font; read it as “Calligraphic G” signifies that aggregation is to be applied, and its subscript specifies the aggregate function to be applied.
The general form of the aggregate operation
is as follows:-
where E is any relational algebra expression. G1, G2,…, Gn constitute a list of attributes on which to group. Ai is an attribute name. G1, G2,…,Gn
F1, F2,…,Fm (Am) (E)
1) Sum function Sum function takes a collection of values and returns the sum of the values. Eg:- loan (loan#, amount) Find the sum of all amounts from loan relation sum(amount) (loan)
2) Count function The aggregate function count returns the number of element in collection. Eg:- Count the number of tuples from the relation customer count(loan#) (loan)
3) Min function Returns the minimum value from the collection of items Eg:- Find the minimum amount from loan relation min(amount)
(loan)
4) Max function Returns the maximum value from the collection of items Eg:- Find the minimum amount from loan relation max(amount)
(loan)
5) Avg function Returns the average value from the collection of items Eg:- Find the average amount from loan relation (loan) Eg:- Teachers (tid, tname, address, dept, salary) Find the average salary of each department. avg(amount)
dept
dept, avg(amount)
Unit 3/Relational Model/Page No.12
(loan)
By Ramesh Shrestha
Computer 6th Semester Database Management System
Modification of the Database 1. Deletion (Delete Operation) In relational algebra, a deletion is expressed by, R ← r – E where r is a relation E is a relational algebra query Eg:- depositor (cname, acc#, bname, balance) Delete all of Ram’s account record Depositor ← Depositor– (cname=”Ram”(Depositor)) 2. Insertion (Insert Operation) The relational algebra expresses an insertion by R ← r U E Where, r is a relation E is a relational algebra query Eg:- Insert that Shyam has $1200 in account A-23E at ‘abc’ branch Depositor ← Depositor U {‘Ram’,’A-23E’, ’abc’,1200} 3. Updating (Update Operation) For this we can use the generalized projection operator to do this task. r ← F1,F2,…,Fn(r) Where, Fi is either the ith attribute of r, if the ith attribute is not updated, or if the attribute is to be updated, Fi is an expression, involving only constants and the attributes of r, that gives the new value for the attributes. If we want to select some tuples from r and to update only them, we can use the following expression; here, P denotes the selection condition that chooses which tuples to update. r ← F1,F2,…,Fn(p(r))U (r- (p(r)) Eg:1) Increase the amount of each depositor by 10% depositor ← cname, acc#, bname, balance=balance+balance*0.1(depositor) 2) Increase the amount of Shyam by 10% depositor ← cname, acc#, bname, balance=balance+balance*0.1(cname=’Shyam’(depositor)) U (depositor- (cname=’Shyam’(depositor)))
Unit 3/Relational Model/Page No.13
By Ramesh Shrestha
Computer 6th Semester Database Management System
Tuple Relational Calculus A non procedural query language, where each query is of the form {t | P (t)} It is the set of all tuples t such that predicate P is true for t t is a tuple variable, t[A] denotes the value of tuple t on attribute A t ∈ R denotes that tuple t is in relation R P is a formula similar to that of the predicate calculus Predicate Calculus formula Set of attributes and constants Set of comparison operators (E.g. <, <=,>,>=, = , ≠ etc) Set of connectives: and (), or (v) and not () Implication (=>): x=>y, if x is true, then y is true x => y x v y Set of quantifier : there exists : for all For e.g. t ε r (Q(t)) “there exists” a tuple in t in relation r such that predicate Q(t) is true. t ε r (Q(t)) Q is true “for all” tuples t in relation r branch (branch_name, branch_city, assets) customer (customer_name, customer_street, customer_city) account (account_number, branch_name, balance) loan (loan_number, branch_name, amount) depositor (customer_name, account_number) borrower (customer_name, loan_number) 1. Find the loan_number, branch_name and amount for the loan of over Rs. 1200 {t | t ε loan t [amount] >1200 } 2. Find the loan number for each loan of an amount greater than Rs. 1200 {t | s ε loan (t [loan_number]= s[loan_number] s[amount] >1200) } 3. Find the name of all customers having a loan, an account, or both at the bank {t | s ε borrower (t [customer_name]= s[customer_name]) v (u ε depositor (t [customer_name]= u[customer_name])} 4. Find the name of all customers who have a loan and an account at the bank {t | s ε borrower (t [customer_name]= s[customer_name]) (u ε depositor (t [customer_name]= u[customer_name])}
Unit 3/Relational Model/Page No.14
By Ramesh Shrestha
Computer 6th Semester Database Management System
Domain Relational Calculus
It uses domain variables that take on values from an attributes domain, rather than values for an entire tuple. An expression in the domain relational calculus is of the form {<x1,, x2, x3, … , xn> | P(x1,, x2, x3, … , xn)} where x1,, x2, x3, … , xn represent domain variables, P is the predicate
For e.g. 1. Find the loan_number, branch_name and amount for the loan of over Rs. 1200 {<l,b,a> | <l,b,a> ε loan a >1200 } 2. Find the loan number for each loan of an amount greater than Rs. 1200 {<l> | b,a (<l,b,a> ε loan a >1200) } 3. Find the name of all customers who have a loan from the Kathmandu branch and find the loan amount. {<c, a> | l (<c, l> ε borrower b (<l, b, a> ε loan b=”kathmandu”))}
Designing a relational schema A relational schema is a list of attributes and their corresponding domains. If A1, A2, A3,…, An are the attributes then R= (A1, A2, A3,…, An) is a relational schema Student Name
SSN
Address
Age
GPA
Student= (Name, SSN, Address, Age, GPA) Or, Student(Name, SSN, Address, Age, GPA)
Unit 3/Relational Model/Page No.15
By Ramesh Shrestha
Computer 6th Semester Database Management System
4. SQL SQL is the query language. It is capable of more than just getting data of relation in the database. It can also handle data updates and even data definition- add new data change existing data, delete or create new structure. Thus SQL is capable of:1. Data Query The content of database is access by a set of command where by useful information is return to the end user. 2. Data Maintenances The data within the relation can be created, corrected, deleted and modified. 3. Data definition The structure of and its relation can be define and created. Data Definition Language (DDL) â&#x20AC;˘
DDL describes the portion of SQL that allows is to create, alter and destroy database object. The database object include schemas, tables etc.
1. CREATE - provides reserved keyword create to create tables, views, database etc. Syntax: Create Table <table_name> (column1 name data-type<constraint>, column2 name data-type<constraint>, column3 name data-type<constraint> ); Note: Constraints is optional and is restrictions. â&#x20AC;˘
Data Types
1. Char (width): used to accept characters. - Max. character is 255. Default is 1. 2. Varchar2(width): is a varying length data. - Max. character 2000 can be stored. 3. Number(width, decimal or point): supports to all type of numeric values. - Max. 40 digits allow - By defaults, takes 0 for width and 0 as decimal point.
Unit 4/SQL/Page No.1
By Ramesh Shrestha
Computer 6th Semester Database Management System
•
E.g.:- Create a table employee having columns – Employee no., name, salary, post.
•
Create table employee( employee_no varchar2(30); name
varchar2(30);
salary
number(5,2);
post
varchar2(20);
); •
Constraints -
Data types are a way to limit the kind of data that can be stored in a table. For example, a column containing a product price should probably only accept positive values. But there is no data type that accepts only positive numbers. If we want to restrict certain columns with certain values then in that case constraints plays an important role. Constraints give you as much control over the data in our tables as our wish.
1.
NOT NULL - When not null restriction (Constraint) added to column it ensures that the value must passed for the column at the time of entering records. Syntax: -Column-Name data-type NOT NULL, *Note: - If NOT NULL explicitly added with column name default null is taken into consideration. Eg. Create table employee( Employee-no number (4), Employee-name varchar2 (30) NOT NULL );
2. Primary Key - When primary key restriction (constraints) is added to column, it ensure that: - There is no duplicity in this field values.
Unit 4/SQL/Page No.2
By Ramesh Shrestha
Computer 6th Semester Database Management System
- NULL should not exist. •
Syntax: -1. Column-name data-type Primary key, … OR 2. Column-name data-type constraint <restriction-name>primary key, …
•
Create a table name student having following columns: -Regno., Name Class, Address. –
By first method: -
–
Create table student(Regno number (15)primary key, Name varchar2 (30) NOT NULL, Class number (5) NOT NULL, Address varchar2 (30));
–
By Second Method: -
–
Create table student(Regno number (15) constraint regnopk primary key, Name varchar2 (30)NOT NULL, Class number (5) NOT NULL, Address varchar2 (30));
3. Check Constraints - When check reserved keyword is added to column it ensures that only valid data should takes entry. Syntax:- 1. Column-name data-type check (column operator value) OR 2. Column-name data-type constraint <restriction-name>Check (column operator value) •
Operator used in SQL: 1. Relational Operator. 2. Logical Operator. 3. Special Operator. a. IN and NOT IN. b. BETWEEN and NOT BETWEEN. c. LIKE AND NOT LIKE.
Unit 4/SQL/Page No.3
By Ramesh Shrestha
Computer 6th Semester Database Management System
1. Relational operator or comparison operator: â&#x20AC;˘
This operator are used to compare two values following operator provided to compare values: > (Greater than). < (Less than). >= (Greater than or equal to). <= (Less than or equal to). !, (<> ) (Not equal to). = (Equal to)
â&#x20AC;˘
Restriction based on Relational operator: Q. To create a table name employee having following column and restrictions: Empno
Primary key.
Empname NOT NULL. Post
NOT NULL.
Salary Check.
Ans: Create table employee(Empno number (15) constraint empnopk primary key, Empname varchar2 (30) NOT NULL, Post varchar2 (10)NOT NULL, Salary number (5) constraint salarychk check (salary>0)); 2. Logical Operator: - These operators are used to combine more than one conditional expression. Following logical operators provided by SQL: a. AND - It returns true if both the condition met true, otherwise false. It is Binary Operator. b. Or
Unit 4/SQL/Page No.4
By Ramesh Shrestha
Computer 6th Semester Database Management System
- It returns true, if either one of the condition met true other-wise it returns false. It is also a binary operator. c. NOT - It reserves the pulses that is true to false and vice-versa. It is also a binary operator. •
Check constraint base on Logical operator: -
Q. To create a table name employee having following column and restrictions: Empno primary key check must between 9000 to 9999 Empname NOT NULL Post programmer, executive, clerk Salary must be greater than 2000 and less than 45000. Ans: Create table employee(Empno number (15) constraint empnopk primary key Check (Empno>=9000 and Empno<=45000), Empname varchar2 (30) NOT NULL, Post varchar2 (15) constraint postchk Check (post=’prog’ or post=’executive’ or post=’clerk’), Salary number (9) constraint salchk Check (salary>=2000 and salary<=45000)); 3. Special Operator: a. “IN” Operator: - This operator is used to restrict specified within parenthesis. Syntax: Column data-type constraint <restriction-name> Check (Column-name in (Value1, Value2, Value3 …)) *Notes: a. If specified value are string type it must enclosed within single quote (‘‘). b. No quote (‘‘) require to enclosed number value. c. “IN” operator acts inside as or type
Unit 4/SQL/Page No.5
By Ramesh Shrestha
Computer 6th Semester Database Management System
Check constraint based on in and not in operator: Q. Create a table name student having following column and restrictions: Regno primary key <8000 and >2000 Sname NOT NULL Class 1,2,3,4 Sex male, female Ans: •
Create table student (Regno number (15) constraint regnopk primary key Check (regno>2000 and regno<8000), Sname varchar2 (15)NOT NULL, Class varchar2 (15) constraint classchk Check (class in (1, 2, 3, 4)), Sex varchar2 (15) constraint sexchk Check (sex in (‘mail’, ’female’)));
b. ‘NOT IN’ operator: - This operator is used to restrict values based on valuesspecified within parenthesis that is; it accepts value that not matched with thevalues specified. Syntax: - Column data-type constraint <restriction-name>Check (column not in (value1, value2 …)) Note: 1. If specified values are of char/varchar2 type, that is must enclosed withinthe single quotes. 2. No quotes required for numeric values. 3. NOT IN operator internally works as AND operator. Check constraint based on IN and NOT IN operator: Q. Create a table name directors having following columns and restrictions: Did primary key did must be >=D001 and <=D999 Dfname NOT NULL Dlname must be Ghai, Verma, or Sharma Dcity not Patna, Pune, Bangalore
Unit 4/SQL/Page No.6
By Ramesh Shrestha
Computer 6th Semester Database Management System
Ans: Create table directors(Did varchar2 (15) constraint didpk primary key Check (Did>=’D001’ and Did<=’D999’), Dfname varchar2 (30) NOT NULL, Dlname varchar2 (20) constraint dlnamechk Check (Dlnamein (‘Ghai’, ‘Verma’, ‘Sharma’)), Dcity varchar2 (30) constraint dcitychk Check (Dcity not in (‘Patna’, ‘Pune’, ‘Bangalore’))); c. BETWEEN and NOT BETWEEN operator: - Between operators is used in SQL PLUS to select values based on ranges specified including initial and final limit. Syntax: Column data-type check (column-name between initial and final limit) - NOT BETWEEN operators used in SQL PLUS to select values either less than initial values or, greater than final values. That is excluding initial and final limit. Syntax: Column data-type check (column-name not between initial limit and final limit)Check constraint based on BETWEEN and NOT BETWEEN operators Q. Create a table name result having following columns and restriction:Regno primary key check >=9000 and <=9999 SEM I, or II, or III Sub1 marks within (0-100) Sub2 marks must be >=0 and <=200 Total must be either >200 and <300 Create table result(regno number (5) constraint regnopk primary key Check (regno between >=9000 and <=9999), SEM varchar2 (5) constraint semchk Check (SEM in (‘I’, ‘II’, ‘III’)), Sub1 number (3) constraint sub1chk Check (Sub1 between 0 and 100), Sub2 number (3) constraint sub2chk Check (Sub2 between 0 and 200), Total number (5) constraint totalchk Check (Total between 200 and 300));
Unit 4/SQL/Page No.7
By Ramesh Shrestha
Computer 6th Semester Database Management System
d. LIKE operator: - This operator in SQL is used to restrict value based on Pattern matching. In order to generate pattern SQL provide two wild-card characters as follows. 1. %: - It matches zero, one, or more than one character. Eg: -‘D%’ => D, Do, Dabc, are true. doo1, A001 are false. ‘E%K’ => EK, EooK, EoK are true. Ek, KEo1 are false 2. _(Underscore): - It matches exactly one character (required) Eg: -‘D_ _ _’=> D001, DABC are true.D0001, dEad are false. Syntax: Column data-type constraint <restriction-name> check (column like ‘pattern’)Check constraint based on pattern (like) matching Q. To create a table name directors having following columns and restrictions: Did primary key check director’s id must start with ‘D’ followed by any sequence of characters but must end with ‘k’. Dfname NOT NULL Dlname check Dlname must be either Kumar, Sharma, Singh, Verma. Dcitycheck city of directors must start with M, K, and C Ans: Create table directors(Did number (8) constraint didpk primary key Check (Did like (‘D%k’)), Dfname varchar2 (30) NOT NULL, Dlname varchar2 (15)constraint dlnamechk Check (Dlname in (‘Kumar’, ‘Sharma’, ‘Singh’,‘Verma’)),
Unit 4/SQL/Page No.8
By Ramesh Shrestha
Computer 6th Semester Database Management System
Dcity varchar2 (30) constraint dcitychk Check (Dcity ‘M%’ or, Dcity ‘K%’ or, Dcity ‘C %’)); e. NOT LIKE operator: - This operator in SQL is used to accept all values other than the specified one. Syntax: Column data-type constraint <restriction-name> check (not like pattern) Q. To create a table name movies having following columns and restrictions: Mid primary key check movie id must start with three characters formed by any characters but must end with S. Mname must be any value except movie name last characters must not be K, M, or N. Category must not be thriller and suspense. Costmust lies within range 200000 to 350000 Ans: Create table movies(Mid varchar2 (10) constraint midpk primary key Check (Mid like ‘_ _ _ %S’), Mname varchar2 (30) constraint mnamechk Check (Mname not like ‘%K’, ‘%M’, ‘%N’), Category varchar2 (15) constraint categorychk Check (Category in (‘Suspense’, ‘Thriller’)), Cost number (10) constraint costchk Check (Cost between 200000 and 350000)
•
Foreign Key or References: - When foreign key restriction is added to column it ensured parent child relation that is there must be matching value available for each foreign key entity instance in its own master table. Syntax: 1. Column data-type constraint <Restriction-name>References <Master-Table> (Column-name) OR 2. Column data-type constraint <Restriction-name>
Unit 4/SQL/Page No.9
By Ramesh Shrestha
Computer 6th Semester Database Management System
Foreign key <Column> References <Master-Table>(Column) E.g. Create Table directors(Did Varchar2 (10), Dfname Varchar2 (20)NOT NULL, Dlname Varchar2 (10), Dcity Varchar2 (30), Constraint Didpk Primary key (Did), Constraint Didchk Check (Did Like ‘D%M’), Constraint Dcitychk Check (Dcity IN (‘Mumbai’, ‘Chennai’, ‘Kolkata’)));
Create table movies(Mid varchar2 (10) constraint midpk primary key Check (mid like ‘M _ _ _ _‘), Mname varchar2 (30) NOT NULL, Category varchar2 (20) constraint catchk Check (Category IN (‘Action’, ‘Drama’, ‘Suspense’)), Did varchar2 (10) constraint didfk References directors (Did)); 2. DROP - To destroy existing database, table, view etc. Syntax:- DROP DATABASE database_name; - DROP TABLE table_name; E.g. - Drop database test; - Drop table employees; 3. ALTER Alter Command: - SQL provide alter command to change or modify existingstructure of the table.
Unit 4/SQL/Page No.10
By Ramesh Shrestha
Computer 6th Semester Database Management System
Syntax: Alter Table <Table Name>Add/Modify/Drop<Statement>; - SQL provides three reserved key words that is add, drop, and modify to do thefollowing jobs: 1. Add: - Add is used to add new columns in the existing table. 2. Modify: - Modify is used with a alter command to change the data-type of existing column. 3. Drop: - Drop is used with the alter command to drop the existing constraint. NOTE: In ORACLE, Drop can also be used to drop the existing column. 1. ADD: - Add reserved key word used with alter table command to add new column to existing table. Syntax:a. To add single column: - Alter Table <Table-name> Add Column-name data-type<Restriction>; b. To add multiple columns: - Alter Table<Table-name>Add (Column1 Data-type <Restriction>, Column2 _ _ _ _ _); Q. To add column dfname in existing table directors. Ans: Alter Table Directors Add dfname varchar2 (10) NOT NULL; Q. To add two columns Dlname, Dcity in the existing table. Ans: Alter Table Directors Add (Dlname varchar2 (20), Dcity varchar2 (10) constraint dcitychk Check (Dcity IN (‘Mumbai’, ‘Chennai’, ‘Kolkata’))); •
To add new restriction using Alter table command: Syntax: -
Unit 4/SQL/Page No.11
By Ramesh Shrestha
Computer 6th Semester Database Management System
a. To add single constraint: Alter Table <Table-name> Add <Restriction>; b. To add multiple constraints on the same table at a time Alter Table <Table-name> Add (<Restriction1>, <Restriction2>, _ _ _ __); Note: - If table is empty means can not exist any, we can add any new column with restriction or, without restriction. - If table consists record, we can add new column without restriction. We can not add new column with restriction. Q. To add new column Dfname in the existing table director. Ans: If table is empty: - Alter Table Director Add Dfname varchar2 (20) NOT NULL; If table consist of record then: - Alter Table Directors Add Dfname varchar2 (20); Q. To add new column Dlname and Dcity in the existing table director. Ans: If table is empty then: - Alter Table Director Add (Dlname varchar2 (20) NOT NULL, Dcity varchar2 (10) Constraint Dcitychk Check (Dcity IN (‘Mumbai’, ‘Chennai’))); If table consists of record then: - Alter Table Director Add (Dlname varchar2 (20), Dcity varchar2 (10)); 4. TRUNCATE - TRUNCATE removes all rows from a table. The operation cannot be rolled back and no triggers will be fired. As such, TRUCATE is faster and doesn't use as much undo space as a DELETE. Syntax:- TRUNCATE TABLE table_name; E.g.
Unit 4/SQL/Page No.12
By Ramesh Shrestha
Computer 6th Semester Database Management System
- TRUNCATE TABLE employee. Note:The DROP command removes a table from the database. All the tables' rows, indexes and privileges will also be removed. 5. RENAME - used to rename a table, view, sequence, or private synonym for a table, view, or sequence. Syntax:- rename old_table to new_table; E.g. - rename employee to new_employee. Data-Manipulation Language (DML) •
DML describes the portion of SQL that allows us to manipulate or control our data. DML is a family of computer language use by computer program or database user to retrieve, insert, delete and update in a database.
1. INSERT – Insert data into a table. 2. SELECT – Retrieve data from the database. 3. UPDATE – Update existing data within a table. 4. DELETE – Delete all records from the table. 1. INSERT - An SQL INSERT statement adds one or more record to any single table in a relational database. Syntax:- INSERT INTO table_name (column1, column2,…,column n) VALUES (value1,value2,…,value n) - The number of columns and values must be same. - If the column is not specified, the default value for the column is used. E.g. –
INSERT INTO student(id, name, address) VALUES (1, “Ram”, “Ktm”);
–
INSERT INTO student VALUES (1, “Ram”, “Ktm”);
Unit 4/SQL/Page No.13
By Ramesh Shrestha
Computer 6th Semester Database Management System
–
INSERT INTO student SET id=1, name=”Ram”, address=”Ktm”;
Copying rows from another table - INSERT INTO student1 SELECT * FROM student WHERE id=1; 2. SELECT An SQL SELECT statement queries data from table in the database. The statement begins with the SELECT statement. Syntax:SELECT (<taglist>) FROM (<relation>) WHERE (<predicate>) SELECT col1,col2,…col n FROM <relation> WHERE <condition> SELECT * FROM <relation> WHERE <condition> * - select all attributes. Condition selection used in WHERE clause 1. 2. 3. 4. 5. 6.
> - greater than < - less than >= - greater than or equal to <= - less than or equal to = - equal to <> - not equal to
Combining condition and Boolean operation OR AND SELECT col1, col2,….,col n WHERE <condition 1> OR/AND <condition 2> Eg:- Find the name of the student whose id is 4 and address is Kathmandu SELECT name FROM student where id=4 and address=”Kathmandu”; Removing Duplicates We use DISTINCT keyword to remove duplicates. Eg:- Find the unique name of the student. SELECT DISTINCT name FROM student;
Unit 4/SQL/Page No.14
By Ramesh Shrestha
Computer 6th Semester Database Management System
Range Specification IN BETWEEN SELECT col1,col2,…col n FROM <relation> WHERE col1 [NOT] BETWEEN value1 AND value2; Eg:- Find the name of the student whose id is in between 5 and 10. SELECT name FROM student where id [NOT] BETWEEN 5 and 10. IN It is used to test whether or not a value is ‘IN’ in the list of the value provided after the keyword IN. Syntax:SELECT col1, col2,…,col n FROM table_name WHERE col2 IN (list of values) Eg:- SELECT name, age, address FROM student WHERE name IN (‘Ram’,’Hari’); This can be interpreted as, SELECT name, age, address FROM student WHERE name=‘Ram’ OR name=’Hari’; SELECT col1, col2, col3,…,col n FROM table_name WHERE col2 NOT IN (list of values) Eg:SELECT name, age, address FROM student WHERE name NOT IN (‘Ram’, ‘Hari’); This can be interpreted as, SELECT name, age, address FROM student WHERE name<>‘Ram’ OR name<>’Hari’; Rename Operation For renaming both relation and attributes, we use as clause. Syntax:old_name as new_name SELECT customer_name, T.loan_number, S.amount FROM borrower as T,loan as S WHERE T.loan_number=S.loan_number
Unit 4/SQL/Page No.15
By Ramesh Shrestha
Computer 6th Semester Database Management System
LIKE Operator We describe the pattern by using two special characters.
Percent (%):- The % character matches any substring. Underscore(_ ):- The _ character matches any characters.
To illustrate pattern matching, we consider following example.
‘Ab%’ matches any string beginning with “Ab”. ‘%ance’ matches any string containing “ance” as a substring. Eg:- ‘mainenance’, ‘performance’ etc. ‘_ _ _’ matches any string of exactly of three character. ‘_ _ _%’ matches any string of at least three characters.
Eg:- Find the name of the student whose name begin with ‘Ra’. SELECT name FROM student where name LIKE ‘Ra%’; Find the name of the student whose name has three character. SELECT name FROM student where name LIKE ‘_ _ _’; Ordering the display of the tuple It is used to display the selected tuple either in ascending or descending order with reference to some tuple. SELECT col1, col2, …, col n FROM table_name WHERE <conditions> ORDER BY col1 [ASC/DESC]; ASC- ascending order-default DESC- descending order Eg:- find the name, age, address of student in alphabetical order of name. SELECT name, age, address FROM student ORDER BY name; Aggregate Function count, min, max, sum, avg 3. UPDATE - An SQL UPDATE statement updates existing data in a particular table. Syntax:- UPDATE table_name SET
Unit 4/SQL/Page No.16
By Ramesh Shrestha
Computer 6th Semester Database Management System
col1=value1, col2= value2, … WHERE
<condition>;
- UPDATE student SET age=20, address=‘Bkt’ WHERE name IN (‘Ram’,’Hari’); - UPDATE student SET age=20, address=‘Bkt’ WHERE name=‘Ram’; 4. DELETE - An SQL DELETE statement deletes existing data in a particular table. Syntax:- DELETE FROM table_name WHERE <condition>; E.g. DELETE FROM student where name=‘Hari’;
Group By Clause •
The GROUP BY clause can be used in a SELECT statement to collect data across multiple records and group the results by one or more columns.
•
The syntax for the GROUP BY clause is: SELECT column1, column2, ... column_n, aggregate_function (expression) FROM tables WHERE predicates GROUP BY column1, column2, ... column_n;
Example: Employee (eid, dept_id, salary) Find the average salary of each department. Select dept_id, avg(salary) as “Average Salary” from Employee Group By dept_id;
Unit 4/SQL/Page No.17
By Ramesh Shrestha
Computer 6th Semester Database Management System
Having Clause •
The HAVING clause is used in combination with the GROUP BY clause. It can be used in a SELECT statement to filter the records that a GROUP BY returns.
•
The syntax for the HAVING clause is: SELECT column1, column2, ... column_n, aggregate_function (expression) FROM tables WHERE predicates GROUP BY column1, column2, ... column_n HAVING condition1 ... condition_n;
Example: Select dept_id, avg(salary) as “Average Salary” from Employee Group By dept_id Having avg(salary)>4000; EXISTS and NOT EXISTS Condition •
The EXISTS condition is considered "to be met" if the subquery returns at least one row.
•
The syntax for the EXISTS and NOT EXISTS condition is:
•
SELECT columns FROM tables WHERE [NOT] EXISTS ( subquery );
•
The EXISTS condition can be used in any valid SQL statement - select, insert, update, or delete. Example:
•
Select * from Employee WHERE [NOT] EXISTS (Select * from Department where Employee.dept_id=Department,dept_id )
•
This select statement will return all records from the Employee table where there is at least one record in the Department table with the same dept_id.
JOIN •
A join is used to combine rows from multiple tables. A join is performed whenever two or more tables is listed in the FROM clause of an SQL statement.
LEFT JOIN The LEFT JOIN keyword returns all rows from the left table (table_name1), even if there are no matches in the right table (table_name2). Syntax SQL> SELECT column_name(s) FROM table_name1 LEFT JOIN table_name2
Unit 4/SQL/Page No.18
By Ramesh Shrestha
Computer 6th Semester Database Management System
ON table_name1.column_name=table_name2.column_name Ex: SQL> SELECT persons.lastname,persons.firstname,orders.orderno FROM persons LEFT JOIN orders ON persons.p_Id = orders.p_Id ORDER BY persons.lastname;
FULL OUTER JOIN The FULL JOIN keyword return rows when there is a match in one of the tables. Syntax SQL>SELECT column_name(s) FROM table_name1 FULL JOIN table_name2 ON table_name1.column_name=table_name2.column_name
Ex: SQL> SELECT persons.lastname,persons.firstname,orders.orderno FROM persons FULL OUTER JOIN orders ON persons.p_Id = orders.p_Id ORDER BY persons.lastname;
RIGHT OUTER JOIN The RIGHT JOIN keyword Return all rows from the right table (table_name2), even if there are no matches in the left table (table_name1).
Syntax SQL> SELECT column_name(s) FROM table_name1 RIGHT OUTER JOIN table_name2 ON table_name1.column_name=table_name2.column_name Ex: SQL> SELECT persons.lastname,persons.firstname,orders.orderno FROM persons RIGHT OUTER JOIN orders ON persons.p_Id = orders.p_Id ORDER BY persons.lastname;
INNER JOIN The INNER JOIN keyword return rows when there is at least one match in both tables. Syntax
Unit 4/SQL/Page No.19
By Ramesh Shrestha
Computer 6th Semester Database Management System
SQL>SELECT
column_name(s)
FROM
table_name1
INNER
JOIN
table_name2
ON
table_name1.column_name=table_name2.column_name Ex: SQL> SELECT persons.lastname,persons.firstname,orders.orderno FROM persons INNER JOIN orders ON persons.p_Id = orders.p_Id ORDER BY persons.lastname;
Views •
A database view is a logical table on a query. It does not store data, but presents it in a format different from the one in which it is stored in the underlying tables.
•
A view does not physically exist in the database- it is derived from other database tables.
•
The general syntax is: CREATE [OR REPLACE] VIEW <view-name> [<column(s)>] as <select-statement> [with check option [constraint <name>]];
Example: •
CREATE VIEW test1 as select ename, eaddress, dept from Employee;
•
A trigger is a statement that the system executes automatically as a side effect of a modification to the database. When a trigger executes, it is said to have fired.
•
Syntax: CREATE [OR REPLACE] TRIGGER <trigger_name> {BEFORE|AFTER} {INSERT|DELETE|UPDATE} ON <table_name> [REFERENCING [NEW AS <new_row_name>] [OLD AS <old_row_name>]] [FOR EACH ROW [WHEN (<trigger_condition>)]] <trigger_body>
Example: CREATE TRIGGER student_insert AFTER INSERT ON student_table FOR EACH ROW BEGIN INSERT into student2(sid,saddress) VALUES (new.sid, new.sddress));
Unit 4/SQL/Page No.20
By Ramesh Shrestha
Computer 6th Semester Database Management System
END •
To view a list of all defined triggers, use: SELECT trigger_name from user_triggers;
•
To drop a trigger: DROP TRIGGER <trigger_name>;
Data Control Language (DCL) •
DCL Statement is used for securing the database.
•
DCL Statement control access to database.
•
Two main DCL statements are Grant and Revoke.
CREATING USER
CREATE USER statement is used to create and configure a database user, which is an account through which we can log in to the database and to establish the means by which Database permits access by the user.
Syntax CREATE USER username IDENTIFIED BY password Ex:
CREATE USER ram IDENTIFIED BY kantipur
Connecting to database server SQL>CONNECT username/password
GRANT Statement
Grant privilege (Rights which are to be allocated) is used when we want our database to share with other users, with certain type of right granted to him.
Grant privilege is assigned not only on table object, but also views, synonyms, indexes, sequences, etc.
Syntax Granting access to new user
GRANT DBA TO username
Granting privileges to tables
GRANT [PRIVILEGES] ON [table_name] TO [user_name]
Unit 4/SQL/Page No.21
By Ramesh Shrestha
Computer 6th Semester Database Management System
Ex: 1) SQL>GRANT DBA TO ram
Granting database accesses to user ram.
2) SQL>CONNECT ram/kantipur
Connecting database system with username ram and password kantipur
3) SQL> GRANT select ON employee TO ram;
employee is the table_name whose access right is being allocated to the user who logged on as ram.
2) SQL>GRANT insert, select, delete ON employee TO operators;
user who logged on as operators are granted access, insertion and deletion right on the database.
3) SQL>GRANT insert (empno, ename, job) ON employee TO endusers;
In some case, we require hiding the information to particular users, this can be achieved by GRANT as in the above command we want to hide the detail of employee salary to endusers, so by executing above command we can hide the information regarding empsalary to the endusers.
REVOKE Statement
Revoke privilege (Rights which are to be de-allocated) is used when we want our database to stop sharing the information with other users, with certain type of right revoked to him. Consider that if we want our operators to have only access privilege to our database, we can revoke it by executing command.
Revoke privilege is assigned not only on table object, but also views, synonyms, indexes, sequences, etc.
Syntax Revoking database access from certain user
REVOKE DBA FROM username
Revoking privileges on table from user
REVOKE [PRIVILEGES] ON [table_name] FROM [user_name]
Ex: SQL> REVOKE DBA FROM ram; SQL> REVOKE insert, delete ON employee FROM operators;
Unit 4/SQL/Page No.22
By Ramesh Shrestha
Computer 6th Semester Database Management System
Transaction Control Language (TCL) COMMIT Command
The commit command saves all transactions to the database since the last COMMIT or ROLLBACK command.
Syntax COMMIT [work];
The keyword commit is the only mandatory part of the syntax. Keyword work is optional; its only purpose is to make the command more user-friendly.
E.g. SQL>DELETE FROM employee WHERE salary > 7500;
The above command deletes the records of those employees whose salary is above 7500 yrs. Though the changes are reflected on database they are not actually save as explained above they are stored in temporary area. To allow changes permanently on database COMMIT command is used.
SQL> COMMIT WORK;
The above command will made changes permanently on database, since last commit or rollback command was issued.
ROLLBACK Command
The rollback command is the transactional control command used to undo transactions that have not already been saved to the database.
Syntax:SQL>ROLLBACK [work];
The keyword ROLLBACK is the only mandatory part of the syntax. Keyword work is optional; its only purpose is to make the command more user-friendly.
Ex: SQL>DELETE FROM employee WHERE salary > 7500;
The above command deletes the records of those employees whose salary is above 75 yrs. Though the changes are reflected on database they are not actually save as explained above they are stored in temporary area. To discards changes made on database rollback command is used.
SQL> ROLLBACK WORK;
The above command will discards changes made on database, since last commit or rollback command was issued.
Unit 4/SQL/Page No.23
By Ramesh Shrestha
Computer 6th Semester Database Management System
SAVEPOINT Command- A SAVEPOINT is a point in a transaction that you can roll the transaction back to without rolling back the entire transaction. Syntax SQL>SAVEPOINT SAVEPOINT name should be explanatory.
Before performing any task further we should create SAVEPOINT here if we are been order later than it might create loss of information.
Ex: SQL>SAVEPOINT increm_sal; SQL>UPDATE employee set salary=salary + (salary*10);
It will Increase salary of employee by 10%.
After sometime top level management decided that salary of only programmer should be increased by 10% than only we have to do is just to pass entry of rollback before salary is updated.
SQL>ROLLBACK TO increm_sal;
Unit 4/SQL/Page No.24
By Ramesh Shrestha
Computer 6th Semester Database Management System
5. Relational Database Design The goal of relational database design is to generate a set of schemas that allow us to
Store information without unnecessary redundancy. Retrieve information easily (and accurately).
A bad design may have several properties, including:
Repetition of information. Inability to represent certain information.
Loss of information.
Anomaly It is the thing, situation that is different from what is normal or expected. In database, anomaly is the violation of consistency, undesired effect on the relation /table caused by the various operations such as insert, update, delete. Eg: Student (name, address, subject, grade) Student name address Subject Hari Bhaktapur DBMS Hari Bhaktapur C++ Naresh Kathmandu C Naresh Kathmandu DBMS
grade A A B C
Update Anomaly It occurs if changing the value of an attributes leads to an inconsistent database state. Eg: During updating, the address of student may not be reflected in all the places. Insertion Anomaly It occurs if we cannot insert a tuple due to some design fault. Eg: Unless the student enrolls under any course one cannot store his name and address. Delete Anomaly It occurs if deleting a tuple results in unexpected loss of information. Eg: If student withdraws his registration from all the subjects then along with the subjects, name and address will also be deleted.
Informal guidelines for relational database design
Design a relational schema so that it is easy to explain its meaning. Do not combine attributes from multiple entity types and relationship types into a single relation.
Unit 5/Relational Database Design/Page No.1
By Ramesh Shrestha
Computer 6th Semester Database Management System
Intuitively, if a relation schema corresponds to one entity type, or one relationship type, the meaning tends to be clear Design the base relation schemas so that no insertion, deletion, or modification anomalies are present in the relation. If any anomalies are present, note them clearly and make sure that the programs that update the database will operate correctly. Due to improper grouping of attributes into a relation schema, the following problems are encountered. Storage wastage Insert anomalies Delete anomalies Modification anomalies If possible avoid placing attributes in a base relation whose values may frequently be null. If nulls are unavoidable, make sure they apply in exceptional cases only and not to majority of tuples in a relation. Problems with null values: Waste of disk space Problem of understanding the meaning of attributes Problems in specifying JOIN operations Problems in applying some aggregate functions May have multiple interpretations (not applicable, unknown, unavailable) Design relational schemas so that they can be joined with equality conditions of attributes that are easier primary keys or foreign keys in a way that guarantees that no spurious tuples are generated. Do not have relations that contain matching attributes other than foreign key-primary key combination. If such relations are unavoidable, do not join them on such attributes, because the join may produce spurious tuples.
Domain and Domain choices A domain is an expression of the permissible value for a given attributes. The use of domain constraints within a database management system ensures that we are getting a data of right types. The domain choices are as follows: i) CHAR -> a fixed length string of text, usually up to 256 characters. ii) VARCHAR-> a variable length string of text, usually up to 256 characters. iii) INT-> an integer, the size of which varies depending on the operating system. iv) DATE -> a date. v) TIME -> a time. vi) DATETIME-> for combination of a date and a time. vii) BOOLEAN -> a logical value (true or false) viii) DECIMAL and NUMERIC->Real nos., with factional part in the right of the decimal point. Must specify how many digits the number can contain and how many digits should be to the right of the decimal point. Join dependency A table T is subject to a join dependency if T can always be recreated by joining multiple tables each having a subset of the attributes of T. Integrity Constraints Unit 5/Relational Database Design/Page No.2
By Ramesh Shrestha
Computer 6th Semester Database Management System
It ensures that changes made to the database by the authorized users do not result in the loss of data consistency. Integrity constraints also guards against accidental changes to database.
Keys Constraints Primary Key A PRIMARY KEY constraint is used to specify the primarykey of a table, which is a column or set of columns that uniquely identifies a row. Because it identifies the row, a primary key column can never be NULL. Syntax: Create table <table_name> (column1 datatype(size), column2 datatype(size), â&#x20AC;Śâ&#x20AC;Ś, [constraint<constraints name>]primary key (column name separated by commas) ); Eg: CREATE TABLE customer ( first_name char(20) NOT NULL, mid_init char(1) NULL, last_name char(20) NOT NULL, SSN char(11) PRIMARY KEY, cust_phone char(10) NULL ) Foreign Key A FOREIGN KEY constraint defines a foreignkey, which identifies a relationship between two tables. The foreign key column or columns in one table reference a candidate key in another table. When a row is inserted into the table with the FOREIGN KEY constraint, the values to be inserted into the column or columns defined as the foreign key are checked against the values in the candidate key of the referenced table. If no row in the referenced table matches the values in the foreign key, the new row cannot be inserted. But if the foreign key values to be inserted into the table do exist in the candidate key of the other table, the new row will be inserted. Eg: CREATE TABLE items ( item_name char(15) NOT NULL, item_id smallint NOT NULL IDENTITY(1,1), price smallmoney NULL, item_desc varchar(30) NOT NULL DEFAULT 'none', CONSTRAINT PK_item_id PRIMARY KEY (item_id) ) CREATE TABLE inventory Unit 5/Relational Database Design/Page No.3
By Ramesh Shrestha
Computer 6th Semester Database Management System
( store_id tinyint NOT NULL, item_id smallint NOT NULL, item_quantity tinyint NOT NULL, CONSTRAINT FK_item_id FOREIGN KEY (item_id) REFERENCES items(item_id) )
Domain constraints It is also known as attribute constraints. Domain types may be integer, character, date, time etc. in SQL. Domain constraints are the most elementary form of integrity constraint. They are tested easily by the system whenever new data item is entered in database. So, declaring an attribute to be a particular domain acts as a constraint on the value it can take. Domain constraints not only allows us to test values inserted in the database, but also permits us to test queries to ensure that the comparison make sense. Syntax: Create domain<domain name> datatype (size) constraints<constraint name> check <conditions>); Example: Create domain dollars as numeric (12,2); Create domain pounds as numeric (12,2); Above statements define the domains dollars and pounds to be decimal number with a total of 12 digits, two of which are placed after the decimal point.
Referential Integrity Constraints Making the database in correct form by taking reference is referential integrity constraints. It ensures that the value appeared in one relation for the set of attribute also appears for the set of attributes in another relation i.e. database should not contain any unmatched foreign key value. If B references A then A should exists. For example:- For a banking database where all branches are in branch relation and if there is account relation which contain all account with their branch name, then if there is any account in Kathmandu branch, then Kathmandu branch should exists in branch relation.
Referential integrity in SQL: In SQL, referential integrity constraint is maintained by using foreign key. Eg Create table employee ( Empno number(4) constraint pk_emp primary key. …………………………………………. Deptno number(3) constraint fk_deptno references Department ); Unit 5/Relational Database Design/Page No.4
By Ramesh Shrestha
Computer 6th Semester Database Management System
Basic Concepts 1. Dangling tuples. o Consider a pair of relations r(R) and s(S), and the natural join. o
There may be a tuple
in r that does not join with any tuple in s.
o
That is, there is no tuple
in s such that
. We call this a dangling tuple. o
Dangling tuples may or may not be acceptable.
2. Suppose there is a tuple in the account relation with the value matching tuple in the branch relation for the Lunartown branch. This is undesirable, as
``Lunartown'', but no
should refer to a branch that exists.
Now suppose there is a tuple in the branch relation with tuple in the account relation for the Mokan branch.
``Mokan'', but no matching
This means that a branch exists for which no accounts exist. This is possible, for example, when a branch is being opened. We want to allow this situation. 3. Note the distinction between these two situations: bname is the primary key of branch, while it is not for account. In account, bname is a foreign key, being the primary key of another relation. o
Let
and
be two relations with primary keys
o
We say that a subset for every tuple in
o o
We call these requirements referential integrity constraints. Also known as subset dependencies, as we require
and
of is a foreign key referencing in relation there must be a tuple in such that
Unit 5/Relational Database Design/Page No.5
respectively. if it is required that
By Ramesh Shrestha
Computer 6th Semester Database Management System
Functional Dependencies A functional dependency is a constraint between two set of attributes in a relational database. If X and Y are the set of attributes in the same relation T then X->Y means that “X functionally determines Y.” or “Y is functionally dependent on X” so that: 1. The value of attributes in X uniquely determines the value of attributes in X. 2. For any two tuples in t1 & t2 in T, t1[x]=t2[x] implies that t1[y]=t2[y]. 3. If two tuples in T agree in their X column, then their Y columns should also be same. Functional dependency may also be based on a composite attribute. Eg: X, Y->Z Note: X->Y<>Y->X, means X determines Y does not implies that Y determines X. Armstrong’s Axioms of FD’s: 1. Reflexivity: If Y C X then X->Y 2. Augmentation If X->Y then XZ->YZ 3. Transitivity If X->Y and Y->Z then X->YZ 4. Union If X->Y and X->Z then X->YZ 5. Decomposition If X->YZ then X->Y and X->Z 6. Pseudo - transitivity If X->Y and WY->Z then XW->Z 7. Accumulation rule If X->YZ and Z->W then X->YZW Closure of functional dependencies: Let F be a set of functional dependencies, then the closure of ‘F’ denoted by F+ , is the set of all functional dependencies logically implied by F. We can compute F+ directly from the formal definition of functional dependencies. F+ is also defined as set of all FD’s desirable from ‘F’. Example:-
Unit 5/Relational Database Design/Page No.6
By Ramesh Shrestha
Computer 6th Semester Database Management System
R= {A, B, C, H, G, I} and F of FD= {A->B, B->C, CG->H, CG->I, B->H} then some of F+ are:i. ii. iii. iv.
A->H CG->HI AG->I => A->C or AG->CG or CG->I A->BC
Equivalence Sets of Functional Dependencies ď&#x201A;ˇ
ď&#x201A;ˇ
Two sets of FDs F and G are equivalent if: - every FD in F can be inferred from G, and - every FD in G can be inferred from F Hence, F and G are equivalent if F + =G +
Definition: F covers G if every FD in G can be inferred from F (i.e., if G + subset-of F +) - F and G are equivalent if F covers G and G covers F Minimal Sets of Functional Dependencies A set of FDs is minimal if it satisfies the following conditions: (1) Every dependency in F has a single attribute for its RHS. (2) We cannot remove any dependency from F and have a set of dependencies that is equivalent to F. (3) We cannot replace any dependency X -> A in F with a dependency Y -> A, where Y propersubset-of X ( Y subset-of X) and still have a set of dependencies that is equivalent to F.
Multivalued Dependencies 1. Functional dependencies rule out certain tuples from appearing in a relation. If A B, then we cannot have two tuples with the same A value but different B values. 2. Multivalued dependencies do not rule out the existence of certain tuples. Instead, they require that other tuples of a certain form be present in the relation. 3. Let R be a relation schema, and let The multivalued dependency
and
.
holds on R if in any legal relation r(R), for all pairs of tuples there exist tuples and in r such that:
and
in r such that
,
4. Figure 1 shows a tabular representation of this. It looks horrendously complicated, but is really rather simple. A simple example is a table with the schema (name, address, car), as shown in Figure 2. Unit 5/Relational Database Design/Page No.7
By Ramesh Shrestha
Computer 6th Semester Database Management System
Figure 1: Tabular representation of
.
Figure 2: (name, address, car) where o
Intuitively, between and
o
and
.
says that the relationship between and is independent of the relationship .
If the multivalued dependency is satisfied by all relations on schema R, then we say it is a trivial multivalued dependency on schema R.
Thus is trivial if or . 5. Look at the example relation bc relation in Figure 3 o
Figure 3: Relation bc, an example of redundancy in a BCNF relation. We must repeat the loan number once for each address a customer has. We must repeat the address once for each loan the customer has. This repetition is pointless, as the relationship between a customer and a loan is independent of the relationship between a customer and his or her address. o If a customer, say ``Smith'', has loan number 23, we want all of Smith's addresses to be associated with that loan. o Thus the relation of Figure 4 is illegal. o If we look at our definition of multivalued dependency, we see that we want the multivalued dependency o o o
cname
street ccity
to hold on BC-schema.
Unit 5/Relational Database Design/Page No.8
By Ramesh Shrestha
Computer 6th Semester Database Management System
Figure 4: An illegal bc relation. 6. Note that if a relation r fails to satisfy a given multivalued dependency, we can construct a relation r' that does satisfy the multivalued dependency by adding tuples to r. Normalization Normalization is a design technique that is widely used as a guide in designing relational databases. Normalization is essentially a two step process that puts data into tabular form by removing repeating groups and then removes duplicated data from the relational tables. Normalization theory is based on the concepts of normal forms. A relational table is said to be a particular normal form if it satisfies a certain set of constraints. There are currently five normal forms that have been defined. In this section, we will cover the first three normal forms that were defined by E. F. Codd. Need of Normalization The goal of normalization is to create a set of relational tables that are free of redundant data and that can be consistently and correctly modified. This means that all tables in a relational database should be in the third normal form (3NF). A relational table is in 3NF if and only if all non-key columns are (a) mutually independent and (b) fully dependent upon the primary key. Mutual independence means that no non-key column is dependent upon any combination of the other columns. The first two normal forms are intermediate steps to achieve the goal of having all tables in 3NF. In order to better understand the 2NF and higher forms, it is necessary to understand the concepts of functional dependencies and lossless decomposition. i) ii) iii) iv)
It removes redundant data from a relational table. It maintains consistency in the table. It converts the complex table into simple one. It tries to remove the anomalies as we proceed from 1NF to 3NF.
Functional Dependencies The concept of functional dependencies is the basis for the first three normal forms. A column, Y, of the relational table R is said to be functionally dependent upon column X of R if and only if each value of X in R is associated with precisely one value of Y at any given time. X and Y may be composite. Saying that column Y is functionally dependent upon X is the same as saying the values of column X identify the values of column Y. If column X is a primary key, then all columns in the relational table R must be functionally dependent upon X. A shorthand notation for describing a functional dependency is: R.x â&#x20AC;&#x201D;> R.y Which can be read as in the relational table named R, column x functionally determines (identifies) column y. Unit 5/Relational Database Design/Page No.9
By Ramesh Shrestha
Computer 6th Semester Database Management System
Full functional dependence applies to tables with composite keys. Column Y in relational table R is fully functional on X of R if it is functionally dependent on X and not functionally dependent upon any subset of X. Full functional dependence means that when a primary key is composite, made of two or more columns, then the other columns must be identified by the entire key and not just some of the columns that make up the key.
Overview Simply stated, normalization is the process of removing redundant data from relational tables by decomposing (splitting) a relational table into smaller tables by projection. The goal is to have only primary keys on the left hand side of a functional dependency. In order to be correct, decomposition must be loss less. That is, a natural join to recreate the original table without creating any spurious or redundant data can recombine the new tables. Sample Data Data taken from Date [Date90] is used to illustrate the process of normalization. A company obtains parts from a number of suppliers. Each supplier is located in one city. A city can have more than one supplier located there and each city has a status code associated with it. Each supplier may provide many parts. The company creates a simple relational table to store this information that can be expressed in relational notation as: FIRST (s#, status, city, p#, qty) Where S# Status City P# Qty>
Supplier identification number (this is the primary key) Status code assigned to city Name of city where supplier is located Part number of part supplied Quantity of parts supplied to date
In order to uniquely associate quantity supplied (qty) with part (p#) and supplier (s#), a composite primary key composed of s# and p# is used. First Normal Form A relational table, by definition, is in first normal form. All values of the columns are atomic. That is, they contain no repeating values. Figure1 shows the table FIRST in 1NF.
Unit 5/Relational Database Design/Page No.10
By Ramesh Shrestha
Computer 6th Semester Database Management System
Figure 1: Table in 1NF
Although the table FIRST is in 1NF it contains redundant data. For example, information about the supplier's location and the location's status has to be repeated for every part supplied. Redundancy causes what are called update anomalies. Update anomalies are problems that arise when information is inserted, deleted, or updated. For example, the following anomalies could occur in FIRST:
INSERT. The fact that a certain supplier (s5) is located in a particular city (Athens) cannot be added until they supplied a part. DELETE. If a row is deleted, then not only is the information about quantity and part lost but also information about the supplier. UPDATE. If supplier s1 moved from London to New York, then six rows would have to be updated with this new information.
Second Normal Form The definition of second normal form states that only tables with composite primary keys can be in 1NF but not in 2NF. A relational table is in second normal form 2NF if it is in 1NF and every non-key column is fully dependent upon the primary key. That is, every non-key column must be dependent upon the entire primary key. FIRST is in 1NF but not in 2NF because status and city are functionally dependent upon only on the column s# of the composite key (s#, p#). This can be illustrated by listing the functional dependencies in the table: S# —> City, status City —> Status (S#, —>Qty p#) The process for transforming a 1NF table to 2NF is: 1. Identify any determinants other than the composite key, and the columns they determine. 2. Create and name a new table for each determinant and the unique columns it determines. 3. Move the determined columns from the original table to the new table. The determinate becomes the primary key of the new table. Unit 5/Relational Database Design/Page No.11
By Ramesh Shrestha
Computer 6th Semester Database Management System
4. Delete the columns you just moved from the original table except for the determinate which will serve as a foreign key. 5. The original table may be renamed to maintain semantic meaning. To transform FIRST into 2NF we move the columns s#, status, and city to a new table called SECOND. The column s# becomes the primary key of this new table. The results are shown below in Figure 2. Figure 2: Tables in 2NF
Tables in 2NF but not in 3NF still contain modification anomalies. In the example of SECOND, they are: INSERT. The fact that a particular city has a certain status (Rome has a status of 50) cannot be inserted until there is a supplier in the city. DELETE. Deleting any row in SUPPLIER destroys the status information about the city as well as the association between supplier and city. Third Normal Form The third normal form requires that all columns in a relational table are dependent only upon the primary key. A more formal definition is: A relational table is in third normal form (3NF) if it is already in 2NF and every non-key column is non transitively dependent upon its primary key. In other words, all nonkey attributes are functionally dependent only upon the primary key. Table PARTS is already in 3NF. The non-key column, qty, is fully dependent upon the primary key (s#, p#). SUPPLIER is in 2NF but not in 3NF because it contains a transitive dependency. A transitive dependency is occurs when a non-key column that is a determinant of the primary key is the determinate of other columns. The concept of a transitive dependency can be illustrated by showing the functional dependencies in SUPPLIER: SUPPLIER.s# —> SUPPLIER.status SUPPLIER.s# —> SUPPLIER.city SUPPLIER.city —> SUPPLIER.status Note that SUPPLIER.status is determined both by the primary key s# and the non-key column city. The process of transforming a table into 3NF is: Unit 5/Relational Database Design/Page No.12
By Ramesh Shrestha
Computer 6th Semester Database Management System
1. Identify any determinants, other the primary key, and the columns they determine. 2. Create and name a new table for each determinant and the unique columns it determines. 3. Move the determined columns from the original table to the new table. The determinate becomes the primary key of the new table. 4. Delete the columns you just moved from the original table except for the determinate which will serve as a foreign key. 5. The original table may be renamed to maintain semantic meaning. To transform SUPPLIER into 3NF, we create a new table called CITY_STATUS and move the columns city and status into it. Status is deleted from the original table, city is left behind to serve as a foreign key to CITY_STATUS, and the original table is renamed to SUPPLIER_CITY to reflect its semantic meaning. The results are shown in Figure 3 below. Figure 3: Tables in 3NF
The results of putting the original table into 3NF have created three tables. These can be represented in "psuedo-SQL" as: PARTS (#s, p#, qty) Primary Key (s#,#p) Foreign Key (s#) references SUPPLIER_CITY.s# SUPPLIER_CITY(s#, city) Primary Key (s#) Foreign Key (city) references CITY_STATUS.city CITY_STATUS (city, status) Primary Key (city)
Advantages of Third Normal Form The advantage of having relational tables in 3NF is that it eliminates redundant data, which in turn saves space and reduces manipulation anomalies. For example, the improvements to our sample database are: INSERT. Facts about the status of a city, Rome have a status of 50; can be added even though there is not supplier in that city. Likewise, facts about new suppliers can be added even though they have not yet supplied parts. DELETE. Information about parts supplied can be deleted without destroying information about a supplier or a city. UPDATE. Changing the location of a supplier or the status of a city requires modifying only one row. Unit 5/Relational Database Design/Page No.13
By Ramesh Shrestha
Computer 6th Semester Database Management System
Advanced Normalization After 3NF, all normalization problems involve only tables, which have three or more columns, and all the columns are keys. Many practitioners argue that placing entities in 3NF is generally sufficient because it is rare that entities that are in 3NF are not also in 4NF and 5NF. They further argue that the benefits gained from transforming entities into 4NF and 5NF are so slight that it is not worth the effort. However, advanced normal forms are presented because there are cases where they are required. Boyce-Codd Normal Form Boyce-Codd normal form (BCNF) is a more rigorous version of the 3NF deal with relational tables that had (a) multiple candidate keys, (b) composite candidate keys, and (c) candidate keys that overlapped. BCNF is based on the concept of determinants. A determinant column is one on which some of the columns are fully functionally dependent. A relational table is in BCNF if and only if every determinant is a candidate key. Fourth Normal Form A relational table is in the fourth normal form (4NF) if it is in BCNF and all multivalued dependencies are also functional dependencies. Fourth normal form (4NF) is based on the concept of multivalued dependencies (MVD). A Multivalued dependency occurs when in a relational table containing at least three columns, one column has multiple rows whose values match a value of a single row of one of the other columns. A more formal definition given by Date is: Given a relational table R with columns A, B, and C then R.A â&#x20AC;&#x201D;>> R.B (column A multidetermines column B) is true if and only if the set of B-values matching a given pair of A-values and C-values in R depends only on the A-value and is independent of the C-value. MVD always occur in pairs. That is R.A â&#x20AC;&#x201D;>> R.B holds if and only if R.A â&#x20AC;&#x201D;>> R.C also holds. Suppose that employees can be assigned to multiple projects. Also suppose that employees can have multiple job skills. If we record this information in a single table, all three attributes must be used as the key since no single attribute can uniquely identify an instance. The relationship between emp# and prj# is a multivalued dependency because for each pair of emp#/skill values in the table, the associated set of prj# values is determined only by emp# and is independent of skill. The relationship between emp# and skill is also a multivalued dependency, since the set of Skill values for an emp#/prj# pair is always dependent upon emp# only. To transform a table with multivalued dependencies into the 4NF move each MVD pair to a new table. The result is shown in Figure1. Figure 1: Tables in 4NF
Unit 5/Relational Database Design/Page No.14
By Ramesh Shrestha
Computer 6th Semester Database Management System
Fifth Normal Form A table is in the fifth normal form (5NF) if it cannot have a lossless decomposition into any number of smaller tables. While the first four normal forms are based on the concept of functional dependence, the fifth normal form is based on the concept of join dependence. Join dependency means that a table, after it has been decomposed into three or more smaller tables, must be capable of being joined again on common keys to form the original table. Stated another way, 5NF indicates when an entity cannot be further decomposed. 5NF is complex and not intuitive. Most experts agree that tables that are in the 4NF are also in 5NF except for "pathological" cases. Teorey suggests that true many-to-many-to-many ternary relations are one such case. Adding an instance to an table that is not in 5NF creates spurious results when the tables are decomposed and then rejoined. For example, let's suppose that we have an employee who uses design skills on one project and programming skills on another. This information is shown below. emp# prj# skill 1211 11 Design 1211 28 Program Next we add an employee (1544) who uses programming skills on Project 11. emp# prj# skill 1211 11 Design 1211 28 Program 1544 11 Program Next, we project this information into three tables as we did above. However, when we rejoin the tables, the recombined table contains spurious results. emp# prj# skill 1211 11 Design 1211 11 Program <<—spurious data 1211 28 Program 1544 11 Design <<—spurious data 1544 11 Program By adding one new instance to a table not in 5NF, two false assertions were stated:
Assertion 1
Employee 1211 has been assigned to Project 11. Project 11 requires programming skills. Therefore, Employee 1211 must use programming skills while assigned to Project 11.
Assertion 2
Employee 1544 has been assigned to project 11. Project 11 needs Design skills. Therefore, Employee 1544 must use Design skills in Project 11.
Unit 5/Relational Database Design/Page No.15
By Ramesh Shrestha
Computer 6th Semester Database Management System
Representation of Information 1. Suppose we have a schema, Lending-schema, Lending-schema = (bname, bcity, assets, cname, loan#, amount) and suppose an instance of the relation is
Sample lending relation. 2. A tuple t in the new relation has the following attributes: o t[assets] is the assets for t[bname] o t[bcity] is the city for t[bname] o t[loan#] is the loan number made by branch t[bname] to t[cname]. o t[amount] is the amount of the loan for t[loan#] 3. If we wish to add a loan to our database, the original design would require adding a tuple to borrow: (SFU, L-31, Turner, 1K) 4. In our new design, we need a tuple with all the attributes required for Lending-schema. Thus we need to insert (SFU, Burnaby, 2M, Turner, L-31, 1K) 5. We are now repeating the assets and branch city information for every loan. o Repetition of information wastes space. o Repetition of information complicates updating. 6. Under the new design, we need to change many tuples if the branch's assets change. 7. Let's analyze this problem: o We know that a branch is located in exactly one city. o We also know that a branch may make many loans. o The functional dependency bname bcity holds on Lending-schema. o The functional dependency bname loan# does not. o These two facts are best represented in separate relations. 8. Another problem is that we cannot represent the information for a branch (assets and city) unless we have a tuple for a loan at that branch. 9. Unless we use nulls, we can only have this information when there are loans, and must delete it when the last loan is paid off.
Unit 5/Relational Database Design/Page No.16
By Ramesh Shrestha
Computer 6th Semester Database Management System
Decomposition 1. The previous example might seem to suggest that we should decompose schema as much as possible. Careless decomposition, however, may lead to another form of bad design. 2. Consider a design where Lending-schema is decomposed into two schemas Branch-customer-schema = (bname, bcity, assets, cname) Customer-loan-schema = (cname, loan#, amount) 3. We construct our new relations from lending by:
branch-customer =
customer-loan =
The decomposed lending relation. 4. It appears that we can reconstruct the lending relation by performing a natural join on the two new schemas. 5. Following figure shows what we get by computing branch-customer customer-loan.
Join of the decomposed relations. 6. We notice that there are tuples in branch-customer customer-loan that are not in lending. 7. How did this happen? o The intersection of the two schemas is cname, so the natural join is made on the basis of equality in the cname. Unit 5/Relational Database Design/Page No.17
By Ramesh Shrestha
Computer 6th Semester Database Management System
If two lendings are for the same customer, there will be four tuples in the natural join. Two of these tuples will be spurious - they will not appear in the original lending relation, and should not appear in the database. o Although we have more tuples in the join, we have less information. o Because of this, we call this a lossy or lossy-join decomposition. o A decomposition that is not lossy-join is called a lossless-join decomposition. o The only way we could make a connection between branch-customer and customer-loan was through cname. 8. When we decomposed Lending-schema into Branch-schema and Loan-info-schema, we will not have a similar problem. o o
Branch-schema = (bname, bcity, assets) Branch-loan-schema = (bname, cname, loan#, amount) The only way we could represent a relationship between tuples in the two relations is through bname. This will not cause problems. 9. For a given branch name, there is exactly one assets value and exactly one bcity; whereas a similar statement associated with a loan depends on the customer, not on the amount of the loan (which is not unique). 10. We'll make a more formal definition of lossless-join: o Let R be a relation schema. o
A set of relation schemas
o o
That is, every attribute in R appears in at least one Let r be a relation on R, and let
o o
That is, is the database that results from decomposing R into It is always the case that:
o
To see why this is, consider a tuple ď&#x201A;§ ď&#x201A;§
is a decomposition of R if
for
.
.
.
When we compute the relations , the tuple t gives rise to one tuple in each . These n tuples combine together to regenerate t when we compute the natural join of the .
Unit 5/Relational Database Design/Page No.18
By Ramesh Shrestha
Computer 6th Semester Database Management System
Thus every tuple in r appears in o However, in general, ď&#x201A;§
.
We saw an example of this inequality in our decomposition of lending into branch-customer and customer-loan. o In order to have a lossless-join decomposition, we need to impose some constraints on the set of possible relations. o Let C represent a set of constraints on the database. o
o
A decomposition of a relation schema R is a lossless-join decomposition for R if, for all relations r on schema R that are legal under C:
11. In other words, a lossless-join decomposition is one in which, for any legal relation r, if we decompose r and then ``recompose'' r, we get what we started with - no more and no less. Normalization Using Functional Dependencies We can use functional dependencies to design a relational database in which most of the problems we have seen do not occur. Using functional dependencies, we can define several normal forms which represent ``good'' database designs. Desirable Properties of Decomposition 1. We'll take another look at the schema Lending-schema = (bname, assets, bcity, loan#, cname, amount) which we saw was a bad design. 2. The set of functional dependencies we required to hold on this schema was: bname
assets bcity
loan#
amount bname
3. If we decompose it into
Unit 5/Relational Database Design/Page No.19
By Ramesh Shrestha
Computer 6th Semester Database Management System
Branch-schema = (bname, assets, bcity) Loan-info-schema = (bname, loan#, amount) Borrow-schema = (cname, loan#) We claim this decomposition has several desirable properties. Lossless-Join Decomposition 1. We claim the above decomposition is lossless. How can we decide whether a decomposition is lossless? o Let R be a relation schema. o Let F be a set of functional dependencies on R. o o
Let and form a decomposition of R. The decomposition is a lossless-join decomposition of R if at least one of the following functional dependencies are in : 1. 2.
Why is this true? Simply put, it ensures that the attributes involved in the natural join ( a candidate key for at least one of the two relations.
) are
This ensures that we can never get the situation where spurious tuples are generated, as for any value on the join attributes there will be a unique tuple in one of the relations. 2. We'll now show our decomposition is lossless-join by showing a set of steps that generate the decomposition: First we decompose Lending-schema into Branch-schema = (bname, bcity, assets) Loan-info-schema = (bname, cname, loan#, amount) Since bname bname
assets bcity, the augmentation rule for functional dependencies implies that
bname assets bcity
Since Branch-schema Borrow-schema = bname, our decomposition is lossless join. Next we decompose Borrow-schema into
Unit 5/Relational Database Design/Page No.20
By Ramesh Shrestha
Computer 6th Semester Database Management System
Loan-schema = (bname, loan#, amount) Borrow-schema = (cname, loan#)
As loan# is the common attribute, and loan#
amount bname
This is also a lossless-join decomposition.
Unit 5/Relational Database Design/Page No.21
By Ramesh Shrestha
Computer 6th Semester Database Management System
6. Security The data stored in the database need protection from unauthorized access and malicious destruction or alteration, in addition to the protection against accidental introduction of inconsistency that integrity constraints provide. - Protection from malicious attempts to steal or modify data. Security Violations
Unauthorized reading of data (theft of information) Unauthorized modification of data Unauthorized destruction of data
To protect the database we must take security measures at different levels:
Database System: Some database system users may be authorized to access only a limited portion of the database. Other users may be allowed to issue queries, but may be forbidden to modify the data. It is the responsibility of the database system to ensure that these authorization restrictions are not violated.
Operation System: No matter how secure the database system is, weakness in operating system security may serve as a means of unauthorized access to the database.
Network: Since almost all database systems allow remote access through terminals or networks, software level security within the network software is as important as physical security, both on the Internet and in private networks.
Physical: Sites with computer system must be physically secured against armed or surreptitious entry by intruders.
Human: Users must be authorized carefully to reduce the chance of any user giving access to an intruder in exchange for a bribe or other favors.
Domain Constraints • Integrity constraints guard against accidental damage to the database, by ensuring that authorized changes to the database do not result in a loss of data consistency. • Domain constraints are the most elementary form of integrity constraint. • They test values inserted in the database, and test queries to ensure that the comparisons make sense. Example: create table EMP ( ENAME varchar2(30) constraint check name check(ENAME = upper(ENAME) ), SAL number(5,2) constraint check sal check(SAL >= 500), DEPTNO number(3) constraint check deptno check(DEPTNO between 10 and 100) );
Unit 6/Security/Page No.1
By Ramesh Shrestha
Computer 6th Semester Database Management System
Referential Integrity Ensures that a value that appears in one relation for a given set of attributes also appears for a certain set of attributes in another relation. • Primary, candidate keys and foreign keys can be specified as part of the SQL create table statement: – The primary key clause lists attributes that comprise the primary key. – The unique key clause lists attributes that comprise a candidate key. – The foreign key clause lists the attributes that comprise the foreign key and the name of the relation referenced by the foreign key. • By default, a foreign key references the primary key attributes of the referenced table foreign key (account-number) references account Authorization Forms of authorization on parts of the database: • Read authorization - allows reading, but not modification of data. • Insert authorization - allows insertion of new data, but not modification of existing data. • Update authorization - allows modification, but not deletion of data. • Delete authorization - allows deletion of data Forms of authorization to modify the database schema: • Index authorization - allows creation and deletion of indices. • Resources authorization - allows creation of new relations. • Alteration authorization - allows addition or deletion of attributes in a relation. • Drop authorization - allows deletion of relations. Defense Mechanism 1. Views 2. Granting of Privileges 3. Encryption Techniques
1. Views A view can hide data that a user does not need to see. The ability of views to hide data serves both to simplify usage of the system and to enhance security. create view cust-loan as (select branch-name, customer-name from borrower, loan where borrower.loan-number = loan.loan-number) Now someone who is allowed to issue query to the view cust-loan only, cannot read or modify other customer details such as balance amount, loan number, customer address etc. This is preferable and it enhances the security of the database system. The content of the database is described by the conceptual scheme and the user’s views are defined by the subschema. The subschema can be used in the name dependent security enforcement policy to limit the portion of the database known to and hence accessible by a user. Example create view DEPT20 as select ENAME, JOB from EMP where DEPTNO = 20;
Unit 6/Security/Page No.2
By Ramesh Shrestha
Computer 6th Semester Database Management System
2. Granting of Privileges U1
DBA
U2
U4
U5
U3 Authorization grant graph A user who has been granted some form of authorization may be allowed to pass on this authorization to other users. However, we must be careful how authorization may be passed among users, to ensure that such authorization can be revoked at some future time. The passing of authorization from one user to another can be represented by an authorization graph. The nodes of this graph are the users. The graph includes an edge Ui Uj if user Ui grants update authorization to Uj. The root of the graph is the database administrator. A user has an authorization if and only if there is a path from the root of the authorization graph down to the node representing the user. 3. Encryption Encryption is the conversion of data into a form, called a cipher text that cannot be easily understood by unauthorized people. • Data may be encrypted when database authorization provisions do not offer sufficient protection. • Properties of good encryption technique: – Relatively simple for authorized users to encrypt and decrypt data. – Encryption scheme depends not on the secrecy of the algorithm but on the secrecy of a parameter of the algorithm called the encryption key. – Extremely difficult for an intruder to determine the encryption key. Different encryption techniques are:•
Data Encryption Standard (DES) substitutes characters and rearranges their order on the basis of an encryption key which is provided to authorize users via a secure mechanism. Scheme is no more secure than the key transmission mechanism since the key has to be shared. . E.g. By substituting each character with the next character in the alphabet Perryridge becomes Qfsszsjehf
•
Advanced Encryption Standard (AES) is a new standard replacing DES, and is also dependent on shared secret keys
•
Public-key encryption is based on each user having two keys: – public key – publicly published key used to encrypt data, but cannot be used to decrypt data – private key -- key known only to individual user, and used to decrypt data. Need not be transmitted to the site doing encryption. Encryption scheme is such that it is impossible or extremely hard to decrypt data given only the public key. • The RSA public-key encryption scheme is based on the hardness of factoring a very large number (100's of digits) into its prime components. Unit 6/Security/Page No.3
By Ramesh Shrestha
Computer 6th Semester Database Management System
Decryption The act of restoring an encrypted file to its original state through the use of a key is decryption. Authentication Authentication refers to the task of verifying the identity of verifying the identity of a person/software connecting to a database. The simplest form of authentication consists of a secret password which must be presented when a connection is opened to a database. This is known as password-based authentication. Other examples of authentication are Challenge response system Digital signatures • •
•
Password based authentication is widely used, but is susceptible to sniffing on a network Challenge-response systems avoid transmission of passwords – DB sends a (randomly generated) challenge string to user – User encrypts string and returns result. – DB verifies identity by decrypting result – Can use public-key encryption system by DB sending a message encrypted using user’s public key, and user decrypting and sending the message back Digital signatures are used to verify authenticity of data – E.g. use private key (in reverse) to encrypt data, and anyone can verify authenticity by using public key (in reverse) to decrypt data. Only holder of private key could have created the encrypted data. – Digital signatures also help ensure nonrepudiation: sender cannot later claim to have not created the data
Digital Certificates • Digital certificates are used to verify authenticity of public keys. • Problem: when you communicate with a web site, how do you know if you are talking with the genuine web site or an impostor?
•
– Solution: use the public key of the web site – Problem: how to verify if the public key itself is genuine? Solution: – Every client (e.g. browser) has public keys of a few root-level certification authorities – A site can get its name/URL and public key signed by a certification authority: signed document is called a certificate – Client can use public key of certification authority to verify certificate – Multiple levels of certification authorities can exist. Each certification authority • presents its own public-key certificate signed by a higher level authority, and • Uses its private key to sign the certificate of other web sites/authorities
Authorization Authorization means giving privilege to the use for access of the database content in various manner such as insert, modify, delete, view etc. We may assign a user several forms of authorization on parts of the database. For example, Read authorization allows reading, but not modification, of data. Insert authorization allows insertion of new data, but not modification of existing data. Update authorization allows modification, but not deletion, of data. Unit 6/Security/Page No.4
By Ramesh Shrestha
Computer 6th Semester Database Management System
Delete authorization allows deletion of data. We may assign the user all, none or a combination of these types of authorization. In addition to these forms of authorization for access to data, we may grant a user authorization to modify the database schema: Index authorization allows the creation and deletion of indices. Resource authorization allows the creation of new relations. Alteration authorization allows the addition or deletion of attributes in a relation. Drop authorization allows the deletion of relations. Privileges in SQL The SQL standard includes the privileges delete, insert, select, and update. SQL also includes a references privilege that permits user to declare foreign keys when creating relations. Grant <privilege list> on <relation or view name> to <user/role list> Revoke <privilege list> on <relation or view name> from <user/role list> E.g.
grant select on account to U1, U2, U3 grant update (amount) on loan to U1, U2, U3 grant references (branch-name) on branch to U1 revoke select on account from U1, U2, U3
Roles A set of roles can be created in the database and authorizations can be granted to roles in exactly the same fashion as they are granted to individual users. Each database user is granted a set of roles that he or she is authorized to perform. In a bank database, examples of roles could include teller, branch-manager, auditor, systemadministrator. Creating roles in SQL:
create role teller Granting privileges to roles: grant select on account to teller Assigning roles to users and other roles grant teller to john create role manager grant teller to manager grant manager to mary
Access control: Discretionary and Mandatory Discretionary access control is based on the concept of access right, or privileges, and mechanisms for giving users such privileges. A privilege allows a user to access some data objects in a certain manner (e.g to read or modify). A user who creates a database object such as a table or a view automatically gets all applicable privileges on that object. SQL supports discretionary access control though the GRANT and REVOKE command. The GRANT command gives privilege to user and REVOKE command takes away privilege. E.g: REVOKE privileges on object from user Unit 6/Security/Page No.5
By Ramesh Shrestha
Computer 6th Semester Database Management System
GRANT privileges on object to user In discretionary access control method, a devious unauthorized user can trick an authorized user into disclosing sensitive data. Mandatory access control method is based on system wide policies that cannot be changed by the individual users. In this approach each database object is assigned a security class, each user is assigned clearance for security class, and the rules are imposed on reading and writing of database objects by users. These rules seek to insure that sensitive data can never be â&#x20AC;&#x153;passed onâ&#x20AC;? to a user without the necessary clearance. Audit Trails Many secure database applications require an audit trail to be maintained. An audit trail is a log of all changes (inserts/deletes/updates) to the database, along with information such as which user performed the change and when the changes were performed. The audit trail aids security in several ways. For example, if the balance on an account is found to be incorrect, the bank may wish to trace all the updates performed on the account, to find out incorrect of fraudulent updates, as well as the persons who carried out the updates. The bank could then also use the audit trail to trace all the updates performed by these persons, in order to find other incorrect or fraudulent updates. .
Unit 6/Security/Page No.6
By Ramesh Shrestha
Computer 6th Semester Database Management System
7. Query Processing Query processing is the processing procedure of selecting the best plan or strategy to be used in responded database request. The plan is then executed to generate response. Steps involved in processing of query
1. Parsing and translation Translate the query into its internal form. This is then translated into relational algebra. Parser checks syntax, verifies relations 2. Evaluation The query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query 3. A relational algebra expression may have many equivalent expressions E.g., balance2500(balance(account)) is equivalent to balance(balance2500(account)) 4. Annotated expression specifying detailed evaluation strategy is called an evaluation-plan. E.g., can use an index on balance to find accounts with balance < 2500, or can perform complete relation scan and discard accounts with balance 2500 5. Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost. Cost is estimated using statistical information from the database catalog Unit 7/Query Processing/Page No.1
By Ramesh Shrestha
Computer 6th Semester Database Management System
ď&#x201A;ˇ
E.g. number of tuples in each relation, size of tuples, etc.
Query Cost The cost of query evaluation can be measured in terms of a number of different resources such as Disk accesses CPU time to execute a query The cost of communication (in distributed systems) The response time for a query evaluation plan assuming that no other activity is going on the computer while the query is being executed would account for all these costs. Since CPU speeds are much faster than disk access speeds and also estimation of CPU time is relatively hard, disk access cost is considered a reasonable measure of the cost of query evaluation plan. A measure of the number of block transfers would estimate The number of seek operations performed The number of blocks read The number of blocks written The costs of all the algorithms depend on the size of the buffer in main memory. In the best case, all data can be read into buffers, and disk does not need to be accessed again. In the worst case, we assume that the buffer can hold only a few blocks of data, approximately one block per relation. Query Interpretation 1. Why do we need to optimize? o A high-level relational query is generally non-procedural in nature. o It says ``what'', rather than ``how'' to find it. o When a query is presented to the system, it is useful to find an efficient method of finding the answer, using the existing database structure. o Usually worthwhile for the system to spend some time on strategy selection. o Typically can be done using information in main memory, with little or no disk access. o Execution of the query will require disk accesses. o Transfer of data from disk is slow, relative to the speed of main memory and the CPU o It is advantageous to spend a considerable amount of processing to save disk accesses. Unit 7/Query Processing/Page No.2
By Ramesh Shrestha
Computer 6th Semester Database Management System
2. Do we really optimize? o Optimizing means finding the best of all possible methods. o The term ``optimization'' is a bit of a misnomer here. o Usually the system does not calculate the cost of all possible strategies. o Perhaps ``query improvement'' is a better term. 3. Two main approaches: 1. Rewriting the query in a more effective manner. 2. Estimating the cost of various execution strategies for the query. Usually both strategies are combined. The difference in execution time between a good strategy and a bad one may be huge. o Thus this is an important issue in any DB system. o As a relational query can be expressed entirely in a relational query language without the use of a host language, it is possible to optimize queries automatically. o SQL is suitable for human use, but internally a query should be represented in a more useful form, like the relational algebra. 4. So, first the system must translate the query into its internal form. Then optimization begins: o Find an equivalent expression that is more efficient to execute. o Select a detailed strategy for processing the query. (Choose specific indices to use, and order in which tuples are to be processed, etc.) 5. Final choice of a strategy is based primarily on the number of disk accesses required. o
Equivalence of Expressions 1. The first step in selecting a query-processing strategy is to find a relational algebra expression that is equivalent to the given query and is efficient to execute. 2. We'll use the following relations as examples: Customer-scheme = (cname, street, ccity) Deposit-scheme = (bname, account#, cname, balance) Branch-scheme = (bname, assets, bcity) We will use instances customer, deposit and branch of these schemes Selection Operation 1. Consider the query to find the assets and branch-names of all banks who have depositors living in Port Chester. In relational algebra, this is
Unit 7/Query Processing/Page No.3
By Ramesh Shrestha
Computer 6th Semester Database Management System
(customer
deposit
branch))
This expression constructs a huge relation, customer
deposit
branch
of which we are only interested in a few tuples. We also are only interested in two attributes of this relation. We can see that we only want tuples for which ccity = ``Port Chester''. Thus we can rewrite our query as:
deposit
branch)
This should considerably reduce the size of the intermediate relation. 2. Suggested Rule for Optimization: o Perform select operations as early as possible. o If our original query was restricted further to customers with a balance over $1000, the selection cannot be done directly to the customer relation above. o The new relational algebra query is
(customer
deposit
branch))
The selection cannot be applied to customer, as balance is an attribute of deposit. o We can still rewrite as o
(customer o
deposit))
branch)
If we look further at the subquery (middle two lines above), we can split the selection predicate in two:
(customer Unit 7/Query Processing/Page No.4
deposit)) By Ramesh Shrestha
Computer 6th Semester Database Management System
This rewriting gives us a chance to use our ``perform selections early'' rule again. o We can now rewrite our subquery as: o
3. Second Transformational Rule: Replace expressions of the form by where P1 and P2 are predicates and e is a relational algebra expression. o Generally, o
Projection Operation 1. Like selection, projection reduces the size of relations. It is advantageous to apply projections early. Consider this form of our example query:
2. When we compute the subexpression
We obtain a relation whose scheme is (cname, ccity, bname, account#, balance) 3. We can eliminate several attributes from this scheme. The only ones we need to retain are those that o appear in the result of the query or o are needed to process subsequent operations.
4. By eliminating unneeded attributes, we reduce the number of columns of the intermediate result, and thus its size. 5. In our example, the only attribute we need is bname (to join with branch). So we can rewrite our expression as: Unit 7/Query Processing/Page No.5
By Ramesh Shrestha
Computer 6th Semester Database Management System
6. Note that there is no advantage in doing an early project on a relation before it is needed for some other operation: o We would access every block for the relation to remove attributes. o Then we access every block of the reduced-size relation when it is actually needed. o We do more work in total, rather than less. Natural Join Operation 1. Another way to reduce the size of temporary results is to choose an optimal ordering of the join operations. 2. Natural join is associative:
3. Although these expressions are equivalent, the costs of computing them may differ. Look again at our expression
We see that we can compute deposit branch first and then join with the first part. o However, deposit branch is likely to be a large relation as it contains one tuple for every account. o The other part, o
is probably a small relation (comparatively). o So, if we compute First, we get a reasonably small relation. o
It has one tuple for each account held by a resident of Port Chester.
Unit 7/Query Processing/Page No.6
By Ramesh Shrestha
Computer 6th Semester Database Management System
o
This temporary relation is much smaller than deposit branch.
4. Natural join is commutative:
Thus we could rewrite our relational algebra expression as:
But there are no common attributes between customer and branch, so this is a Cartesian product that results in lots of tuples. If a user entered this expression, we would want to use the associativity and commutativity of natural join to transform this into the more efficient expression we have derived earlier (join with deposit first, then with branch). Other Operations 1. Some other equivalences for union and set difference:
2. We can summarize query optimization as combining the implementation of various sets of operations in order to reduce the size of intermediate relations: o Combine projects and selects with a Cartesian product or natural join. o The idea is to do the selection and/or projection while computing the join. o This saves computing a large intermediate relation that is going to be subsequently reduced by the select or project anyway. Join Strategies 1. We look at estimating the cost of processing a join. 2. Several factors influence the selection of an optimal strategy: o Physical order of tuples in a relation. o Presence of indices and type of index o Cost of computing a temporary index for the sole purpose of processing one query. 3. We'll look at computing the expression deposit customer assuming no indices exist. We also let Unit 7/Query Processing/Page No.7
By Ramesh Shrestha
Computer 6th Semester Database Management System
deposit = 10,000 (number of deposit tuples) customer = 200 (number of customer tuples) Simple Iteration 1. If we don't create an index, we must examine every pair of tuples t 1 in deposit and t2 in customer. This means examining 10,000 * 200 = 2,000,000 pairs. 2. If we execute this query cleverly, we can cut down the number of block accesses. We use the following method: for each tuple d Îľ deposit do begin for each tuple c Îľ customer do begin examine pair (d, c) to see if a tuple should be added to the result end end We read each tuple of deposit once. This could require 10,000 block accesses. The total number of block access, if the tuples are not stored together physically, would be 10,000 + 10,000 * 200 = 2,010,000. o If we put customer in the outer loop, we get 2,000,200 accesses. o If the tuples of deposit are stored together physically, fewer accesses are required (at 20 per block, 10,000/20 = 500 block accesses). o We read each tuple of customer once for each tuple of deposit. o This suggests we read each tuple of customer 10,000 times, giving as many as 2,000,000 accesses to read customer tuples! o This would give a total of 2,000,500 accesses. o We can reduce accesses significantly if we store customer tuples together physically. o At 20 tuples per block, only 10 accesses are required to read the entire relation (as opposed to 200). o Then we only need 10 * 10,000 = 100,000 block accesses for customer. o This gives a total of 100,500. 3. if we use customer in the outer loop, o Now we reference each tuple of deposit once for each tuple of customer. o If deposit tuples are stored together physically, then since 20 tuples fit on one block, ndeposit/20=500 accesses are needed to read the entire relation. o Since customer has 200 tuples, we read the deposit relation 200 times. o Total cost is then 100,000 for inner loop plus 10 accesses to read the customer relation once for a total of 100,010. o Compared to previous estimate of 100,500, the savings are small (490). o o o
4. Note that we are considering worst-case number of block reads, where every time a block is needed it is not in the buffer. Unit 7/Query Processing/Page No.8
By Ramesh Shrestha
Computer 6th Semester Database Management System
Good buffer management can reduce this considerably. Block-Oriented Iteration 1. If we process tuples on a per-block basis we can save many accesses. The idea is that, if both relations have tuples stored together physically, we can examine all the tuple pairs for a block of each relation at one time. We still need to read all the tuples of one relation for a block of the other relation. The block method algorithm is: for each block Bd of deposit do begin for each block Bc of customer do begin for each tuple d in Bd do begin for each tuple c in Bc do begin test pair (d, c) to see if a tuple should be added to the result end end end end o o
o o o o o
t
Instead of reading the entire customer relation for each tuple of deposit, we read the entire customer relation once for each block of deposit. Since there are 500 blocks of deposit tuples and 10 blocks of customer tuples, reading customer once for each block of deposit requires 10 * 500 = 5000 accesses to customer blocks. Total cost is then 5000 + 500 (for accesses to deposit blocks) = 5500. This is obviously a significant improvement over the non-block method, which required roughly 100,000 or 2,000,000 accesses. Choice of customer for the inner loop is arbitrary, but does provide a potential advantage. Being the smaller relation, it may be possible to keep it all in main memory. If this was the case, we would only require 500 blocks to read deposit plus 10 blocks to read customer into main memory, for a total of 510 block accesses.
Merge-Join 1. Suppose neither relation fits in main memory, nor both are stored in sorted order on the join attributes. (E.g. both deposit and customer sorted by cname.) 2. We can then perform a merge-join, computed like this: o Associate one pointer with each relation. o Initially these pointers point to the first record in each relation. o As algorithm proceeds, pointers move through the relation. Unit 7/Query Processing/Page No.9
By Ramesh Shrestha
Computer 6th Semester Database Management System
A group of tuples in one relation with the same value on the join attributes is read. o Then the corresponding tuples (if any) of the other relation are read. o Since the relations are in sorted order, tuples with same value on the join attributes are in consecutive order. This allows us to read each tuple only once. 3. In the case where tuples of the relations are stored together physically in their sorted order, this algorithm allows us to compute the join by reading each block exactly once. o For deposit customer, this is a total of 510 block accesses. o This is as good as the block-oriented method with the inner loop relation fitting into main memory. o The disadvantage is that both relations must be sorted physically. o It may be worthwhile to do this sort to allow a merge-join. o
Structure of the Query Optimizer 1. There are many query-processing strategies used in database systems. 2. Most systems only implement a few strategies. 3. Some systems make a heuristic guess of a good strategy, in order to minimize the number of strategies to be considered. 4. Then the optimizer considers every possible strategy, but quits as soon as it determines that the cost is greater than the best previously considered strategy. 5. To simplify the strategy selection task, a query may be split into several subqueries. 6. This simplifies strategy selection and permits recognition of common sub-queries (no need to compute them twice). 7. Examination of a query for common sub queries and the estimation of the cost of a large number of strategies impose a substantial overhead on query processing. 8. However, this is usually more than offset by savings at query execution time. 9. Therefore, most commercial systems include relatively sophisticated optimizers. Alternatives for evaluating an entire expression tree
Materialization: generate results of an expression, whose inputs are relations or are already computed, materialize (store) it on disk. Repeat. Pipelining: pass on tuples to parent operations even as an operation is being executed
Materialized evaluation: evaluate one operation at a time, starting at the lowestlevel. Use intermediate results materialized into temporary relations to evaluate nextlevel operations. E.g., in figure below, compute and store
balance
2500
( account
)
then compute the store its join with customer, and finally compute the projections on customer-name. Unit 7/Query Processing/Page No.10
By Ramesh Shrestha
Computer 6th Semester Database Management System
Materialized evaluation is always applicable Cost of writing results to disk and reading them back can be quite high Our cost formulas for operations ignore cost of writing results to disk, so Overall cost = Sum of costs of individual operations + cost of writing intermediate results to disk Double buffering: use two output buffers for each operation, when one is full write it to disk while the other is getting filled Allows overlap of disk writes with computation and reduces execution time Pipelining Pipelined evaluation: evaluate several operations simultaneously, passing the results of one operation on to the next. E.g., in previous expression tree, doesn’t store result of
balance
2500
( account
)
instead, pass tuples directly to the join. Similarly, don’t store result of join, pass tuples directly to projection. Much cheaper than materialization: no need to store a temporary relation to disk. Pipelining may not always be possible – e.g., sort, hash-join. For pipelining to be effective, use evaluation algorithms that generate output tuples even as tuples are received for inputs to the operation. Pipelines can be executed in two ways: demand driven and producer driven In demand driven or lazy evaluation system repeatedly requests next tuple from top level operation Each operation requests next tuple from children operations as required, in order to output its next tuple In between calls, operation has to maintain “state” so it knows what to return next In producer-driven or eager pipelining Operators produce tuples eagerly and pass them up to their parents Unit 7/Query Processing/Page No.11
By Ramesh Shrestha
Computer 6th Semester Database Management System
Buffer maintained between operators, child puts tuples in buffer, parent removes tuples from buffer if buffer is full, child waits till there is space in the buffer, and then generates more tuples System schedules operations that have space in output buffer and can process more input tuples Alternative name: pull and push models of pipelining Implementation of demand-driven pipelining Each operation is implemented as an iterator implementing the following operations open() – E.g. file scan: initialize file scan » state: pointer to beginning of file – E.g.merge join: sort relations; » state: pointers to beginning of sorted relations next() – E.g. for file scan: Output next tuple, and advance and store file pointer – E.g. for merge join: continue with merge from earlier state till next output tuple is found. Save pointers as iterator state. close()
Query evaluation plan A sequence of primitive operations that can be used to evaluate a query is a query evaluation plan. Query optimization is a function of many relational database management systems in which multiple query plans for satisfying a query are examined and a good query plan is identified. Optimization technique 1. Cost based optimization 2. Heuristic optimization
Cost based optimization Using the cost-based approach, the optimizer determines which execution plan is most efficient by considering available access paths and factoring in information based on statistics for the schema objects (tables or indexes) accessed by the SQL statement. The
cost-based approach also considers hints, which are optimization suggestions placed in a Comment in the statement. Unit 7/Query Processing/Page No.12
By Ramesh Shrestha
Computer 6th Semester Database Management System
Conceptually, the cost-based approach consists of these steps: 1. The optimizer generates a set of potential execution plans for the SQL statement based on its available access paths and hints. 2. The optimizer estimates the cost of each execution plan based on statistics in the data dictionary for the data distribution and storage characteristics of the tables, indexes, and partitions accessed by the statement. The cost is an estimated value proportional to the expected resource use needed to execute the statement with a particular execution plan. The optimizer calculates the cost of each possible access method and join order based on the estimated computer resources, including (but not limited to) I/O, CPU time, and memory, that are required to execute the statement using the plan. Serial execution plans with greater costs take more time to execute than those with smaller costs. When using a parallel execution plan, however, resource use is not directly related to elapsed time. 3. The optimizer compares the costs of the execution plans and chooses the one with the smallest cost. A Cost based optimizer generates a range of query-evaluation plans from the given query by using the equivalence rules, and chooses the one with the least cost. For a complex query, the number of different query plans that are equivalent to a given plan can be large. As an illustration, Consider finding the best join-order for r1 r2 . . . rn. There are (2(n – 1))!/(n – 1)! different join orders for above expression. With n = 7, the number is 665280, with n = 10, the number is greater than 176 billion! No need to generate all the join orders. Using dynamic programming, the least-cost join order for any subset of {r1, r2, …., rn} is computed only once and stored for future use. To find best join tree for a set of n relations: o To find best plan for a set S of n relations, consider all possible plans of the form: S1 (S – S1) where S1 is any non-empty subset of S. o Recursively compute costs for joining subsets of S to find the cost of each plan. Choose the cheapest of the 2n – 1 alternatives. o Base case for recursion: single relation access plan Apply all selections on Ri using best choice of indices on Ri o When plan for any subset is computed, store it and reuse it when it is required again, instead of recomputing it Dynamic programming
Unit 7/Query Processing/Page No.13
By Ramesh Shrestha
Computer 6th Semester Database Management System
procedure findbestplan(S) if (bestplan[S].cost ) return bestplan[S] // else bestplan[S] has not been computed earlier, compute it now if (S contains only 1 relation) set bestplan[S].plan and bestplan[S].cost based on the best way of accessing S /* Using selections on S and indices on S */ else for each non-empty subset S1 of S such that S1 S P1= findbestplan(S1) P2= findbestplan(S - S1) A = best algorithm for joining results of P1 and P2 cost = P1.cost + P2.cost + cost of A if cost < bestplan[S].cost bestplan[S].cost = cost bestplan[S].plan = “execute P1.plan; execute P2.plan; join results of P1 and P2 using A” return bestplan[S]
Heuristic optimization • • •
Cost-based optimization is expensive, even with dynamic programming. Systems may use heuristics to reduce the number of choices that must be made in a cost-based fashion. Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases) improve execution performance: o Perform selection early (reduces the number of tuples) o Perform projection early (reduces the number of attributes) o Perform most restrictive selection and join operations (i.e. with smallest result size) before other similar operations. o Some systems use only heuristics; others combine heuristics with partial cost-based optimization.
•
Heuristic optimization converts a declarative query to a canonical algebraic query tree that is then gradually transformed using certain rules
•
The main heuristics is to perform unary relational operations (selection and projection) before binary operations (joins, set theoretic), and aggregate functions with (or without) grouping.
Parse Tree Suppose we have following relations, MovieStar (name, address, gender, birthdate) StarsIn (title, year, starName) Find birthdate and movie for those female stars who appeared in movies in 1996.
Unit 7/Query Processing/Page No.14
By Ramesh Shrestha
Computer 6th Semester Database Management System
SQL SELECT title, birthdate FROM MivieStar, StarsIn WHERE year= 1996 AND gender=’F’ AND starName=name;
Unit 7/Query Processing/Page No.15
By Ramesh Shrestha
Computer 6th Semester Database Management System
SQL SELECT name FROM MovieStar UNION ALL SELECT starName FROM starsIn;
Unit 7/Query Processing/Page No.16
By Ramesh Shrestha
Computer 6th Semester Database Management System
8. Filing and File Structure 1. We have been looking mostly at the higher-level models of a database. At the conceptual or logical level, the database was viewed as o
A collection of tables (relational model).
o
A collection of classes of objects (object-oriented model).
2. The logical model is the correct level for database users to focus on. However, performance depends on the efficiency of the data structures used to represent data in the database, and on the efficiency of operations on these data structures.
Overview of Physical Storage Media 1. Several types of data storage exist in most computer systems. They vary in speed of access, cost per unit of data, and reliability. o
Cache: most costly and fastest form of storage. Usually very small, and managed by the operating system.
o
Main Memory (MM): the storage area for data available to be operated on. o
General-purpose machine instructions operate on main memory.
o
Contents of main memory are usually lost in a power failure or “crash”.
o
Usually too small (even with megabytes) and too expensive to store the entire database.
o
Flash memory: EEPROM (electrically erasable programmable read-only memory).
Data in flash memory survive from power failure.
Reading data from flash memory takes about 10 nano-secs (roughly as fast as from main memory), and writing data into flash memory is more complicated: write-once takes about 4-10 microsecs.
To overwrite what has been written, one has to first erase the entire bank of the memory. It may support only a limited number of erase cycles (
to
).
It has found its popularity as a replacement for disks for storing small volumes of data (5-10 megabytes).
o
Magnetic-disk storage: primary medium for long-term storage.
Typically the entire database is stored on disk.
Data must be moved from disk to main memory in order for the data to be operated on.
Unit 8/Filing and File Structure/Page No.1
By Ramesh Shrestha
Computer 6th Semester Database Management System
After operations are performed, data must be copied back to disk if any changes were made.
Disk storage is called direct access storage as it is possible to read data on the disk in any order (unlike sequential access).
o
Disk storage usually survives power failures and system crashes.
Optical storage: CD-ROM (compact-disk read-only memory), WORM (write-once readmany) disk (for archival storage of data), and Juke box (containing a few drives and numerous disks loaded on demand).
o
Tape Storage: used primarily for backup and archival data.
Cheaper, but much slower access, since tape must be read sequentially from the beginning.
Used as protection from disk failures.
2. The storage device hierarchy is presented in figure where the higher levels are expensive (cost per bit), fast (access time), but the capacity is smaller.
Storage-device hierarchy 3. Another classification: Primary, secondary, and tertiary storage. o
Primary storage: the fastest storage media, such as cache and main memory.
o
Secondary (or on-line) storage: the next level of the hierarchy, e.g., magnetic disks.
o
Tertiary (or off-line) storage: magnetic tapes and optical disk juke boxes.
Unit 8/Filing and File Structure/Page No.2
By Ramesh Shrestha
Computer 6th Semester Database Management System
o
Volatility of storage. Volatile storage loses its contents when the power is removed. Without power backup, data in the volatile storage (the part of the hierarchy from main memory up) must be written to nonvolatile storage for safekeeping.
Organization of records into blocks A record is a collection of data item and is the unit for data storage at the logical or file level. The application program usually reads a complete record from the database. A record may consist of different fields and each field corresponds to an attribute of the record. Name Ram
Course
Section
BE
A
The block may be of fixed size, or may be variable in length. One block may contain multiple records. When the records are restricted to fit within blocks, then such records are called unspanned, whereas, for spanned records, portions of a single record may lie in different blocks. Different methods of arranging records in blocks also called blocking of records is shown in following figure.
R1
R2
R3
R4
R5
Fig: Variable blocking for spanned records
1. Each file is partitioned into fixed-length storage units, called blocks, which are the units of both storage allocation and data transfer. 2. It is desirable to keep as many blocks as possible in main memory. Usually, we cannot keep all blocks in main memory, so we need to manage the allocation of available main memory space. 3. We need to use disk storage for the database, and to transfer blocks of data between main memory and disk. We also want to minimize the number of such transfers, as they are time-consuming. 4. The buffer is the part of main memory available for storage of copies of disk blocks.
Buffer manager 1. The subsystem responsible for the allocation of buffer space is called the buffer manager. o
The buffer manager handles all requests for blocks of the database.
o
If the block is already in main memory, the address in main memory is given to the requester.
o
If not, the buffer manager must read the block in from disk (possibly displacing some other block if the buffer is full) and then pass the address in main memory to the requester.
Unit 8/Filing and File Structure/Page No.3
By Ramesh Shrestha
Computer 6th Semester Database Management System
2. The buffer manager must use some sophisticated techniques in order to provide good service: o Replacement Strategy: when there is no room left in the buffer, some block must be removed to make way for the new one. Typical operating system memory management schemes use a “least recently used” (LRU) method. (Simply remove the block least recently referenced.) This can be improved upon for database applications. o Pinned Blocks: For the database to be able to recover from crashes, we need to restrict times when a block maybe written back to disk. A block not allowed to be written is said to be pinned. Many operating systems do not provide support for pinned blocks, and such a feature is essential if a database is to be “crash resistant”. o Forced Output of Blocks: Sometimes it is necessary to write a block back to disk even though its buffer space is not needed, (called the forced output of a block.) This is due to the fact that main memory contents (and thus the buffer) are lost in a crash, while disk data usually survives.
File Organization
A file is organized logically as a sequence of records. These records are mapped into disk blocks. A file organization techniques to be discussed in this chapter are follows:o Sequential file organization o Indexed Sequential file organization
Sequential File Organization In a sequential file records are placed sequentially onto the storage media. In addition, the records in the file are usually ordered according to the values of search key attributes (A search key is any attribute or set of attributes). To permit fast retrieval of records in search key order, we chain together records by pointers. The pointer in each record points to the next records in search key order. Furthermore, to minimize the number of block accesses in sequential file processing, we store records physically in search key order, or as close to search key order as possible.
A101
Bkt
600
A208
Bkt
500
A104
Ktm
200
A201
Ltp
800
A301
Ltp
300
Unit 8/Filing and File Structure/Page No.4
By Ramesh Shrestha
Computer 6th Semester Database Management System
1. A sequential file is designed for efficient processing of records in sorted order on some search key. o
Records are chained together by pointers to permit fast retrieval in search key order.
o
Pointer points to next record in order.
o
Records are stored physically in search key order (or as close to this as possible).
o
This minimizes number of block accesses.
2. It is difficult to maintain physical sequential order as records are inserted and deleted. o
Deletion can be managed with the pointer chains.
o
Insertion poses problems if there no space where new record should go.
o
If there is space, use it, else put new record in an overflow block.
o
Adjust pointers accordingly.
o
Problem: we now have some records out of physical sequential order.
o
If very few records in overflow blocks, this will work well.
o
If order is lost, reorganize the file.
o
Reorganizations are expensive and done when system load is low.
3. If insertions rarely occur, we could keep the file in physically sorted order and reorganize when insertion occurs. In this case, the pointer fields are no longer required.
A101
Bkt
600
A208
Bkt
500
A104
Ktm
200
A201
Ltp
800
A301
Ltp
300
A286
Kavre
400
Sequential file after an insertion
The indexed sequential file organization The indexed sequential method of file organization attempts to reduce the access problem inherit in the sequential file, without losing the benefit of a sequential file organization. This type of organization enables the user to process records either in a serial or in a random manner. In a batched mode, the sequential processing of records (such as retrieval of successor records) can be performed just as Unit 8/Filing and File Structure/Page No.5
By Ramesh Shrestha
Computer 6th Semester Database Management System
efficiently as in sequential file organization. The indexed sequential file, as the name suggests, is provided with indexes associated with the file. The indexes can be used to quickly locate any given record for random processing. Another important feature associated with indexed sequential file is the use of an overflow area to handle insertions to the file.
B -Tree Index Files 1. Primary disadvantage of index-sequential file organization is that performance degrades as the file grows. This can be remedied by costly re-organizations. 2. B -tree file structure maintains its efficiency despite frequent insertions and deletions. It imposes some acceptable update and space overheads. 3. A B -tree index is a balanced tree in which every path from the root to a leaf is of the same length. 4. Each nonleaf node in the tree must have between n/2 and n children, where n is fixed for a particular tree.
Structure of a B -Tree 1. A B -tree index is a multilevel index but is structured differently from that of multi-level index sequential files. 2. A typical node contains up to n-1 search key values K1, K2,â&#x20AC;Ś,Kn-1, and n pointers P1, P2, â&#x20AC;Ś, Pn. Search key values in a node are kept in sorted order.
Typical node of a B+-tree. 3. For leaf nodes, Pi (i=1,2,3,..n) points to either a file record with search key value Ki, or a bucket of pointers to records with that search key value. Bucket structure is used if search key is not a primary key, and file is not sorted in search key order. Pointer Pn (nth pointer in the leaf node) is used to chain leaf nodes together in linear order (search key order). This allows efficient sequential processing of the file.
4. Non-leaf nodes form a multilevel index on leaf nodes. A non-leaf node may hold up to n pointers and must hold n/2 pointers. The number of pointers in a node is called the fan-out of the node.
Unit 8/Filing and File Structure/Page No.6
By Ramesh Shrestha
Computer 6th Semester Database Management System
Consider a node containing m pointers. Pointer Pi (i=2,â&#x20AC;Ś,m) points to a subtree containing search key values
and <Ki. Pointer Pm points to a subtree containing search key values
.
Pointer P1 points to a subtree containing search key values <K1. 5. Figure shows B -trees for the deposit file with n=3
B+-tree for deposit file with n = 3. Queries on B -Trees 1. Suppose we want to find all records with a search key value of k. o
Examine the root node and find the smallest search key value Ki>k.
o
Follow pointer Pi to another node.
o
If k<K1 follow pointer P1.
o
Otherwise, find the appropriate pointer to follow.
o
Continue down through non-leaf nodes, looking for smallest search key value > k and following the corresponding pointer.
o
Eventually we arrive at a leaf node, where pointer will point to the desired record or bucket.
2. In processing a query, we traverse a path from the root to a leaf node. If there are K search key values in the file, this path is no longer than log [n/2] (K). This means that the path is not long, even in large files. For a 4k byte disk block with a search-key size of 12 bytes and a disk pointer of 8 bytes, n is around 200. If n =100, a look-up of 1 million search-key values may take log50 (1,00,000) = 4 nodes to be accessed. Since root is in usually in the buffer, so typically it takes only 3 or fewer disk reads.
Unit 8/Filing and File Structure/Page No.7
By Ramesh Shrestha
Computer 6th Semester Database Management System
Updates on B -Trees 1. Insertions and Deletions: Insertion and deletion are more complicated, as they may require splitting or combining nodes to keep the tree balanced. If splitting or combining are not required, insertion works as follows: o
Find leaf node where search key value should appear.
o
If value is present, add new record to the bucket.
o
If value is not present, insert value in leaf node (so that search keys are still in order).
o
Create a new bucket and insert the new record.
If splitting or combining are not required, deletion works as follows: o
Deletion: Find record to be deleted, and remove it from the bucket.
o
If bucket is now empty, remove search key value from leaf node.
2. Insertions Causing Splitting: When insertion causes a leaf node to be too large, we split that node. Lets assume we wish to insert a record with a bname value of ``Clearview''. o
There is no room for it in the leaf node where it should appear.
o
We now have n values (the n-1 search key values plus the new one we wish to insert).
o
We put the first [n/2] values in the existing node, and the remainder into a new node.
o
The new node must be inserted into the B -tree.
o
We also need to update search key values for the parent (or higher) nodes of the split leaf node. (Except if the new node is the leftmost one)
o
Order must be preserved among the search key values in each node.
o
If the parent was already full, it will have to be split.
o
When a non-leaf node is split, the children are divided among the two new nodes.
o
In the worst case, splits may be required all the way up to the root. (If the root is split, the tree becomes one level deeper.)
3. To summarize: o
Insertion and deletion are complicated, but require relatively few operations.
o
Number of operations required for insertion and deletion is proportional to logarithm of number of search keys.
o
B -trees are fast as index structures for database.
Unit 8/Filing and File Structure/Page No.8
By Ramesh Shrestha
Computer 6th Semester Database Management System
B-Tree Index Files 1. B-tree indices are similar to B -tree indices. o
Difference is that B-tree eliminates the redundant storage of search key values.
o
A corresponding B-tree of a B -tree allows search key values to appear only once.
o
Thus we can store the index in less space.
Leaf and nonleaf node of a B-tree. 2. Advantages: o
Lack of redundant storage (but only marginally different).
o
Some searches are faster (key may be in non-leaf node).
3. Disadvantages: o
Leaf and non-leaf nodes are of different size (complicates storage)
o
Deletion may occur in a non-leaf node (more complicated)
Generally, the structural simplicity of B -tree is preferred.
Hashing and Hash Functions 1. Index schemes force us to traverse an index structure. Hashing avoids this. Hash File Organization 1. Hashing involves computing the address of a data item by computing a function on the search key value. 2. A hash function h is a function from the set of all search key values K to the set of all bucket addresses B. o
We choose a number of buckets to correspond to the number of search key values we will have stored in the database.
o
To perform a lookup on a search key value Ki, we compute h(Ki), and search the bucket with that address.
o
If two search keys i and j map to the same address, because h(Ki) = h(Kj), then the bucket at the address obtained will contain records with both search key values.
o
In this case we will have to check the search key value of every record in the bucket to get the ones we want.
o
Insertion and deletion are simple.
Unit 8/Filing and File Structure/Page No.9
By Ramesh Shrestha
Computer 6th Semester Database Management System
Hash Functions 1. A good hash function gives an average-case lookup that is a small constant, independent of the number of search keys. 2. We hope records are distributed uniformly among the buckets. 3. The worst hash function maps all keys to the same bucket. 4. The best hash function maps all keys to distinct addresses. 5. Ideally, distribution of keys to addresses is uniform and random. 6. Suppose we have 26 buckets, and map names beginning with ith letter of the alphabet to the ith bucket. o
Problem: this does not give uniform distribution.
o
Many more names will be mapped to ``A'' than to ``X''.
o
Typical hash functions perform some operation on the internal binary machine representations of characters in a key.
o
For example, compute the sum, modulo # of buckets, of the binary representations of characters of the search key.
Handling of bucket overflows 1. Open hashing occurs where records are stored in different buckets. Compute the hash function and search the corresponding bucket to find a record. 2. Closed hashing occurs where all records are stored in one bucket. Hash function computes addresses within that bucket. (Deletions are difficult.) Not used much in database applications. 3. Drawback to our approach: Hash function must be chosen at implementation time. o
Number of buckets is fixed, but the database may grow.
o
If number is too large, we waste space.
o
If number is too small, we get too many ``collisions'', resulting in records of many search key values being in the same bucket.
o
Choosing the number to be twice the number of search key values in the file gives a good space/performance tradeoff.
Hash Indices 1. A hash index organizes the search keys with their associated pointers into a hash file structure. 2. We apply a hash function on a search key to identify a bucket, and store the key and its associated pointers in the bucket (or in overflow buckets). Unit 8/Filing and File Structure/Page No.10
By Ramesh Shrestha
Computer 6th Semester Database Management System
3. Strictly speaking, hash indices are only secondary index structures, since if a file itself is organized using hashing, there is no need for a separate hash index structure on it.
Dynamic Hashing 1. As the database grows over time, we have three options: o
Choose hash function based on current file size. Get performance degradation as file grows.
o
Choose hash function based on anticipated file size. Space is wasted initially.
o
Periodically re-organize hash structure as file grows. Requires selecting new hash function, recomputing all addresses and generating new bucket assignments. Costly, and shuts down database.
2. Some hashing techniques allow the hash function to be modified dynamically to accommodate the growth or shrinking of the database. These are called dynamic hash functions. o
Extendable hashing is one form of dynamic hashing.
o
Extendable hashing splits and coalesces buckets as database size changes.
o
This imposes some performance overhead, but space efficiency is maintained.
o
As reorganization is on one bucket at a time, overhead is acceptably low.
3. Advantages: o
Extendable hashing provides performance that does not degrade as the file grows.
o
Minimal space overhead - no buckets need be reserved for future use. Bucket address table only contains one pointer for each hash value of current prefix length.
4. Disadvantages: o
Extra level of indirection in the bucket address table
o
Added complexity
5. Summary: A highly attractive technique, provided we accept added complexity.
Data Dictionary Storage 1. The database also needs to store information about the relations, known as the data dictionary. This includes: o
Names of relations.
o
Names of attributes of relations.
o
Domains and lengths of attributes.
o
Names and definitions of views.
o
Integrity constraints (e.g., key constraints).
Unit 8/Filing and File Structure/Page No.11
By Ramesh Shrestha
Computer 6th Semester Database Management System
plus data on the system users: o
Names of authorized users.
o
Accounting information about users.
plus (possibly) statistical and descriptive data: o
Number of tuples in each relation.
o
Method of storage used for each relation (e.g., clustered or non-clustered).
2. We'll also need to store information about each index on each relation: o
Name of the index.
o
Name of the relation being indexed.
o
Attributes the index is on.
o
Type of index.
3. This information is, in itself, a miniature database. We can use the database to store data about itself, simplifying the overall structure of the system, and allowing the full power of the database to be used to permit fast access to system data. 4. The exact choice of how to represent system data using relations must be made by the system designer. One possible representation follows. System-catalog-schema = (relation-name, number-attrs) Attr-schema = (attr-name, rel-name, domain-type, position, length) User-schema = (user-name, encrypted-password, group) Index-schema = (index-name, rel-name, index-type, index-attr) View-schema = (view-name, definition)
Unit 8/Filing and File Structure/Page No.12
By Ramesh Shrestha
Computer 6th Semester Database Management System
9. Concurrency Control Transaction
Collections of operations that form a single logical unit of work are called transactions.
A database system must ensure proper execution of transactions despite failures- either the entire transaction executes or none of it does.
A transaction is a unit of program execution that accesses and possibly updates various data items.
To ensure integrity of data, we require that the database system maintain the following properties of the transactions:
Atomicity: Either all operations of the transaction are reflected properly in the database, or none are.
Consistency: Execution of a transaction in isolation (i.e. with no other transaction executing concurrently) preserves the consistency of the database.
Isolation: Even though multiple transactions may execute concurrently, the system guarantees that, for every pair of transactions Ti and Tj , it appears to Ti that either Tj finished execution before Ti started, or Tj started execution after Ti finished. Thus, each transaction is unaware of other transactions executing concurrently in the system.
Durability: After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.
Transaction State A transaction must be in one of the following states: Active: the initial state; the transaction stays in this state while it is executing Partially committed: after the final statement has been committed Aborted: after the transaction has been rolled back and the database has been restored to its state prior to the start of the transaction Committed: after successful completion
Unit 9/Concurrency Control/Page No.1
By Ramesh Shrestha
Computer 6th Semester Database Management System
partially committed
committed
active
failed State diagram of a transaction
aborted
Scheduling and Serializability
The execution sequences of transactions are called schedules. They represent the chronological order in which instructions are executed in the system.
A schedule must preserve the order in which the instructions appear in each individual transaction.
The way in which the transactions are scheduled such that there is the same effect as if they were carried out serially, i.e. one by one is known as serializability. In other words serializabiltiy ensures the equivalence of a schedule with the serial schedule. Conflict Serializability We say that a schedule S is conflict serializable, if it is conflict serializable to a serial schedule. Let’s consider a schedule S in which there are two consecutive instructions Ii and Ij, of transactions Ti and Tj. If Ii and Ij refer to different data items, then we can swap Ii and Ij without affecting the results of any instruction in the schedule. However, if Ii and Ij refer to the same data item Q, then the order of the two steps may matter. Consider the read and write operations which leads to four Unit 9/Concurrency Control/Page No.2
By Ramesh Shrestha
Computer 6th Semester Database Management System different cases: 1. Ii = read(Q), Ij = read(Q). The order of Ii and Ij does not matter, since the same value of Q is read by Ti and Tj regardless of the order. 2. Ii = read(Q), Ij = write(Q). If Ii comes before Ij, then Ti does not read the value of Q that is written by Tj in instruction Ij. If Ij comes before Ii, the Ti reads the value of Q that is written by Tj. Thus, the order of Ii and Ij matters. 3. Ii = write(Q), Ij = read(Q). The order of Ii and Ij matters for reasons similar to the previous case. 4. Ii = write(Q), Ij = write(Q). Since both of these are write operations, the order of these instructions does not affect either Ti or Tj. However, the value obtained by the next read(Q) instruction of S is affected, since the result of only the latter the latter of the two write instructions is preserved in the database. If there is no other write(Q) instruction after Ii and Ij in S, then the order of Ii and Ij directly affects the final value of Q in the database state that results from schedule S.
Schedule 1 T1 Read(A) Write(A)
T2
Read(A) Write(A) Read(B) Write(B) Read(B) Write(B)
Schedule 2 (with non-conflicting instructions Write(A) of T2 and Read(B) of T1 swapped) T1 Read(A) Write(A)
T2
Read(A) Read(B) Write(A) Write(B) Read(B) Unit 9/Concurrency Control/Page No.3
By Ramesh Shrestha
Computer 6th Semester Database Management System Write(B) Schedule 3 with the following swaps of non-conflicting instructions: Read(B) of T1 with the read(A) of T2 Write(B) of T1 with Write(A) of T2 Write(B) of T1 with read(A) of T2 T1 Read(A) Write(A) Read(B) Write(B)
T2
Read(A) Write(A) Read(B) Write(B) Schedule 3 is a serial schedule. Thus, we have shown that Schedule 1 is equivalent to a serial schedule. Consider Schedule 4 T3 Read(Q)
T4 Write(Q)
Write(Q) Schedule 4 is not conflict serializable, since it is not equivalent to either the serial schedule <T3,T4> or <T4,T3> View Serializability View serializabiltiy is a less stringent form of equivalence than conflict serializability. Consider schedules S and S’, where the same set of transactions participates in both schedules. The schedules S and S’ are view equivalent if three conditions meet: 1. For each data item Q, if transaction Ti reads initial value of Q in schedule S, then transaction Ti must, in schedule S’, also read the initial value of Q. 2. For each data item Q, if transaction Ti executes read(Q) in schedule S, and if that value was produced by a write(Q) operation executed by transaction Tj, then read(Q) operation of transaction Ti must, in schedule S’, also read the value of Q that was produced by the same Unit 9/Concurrency Control/Page No.4
By Ramesh Shrestha
Computer 6th Semester Database Management System write(Q) operation of transaction Tj. 3. For each data item Q, the transaction (if any) that performs the final write(Q) operation is schedule S must perform the final write(Q) operation in schedule S’.
Conditions 1 and 2 ensure that each transaction reads the same values in both schedules and therefore, performs the same computation. Condition 3, coupled with conditions 1 and 2, and ensures that both schedules result in the same final state. Consider Schedule 5 T3 Read(Q)
T4
T5
Write(Q) Write(Q) Write(Q) Schedule 9 is view serializable as it is view equivalent to the serial schedule <T3, T4, T6>, since the single read(Q) instruction reads the initial value of Q in both schedules, and T6 performs the final write of Q in both schedules. Schedule 9 is not conflict-serializable since every pair of consecutive instructions conflicts, and thus, no swapping of instruction is possible. Thus, every conflict-serializable schedule is also viewserialiable but there are view-serializable schedules that are not conflict serializable. Concurrency Control One of the fundamental properties of a transaction is isolation. When several transactions execute concurrently in the database, however, the isolation property may no longer be preserved. To ensure that it is preserved, the system must control the interaction among the concurrent transactions; this control is achieved through one of a variety of mechanisms called concurrency control schemes.
Why Concurrency Control is needed:
The Lost Update Problem This occurs when two transactions that access the same database items have their operations interleaved in a way that makes the value of some database item incorrect.
The Temporary Update (or Dirty Read) Problem
Unit 9/Concurrency Control/Page No.5
By Ramesh Shrestha
Computer 6th Semester Database Management System This occurs when one transaction updates a database item and then the transaction fails for some reason (see Section 17.1.4). The updated item is accessed by another transaction before it is changed back to its original value. ď&#x201A;ˇ
The Incorrect Summary Problem If one transaction is calculating an aggregate summary function on a number of records while other transactions are updating some of these records, the aggregate function may calculate some values before they are updated and others after they are updated.
Concurrency Control Mechanism Different types of concurrency control mechanism are as follows:1. Locking Locking is defined as a process of preventing multiple transactions from accessing the data items concurrently. 2. Timestamp Timestamp is a unique identifier for each transaction generated by the system. 3. Multiversion Multiversion is defined as a process of concurrency control which keeps the old values of data item when the data item is updated. 4. Validation Validation concurrency control technique is a process in which no checking is done while the transaction is executed and validates at the end. 5. Granularity Granularity is the process of determining what portion of the database a data item represent.
Multiple Granularity In all concurrency control schemes, we have used each individual data item as the unit on which synchronization is performed. However, it would be advantageous to group several data items and to treat them as one individual unit. Example, if a transaction Ti needs to access the entire database, it uses a locking protocol. Then Ti must lock each item in the database, so it is time consuming process. Hence it would be better if Ti would issue a single lock request to lock the entire database. On the other hand if transaction Ti needs to access only a few data items, it should not be required to lock the entire database. Unit 9/Concurrency Control/Page No.6
By Ramesh Shrestha
Computer 6th Semester Database Management System A data item can be one of the following. 1. 2. 3. 4. 5.
A database record Field value of database record A disk block Whole File Whole database
The size of database item is often called the data item granularity. Fine granularity refers to overall item size where as coarse granularity refers to large item sizes. The best item size depends on the type transaction. Hierarchy of data granularities, where the small granularities are nested within larger ones, can be represented graphically a tree. In the tree, each node represents independent data item, non-leaf node of the multiple granularity tree represents the data associated with its descendents.
Level 0
Level 1
Level
Level
Level
DB
Files
Pages
Record
Fields
The highest level represents the entire database, then files, pages, records and fields. Hence we can use shared and exclusive lock when a transaction locks a node, all the descendants of the node in the same lock node. To make multiple granularity level locking practical, additional types of locks Unit 9/Concurrency Control/Page No.7
By Ramesh Shrestha
Computer 6th Semester Database Management System called intention locks are needed. The idea behind intention locks is for a transaction to indicate, long path from the root to the desired node, what type of the lock it will require from one of the nodeâ&#x20AC;&#x2122;s descendants. There are three types of intention locks, they are 1. Intention Shared (IS) to indicate that a shared lock will be requested on some descendant node 2. Intention Exclusive (IX) to indicate that a exclusive lock will be requested on some descendant node 3. Shared intention exclusive (SIX) to indicate that the current node is locked in shared mode but an exclusive lock will be requested.
Lock Based Protocol A lock is nothing but a mechanism that tells the DBMS whether a particular data item is being used by any transaction for read/write purpose. Since there are two types of operations, i.e. read and write, whose basic nature are different, the locks for read and write operation may behave differently. Read operation performed by different transactions on the same data item poses less of a challenge. The value of the data item, if constant, can be read by any number of transactions at any given time. Write operation is something different. When a transaction writes some value into a data item, the content of that data item remains in an inconsistent state, starting from the moment when the writing operation begins up to the moment the writing operation is over. If we allow any other transaction to read/write the value of the data item during the write operation, those transaction will read an inconsistent value or overwrite the value being written by the first transaction. In both the cases anomalies will creep into the database.
The simple rule for locking can be derived from here. If a transaction is reading the content of a sharable data item, then any number of other processes can be allowed to read the content of the same data item. But if any transaction is writing into a sharable data item, then no other transaction will be allowed to read or write that same data item.
Depending upon the rules we have found, we can classify the locks into two types. Shared Lock: A transaction may acquire shared lock on a data item in order to read its content. The lock is shared in the sense that any other transaction can acquire the shared lock on that same data item for reading purpose. Exclusive Lock: A transaction may acquire exclusive lock on a data item in order to both Unit 9/Concurrency Control/Page No.8
By Ramesh Shrestha
Computer 6th Semester Database Management System read/write into it. The lock is excusive in the sense that no other transaction can acquire any kind of lock (either shared or exclusive) on that same data item. The relationship between Shared and Exclusive Lock can be represented by the following table which is known as Lock Matrix. Locks already existing Locks to Be granted
Shared
Exclusive
Shared
TRUE
FALSE
Exclusive
FALSE
FALSE
How Should Lock be used? In a transaction, a data item which we want to read/write should first be locked before the read/write is done. After the operation is over, the transaction should then unlock the data item so that other transaction can lock that same data item for their respective usage. A transaction is to be deposited Rs 100/- from account A to account B. The transaction should now be written as the following: Lock-X (A); (Exclusive Lock, we want to both read A’s value and modify it) Read A; A = A – 100; Write A; Unlock (A); (Unlocking A after the modification is done) Lock-X (B); (Exclusive Lock, we want to both read B’s value and modify it) Read B; B = B + 100; Write B; Unlock (B); (Unlocking B after the modification is done) And the transaction that deposits 10% amount of account A to account C should now be written as: Lock-S (A); (Shared Lock, we only want to read A’s value) Read A; Temp = A * 0.1; Unlock (A); (Unlocking A) Lock-X (C); (Exclusive Lock, we want to both read C’s value and modify it) Read C; C = C + Temp; Unit 9/Concurrency Control/Page No.9
By Ramesh Shrestha
Computer 6th Semester Database Management System Write C; Unlock (C); (Unlocking C after the modification is done)
Let us see how these locking mechanisms help us to create error free schedules. We should remember that in the previous chapter we discussed an example of an erroneous schedule: T1
T2
Read A; A = A â&#x20AC;&#x201C; 100; Read A; Temp = A * 0.1; Read C; C = C + Temp; Write C; Write A; Read B; B = B + 100; Write B;
We detected the error based on common sense only that the Context Switching is being performed before the new value has been updated in A. T2 reads the old value of A, and thus deposits a wrong amount in C. Had we used the locking mechanism, this error could never have occurred. Let us rewrite the schedule using the locks. T1
T2
Lock-X (A) Read A; A = A â&#x20AC;&#x201C; 100; Lock-S (A) Read A; Temp = A * 0.1; Unlock (A) Lock-X (C) Read C; C = C + Temp; Unit 9/Concurrency Control/Page No.10
By Ramesh Shrestha
Computer 6th Semester Database Management System Write C; Unlock (C) Write A; Unlock (A) Lock-X (B) Read B; B = B + 100; Write B; Unlock (B)
We cannot prepare a schedule like the above even if we like, provided that we use the locks in the transactions. See the first statement in T2 that attempts to acquire a lock on A. This would be impossible because T1 has not released the excusive lock on A, and T2 just cannot get the shared lock it wants on A. It must wait until the exclusive lock on A is released by T1, and can begin its execution only after that. So the proper schedule would look like the following:
T1
T2
Lock-X (A) Read A; A = A â&#x20AC;&#x201C; 100; Write A; Unlock (A) Lock-S (A) Read A; Temp = A * 0.1; Unlock (A) Lock-X (C) Read C; C = C + Temp; Write C; Unlock (C) Lock-X (B) Read B; Unit 9/Concurrency Control/Page No.11
By Ramesh Shrestha
Computer 6th Semester Database Management System B = B + 100; Write B; Unlock (B)
And this automatically becomes a very correct schedule. We need not apply any manual effort to detect or correct the errors that may creep into the schedule if locks are not used in them.
Two Phase Locking Protocol The use of locks has helped us to create neat and clean concurrent schedule. The Two Phase Locking Protocol defines the rules of how to acquire the locks on a data item and how to release the locks. The Two Phase Locking Protocol assumes that a transaction can only be in one of two phases. Growing Phase: In this phase the transaction can only acquire locks, but cannot release any lock. The transaction enters the growing phase as soon as it acquires the first lock it wants. From now on it has no option but to keep acquiring all the locks it would need. It cannot release any lock at this phase even if it has finished working with a locked data item. Ultimately the transaction reaches a point where all the lock it may need has been acquired. This point is called Lock Point.
Shrinking Phase: After Lock Point has been reached, the transaction enters the shrinking phase. In this phase the transaction can only release locks, but cannot acquire any new lock. The transaction enters the shrinking phase as soon as it releases the first lock after crossing the Lock Point. From now on it has no option but to keep releasing all the acquired locks. There are two different versions of the Two Phase Locking Protocol. One is called the Strict Two Phase Locking Protocol and the other one is called the Rigorous Two Phase Locking Protocol.
Strict Two Phase Locking Protocol In this protocol, a transaction may release all the shared locks after the Lock Point has been reached, but it cannot release any of the exclusive locks until the transaction commits. This protocol helps in creating cascade less schedule.
A Cascading Schedule is a typical problem faced while creating concurrent schedule. Consider the Unit 9/Concurrency Control/Page No.12
By Ramesh Shrestha
Computer 6th Semester Database Management System following schedule once again.
T1
T2
Lock-X (A) Read A; A = A â&#x20AC;&#x201C; 100; Write A; Unlock (A) Lock-S (A) Read A; Temp = A * 0.1; Unlock (A) Lock-X (C) Read C; C = C + Temp; Write C; Unlock (C) Lock-X (B) Read B; B = B + 100; Write B; Unlock (B)
The schedule is theoretically correct, but a very strange kind of problem may arise here. T1 releases the exclusive lock on A, and immediately after that the Context Switch is made. T2 acquires a shared lock on A to read its value, perform a calculation, update the content of account C and then issue COMMIT. However, T1 is not finished yet. What if the remaining portion of T1 encounters a problem (power failure, disc failure etc) and cannot be committed? In that case T1 should be rolled back and the old BFIM value of A should be restored. In such a case T2, which has read the updated (but not committed) value of A and calculated the value of C based on this value, must also have to be rolled back. We have to rollback T2 for no fault of T2 itself, but because we proceeded with T2 depending on a value which has not yet been committed. This phenomenon of rolling back a child transaction if the parent transaction is rolled back is called Cascading Rollback, which causes a tremendous loss of processing power and execution time. Unit 9/Concurrency Control/Page No.13
By Ramesh Shrestha
Computer 6th Semester Database Management System Using Strict Two Phase Locking Protocol, Cascading Rollback can be prevented. In Strict Two Phase Locking Protocol a transaction cannot release any of its acquired exclusive locks until the transaction commits. In such a case, T1 would not release the exclusive lock on A until it finally commits, which makes it impossible for T2 to acquire the shared lock on A at a time when A’s value has not been committed. This makes it impossible for a schedule to be cascading.
Rigorous Two Phase Locking Protocol In Rigorous Two Phase Locking Protocol, a transaction is not allowed to release any lock (either shared or exclusive) until it commits. This means that until the transaction commits, other transaction might acquire a shared lock on a data item on which the uncommitted transaction has a shared lock; but cannot acquire any lock on a data item on which the uncommitted transaction has an exclusive lock.
Timestamp Ordering Protocol A timestamp is a tag that can be attached to any transaction or any data item, which denotes a specific time on which the transaction or data item had been activated in any way.
Scheduler assigs each transaction T a unique number, its timestamp TS (T).
Each transaction is issued a timestamp when it enters the system.
If an old transaction T1 has timestamp TS(T1), a new transaction T2 is assigned timestamp TS (T2) such that TS(T1) < TS(T2).
A timestamp can be implemented in two ways.
The simplest one is to directly assign the current value of the clock to the transaction or the data item.
The other policy is to attach the value of a logical counter that keeps incrementing as new timestamps are required.
The timestamp of a transaction denotes the time when it was first activated. The timestamp of a data item can be of the following two types: W-timestamp (Q): This means the latest time when the data item Q has been written into. R-timestamp (Q): This means the latest time when the data item Q has been read from. These two timestamps are updated each time a successful read/write operation is performed on the Unit 9/Concurrency Control/Page No.14 By Ramesh Shrestha
Computer 6th Semester Database Management System data item Q.
How should timestamps be used? The timestamp ordering protocol ensures that any pair of conflicting read/write operations will be executed in their respective timestamp order. This is an alternative solution to using locks.
Rule 1. ri(x) is allowed id ts (Ti) >= WT (X). 2. wi(x) is allowed if ts (Ti) >= WT(X) and ts (Ti) >= RT (X). E.g. [we assume that at the beginning, each of the database elements has both a read and write time of 0] T1 200
T2 150
T3 175
A RT=0 WT=0
Read (B) Read (A)
B RT=0 WT=0 RT=200
C RT=0 WT=0
RT=150 Read (C)
Write (B) Write (A)
RT=175 WT=200 WT=200
Write (C) (aborted) Write (A)
T2 tries to write C. However C was already read by transaction T3 which theoretically executed at time 175, while T2 would have written its value at time 150. Thus T2 is trying to do something that would result in physically unrealizable behavior, and T2 must be rollback.
Deadlock Handling A system is in a deadlock state if there exists a set of transactions such that every transaction in the set is waiting for another transaction in the set. More precisely, there exists a set of waiting transactions {T0, T1,â&#x20AC;Śâ&#x20AC;Ś..,Tn} such that T0 is waiting for a data item that T1 holds, and T1 is waiting for a data item that T2 holds, and Tn is waiting for a data item that T0 holds. None of the transactions can make progress in such a situation. Two principal methods for dealing with deadlock problem is to use a deadlock prevention protocol or use a deadlock detection and recovery scheme.
Deadlock Prevention Unit 9/Concurrency Control/Page No.15
By Ramesh Shrestha
Computer 6th Semester Database Management System There are two approaches to deadlock prevention. One approach ensures that no cyclic wait can occur by ordering the requests for locks, or requiring all locks to be acquired together. The other approach performs transaction rollbacks instead of waiting for a lock, whenever the wait could potentially result in a deadlock. The simplest scheme under the first approach requires that each transaction locks all its data items before it begins execution. Either all are locked in one stop or none are locked. Another approach for preventing deadlocks is to impose an ordering of all data items, and to require that a transaction lock data items only in a sequence consistent with the ordering. Deadlock Detection and Recovery To detect deadlock and recover from it, a system must:
Maintain information about the current allocation of data items to transactions, as well as any outstanding data item requests.
Provide an algorithm that uses this information to determine whether a system has entered a deadlock state.
Recover from the deadlock when the detection algorithm determines that a deadlock exists.
Deadlock Detection Deadlocks can be described precisely in terms of a directed graph called a wait-for graph. This graph consists of a pair G = (V,E), where V is a set of vertices and E is a set of edges. The set of vertices consists of all the transactions in the system. Each element in the set E of edges is an ordered pair Ti Tj. If Ti Tj is in E, then there is a directed edge from transaction Ti to Tj, implying that transaction Ti is waiting for transaction Tj to release a data item that it needs. When transaction Ti requests a data item currently being held by transaction Tj, then the edge Ti Tj is inserted in the wait-for graph. A deadlock exists if and only if the wait-for graph contains a cycle. To detect deadlocks, the system needs to maintain the wait-for graph, and periodically invoke an algorithm that searches for a cycle in the graph.
Recovery from Deadlock Unit 9/Concurrency Control/Page No.16
By Ramesh Shrestha
Computer 6th Semester Database Management System When a deadlock algorithm determines that a deadlock exists, the system must recover from the deadlock. The most common solution is to roll back one or more transactions to break the deadlock. Three actions need to be taken: 1) Selection of a victim: We must determine which transaction to roll back to break the deadlock. We should roll back those transactions that will incur the minimum cost. Many factors may determine the cost of a rollback such as: How long the transaction has computed, and how much longer the transaction will compute before it completes its designated task? How many data items the transaction has used? How many more data items the transaction needs for it to be complete? How many transactions will be involved in the rollback? 2) Rollback: We must determine how far the selected transaction should be rolled back. Simplest solution is total rollback. However it is more effective to perform a partial rollback although it requires some additional overheads. 3) Starvation: Some transaction may always be selected and never get a chance to proceed and starve. We must ensure that transaction can be picked as a victim only a finite number of times. For this the number of rollbacks can be include in the cost factor.
Unit 9/Concurrency Control/Page No.17
By Ramesh Shrestha
Query by Example (QBE)
Query by Example (QBE) is a method of query creation that allows the user to search for documents based on an example in the form of a selected text string or in the form of a document name or a list of documents. Because the QBE system formulates the actual query, QBE is easier to learn than formal query languages, such as the standard Structured Query Language (SQL), while still enabling powerful searches.
Basic Structure
Queries are expressed by example. Non-procedural. Queries are expressed using skeleton tables. User selects the skeletons needed. User fills in skeletons with example rows. A P. before the variable causes printing. A P. in front of the row prints all attributes. Arithmetic expressions and comparisons are allowed.
Example: Let us consider the following relation:branch (bname, assets, bcity) borrow (bname, loan#, cname, amount) customer (cname, street, city) deposit (bname, account#, cname, balance)
1. Find all customers having an account at the SFU branch
2. Find the names of all branches not located in Burnaby
3. Find all customers having an account at either branch or both
4. Find all account numbers with balances between $1,300 and $1,500
5. Find all account numbers where the balance is between $1,300 and $2,000, but is not $1,500
6. Find the name and city of all customers having a loan at the SFU branch
Sailors (sid, sname, rating, age) 1. 2. 3. 4.
Print names and ages of all sailors. Print names of sailors whose age younger than 30 or age older than 20. Print names of the sailors younger than 30 and rating>4. To find all name where the sid between 1 and 10, but is not 5
Insert, Update and Delete using QBE Insertion, deletion, and modification of a tuple are specified through the commands I., D., and U., respectively. We can insert a new tuple into the Sailors relation as follows: Sailors (sid, sname, rating, age) Reservation (sid, bid, day) Sailors I.
sid 74
sname rating age Ram 7 22
We can delete all tuples with rating > 5 from the Sailors relation as follows: Sailors D.
sid
sname rating age >5
We can delete all reservations for sailors with rating < 4 by using: Sailors
sid _Id
sname rating age <4
Reservation sid D. _Id
bid
day
We can update the age of the sailor with sid 74 to be 42 years by using: Sailors
sid 74
sname rating age U.42
We can also change the age of sailor 74 from 21 to 22 by incrementing the age value: Sailors
sid 74
sname rating age U._A+1