12.05.2015 Views

Relational Database Design.cwk (WP)

Relational Database Design.cwk (WP)

Relational Database Design.cwk (WP)

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CPS221 Lecture: <strong>Relational</strong> <strong>Database</strong> <strong>Design</strong><br />

Objectives:<br />

last revised October 19, 2010<br />

1. To introduce the anomalies that result from redundant storage of data<br />

2. To introduce the notion of functional dependencies<br />

3. To introduce the basic rules for BCNF normalization<br />

Materials:<br />

1. Handout: progressive normalization of a registration scheme to BCNF<br />

2. Projectable of code for creating database used in SQL Use lab<br />

3. Ability to demonstrate the above using db2<br />

4. Projectable of schema diagram for SQL use lab database<br />

I. Basic Principles of <strong>Relational</strong> <strong>Database</strong> <strong>Design</strong><br />

A. The topic of relational database design is a complex one, and one we consider<br />

in detail in the DBMS course. For now, we look at a few simple principles,<br />

which we will make more formal later.<br />

B. One principle is that each relation should have a subset of its attributes which,<br />

together, form a PRIMARY KEY for the relation.<br />

1. It is helpful, then, to specify the primary key of each relation as part of<br />

the design process.<br />

2. Of course, we need the primary key of an entity in order to create the<br />

tables for any relationships in which it participates, since the primary<br />

keys of the entities become columns in the table representing the<br />

relationship.<br />

3. Good DBMS software will be capable of enforcing a PRIMARY KEY<br />

CONSTRAINT - i.e. a primary key can be declared when a table is<br />

created, and the DBMS will signal an error if an attempt is made to<br />

insert or modify a row in such a way as to create two rows with the<br />

same primary key value(s).<br />

C. Another principle is to develop the database scheme in such a way as o<br />

avoid storage of redundant information.<br />

1. Often, this will involve decomposing a relation scheme into two or<br />

more smaller schemes.<br />

1


EXAMPLE:<br />

We might be inclined to represent information about student<br />

registrations by a scheme like this:<br />

enrolled(department, course_number, section,<br />

days, time, room, title,<br />

student_id, last_name, first_name,<br />

faculty_id, professor_name)<br />

HANDOUT<br />

2. However, a scheme like this exhibits several serious problems, all<br />

arising from REDUNDANCY:<br />

a) The course's id, days, time, room, and title are stored once for each<br />

student enrolled - potentially dozens of copies.<br />

b) The student's id, last and first names are stored once for each<br />

course the student is enrolled in.<br />

c) The professor's id and name is stored once for each student enrolled<br />

in each course he/she teaches<br />

3. Redundancy is a problem in its own right, since it wastes storage, and<br />

increases the time needed to back up or transmit the information.<br />

Moreover, redundancy gives rise to some additional problems beyond<br />

wasted space and time:<br />

a) The UPDATE ANOMALY.<br />

Suppose the room a course meets in is changed. Every Enrolled<br />

row in the database must now be updated - one for each student<br />

enrolled.<br />

(1) This entails a fair amount of clerical work.<br />

(2) If some rows are updated while others are not, the database will<br />

give conflicting answers to the question "where does ____<br />

meet?"<br />

b) An even worse problem is the DELETION ANOMALY.<br />

(1) Suppose that the last student enrolled is dropped from the<br />

course. All information about the course in the database is now<br />

2


lost! (One might argue that this is not a problem, since courses<br />

with zero enrollment make no sense. However, this could<br />

happen early in the registration process - e.g. if a senior is<br />

mistakenly registered for a freshman course, and this is<br />

corrected before freshmen register. In any case, the decision to<br />

delete a course should be made by an appropriate administrator,<br />

not by the software!<br />

(2) Likewise if a student is dropped from all his/her courses,<br />

information about the student is lost. This may not be what is<br />

intended.<br />

c) There is a related problem called the INSERTION ANOMALY:<br />

(1) We cannot even store information in the database about a course<br />

before some student enrolls - unless we want to create a<br />

"dummy" student.<br />

(2) Likewise, we cannot store information about a student until the<br />

student is enrolled in at least one course.<br />

(3) Can you think of another example?<br />

ASK<br />

We cannot store information about a faculty member who is not<br />

teaching any courses - e.g. a faculty member on sabbatical.<br />

4. A better scheme - though still not a perfect one, as we shall see -<br />

would be to break this scheme up into several tables:<br />

enrolled(department, course_number, section, student_id)<br />

course(department, course_number, section, days, time,<br />

room, title, faculty_id)<br />

student(student_id, last_name, first_name)<br />

professor(faculty_id, professor_name)<br />

The process of breaking a large single scheme into two or more<br />

smaller schemes is called DECOMPOSITION<br />

D. Decomposition must be done with care, lest information be lost.<br />

EXAMPLE:<br />

3


Suppose, in avoiding to store redundant information, we had come up<br />

with this decomposition (same as above, except for no enrolled scheme,<br />

and no faculty_id attribute in course.)<br />

course(department, course_number, section, days, time,<br />

room, title)<br />

student(student_id, last_name, first_name)<br />

professor(faculty_id, professor_name)<br />

1. It appears that we haven't lost any information - all the data that was<br />

stored in the original single scheme is still present in some scheme.<br />

Indeed, each value is stored in exactly one table.<br />

2. However, we call such a decomposition a LOSSY-JOIN<br />

DECOMPOSITION, because we have actually lost some information.<br />

What information have we lost?<br />

ASK<br />

a) We have lost the information about what students are enrolled in<br />

what courses.<br />

b) We have lost the information about which faculty member teaches<br />

which course.<br />

c) In contrast, our original decomposition was LOSSLESS-JOIN. If<br />

we did the following natural join (where |X| stands for natural join):<br />

enrolled |X| course |X| student |X| professor<br />

we would get back the undecomposed table we started with.<br />

(If we tried to do a similar set of natural joins on our lossy-join<br />

decomposition, we would end up with every student enrolled in<br />

every course, taught by every professor!)<br />

3. The "acid test" of any decomposition performed to address<br />

redundancy is that it must be LOSSLESS-JOIN.<br />

E. A principle related to using lossless join decompositions to avoid<br />

redundancy is the explicit identification of FOREIGN KEYS.<br />

4


1. In our lossless join decomposition, what made the decomposition work<br />

correctly is that the first scheme - enrolled - had foreign keys that<br />

referenced the course and student tables; and course had a foreign key<br />

that referenced the professor table.<br />

2. All DBMS's allow foreign keys to be declared when a table is created.<br />

Most DBMS’s will then enforce the rule that no row can be inserted or<br />

modified in such a way as to have foreign key values that do not<br />

appear in some row of the table being referenced. (Alas, mysql does<br />

not do this!)<br />

e.g. if we made id a foreign key in enrolled, referencing the student<br />

table, then it would be impossible to insert a row in Enrolled containing<br />

a studentId that does not appear in student.<br />

F. nulls<br />

1. One interesting question that arises in database design is how are we<br />

to handle a situation where we don't have values available for all the<br />

attributes of an entity. We have already seen that relational DBMS's<br />

provide a special value called null that can be stored in such a column.<br />

2. In designing a database, it will sometimes be necessary to specify that<br />

certain columns CANNOT ever contain a null value. This will<br />

necessarily be true of any column that is part of the primary key, and<br />

may be true of other columns as well. Most DBMS's allow the<br />

designer to specify that a given column cannot be null.<br />

3. One wants to think rather carefully about whether one wants to allow<br />

a column to contain a null value.<br />

a) A primary key - or portion of a primary key - cannot be null, of<br />

course.<br />

b) Allowing null is certainly appropriate if the column represents<br />

information one might not have or which might not be meaningful<br />

for every row in the table.<br />

Example: A medical records database might want to allow age to<br />

be null because this might not be known for a patient treated in the<br />

emergency room,<br />

Example: A medical records database might want insurance to be<br />

null to allow for uninsured patients.<br />

5


c) null values are problematic if allowing them is a result of poor<br />

design.<br />

Example: In our original scheme for enrolledIn (the one with all<br />

attributes in one table), if we wanted to store information about a<br />

student who is not enrolled in any courses we could only do so by<br />

recording the student as “enrolled” in a null course. But this is a<br />

consequence of bad design, and ceases to be necessary when we<br />

decompose the design properly.<br />

d) Sometimes the question is an open one, with good arguments<br />

possible for both answers.<br />

Example: Consider the Video Store project you did in CPS122.<br />

Suppose you wanted to store ther Video Store information in a<br />

database. There are two ways of representing rental information:<br />

(1) You could have a separate rented table with attributes<br />

customerID, copyID and dateDue.<br />

(2) You could include rentedToCustomerID and dateDue<br />

attributes in the copy table, because a given copy can only be<br />

rented to one person at a time. (These would have to be null if<br />

the item was on the shelf.)<br />

(3) Advantages/disadvantages of the approaches?<br />

ASK<br />

(a) A separate table is cleaner, and has the advantage that<br />

recording a rental or return is a matter of inserting/deleting a<br />

row, rather than modifying one.<br />

(b) Including rental information in the Copy table makes some<br />

inquiries more efficient, since it avoids the need for joining<br />

two tables.<br />

6


II. Functional Dependencies<br />

A. Definition: for some relation-scheme R, we say that a set of attributes B (B<br />

a subset of R) is functionally dependent on a set of attributes A (A a<br />

subset of R) if, for any legal relation on R, if there are two tuples t1 and<br />

t2 such that t1[A] = t2[A], then it must be that t1[B] = t2[B].<br />

(This can be stated alternately as follows: there can be no two tuples t1<br />

and t2 such that t1[A] = t2[A] but t1[B] != t2[B].)<br />

B. We denote such a functional dependency as follows:<br />

A → B<br />

(Read: A determines B)<br />

Example: For our Enrolled database, the following FD's certainly hold:<br />

department, course_number → title<br />

department, course_number, section → days<br />

department, course_number, section → time<br />

department, course_number, section → room<br />

student_id → last_name<br />

student_id → first_name<br />

faculty_id → professor_name<br />

C. One interesting question is the relationship between department,<br />

course_number, and section, on the one hand, and professor on the other<br />

hand.<br />

1. Since courses can be team taught, a simple FD would be incorrect -<br />

e.g.<br />

NOT: department, course_number, section → faculty_id<br />

2. However, there is a relationship between sections of a course and<br />

faculty teaching the section. The relationship is a more complicated<br />

one called a MULTIVALUED DEPENDENCY, which we won't<br />

discuss in this course (though we do in the DBMS course.)<br />

3. Note that functional dependencies are defined in terms of the<br />

UNDERLYING REALITY that the database models - not some<br />

particular set of values in the database.<br />

For example, it happens that, for the students in many courses<br />

last_name → first_name<br />

(and sometimes first_name → last_name!)<br />

7


However, this is not inherent in the underlying reality, so we would not<br />

include it as an FD in designing a database representation for students<br />

in a course.<br />

D. From the FD's, we can determine the candidate keys, and choose primary<br />

keys, for the scheme.<br />

1. Formally, we say that some set of attributes K is a SUPERKEY for<br />

some relation scheme R if<br />

K → R<br />

2. We say that K is a CANDIDATE KEY if it has no proper subsets that<br />

are superkeys.<br />

3. EXAMPLE: For the scheme<br />

student(student_id, last_name, first_name)<br />

R - the set of all attributes - is { student_id, last_name, first_name }<br />

{ student_id, last_name } is a superkey, because<br />

student_id, last_name → student_id, last_name, first_name<br />

but { student_id, last_name } is not a candidate key, because<br />

student_id all by itself is a superkey<br />

student_id is a superkey because<br />

student_id → student_id, last_name, first_name<br />

student_id is also a candidate key, because it obviously has no proper<br />

subsets that are superkeys.<br />

4. EXAMPLE: For the same scheme, if we insisted that no two students<br />

could have the same full name, we would have<br />

last_name, first_name → student_id, last_name, first_name<br />

In this case, last_name and first_name would be both a superkey and a<br />

candidate key. (In general, though, this is not a good idea!)<br />

E. It important to bear in mind that functional dependencies are properties of<br />

the underlying reality - not a particular set of data.<br />

8


Example: For the people in this class, the dependency<br />

last_name → first_name<br />

appears to hold. But this is not a fundamental property of the reality; it is<br />

an accident of a particular set of data. Thus, we would not include this as<br />

an FD when designing a relational database to represent enrollment in a<br />

class!<br />

F. An important property of FD’s is that, if we decompose a scheme R into<br />

two schemes R1 and R2, the join is lossless if and only if at least one of<br />

R1 ∩ R2 → R1 or R1 ∩ R2 → R2 holds - i.e. the intersection of the two<br />

schemes must be a candidate key for at least one of them if the join is to<br />

be lossless<br />

Example: Given the scheme (W, X, Y, Z) with dependencies<br />

W → X, Y and Y → Z<br />

1. The decomposition (W, X, Y) and (Y, Z) is lossless because the<br />

intersection of the two schemes is Y, which is a key of the second<br />

scheme.<br />

2. The decomposition (W, X) and (W, Y, Z) is lossless because the<br />

intersection of the two schemes is W, which is actually a key of both<br />

3. The decomposition (W, X, Z) and (X, Y, Z) is not lossless because the<br />

intersection of the two schemes is X, Z which is not a key of either<br />

III. Normal Forms<br />

A. <strong>Relational</strong> database theorists have developed a number of normal forms<br />

which can be used to develop appropriate designs. These are covered in<br />

detail in a DBMS course. For now, we'll just look at them briefly.<br />

1. The tests are applied separately to the design of each entity set.<br />

2. If any design fails a test, it is typically NORMALIZED by<br />

decomposing it into two or more entity sets which share a common<br />

key.<br />

B. First Normal Form (1NF):<br />

1. A relation scheme R is in 1NF iff, for each tuple t in R, each attribute<br />

of t is atomic - i.e. it has a SINGLE, NON-COMPOSITE VALUE.<br />

9


2. This rules out:<br />

a) Repeating groups<br />

b) Composite columns in which we can access individual components -<br />

e.g. dates that can be either treated as unit or can have month, day<br />

and year components accessed separately.<br />

3. Example:<br />

Recall the problem that arises because of team teaching (the FD<br />

department, course_number, section → faculty_id does NOT hold).<br />

We might try to solve this by storing several values in the faculty_id<br />

column - e.g. (using names instead of faculty id's, since I don't know<br />

the latter!)<br />

Example: In the spring of 2010, BIO151 was team taught by Drs.<br />

Blend and Boorse - so we could store it as<br />

BIO151<br />

MWF 9:10 MAC109 'BIOLOGY II'<br />

(BLEND, BOORSE)<br />

However, this is not in 1NF, since the faculty attribute is not atomic<br />

4. Any non-1NF scheme can be made 1NF by "flattening" it - e.g.<br />

BIO151<br />

BIO151<br />

MWF 9:10 MAC109 'BIOLOGY II' BLEND<br />

MWF 9:10 MAC109 'BIOLOGY II' BOORSE<br />

a) i.e. we create two distinct rows, one for each value<br />

b) Of course, this creates new redundancy problems, addressed by the<br />

theory of multivalued dependencies.<br />

c) Since we won't be discussing these here, we'll assume for the rest of<br />

our example that all courses have a SINGLE faculty member<br />

(either no team teaching, or include only one professor in the listing<br />

for the course)<br />

5. 1NF is desirable for most applications, because it guarantees that each<br />

attribute in R is functionally dependent on the primary key, and<br />

simplifies queries.<br />

10


6. However, there are some applications for which atomicity may be<br />

undesirable - e.g. keyword columns in bibliographic databases, or<br />

multimedia databases where a "column" may actually be a movie.<br />

Normalization theory for such situations is still being researched.<br />

C. Second Normal Form (2NF):<br />

1. A 1NF relation scheme R is in 2NF iff each non-key attribute of R is<br />

FULLY functionally dependent on each candidate key. By FULLY<br />

functionally dependent, we mean that it is functionally dependent on<br />

the whole candidate key, but not on any subset of it.<br />

2. Example: Consider our original Enrollment scheme:<br />

enrolled(department, course_number, section,<br />

days, time, room, title,<br />

student_id, last_name, first_name,<br />

faculty_id, professor_name)<br />

What is/are the candidate key(s)?<br />

ASK<br />

department, course_number, section, student_id is our CK<br />

because: department and course_number together determine title;<br />

department, course_number, and section together determine days,<br />

time, room and faculty_id; student_id determines last_name and<br />

first_name, and faculty_id determines professor_name<br />

However, we have several partial dependencies:<br />

a) title depends only on department, course_number<br />

b) days, time, room, and faculty_id depend only on department,<br />

course_number, and section<br />

c) last_name and first_name depend only on student_id<br />

3. Any non-2NF scheme can be made 2NF by a decomposition in which<br />

we factor out the attributes that are dependent on only a portion of a<br />

candidate key, together with the portion they depend on.<br />

Example: The dependencies listed above lead to the following 2NF<br />

decomposition<br />

11


course(department, course_number, title)<br />

section(department, course_number, section, days,<br />

time, room,<br />

faculty_id. professor_name)<br />

student(student_id, last_name, first_name)<br />

enrolled(department, course_number, section,<br />

student_id)<br />

4. Observe that any 1NF relation scheme which does NOT have a<br />

COMPOSITE primary key is, of necessity, in 2NF.<br />

5. 2NF is desirable because it avoids repetition of information that is<br />

dependent on part of the primary key, but not the whole key, and<br />

thus prevents various anomalies.<br />

D. Third Normal Form (3NF) / Boyce-Codd Normal Form (BCNF)<br />

1. In the history of normalization theory, there was developed a definition<br />

for what is called third normal form (3NF)<br />

a) It was later shown that this definition does not eliminate all<br />

undesirable redundancies (and hence anomalies.) As a result, a new,<br />

more stringent definition was proposed.<br />

b) However, there are some unusual cases where the new definition<br />

creates a problem that the initial definition did not have. For this<br />

reason, both definitions have been retained, and the new definition<br />

is called Boyce-Codd Normal Form (BCNF).<br />

c) We will use the more stringent BCNF form (which is actually easier<br />

to test).<br />

2. A normalized relation R is in BCNF iff, for every dependency of the<br />

form A → B that hold on R, A is a superkey. (There are no<br />

dependencies on attribute sets that are not superkeys)<br />

3. Example: In the above decomposition, section is not BCNF (or 3NF<br />

for that matter) because there is what is called a transitive dependency.<br />

a) The primary key of section is department, course_number, section.<br />

Any superkey must include these attributes.<br />

b) But faculty_id by itself determines professor_name - i.e.<br />

faculty_id → professor_name holds<br />

12


This is called a transitive dependency because the primary key<br />

determines professor_name indirectly - i.e. through another<br />

attribute (faculty_id).<br />

c) Any non-BCNF scheme can be decomposed into BCNF schemes<br />

by factoring out the attribute(s) that are transitively-dependent on<br />

the primary key, and putting them into a new scheme along with<br />

the attribute(s) they depend on.<br />

Example: We decompose section to<br />

section(department, course_number, section, days,<br />

time,room,faculty_id)<br />

professor(faculty_id, professor_name)<br />

E. Beyond 3NF/BCNF there are further normal forms called 4NF, and 5NF<br />

that we won't discuss here, but do discuss in the DBMS course. We aim<br />

to ensure that every database design we produce is in the highest normal<br />

form (generally at least BCNF, but often higher).<br />

For now, we'll stop at BCNF - which can be summarized by the following<br />

rule:<br />

Every attribute depends on the key, the whole key, and nothing but the<br />

key<br />

IV. Integrity Constraints<br />

A. One of the advantages of using a DBMS rather than accessing files<br />

directly is that it is possible, when setting up a database, to specify various<br />

integrity constraints on the data that are enforced for all accesses to the<br />

data.<br />

Data integrity is concerned with ensuring the ACCURACY of the data. In<br />

particular, the concern is with protecting the data from ACCIDENTAL<br />

inaccuracy, due to causes like:<br />

1. Data entry errors<br />

2. System crashes<br />

3. Anomalies due to concurrent and/or distributed processing<br />

13


B. We consider this topic much more extensively in CPS352, but for now we<br />

will look at some high points. For our demonstrations, we will be using<br />

db2 instead of the mysql system you used in lab. At the present time,<br />

mysql does not implement all of the kinds of checking that are built into<br />

“industrial strength” DBMS’s. (However, we used it in lab because it’s a<br />

lot easier to work with!)<br />

(Setup to run db2<br />

• use bash<br />

• be sure command db2 start database manager has been given<br />

• use db2 -t to get use of semicolons as terminators<br />

• connect using connect to cps221 user bjork)<br />

C. One type of constraint we want to note is one having to do with the<br />

values that are permitted to occur in a particular column.<br />

1. Of course, certain constraints are implicit in the data type for a column.<br />

For example, if a column is declared to be of a numeric type, it is not<br />

possible to insert an arbitrary character string into it.<br />

DEMO: insert a row into gradepoints with grade_points not numeric<br />

insert into gradepoints values('W', 'x');<br />

2. Because this sort of data typing is coarse-grained, relational DBMS’s<br />

also support more fine grained specification of permitted values.<br />

PROJECT: code for creating database use in SQL use lab<br />

Note check constraint on gradepoints<br />

DEMO: insert into course_taken values('1111111', 'BCM',<br />

'103', '2009FA', 4, 'Z');<br />

D. Recall that a relational table models a SET, and therefore each row in it<br />

must be unique among all the elements of that set. This translates into the<br />

notion of a "key" that we discussed earlier. ENTITY INTEGRITY<br />

constraints are concerned with ensuring that each row in a table is distinct<br />

from all other rows in the same table in their key values.<br />

1. We have already said that in the process of designing a relational<br />

database each table should have a primary key.<br />

2. This can be incorporated into the declaration for the table by means of<br />

a PRIMARY KEY constraint.<br />

14


PROJECT: code for creating database used in SQL use lab.<br />

A primary key constraint can be specified in one of two ways<br />

a) If the primary key consists of a single column, it can be specified as<br />

a column constraint.<br />

Example: id column in student<br />

b) If the primary key is composite, it must be expressed as a table<br />

constraint<br />

Example: department, course_number in course_offered<br />

c) Note, by the way, that in both cases the relevant column(s) needs a<br />

not null constraint.<br />

3. When a primary key is declared, DBMS will enforce the requirement<br />

that no two rows in the table will be allowed to have the same value(s)<br />

in the specified column(s). (Recall that we demonstrated this as part of<br />

the lecture introducing DBMS’s) Let’s look at another example<br />

Demo: insert into student values('1111111', 'Zebra',<br />

'Zelda', 'CPS');<br />

4. There is another form of entity integrity constraint - the unique<br />

constraint - that we won’t discuss in this course.<br />

E. Entity integrity constraints pertain to data within a single table, and can be<br />

enforced by looking at that table alone. The next sort of constraint we<br />

need to consider pertains to REFERENTIAL INTEGRITY.<br />

1. It is frequently the case that the logic of a system demands that an<br />

entry cannot logically occur in one table without a related entry<br />

occuring in another table:<br />

Example: A course section should not appear in current_term_course<br />

unless the corresponding course appears in course_offered, because<br />

certain important information about the course (its title and number of<br />

credits) appear only there.<br />

2. This is, of course, the notion of a foreign key, which we represented by<br />

means of an arrow in our schema diagrams.<br />

PROJECT SQL use lab database schema<br />

15


3. The requirement that a matching row occur in another table for each<br />

value occurring in a certain column (or set of columns) in a particular<br />

table is called REFERENTIAL INTEGRITY. Referential integrity<br />

constraints are expressed in SQL by using a FOREIGN KEY<br />

constraint, which is specified by the use of the reserved words<br />

FOREIGN KEY and REFERENCES.<br />

PROJECT: code for creating database used in SQL use lab.<br />

Note foreign key constraint in course_offered<br />

4. The DBMS will not allow a row to be inserted into the referencing<br />

table without the corresponding row existing in the referenced table.<br />

Demo insert into current_term_course values('AAA',<br />

'101', ' ', 'MWF', '0800', 'KOS124');<br />

F. When appropriate constraints are incorporated into the design of a<br />

database, they are enforced for every program that uses that database.<br />

Moreover, if a constraint is changed, the change will be enforced for all<br />

future accesses without in any way needing to modify programs accessing<br />

the database.<br />

16

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!