Lecture 8 Objectives Physical Database Design

Lecture 8 

Physical Database Design for 

Relational Databases 

Physical Database Design 

Process of producing a description of the 

implementation of the database on secondary 

storage; it describes the base relations, file 

organizations, and indexes used to achieve 

efficient access to the data, and any associated 

integrity constraints and security measures. 

3 

Objectives 

• Purpose of physical database design. 

• How to design base relations for target DBMS. 

• How to design enterprise constraints for target 

DBMS. 

• How to select appropriate file organizations based on 

analysis of transactions. 

• When to use secondary indexes to improve 

performance. 

Physical Database Design Methodology 

• Step 1: Translate logical data model for target DBMS 

– Step 1.1 Design base relations 

– Step 1.2 Design representation of derived data 

– Step 1.3 Design enterprise constraints 

• Step 2 Design physical representation 

– Step 2.1 Analyze transactions 

– Step 2.2 Choose file organizations 

– Step 2.3 Choose indexes 

– Step 2.4 Estimate disk space requirements 

• Step 3 Design user views 

• Step 4 Design security mechanisms 

• Step 5 Consider the introduction of controlled redundancy 

2 

4

Step 1 Translate Logical Data Model for 

Target DBMS 

To produce a relational database schema that can be 

implemented in the target DBMS from the logical data 

model. 

• Need to know functionality of target DBMS such as 

how to create base relations and whether the system 

supports the definition of: 

– PKs, and FKs; 

– required data – i.e. whether system supports NOT NULL; 

– domains; 

– relational integrity constraints; 

– enterprise constraints. 

Example: PropertyForRent Relation 

5 

7 

Step 1.1 Design Base Relations 

To decide how to represent base relations in target DBMS. 

• For each relation, need to define: 

– the name of the relation; 

– a list of simple attributes in brackets; 

– the PK and FKs. 

– a list of any derived attributes and how they should be computed; 

– referential integrity constraints for any FKs identified. 

• For each attribute, need to define: 

– its domain, consisting of a data type, length, and any constraints on the 

domain; 

– an optional default value for the attribute; 

– whether the attribute can hold nulls. 

Step 1.2 Design Representation of Derived Data 

To decide how to represent any derived data present in 

the logical data model in the target DBMS. 

• Produce list of all derived attributes. 

• Derived attribute can be stored in database or calculated 

every time it is needed. Option selected is based on: 

• additional cost to store the derived data and keep it 

consistent with operational data from which it is derived; 

• cost to calculate it each time it is required. 

• Less expensive option is chosen subject to performance 

constraints. 

6 

8

PropertyforRent Relation and Staff Relation with 

Derived Attribute noOfProperties 

Step 2 Design Physical Representation 

To determine optimal file organizations to store 

the base relations and the indexes that are 

required to achieve acceptable performance; that 

is, the way in which relations and tuples will be 

held on secondary storage. 

9 

11 

Step 1.3 Design Enterprise Constraints 

To design the enterprise constraints for the 

target DBMS. 

• Some DBMS provide more facilities than others for 

defining enterprise constraints. Example: 

CREATE TABLE PropertyForRent ( 

… 

CONSTRAINT StaffNotHandlingTooMuch 

CHECK (NOT EXISTS (SELECT staffNo 

FROM PropertyForRent 

GROUP BY staffNo 

HAVING COUNT(*) > 100)) 

…); 

Step 2 Design Physical Representation 

• Number of factors that may be used to measure 

efficiency: 

- Transaction throughput: number of transactions 

processed in given time interval. 

- Response time: elapsed time for completion of a single 

transaction. 

- Disk storage: amount of disk space required to store 

database files. 

• Typically, have to trade one factor off against 

another to achieve a reasonable balance. 

10 

12

Step 2 Design Physical Representation – 

Preliminaries 

• Data is stored in a number of types of media 

– Primary Storage - here data that is accessed directly 

by the CPU in the form of main memory or faster but 

smaller capacity cache memory 

• provides fast access but has limited capacity and is also 

volatile (i.e data that is not stored permanently) 

– Secondary Storage - here data is not directly accessed 

by the CPU so it needs to be loaded into primary 

memory 

• 4-5 times slower than primary storage, has unlimited capacity 

and is non-volatile (stores data permanently) 

• Most databases are stored permanently on 

magnetic disk secondary storage 


Preliminaries 

• Data is organised on the disk as files. 

• File records are stored in disk blocks- this is because 

a block is the unit of data transfer between disk and 

memory. 

• The number of complete records that can fit into a 

block is the blocking factor (bfr). 

• If a block size is B bytes and the record size is R 

bytes, then 

bfr = Int(B/R) 

13 

15 


Preliminaries 

Operating 

System 

Typical disk configuration 

DB Files Index file Recovery 

log file 

Step 2.1 Analyze Transactions 

To understand the functionality of the 

transactions that will run on the database and to 

analyze the important transactions. 

• Attempt to identify performance criteria, such as: 

– transactions that run frequently and will have a significant 

impact on performance; 

– transactions that are critical to the business; 

– times during the day/week when there will be a high demand 

made on the database (called the peak load). 

– attributes that are updated in an update transaction; 

– criteria used to restrict tuples that are retrieved in a query. 

14 

16

Cross-Referencing Transactions and Relations 

Example Transaction Analysis Form 

17 

19 

Transaction Usage Map for Some Sample 

Transactions Showing Expected Occurrences 

Step 2.2 Choose File Organisations 

• Objective: To determine an efficient file organisation for 

each relation, if the DBMS allows this. 

• File organisation is the physical arrangement of data in a 

file into records and pages (blocks) on secondary storage. 

• Existing types of file organisations: 

– Heap (unordered) files 

– Sequential (ordered) files 

–Hashfiles 

• E.g., if we want to retrieve staff tuples in alphabetical 

order of name, sorting the file by staff name is a good file 

organisation. 

18 

20

Insert 

block 1 block 2 block 3 

Heap Files 

Here records are placed in the file in the same order as 

they are inserted 

Insert new record into last block creating a new block if necessary. 

Very efficient - last block address is readily available 

Find 


Go through blocks one at a time until required record is found. 

Very inefficient - requires a linear search which works out at an 

average of b/2 block accesses if the file occupies b blocks 

Ordered Files 

• Here records are physically ordered (sorted) 

on disk based on the values of one or more 

fields (a.k.a. ordering fields) 

– if the ordering field is guaranteed to have a 

unique value in each record then it is also a key 

field of the file and is a.k.a an ordering key 

21 

23 

Heap Files 

Delete 

block i 

record 1 record 2 record 3 record 4 ….. 

First find then mark for deletion records that are no 

longer required. Very inefficient because records have to be found. 

Reorganise 

block i 

record 1 record 3 ….. 

Repacking needed only occasionally to remove unused space. 

Reasonably efficient. 

Insert 

Main file: 

Ordered Files 


1 2 3 4 5 7 8 9 10 11 13 

Insert new record into overflow file and merge periodically 

Overflow file: block 1 

6 12 

Delete 

Mark for deletion records no longer required 


1 2 (3) 4 5 (6) 7 (8) 9 10 11 12 

Reorganise 

Repack records to remove unused space 


1 2 4 5 7 9 10 11 12 

22 

24

Ordered Files 

Find 

Say we wanted to find record with ordering key value K then we require 

a binary search 

The binary search starts by reading the middle block, M, calculated as: 

M = (1 + B) div 2 where B is the number of blocks 

Suppose: 

the smallest value in block M is called S and 

the largest value in block M is called L 

There are 3 possibilities: 

1. K < S – this means record K is somewhere in blocks 1 to M-1 

2. K < L – this means record K is somewhere in blocks M+1 to B 

3. Neither 1 or 2 in which case record K is in the current block, M 

In cases 1 and 2 the search continues at the next mid block 

For Ordered Files on average we will require ⎡log 2 B ⎤ block access 

Hash Files 

• Here one or more fields in a record is used to 

calculate the location of record for storage and 

retrieval 

• Unit of storage is a bucket 

– these contain blocks which store up to bfr records 

• e.g if bfr = 4 then you can store 4 records max per block 

• A hash function is used to calculate location of 

bucket 

– Enter record into next free space in block 

– If all the blocks are used up then create an overflow 

25 

Find cont. 

Ordered Files 

Given the following blocks, find record with key value, K=300 

block 1 block 2 block 3 block 4 block 5 block 6 

1 2 3 4 5 7 8 11 12 14 16 18 20 30 40 50 100 150 200 300 400 

B = 6 so read mid-block number i.e (1 +6) div 2 = block 3 

300 not in block 3 and K>18 so look between blocks 4 to 6 


20 30 40 50 100 150 200 300 400 

Read new mid-block number (4+6) div 2 = block 5 

300 not in block 5 and K>200 so look at block 6 - found 

Correct block found with 3 block accesses i.e ⎡log2 6 ⎤ =3 

- quicker than a linear search which requires 6 file block access 

Hash Files 

Insert 

Assume we have 2 blocks per bucket with bfr = 4 

Also assume we have a hash function on K mod 5 – so could have: 

Bucket 1 Bucket 2 

1 6 11 16 21 26 31 36 22 

2 7 12 17 

Use overflow buckets when there is no room (collisions) 

e.g insert 41 and 46 


1 6 11 16 21 26 31 36 22 

2 7 12 17 

Overflow Bucket 

bucket and place record in there 41 46 

27 

28 

26

Hash Files 

Find 

Use hash function to locate bucket then do a linear search of blocks 

- this is 1 block access or maybe more 

Delete and Reorganise 

Remove from bucket and replace from overflow bucket if necessary 

e.g delete 46 


1 6 11 16 21 26 31 36 22 

2 7 12 17 

Overflow Bucket 

41 46 

Guidelines for selecting a file organisation 

• Ordered: Supports retrievals based on exact 

key match, pattern matching and range of 

values. 

• However its performance deteriorates as the 

relation is updated (loss of access key 

sequence). 

29 

31 


• Heap (unordered): is a good storage structure in 

the following situations: 

– When every tuple in the relation has to be retrieved 

(in any order) every time the relation is accessed 

– When the relation has an additional access 

structure, such as an index key. 

• Heap files are inappropriate when only selected 

tuples of a relations are to be accessed. 


•Hash:is a good storage structure when tuples 

are retrieved based on an exact match on the 

hash field value. 

• It is not so good in the following situations: 

– When tuples are retrieved based on a pattern match 

or range of the hash field value. E.g., staffno begins 

with “S1”, or salary in range 1000-2000 

– When tuples are retrieved based on a field other 

than the hash field. 

– When the hash field is frequently updated. 

30 

32

Step 2.3 Choose Indexes 

To determine whether adding indexes will improve the 

performance of the system. 

• An ordering index may be seen as a data structure 

designed to speed up the access to records in a file using 

an indexing field. An index, being much smaller, will be 

quicker to search than the main file. 

• There are 3 different types of indexes: 

– Primary Index 

– Secondary Index 

– Clustering Index 

• An index can be Multilevel. 

Primary Index 

• A Primary index is an ordered file whose records have 

two fields. The first field is of the same type as the 

ordering key field (in the Data file). The second field is 

a pointer to a disk block. 

• There is one index entry (or index record) in the index 

file for each block in the data file. 

– This is an example of a non-dense (sparse) index. 

– The first record in each block of the data file is called the 

anchor record. 

33 

35 

Sparse or Dense Indexes 

• An index can be sparse or dense: 

– A sparse (non-dense) index has an index record for 

only some of the search key values in the file 

– A dense index has an index record for every search 

key value in the file 

• The search key for an index can consist of one 

or more fields. 

Primary Index Example 

Say we had the following relation: 

Cars(reg#, engine#, make, model, price) 

Assume: 

1. There are 30,000 records 

2. Block size is 1024 bytes 

3. reg# (primary key) is 7 bytes long, the block pointer is 6 bytes long 

4. Record length is 100 bytes 

Therefore, bfr = Int(B/R) = Int(1024/100) = 10, so we need 3000 blocks 

to store all records i.e 30,000/10 

- A binary search on the data file needs 12 block accesses i.e ⎡log 2 3000⎤ 

- A better option is to use the primary index: only 6 block accesses needed 

i.e., ⎡log 2 39⎤ + 1 (1 accounts for reading the data file block) 

The size of an index entry = 13 bytes. Bfr i = int(1024/13)=78. Number of 

index entries = 3000. Number blocks = ⎡3000/78⎤ = 39 

34 

36

E246WFC 

G123RMR 

G889VDU 

H203PBR 

H311MHG 

… 

Primary Index Example cont. 

3000 records, 13 bytes each 

39 blocks, 78 records each 

Primary Index File 

… 

7 bytes 6 bytes 

to read block 

Block access = ⎡log2 39⎤+ 1 = 6 

(compare with 12 for a binary search) 

30,000 records, 100 bytes each 

3000 blocks, 10 records 

Main Data File 

E246WFC 7648378 

F651DEK 3096275 

G123RMR 7493874 

G551JBA 2098377 

G889VDU 6587969 

G994PBR 5675789 

H203PBR 4654786 

H266MHU 6345234 

H311MHG 5675489 

H626RPG 5673455 

Clustering Index Example 

Consider the same relation: Cars(reg#, engine#, make, model, price) 

Assume clustering field is: make 

Clustering 

Clustering Index File 

Clustering Block 

Main Data File Filed 

G551JBA 7648378 Fiat 

F651DEK 3096275 Ford 

1 

Filed Value Pointer 

Fiat 

Ford 

Renault 

Rover 

Vauxhall 

G123RMR 7493874 Ford 

E246WFC 2098377 Ford 

G889VDU 6587969 Renault 

H266MHU 5675789 Renault 

H203PBR 4654786 Renault 

G994PBR 6345234 Rover 

H311MHG 5675489 Rover 

H626RPG 5673455 Vauxhall 

1 

2 

3 

4 

5 

2 

3 

4 

5 

1024 

block 

size 

37 

39 

Clustering Index 

• If records of a file are physically ordered on a non-key 

field. That field is called the clustering field. 

• A clustering index differs from a primary index, which 

requires that the ordering field of the data file have a 

distinct value for each record. 

• There is one entry in the clustering index for each 

distinct value of the clustering field. This is an example 

of a dense index. 

• A data file can have at most one primary index or one 

clustering index. 

Secondary Index 

• A secondary index is also an ordered file with 

two fields. The first field is of the same type as 

some non-ordering field of the data file. The 

second field is a block pointer or a record 

pointer. 

• There can be many secondary indexes for the 

same data file. 

38 

40

Example 1: Dense Secondary Index 

Say we had the same relation: 

Cars(reg#, engine#, make, model, price) 

Assume that engine# (secondary key) is 7 bytes long. 

A secondary key field has a distinct value for each data record. 

Hence, the secondary index will be dense. 

We want to know if it would be better to use a secondary index file 

constructed with engine# as the secondary index key, or 

Perform a simple linear search on the data file (cost = 1500 

accesses). 

Example 2: Non-dense Secondary Index 

• We can also create a secondary index on a non-key field 

of a file. In this case, many records in the data file can 

have the same value for the indexing field. There are 

several options for implementing such an index. 

• The option which is more commonly used is to have a 

single entry for each index filed value, but to create an 

extra level of indirection. In this non-dense scheme, the 

secondary index pointer will point to a block of record 

pointers; each record pointer points to one record in the 

data file. 

41 

43 

Example 1: Dense Secondary Index 



Secondary Index File 

2098377 

3096275 

4654786 

5673455 

5675489 

5675789 

6345234 

6587969 

7493874 

7648378 





E246WFC 7648378 

F651DEK 3096275 

G123RMR 7493874 

G551JBA 2098377 

G889VDU 6587969 

G994PBR 5675789 

H203PBR 4654786 

H266MHU 6345234 

H311MHG 5675489 

H626RPG 5673455 

to read block 

Block access = ⎡log2 385⎤ + 1 = 10 

(Better than a linear search on the data file which needs 1500 accesses) 

Example 2: Non-dense Secondary Index 


1 block, 30 records in it 

Secondary Index File 

Fiat 

Ford 

Renault 

Rover 

Vauxhall 


Pointers 

to read block and pointer 


3000 blocks, 10 records 


E246WFC 7648378 Fiat 

F651DEK 3096275 Rover 

G123RMR 7493874 Ford 

G551JBA 2098377 Rover 

G889VDU 6587969 Renault 

G994PBR 5675789 Vauxhall 

H203PBR 4654786 Rover 

H266MHU 6345234 Ford 

H311MHG 5675489 Renault 

H626RPG 5673455 Fiat 

Block access = ⎡log2 1⎤ + 1+ 2 = 3 

(Better than a linear search on the data file which needs 1500 accesses) 

1 

2 

3 

4 

5 

1024 

block 

size 

42 

1 

2 

3 

4 

5 

44 

1024 

block 

size


• An index can be created in SQL using the CREATE 

INDEX statement. 

• To create a primary index: 

CREATE UNIQUE INDEX indexname ON 

table(attribute); 

• To create a clustering index: 

CREATE INDEX indexname ON table(attribute) 

CLUSTER; 

• To create a secondary index: 

CREATE INDEX indexname ON table(attribute); 

Guidelines for selecting indexes: “wish-list” 

1. Do not index small relations (more efficient to search the 

relation in memory) 

2. Index the primary key of a relation if it is not a key of the file 

organisation 

3. Add a secondary index to a foreign key if it is frequently 

accessed 

4. Add a secondary index to any attribute that is heavily used as a 

secondary key 

5. Add a secondary index on attributes that are frequently involved 

in: selection (WHERE) or join criteria, ORDER BY, GROUP 

BY, sorting (e.g., DISTINCT…). 

6. Avoid indexing an attribute or relation that is frequently updated 

7. Avoid indexing attributes that consist of long character strings 

8. Avoid indexing an attribute if the query will retrieve a 

significant proportion (e.g., 25%) of the tuples in the relation 

45 

47 


• One approach is to keep tuples unordered and create as 

many secondary indexes as necessary. 

• Another approach is to order tuples in the relation by 

specifying a primary or clustering index. 

• In this case, choose the attribute for ordering or 

clustering the tuples as: 

– attribute that is used most often for join operations - this 

makes join operation more efficient, or 

– attribute that is used most often to access the tuples in a 

relation in order of that attribute. 

Removing indexes from the “wish-list” 

• Overhead involved in maintenance and use of secondary 

indexes: 

– adding an index record to every secondary index whenever tuple 

is inserted; 

– updating a secondary index when corresponding tuple is 

updated; 

– increase in disk space needed to store the secondary index; 

– possible performance degradation during query optimization to 

consider all secondary indexes. 

• It is a good idea to experiment whether an index is 

improving performance, providing very little 

improvement, or adversely impacting performance. 

• Some DBMSs allow such experiments, e.g., MS Access 

has a Performance Analyser. 

46 

48

Lecture 8 Objectives Physical Database Design

Create successful ePaper yourself

Delete template?

Save as template?