ECON 381 SC Foundations Of Economic Analysis ... - Economics

ECON 381 SC Foundations Of Economic Analysis 

2009 

John Hillas and Dmitriy Kvasov 

University of Auckland

Contents 

Chapter 1. Logic, Sets, Functions, and Spaces 1 

1. Logic 1 

2. Sets 3 

3. Binary Relations 4 

4. Functions 5 

5. Spaces 7 

6. Metric Spaces and Continuous Functions 8 

7. Open sets, Compact Sets, and the Weierstrass Theorem 10 

8. Sequences and Subsequences 11 

9. Linear Spaces 14 

Chapter 2. Linear Algebra 17 

1. The Space R n 17 

2. Linear Functions from R n to R m 19 

3. Matrices and Matrix Algebra 20 

4. Matrices as Representations of Linear Functions 22 

5. Linear Functions from R n to R n and Square Matrices 24 

6. Inverse Functions and Inverse Matrices 25 

7. Changes of Basis 25 

8. The Trace and the Determinant 28 

9. Calculating and Using Determinants 30 

10. Eigenvalues and Eigenvectors 34 

Chapter 3. Consumer Behaviour: Optimisation Subject to the Budget 

Constraint 37 

1. Constrained Maximisation 37 

2. The Implicit Function Theorem 41 

3. The Theorem of the Maximum 43 

4. The Envelope Theorem 45 

5. Applications to Microeconomic Theory 49 

Chapter 4. Topics in Convex Analysis 57 

1. Convexity 57 

2. Support and Separation 61 

i

CHAPTER 1 

Logic, Sets, Functions, and Spaces 

1. Logic 

All the aspects of logic that we describe in this section are part of what is called 

first order or propositional logic. 

We start by supposing that we have a number of atomic statements, which we 

denote by lower case letters, p, q, r. Examples of such statements might be 

Consumer 1 is a utility maximiser 

the apple is green 

the price of good 3 is 17. 

We assume that each atomic statement is either true or false. 

Given these atomic statements we can form other statements using logical connectives. 

If p is a statement then ¬p, read not p, is the statement that is true precisely 

when p is false. If both p and q are statements then p ∧ q, read p and q, is the 

statement that is true when both p and q are true and false otherwise. If both p 

and q are statements then p ∨ q, read p or q, is the statement that is true when 

either p and q are true, that is, the statement that is false only if both p and q are 

false. 

We could make do with these three symbols together with brackets to group 

symbols and tell us what to do first. For example we could have the complicated 

statement ((p ∧ q) ∨ (p ∧ r)) ∨ ¬s. This means that at least one of two statements 

is true. The first is that either both p and q are true or both p and r are true. The 

second is that s is not true. 

Exercise 1. Think about the meaning of the statement we have just considered. 

Can you see a more straightforward statement that would mean the same 

thing? 

While we don’t strictly need any more symbols it is certainly convenient to 

have at least a couple more. If both p and q are statements then p ⇒ q, read if p 

then q or p implies q or p is sufficient for q or q is necessary for p, is the statement 

that is false when p is true and q is false and is true otherwise. Many people find 

this a bit nonintuitive. In particular, one might wonder about the truth of this 

statement when p is false and q is true. A simple (and correct) answer is that this 

is a definition. It is simply what we mean by the symbol and there isn’t any point 

in arguing about definitions. However there is a sense in which the definition is 

what is implied by the informal statements. When we say “if p then q” we are 

saying that in any situation or state in which p is true then q is also true. We are 

not making any claim about what might or might not be the case when p is not 

true. So, in states in which p is not true we make no claim about q and so our 

statement is true whether q is true or false. Instead of p ⇒ q we can write q ⇐ p. 

In this case we are most likely to read the statement as q only if p or q is necessary 

for p. 

1

2 1. LOGIC, SETS, FUNCTIONS, AND SPACES 

If p ⇒ q and p ⇐ q (that is q ⇒ p) then we say that p if and only if q or p is 

necessary and sufficient for q and write p ⇔ q. 

One powerful method of analysing logical relationships is by means of truth 

tables. A truth table lists all possible combinations of the truth values of the 

atomic statements and the associated truth values of the compound statements. 

If we have two atomic statements then the following table gives the four possible 

combinations of truth values. 

p q 

T T 

F T 

T F 

F F 

Now, we can add a column that would, for each combination of truth values of 

p and q, give the truth value of p ⇒ q, just as described above. 

p q p ⇒ q 

T T T 

F T T 

T F F 

F F T 

Such truth tables allow us to see the logical relationship between various statements. 

Suppose we have two compound statements A and B and we form a truth 

table showing the truth values of A and B for each possible profile of truth values 

of the atomic statements that constitute A and B. If in each row in which A is true 

B is also true then statement A implies statement B. If statements A and B have 

the same truth value is each row then statements A and B are logically equivalent. 

For example I claim that the statement p ⇒ q we have just considered is logically 

equivalent to ¬p ∨ q. We can see this by adding columns to the truth table we have 

just considered. Let me add a column for ¬p and then one for ¬p ∨ q. (we only add 

the column for ¬p to make it easier). 

p q p ⇒ q ¬p ¬p ∨ q 

T T T F T 

F T T T T 

T F F F F 

F F T T T 

Since the third column and the fifth column contain exactly the same truth values 

we see that the two statements, p ⇒ q and ¬p ∨ q are indeed logically equivalent. 

Exercise 2. Construct the truth table for the statement ¬(¬p ∨ ¬q). Is it 

possible to write this statement using fewer logical connectives? Hint: why not 

start with just one? 

Exercise 3. Prove that the following statements are equivalent: 

(i) (p ∨ ¬q) ⇒ ((¬p) ∧ q) and ¬(q ⇒ p), 

(ii) p ⇒ q and ¬q ⇒ ¬p. 

In part (ii) the second statement is called the contrapositive of the first statement. 

Often if you are asked to prove that p implies q it will be easier to show the 

contrapositive, that is, that not q implies not p. 

Exercise 4. Prove that the following statements are equivalent: 

(i) ¬(p ∧ q) and ¬p ∨ ¬q, 

(ii) ¬(p ∨ q) and ¬p ∧ ¬q.

2. SETS 3 

These two equivalences are known as De Morgan’s Laws. 

A tautology is a statement that is necessarily true. For example if the statements 

A and B are logically equivalent then the statement A ⇔ B is a tautology. 

If A logically implies B then A ⇒ B is a tautology. We can check whether a compound 

statement is a tautology by writing a truth table for this statement. If the 

statement is a tautology then its truth value should be T in each row of its truth 

table. 

A contradiction is a statement that is necessarily false, that is, a statement A 

such that ¬A is a tautology. Again, we can see whether a statement is a contradiction 

by writing a truth table for the statement. 

2. Sets 

Set theory was developed in the second half of the 19th century and is at the 

very foundation of modern mathematics. But we shall not be concerned here with 

the development of the theory. Rather we shall only give the basic language of set 

theory and outline some of the very basic operations on sets. 

We start by defining a set to be a collection of objects or elements. We will 

usually denote sets by capital letters and their elements by lower case letters. If 

the element a is in the set A we write a ∈ A. If every element of the set B is also 

in the set A we call B a subset of the set A and write B ⊂ A. We shall also say 

the A contains B. If A and B have exactly the same elements then we say they 

are equal or identical. Alternatively we could say A = B if and only if A ⊂ B and 

B ⊂ A. If B ⊂ A and B ≠ A then we say that B is a proper subset of A or that A 

strictly contains B. 

Exercise 5. How many subsets a set with N elements has? 

In order to avoid the paradoxes such as the one referred to in the first paragraph 

we shall always assume that in whatever situation we are discussing there is some 

given set U called the universal set which contains all of the sets with which we 

shall deal. 

We customarily enclose our specification of a set by braces. In order to specify 

a set one may simply list the elements. For example to specify the set D which 

contains the numbers 1,2, and 3 we may write D = {1, 2, 3}. Alternatively we may 

define the set by specifying a property that identifies the elements. For example 

we may specify the same set D by D = {x | x is an integer and 0 < x < 4}. Notice 

that this second method is more powerful. We could not, for example, list all 

the integers. (Since there are an infinite number of them we would die before we 

finished.) 

For any two sets A and B we define the union of A and B to be the set which 

contains exactly all of the elements of A and all the elements of B. We denote the 

union of A and B by A ∪ B. Similarly we define the intersection of A and B to 

be that set which contains exactly those elements which are in both A and B. We 

denote the intersection of A and B by A ∩ B. Thus we have 

A ∪ B = {x | x ∈ A or x ∈ B} 

A ∩ B = {x | x ∈ A and x ∈ B}. 

Exercise 6. The oldest mathematician among chess players and the oldest 

chess player among mathematicians is it the same person or (possibly) different 

ones? 

Exercise 7. The best mathematician among chess players and the best chess 

player among mathematicians is it the same person or (possibly) different ones?


Exercise 8. Every tenth mathematician is a chess player and every fourth 

chess player is a mathematician. Are there more mathematicians or chess players 

and by how many times? 

Exercise 9. Prove the distributive laws for operations of union and intersection. 

(i) (A ∩ B) ∪ C = (A ∪ C) ∩ (B ∪ C) 

(ii) (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C) 

Just as the number zero is extremely useful so the concept of a set that has 

no elements is extremely useful also. This set we call the empty set or the null set 

and denote by ∅. To see one use of the empty set notice that having such a concept 

allows the intersection of two sets be well defined whether or not the sets have any 

elements in common. 

We also introduce the concept of a Cartesian product. If we have two sets, say 

A and B, the Cartesian product, A × B, is the set of all ordered pairs, (a, b) such 

that a is an element of A and b is an element of B. Symbolically we write 

A × B = {(a, b) | a ∈ A and b ∈ B}. 

3. Binary Relations 

There are a number of ways of formulating the notion of a binary relation. We 

shall pursue one, defining a binary relation on a set X simply as a subset of X × X, 

the Cartesian product of X with itself. 

Definition 1. A binary relation R on the set X is a subset of X × X. If the 

point (x, y) ∈ R we shall often write xRy instead of (x, y) ∈ R. 

Since we have already defined the notions of Cartesian product and subset, 

there is really nothing new here. However the structure and properties of binary 

relations that we shall now study is motivated by the informal notion of a “relation” 

between the elements of X. 

Example 1. Suppose that X is a set of boys and girls and the relation xSy is 

“x is a sister of y.” 

Example 2. Suppose that X is the set of natural numbers X = {1, 2, 3, . . . }. 

There are binary relations >, ≥, and =. 

Example 3. Suppose that X is the set of natural numbers X = {1, 2, 3, . . . }. 

The relations R, P , and I are defined by 

xRy if and only if x + 1 ≥ y, 

xP y if and only if x > y + 1, and 

xIy if and only if −1 ≤ x − y ≤ 1. 

Definition 2. The following properties of binary relations have been defined 

and found to be useful. 

(BR1) Reflexivity: For all x in X xRx. 

(BR2) Irreflexivity: For all x in X not xRx. 

(BR3) Completeness: For all x and y in X either xRy or yRx (or both). 1 

(BR4) Transitivity: For all x, y, and z in X if xRy and yRz then xRz. 

(BR5) Negative Transitivity: For all x, y, and z in X if xRy then either 

xRz or zRy (or both). 

(BR6) Symmetry: For all x and y in X if xRy then yRx. 

(BR7) Anti-Symmetry: For all x and y in X if xRy and yRx then x = y. 

(BR8) Asymmetry: For all x and y in X if xRy then not yRx. 

1 We shall always implicitly include “or both” when we say “either. . . or.”

4. FUNCTIONS 5 

Exercise 10. Show that completeness implies reflexivity, that asymmetry implies 

anti-symmetry, and that asymmetry implies irreflexivity. 

Exercise 11. Which properties does the relation described in Example 1 satisfy? 

Exercise 12. Which properties do the relations described in Example 2 satisfy? 

Exercise 13. Which properties do the relations described in Example 3 satisfy? 

We now define a few particularly important classes of binary relations. 

Definition 3. A weak order is a binary relation that satisfies transitivity and 

completeness. 

Definition 4. A strict partial order is a binary relation that satisfies transitivity 

and asymmetry. 

Definition 5. An equivalence is a binary relation that satisfies transitivity 

and symmetry. 

You have almost certainly already met examples of such binary relations in 

your study of Economics. We normally assume that weak preference, strict preference, 

and indifference of a consumer are weak orders, strict partial orders, and 

equivalences, though we actually typically assume a little more about the strict 

preference. 

The following construction is also motivated by the idea of preference. Let 

us consider some binary relation R which we shall informally think of as a weak 

preference relation, though we shall not, for the moment, make any assumptions 

about the properties of R. Consider the relations P defined by xP y if and only if 

xRy and not yRx and I defined by xRy if and only if xRy and yRx. 

Exercise 14. Show that if R is a weak order then P is a strict partial order 

and I is an equivalence. 

We could also think of starting with a strict preference P and defining the weak 

preference R in terms of P . We could do so either by defining R as xRy if and only 

if not yP x or by defining R as xRy if and only if either xP y or not yP x. 

Exercise 15. Show that these two definitions of R coincide if P is asymmetric. 

Exercise 16. Show by example that P may be a strict partial order (so, by 

the previous result, the two definitions of R coincide) but R not a weak order. 

[Hint: If you cannot think of another example consider the binary relations defined 

in Example 3.] 

Exercise 17. Show that if P is asymmetric and negatively transitive then 

(i) P is transitive (and hence a strict partial order), and 

(ii) R is a weak order. 

4. Functions 

Let X and Y be two sets. A function (or a mapping) f from the set X to the 

set Y is a rule that assigns to each x in X a unique element in Y , denoted by f(x). 

The notation 

f : X → Y.


is standard. The set X is called the domain of f and the set Y is called the 

codomain of f. The set of all values taken by f, i.e. the set 

{y ∈ Y | there exists x in X such that y = f(x)} 

is called the range of f. The range of a function need not coincide with its codomain 

Y . 

There are several useful ways of visualising functions. A function can be thought 

of as a machine that operates on elements of the set X and transforms an input 

x into a unique output f(x). Note that the machine is not required to produce 

different outputs from different inputs. This analogy helps to distinguish between 

the function itself, f, and its particular value, f(x). The former is the machine, 

the latter is the output 2 ! One of the reasons for this confusion is that in practice, 

to avoid being verbose, people often say things like ‘consider a function U(x, y) = 

x α y β ’ instead of saying ‘consider a function defined for every pair (x, y) in R 2 by 

the equation U(x, y) = x α y β ’. 

A function can also be thought of as a transformation, or a mapping, of the set 

X into the set Y . In line with this interpretation is the common terminology, it is 

said that f(x) is the image of x under the function f. Again, it is important to 

remember that there may be points of Y which are the images of no point of X and 

that there may be different points of X which have the same images in Y . What is 

absolutely prohibited, however, is for a point from X to have several images in Y ! 

The part of definition of the function is the specification of its domain. However, 

in applications, functions are quite often defined as an algebraic formula, without 

explicit specification of its domain. For example, a function may be defined as 

f(x) = sin x + 145x 2 . 

The function f is then the rule that assigns the value sin x + 145x 2 to each value of 

x. The convention in such cases is that the domain of f is the set of all values of x 

for which the formula gives a unique value. Thus, if you come, for instance, across 

the function f(x) = 1/x you should assume that its domain is (−∞, 0) ∪ (0, ∞), 

unless specified otherwise. 

For any subset A of X, the subset f(A) of Y such that y = f(x) for some x in 

X is called the image of A by f, that is, 

f(A) = {y ∈ Y | there exists x in A such that y = f(x)}. 

Thus, the the range of f can be written as f(X). Similarly, one can define the 

inverse image. For any subset B of Y , the inverse image f −1 (B) of B is the set of 

x in X such that f(x) is in B, that is, 

f −1 (B) = {x ∈ X | f(x) ∈ B}. 

A function f is called a function onto Y (or surjection) if the range of f is Y , 

i.e., if for every y ∈ Y there is (at least) one x ∈ X such that y = f(x). In other 

words, each element of Y is the image of (at least) one element of X. A function f is 

called one-to-one (or injection) if f(x 1 ) = f(x 2 ) implies x 1 = x 2 , that is, for every 

element y of f(X) there is a unique element x of X such that y = f(x). In other 

words, one-to-one function maps different elements of X into different elements of 

Y . When a function f : X → Y is both onto and one-to-one it is called a bijection. 

Exercise 18. Suppose that a set X has m elements and a set Y has n ≥ m 

elements. How many different functions are there from X to Y ? from Y to X? 

How many of them surjective? How many of them injective? How many of them 

bijective? 

2 Mathematician Robert Bartle put it as follows. ”Only a fool would confuse sausage-grinder 

with a sausage; however, enough people have confused functions with their values...”

5. SPACES 7 

Exercise 19. Find a function f : N → N which is 

(i) surjective but not injective, 

(ii) injective but not surjective, 

(iii) neither surjective nor injective, 

(iv) bijective 

If function f is a bijection then it is possible to define a function g : Y → X 

such that g(y) = x where x = f(y). Thus, to each element y of Y is assigned an 

element x in X whose image under f is y. Since f is onto, g is defined for every y 

of Y and since f is one-to-one g(y) is unique. The function g is called the inverse of 

f and is usually written as f −1 . In that case, however, it’s not immediately clear 

what f −1 (x) means. Is it the inverse image of x under f or the image of x under 

f −1 ? Happily enough they are the same if f −1 exists! 

Exercise 20. Prove that when a function f −1 exists it is both onto and oneto-one 

and that the inverse of f −1 is the function f itself. 

If f : X → Y and g : Y → Z, then the function h : X → Z, defined as 

h(x) = g(f(x)), is called the composition of g with f and denoted by g ◦ f. Note 

that even if f ◦ g is well-defined it is usually, different from g ◦ f. 

Exercise 21. Let f : X → Y . Prove that there exist a surjection g : X → A 

where A ⊆ X and a injection h : A → Y such that f = h ◦ g. In other words, prove 

that any function can be written as a composition of a surjection and an injection. 

The set G ⊂ X ×Y of ordered pairs (x, f(x)) is called the graph of the function 

f 3 . Of course, the fact that something is called a graph does not necessarily mean 

that it can be drawn! 

5. Spaces 

Sets are reasonably interesting mathematical objects to study. But to make 

them even more interesting (and useful for applications) sets are usually endowed 

with some additional properties, or structures. These new objects are called spaces. 

The structures are often modeled after the familiar properties of space we live in and 

reflect (in axiomatic form) such notions as order, distance, addition, multiplication, 

etc. 

Probably one of the most intuitive spaces is the space of the real numbers, R. 

We will briefly look at the axiomatic way of describing some of its properties. 

Given the set of real numbers R, the operation of addition is the function 

+ : R × R → R that maps any two elements x and y in R to an element denoted 

by x + y and called the sum of x and y. The addition satisfies the following axioms 

for all real numbers x, y, and z. 

A1: x + y = y + x. 

A2: (x + y) + z = x + (y + z). 

A3: There exist an element, denoted by 0, such that x + 0 = x. 

A4: For each x there exist an element, denoted by −x, such that x + (−x) = 0. 

All the remaining properties of the addition can be proven using these axioms. 

Note also that we can define another operation x − y as x + (−y) and call it 

subtraction. 

3 Some people like the idea of the graph of a function so much that they define a function to 

be its graph.


Exercise 22. Prove that the axioms for addition imply the following statements. 

(i) The element 0 is unique. 

(ii) If x + y = x + z then y = z (a cancelation law). 

(iii) −(−x) = x. 

The operation of multiplication can be axiomatised in a similar way. Given the 

set of real numbers, R, the operation of multiplication is the function · : R × R → R 

that maps any two elements x and y in R to an element denoted by x · y and called 

the product of x and y. The multiplication satisfies the following axioms for all real 

numbers x, y, and z. 

A5: x · y = y · x. 

A6: (x · y) · z = x · (y · z). 

A7: There exist an element, denoted by 1, such that x · 1 = x. 

A8: For each x ≠ 0 there exist an element, denoted by x −1 , such that 

x · x −1 = 1. 

One more axiom (a distributive law) brings these two operations, addition and 

multiplication 4 , together. 

A9: x(y + z) = xy + xz for all x, y, and z in R. 

Another structure possessed by the real numbers has to do with the fact that 

the real numbers are ordered. The notion of x less than y can be axiomatised as 

follows. For any two distinct elements x and y either x < y or y < x and, in 

addition, if x < y and y < z then x < z. 

Another example of a space (very important and useful one) is n−dimensional 

real space 5 . Given the natural number n, define R n to be the set of all possible 

ordered n−tuples of n real numbers, with generic element denoted by x = 

(x 1 , . . . , x n ). Thus, the space R n is the n−fold Cartesian product of the set R with 

itself. Real numbers x 1 , . . . , x n are called coordinates of the vector x. Two vectors 

x and y are equal if and only if x 1 = y 1 , . . . , x n = y n . The operation of addition of 

two vectors is defined as 

x + y = (x 1 + y 1 , . . . , x n + y n ). 

Exercise 23. Prove that the addition of vectors in R n satisfies the axioms of 

addition. 

The role of multiplication in this space is player by the operation of multiplication 

by real number defined for all x in R n and all α in R by 

αx = (αx 1 , . . . , αx n ). 

Exercise 24. Prove that the multiplication by real number satisfies a distributive 

law. 

6. Metric Spaces and Continuous Functions 

The notion of metric is the generalisation of the notion of distance between two 

real numbers. 

Let X be a set and d : X ×X → R a function. The function d is called a metric 

if it satisfies the following properties for all x, y, and z in X. 

1. d(x, y) ≥ 0 and d(x, y) = 0 if and only if x = y, 

2. d(x, y) = d(y, x), 

3. d(x, y) ≤ d(x, z) + d(z, y). 

4 From now on, to go easy on notation we will follow the standard convention not to write 

the symbol for multiplication, that is to write xy instead of x · y, etc. 

5 We haven’t defined what the word dimension means yet, so just treat it as a (fancy) name.

6. METRIC SPACES AND CONTINUOUS FUNCTIONS 9 

The set X together with the function d is called a metric space, elements of X 

are usually called points, and the number d(x, y) is called the distance between x 

and y. The last property of a metric is called triangle inequality. 

Exercise 25. Let X be a non-empty set and d : X × X → R be the function 

that satisfies the following two properties for all x, y, and z in X. 

(i) d(x, y) = 0 if and only if x = y, 

(ii) d(x, y) ≤ d(x, z) + d(z, y). 

Prove that d is a metric. 

Exercise 26. Prove that d(x, y)+d(w, z) ≤ d(x, w)+d(x, z)+d(y, w)+d(y, z) 

for all x, y, w, and z in X, where d is some metric on X. 

An obvious example of a metric space is the the set of real numbers, R, together 

with the ‘usual’ distance, d(x, y) = |x − y|. Another example is the n−dimensional 

Euclidean space R n with metric 

d(x, y) = √ (x 1 − y 1 ) 2 + · · · + (x n − y n ) 2 . 

Note that the same set can be endowed with the different metrics thus resulting 

in the different metric spaces! For example, the set of all n−tuples of real numbers 

can be made into metric space by use of the (non-Euclidean) metric 

d T (x, y) = |x 1 − y 1 | + · · · + |x n − y n |, 

which is different from metric space R n . This metric is sometimes called the Manhattan 

(or taxicab) metric. Another curious metric is the so-called French railroad 

metric, defined by 

{ 

d F (x, y) = 

0 if x = y 

d(x, P ) + d(y, P ) if x ≠ y 

where P is the particular point of R n (called Paris) and function d is the Euclidean 

distance. 

Exercise 27. Prove that the French railroad metric d F is a metric. 

Exercise 28. Let X be a non-empty set and d : X × X → R be the function 

defined by 

{ 1 if x ≠ y 

d(x, y) = 

0 if x = y 

Prove that d is a metric. (This metric is called the discrete metric.) 

Using the notion of metric it is possible to generalise the idea of continuous 

function. 

Suppose (X, d X ) and (Y, d Y ) are metric spaces, x 0 ∈ X, and f : X → Y is a 

function. Then f is continuous at x 0 if for every ε > 0 there exists a δ > 0 such 

that 

d Y (f(x 0 ), f(x)) < ε 

for all points x ∈ X for which d X (x 0 , x) < δ. 

The function f is continuous on X if f is continuous at every point of X. 

Let’s prove that function f(x) = x is continuous on R using the above definition. 

For all x 0 ∈ R, we have |f(x 0 ) − f(x)| = |x 0 − x| < ε as long as |x 0 − x| < δ = ε. 

That is, given any ε > 0 we are always able to find a δ, namely δ = ε, such that 

all points which are closer to x 0 than δ will have images which are closer to f(x 0 ) 

than ε.


Exercise 29. Let f : R → R be the function defined by 

{ 

1/x if x ≠ 0 

f(x) = 

0 if x = 0 

Prove that f is continuous at every point of R, with the exception of 0. 

7. Open sets, Compact Sets, and the Weierstrass Theorem 

Let x be a point in a metric space and r > 0. The open ball B(x, r) of radius 

r centred at x is the set of all y ∈ X such that d(x, y) < r. Thus, the open ball is 

the set of all points whose distance from the centre is strictly less than r. The ball 

is closed if the inequality is weak, d(x, y) ≤ r. 

A set S in a metric space is open if for all x ∈ S there exists r ∈ R, r > 0 such 

that B(x, r) ⊂ S. A set S is closed if its complement 

is open. 

S C = {x ∈ X :| x /∈ S} 

Exercise 30. Prove that an open ball is an open set. 

Exercise 31. Prove that the intersection of any finite number of open sets is 

the open set. 

A set S is bounded if there exists a closed ball of finite radius that contains it. 

Formally, S is bounded if there exists a closed ball B(x, r) such that S ⊂ B(x, r). 

Exercise 32. Prove that the set S is bounded if and only if there a exists a 

real number p > 0 such that d(x, x ′ ) ≤ p for all x and x ′ in S. 

Exercise 33. Prove that the union of two bounded sets is a bounded set. 

A collection (possibly infinite) of open sets U 1 , U 2 , . . . in a metric space is an 

open cover of the set S if S is contained in its union. 

A set S is compact if every open cover of S has a finite subcover. That is from 

any open cover can select a finite number of sets U i that still cover S. 

Note that the definition does not say that a set is compact if there is a finite 

open cover! That wouldn’t be a good definition as you can cover any set with the 

whole space, which is just one open set. 

Let’s see how to use this definition to show that something is not compact. 

Consider the set (0, 1) ∈ R. To prove that it is not compact we need to find an 

open cover of (0, 1) from which we cannot select a finite cover. The collection of 

open intervals (1/n, 1) for all integers n ≥ 2 is an open cover of (0, 1), because for 

any point x ∈ (0, 1) it is always able to find an integer n such that n > 1/x, thus 

x ∈ (1/n, 1). But, no finite subcover will do! Let (1/N, 1) be the maximal interval 

in a candidate subcover then it is always possible to find a point x ∈ (0, 1) such 

that N < 1/x. 

While this definition of compactness is quite useful for finding out when the set 

under question is not compact it is less useful for verifying that a set is indeed compact. 

Much more convenient characterisation of compact sets in finite-dimensional 

Euclidean space, R n , is given by the following theorem. 

Theorem 1. Any closed and bounded subset of R n is compact. 

But why are we interested in compactness at all? Because of the following extremely 

important theorem the first version of which was proved by Carl Weierstrass 

around 1860. 

Theorem 2. Let S be a compact set in a metric space and f : S → R be a 

continuous function. Then function f attains its maximum and minimum in S.

8. SEQUENCES AND SUBSEQUENCES 11 

And why this theorem is important for us? Because many economic problems 

are concerned with finding a maximal (or a minimal) value of a function on some set. 

Weierstrass theorem provides conditions under which such search is meaningful!!! 

This theorem and its implications will be much dwelt upon later in the notes, so 

we just give here one example. The consumer utility maximisation problem is the 

problem of finding the maximum of utility function subject to the budget constraint. 

According to Weierstrass theorem, this problem has a solution if utility function is 

continuous and the budget set is compact. 

8. Sequences and Subsequences 

Let us consider again some metric space (X, d). An infinite sequence of points 

in (X, d) is simply a list 

x 1 , x 2 , x 3 , . . . , 

where . . . indicates that the list continues “forever.” 

We can be a bit more formal about this. We first consider the set of natural 

numbers (or counting numbers) 1, 2, 3, . . . , which we denote N. We can now define 

an infinite sequence in the following way. 

X. 

Definition 6. An infinite sequence of elements of X is a function from N to 

Notation. If we look at the previous definition we see that we might have 

a sequence s : N → X which would define s(1), s(2), s(3), . . . or in other words 

would define s(n) for any natural number n. Typically when we are referring to 

sequences we use subscripts (or sometimes superscripts) instead of parentheses and 

write s 1 , s 2 , s 3 , . . . and s n instead of s(1), s(2), s(3), . . . and s(n). Also rather than 

saying that s : N → X is a sequence we say that {s n } is a sequence or even that 

{s n } ∞ n=1 is a sequence. 

Lets now examine a few examples. 

Example 4. Suppose that (X, d) is R the real numbers with the usual metric 

d(, x, y) = |x − y|. Then {n}, { √ n}, and {1/n} are sequences. 

Example 5. Again, suppose that (X, d) is R the real numbers with the usual 

metric d(x, y) = |x − y|. Consider the sequence {x n } where 

{ 

1 if n is odd 

x n = 

0 if n is even 

We see that {n} and { √ n} get arbitrary large as n gets larger, while in the last 

example x n “bounces” back and forth between 0 and 1 as n gets larger. However for 

{1/n} the element of the sequence gets closer and closer to 0 (and indeed arbitrarily 

close to 0). We say, in this case, that the sequence converges to zero or that the 

sequence has limit 0. This is a particularly important concept and so we shall give 

a formal definition. 

Definition 7. Let {x n } be a sequence of points in (X, d). We say that the 

sequence converges to x 0 ∈ X if for any ε > 0 there is N ∈ N such that if n > N 

then d(x n , x 0 ) < ε. 

Informally we can describe this by saying that if n is large then the distance 

from x n to x 0 is small. 

If the sequence {x n } converges to x 0 , then we often write x n → x 0 as n → ∞ 

or lim n→∞ x n = x 0 .


Exercise 34. Show that if the sequence {x n } converges to x 0 then it does not 

converge to any other value unequal to x 0 . Another way of saying this is that if 

the sequence converges then it’s limit is unique. 

We have now seen a number of examples of sequences. In some the sequence 

“runs off to infinity;” in others it “bounces around;” while in others it converges to 

a limit. Could a sequence do anything else? Could a sequence, for example, settle 

down each element getting closer and closer to all future elements in the sequence 

but not converging to any particular limit? In fact, depending on what the space 

X is this is indeed possible. 

First let us recall the notion of a rational number. A rational number is a 

number that can be expressed as the ratio of two integers, that is r is rational if 

r = a/b with a and b integers and b ≠ 0. We usually denote the set of all rational 

numbers Q (since we have already used R for the real numbers). We now consider 

and example in which the underlying space X is Q. Consider the sequence of 

rational numbers defined in the following way 

x 1 = 1 

x n+1 = x n + 2 

x n + 1 . 

This kind of definition is called a recursive definition. Rather than writing, as a 

function of n, what x n is we write what x 1 is and then what x n+1 is as a function 

of what x n is. We can obviously find any element of the sequence that we need, as 

long as we sequentially calculate each previous element. In our case we’d have 

x 1 = 1 

x 2 = 1 + 2 

1 + 1 = 3 2 = 1.5 

x 3 = 

x 4 = 

x 5 = 

x 6 = 

. 

3 

2 + 2 

3 

2 + 1 = 7 5 = 1.4 

7 

5 + 2 

7 

5 + 1 = 17 

12 ≈ 1.416667 

17 

12 + 2 

17 

12 + 1 = 41 

29 ≈ 1.413793 

41 

29 + 2 

41 

29 + 1 = 99 

70 ≈ 1.414286 

We see that the sequence goes up and down but that it seems to be “converging.” 

What is it converging to? Lets suppose that it’s converging to some value x 0 . 

Recall that 

x n+1 = x n + 2 

x n + 1 . 

We’ll see later that if f is a continuous function then lim n → ∞f(x n ) = f(lim n → ∞x n ). 

In this case that means that 

x 0 = lim n → ∞x n+1 = lim n → ∞ x n + 2 

x n + 1 

= x 0 + 2 

x 0 + 1 . 

Thus we have 

x 0 = x 0 + 2 

x 0 + 1

8. SEQUENCES AND SUBSEQUENCES 13 

and if we solve this we obtain x 0 = ± √ 2. Clearly if x n > 0 then x n+1 > 0 so 

our sequence can’t be converging to − √ 2 so we must have x 0 = √ 2. But √ 2 is 

not in Q. Thus we have a sequence of elements in Q that are getting very close to 

each other but are not converging to any element of Q. (Of course the sequence is 

converging to a point in R. In fact one construction of the real number system is 

in terms of such sequences in Q. 

Definition 8. Let {x n } be a sequence of points in (X, d). We say that the 

sequence is a Cauchy sequence if for any ε > 0 there is N ∈ N such that if n, m > N 

then d(x n , x m ) < ε. 

Exercise 35. Show that if {x n } converges then {x n } is a Cauchy sequence. 

A metric space (X, d) in which every Cauchy sequence converges to a limit in 

X is called a complete metric space. The space of real numbers R is a complete 

metric space, while the space of rationals Q is not. 

Exercise 36. Is N the space of natural or counting numbers with metric d 

given by d(x, y) = |x − y| a complete metric space? 

In Section 6 we defined the notion of a function being continuous at a point. 

It is possible to give that definition in terms of sequences. 

Definition 9. Suppose (X, d X ) and (Y, d Y ) are metric spaces, x 0 ∈ X, and 

f : X → Y is a function. Then f is continuous at x 0 if for every sequence {x n } that 

converges to x 0 in (X, d X ) the sequence {f(x n )} converges to f(x 0 ) in (Y, d Y ). 

Exercise 37. Show that the function f(x) = (x + 2)/(x + 1) is continuous at 

any point x ≠ −1. Show that this means that if x n → x 0 as n → ∞ then 

x n + 2 

lim 

n→∞ x n + 1 = x 0 + 2 

x 0 + 1 . 

We can also define the concept of a closed set (and hence the concepts of open 

sets and compact sets) in terms of sequences. 

Definition 10. Let (X, d) be a metric space. A set S ⊂ X is closed if for any 

convergent sequence {x n } such that x n ∈ S for all n then lim n→∞ x n ∈ S. A set is 

open if its complement is closed. 

Given a sequence {x n } we can define a new sequence by taking only some of 

the elements of the original sequence. In the example we considered earlier in which 

x n was 1 if n was odd and 0 if n was even we could take only the odd n and thus 

obtain a sequence that did converge. The new sequence is called a subsequence of 

the old sequence. 

Definition 11. Let {x n } be some sequence in (X, d). Let {n j } ∞ j=1 be a 

sequence of natural numbers such that for each j we have n j < n j+1 , that is 

n 1 < n 2 < n 3 < . . . . The sequence {x nj } ∞ j=1 is called a subsequence of the original 

sequence. 

The notion of a subsequence is often useful. We often use it in the way that 

we briefly referred to above. We initially have a sequence that may not converge, 

but we are able to take a subsequence that does converge. Such a subsequence is 

called a convergent subsequence. 

Definition 12. A subset of a metric space with the property that every sequence 

in the subset has a convergent subsequence is called sequentially compact. 

Theorem 3. In any metric space any compact set is sequentially compact.


If we restrict attention to finite dimensional Euclidian spaces the situation is 

even better behaved. 

Theorem 4. Any subset of R n is sequentially compact if and only if it is 

compact. 

Exercise 38. Verify the following limits. 

n 

(i) lim 

n→∞ n + 1 = 1 

n + 3 

(ii) lim 

n→∞ 

√n 2 + 1 = 0 

√ 

(iii) lim n + 1 − n = 0 

n→∞ 

n√ 

(iv) an + b n = max{a, b} 

lim 

n→∞ 

Exercise 39. Consider a sequence {x n } in R. What can you say about the 

sequence if it converges and for each n x n is an integer. 

Exercise 40. Consider the sequence 

1 

2 , 1 3 , 2 3 , 1 4 , 2 4 , 3 4 , 1 5 , 2 5 , 3 5 , 4 5 , 1 6 , . . . . 

For which values z ∈ R is there a subsequence converging to z? 

Exercise 41. Prove that if a subsequence of a Cauchy sequence converges to 

a limit z then so does the original Cauchy sequence. 

Exercise 42. Prove that any subsequence of a convergent sequence converges. 

Finally one somewhat less trivial exercise. 

Exercise 43. Prove that if lim n→∞ x n = z then 

x 1 + · · · + x n 

lim 

= z 

n→∞ n 

9. Linear Spaces 

The notion of linear space is the axiomatic way of looking at the familiar linear 

operations: addition and multiplication. A trivial example of a linear space is the 

set of real numbers, R. 

What is the operation of addition? The one way of answering the question is 

saying that the operation of addition is just the list of its properties. So, we will 

define the addition of elements from some set X as the operation that satisfies the 

following four axioms. 

A1: x + y = y + x for all x and y in X. 

A2: x + (y + z) = (x + y) + z, for all x, y, and z in X. 

A3: There exists an element, denoted by 0, such that x + 0 = x for all x in 

X. 

A4: For every x in X there exist an element y in X, called inverse of x, such 

that x + y = 0. 

And, to make things more interesting we will also introduce the operation of 

‘multiplication by number’ by adding two more axioms. 

A5: 1x = x for all x in X. 

A6: α(βx) = (αβ)x for all x in X and for all α and β in R. 

Finally, two more axioms relating addition and multiplication. 

A7: α(x + y) = αx + αy for all x and y in X and for all α in R. 

A8: (α + β)x = αx + βx for all x in X and for all α and β in R.

9. LINEAR SPACES 15 

Elements x, y, . . . , w are linearly dependent if there exist real numbers α, β, . . . , λ, 

not all of them equal to zero, such that 

αx + βy + · · · + λz = 0. 

Otherwise, the elements x, y, . . . , w are linearly independent. 

If in a space L it is possible to find n linearly independent elements, but any 

n + 1 elements are linearly dependent then we say that the space L has dimension 

n. 

Nonempty subset L ′ of a linear space L is called a linear subspace if L ′ forms 

a linear space in itself. In other words, L ′ is a linear subspace of L if for any x and 

y in L and all α and β in R 

αx + βy ∈ L ′ .

CHAPTER 2 

Linear Algebra 

1. The Space R n 

In the previous chapter we introduced the concept of a linear space or a vector 

space. We shall now examine in some detail one example of such a space. This is 

the space of all ordered n-tuples (x 1 , x 2 , . . . , x n ) where each x i is a real number. 

We call this space n-dimensional real space and denote it R n . 

Remember from the previous chapter that to define a vector space we not only 

need to define the points in that space but also to define how we add such points 

and how we multiple such points by scalars. In the case of R n we do this element 

by element in the n-tuple or vector. That is, 

and 

(x 1 , x 2 , . . . , x n ) + (y 1 , y 2 , . . . , y n ) = (x 1 + y 1 , x 2 + y 2 , . . . , x n + y n ) 

α(x 1 , x 2 , . . . , x n ) = (αx 1 , αx 2 , . . . , αx n ). 

Let us consider the case that n = 2, that is, the case of R 2 . In this case we can 

visualise the space as in the following diagram. The vector (x 1 , x 2 ) is represented 

by the point that is x 1 units along from the point (0, 0) in the horizontal direction 

and x 2 units up from (0, 0) in the vertical direction. 

x 2 

✻ 

(1, 2) 

2 

1 

✲ 

x 1 

Figure 1 

Let us for the moment continue our discussion in R 2 . Notice that we are 

implicitly writing a vector (x 1 , x 2 ) as a sum x 1 × v 1 + x 2 × v 2 where v 1 is the 

unit vector in the first direction and v 2 is the unit vector in the second direction. 

Suppose that instead we considered the vectors u 1 = (2, 1) = 2 × v 1 + 1 × v 2 and 

17

18 2. LINEAR ALGEBRA 

u 2 = (1, 2) = 1 × v 1 + 2 × v 2 . We could have written any vector (x 1 , x 2 ) instead 

as z 1 × u 1 + z 2 × u 2 where z 1 = (2x 1 − x 2 )/3 and z 2 = (2x 2 − x 1 )/3. That is, for 

any vector in R 2 we can uniquely write that vector in terms of u 1 and u 2 . Is there 

anything that is special about u 1 and u 2 that allows us to make this claim? There 

must be since we can easily find other vectors for which this would not have been 

true. (For example, (1, 2) and (2, 4).) 

The property of the pair of vectors u 1 and u 2 is that they are independent. That 

is, we cannot write either as a multiple of the other. More generally in n dimensions 

we would say that we cannot write any of the vectors as a linear combination of 

the others, or equivalently as the following definition. 

Definition 13. The vectors x 1 , . . . , x k all in R n are linearly independent if it 

is not possible to find scalars α 1 , . . . , α k not all zero such that 

α 1 x 1 + · · · + α k x k = 0. 

Notice that we do not as a matter of definition require that k = n or even that 

k ≤ n. We state as a result that if k > n then the collection x 1 , . . . , x k cannot 

be linearly independent. (In a real maths course we would, of course, have proved 

this.) 

Comment 1. If you examine the definition above you will notice that there 

is nowhere that we actually need to assume that our vectors are in R n . We can 

in fact apply the same definition of linear independence to any vector space. This 

allows us to define the concept of the dimension of an arbitrary vector space as the 

maximal number of linearly independent vectors in that space. In the case of R n 

we obtain that the dimension is in fact n. 

Exercise 44. Suppose that x 1 , . . . , x k all in R n are linearly independent and 

that the vector y in R n is equal to β 1 x 1 + · · · + β k x k . Show that this is the only 

way that y can be expressed as a linear combination of the x i ’s. (That is show that 

if y = γ 1 x 1 + · · · + γ k x k then β 1 = γ 1 , . . . , β k = γ k .) 

The set of all vectors that can be written as a linear combination of the vectors 

x 1 , . . . , x k is called the span of those vectors. If x 1 , . . . , x k are linearly independent 

and if the span of x 1 , . . . , x k is all of R n then the collection { x 1 , . . . , x k } is called 

a basis for R n . (Of course, in this case we must have k = n.) Any vector in R n 

can be uniquely represented as a linear combination of the vectors x 1 , . . . , x k . We 

shall later see that it can sometimes be useful to choose a particular basis in which 

to represent the vectors with which we deal. 

It may be that we have a collection of vectors { x 1 , . . . , x k } whose span is not 

all of R n . In this case we call the span of { x 1 , . . . , x k } a linear subspace of R n . 

Alternatively we say that X ⊂ R n is a linear subspace of R n if X is closed under 

vector addition and scalar multiplication. That is, if for all x, y ∈ X the vector 

x + y is also in X and for all x ∈ X and α ∈ R the vector αx is in X. If the span 

of x 1 , . . . , x k is X and if x 1 , . . . , x k are linearly independent then we say that these 

vectors are a basis for the linear subspace X. In this case the dimension of the 

linear subspace X is k. In general the dimension of the span of x 1 , . . . , x k is equal 

to the maximum number of linearly independent vectors in x 1 , . . . , x k . 

Finally, we comment that R n is a metric space with metric d : R 2n → R + 

defined by 

d((x 1 , . . . , x n ), (y 1 , . . . , y n )) = √ (x 1 − y 1 ) 2 + · · · + (x n − y n ) 2 . 

There are many other metrics we could define on this space but this is the standard 

one.

2. LINEAR FUNCTIONS FROM R n TO R m 19 

2. Linear Functions from R n to R m 

In the previous section we introduced the space R n . Here we shall discuss 

functions from one such space to another (possibly of different dimension). The 

concept of continuity that we introduced for metric spaces is immediately applicable 

here. We shall be mainly concerned here with an even narrower class of functions, 

namely, the linear functions. 

Definition 14. A function f : R n → R m is said to be a linear function if it 

satisfies the following two properties. 

(1) f(x + y) = f(x) + f(y) for all x, y ∈ R n , and 

(2) f(αx) = αf(x) for all x ∈ R n and α ∈ R. 

Comment 2. When considering functions of a single real variable, that is, 

functions from R to R functions of the form f(x) = ax + b where a and b are 

fixed constants are sometimes called linear functions. It is easy to see that if b ≠ 0 

then such functions do not satisfy the conditions given above. We shall call such 

functions affine functions. More generally we shall call a function g : R n → R m an 

affine function if it is the sum of a linear function f : R n → R m and a constant 

b ∈ R m . That is, if for any x ∈ R n g(x) = f(x) + b. 

Let us now suppose that we have two linear functions f : R n → R m and 

g : R n → R m . It is straightforward to show that the function (f + g) : R n → R m 

defined by (f + g)(x) = f(x) + g(x) is also a linear function. Similarly if we have a 

linear function f : R n → R m and a constant α ∈ R the function (αf) : R n → R m 

defined by (αf)(x) = αf(x) is a linear function. If f : R n → R m and g : R m → 

R k are linear functions then the composite function g ◦ f : R n → R k defined by 

g ◦ f(x) = g(f(x)) is again a linear function. Finally, if f : R n → R n is not only 

linear, but also one-to-one and onto so that it has an inverse f −1 : R n → R n then 

the inverse function is also a linear function. 

Exercise 45. Prove the facts stated in the previous paragraph. 

Recall in the previous section we defined the notion of a linear subspace. A 

linear function f : R n → R m defines two important subspaces, the image of f, 

denoted Im(f) ⊂ R m , and the kernel of f, denoted Ker(f) ⊂ R n . The image of f 

is the set of all vectors in R m such that f maps some vector in R n to that vector, 

that is, 

Im(f) = { y ∈ R m | ∃x ∈ R n such that y = f(x) }. 

The kernel of f is the set of all vectors in R n that are mapped by the function f 

to the zero vector in R m , that is, 

Ker(f) = { x ∈ R n | f(x) = 0 }. 

The kernel of f is sometimes called the null space of f. 

It is intuitively clear that the dimension of Im(f) is no more than n. (It is of 

course no more than m since it is contained in R m .) Of course, in general it may be 

less than n, for example if m < n or if f mapped all points in R n to the zero vector 

in R m . (You should satisfy yourself that this function is indeed a linear function.) 

However if the dimension of Im(f) is indeed less than n it means that the function 

has mapped the n-dimensional space R n into a linear space of lower dimension and 

that in the process some dimensions have been lost. The linearity of f means that 

a linear subspace of dimension equal to the number of dimensions that have been 

lost must have been collapsed to the zero vector (and that translates of this linear 

subspace have been collapsed to single points). Thus we can say that 

dim(Im(f)) + dim(Ker(f)) = n.


In the following section we shall introduce the notion of a matrix and define 

various operations on matrices. If you are like me when I first came across matrices, 

these definitions may seem somewhat arbitrary and mysterious. However, we shall 

see that matrices may be viewed as representations of linear functions and that when 

viewed in this way the operations we define on matrices are completely natural. 

3. Matrices and Matrix Algebra 

A matrix is defined as a rectangular array of numbers. If the matrix contains 

m rows and n columns it is called an m × n matrix (read “m by n” matrix). The 

element in the ith row and the jth column is called the ijth element. We typically 

enclose a matrix in square brackets [ ] and write it as 

⎡ 

⎤ 

a 11 . . . a 1n 

⎢ 

⎣ 

. 

. .. 

⎥ 

. ⎦ . 

a m1 . . . a mn 

In the case that m = n we call the matrix a square matrix. If m = 1 the matrix 

contains a single row and we call it a row vector. If n = 1 the matrix contains 

a single column and we call it a column vector. For most purposes we do not 

distinguish between a 1 × 1 matrix [a] and the scalar a. 

Just as we defined the operation of vector addition and the multiplication of 

a vector by a scalar we define similar operations for matrices. In order to be able 

to add two matrices we require that the matrices be of the same dimension. That 

is, if matrix A is of dimension m × n we shall be able to add the matrix B to it 

if and only if B is also of dimension m × n. If this condition is met then we add 

matrices simply by adding the corresponding elements of each matrix to obtain the 

new m × n matrix A + B. That is, 

⎡ 

⎤ ⎡ 

⎤ ⎡ 

⎤ 

a 11 . . . a 1n b 11 . . . b 1n a 11 + b 11 . . . a 1n + b 1n 

⎢ 

⎣ 

. 

. .. 

⎥ ⎢ 

. ⎦ + ⎣ 

. 

. .. 

⎥ ⎢ 

. ⎦ = ⎣ 

. 

. .. 

⎥ 

. ⎦ . 

a m1 . . . a mn b m1 . . . b mn a m1 + b m1 . . . a mn + b mn 

We can see that this definition of matrix addition satisfies many of the same 

properties of the addition of scalars. If A, B, and C are all m × n matrices then 

(1) A + B = B + A, 

(2) (A + B) + C = A + (B + C), 

(3) there is a zero matrix 0 such that for any m×n matrix A we have A+0 = 

0 + A = A, and 

(4) there is a matrix −A such that A + (−A) = (−A) + A = 0. 

Of course, the zero matrix referred to in 3 is simply the m×n matrix consisting 

of all zeros (this is called a null matrix) and the matrix −A referred to in 4 is the 

matrix obtained from A by replacing each element of A by its negative, that is, 

⎡ 

⎡ 

⎢ 

− ⎣ 

a 11 . . . a 1n 

⎤ 

. 

. .. . 

⎥ 

⎦ = 

⎢ 

⎣ 

−a 11 . . . −a 1n 

⎤ 

. 

. .. . 

Now, given a scalar α in R and an m × n matrix A we define the product of α 

and A which we write αA to be the matrix in which each element is replaced by α 

times that element, that is, 

⎡ 

⎡ 

⎢ 

α ⎣ 

a 11 . . . a 1n 

⎤ 

. 

. .. . 

⎥ 

⎦ = 

⎢ 

⎣ 

⎥ 

⎦ . 

αa 11 . . . αa 1n 

⎤ 

. 

. .. . 

⎥ 

⎦ .

3. MATRICES AND MATRIX ALGEBRA 21 

So far the definitions of matrix operations have all seemed the most natural 

ones. We now come to defining matrix multiplication. Perhaps here the definition 

seems somewhat less natural. However in the next section we shall see that the definition 

we shall give is in fact very natural when we view matrices as representations 

of linear functions. 

We define matrix multiplication of A times B written as AB where A is an 

m × n matrix and B is a p × q matrix only when n = p. In this case the product 

AB is defined to be an m × q matrix in which the element in the ith row and jth 

column is ∑ n 

k=1 a ikb kj . That is, to find the term to go in the ith row and the jth 

column of the product matrix AB we take the ith row of the matrix A which will 

be a row vector with n elements and the jth column of the matrix B which will be 

a column vector with n elements. We then multiply each element of the first vector 

by the corresponding element of the second and add all these products. Thus 

⎡ 

⎢ 

⎣ 

⎤ ⎡ 

a 11 . . . a 1n 

. 

. .. 

⎥ ⎢ 

. ⎦ ⎣ 

a m1 . . . a mn 

b 11 . . . b 1q 

⎤ 

. 

. .. . 

⎥ 

⎦ = 

⎡ ∑ n 

k=1 

⎢ 

a ∑ n 

1kb k1 . . . 

k=1 a ⎤ 

1kb kq 

⎣ 

. 

. .. 

∑ . 

n 

k=1 a ∑ n 

mkb k1 . . . 

k=1 a mkb kq 

⎥ 

⎦ . 

For example 

[ a b c 

d e f 

] ⎡ ⎣ p q 

r s 

t v 

⎤ 

[ 

⎦ ap + br + ct aq + bs + cv 

= 

dp + er + ft dq + es + fv 

] 

. 

We define the identity matrix of order n to be the n × n matrix that has 1’s on 

its main diagonal and zeros elsewhere that is, whose ijth element is 1 if i = j and 

zero if i ≠ j. We denote this matrix by I n or, if the order is clear from the context, 

simply I. That is, 

⎡ 

I = ⎢ 

⎣ 

1 0 . . . 0 

0 1 . . . 0 

. . 

. .. . 

0 0 . . . 1 

⎤ 

⎥ 

⎦ . 

It is easy to see that if A is an m × n matrix then AI n = A and I m A = A. In fact, 

we could equally well define the identity matrix to be that matrix that satisfies 

these properties for all such matrices A in which case it would be easy to show that 

there was a unique matrix satisfying this property, namely, the matrix we defined 

above. 

Consider an m × n matrix A. The columns of A are m-dimensional vectors, 

that is, elements of R m and the rows of A are elements of R n . Thus we can ask 

if the n columns are linearly independent and similarly if the m rows are linearly 

independent. In fact we ask: What is the maximum number of linearly independent 

columns of A? It turns out that this is the same as the maximum number of linearly 

independent rows of A. We call the number the rank of the matrix A.


4. Matrices as Representations of Linear Functions 

Let us suppose that we have a particular linear function f : R n → R m . We have 

suggested in the previous section that such a function can necessarily be represented 

as multiplication by some matrix. We shall now show that this is true. Moreover 

we shall do so by explicitly constructing the appropriate matrix. 

Let us write the n-dimensional vector x as a column vector 

⎡ 

x = ⎢ 

⎣ 

x 1 

x 2 

. 

x n 

Now, notice that we can write the vector x as a sum ∑ n 

i=1 x ie i , where e i is the ith 

unit vector, that is, the vector with 1 in the ith place and zeros elsewhere. That is, 

⎡ 

⎢ 

⎣ 

x 1 

x 2 

. 

x n 

⎤ 

⎡ 

⎥ 

⎦ = x 1 ⎢ 

⎣ 

1 

0 

. 

0 

⎤ 

⎡ 

⎥ 

⎦ + x 2 ⎢ 

⎣ 

0 

1 

. 

0 

⎤ 

⎥ 

⎦ . 

⎤ 

⎡ 

⎥ 

⎦ + · · · + x n ⎢ 

⎣ 

Now from the linearity of the function f we can write 

f(x) = f( 

= 

= 

n∑ 

x i e i ) 

i=1 

n∑ 

f(x i e i ) 

i=1 

n∑ 

x i f(e i ). 

i=1 

But, what is f(e i )? Remember that e i is a unit vector in R n and that f maps 

vectors in R n to vectors in R m . Thus f(e i ) is the image in R m of the vector e i . Let 

us write f(e i ) as 

Thus 

f(x) = 

n∑ 

x i f(e i ) 

i=1 

⎡ 

= x 1 ⎢ 

⎣ 

⎡ 

= ⎢ 

⎣ 

a 11 

a 21 

. 

a m1 

⎤ 

∑ n 

i=1 a 1ix i 

∑n 

i=1 a 2ix i 

. 

∑ n 

i=1 a mix i 

⎡ 

⎢ 

⎣ 

a 1i 

a 2i 

. 

a mi 

⎡ 

⎥ 

⎦ + x 2 ⎢ 

⎣ 

⎤ 

⎥ 

⎦ 

⎤ 

⎥ 

⎦ . 

a 12 

a 22 

. 

a m2 

⎤ 

⎡ 

⎥ 

⎦ + · · · + x n ⎢ 

⎣ 

0 

0 

. 

1 

⎤ 

⎥ 

⎦ . 

a 1n 

a 2n 

. 

a mn 

⎤ 

⎥ 

⎦

4. MATRICES AS LINEAR FUNCTIONS 23 

and this is exactly what we would have obtained had we multiplied the matrices 

⎡ 

⎢ 

⎣ 

⎤ ⎡ 

a 11 a 12 . . . a 1n 

a 21 a 22 . . . a 2n 

. 

. . .. 

⎥ ⎢ 

. ⎦ ⎣ 

a m1 a m2 . . . a mn 

x 1 

x 2 

. 

x n 

⎤ 

⎥ 

⎦ . 

Thus we have not only shown that a linear function is necessarily represented by 

multiplication by a matrix we have also shown how to find the appropriate matrix. 

It is precisely the matrix whose n columns are the images under the function of the 

n unit vectors in R n . 

Exercise 46. Find the matrices that represent the following linear functions 

from R 2 to R 2 . 

(1) a clockwise rotation of π/2 (90 ◦ ), 

(2) a reflection in the x 1 axis, 

(3) a reflection in the line x 2 = x 1 (that is, the 45 ◦ line), 

(4) a counter clockwise rotation of π/4 (45 ◦ ), and 

(5) a reflection in the line x 2 = x 1 followed by a counter clockwise rotation of 

π/4. 

Recall that in Section 2 we defined, for any f, g : R n → R m and α ∈ R, the 

functions (f + g) and (αf). In Section 3 we defined the sum of two m × n matrices 

A and B, and the product of a scalar α with the matrix A. Let us instead define 

the sum of A and B as follows. 

Let f : R n → R m be the linear function represented by the matrix A and 

g : R n → R m be the linear function represented by the matrix B. Now define 

the matrix (A + B) to be the matrix that represents the linear function (f + g). 

Similarly let the matrix αA be the matrix that represents the linear function (αf). 

Exercise 47. Prove that the matrices (A + B) and αA defined in the previous 

paragraph coincide with the matrices defined in Section 3. 

We can also see that the definition we gave of matrix multiplication is precisely 

the right definition if we mean multiplication of matrices to mean the composition of 

the linear functions that the matrices represent. To be more precise let f : R n → R m 

and g : R m → R k be linear functions and let A and B be the m × n and k × m 

matrices that represent them. Let (g ◦ f) : R n → R k be the composite function 

defined in Section 2. Now let us define the product BA to be that matrix that 

represents the linear function (g ◦ f).


Now since the matrix A represents the function f and B represents g we have 

(g ◦ f)(x) = g(f(x)) 

⎛⎡ 

⎤ ⎡ ⎤⎞ 

a 11 a 12 . . . a 1n x 1 

a 21 a 22 . . . a 2n 

x 2 

= g ⎜⎢ 

⎝⎣ 

. 

. . .. 

⎥ ⎢ ⎥⎟ 

. ⎦ ⎣ . ⎦⎠ 

a m1 a m2 . . . a mn x n 

⎛⎡ 

∑ n 

i=1 a ⎤⎞ 

∑n 

1ix i 

i=1 

= g ⎜⎢ 

a 2ix i 

⎥⎟ 

⎝⎣ 

∑ 

. ⎦⎠ 

n 

i=1 a mix i 

⎡ 

⎤ ⎡ ∑ n 

b 11 b 12 . . . b 1m 

i=1 b 21 b 22 . . . b 2m 

a ⎤ 

∑n 

1ix i 

i=1 

= ⎢ 

⎣ 

. 

. . .. 

⎥ ⎢ 

a 2ix i 

⎥ 

. ⎦ ⎣ . ⎦ 

∑ n 

b k1 b k2 . . . b km i=1 a mix i 

⎡ ∑ m 

j=1 b ∑ n 

1j i=1 a ⎤ 

jix i 

∑m j=1 

= 

b ∑ n 

2j i=1 a jix i 

⎢ 

⎥ 

⎣ 

∑ 

. ⎦ 

m 

j=1 b ∑ n 

kj i=1 a jix i 

⎡ ∑ n ∑ m 

i=1 j=1 b ⎤ 

1ja ji x i 

∑n ∑ m i=1 j=1 

= 

b 2ja ji x i 

⎢ 

⎥ 

⎣ 

∑ 

. ⎦ 

n ∑ m 

i=1 j=1 b kja ji x i 

⎡ ∑ m 

j=1 b ∑ m 

1ja j1 j=1 b ∑ m 

1ja j2 . . . 

j=1 b ⎤ ⎡ 

1ja jn 

∑m j=1 

= 

b ∑ m 

2ja j1 j=1 b ∑ m 

2ja j2 . . . 

j=1 b 2ja jn 

⎢ 

⎣ 

. ⎥ ⎢ 

. 

. .. 

∑ . ⎦ ⎣ 

m 

j=1 b ∑ m 

kja j1 j=1 b ∑ m 

kja j2 . . . 

j=1 b kja jn 

And this last is the product of the matrix we defined in Section 3 to be BA with 

the column vector x. As we have claimed the definition of matrix multiplication 

we gave in Section 3 was not arbitrary but rather was forced on us by our decision 

to regard the multiplication of two matrices as corresponding to the composition 

of the linear functions the matrices represented. 

Recall that the columns of the matrix A that represented the linear function 

f : R n → R m were precisely the images of the unit vectors in R n under f. The 

linearity of f means that the image of any point in R n is in the span of the images 

of these unit vectors and similarly that any point in the span of the images is the 

image of some point in R n . Thus Im(f) is equal to the span of the columns of 

A. Now, the dimension of the span of the columns of A is equal to the maximum 

number of linearly independent columns in A, that is, to the rank of A. 

x 1 

x 2 

. 

x n 

⎤ 

⎥ 

⎦ . 

5. Linear Functions from R n to R n and Square Matrices 

In the remainder of this chapter we look more closely at an important subclass 

of linear functions and the matrices that represent them, viz the functions that 

map R n to itself. From what we have already said we see immediately that the 

matrix representing such a linear function will have the same number of rows as it 

has columns. We call such a matrix a square matrix.

7. CHANGES OF BASIS 25 

If the linear function f : R n → R n is one-to-one and onto then the function f 

has an inverse f −1 . In Exercise 45 you showed that this function too was linear. 

A matrix that represents a linear function that is one-to-one and onto is called a 

nonsingular matrix. Alternatively we can say that an n × n matrix is nonsingular 

if the rank of the matrix is n. To see these two statements are equivalent note 

first that if f is one-to-one then Ker(f) = {0}. (This is the trivial direction of 

Exercise 48.) But this means that dim(Ker(f)) = 0 and so dim(Im(f)) = n. And, 

as we argued at the end of the previous section this is the same as the rank of 

matrix that represents f. 

Exercise 48. Show that the linear function f : R n → R m is one-to-one if and 

only if Ker(f) = {0}. 

Exercise 49. Show that the linear function f : R n → R n is one-to-one if and 

only if it is onto. 

6. Inverse Functions and Inverse Matrices 

In the previous section we discussed briefly the idea of the inverse of a linear 

function f : R n → R n . This allows us a very easy definition of the inverse of a 

square matrix A. The inverse of A is the matrix that represents the linear function 

that is the inverse function of the linear function that A represents. We write the 

inverse of the matrix A as A −1 . Thus a matrix will have an inverse if and only if 

the linear function that the matrix represents has an inverse, that is, if and only 

if the linear function is one-to-one and onto. We saw in the previous section that 

this will occur if and only if the kernel of the function is {0} which in turn occurs 

if and only if the image of f is of full dimension, that is, is all of R n . This is the 

same as the matrix being of full rank, that is, of rank n. 

As with the ideas we have discussed earlier we can express the idea of a matrix 

inverse purely in terms of matrices without reference to the linear function that 

they represent. Given an n × n matrix A we define the inverse of A to be a matrix 

B such that BA = I n where I n is the n × n identity matrix discussed in Section 3. 

Such a matrix B will exist if and only if the matrix A is nonsingular. Moreover, if 

such a matrix B exists then it is also true that AB = I n , that is, (A −1 ) −1 = A. 

In Section 9 we shall see one method for calculating inverses of general n × n 

matrices. Here we shall simply describe how to calculate the inverse of a 2 × 2 

matrix. Suppose that we have the matrix 

[ ] a b 

A = . 

c d 

The inverse of this matrix is 

( 1 

) [ d −b 

ad − bc −c a 

Exercise 50. Show that the matrix A is of full rank if and only if ad − bc ≠ 0. 

Exercise 51. Check that the matrix given is, in fact, the inverse of A. 

7. Changes of Basis 

We have until now implicitly assumed that there is no ambiguity when we 

speak of the vector (x 1 , x 2 , . . . , x n ). Sometimes there may indeed be an obvious 

meaning to such a vector. However when we define a linear space all that are really 

specified are “what straight lines are” and “where zero is.” In particular, we do 

not necessarily have defined in an unambiguous way “where the axes are” or “what 

] 

.


a unit length along each axis is.” In other words we may not have a set of basis 

vectors specified. 

Even when we do have, or have decided on, a set of basis vectors we may wish 

to redefine our description of the linear space with which we are dealing so as to 

use a different set of basis vectors. Let us suppose that we have an n-dimensional 

space, even R n say, with a given set of basis vectors v 1 , v 2 , . . . , v n and that we 

wish instead to describe the space in terms of the linearly independent vectors 

b 1 , b 2 , . . . , b n where 

b i = b 1i v 1 + b 2i v 2 + · · · + b ni v n . 

Now, if we had the description of a point in terms of the new coordinate vectors, 

e.g., as 

z 1 b 1 + z 2 b 2 + · · · + z n b n 

then we can easily convert this to a description in terms of the original basis vectors. 

We would simply substitute the formula for b i in terms of the e j ’s into the previous 

formula giving 

( 

∑ n 

) ( n 

) ( 

∑ 

n 

) 

∑ 

b 1i z i v 1 + b 2i z i v 2 + · · · + b ni z i v n 

i=1 

or, in our previous notation 

i=1 

⎡ 

( ∑ n 

i=1 b ⎤ 

1iz i ) 

( ∑ n 

i=1 

⎢ 

b 2iz i ) 

⎥ 

⎣ . 

( ∑ ⎦ . 

n 

i=1 b niz i ) 

But this is simply the product 

⎡ 

⎤ ⎡ 

b 11 b 12 . . . b 1n 

b 21 b 22 . . . b 2n 

⎢ 

⎣ 

. 

. . .. 

⎥ ⎢ 

. ⎦ ⎣ 

b n1 b n2 . . . b nn 

That is, if we are given an n-tuple of real numbers that describe a vector in terms 

of the new basis vectors b 1 , b 2 , . . . , b n and we wish to find the n-tuple that describes 

the vector in terms of the original basis vectors we simply multiply the ntuple we 

are given, written as a column vector by the matrix whose columns are the new 

basis vectors b 1 , b 2 , . . . , b n . We shall call this matrix B. We see among other things 

that changing the basis is a linear operation. 

Now, if we were given the information in terms of the original basis vectors 

and wanted to write it in terms of the new basis vectors what should we do? Since 

we don’t have the original basis vectors written in terms of the new basis vectors 

this is not immediately obvious. However we do know that if we were to do it and 

then were to carry out the operation described in the previous paragraph we would 

be back with what we started. Further we know that the operation is a linear 

operation that maps n-tuples to n-tuples and so is represented by multiplication 

by an n × n matrix. That is we multiply the n-tuple written as a column vector by 

the matrix that when multiplied by B gives the identity matrix, that is, the matrix 

B −1 . If we are given a vector of the form 

z 1 

z 2 

. 

z n 

x 1 v 1 + x 2 v 2 + · · · + x n v n 

i=1 

⎤ 

⎥ 

⎦ .

7. CHANGES OF BASIS 27 

and we wish to express it in terms of the vectors b 1 , b 2 , . . . , b n we calculate 

⎡ 

⎤−1 ⎡ ⎤ 

b 11 b 12 . . . b 1n x 1 

b 21 b 22 . . . b 2n 

x 2 

⎢ 

⎣ 

. 

. . .. 

⎥ ⎢ ⎥ 

. ⎦ ⎣ . ⎦ . 

b n1 b n2 . . . b nn x n 

Suppose now that we consider a linear function f : R n → R n and that we have 

originally described R n in terms of the basis vectors v 1 , v 2 , . . . , v n where v i is the 

vector with 1 in the ith place and zeros elsewhere. Suppose that with these basis 

vectors f is represented by the matrix 

⎡ 

⎤ 

a 11 a 12 . . . a 1n 

a 21 a 22 . . . a 2n 

A = ⎢ 

⎣ 

. 

. . .. 

⎥ 

. ⎦ . 

a n1 a n2 . . . a nn 

If we now describe R n in terms of the vectors b 1 , b 2 , . . . , b n how will the linear 

function f be represented? Let us think of what we want? We shall be given 

a vector described in terms of the basis vectors b 1 , b 2 , . . . , b n and we shall want 

to know what the image of this vector under the linear function f is, where we 

shall again want our answer in terms of the basis vectors b 1 , b 2 , . . . , b n . We shall 

know how to do this when we are given the description in terms of the vectors 

e 1 , e 2 , . . . , e n . Thus the first thing we shall do with our vector is to convert it from 

a description in terms of b 1 , b 2 , . . . , b n to a description in terms of e 1 , e 2 , . . . , e n . We 

do this by multiplying the n-tuple by the matrix B. Thus if we call our original 

n-tuple z we shall now have a description of the vector in terms of e 1 , e 2 , . . . , e n , 

viz Bz. Given this description we can find the image of the vector in question 

under f by multiplying by the matrix A. Thus we shall have A(Bz) = (AB)z. 

Remember however this will have given us the image vector in terms of the basis 

vectors e 1 , e 2 , . . . , e n . In order to convert this to a description in terms of the vectors 

b 1 , b 2 , . . . , b n we must multiply by the matrix B −1 . Thus our final n-tuple will be 

(B −1 AB)z. 

Recapitulating, suppose that we know that the linear function f : R n → R n is 

represented by the matrix A when we describe R n in terms of the standard basis 

vectors e 1 , e 2 , . . . , e n and that we have a new set of basis vectors b 1 , b 2 , . . . , b n . Then 

when R n is described in terms of these new basis vectors the linear function f will 

be represented by the matrix B −1 AB. 

Exercise 52. Let f : R n → R m be a linear function. Suppose that with the 

standard bases for R n and R m the function f is represented by the matrix A. Let 

b 1 , b 2 , . . . , b n be a new set of basis vectors for R n and c 1 , c 2 , . . . , c m be a new set of 

basis vectors for R m . What is the matrix that represents f when the linear spaces 

are described in terms of the new basis vectors? 

Exercise 53. Let f : R 2 → R 2 be a linear function. Suppose that with the 

standard bases for R n and R m the function f is represented by the matrix 

[ ] 

3 1 

. 

1 2 

Let [ 3 

2 

] 

and 

be a new set of basis vectors for R 2 . What is the matrix that represents f when 

R 2 is described in terms of the new basis vectors? 

[ 1 

1 

]


Properties of a square matrix that depend only on the linear function that the 

matrix represents and not on the particular choice of basis vectors for the linear 

space are called invariant properties. We have already seen one example of an 

invariant property, the rank of a matrix. The rank of a matrix is equal to the 

dimension of the image space of the function that the matrix represents which 

clearly depends only on the function and not on the choice of basis vectors for the 

linear space. 

The idea of a property being invariant can be expressed also in terms only of 

matrices without reference to the idea of linear functions. A property is invariant 

if whenever an n × n matrix A has the property then for any nonsingular n × n 

matrix B the matrix B −1 AB also has the property. We might think of rank as a 

function that associates to any square matrix a nonnegative integer. We shall say 

that such a function is an invariant if the property of having the function take a 

particular value is invariant for all particular values we may choose. 

Two particularly important invariants are the trace of a square matrix and the 

determinant of a square matrix. We examine these in more detail in the following 

section. 

8. The Trace and the Determinant 

In this section we define two important real valued functions on the space 

of n × n matrices, the trace and the determinant. Both of these concepts have 

geometric interpretations. However, while the trace is easy to calculate (much easier 

than the determinant) its geometric interpretation is rather hard to see. Thus we 

shall not go into it. On the other hand the determinant while being somewhat 

harder to calculate has a very clear geometric interpretation. In Section 9 we shall 

examine in some detail how to calculate determinants. In this section we shall be 

content to discuss one definition and the geometric intuition of the determinant. 

Given an n×n matrix A the trace of A, written tr(A) is the sum of the elements 

on the main diagonal, that is, 

⎡ 

⎤ 

a 11 a 12 . . . a 1n 

a 21 a 22 . . . a 2n 

tr ⎢ 

⎣ 

. 

. . .. 

⎥ 

. ⎦ = 

∑ 

n a ii . 

i=1 

a n1 a n2 . . . a nn 

Exercise 54. For the matrices given in Exercise 53 confirm that tr(A) = 

tr(B −1 AB). 

It is easy to see that the trace is a linear function on the space of all n × n 

matrices, that is, that for all A and B n × n matrices and for all α ∈ R 

(1) tr(A + B) = tr(A) + tr(B), 

and 

(2) tr(αA) = αtr(A). 

We can also see that if A and B are both n×n matrices then tr(AB) = tr(BA). 

In fact, if A is an m × n matrix and B is an n × m matrix this is still true. This 

will often be extremely useful in calculating the trace of a product. 

Exercise 55. From the definition of matrix multiplication show that if A is an 

m × n matrix and B is an n × m matrix that tr(AB) = tr(BA). [Hint: Look at the 

definition of matrix multiplication in Section 2. Then write the determinant of the 

product matrix using summation notation. Finally change the order of summation.]

8. THE TRACE AND THE DETERMINANT 29 

The determinant, unlike the trace is not a linear function of the matrix. It does 

however have some linear structure. If we fix all columns of the matrix except one 

and look at the determinant as a function of only this column then the determinant 

is linear in this single column. Moreover this is true whatever the column we choose. 

Let us write the determinant of the n × n matrix A as det(A). Let us also write 

the matrix A as [a 1 , a 2 , . . . , a n ] where a i is the ith column of the matrix A. Thus 

our claim is that for all n × n matrices A, for all i = 1, 2, . . . n, for all n vectors b, 

and for all α ∈ R 

(3) 

det([a 1 , . . . , a i−1 , a i + b, a i+1 , . . . , a n ]) = det([a 1 , . . . , a i−1 , a i , a i+1 , . . . , a n ]) 

+ det([a 1 , . . . , a i−1 , b, a i+1 , . . . , a n ]) 

and 

(4) det([a 1 , . . . , a i−1 , αa i , a i+1 , . . . , a n ]) = α det([a 1 , . . . , a i−1 , a i , a i+1 , . . . , a n ]). 

We express this by saying that the determinant is a multilinear function. 

Also the determinant is such that any n × n matrix that is not of full rank, 

that is, of rank n, has a zero determinant. In fact, given that the determinant 

is a multilinear function if we simply say that any matrix in which one column is 

the same as one of its neighbours has a zero determinant this implies the stronger 

statement that we made. We already see one use of calculating determinants. A 

matrix is nonsingular if and only if its determinant is nonzero. 

The two properties of being multilinear and zero whenever two neighbouring 

columns are the same already almost uniquely identify the determinant. Notice 

however that if the determinant satisfies these two properties then so does any 

constant times the determinant. To uniquely define the determinant we “tie down” 

this constant by assuming that det(I) = 1. 

Though we haven’t proved that it is so, these three properties uniquely define 

the determinant. That is, there is one and only one function with these three 

properties. We call this function the determinant. In Section 9 we shall discuss a 

number of other useful properties of the determinant. Remember that this additional 

properties are not really additional facts about the determinant. They can 

all be derived from the three properties we have given here. 

Let us now look to the geometric interpretation of the determinant. Let us 

first think about what linear transformations can do to the space R n . Since we 

have already said that a linear transformation that is not onto is represented by a 

matrix with a zero determinant let us think about linear transformations that are 

onto, that is, that do not map R n into a linear space of lower dimension. Such 

transformations can rotate the space around zero. They can “stretch” the space in 

different directions. And they can “flip” the space over. In the latter case all objects 

will become “mirror images” of themselves. We call linear transformations that 

make such a mirror image orientation reversing and those that don’t orientation 

preserving. A matrix that represents an orientation preserving linear function has a 

positive determinant while a matrix that represents an orientation reversing linear 

function has a negative determinant. Thus we have a geometric interpretation of 

the sign of the determinant. 

The absolute size of the determinant represents how much bigger or smaller the 

linear function makes objects. More precisely it gives the “volume” of the image 

of the unit hypercube under the transformation. The word volume is in quotes 

because it is the volume with which we are familiar only when n = 3. If n = 2 then 

it is area, while if n > 3 then it is the full dimensional analog in R n of volume in 

R 3 .


Exercise 56. Consider the matrix 

[ 3 1 

1 2 

In a diagram show the image under the linear function that this matrix represents 

of the unit square, that is, the square whose corners are the points (0,0), (1,0), 

(0,1), and (1,1). Calculate the area of that image. Do the same for the matrix 

[ 4 1 

−1 1 

In the light of Exercise 53, comment on the answers you calculated. 

] 

. 

] 

. 

9. Calculating and Using Determinants 

We have already used the concepts of the inverse of a matrix and the determinant 

of a matrix. The purpose of this section is to cover some of the “cookbook” 

aspects of calculating inverses and determinants. 

Suppose that we have an n × n matrix 

⎡ 

then we shall use |A| or 

A = 

⎢ 

⎣ 

⎤ 

a 11 . . . a 1n 

. 

. .. 

⎥ 

. ⎦ 

a n1 . . . a nn 

∣ a 11 . . . a 1n ∣∣∣∣∣∣ 

. . .. . 

∣ a n1 . . . a nn 

as an alternative notation for det(A). Always remember that 

∣ a 11 . . . a 1n ∣∣∣∣∣∣ 

. . .. . 

∣ a n1 . . . a nn 

is not a matrix but rather a real number. For the case n = 2 we define 

det(A) = 

∣ a ∣ 

11 a 12 ∣∣∣ 

a 21 a 22 

as a 11 a 22 − a 21 a 12 . It is possible to also give a convenient formula for the determinant 

of a 3 × 3 matrix. However, rather than doing this, we shall immediately 

consider the case of an n × n matrix. 

By the minor of an element of the matrix A we mean the determinant (remember 

a real number) of the matrix obtained from the matrix A by deleting the 

row and column containing the element in question. We denote the minor of the 

element a ij by the symbol |M ij |. Thus, for example, 

∣ a 22 . . . a 2n ∣∣∣∣∣∣ |M 11 | = 

. 

. . .. . . 

∣ a n2 . . . a nn 

Exercise 57. Write out the minors of a general 3 × 3 matrix. 

We now define the cofactor of an element to be either plus or minus the minor 

of the element, being plus if the sum of indices of the element is even and minus 

if it is odd. We denote the cofactor of the element a ij by the symbol |C ij |. Thus 

|C ij | = |M ij | if i + j is even and |C ij | = −|M ij | if i + j is odd. Or, 

|C ij | = (−1) i+j |M ij |.

9. CALCULATING AND USING DETERMINANTS 31 

We now define the determinant of an n × n matrix A, 

∣ a 11 . . . a 1n ∣∣∣∣∣∣ 

det(A) = |A| = 

. . .. . 

∣ a n1 . . . a nn 

to be ∑ n 

j=1 a 1j|C 1j |. This is the sum of n terms, each one of which is the product 

of an element of the first row of the matrix and the cofactor of that element. 

Exercise 58. Define the determinant of the 1 × 1 matrix [a] to be a. (What 

else could we define it to be?) Show that the definition given above corresponds 

with the definition we gave earlier for 2 × 2 matrices. 

Exercise 59. Calculate the determinants of the following 3 × 3 matrices. 

⎡ 

1 2 

⎤ 

3 

⎡ 

1 5 

⎤ 

2 

⎡ 

1 1 0 

⎤ 

(a) ⎣ 3 6 9 ⎦ (b) ⎣ 1 4 3 ⎦ (c) ⎣ 5 4 1 ⎦ 

4 5 7 

0 1 2 

2 3 2 

(d) 

⎡ 

⎣ 

1 0 0 

0 1 0 

0 0 1 

⎤ 

⎦ 

(e) 

⎡ 

⎣ 

2 5 2 

1 5 3 

0 1 3 

Exercise 60. Show that the determinant of the identity matrix, det(I n ) is 1 

for all values of n. [Hint: Show that it is true for I 2 . Then show that if it is true 

for I n−1 then it is true for I n .] 

One might ask what was special about the first row that we took elements of 

that row multiplied them by their cofactors and added them up. Why not the 

second row, or the first column? It will follow from a number of properties of 

determinants we list below that in fact we could have used any row or column and 

we would have arrived at the same answer. 

Exercise 61. Expand the matrix given in Exercise 59(b) in terms of the 2nd 

and 3rd rows and in terms of each column and check that the resulting answer 

agrees with the answer you obtained originally. 

We now have a way of calculating the determinant of any matrix. To find 

the determinant of an n × n matrix we have to calculate n determinants of size 

(n−1)×(n−1). This is clearly a fairly computationally costly procedure. However 

there are often ways to economise on the computation. 

Exercise 62. Evaluate the determinants of the following matrices 

⎡ 

⎤ ⎡ 

⎤ 

1 8 0 7 

4 7 0 4 

(a) ⎢ 2 3 4 6 

⎥ 

⎣ 1 6 0 −1 ⎦ (b) ⎢ 5 6 1 8 

⎥ 

⎣ 0 0 9 0 ⎦ 

0 −5 0 8 

1 −3 1 4 

[Hint: Think carefully about which column or row to use in the expansion.] 

We shall now list a number of properties of determinants. These properties 

imply that, as we stated above, it does not matter which row or column we use to 

expand the determinant. Further these properties will give us a series of transformations 

we may perform on a matrix without altering its determinant. This will 

allow us to calculate a determinant by first transforming the matrix to one whose 

determinant is easier to calculate and then calculating the determinant of the easier 

matrix. 

⎤ 

⎦


Property 1. The determinant of a matrix equals the determinant of its transpose. 

|A| = |A ′ | 

Property 2. Interchanging two rows (or two columns) of a matrix changes 

its sign but not its absolute value. For example, 

∣ c d 

∣ ∣∣∣ a b ∣ = cb − ad = −(ad − cb) = − a b 

c d ∣ . 

Property 3. Multiplying one row (or column) of a matrix by a constant λ 

will change the value of the determinant λ-fold. For example, 

∣ ∣ λa 11 . . . λa 1n ∣∣∣∣∣∣ a 11 . . . a 1n ∣∣∣∣∣∣ . . .. . = λ 

. . .. . . 

∣ a n1 . . . a nn 

∣ a n1 . . . a nn 

Exercise 63. Check Property 3 for the cases n = 2 and n = 3. 

Corollary 1. |λA| = λ n |A| (where A is an n × n matrix). 

Corollary 2. | − A| = |A| if n is even. | − A| = −|A| if n is odd. 

Property 4. Adding a multiple of any row (column) to any other row (column) 

does not alter the value of the determinant. 

Exercise 64. Check that 

⎡ ⎤ ⎡ 

1 5 2 1 5 + 3 × 2 2 

⎣ 1 4 3 ⎦ = ⎣ 1 4 + 3 × 3 3 

0 1 2 0 1 + 3 × 2 2 

⎡ 

= ⎣ 

⎤ 

⎦ 

1 + (−2) × 1 5 + (−2) × 4 2 + (−2) × 3 

1 4 3 

0 1 2 

Property 5. If one row (or column) is a constant times another row (or 

column) then the determinant the matrix is zero. 

Exercise 65. Show that Property 5 follows from Properties 3 and 4. 

We can strengthen Property 5 to obtain the following. 

Property 5 ′ . The determinant of a matrix is zero if and only if the matrix is 

not of full rank. 

Exercise 66. Explain why Property 5 ′ is a strengthening of Property 5, that 

is, why 5 ′ implies 5. 

These properties allow us to calculate determinants more easily. Given an n×n 

matrix A the basic strategy one follows is to use the above properties, particularly 

Property 4 to find a matrix with the same determinant as A in which one row (or 

column) has only one non-zero element. Then, rather than calculating n determinants 

of size (n − 1) × (n − 1) one only needs to calculate one. One then does the 

same thing for the (n − 1) × (n − 1) determinant that needs to be calculated, and 

so on. 

There are a number of reasons we are interested in determinants. One is that 

they give us one method of calculating the inverse of a nonsingular matrix. (Recall 

that there is no inverse of a singular matrix.) They also give us a method, known 

as Cramer’s Rule, for solving systems of linear equations. Before proceeding with 

this it is useful to state one further property of determinants. 

⎤ 

⎦ .

9. CALCULATING AND USING DETERMINANTS 33 

Property 6. If one expands a matrix in terms of one row (or column) and 

the cofactors of a different row (or column) then the answer is always zero. That is 

n∑ 

a ij |C kj | = 0 

whenever i ≠ k. Also 

whenever j ≠ k. 

j=1 

n∑ 

a ij |C ik | = 0 

i=1 

Exercise 67. Verify Property 6 for the matrix 

⎡ 

4 1 2 

⎤ 

⎣ 5 2 1 ⎦ . 

1 0 3 

Let us define the matrix of cofactors C to be the matrix [|C ij |] whose ijth 

element is the cofactor of the ijth element of A. Now we define the adjoint matrix 

of A to be the transpose of the matrix of cofactors of A. That is 

adj(A) = C ′ . 

It is straightforward to see (using Property 6) that A adj(A) = |A|I n = adj(A)A. 

That is, A −1 = 1 

|A| 

adj(A). Notice that this is well defined if and only if |A| ≠ 0. 

We now have a method of finding the inverse of any nonsingular square matrix. 

Exercise 68. Use this method to find the inverses of the following matrices 

⎡ 

3 −1 

⎤ 

2 

⎡ 

4 −2 

⎤ 

1 

⎡ 

1 5 

⎤ 

2 

(a) ⎣ 1 0 3 ⎦ (b) ⎣ 7 3 3 ⎦ (c) ⎣ 1 4 3 ⎦ . 

4 0 2 2 0 1 0 1 2 

Knowing how to invert matrices we thus know how to solve a system of n linear 

equations in n unknowns. For we can express the n equations in matrix notation as 

Ax = b where A is an n × n matrix of coefficients, x is an n × 1 vector of unknowns, 

and b is an n × 1 vector of constants. Thus we can solve the system of equations 

as x = A −1 Ax = A −1 b. 

Sometimes, particularly if we are not interested in all of the x’s it is convenient 

to use another method of solving the equations. This method is known as Cramer’s 

Rule. Let us suppose that we wish to solve the above system of equations, that is, 

Ax = b. Let us define the matrix A i to be the matrix obtained from A by replacing 

the ith column of A by the vector b. Then the solution is given by 

x i = |A i| 

|A| . 

Exercise 69. Derive Cramer’s Rule. [Hint: We know that the solution to the 

system of equations is solved by x = (1/|A|)adj(A)b. This gives a formula for x i . 

Show that this formula is the same as that given by x i = |A i |/|A|.] 

Exercise 70. Solve the following system of equations (i) by matrix inversion 

and (ii) by Cramer’s Rule 

(a) 

2x 1 − x 2 = 2 

3x 2 + 2x 3 = 16 

5x 1 + 3x 3 = 21 

(b) 

−x 1 + x 2 + x 3 = 1 

x 1 − x 2 + x 3 = 1 

x 1 + x 2 + x 3 = 1 

.


Exercise 71. Recall that we claimed that the determinant was an invariant. 

Confirm this by calculating (directly) det(A) and det(B −1 AB) where 

⎡ 

1 0 1 

⎤ 

⎡ 

1 0 0 

⎤ 

B = ⎣ 1 −1 2 ⎦ and A = ⎣ 0 2 0 ⎦ . 

2 1 −1 

0 0 3 

Exercise 72. An nth order determinant of the form 

∣ a 11 0 0 . . . 0 ∣∣∣∣∣∣∣∣∣∣ 

a 21 a 22 0 . . . 0 

a 31 a 32 a 33 . . . 0 

. . . . .. . 

∣ a n1 a n2 a n3 . . . a nn 

is called triangular. Evaluate this determinant. [Hint: Expand the determinant in 

terms of its first row. Expand the resulting (n − 1) × (n − 1) determinant in terms 

of its first row, and so on.] 

10. Eigenvalues and Eigenvectors 

Suppose that we have a linear function f : R n → R n . When we look at 

how f deforms R n one natural question to look at is: Where does f send some 

linear subspace? In particular we might ask if there are any linear subspaces that 

f maps to themselves. We call such linear subspaces invariant linear subspaces. 

Of course the space R n itself and the zero dimensional space {0} are invariant 

linear subspaces. The real question is whether there are any others. Clearly, for 

some linear transformations there are no other invariant subspaces. For example, 

a clockwise rotation of π/4 in R 2 has no invariant subspaces other than R 2 itself 

and {0}. 

A particularly important class of invariant linear subspaces are the one dimensional 

ones. A one dimensional linear subspace is specified by one nonzero vector, 

say ¯x. Then the subspace is {λ¯x | λ ∈ R}. Let us call this subspace L(¯x). If L(¯x) 

is an invariant linear subspace of f and if x ∈ L(¯x) then there is some value λ such 

that f(x) = λx. Moreover the value of λ for which this is true will be the same 

whatever value of x we choose in L(¯x). 

Now if we fix the set of basis vectors and thus the matrix A that represents f 

we have that if x is in a one dimensional invariant linear subspace of f then there 

is some λ ∈ R such that 

Ax = λx. 

Again we can define this notion without reference to linear functions. Given a 

matrix A if we can find a pair x, λ with x ≠ 0 that satisfy the above equation we 

call x an eigenvector of the matrix A and λ the associated eigenvalue. (Sometimes 

these are called characteristic vectors and values.) 

Exercise 73. Show that the eigenvalues of a matrix are an invariant, that 

is, that they depend only on the linear function the matrix represents and not on 

the choice of basis vectors. Show also that the eigenvectors of a matrix are not 

an invariant. Explain why the dependence of the eigenvectors on the particular 

basis is exactly what we would expect and argue that is some sense they are indeed 

invariant. 

Now we can rewrite the equation Ax = λx as 

(A − λI n )x = 0.

10. EIGENVALUES AND EIGENVECTORS 35 

If x, λ solve this equation and x ≠ 0 then we have a nonzero linear combination of 

the columns of A − λI n equal to zero. This means that the columns of A − λI n are 

not linearly independent and so det(A − λI n ) = 0, that is, 

⎡ 

⎤ 

a 11 − λ a 12 . . . a 1n 

a 21 a 22 − λ . . . a 2n 

det ⎢ 

⎣ 

. 

. . .. 

⎥ 

. ⎦ = 0. 

a n1 a n2 . . . a nn − λ 

Now, the left hand side of this last equation is a polynomial of degree n in 

λ, that is, a polynomial in λ in which n is the highest power of λ that appears 

with nonzero coefficient. It is called the characteristic polynomial and the equation 

is called the characteristic equation. Now this equation may, or may not, have a 

solution in real numbers. In general, by the fundamental theorem of algebra the 

equation has n solutions, perhaps not all distinct, in the complex numbers. If the 

matrix A happens to be symmetric (that is, if a ij = a ji for all i and j) then all of 

its eigenvalues are real. If the eigenvalues are all distinct (that is, different from 

each other) then we are in a particularly well behaved situation. As a prelude we 

state the following result. 

Theorem 5. Given an n×n matrix A suppose that we have m eigenvectors of A 

x 1 , x 2 , . . . , x m with corresponding eigenvalues λ 1 , λ 2 , . . . , λ m . If λ i ≠ λ j whenever 

i ≠ j then x 1 , x 2 , . . . , x m are linearly independent. 

An implication of this theorem is that an n × n matrix cannot have more than 

n eigenvectors with distinct eigenvalues. Further this theorem allows us to see that 

if an n × n matrix has n distinct eigenvalues then it is possible to find a basis 

for R n in which the linear function that the matrix represents is represented by 

a diagonal matrix. Equivalently we can find a matrix B such that B −1 AB is a 

diagonal matrix. 

To see this let b 1 , b 2 , . . . , b n be n linearly independent eigenvectors with associated 

eigenvalues λ 1 , λ 2 , . . . , λ n . Let B be the matrix whose columns are the vectors 

b 1 , b 2 , . . . , b n . Since these vectors are linearly independent the matrix B has an 

inverse. Now 

B −1 AB = B −1 [Ab 1 Ab 2 . . . Ab n ] 

= B −1 [λ 1 b 1 λ 2 b 2 . . . λ n b n ] 

= [λ 1 B −1 b 1 λ 2 B −1 b 2 . . . λ n B −1 b n ] 

⎡ 

⎤ 

λ 1 0 . . . 0 

0 λ 2 . . . 0 

= ⎢ 

⎣ 

. 

. . .. 

⎥ 

. ⎦ . 

0 0 . . . λ n

CHAPTER 3 

Consumer Behaviour: Optimisation Subject to the 

Budget Constraint 

1. Constrained Maximisation 

1.1. Lagrange Multipliers. Consider the problem of a consumer who seeks 

to distribute his income across the purchase of the two goods that he consumes, 

subject to the constraint that he spends no more than his total income. Let us 

denote the amount of the first good that he buys x 1 and the amount of the second 

good x 2 , the prices of the two goods p 1 and p 2 , and the consumer’s income y. 

The utility that the consumer obtains from consuming x 1 units of good 1 and x 2 

of good two is denoted u(x 1 , x 2 ). Thus the consumer’s problem is to maximise 

u(x 1 , x 2 ) subject to the constraint that p 1 x 1 + p 2 x 2 ≤ y. (We shall soon write 

p 1 x 1 + p 2 x 2 = y, i.e., we shall assume that the consumer must spend all of his 

income.) Before discussing the solution of this problem lets write it in a more 

“mathematical” way. 

(5) 

max u(x 1 , x 2 ) 

x 1,x 2 

subject to p 1 x 1 + p 2 x 2 = y 

We read this “Choose x 1 and x 2 to maximise u(x 1 , x 2 ) subject to the constraint 

that p 1 x 1 + p 2 x 2 = y.” 

Let us assume, as usual, that the indifference curves (i.e., the sets of points 

(x 1 , x 2 ) for which u(x 1 , x 2 ) is a constant) are convex to the origin. Let us also 

assume that the indifference curves are nice and smooth. Then the point (x ∗ 1, x ∗ 2) 

that solves the maximisation problem (31) is the point at which the indifference 

curve is tangent to the budget line as given in Figure 1. 

One thing we can say about the solution is that at the point (x ∗ 1, x ∗ 2) it must be 

true that the marginal utility with respect to good 1 divided by the price of good 1 

must equal the marginal utility with respect to good 2 divided by the price of good 

2. For if this were not true then the consumer could, by decreasing the consumption 

of the good for which this ratio was lower and increasing the consumption of the 

other good, increase his utility. Marginal utilities are, of course, just the partial 

derivatives of the utility function. Thus we have 

(6) 

∂u 

∂x 1 

(x ∗ 1, x ∗ 2) 

p 1 

= 

∂u 

∂x 2 

(x ∗ 1, x ∗ 2) 

p 2 

. 

The argument we have just made seems very “economic.” It is easy to give an 

alternate argument that does not explicitly refer to the economic intuition. Let x u 2 

be the function that defines the indifference curve through the point (x ∗ 1, x ∗ 2), i.e., 

u(x 1 , x u 2(x 1 )) ≡ ū ≡ u(x ∗ 1, x ∗ 2). 

Now, totally differentiating this identity gives 

∂u 

∂x 1 

(x 1 , x u 2(x 1 )) + ∂u 

∂x 2 

(x 1 , x u 2(x 1 )) dxu 2 

dx 1 

(x 1 ) = 0. 

37

383. CONSUMER BEHAVIOUR: OPTIMISATION SUBJECT TO THE BUDGET CONSTRAINT 

x 2 

✻ 

❅ 

❅ 

❅ 

❅ 

❅ 

❅ 

❅ 

❅ 

x ∗ 2 

 

 

 

 

 

 

x ∗ 1 

❅ 

❅ 

❅ 

❅ 

❅ 

❅ 

❅ 

u(x 1 , x 2 ) = ū 

p 1 x 1 + p 2 x 2 = y 

✲ 

x 1 

Figure 1 

That is, 

∂u 

dx u 2 

∂x 

(x 1 ) = − 1 

(x 1 , x u 2(x 1 )) 

dx 

∂u 

1 ∂x 2 

(x 1 , x u 2 (x 1)) . 

Now x u 2(x ∗ 1) = x ∗ 2. Thus the slope of the indifference curve at the point (x ∗ 1, x ∗ 2) 

dx u 2 

dx 1 

(x ∗ 1) = − 

∂u 

∂x 1 

(x ∗ 1, x ∗ 2) 

∂u 

∂x 2 

(x ∗ 1 , x∗ 2 ). 

Also, the slope of the budget line is − p1 

p 2 

. Combining these two results again gives 

result (6). 

Since we also have another equation that (x ∗ 1, x ∗ 2) must satisfy, viz 

(7) p 1 x ∗ 1 + p 2 x ∗ 2 = y 

we have two equations in two unknowns and we can (if we know what the utility 

function is and what p 1 , p 2 , and y are) go happily away and solve the problem. 

(This isn’t quite true but we shall not go into that at this point.) What we shall 

develop is a systemic and useful way to obtain the conditions (6) and (7). Let us 

first denote the common value of the ratios in (6) by λ. That is, 

and we can rewrite this and (7) as 

(8) 

∂u 

∂x 1 

(x ∗ 1, x ∗ 2) 

p 1 

= λ = 

∂u 

∂x 2 

(x ∗ 1, x ∗ 2) 

p 2 

∂u 

∂x 1 

(x ∗ 1, x ∗ 2) − λp 1 = 0 

∂u 

∂x 2 

(x ∗ 1, x ∗ 2) − λp 2 = 0 

y − p 1 x ∗ 1 − p 2 x ∗ 2 = 0.

1. CONSTRAINED MAXIMISATION 39 

Now we have three equations in x ∗ 1, x ∗ 2, and the new artificial or auxiliary variable 

λ. Again we can, perhaps, solve these equations for x ∗ 1, x ∗ 2, and λ. Consider the 

following function 

(9) L(x 1 , x 2 , λ) = u(x 1 , x 2 ) + λ(y − p 1 x 1 − p 2 x 2 ) 

This function is known as the Lagrangian. Now, if we calculate ∂L 

∂x 1 

, 

∂L 

∂x 2 

, and, ∂L 

∂λ , 

and set the results equal to zero we obtain exactly the equations given in (8). We 

now describe this technique in a somewhat more general way. 

Suppose that we have the following maximisation problem 

(10) 

max f(x 1 , . . . , x n ) 

x 1,...,x n 

subject to g(x 1 , . . . , x n ) = c 

and we let 

(11) L(x 1 , . . . , x n , λ) = f(x 1 , . . . , x n ) + λ(c − g(x 1 , . . . , x n )) 

then if (x ∗ 1, . . . , x ∗ n) solves (10) there is a value of λ, say λ ∗ such that 

(12) 

(13) 

∂L 

(x ∗ 

∂x 

1, . . . , x ∗ n, λ ∗ ) = 0 

i 

∂L 

∂λ (x∗ 1, . . . , x ∗ n, λ ∗ ) = 0. 

i = 1, . . . , n 

Notice that the conditions (12) are precisely the first order conditions for 

choosing x 1 , . . . , x n to maximise L, once λ ∗ has been chosen. This provides an 

intuition into this method of solving the constrained maximisation problem. In 

the constrained problem we have told the decision maker that he must satisfy 

g(x 1 , . . . , x n ) = c and that he should choose among all points that satisfy this constraint 

the point at which f(x 1 , . . . , x n ) is greatest. We arrive at the same answer 

if we tell the decision maker to choose any point he wishes but that for each unit by 

which he violates the constraint g(x 1 , . . . , x n ) = c we shall take away λ units from 

his payoff. Of course we must be careful to choose λ to be the correct value. If we 

choose λ too small the decision maker may choose to violate his constraint—e.g., 

if we made the penalty for spending more than the consumer’s income very small 

the consumer would choose to consume more goods than he could afford and to 

pay the penalty in utility terms. On the other hand if we choose λ too large the 

decision maker may violate his constraint in the other direction, e.g., the consumer 

would choose not to spend any of his income and just receive λ units of utility for 

each unit of his income. 

It is possible to give a more general statement of this technique, allowing for 

multiple constraints. (Of course, we should always have fewer constraints than we 

have variables.) Suppose we have more than one constraint. Consider the problem 

Again we construct the Lagrangian 

max f(x 1 , . . . , x n ) 

x 1,...,x n 

subject to g 1 (x 1 , . . . , x n ) = c 1 

. 

g m (x 1 , . . . , x n ) = c m . 

. 

(14) 

L(x 1 , . . . , x n , λ 1 , . . . , λ m ) = f(x 1 , . . . , x n ) 

+ λ 1 (c 1 − g 1 (x 1 , . . . , x n )) + · · · + λ m (c m − g m (x 1 , . . . , x n ))


and again if (x ∗ 1, . . . , x ∗ n) solves (14) there are values of λ, say λ ∗ 1, . . . , λ ∗ m such that 

(15) 

∂L 

(x ∗ 

∂x 

1, . . . , x ∗ n, λ ∗ 1, . . . , λ ∗ m) = 0 

i 

i = 1, . . . , n 

∂L 

(x ∗ 

∂λ 

1, . . . , x ∗ n, λ ∗ 1, . . . , λ ∗ m) = 0 

j 

j = 1, . . . , m. 

1.2. Caveats and Extensions. Notice that we have been referring to the set 

of conditions which a solution to the maximisation problem must satisfy. (We call 

such conditions necessary conditions.) So far we have not even claimed that there 

necessarily is a solution to the maximisation problem. There are many examples of 

maximisation problems which have no solution. One example of an unconstrained 

problem with no solution is 

(16) max 2x 

x 

maximise over the choice of x the function 2x. Clearly the greater we make x the 

greater is 2x, and so, since there is no upper bound on x there is no maximum. 

Thus we might want to restrict maximisation problems to those in which we choose 

x from some bounded set. Again, this is not enough. Consider the problem 

(17) max 

0≤x≤1 1/x . 

The smaller we make x the greater is 1/x and yet at zero 1/x is not even defined. 

We could define the function to take on some value at zero, say 7. But then the 

function would not be continuous. Or we could leave zero out of the feasible set 

for x, say 0 < x ≤ 1. Then the set of feasible x is not closed. Since there would 

obviously still be no solution to the maximisation problem in these cases we shall 

want to restrict maximisation problems to those in which we choose x to maximise 

some continuous function from some closed (and because of the previous example) 

bounded set. (We call a set of numbers, or more generally a set of vectors, that 

is both closed and bounded a compact set.) Is there anything else that could go 

wrong? No! The following result says that if the function to be maximised is 

continuous and the set over which we are choosing is both closed and bounded, i.e., 

is compact, then there is a solution to the maximisation problem. 

Theorem 6 (The Weierstrass Theorem). Let S be a compact set. Let f be a 

continuous function that takes each point in S to a real number. (We usually write: 

let f : S → R be continuous.) Then there is some x ∗ in S at which the function is 

maximised. More precisely, there is some x ∗ in S such that f(x ∗ ) ≥ f(x) for any 

x in S. 

Notice that in defining such compact sets we typically use inequalities, such 

as x ≥ 0. However in Section 1 we did not consider such constraints, but rather 

considered only equality constraints. However, even in the example of utility maximisation 

at the beginning of Section 5.6, there were implicitly constraints on x 1 

and x 2 of the form 

x 1 ≥ 0, x 2 ≥ 0. 

A truly satisfactory treatment would make such constraints explicit. It is possible 

to explicitly treat the maximisation problem with inequality constraints, at the 

price of a little additional complexity. We shall return to this question later in the 

book. 

Also, notice that had we wished to solve a minimisation problem we could 

have transformed the problem into a maximisation problem by simply multiplying 

the objective function by −1. That is, if we wish to minimise f(x) we could do 

so by maximising −f(x). As an exercise write out the conditions analogous to

2. THE IMPLICIT FUNCTION THEOREM 41 

the conditions (8) for the case that we wanted to minimise u(x). Notice that if 

x ∗ 1, x ∗ 2, and λ satisfy the original equations then x ∗ 1, x ∗ 2, and −λ satisfy the new 

equations. Thus we cannot tell whether there is a maximum at (x ∗ 1, x ∗ 2) or a 

minimum. This corresponds to the fact that in the case of a function of a single 

variable over an unconstrained domain at a maximum we require the first derivative 

to be zero, but that to know for sure that we have a maximum we must look at the 

second derivative. We shall not develop the analogous conditions for the constrained 

problem with many variables here. However, again, we shall return to it later in 

the book. 

2. The Implicit Function Theorem 

In the previous section we said things like: “Now we have three equations 

in x ∗ 1, x ∗ 2, and the new artificial or auxiliary variable λ. Again we can, perhaps, 

solve these equations for x ∗ 1, x ∗ 2, and λ.” In this section we examine the question 

of when we can solve a system of n equations to give n of the variable in terms 

of the others. Let us suppose that we have n endogenous variables x 1 , . . . , x n , 

m exogenous variables or parameters, b 1 , . . . , b m , and n equations or equilibrium 

conditions 

f 1 (x 1 , . . . , x n , b 1 , . . . , b m ) = 0 

(18) 

or, using vector notation, 

f 2 (x 1 , . . . , x n , b 1 , . . . , b m ) = 0 

f n (x 1 , . . . , x n , b 1 , . . . , b m ) = 0, 

f(x, b) = 0, 

where f : R n+m → R n , x ∈ R n , that is it is an n vector, b ∈ R m , and 0 ∈ R n . 

When can we solve this system to obtain functions giving each x i as a function 

of b 1 , . . . , b m ? As we’ll see below we only give an incomplete answer to this question, 

but first let’s look at the case that the function f is a linear function. 

Suppose that our equations are 

a 11 x 1 + . . . a 1n x n + c 11 b 1 + c 1m b m = 0 

a 21 x 1 + . . . a 2n x n + c 21 b 1 + c 2m b m = 0 

a n1 x 1 + . . . a nn x n + c n1 b 1 + c nm b m = 0. 

We can write this, in matrix notation, as 

[ ] 

x 

[A | C] = 0, 

b 

where A is an n × n matrix, C is an n × m matrix, x is an n × 1 (column) vector, 

and b is an m × 1 vector. 

This we can rewrite as 

Ax + Cb = 0, 

and solve this to give 

x = −A −1 Cb. 

And we can do this as long as the matrix A can be inverted, that is, as long as the 

matrix A is of full rank. 

Our answer to the general question in which the function f may not be linear 

is that if there are some values (¯x, ¯b) for which f(¯x, ¯b) = 0 then if, when we take 

a linear approximation to f we can solve the approximate linear system as we did 

. 

.


above, then we can solve the true nonlinear system, at least in a neighbourhood of 

(¯x, ¯b). By this last phrase we mean that if b is not close to ¯b we may not be able to 

solve the system, and that for a particular value of b there may be many values of 

x that solve the system, but there is only one close to ¯x. 

To see why we can’t, in general, do better than this consider the equation 

f : R 2 → R given by f(x, b) = g(x)−b, where the function g is graphed in Figure 2. 

Notice that the values (¯x, ¯b) satisfy the equation f(x, b) = 0. For all values of b 

close to ¯b we can find a unique value of x close to ¯x such that f(x, b) = 0. However, 

(1) for each value of b there are other values of x far away from ¯x that also satisfy 

f(x, b) = 0, and (2) there are values of b, such as ˜b for which there are no values of 

x that satisfy f(x, b) = 0. 

g(x) 

✻ 

˜b 

 

¯b 

 

 

 

¯x 

✲ 

x 

Figure 2 

Let us consider again the system of equations 18. We say that the function f 

is C 1 on some open set A ⊂ R n+m if f has partial derivatives everywhere in A and 

these partial derivatives are continuous on A. 

Theorem 7. Suppose that f : R n+m → R n is a C 1 function on an open set 

A ⊂ R n+m and that (¯x, ¯b) in A is such that f(¯x, ¯b) = 0. Suppose also that 

⎡ 

⎤ 

∂f 1(x,b) 

∂x 1 

· · · 

∂f(x, b) 

∂x 

= 

⎢ 

⎣ 

. 

∂f n(x,b) 

∂x 1 

· · · 

∂f 1(x,b) 

∂x n 

. 

∂f n(x,b) 

∂x n 

is of full rank. Then there are open sets A 1 ⊂ R n and A 2 ⊂ R m with ¯x in A 1 and 

¯b in A2 and A 1 × A 2 ⊂ A such that for each b in A 2 there is exactly one g(b) in A 1 

such that f(g(b), b) = 0. Moreover, g : A 2 → A 1 is a C 1 function and 

∂g(b) 

∂b 

[ ] −1 [ ∂f(g(b), b) ∂f(g(b), b) 

= − 

∂x 

∂b 

⎥ 

⎦ 

] 

.

3. THE THEOREM OF THE MAXIMUM 43 

(19) 

Exercise 74. Consider the general utility maximisation problem 

max u(x 1 , x 2 , . . . , x n ) 

x 1,x 2,...,x n 

subject to p 1 x 1 + p 2 x 2 + · · · + p n x n = w. 

Suppose that for some price vector ¯p the maximisation problem has a utility maximising 

bundle ¯x. Find conditions on the utility function such that in a neighbourhood 

of (¯x, ¯p) we can solve for the demand functions x(p). Find the derivatives of 

the demand functions, ∂x/∂p. 

Exercise 75. Now suppose that there are only two goods and the utility 

function is given by 

u(x 1 , x 2 ) = (x 1 ) 1 3 (x2 ) 2 3 . 

Solve this utility maximisation problem, as you learned to do in Section 1 of this 

Chapter, and then differentiate the demand functions that you find to find the 

partial derivative with respect to p 1 , p 2 , and w of each demand function. 

Also find the same derivatives using the method of the previous exercise. 

3. The Theorem of the Maximum 

Often in economics we are not so much interested in what the solution to a 

particular maximisation problem is but rather wish to know how the solution to a 

parameterised problem depends on the parameters. Thus in our first example of 

utility maximisation we might be interested not so much in what the solution to the 

maximisation problem is when p 1 = 2, p 2 = 7, and y = 25, but rather in how the 

solution depends on p 1 , p 2 , and y. (That is, we might be interested in the demand 

function.) Sometimes we shall also be interested in how the maximised function 

depends on the parameters—in the example how the maximised utility depends on 

p 1 , p 2 , and y. 

This raises a number of questions. In order for us to speak meaningfully of a 

demand function it should be the case that the maximisation problem has a unique 

solution. Further, we would like to know if the “demand” function is continuous— 

or even if it is differentiable. Consider again the problem (14), but this time let us 

explicitly add some parameters. 

(20) 

max f(x 1 , . . . , x n , a 1 , . . . , a k ) 

x 1,...,x n 

subject to g 1 (x 1 , . . . , x n , a 1 , . . . , a k ) = c 1 

. 

g m (x 1 , . . . , x n , a 1 , . . . , a k ) = c m 

In order to be able to say whether or not the problem has a unique solution 

it is useful to know something about the shape or curvature of the functions f 

and g. We say a function is concave if for any two points in the domain of the 

function the value of function at a weighted average of the two points is greater 

than the weighted average of the value of the function at the two points. We say 

the function is convex if the value of the function at the average is less than the 

average of the values. The following definition makes this a little more explicit. (In 

both definitions x = (x 1 , . . . , x n ) is a vector.) 

Definition 15. A function f is concave if for any x and x ′ with x ≠ x ′ and 

for any t such that 0 < t < 1 we have f(tx + (1 − t)x ′ ) ≥ tf(x) + (1 − t)f(x ′ ). The 

function is strictly concave if f(tx + (1 − t)x ′ ) > tf(x) + (1 − t)f(x ′ ). 

A function f is convex if for any x and x ′ with x ≠ x ′ and for any t such that 

0 < t < 1 we have f(tx + (1 − t)x ′ ) ≤ tf(x) + (1 − t)f(x ′ ). The function is strictly 

convex if f(tx + (1 − t)x ′ ) < tf(x) + (1 − t)f(x ′ ). 

.


The result we are about to give is most conveniently stated when our statement 

of the problem is in terms of inequality constraints rather than equality constraints. 

As mentioned earlier we shall examine this kind of problem later in this course. 

However for the moment in order to proceed with our discussion of the problem 

involving equality constraints we shall assume that all of the functions with which 

we are dealing are increasing in the x variables. (See Exercise 1 for a formal 

definition of what it means for a function to be increasing.) In this case if f is 

strictly concave and g j is convex for each j then the problem has a unique solution. 

In fact the concepts of concavity and convexity are somewhat stronger than is 

required. We shall see later in the course that they can be replaced by the concepts 

of quasi-concavity and quasi-convexity. In some sense these latter concepts are the 

“right” concepts for this result. 

Theorem 8. Suppose that f and g j are increasing in (x 1 , . . . , x n ). If f is 

strictly concave in (x 1 , . . . , x n ) and g j is convex in (x 1 , . . . , x n ) for j = 1, . . . , m 

then for each value of the parameters (a 1 , . . . , a k ) if problem (20) has a solution 

(x ∗ 1, . . . , x ∗ n) that solution is unique. 

Now let v(a 1 , . . . , a k ) be the maximised value of f when the parameters are 

(a 1 , . . . , a k ). Let us suppose that the problem is such that the solution is unique and 

that (x ∗ 1(a 1 , . . . , a k ), . . . , x ∗ n(a 1 , . . . , a k )) are the values that maximise the function 

f when the parameters are (a 1 , . . . , a k ) then 

(21) v(a 1 , . . . , a k ) = f(x ∗ 1(a 1 , . . . , a k ), . . . , x ∗ n(a 1 , . . . , a k ), a 1 , . . . , a k ). 

(Notice however that the function v is uniquely defined even if there is not a unique 

maximiser.) 

The Theorem of the Maximum gives conditions on the problem under which 

the function v and the functions x ∗ 1, . . . , x ∗ n are continuous. The constraints in the 

problem (20) define a set of feasible vectors x over which the function f is to be 

maximised. Let us call this set G(a 1 , . . . , a k ), i.e., 

(22) G(a 1 , . . . , a k ) = {(x 1 , . . . , x n ) | g j (x 1 , . . . , x n , a 1 , . . . , a k ) = c j ∀j} 

Now we can restate the problem as 

(23) 

max f(x 1 , . . . , x n , a 1 , . . . , a k ) 

x 1,...,x n 

subject to (x 1 , . . . , x n ) ∈ G(a 1 , . . . , a k ). 

Notice that both the function f and the feasible set G depend on the parameters 

a, i.e., both may change as a changes. The Theorem of the Maximum requires 

both that the function f be continuous as a function of x and a and that the 

feasible set G(a 1 , . . . , a k ) change continuously as a changes. We already know— 

or should know—what it means for f to be continuous but the notion of what it 

means for a set to change continuously is less elementary. We call G a set valued 

function or a correspondence. G associates with any vector (a 1 , . . . , a k ) a subset of 

the vectors (x 1 , . . . , x n ). The following two definitions define what we mean by a 

correspondence being continuous. First we define what it means for two sets to be 

close. 

Definition 16. Two sets of vectors A and B are within ɛ of each other if for 

any vector x in one set there is a vector x ′ in the other set such that x ′ is within ɛ 

of x. 

We can now define the continuity of the correspondence G in essentially the 

same way that we define the continuity of a single valued function.

4. THE ENVELOPE THEOREM 45 

Definition 17. The correspondence G is continuous at (a 1 , . . . , a k ) if for any 

ɛ > 0 there is δ > 0 such that if (a ′ 1, . . . , a ′ k ) is within δ of (a 1, . . . , a k ) then 

G(a ′ 1, . . . , a ′ k ) is within ɛ of G(a 1, . . . , a k ). 

It is, unfortunately, not the case that the continuity of the functions g j necessarily 

implies the continuity of the feasible set. (Exercise 2 asks you to construct a 

counterexample.) 

Remark 1. It is possible to define two weaker notions of continuity, which we 

call upper hemicontinuity and lower hemicontinuity. A correspondence is in fact 

continuous in the way we have defined it if it is both upper hemicontinuous and 

lower hemicontinuous. 

We are now in a position to state the Theorem of the Maximum. We assume 

that f is a continuous function, that G is a continuous correspondence, and that 

for any (a 1 , . . . , a k ) the set G(a 1 , . . . , a k ) is compact. The Weierstrass Theorem 

thus guarantees that there is a solution to the maximisation problem (23) for any 

(a 1 , . . . , a k ). 

Theorem 9 (Theorem of the Maximum). Suppose that f(x 1 , . . . , x n , a 1 , . . . , a k ) 

is continuous (in (x 1 , . . . , x n , a 1 , . . . , a k )), that G(a 1 , . . . , a k ) is a continuous correspondence, 

and that for any (a 1 , . . . , a k ) the set G(a 1 , . . . , a k ) is compact. Then 

(1) v(a 1 , . . . , a k ) is continuous, and 

(2) if (x ∗ 1(a 1 , . . . , a k ), . . . , x ∗ n(a 1 , . . . , a k )) are (single valued) functions then 

they are also continuous. 

Later in the course we shall see how the Implicit Function Theorem allows us 

to identify conditions under which the functions v and x ∗ are differentiable. 

Exercises. 

Exercise 76. We say that the function f(x 1 , . . . , x n ) is nondecreasing if x ′ i ≥ 

x i for each i implies that f(x ′ 1, . . . , x ′ n) ≥ f(x 1 , . . . , x n ), is increasing if x ′ i > x i 

for each i implies that f(x ′ 1, . . . , x ′ n) > f(x 1 , . . . , x n ) and is strictly increasing if 

x ′ i ≥ x i for each i and x ′ j > x j for at least one j implies that f(x ′ 1, . . . , x ′ n) > 

f(x 1 , . . . , x n ). Show that if f is nondecreasing and strictly concave then it must be 

strictly increasing. [Hint: This is very easy.] 

Exercise 77. Show by example that even if the functions g j are continuous 

the correspondence G may not be continuous. [Hint: Use the case n = m = k = 1.] 

4. The Envelope Theorem 

In this section we examine a theorem that is particularly useful in the study 

of consumer and producer theory. There is in fact nothing mysterious about this 

theorem. You will see that the proof of this theorem is simply calculation and a 

number of substitutions. Moreover the theorem has a very clear intuition. It is this: 

Suppose we are at a maximum (in an unconstrained problem) and we change the 

data of the problem by a very small amount. Now both the solution of the problem 

and the value at the maximum will change. However at a maximum the function 

is flat (the first derivative is zero). Thus when we want to know by how much the 

maximised value has changed it does not matter (very much) whether or not we 

take account of how the maximiser changes or not. See Figure 2. The intuition for 

a constrained problem is similar and only a little more complicated. 

To motivate our discussion of the Envelope Theorem we will first consider a 

particular case, viz, the relation between short and long run average cost curves. 

Recall that, in general we assume that the average cost of producing some good is


f(x, a) 

✻ 

f(x ∗ (a ′ ), a ′ ) 

f(x ∗ (a), a ′ ) 

f(x ∗ (a), a) 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

f(·, a) 

 

 

x ∗ (a) x ∗ (a ′ ) 

f(·, a ′ ) 

✲ 

x 

Figure 2 

a function of the amount of the good to be produced. The short run average cost 

function is defined to be the function which for any quantity, Q, gives the average 

cost of producing that quantity, taking as given the scale of operation, i.e., the size 

and number of plants and other fixed capital which we assume cannot be changed 

in the short run (whatever that is). The long run average cost function on the 

other hand gives, as a function of Q, the average cost of producing Q units of the 

good, with the scale of operation selected to be the optimal scale for that level of 

production. 

That is, if we let the scale of operation be measured by a single variable k, 

say, and we let the short run average cost of producing Q units when the scale is 

k be given by SRAC(Q, k) and the long run average cost of producing Q units by 

LRAC(Q) then we have 

LRAC(Q) = min SRAC(Q, k). 

k 

Let us denote, for a given value Q, the optimal level of k by k(Q). That is, k(Q) is 

the value of k that minimises the right hand side of the above equation. 

Graphically, for any fixed level of k the short run average cost function can be 

represented by a curve (normally assumed to be U-shaped) drawn in two dimensions 

with quantity on the horizontal axis and cost on the vertical axis. Now think about 

drawing one short run average cost curve for each of the (infinite) possible values of 

k. One way of thinking about the long run average cost curve is as the “bottom” or 

envelope of these short run average cost curves. Suppose that we consider a point 

on this long run or envelope curve. What can be said about the slope of the long 

run average cost curve at this point. A little thought should convince you that it 

should be the same as the slope of the short run curve through the same point. 

(If it were not then that short run curve would come below the long run curve, a

4. THE ENVELOPE THEOREM 47 

contradiction.) That is, 

See Figure 3. 

d LRAC(Q) 

dQ 

= 

∂ SRAC(Q, k(Q)) 

. 

∂Q 

Cost 

✻ 

SRAC 

LRAC( ¯Q) = 

SRAC( ¯Q, k( ¯Q)) 

 

 

 

 

 

 

 

¯Q 

LRAC 

✲ 

Q 

Figure 3 

The envelope theorem is a general statement of the result of which this is a 

special case. We will consider not only cases in which Q and k are vectors, but also 

cases in which the maximisation or minimisation problem includes some constraints. 

Let us consider again the maximisation problem (20). Recall: 

max f(x 1 , . . . , x n , a 1 , . . . , a k ) 

x 1,...,x n 

subject to g 1 (x 1 , . . . , x n , a 1 , . . . , a k ) = c 1 

. 

. 

g m (x 1 , . . . , x n , a 1 , . . . , a k ) = c m 

Again let L(x 1 , . . . , x n , λ 1 , . . . , λ m ; a 1 , . . . , a k ) be the Lagrangian function. 

(24) 

L(x 1 , . . . , x n , λ 1 , . . . , λ m ; a 1 , . . . , a k ) = f(x 1 , . . . , x n , a 1 , . . . , a k ) 

m∑ 

+ λ j (c j − g j (x 1 , . . . , x n , a 1 , . . . , a k )). 

j=1 

Let (x ∗ 1(a 1 , . . . , a k ), . . . , x ∗ n(a 1 , . . . , a k )) and (λ 1 (a 1 , . . . , a k ), . . . , λ m (a 1 , . . . , a k )) be 

the values of x and λ that solve this problem. Now let 

(25) v(a 1 , . . . , a k ) = f(x ∗ 1(a 1 , . . . , a k ), . . . , x ∗ n(a 1 , . . . , a k ), a 1 , . . . , a k ) 

That is, v(a 1 , . . . , a k ) is the maximised value of the function f when the parameters 

are (a 1 , . . . , a k ). The envelope theorem says that the derivative of v is equal to the 

derivative of L at the maximising values of x and λ. Or, more precisely


Theorem 10 (The Envelope Theorem). If all functions are defined as above 

and the problem is such that the functions x ∗ and λ are well defined then 

∂v 

∂a h 

(a 1 , . . . , a k ) = ∂L 

∂a h 

(x ∗ 1(a 1 , . . . , a k ), . . . , x ∗ n(a 1 , . . . , a k ), 

for all h. 

λ 1 (a 1 , . . . , a k ), . . . , λ m (a 1 , . . . , a k ), a 1 , . . . , a k ) 

= ∂f (x ∗ 

∂a 

1(a 1 , . . . , a k ), . . . , x ∗ n(a 1 , . . . , a k ), a 1 , . . . , a k ) 

h 

m∑ 

− λ j (a 1 , . . . , a k ) ∂g h 

(x ∗ 

∂a 

1(a 1 , . . . , a k ), . . . , x ∗ n(a 1 , . . . , a k ), a 1 , . . . , a k ) 

h 

j=1 

In order to show the advantages of using matrix and vector notation we shall 

restate the theorem in that notation before returning to give a proof of the theorem. 

(In proving the theorem we shall return to using mainly scalar notation.) 

Theorem 10 (The Envelope Theorem). Under the same conditions as above 

∂v ∂L 

(a) = 

∂a ∂a (x∗ (a), λ(a), a) 

= ∂f 

∂a (x∗ (a), a) − λ(a) ∂g 

∂a (x∗ (a), a). 

Proof. From the definition of the function v we have 

(26) v(a 1 , . . . , a k ) = f(x ∗ 1(a 1 , . . . , a k ), . . . , x ∗ n(a 1 , . . . , a k ), a 1 , . . . , a k ) 

Thus 

(27) 

∂v 

(a) = ∂f (x ∗ (a), a) + 

∂a h ∂a h 

n∑ 

i=1 

∂f 

∂x i 

(x ∗ (a), a) ∂x∗ i 

∂a h 

(a). 

Now, from the first order conditions (12) we have 

∂f 

m∑ 

(x ∗ (a), a) − λ j (a) ∂g j 

(x ∗ (a), a) = 0. 

∂x i ∂x i 

Or 

(28) 

j=1 

∂f 

∂x i 

(x ∗ (a), a) = 

m∑ 

j=1 

λ j (a) ∂g j 

∂x i 

(x ∗ (a), a). 

Also, since x ∗ (a) satisfies the constraints we have, for each j 

g j (x ∗ 1(a), . . . , x ∗ n(a), a 1 , . . . , a k ) ≡ c j . 

And, since this holds as an identity, we may differentiate both sides with respect 

to a h giving 

n∑ ∂g j 

(x ∗ (a), a) ∂x∗ i 

(a) + ∂g j 

(x ∗ (a), a) = 0. 

∂x i ∂a h ∂a h 

Or 

(29) 

i=1 

n∑ 

i=1 

Substituting (28) into (27) gives 

∂g j 

∂x i 

(x ∗ (a), a) ∂x∗ i 

∂a h 

(a) = − ∂g j 

∂a h 

(x ∗ (a), a). 

∂v 

(a) = ∂f (x ∗ (a), a) + 

∂a h ∂a h 

n∑ m∑ 

[ 

i=1 j=1 

λ j (a) ∂g j 

∂x i 

(x ∗ (a), a)] ∂x∗ i 

∂a h 

(a).

5. APPLICATIONS TO MICROECONOMIC THEORY 49 

Changing the order of summation gives 

(30) 

∂v 

∂a h 

(a) = ∂f 

∂a h 

(x ∗ (a), a) + 

m∑ n∑ 

λ j (a)[ 

j=1 

i=1 

∂g j 

∂x i 

(x ∗ (a), a) ∂x∗ i 

∂a h 

(a)]. 

And now substituting (29) into (30) gives 

∂v 

(a) = ∂f (x ∗ (a), a) − 

∂a h ∂a h 

m∑ 

j=1 

λ j (a) ∂g j 

∂a h 

(x ∗ (a), a), 

which is the required result. 

□ 

Exercises. 

Exercise 78. Rewrite this proof using matrix notation. Go through your proof 

and identify the dimension of each of the vectors or matrices you use. For example 

f x is a 1 × n vector, g x is an m × n matrix. 

5. Applications to Microeconomic Theory 

5.1. Utility Maximisation. Let us again consider the problem given in (31) 

max u(x 1 , x 2 ) 

x 1,x 2 

subject to p 1 x 1 + p 2 x 2 − y = 0. 

Let v(p 1 , p 2 , y) be the maximised value of u when prices and income are p 1 , p 2 , and 

y. Let us consider the effect of a change in y with p 1 and p 2 remaining constant. 

By the Envelope Theorem 

∂v 

∂y = ∂ ∂y {u(x 1, x 2 ) + λ(y − p 1 x 1 + p 2 x 2 )} = 0 + λ1 = λ. 

This is the familiar result that λ is the marginal utility of income. 

5.2. Expenditure Minimisation. Let us consider the problem of minimising 

expenditure subject to attaining a given level of utility, i.e., 

min 

x 1,...,x n 

n 

∑ 

i=1 

p i x i 

subject to u(x 1 , . . . , x n ) − u 0 = 0. 

Let the minimised value of the expenditure function be denoted by 

e(p 1 , . . . , p n , u 0 ). Then by the Envelope Theorem we obtain 

∂e 

= 

∂ { 

∂p i ∂p i 

n∑ 

p i x i + λ(u 0 − u(x 1 , . . . , x n ))} = x i − λ0 = x i 

i=1 

when evaluated at the point which solves the minimisation problem which we write 

as h i (p 1 , . . . , p n , u 0 ) to distinguish this (compensated) value of the demand for good 

i as a function of prices and utility from the (uncompensated) value of the demand 

for good i as a function of prices and income. This result is known as Hotelling’s 

Theorem.


5.3. The Hicks-Slutsky Equations. It can be shown that the compensated 

demand at utility u 0 , i.e., h i (p 1 , . . . , p n , u 0 ) is equal to the uncompensated demand 

at income e(p 1 , . . . , p n , u 0 ), i.e., x i (p 1 , . . . , p n , e(p 1 , . . . , p n , u 0 )). (This result 

is known as the duality theorem.) Thus totally differentiating the identity 

with respect to p k we obtain 

x i (p 1 , . . . , p n , e(p 1 , . . . , p n , u 0 )) ≡ h i (p 1 , . . . , p n , u 0 ) 

which by Hotelling’s Theorem gives 

So 

∂x i 

+ ∂x i ∂e 

= ∂h i 

∂p k ∂y ∂p k ∂p k 

∂x i 

∂p k 

+ ∂x i 

∂y h k = ∂h i 

∂p k 

. 

∂x i 

= ∂h i ∂x i 

− h k 

∂p k ∂p k ∂y 

for all i, k = 1, . . . , n. These are the Hicks-Slutsky equations. 

5.4. The Indirect Utility Function. Again let v(p 1 , . . . , p n , y) be the indirect 

utility function, that is, the maximised value of utility as described in Application 

(1). Then by the Envelope Theorem 

since ∂u 

∂p i 

have 

∂v 

= ∂u − λx i (p 1 , . . . , p n , y) = −λx i (p 1 , . . . , p n , y) 

∂p i ∂p i 

= 0. Now, since we have already shown that λ = ∂v 

∂y 

This is known as Roy’s Theorem. 

x i (p 1 , . . . , p n , y) = − ∂v/∂p i 

∂v/∂y . 

(in Section 4.1) we 

5.5. Profit functions. Now consider the problem of a firm that maximises 

profits subject to technology constraints. Let x = (x 1 , . . . , x n ) be a vector of 

netputs, i.e., x i is positive if the firm is a net supplier of good i, negative if the firm 

is a net user of that good. Let assume that we can write the technology constraints 

as F (x) = 0. Thus the firm’s problem is 

max 

x 1,...,x n 

n ∑ 

i=1 

p i x i 

subject to F (x 1 , . . . , x n ) = 0. 

Let ϕ i (p) be the value of x i that solves this problem, i.e., the net supply of 

commodity i when prices are p. (Here p is a vector.) We call the maximised value 

the profit function which is given by 

n∑ 

Π(p) = p i ϕ i (p). 

And so by the Envelope Theorem 

i=1 

∂Π 

∂p i 

= ϕ i (p). 

This result is known as Hotelling’s lemma.


5.6. Cobb-Douglas Example. We consider a particular Cobb-Douglas example 

of the utility maximisation problem 

√ √ 

x1 x2 

(31) 

The Lagrangean is 

max 

x 1,x 2 

subject to p 1 x 1 + p 2 x 2 = w 

(32) L(x 1 , x 2 , λ) = √ x 1 

√ 

x2 + λ(y − p 1 x 1 − p 2 x 2 ) 

and the first order conditions are 

∂L 

= 1 1 

∂x 1 2 x− 2 

1 x 1 2 

(33) 

2 − p 1 λ = 0 

∂L 

= 1 ∂x 2 2 x 1 2 

1 x − 1 2 

(34) 

2 − p 2 λ = 0 

∂L 

(35) 

∂λ = w − p 1x 1 − p 2 x 2 = 0. 

If we divide equation (33) by equation (34) we obtain 

x 1 −1 x 2 = p 1 /p 2 

or 

p 1 x 1 = p 2 x 2 

and if we substitute this into equation (35) we obtain 

or 

w − p 1 x 1 − p 1 x 1 = 0 

(36) x 1 = w 

2p 1 

. 

Similarly, 

(37) x 2 = w 

2p 2 

. 

Substituting equations (36) and (37) into the utility function gives 

√ 

w 

(38) v(p 1 , p 2 , w) = 

2 w 

= 

4p 1 p 2 2 √ . 

p 1 p 2 

As a check here we can check some known properties of the indirect utility 

function. For example it is homogeneous of degree zero, that is, is we multiply p 1 , 

p 2 , and w by the same positive constant, say α we do not change the value of v. 

You should confirm that this is the case. 

We now calculate the optimal value of λ from the first order conditions by 

substituting equations (36) and (37) into (33), giving 

or 

or 

or 

1 

2 

( ) − 1 ( ) 1 

w 

2 2 

w − p1 λ = 0 

2p 1 2p 2 

√ 

1 2p1 w 

= p 1 λ 

2 w2p 2 

√ 

p1 

1 1 

√ · = λ 

2 p2 p 1 

λ = 

1 

2 √ p 1 p 2 

.


Our first application of the Envelope Theorem told us that this value of λ could 

be found as the derivative of the indirect utility function with respect to w. We 

confirm this by differentiating the function we found above with respect to w. 

∂v 

∂w = 

∂ 

∂w 

w 

2 √ p 1 p 2 

= 

1 

2 √ p 1 p 2 

as we had found directly above. 

Now let us, for the same utility function consider the expenditure minimisation 

problem 

The Lagrangian is 

min 

x 1,x 2 

p 1 x 2 + p 2 x 2 

subject to √ x 1 

√ 

x2 = u. 

(39) L(x 1 , x 2 , λ) = p 1 x 1 + p 2 x 2 + λ(u − √ x 1 

√ 

x2 ) 

and the first order conditions are 

(40) 

(41) 

(42) 

∂L 

= p 1 − λ 1 1 

∂x 1 2 x− 2 

1 x 1 2 

2 = 0 

∂L 

= p 2 − λ 1 ∂x 2 2 x 1 2 

1 x − 1 2 

2 = 0 

∂L 

∂λ = u − √ √ 

x 1 x2 = 0. 

Dividing equation (40) by equation (41) gives 

p 1 

= x 2 

p 2 x 1 

or 

(43) x 2 = p 1x 1 

p 2 

. 

And, if we substitute equation (43) into equation (40) we obtain 

or 

Similarly, 

u − x 1 

√ 

p1 

p 2 

√ 

p2 

x 1 = u . 

p 1 

√ 

p1 

x 2 = u , 

p 2 

and if we substitute these values back into the objective function we obtain the 

expenditure function 

e(p 1 , p 2 , u) = p 1 u 

√ 

p2 

p 1 

+ p 2 u 

√ 

p1 

p 2 

= 2u √ p 1 p 2 . 

Hotelling’s Theorem tells us that is we differentiate this expenditure function 

with respect to p i we should obtain the Hicksian demand function h i . 

∂e(p 1 , p 2 , u) 

= ∂ 2u √ p 1 p 2 = 2u · 1 √ √ 

p2 p2 

= u 

∂p 1 ∂p 1 2 p 1 

as we had already found. And similarly for h 2 . 

p 1


Let us summarise what we have found so far. The Marshallian demand functions 

are 

x 1 (p 1 , p 2 , w) = 

w 

2p 1 

x 2 (p 1 , p 2 , w) = 

w 

2p 2 

The indirect utility function is 

The Hicksian demand functions are 

v(p 1 , p 2 , w) = 

h 1 (p 1 , p 2 , w) = u 

w 

2 √ p 1 p 2 

. 

√ 

p2 

p 1 

√ 

p1 

h 2 (p 1 , p 2 , w) = u , 

p 2 

and the expenditure function is 

e(p 1 , p 2 , u) = 2u √ p 1 p 2 . 

We now look at the third application concerning the Hicks -Slutsky decomposition. 

First let us confirm that if we substitute the expenditure function for w in 

the Marshallian demand function we do obtain the Hicksian demand function. 

x 1 (p 1 , p 2 , e(p 1 , p 2 , u)) = e(p 1, p 2 , u) 

2p 1 

= 2u√ p 1 p 2 

2p 1 

= u 

√ 

p2 

as required. 

Similarly, if we plug the indirect utility function v into the Hicksian demand 

function h i we obtain the Marshallian demand function x i . Confirmation of this is 

left as an exercise. [You should do this exercise. If you understand properly it is 

very easy. If you understand a bit then doing the exercise will solidify your understanding. 

If you can’t do it then it is a message to get some further explanation.] 

Let us now check the Hicks-Slutsky decomposition for the effect of a change in 

the price of good 2 on the demand for good 1. The Hicks-Slutsky decomposition 

tells us that 

∂x 1 

= ∂h 1 ∂x 1 

− h 2 

∂p 2 ∂p 2 ∂w . 

Calculating these partial derivatives we have 

p 1 

, 

∂x 1 

= 0 

∂p 2 

∂x 1 

∂w = 1 

2p 1 

∂h 1 

= √ u × 1 ∂p 2 p1 2 × √ 1 

p2 

u 

= 

2 √ p 1 p 2


and 

√ 

p2 

h 1 = u . 

p 1 

Substituting into the right hand side of the Hicks-Slutsky equation above gives 

RHS = 

u 

2 √ p 1 p 2 

− u 

√ 

p2 

p 1 

· 

1 

2p 1 

= 0, 

which is exactly what we had found for the left hand side of the Hicks-Slutsky 

equation. 

Finally we check Roy’s Theorem, which tells us that the Marshallian demand 

for good 1 can be found as 

In this case we obtain 

as required. 

Exercises. 

x 1 (p 1 , p 2 , w) = − ∂v 

∂p 1 

. 

∂v 

∂w 

x 1 (p 1 , p 2 , w) = − w 2 × 1 √ p2 

= w 

2p 1 

, 

1 

2 

× −1 

√ 

1 

p 1p 2 

Exercise 79. Consider the direct utility function 

n∑ 

u(x) = β i log(x i − γ i ), 

i=1 

2 × p −3 

2 1 

where β i and γ i , i = 1, . . . , n are, respectively, positive and nonpositive parameters. 

(1) Derive the indirect utility function and show that it is decreasing in its 

arguments. 

(2) Verify Roy’s Theorem. 

(3) Derive the expenditure function and show that it is homogeneous of degree 

one and nondecreasing in prices. 

(4) Verify Hotelling’s Theorem. 

Exercise 80. For the utility function defined in exercise 2, 

(1) Derive the Slutsky equation. 

(2) Let d i (p, y) be the demand for good i derived from the above utility function. 

Goods i and j are said to be gross substitutes if ∂d i (p, y)/∂p j > 0 

and gross complements if ∂d i (p, y)/∂p j < 0. For this utility function are 

the various goods gross substitutes, gross complements, or can we not say? 

(The two previous exercises are taken from R. Robert Russell and Maurice 

Wilkinson, Microeconomics: A Synthesis of Modern and Neoclassical Theory, New 

York, John Wiley & Sons, 1979.) 

Exercise 81. An electric utility has two generating plants in which total costs 

per hour are c 1 and c 2 respectively where 

c 1 =80 + 2x 1 + 0.001bx 2 1 

b >0 

c 2 =90 + 1.5x 2 + 0.002x 2 2


where x i is the quantity generated in the i-th plant. If the utility is required to produce 

2000 megawatts in a particular hour, how should it allocate this load between 

the plants so as to minimise costs? Use the Lagrangian method and interpret the 

multiplier. How do total costs vary as b changes. (That is, what is the derivative 

of the minimised cost with respect to b.)

CHAPTER 4 

Topics in Convex Analysis 

1. Convexity 

Convexity is one of the most important mathematical properties in economics. 

For example, without convexity of preferences, demand and supply functions 

are not continuous, and so competitive markets generally do not have equilibrium 

points. The economic interpretation of convex preference sets in consumer theory is 

diminishing marginal rates of substitution; the interpretation of convex production 

sets is constant or decreasing returns to scale. Considerably less is known about 

general equilibrium models that allow non-convex production sets (e.g., economies 

of scale) or non-convex preferences (e.g., the consumer prefers a pint of beer or a 

shot of vodka alone to any mixture of the two). 

Another set of mathematical results closely connected to the notion of convexity 

is so-called separation and support theorems. These theorems are frequently used in 

economics to obtain a price system that leads consumers and producers to choose 

Pareto-efficient allocation. That is, given the prices, producers are maximizing 

profits, and given those profits as income, consumers are maximizing utility subject 

to their budget constraints. 

1.1. Convex Sets. Given two points x, y ∈ R n , a point z = ax + (1 − a) y, 

where 0 ≤ a ≤ 1, is called a convex combination of x and y. 

The set of all possible convex combinations of x and y, denoted by [x, y], is 

called the interval with endpoints x and y (or, the line segment connecting x and 

y): 

[x, y] = {ax + (1 − a) y : 0 ≤ a ≤ 1} . 

Definition 18. A set S ⊆ R n is convex iff for any x, y ∈ S the interval 

[x, y] ⊆ S. 

In words: a set is convex if it contains the line segment connecting any two of 

its points; or, more loosely speaking, a set is convex if along with any two points it 

contains all points between them. 

Convex sets in R 2 include interiors of triangle, squares, circles, ellipses, and 

hosts of other sets. Note also that, for example in R 3 , while the interior of a cube is 

a convex set, its boundary is not. The quintessential convex set in Euclidean space 

R n for any n > 1 is the n−dimensional sphere S R (a) of radius R > 0 about point 

a ∈ R n , given by 

S R (a) = {x : x ∈ R n , |x − a| < R}. 

More examples of convex sets: 

1. Is the empty set convex? Is a singleton convex? Is R n convex? 

There are also several standard ways of forming convex sets from convex sets: 

2. Let A, B ⊆ R n be sets. The Minkowski sum A + B ⊆ R n is defined as 

A + B = {x + y : x ∈ A, y ∈ B} . 

When B = {b} is a singleton, the set A + b is called a translation of A. Prove that 

A + B is convex if A and B are convex. 

57

58 4. TOPICS IN CONVEX ANALYSIS 

3. Let A ⊆ R n be a set and α ∈ R be a number. The scaling αA ⊆ R n is 

defined as 

αA = {αx : x ∈ A} . 

When α > 0, the set αA is called a dilation of A. Prove that αA is convex if A is 

convex. 

4. Prove that the intersection ∩ i∈I S i of any number of convex sets is convex. 

5. Show by example that the union of convex sets need not be convex. 

It is also possible to define convex combination of arbitrary (but finite) number 

of points. 

Definition 19. Let x 1 , ..., x k be a finite set of points from R n . A point 

k∑ 

x = α i x i , 

where α i ≥ 0 for i = 1, ..., k and 

x 1 , ..., x k . 

k ∑ 

i=1 

i=1 

α i = 1, is called a convex combination of 

Note that the definition of a convex combination of two points is a special case 

of this definition. (Prove it) 

Can we generate ‘superconvex’ sets using definition 19? No! as the following 

Lemma shows. 

Lemma 1. A set S ⊆ R n is convex iff every convex combination of points of S 

is in S. 

Proof. If a set contains all convex combinations of its points it is obviously 

convex, because it also contains convex combinations of all pairs its points. Thus, 

we need to show that a convex set contains any convex combination of its points. 

The proof is by induction on the number of points of S in a convex combination. 

By definition, convex set contains all convex combinations of any two of it points. 

Suppose that S contains any convex combination of n or fewer points and consider 

one of n + 1 points, x = ∑ n+1 

i=1 α ix i . Since not all α i = 1, we can relabel them so 

that α n+1 < 1. Then 

x = (1 − α n+1 ) 

n∑ 

i=1 

= (1 − α n+1 ) y + α n+1 x n+1 . 

α i 

1 − α n+1 

x i + α n+1 x n+1 

Note that y ∈ S by induction hypothesis (as a convex combination of n points of 

S) and, as a result, so is x, being a convex combination of two points in S. □ 

But, using definition 19, we can generate convex sets from non-convex sets! 

This operation is very useful, so the resulting set deserves a special name. 

Definition 20. Given a set S ⊆ R n the set of all convex combinations of 

points from S, denoted convS, is called the convex hull of S. 

Note: convince yourself that the adjective ‘convex’ in the term ‘convex hull’ is 

well-deserved by proving that convex hull is indeed convex! Now, the lemma 1 can 

be written more succinctly: S = convS iff S is convex. 

1.2. Convex Hulls. The next theorem deals the following interesting property 

of convex hulls: the convex hull of a set S is the intersection of all convex sets 

containing S. Thus, in a natural sense, the convex hull of a set S is the ‘smallest’ 

convex set containing S. In fact, many authors define convex hulls in that way and 

then prove our Definition 20 as theorem.

1. CONVEXITY 59 

Theorem 11. Let S ⊆ R n be a set then any convex set containing S also 

contains convS. 

Proof. Let A be a convex set such that S ⊆ A. By lemma 1 A contains all 

convex combinations of its points and, in particular, all convex combinations of 

points of its subset S, which is convS. 

□ 

The next property is quite obvious and, again, frustrates attempts to generate 

‘superconvex’ sets, this time by trying to take convex hulls of convex hulls. 

1. Prove that convconvS = convS for any S. 

2. Prove that if A ⊂ B then convA ⊂ convB. 

The next property relates the operation of taking convex hulls and of taking 

direct sums. It does not matter in which order you use these operations. 

3. Prove that conv (A + B) = (convA) + (convB). 

4. Prove that conv (A ∩ B) ⊆ (convA) ∩ (convB). 

5. Prove that (convA) ∪ (convB) ⊆ conv (A ∪ B). 

1.3. Caratheodory’s Theorem. The definition 20 implies that any point x 

in the convex hull of S is representable as a convex combination of (finitely) many 

points of S but it places no restrictions on the number of points of S required 

to make the combination. Caratheodory’s Theorem puts the upper bound on the 

number of points required, in R n the number of points never has to be more than 

n + 1. 

Theorem 12 (Caratheodory, 1907). Let S ⊆ R n be a non-empty set then every 

x ∈ convS can be represented as a convex combination of (at most) n + 1 points 

from S. 

Note that the theorem does not ‘identify’ points used in representation, their 

choice would depend on x. 

Show by example that the constant n + 1 in Caratheodory’s theorem cannot 

be improved. That is, exhibit a set S ⊆ R n and a point x ∈ convS that cannot be 

represented as a convex combination of fewer than n + 1 points from S. 

1.4. Polytopes. The simplest convex sets are those which are convex hulls of 

a finite set of points, that is, sets of the form S = conv{x 1 , x 2 , ..., x m }. The convex 

hull of a finite set of points in R n is called a polytope. 

1. Prove that the set 

n+1 

∑ 

∆ = {x ∈ R n+1 : x i = 1 and x i ≥ 0 for any i} 

i=1 

is a polytope. This polytope is called the standard n−dimensional simplex. 


C = {x ∈ R n+1 : 0 ≤ x i ≤ 1 for any i} 

is a polytope. This polytope is called an n−dimensional cube. 


n+1 

∑ 

O = {x ∈ R n+1 : |x i | ≤ 1} 

i=1 

is a polytope. This polytope is called a (hyper)octahedron. 

1.5. Topology of Convex Sets. 

(1) The closure of a convex set is a convex set. 

(2) The interior of a convex set (possible empty) is convex.


1.6. Aside: Helly’s Theorem. While there are not so many applications of 

Helly’s theorem to economics (in fact, I am aware of the only one paper that uses 

Helly’s theorem in economic context), it is definitely one of the most famous results 

in convexity. 

Theorem 13 (Helly, 1913). Let A 1 , A 2 , ..., A m ⊆ R n be a finite family of convex 

sets with m ≥ n + 1. Suppose that every n + 1 sets have a nonempty intersection. 

Then all sets have a nonempty intersection. 

To prove Helly’s theorem with elegance we need first to formulate a very useul 

result obtianed by J.Radon. 

Theorem 14 (Radon, 1921). Let S ⊆ R n be a set of at least n + 2 points. 

Then there are two non-intersecting subsets R ⊂ S (‘red points’) and B ⊂ S (‘blue 

points’) such that 

convR ∩ convB ≠ ∅. 

Proof. Let x 1 , ..., x m be m ≥ n + 2 distinct points from S. Consider the 

system of n + 1 homogeneous linear equations in variables γ 1 , ..., γ m 

γ 1 x 1 + ... + γ m x m = 0 and γ 1 + ... + γ m = 0 

Since m ≥ n + 2, there is a nontrivial solution to this system. Let 

R = {x i : γ i > 0} and B = {x i : γ i < 0}. 

Then R ∩ B = ∅. Let β = 

∑ 

∑ 

γ i then β > 0 and γ i = −β, since γ’s sum 

up to zero. Moreover, 

since ∑ γ i x i = 0. Let 

x = 

i:γ i>0 

∑ 

i:γ i>0 

∑ 

i:γ i>0 

γ i x i = 

γ i 

β xi = 

∑ 

i:γ i

2. SUPPORT AND SEPARATION 61 

2. Support and Separation 

2.1. Hyperplanes. The concept of hyperplane in R n is a straightforward generalisation 

of the notion of a line in R 2 and of a plane in R 3 . A line in R 2 can be 

described by an equation 

p 1 x 1 + p 2 x 2 = α 

where p = (p 1 , p 2 ) is some non-zero vector and α is some scalar. A plane in R 3 can 

be described by an equation 

p 1 x 1 + p 2 x 2 + p 3 x 3 = α 

where p = (p 1 , p 2 , p 3 ) is some non-zero vector and α is some scalar. Similarly, a 

hyperplane in R n can be described by an equation 

∑ n 

i=1 p ix i = α 

where p = (p 1 , p 2 , ..., p n ) is some non-zero vector in R n and α is some scalar. It can 

be written in more concise way using scalar (aka inner, dot) product notation. 

Definition 21. A hyperplane is the set 

H(p, α) = {x ∈ R n : p · x = α} 

where p ∈ R n is a non-zero vector and α is a scalar. The vector p is called the 

normal to the hyperplane H. 

Suppose that there are two points x ∗ , y ∗ ∈ H(p, α). Then by definition p·x ∗ = α 

and p · y ∗ = α. Hence p · (x ∗ − y ∗ ) = 0. In other words, vector p is orthogonal to 

the line segment (x ∗ − y ∗ ), or to H(p, α). 

Given a hyperplane H ⊂ R n points in R n can be classified according to their 

positions relative to hyperplane. The (closed) half-space determined by the hyperplane 

H(p, α) is either the set of points ‘below’ H or the set of points ‘above’ H, 

i.e., either the set {x ∈ R n : p · x ≤ α} or the set {x ∈ R n : p · x ≥ α}. Open 

half-spaces are defined by strict inequalities. Prove that a closed half-space is closed 

and open half-space is open. 

The straightforward economic example of a half-space is a budget set {x ∈ 

R n : p · x ≤ α} of a consumer with income α facing the vector of prices p. (It was 

rather neat to call the normal vector p, wasn’t it?). By the way, hyperplanes and 

half-spaces are convex sets (Prove it). 

2.2. Support Functions. In this section we give a description of what is 

called a dual structure. Consider the set of all closed convex subsets of R n . We 

will show that to each such set S we can associate an extended-real valued function 

µ S : R n → R ∪ {∞}, that is a function that maps each vector in R n to either a real 

number or to −∞. Not all such functions can be arrived at in this way. In fact 

we shall show that any such function must be concave and homogeneous of degree 

1. But once we restrict attention to functions that can be arrived at as a “support 

function” for some such closed convex set we have another set of objects that we 

can analyse and perhaps make useful arguments about the original sets in which 

we where interested. 

In fact, we shall define the function µ S for any subset of R n , not just the closed 

and convex ones. However, if the original set S is not a closed convex one we shall 

lose some information about S in going to µ S . In particular, µ S only depends on 

the closed convex hull of S, that is, if two sets have the same closed convex hull 

they will lead to the same function µ S . 

We define µ S : R n → R ∪ {−∞} as 

µ S (p) = inf{p · x | x ∈ S},


where inf denotes the infimum or greatest lower bound. It is a property of the 

real numbers that any set of real numbers has an infimum. Thus µ S (p) is well 

defined for any set S. If the minimum exists, for example if the set S is compact, 

then the infimum is the minimum. In other cases the minimum may not exist. To 

take a simple one dimensional example suppose that the set S was the subset of R 

consisting od the numbers 1/n for n = 1, . . . and that p = 2. Then clearly p·x = px 

does not have a minimum on the set S However 0 is less than px = 2x for any value 

of x in S but for any number a greater than 0 there is a value of x in S such that 

px < a. Thus 0 is in this case the infimum of the set {p · x | x ∈ S}. 

Recall that we have not assumed that S is convex. However, if we do assume 

that S is both convex and closed then the function µ S contains all the information 

needed to reconstruct S. 

Given any extended-real valued function µ : R n → R ∪ {∞} let us define the 

set S µ as 

S µ = {x ∈ R n | p · x ≥ µ(p) for every p ∈ R n }. 

That is, for each p > −infty we define the closed half space 

{x ∈ R n | p · x ≥ µ(p)}. 

Notice that is µ(p) = −∞ then p · x ≥ µ(p) for any x and so the above set will be 

R n rather than a half space. The set S µ is the intersection of all these closed half 

spaces. Since the intersection of convex sets is convex and the intersection of closed 

sets is closed, the set S µ is, for any function µ, a closed convex set. 

Suppose that we start with a set S, define µ S as above and then use µ S to 

define the set S µS . If the set S was a closed convex set then S µS will be exactly 

equal to S. Since we have seen that S µS is a closed convex set, it must be that if 

S is not a closed convex set it will not be equal to S µS . However S will always be 

a subset of S µS , and indeed S µS will be the smallest closed convex set such that S 

is a subset, that is S µS is the closed convex hull of S. 

2.3. Separation. We now consider the notion of ‘separating’ two sets by a 

hyperplane. 

Definition 22. A hyperplane H separates sets A and B if A is contained in 

one closed half-space and B is contained in the other. A hyperplane H strictly 

separates sets A and B if A is contained in one open half-space and B is contained 

in the other. 

It is clear that strict separation requires the two sets to be disjoint. For example, 

consider two (externally) tangent circles in a plane. Their common tangent line 

separates them but does not separate them strictly. On the other hand, although it 

is necessary for two sets be disjoint in order to strictly separate them, this condition 

is not sufficient, even for closed convex sets. Let A = {x ∈ R 2 : x 1 > 0 and 

x 1 x 2 ≥ 1} and B = {x ∈ R 2 : x 1 ≥ 0 and x 2 = 0} then A and B are disjoint 

closed convex sets but they cannot be strictly separated by a hyperplane (line in 

R 2 ). Thus the problem of the existence of separating hyperplane is more involved 

then it may appear to be at first. 

We start with separation of a set and a point. 

Theorem 15. Let S ⊆ R n be a convex set and x 0 /∈ S be a point. Then S and 

x 0 can be separated. If S is closed then S and x can be strongly separated. 

Idea of proof. Proof proceeds in two steps. The first step establishes the 

existence a point a in the closure of S which is the closest to x 0 . The second step 

constructs the separating hyperplane using the point a.

2. SUPPORT AND SEPARATION 63 

STEP 1. There exists a point a ∈ ¯S (closure of S) such that d(x 0 , a) ≤ d(x, a) 

for all x ∈ ¯S, and d(x 0 , a) > 0. 

Let ¯B(x 0 ) be closed ball with centre at x 0 that intersects the closure of S. 

Let A = ¯B(x 0 ) ∩ ¯S ≠ ∅. The set A is nonempty, closed and bounded (hence 

compact). According to Weierstrass’s theorem, the continuous distance function 

d(x 0 , x) achieves its minimum in A. That is, there exists a ∈ A such that d(x 0 , a) ≤ 

d(x, a) for all x ∈ ¯S. Note that d(x 0 , a) > 0 

STEP 2. There exists a hyperplane H(p, α) = {x ∈ R n : p · x = α} such that 

p · x ≥ α for all x ∈ ¯S and p · x < α. 

Construct a hyperplane which goes through the point a ∈ ¯S and has normal 

p = a − x 0 . The proof that this hyperplane is the separating one is done by 

contradiction. Suppose there exists a point y ∈ ¯S which is strictly on the same side 

of H as x 0 . Consider the point y ′ ∈ [a, y] such that the vector y ′ − x 0 is orthogonal 

to y − a. Since d(x 0 , y) ≥ d(x 0 , a), the point y ′ is between a and y. Thus, y ∈ ¯S 

and d(x 0 , y ′ ) ≤ d(x 0 , a) which contradicts the choice of a. When S = ¯S, that is S 

is closed, the separation can be made strict by choosing a point strictly in between 

a and x 0 instead of a. This is always possible because d(x 0 , a) > 0. 

□ 

Theorem 15 is very useful because separation of a pair of sets can be always 

reduced to separation of a set and a point. 

Lemma 2. Let A and B be a non-empty sets. A and B can be separated 

(strongly separated) iff A − B and 0 can be separated (strongly separated). 

Proof. If A and B are convex then A − B is convex. If A is compact and B 

is closed then A − B is closed. And 0 /∈ A − B iff A ∩ B = ∅. 

□ 

Theorem 16 (Minkowski, 1911). Let A and B be a non-empty convex sets 

with A ∩ B = ∅. Then A and B can be separated. If A is compact and B is closed 

then A and B can be strongly separated. 

2.4. Support. Closely (not in topological sense) related to the notion of a 

separating hyperplane is the notion of supporting hyperplane. 

Definition 23. The hyperplane H supports the set S at the point x 0 ∈ S if 

x 0 ∈ H and S is a subset of one of the half-spaces determined by H. 

A convex set can be supported at any of its boundary points, this is the immediate 

consequence of Theorem 16. To prove it, consider the sets A and B = {x 0 }, 

where x 0 is a boundary point of A. 

Theorem 17. Let S ⊆ R n be a convex set with nonempty interior and x 0 ∈ S 

be its boundary point. Then there exist a supporting hyperplane for S at x 0 . 

Note that if the boundary of a convex set is smooth (‘differentiable’) at the 

given point x 0 then the supporting hyperplane is unique and is just the tangent 

hyperplane. If, however, the boundary is not smooth then there can be many 

supporting hyperplanes passing through the given point. It is important to note 

that conceptually the supporting theorems are connected to calculus. But, the 

supporting theorems are more powerful (don’t require smoothness), more direct, 

and more set-theoretic. 

Certain points on the boundary of a convex set carry a lot of information about 

the set. 

Definition 24. A point x of a convex set S is an extreme point of S if x is 

not an interior point of any line segment in S.


The extreme points of a closed ball and of a closed cube in R 3 are its boundary 

points and its eight vertices, respectively. A half-space has no extreme points even 

if it is closed. 

An interesting property of extreme points is that an extreme point can be 

deleted from the set without destroying convexity of the set. That is, a point x in 

a convex set S is an extreme point iff the set S\{x} is convex. 

The next Theorem is a finite-dimensional version of a quite general and powerful 

result by M.G. Krein and D.P. Milman. 

Theorem 18 (Krein & Milman, 1940). Let S ⊆ R n be convex and compact. 

Then S is the convex hull of its extreme points.

ECON 381 SC Foundations Of Economic Analysis ... - Economics

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?