# R basics: Working with data

R basics: Working with data

Data basics Data manipulation Basic data analysis Exercises Conclusions

R basics: Working with data

Andreas Alfons

Pieter Schoonees

Erasmus School of Economics, Erasmus University Rotterdam

ERIM Summer Course: Data Analysis with R, July 3, 2013

1 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Software requirements

−→ Data from package eRim are used

R> library("eRim")

−→ Packages foreign and XLConnect are required

R> install.packages(c("foreign", "XLConnect"))

2 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Content

1 Data basics: data types and operations

2 Data manipulation

3 Basic data analysis

4 Exercises

5 Conclusions

3 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Data basics: data types and operations

4 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Univariate data types

vector three elementary types

numeric quantitative variables

character text strings (between quotes)

logical TRUE or FALSE

factor qualitative variables with labels for categories

−→ There are no scalars, just vectors of length 1

5 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Vectors

Combine values into a vector with function c():

R> c(2, 4, 5, 421, 915)

[1] 2 4 5 421 915

R> c("articles", "mentor")

[1] "articles" "mentor"

Sequences of values:

R> 3:7

[1] 3 4 5 6 7

R> seq(0, 1, by=0.2)

[1] 0.0 0.2 0.4 0.6 0.8 1.0

6 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Assigning values

Assign values to an object with i i

[1] 2 4 5 421 915

R> keep keep

[1] "articles" "mentor"

−→ If values are assigned to an object, they are not printed

−→ Print values by typing the name of the object

7 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Factors

Encode a vector as a factor with function factor():

R> factor(c("foo", "foo", "bar"))

[1] foo foo bar

Levels: bar foo

Specify factor levels:

R> factor(c("foo", "foo", "bar"),

+ levels=c("foo", "bar"))

[1] foo foo bar

Levels: foo bar

8 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Matrices and data frames

matrix only vectors of the same type allowed

−→ Most common use case: numeric matrices in

arithmetic expressions

data.frame collection of variables of any type

−→ Data sets are usually in the form of data frames

−→ Modelling functions typically require data frames

9 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Matrices

Create a matrix:

R> A A

[,1] [,2]

[1,] 1 4

[2,] 2 5

[3,] 3 6

Matrix operations:

R> t(A) %*% A

[,1] [,2]

[1,] 14 32

[2,] 32 77

10 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Data frames

Create a data frame:

R> df df

foo bar keep

1 1 a FALSE

2 2 b TRUE

3 3 c TRUE

4 4 d FALSE

11 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Special values

NA Not available (represents missing value)

NaN Not a number (usually result of division 0/0)

Inf Positive infinity

-Inf Negative infinity

NULL Represents undefined value

12 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Basic math

Operator or function Operation Example

- subtraction x - y

univariate minus -x

* multiplication x * y

/ division x / y

^ exponentiation x ^ y

abs() absolute value abs(x)

sqrt() square root sqrt(x)

log() logarithm log(x)

exp() exponential function exp(x)

−→ Vectorized arithmetic: operations are performed elementwise

13 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Comparisions

Operator or function Operation Example

== exactly equal x == y

!= not equal x != y

= y

< less x < y

> greater x > y

is.na() is missing is.na(x)

−→ Vectorized comparisons: operations are performed elementwise

14 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Basic logic

Operator or function Operation Example

! not !is.na(x)

& and (x > 0) & (x < 1)

| or (x < 0) | (x > 1)

any() any element TRUE? any(x > 0)

all() all elements TRUE? all(x > 0)

−→ Vectorized logic: operations !, & and | are performed

elementwise

15 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Data conversion

−→ Functions is.foo() to check for type foo

−→ Functions as.foo() to convert to type foo

Important examples:

is.numeric()

is.factor()

is.matrix()

is.data.frame()

as.numeric()

as.factor()

as.matrix()

as.data.frame()

16 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Data sets in text files

Read data from a text file:

1 Tools → Import Dataset →

From Text File. . .

2 In the Select File dialog,

select the text file

3 In the Import Dataset

the name for the data set

and file characteristics, and

click Import

17 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Data sets in text files: command line

Read data from a text file:

R> PhDPublications write.table(PhDPublications,

+ file="PhDPublications.csv",

+ sep=",", row.names=FALSE)

−→ If the text file is not in or should not be saved to the working

directory, the full path needs to be specified

18 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Data sets in Excel files

−→ Functionality available in package XLConnect

R> library("XLConnect")

R> workbook PhDPublications

Data basics Data manipulation Basic data analysis Exercises Conclusions

Data sets in foreign data formats

−→ Functionality available in package foreign

R> library("foreign")

Read data from an SPSS file:

R> PhDPublications PhDPublications

Data basics Data manipulation Basic data analysis Exercises Conclusions

Data manipulation

21 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Data dimensions

Number of observations and columns together:

R> dim(PhDPublications)

[1] 915 6

Number of observations and columns separately:

R> nrow(PhDPublications)

[1] 915

R> ncol(PhDPublications)

[1] 6

22 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Names

Variable names:

R> colnames(PhDPublications)

[1] "articles" "gender" "married" "kids"

[5] "prestige" "mentor"

Row names:

R> rownames(PhDPublications)

23 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Data subsets

Take a subset of observations and variables with [,]:

R> i keep PhDPublications[i, keep]

articles mentor

2 0 6

4 0 3

5 0 26

421 1 18

915 19 42

24 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Subsamples

Take a subsample of all variables with [,]:

R> PhDPublications[c(2, 4, 5, 421, 915),]

articles gender married kids prestige mentor

2 0 female no 0 2.05 6

4 0 male yes 1 1.18 3

5 0 female no 0 3.75 26

421 1 female yes 0 4.62 18

915 19 male yes 0 1.86 42

25 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Extracting variables

Extract a variable with [,]:

R> articles mentor articles mentor

Data basics Data manipulation Basic data analysis Exercises Conclusions

Subscripting revisited

−→ [,] allows any numeric, character or logical expression to

specify observations and variables to keep

−→ Negative indices specify items to remove

R> PhDPublications[articles >= 10, -(2:5)]

articles mentor

910 10 18

911 11 7

912 12 35

913 12 5

914 16 21

915 19 42

27 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Subscripting vectors

Take a subset of a vector with []:

R> articles[c(2, 4, 5, 421, 915)]

[1] 0 0 0 1 19

R> mentor[articles >= 10]

[1] 18 7 35 5 21 42

28 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Modifying subsets of the data

Copy data set before modifications:

R> PhD PhD[c(2, 4, 5), "articles"] head(PhD)

articles gender married kids prestige mentor

1 0 male yes 0 2.52 7

2 2 female no 0 2.05 6

3 0 female no 0 3.75 6

4 1 male yes 1 1.18 3

5 3 female no 0 3.75 26

6 0 female yes 2 3.59 2

29 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Transforming variables

Transform a quantitative variable:

R> PhD\$logArticles summary(PhD\$logArticles)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0000 0.0000 0.6931 0.7897 1.0990 2.9960

30 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Categorizing a quantitative variable

Use function cut() for categories between breakpoints:

R> b PhD\$prcat summary(PhD\$prcat)

(0,2.5] (2.5,3.5] (3.5,5]

279 284 352

−→ Categorization yields loss of information

−→ Always keep the original variable

31 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Recoding factors

Retrieve factor levels with function levels():

R> levels(PhD\$prcat)

[1] "(0,2.5]" "(2.5,3.5]" "(3.5,5]"

Recode factor by combining levels() with assignment levels(PhD\$prcat) summary(PhD\$prcat)

low average high

279 284 352

32 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Categorization revisited

−→ Use labels in the categorization:

R> b l PhD\$prcat summary(PhD\$prcat)

low average high

279 284 352

33 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Data transformations: with() and within()

with() For computations with variables in a data frame

within() For modifying a data frame, e.g., transforming

variables

−→ Variables do not have to be accessed with \$

−→ Useful if computations require multiple variables

34 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Reordering observations

Use function order() to obtain order of observations:

articles gender married kids prestige mentor

12 0 female no 0 0.755 13

189 0 female no 0 0.755 0

86 0 male yes 2 0.920 1

350 1 female no 0 0.920 4

564 2 female yes 0 0.920 0

70 0 female no 0 1.005 0

logArticles prcat

12 0.0000000 low

189 0.0000000 low

86 0.0000000 low

350 0.6931472 low

564 1.0986123 low

70 0.0000000 low

35 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Basic data analysis

36 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Distributions

Cumulative distribution function of a normal distribution:

R> pnorm(1.645)

[1] 0.9500151

Probability density function of a normal distribution:

R> dnorm(1.645)

[1] 0.1031108

Quantiles of a normal distribution:

R> qnorm(c(0.025, 0.975))

[1] -1.959964 1.959964

−→ See ?Distributions for information on other distributions

37 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Random numbers

Set the seed of the random number generator for reproducibility:

R> set.seed(03072013)

Generate data from a normal distribution

R> x x

[1] 4.1791331 -0.5020036 -3.5889621 0.4840535

[5] -0.2963982 3.3880320 1.9244055 -0.6760739

[9] 2.2400438 -0.1888176

−→ See ?Distributions for information on other distributions

38 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Random samples

Take a random sample without replacement:

R> sample(10, 5)

[1] 5 1 6 9 7

R> sample(x, 3)

[1] 1.9244055 4.1791331 -0.2963982

Take a random sample with replacement:

R> sample(10, 5, replace=TRUE)

[1] 8 1 3 4 1

R> sample(x, 3, replace=TRUE)

[1] 0.4840535 -0.1888176 -0.5020036

39 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Random permutations

Obtain a permutation:

R> sample(10)

[1] 5 4 2 10 3 8 1 6 7 9

R> sample(x)

[1] -0.6760739 0.4840535 -0.5020036 -0.1888176

[5] 2.2400438 1.9244055 4.1791331 3.3880320

[9] -0.2963982 -3.5889621

40 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Length and summary

Length of a vector:

R> length(x)

[1] 10

Summary statistics:

R> summary(x)

Min. 1st Qu. Median Mean 3rd Qu. Max.

-3.5890 -0.4506 0.1476 0.6963 2.1610 4.1790

41 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Minimum and maximum

Minimum and maximum separately:

R> min(x)

[1] -3.588962

R> max(x)

[1] 4.179133

Minimum and maximum together:

R> range(x)

[1] -3.588962 4.179133

42 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Quantiles

Default quantiles:

R> quantile(x)

0% 25% 50% 75% 100%

-3.5889621 -0.4506023 0.1476180 2.1611342 4.1791331

Quantiles for specified probabilities:

R> quantile(x, probs=c(0.25, 0.5, 0.75))

25% 50% 75%

-0.4506023 0.1476180 2.1611342

43 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Mean and dispersion

Mean:

R> mean(x)

[1] 0.6963412

Standard deviation and variance:

R> sd(x)

[1] 2.279466

R> var(x)

[1] 5.195965

−→ sd() and var() use denominator n − 1

44 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Confidence interval for the mean

Confidence level 1 − α based on t-distribution:

R> n alpha q mean(x) + c(-q, q) * sd(x)/sqrt(n)

[1] -0.9342904 2.3269729

45 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

t-test with p-value: H 0 : µ = 1, H a : µ ≠ 1

Test statistic:

R> n t0 2 * (1 - pt(abs(t0), df=n-1))

[1] 0.6834458

46 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

t-test at significance level: H 0 : µ = 1, H a : µ ≠ 1

Critical value at α = 0.05:

R> alpha q abs(t0) > q

[1] FALSE

47 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

t-test revisited: H 0 : µ = 1, H a : µ ≠ 1

−→ Better to use built-in t-test

R> t.test(x, mu=1)

One Sample t-test

data: x

t = -0.4213, df = 9, p-value = 0.6834

alternative hypothesis: true mean is not equal to 1

95 percent confidence interval:

-0.9342904 2.3269729

sample estimates:

mean of x

0.6963412

48 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Columnwise means

R> keep PhD colMeans(PhD)

articles mentor prestige

1.692896 8.767213 3.103109

49 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Covariance and correlation

Covariance matrix:

R> cov(PhD)

articles mentor prestige

articles 3.7097416 5.587075 0.1390303

mentor 5.5870754 89.944657 2.4308449

prestige 0.1390303 2.430845 0.9687462

Correlation matrix:

R> cor(PhD)

articles mentor prestige

articles 1.00000000 0.3058616 0.07333861

mentor 0.30586164 1.0000000 0.26041413

prestige 0.07333861 0.2604141 1.00000000

50 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Contingency tables

One-way contingency table:

R> with(PhDPublications, table(kids))

kids

0 1 2 3

599 195 105 16

Two-way contingency table:

R> with(PhDPublications, table(married, kids))

kids

married 0 1 2 3

no 309 0 0 0

yes 290 195 105 16

51 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Exercises

52 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Conclusions

53 / 54

Data basics Data manipulation Basic data analysis Exercises Conclusions

Conclusions

Data sets typically stored as data frame

Quantitative variables stored as numeric vectors and

qualitative variables as factors

R offers powerful subscripting tools to access or modify parts

of the data

Interactive data analysis: enter command and examine output

Data import from many sources possible through contributed

packages

54 / 54

More magazines by this user
Similar magazines