Content from Introduction to R and RStudio


Last updated on 2024-12-04 | Edit this page

Estimated time: 55 minutes

Overview

Questions

  • How to find your way around RStudio?
  • How to interact with R?
  • How to manage your environment?
  • How to install packages?

Objectives

  • Describe the purpose and use of each pane in RStudio
  • Locate buttons and options in RStudio
  • Define a variable
  • Assign data to a variable
  • Manage a workspace in an interactive R session
  • Use mathematical and comparison operators
  • Call functions
  • Manage packages

Before Starting The Workshop


Please ensure you have the latest version of R and RStudio installed on your machine. This is important, as some packages used in the workshop may not install correctly (or at all) if R is not up to date.

Why use R and R studio?


Welcome to the R portion of the Software Carpentry workshop!

Science is a multi-step process: once you’ve designed an experiment and collected data, the real fun begins with analysis! Throughout this lesson, we’re going to teach you some of the fundamentals of the R language as well as some best practices for organizing code for scientific projects that will make your life easier.

Although we could use a spreadsheet in Microsoft Excel or Google sheets to analyze our data, these tools are limited in their flexibility and accessibility. Critically, they also are difficult to share steps which explore and change the raw data, which is key to “reproducible” research.

Therefore, this lesson will teach you how to begin exploring your data using R and RStudio. The R program is available for Windows, Mac, and Linux operating systems, and is a freely-available where you downloaded it above. To run R, all you need is the R program.

However, to make using R easier, we will use the program RStudio, which we also downloaded above. RStudio is a free, open-source, Integrated Development Environment (IDE). It provides a built-in editor, works on all platforms (including on servers) and provides many advantages such as integration with version control and project management.

Overview


We will begin with raw data, perform exploratory analyses, and learn how to plot results graphically. This example starts with a dataset from gapminder.org containing population information for many countries through time. Can you read the data into R? Can you plot the population for Senegal? Can you calculate the average income for countries on the continent of Asia? By the end of these lessons you will be able to do things like plot the populations for all of these countries in under a minute!

Basic layout

When you first open RStudio, you will be greeted by three panels:

  • The interactive R console/Terminal (entire left)
  • Environment/History/Connections (tabbed in upper right)
  • Files/Plots/Packages/Help/Viewer (tabbed in lower right)
RStudio layout

Once you open files, such as R scripts, an editor panel will also open in the top left.

RStudio layout with .R file open

R scripts

Any commands that you write in the R console can be saved to a file to be re-run again. Files containing R code to be ran in this way are called R scripts. R scripts have .R at the end of their names to let you know what they are.

Workflow within RStudio


There are two main ways one can work within RStudio:

  1. Test and play within the interactive R console then copy code into a .R file to run later.
  • This works well when doing small tests and initially starting off.
  • It quickly becomes laborious
  1. Start writing in a .R file and use RStudio’s short cut keys for the Run command to push the current line, selected lines or modified lines to the interactive R console.
  • This is a great way to start; all your code is saved for later
  • You will be able to run the file you create from within RStudio or using R’s source() function.

Tip: Running segments of your code

RStudio offers you great flexibility in running code from within the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can

  1. click on the Run button above the editor panel, or
  2. select “Run Lines” from the “Code” menu, or
  3. hit Ctrl+Return in Windows or Linux or +Return on OS X. (This shortcut can also be seen by hovering the mouse over the button). To run a block of code, select it and then Run. If you have modified a line of code within a block of code you have just run, there is no need to reselect the section and Run, you can use the next button along, Re-run the previous region. This will run the previous code block including the modifications you have made.

Introduction to R


Much of your time in R will be spent in the R interactive console. This is where you will run all of your code, and can be a useful environment to try out ideas before adding them to an R script file. This console in RStudio is the same as the one you would get if you typed in R in your command-line environment.

The first thing you will see in the R interactive session is a bunch of information, followed by a “>” and a blinking cursor. In many ways this is similar to the shell environment you learned about during the shell lessons: it operates on the same idea of a “Read, evaluate, print loop”: you type in commands, R tries to execute them, and then returns a result.

Using R as a calculator


The simplest thing you could do with R is to do arithmetic:

R

1 + 100

OUTPUT

[1] 101

And R will print out the answer, with a preceding “[1]”. [1] is the index of the first element of the line being printed in the console. For more information on indexing vectors, see Episode 6: Subsetting Data.

If you type in an incomplete command, R will wait for you to complete it. If you are familiar with Unix Shell’s bash, you may recognize this behavior from bash.

R

> 1 +

OUTPUT

+

Any time you hit return and the R session shows a “+” instead of a “>”, it means it’s waiting for you to complete the command. If you want to cancel a command you can hit Esc and RStudio will give you back the “>” prompt.

Tip: Canceling commands

If you’re using R from the command line instead of from within RStudio, you need to use Ctrl+C instead of Esc to cancel the command. This applies to Mac users as well!

Canceling a command isn’t only useful for killing incomplete commands: you can also use it to tell R to stop running code (for example if it’s taking much longer than you expect), or to get rid of the code you’re currently writing.

When using R as a calculator, the order of operations is the same as you would have learned back in school.

From highest to lowest precedence:

  • Parentheses: (, )
  • Exponents: ^ or **
  • Multiply: *
  • Divide: /
  • Add: +
  • Subtract: -

R

3 + 5 * 2

OUTPUT

[1] 13

Use parentheses to group operations in order to force the order of evaluation if it differs from the default, or to make clear what you intend.

R

(3 + 5) * 2

OUTPUT

[1] 16

This can get unwieldy when not needed, but clarifies your intentions. Remember that others may later read your code.

R

(3 + (5 * (2 ^ 2))) # hard to read
3 + 5 * 2 ^ 2       # clear, if you remember the rules
3 + 5 * (2 ^ 2)     # if you forget some rules, this might help

The text after each line of code is called a “comment”. Anything that follows after the hash (or octothorpe) symbol # is ignored by R when it executes code.

Really small or large numbers get a scientific notation:

R

2/10000

OUTPUT

[1] 2e-04

Which is shorthand for “multiplied by 10^XX”. So 2e-4 is shorthand for 2 * 10^(-4).

You can write numbers in scientific notation too:

R

5e3  # Note the lack of minus here

OUTPUT

[1] 5000

Mathematical functions


R has many built in mathematical functions. To call a function, we can type its name, followed by open and closing parentheses. Functions take arguments as inputs, anything we type inside the parentheses of a function is considered an argument. Depending on the function, the number of arguments can vary from none to multiple. For example:

R

getwd() #returns an absolute filepath

doesn’t require an argument, whereas for the next set of mathematical functions we will need to supply the function a value in order to compute the result.

R

sin(1)  # trigonometry functions

OUTPUT

[1] 0.841471

R

log(1)  # natural logarithm

OUTPUT

[1] 0

R

log10(10) # base-10 logarithm

OUTPUT

[1] 1

R

exp(0.5) # e^(1/2)

OUTPUT

[1] 1.648721

Don’t worry about trying to remember every function in R. You can look them up on Google, or if you can remember the start of the function’s name, use the tab completion in RStudio.

This is one advantage that RStudio has over R on its own, it has auto-completion abilities that allow you to more easily look up functions, their arguments, and the values that they take.

Typing a ? before the name of a command will open the help page for that command. When using RStudio, this will open the ‘Help’ pane; if using R in the terminal, the help page will open in your browser. The help page will include a detailed description of the command and how it works. Scrolling to the bottom of the help page will usually show a collection of code examples which illustrate command usage. We’ll go through an example later.

Comparing things


We can also do comparisons in R:

R

1 == 1  # equality (note two equals signs, read as "is equal to")

OUTPUT

[1] TRUE

R

1 != 2  # inequality (read as "is not equal to")

OUTPUT

[1] TRUE

R

1 < 2  # less than

OUTPUT

[1] TRUE

R

1 <= 1  # less than or equal to

OUTPUT

[1] TRUE

R

1 > 0  # greater than

OUTPUT

[1] TRUE

R

1 >= -9 # greater than or equal to

OUTPUT

[1] TRUE

Tip: Comparing Numbers

A word of warning about comparing numbers: you should never use == to compare two numbers unless they are integers (a data type which can specifically represent only whole numbers).

Computers may only represent decimal numbers with a certain degree of precision, so two numbers which look the same when printed out by R, may actually have different underlying representations and therefore be different by a small margin of error (called Machine numeric tolerance).

Instead you should use the all.equal function.

Further reading: http://floating-point-gui.de/

Variables and assignment


We can store values in variables using the assignment operator <-, like this:

R

x <- 1/40

Notice that assignment does not print a value. Instead, we stored it for later in something called a variable. x now contains the value 0.025:

R

x

OUTPUT

[1] 0.025

More precisely, the stored value is a decimal approximation of this fraction called a floating point number.

Look for the Environment tab in the top right panel of RStudio, and you will see that x and its value have appeared. Our variable x can be used in place of a number in any calculation that expects a number:

R

log(x)

OUTPUT

[1] -3.688879

Notice also that variables can be reassigned:

R

x <- 100

x used to contain the value 0.025 and now it has the value 100.

Assignment values can contain the variable being assigned to:

R

x <- x + 1 #notice how RStudio updates its description of x on the top right tab
y <- x * 2

The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.

Variable names can contain letters, numbers, underscores and periods but no spaces. They must start with a letter or a period followed by a letter (they cannot start with a number nor an underscore). Variables beginning with a period are hidden variables. Different people use different conventions for long variable names, these include

  • periods.between.words
  • underscores_between_words
  • camelCaseToSeparateWords

What you use is up to you, but be consistent.

It is also possible to use the = operator for assignment:

R

x = 1/40

But this is much less common among R users. The most important thing is to be consistent with the operator you use. There are occasionally places where it is less confusing to use <- than =, and it is the most common symbol used in the community. So the recommendation is to use <-.

Challenge 1

Which of the following are valid R variable names?

R

min_height
max.height
_age
.mass
MaxLength
min-length
2widths
celsius2kelvin

The following can be used as R variables:

R

min_height
max.height
MaxLength
celsius2kelvin

The following creates a hidden variable:

R

.mass

The following will not be able to be used to create a variable

R

_age
min-length
2widths

Vectorization


One final thing to be aware of is that R is vectorized, meaning that variables and functions can have vectors as values. In contrast to physics and mathematics, a vector in R describes a set of values in a certain order of the same data type. For example:

R

1:5

OUTPUT

[1] 1 2 3 4 5

R

2^(1:5)

OUTPUT

[1]  2  4  8 16 32

R

x <- 1:5
2^x

OUTPUT

[1]  2  4  8 16 32

This is incredibly powerful; we will discuss this further in an upcoming lesson.

Managing your environment


There are a few useful commands you can use to interact with the R session.

ls will list all of the variables and functions stored in the global environment (your working R session):

R

ls()

OUTPUT

[1] "x" "y"

Tip: hidden objects

Like in the shell, ls will hide any variables or functions starting with a “.” by default. To list all objects, type ls(all.names=TRUE) instead

Note here that we didn’t give any arguments to ls, but we still needed to give the parentheses to tell R to call the function.

If we type ls by itself, R prints a bunch of code instead of a listing of objects.

R

ls

OUTPUT

function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
    pattern, sorted = TRUE)
{
    if (!missing(name)) {
        pos <- tryCatch(name, error = function(e) e)
        if (inherits(pos, "error")) {
            name <- substitute(name)
            if (!is.character(name))
                name <- deparse(name)
            warning(gettextf("%s converted to character string",
                sQuote(name)), domain = NA)
            pos <- name
        }
    }
    all.names <- .Internal(ls(envir, all.names, sorted))
    if (!missing(pattern)) {
        if ((ll <- length(grep("[", pattern, fixed = TRUE))) &&
            ll != length(grep("]", pattern, fixed = TRUE))) {
            if (pattern == "[") {
                pattern <- "\\["
                warning("replaced regular expression pattern '[' by  '\\\\['")
            }
            else if (length(grep("[^\\\\]\\[<-", pattern))) {
                pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
                warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
            }
        }
        grep(pattern, all.names, value = TRUE)
    }
    else all.names
}
<bytecode: 0x5614be79ee20>
<environment: namespace:base>

What’s going on here?

Like everything in R, ls is the name of an object, and entering the name of an object by itself prints the contents of the object. The object x that we created earlier contains 1, 2, 3, 4, 5:

R

x

OUTPUT

[1] 1 2 3 4 5

The object ls contains the R code that makes the ls function work! We’ll talk more about how functions work and start writing our own later.

You can use rm to delete objects you no longer need:

R

rm(x)

If you have lots of things in your environment and want to delete all of them, you can pass the results of ls to the rm function:

R

rm(list = ls())

In this case we’ve combined the two. Like the order of operations, anything inside the innermost parentheses is evaluated first, and so on.

In this case we’ve specified that the results of ls should be used for the list argument in rm. When assigning values to arguments by name, you must use the = operator!!

If instead we use <-, there will be unintended side effects, or you may get an error message:

R

rm(list <- ls())

ERROR

Error in rm(list <- ls()): ... must contain names or character strings

Tip: Warnings vs. Errors

Pay attention when R does something unexpected! Errors, like above, are thrown when R cannot proceed with a calculation. Warnings on the other hand usually mean that the function has run, but it probably hasn’t worked as expected.

In both cases, the message that R prints out usually give you clues how to fix a problem.

R Packages


It is possible to add functions to R by writing a package, or by obtaining a package written by someone else. As of this writing, there are over 10,000 packages available on CRAN (the comprehensive R archive network). R and RStudio have functionality for managing packages:

  • You can see what packages are installed by typing installed.packages()
  • You can install packages by typing install.packages("packagename"), where packagename is the package name, in quotes.
  • You can update installed packages by typing update.packages()
  • You can remove a package with remove.packages("packagename")
  • You can make a package available for use with library(packagename)

Packages can also be viewed, loaded, and detached in the Packages tab of the lower right panel in RStudio. Clicking on this tab will display all of the installed packages with a checkbox next to them. If the box next to a package name is checked, the package is loaded and if it is empty, the package is not loaded. Click an empty box to load that package and click a checked box to detach that package.

Packages can be installed and updated from the Package tab with the Install and Update buttons at the top of the tab.

Challenge 2

What will be the value of each variable after each statement in the following program?

R

mass <- 47.5
age <- 122
mass <- mass * 2.3
age <- age - 20

R

mass <- 47.5

This will give a value of 47.5 for the variable mass

R

age <- 122

This will give a value of 122 for the variable age

R

mass <- mass * 2.3

This will multiply the existing value of 47.5 by 2.3 to give a new value of 109.25 to the variable mass.

R

age <- age - 20

This will subtract 20 from the existing value of 122 to give a new value of 102 to the variable age.

Challenge 3

Run the code from the previous challenge, and write a command to compare mass to age. Is mass larger than age?

One way of answering this question in R is to use the > to set up the following:

R

mass > age

OUTPUT

[1] TRUE

This should yield a boolean value of TRUE since 109.25 is greater than 102.

Challenge 4

Clean up your working environment by deleting the mass and age variables.

We can use the rm command to accomplish this task

R

rm(age, mass)

Challenge 5

Install the following packages: ggplot2, plyr, gapminder

We can use the install.packages() command to install the required packages.

R

install.packages("ggplot2")
install.packages("plyr")
install.packages("gapminder")

An alternate solution, to install multiple packages with a single install.packages() command is:

R

install.packages(c("ggplot2", "plyr", "gapminder"))

When installing ggplot2, it may be required for some users to use the dependencies flag as a result of lazy loading affecting the install. This suggestion is not tied to any known bug discussion, and is advised based off instructor feedback/experience in resolving stochastic occurences of errors identified through delivery of this workshop:

R

install.packages("ggplot2", dependencies = TRUE)

Key Points

  • Use RStudio to write and run R programs.
  • R has the usual arithmetic operators and mathematical functions.
  • Use <- to assign values to variables.
  • Use ls() to list the variables in a program.
  • Use rm() to delete objects in a program.
  • Use install.packages() to install packages (libraries).

Content from Project Management With RStudio


Last updated on 2024-12-04 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • How can I manage my projects in R?

Objectives

  • Create self-contained projects in RStudio

Introduction


The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together.

Most people tend to organize their projects like this:

Screenshot of file manager demonstrating bad project organisation

There are many reasons why we should ALWAYS avoid this:

  1. It is really hard to tell which version of your data is the original and which is the modified;
  2. It gets really messy because it mixes files with various extensions together;
  3. It probably takes you a lot of time to actually find things, and relate the correct figures to the exact code that has been used to generate it;

A good project layout will ultimately make your life easier:

  • It will help ensure the integrity of your data;
  • It makes it simpler to share your code with someone else (a lab-mate, collaborator, or supervisor);
  • It allows you to easily upload your code with your manuscript submission;
  • It makes it easier to pick the project back up after a break.

A possible solution


Fortunately, there are tools and packages which can help you manage your work effectively.

One of the most powerful and useful aspects of RStudio is its project management functionality. We’ll be using this today to create a self-contained, reproducible project.

Challenge 1: Creating a self-contained project

We’re going to create a new project in RStudio:

  1. Click the “File” menu button, then “New Project”.
  2. Click “New Directory”.
  3. Click “New Project”.
  4. Type in the name of the directory to store your project, e.g. “my_project”.
  5. If available, select the checkbox for “Create a git repository.”
  6. Click the “Create Project” button.

The simplest way to open an RStudio project once it has been created is to click through your file system to get to the directory where it was saved and double click on the .Rproj file. This will open RStudio and start your R session in the same directory as the .Rproj file. All your data, plots and scripts will now be relative to the project directory. RStudio projects have the added benefit of allowing you to open multiple projects at the same time each open to its own project directory. This allows you to keep multiple projects open without them interfering with each other.

Challenge 2: Opening an RStudio project through the file system

  1. Exit RStudio.
  2. Navigate to the directory where you created a project in Challenge 1.
  3. Double click on the .Rproj file in that directory.

Best practices for project organization


Although there is no “best” way to lay out a project, there are some general principles to adhere to that will make project management easier:

Treat data as read only

This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as “read-only”.

Data Cleaning

In many cases your data will be “dirty”: it will need significant preprocessing to get into a format R (or any other programming language) will find useful. This task is sometimes called “data munging”. Storing these scripts in a separate folder, and creating a second “read-only” data folder to hold the “cleaned” data sets can prevent confusion between the two sets.

Treat generated output as disposable

Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts.

There are lots of different ways to manage this output. Having an output folder with different sub-directories for each separate analysis makes it easier later. Since many analyses are exploratory and don’t end up being used in the final project, and some of the analyses get shared between projects.

Tip: Good Enough Practices for Scientific Computing

Good Enough Practices for Scientific Computing gives the following recommendations for project organization:

  1. Put each project in its own directory, which is named after the project.
  2. Put text documents associated with the project in the doc directory.
  3. Put raw data and metadata in the data directory, and files generated during cleanup and analysis in a results directory.
  4. Put source for the project’s scripts and programs in the src directory, and programs brought in from elsewhere or compiled locally in the bin directory.
  5. Name all files to reflect their content or function.

Separate function definition and application

One of the more effective ways to work with R is to start by writing the code you want to run directly in a .R script, and then running the selected lines (either using the keyboard shortcuts in RStudio or clicking the “Run” button) in the interactive R console.

When your project is in its early stages, the initial .R script file usually contains many lines of directly executed code. As it matures, reusable chunks get pulled into their own functions. It’s a good idea to separate these functions into two separate folders; one to store useful functions that you’ll reuse across analyses and projects, and one to store the analysis scripts.

Save the data in the data directory

Now we have a good directory structure we will now place/save the data file in the data/ directory.

Challenge 3

Download the gapminder data from this link to a csv file.

  1. Download the file (right mouse click on the link above -> “Save link as” / “Save file as”, or click on the link and after the page loads, press Ctrl+S or choose File -> “Save page as”)
  2. Make sure it’s saved under the name gapminder_data.csv
  3. Save the file in the data/ folder within your project.

We will load and inspect these data later.

Challenge 4

It is useful to get some general idea about the dataset, directly from the command line, before loading it into R. Understanding the dataset better will come in handy when making decisions on how to load it in R. Use the command-line shell to answer the following questions:

  1. What is the size of the file?
  2. How many rows of data does it contain?
  3. What kinds of values are stored in this file?

By running these commands in the shell:

SH

ls -lh data/gapminder_data.csv

OUTPUT

-rw-r--r-- 1 runner docker 80K Dec  4 15:34 data/gapminder_data.csv

The file size is 80K.

SH

wc -l data/gapminder_data.csv

OUTPUT

1705 data/gapminder_data.csv

There are 1705 lines. The data looks like:

SH

head data/gapminder_data.csv

OUTPUT

country,year,pop,continent,lifeExp,gdpPercap
Afghanistan,1952,8425333,Asia,28.801,779.4453145
Afghanistan,1957,9240934,Asia,30.332,820.8530296
Afghanistan,1962,10267083,Asia,31.997,853.10071
Afghanistan,1967,11537966,Asia,34.02,836.1971382
Afghanistan,1972,13079460,Asia,36.088,739.9811058
Afghanistan,1977,14880372,Asia,38.438,786.11336
Afghanistan,1982,12881816,Asia,39.854,978.0114388
Afghanistan,1987,13867957,Asia,40.822,852.3959448
Afghanistan,1992,16317921,Asia,41.674,649.3413952

Tip: command line in RStudio

The Terminal tab in the console pane provides a convenient place directly within RStudio to interact directly with the command line.

Working directory

Knowing R’s current working directory is important because when you need to access other files (for example, to import a data file), R will look for them relative to the current working directory.

Each time you create a new RStudio Project, it will create a new directory for that project. When you open an existing .Rproj file, it will open that project and set R’s working directory to the folder that file is in.

Challenge 5

You can check the current working directory with the getwd() command, or by using the menus in RStudio.

  1. In the console, type getwd() (“wd” is short for “working directory”) and hit Enter.
  2. In the Files pane, double click on the data folder to open it (or navigate to any other folder you wish). To get the Files pane back to the current working directory, click “More” and then select “Go To Working Directory”.

You can change the working directory with setwd(), or by using RStudio menus.

  1. In the console, type setwd("data") and hit Enter. Type getwd() and hit Enter to see the new working directory.
  2. In the menus at the top of the RStudio window, click the “Session” menu button, and then select “Set Working Directory” and then “Choose Directory”. Next, in the windows navigator that opens, navigate back to the project directory, and click “Open”. Note that a setwd command will automatically appear in the console.

Tip: File does not exist errors

When you’re attempting to reference a file in your R code and you’re getting errors saying the file doesn’t exist, it’s a good idea to check your working directory. You need to either provide an absolute path to the file, or you need to make sure the file is saved in the working directory (or a subfolder of the working directory) and provide a relative path.

Version Control

It is important to use version control with projects. Go here for a good lesson which describes using Git with RStudio.

Key Points

  • Use RStudio to create and manage projects with consistent layout.
  • Treat raw data as read-only.
  • Treat generated output as disposable.
  • Separate function definition and application.

Content from Seeking Help


Last updated on 2024-12-04 | Edit this page

Estimated time: 20 minutes

Overview

Questions

  • How can I get help in R?

Objectives

  • To be able to read R help files for functions and special operators.
  • To be able to use CRAN task views to identify packages to solve a problem.
  • To be able to seek help from your peers.

Reading Help Files


R, and every package, provide help files for functions. The general syntax to search for help on any function, “function_name”, from a specific function that is in a package loaded into your namespace (your interactive R session) is:

R

?function_name
help(function_name)

For example take a look at the help file for write.table(), we will be using a similar function in an upcoming episode.

R

?write.table()

This will load up a help page in RStudio (or as plain text in R itself).

Each help page is broken down into sections:

  • Description: An extended description of what the function does.
  • Usage: The arguments of the function and their default values (which can be changed).
  • Arguments: An explanation of the data each argument is expecting.
  • Details: Any important details to be aware of.
  • Value: The data the function returns.
  • See Also: Any related functions you might find useful.
  • Examples: Some examples for how to use the function.

Different functions might have different sections, but these are the main ones you should be aware of.

Notice how related functions might call for the same help file:

R

?write.table()
?write.csv()

This is because these functions have very similar applicability and often share the same arguments as inputs to the function, so package authors often choose to document them together in a single help file.

Tip: Running Examples

From within the function help page, you can highlight code in the Examples and hit Ctrl+Return to run it in RStudio console. This gives you a quick way to get a feel for how a function works.

Tip: Reading Help Files

One of the most daunting aspects of R is the large number of functions available. It would be prohibitive, if not impossible to remember the correct usage for every function you use. Luckily, using the help files means you don’t have to remember that!

Special Operators


To seek help on special operators, use quotes or backticks:

R

?"<-"
?`<-`

Getting Help with Packages


Many packages come with “vignettes”: tutorials and extended example documentation. Without any arguments, vignette() will list all vignettes for all installed packages; vignette(package="package-name") will list all available vignettes for package-name, and vignette("vignette-name") will open the specified vignette.

If a package doesn’t have any vignettes, you can usually find help by typing help("package-name").

RStudio also has a set of excellent cheatsheets for many packages.

When You Remember Part of the Function Name


If you’re not sure what package a function is in or how it’s specifically spelled, you can do a fuzzy search:

R

??function_name

A fuzzy search is when you search for an approximate string match. For example, you may remember that the function to set your working directory includes “set” in its name. You can do a fuzzy search to help you identify the function:

R

??set

When You Have No Idea Where to Begin


If you don’t know what function or package you need to use CRAN Task Views is a specially maintained list of packages grouped into fields. This can be a good starting point.

When Your Code Doesn’t Work: Seeking Help from Your Peers


If you’re having trouble using a function, 9 times out of 10, the answers you seek have already been answered on Stack Overflow. You can search using the [r] tag. Please make sure to see their page on how to ask a good question.

If you can’t find the answer, there are a few useful functions to help you ask your peers:

R

?dput

Will dump the data you’re working with into a format that can be copied and pasted by others into their own R session.

R

sessionInfo()

OUTPUT

R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.4.2 tools_4.4.2    yaml_2.3.10    knitr_1.49     xfun_0.49
[6] renv_1.0.11    evaluate_1.0.1

Will print out your current version of R, as well as any packages you have loaded. This can be useful for others to help reproduce and debug your issue.

Challenge 1

Look at the help page for the c function. What kind of vector do you expect will be created if you evaluate the following:

R

c(1, 2, 3)
c('d', 'e', 'f')
c(1, 2, 'f')

The c() function creates a vector, in which all elements are of the same type. In the first case, the elements are numeric, in the second, they are characters, and in the third they are also characters: the numeric values are “coerced” to be characters.

Challenge 2

Look at the help for the paste function. You will need to use it later. What’s the difference between the sep and collapse arguments?

To look at the help for the paste() function, use:

R

help("paste")
?paste

The difference between sep and collapse is a little tricky. The paste function accepts any number of arguments, each of which can be a vector of any length. The sep argument specifies the string used between concatenated terms — by default, a space. The result is a vector as long as the longest argument supplied to paste. In contrast, collapse specifies that after concatenation the elements are collapsed together using the given separator, the result being a single string.

It is important to call the arguments explicitly by typing out the argument name e.g sep = "," so the function understands to use the “,” as a separator and not a term to concatenate. e.g.

R

paste(c("a","b"), "c")

OUTPUT

[1] "a c" "b c"

R

paste(c("a","b"), "c", ",")

OUTPUT

[1] "a c ," "b c ,"

R

paste(c("a","b"), "c", sep = ",")

OUTPUT

[1] "a,c" "b,c"

R

paste(c("a","b"), "c", collapse = "|")

OUTPUT

[1] "a c|b c"

R

paste(c("a","b"), "c", sep = ",", collapse = "|")

OUTPUT

[1] "a,c|b,c"

(For more information, scroll to the bottom of the ?paste help page and look at the examples, or try example('paste').)

Challenge 3

Use help to find a function (and its associated parameters) that you could use to load data from a tabular file in which columns are delimited with “\t” (tab) and the decimal point is a “.” (period). This check for decimal separator is important, especially if you are working with international colleagues, because different countries have different conventions for the decimal point (i.e. comma vs period). Hint: use ??"read table" to look up functions related to reading in tabular data.

The standard R function for reading tab-delimited files with a period decimal separator is read.delim(). You can also do this with read.table(file, sep="\t") (the period is the default decimal separator for read.table()), although you may have to change the comment.char argument as well if your data file contains hash (#) characters.

Other Resources


Key Points

  • Use help() to get online help in R.

Content from Data Structures


Last updated on 2024-12-04 | Edit this page

Estimated time: 55 minutes

Overview

Questions

  • How can I read data in R?
  • What are the basic data types in R?
  • How do I represent categorical information in R?

Objectives

  • To be able to identify the 5 main data types.
  • To begin exploring data frames, and understand how they are related to vectors and lists.
  • To be able to ask questions from R about the type, class, and structure of an object.
  • To understand the information of the attributes “names”, “class”, and “dim”.

One of R’s most powerful features is its ability to deal with tabular data - such as you may already have in a spreadsheet or a CSV file. Let’s start by making a toy dataset in your data/ directory, called feline-data.csv:

R

cats <- data.frame(coat = c("calico", "black", "tabby"),
                    weight = c(2.1, 5.0, 3.2),
                    likes_string = c(1, 0, 1))

We can now save cats as a CSV file. It is good practice to call the argument names explicitly so the function knows what default values you are changing. Here we are setting row.names = FALSE. Recall you can use ?write.csv to pull up the help file to check out the argument names and their default values.

R

write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE)

The contents of the new file, feline-data.csv:

R

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1

Tip: Editing Text files in R

Alternatively, you can create data/feline-data.csv using a text editor (Nano), or within RStudio with the File -> New File -> Text File menu item.

We can load this into R via the following:

R

cats <- read.csv(file = "data/feline-data.csv")
cats

OUTPUT

    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

The read.table function is used for reading in tabular data stored in a text file where the columns of data are separated by punctuation characters such as CSV files (csv = comma-separated values). Tabs and commas are the most common punctuation characters used to separate or delimit data points in csv files. For convenience R provides 2 other versions of read.table. These are: read.csv for files where the data are separated with commas and read.delim for files where the data are separated with tabs. Of these three functions read.csv is the most commonly used. If needed it is possible to override the default delimiting punctuation marks for both read.csv and read.delim.

Check your data for factors

In recent times, the default way how R handles textual data has changed. Text data was interpreted by R automatically into a format called “factors”. But there is an easier format that is called “character”. We will hear about factors later, and what to use them for. For now, remember that in most cases, they are not needed and only complicate your life, which is why newer R versions read in text as “character”. Check now if your version of R has automatically created factors and convert them to “character” format:

  1. Check the data types of your input by typing str(cats)
  2. In the output, look at the three-letter codes after the colons: If you see only “num” and “chr”, you can continue with the lesson and skip this box. If you find “fct”, continue to step 3.
  3. Prevent R from automatically creating “factor” data. That can be done by the following code: options(stringsAsFactors = FALSE). Then, re-read the cats table for the change to take effect.
  4. You must set this option every time you restart R. To not forget this, include it in your analysis script before you read in any data, for example in one of the first lines.
  5. For R versions greater than 4.0.0, text data is no longer converted to factors anymore. So you can install this or a newer version to avoid this problem. If you are working on an institute or company computer, ask your administrator to do it.

We can begin exploring our dataset right away, pulling out columns by specifying them using the $ operator:

R

cats$weight

OUTPUT

[1] 2.1 5.0 3.2

R

cats$coat

OUTPUT

[1] "calico" "black"  "tabby" 

We can do other operations on the columns:

R

## Say we discovered that the scale weighs two Kg light:
cats$weight + 2

OUTPUT

[1] 4.1 7.0 5.2

R

paste("My cat is", cats$coat)

OUTPUT

[1] "My cat is calico" "My cat is black"  "My cat is tabby" 

But what about

R

cats$weight + cats$coat

ERROR

Error in cats$weight + cats$coat: non-numeric argument to binary operator

Understanding what happened here is key to successfully analyzing data in R.

Data Types

If you guessed that the last command will return an error because 2.1 plus "black" is nonsense, you’re right - and you already have some intuition for an important concept in programming called data types. We can ask what type of data something is:

R

typeof(cats$weight)

OUTPUT

[1] "double"

There are 5 main types: double, integer, complex, logical and character. For historic reasons, double is also called numeric.

R

typeof(3.14)

OUTPUT

[1] "double"

R

typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers

OUTPUT

[1] "integer"

R

typeof(1+1i)

OUTPUT

[1] "complex"

R

typeof(TRUE)

OUTPUT

[1] "logical"

R

typeof('banana')

OUTPUT

[1] "character"

No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types. This strictness has some really important consequences.

A user has added details of another cat. This information is in the file data/feline-data_v2.csv.

R

file.show("data/feline-data_v2.csv")

R

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
tabby,2.3 or 2.4,1

Load the new cats data like before, and check what type of data we find in the weight column:

R

cats <- read.csv(file="data/feline-data_v2.csv")
typeof(cats$weight)

OUTPUT

[1] "character"

Oh no, our weights aren’t the double type anymore! If we try to do the same math we did on them before, we run into trouble:

R

cats$weight + 2

ERROR

Error in cats$weight + 2: non-numeric argument to binary operator

What happened? The cats data we are working with is something called a data frame. Data frames are one of the most common and versatile types of data structures we will work with in R. A given column in a data frame cannot be composed of different data types. In this case, R does not read everything in the data frame column weight as a double, therefore the entire column data type changes to something that is suitable for everything in the column.

When R reads a csv file, it reads it in as a data frame. Thus, when we loaded the cats csv file, it is stored as a data frame. We can recognize data frames by the first row that is written by the str() function:

R

str(cats)

OUTPUT

'data.frame':	4 obs. of  3 variables:
 $ coat        : chr  "calico" "black" "tabby" "tabby"
 $ weight      : chr  "2.1" "5" "3.2" "2.3 or 2.4"
 $ likes_string: int  1 0 1 1

Data frames are composed of rows and columns, where each column has the same number of rows. Different columns in a data frame can be made up of different data types (this is what makes them so versatile), but everything in a given column needs to be the same type (e.g., vector, factor, or list).

Let’s explore more about different data structures and how they behave. For now, let’s remove that extra line from our cats data and reload it, while we investigate this behavior further:

feline-data.csv:

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1

And back in RStudio:

R

cats <- read.csv(file="data/feline-data.csv")

Vectors and Type Coercion

To better understand this behavior, let’s meet another of the data structures: the vector.

R

my_vector <- vector(length = 3)
my_vector

OUTPUT

[1] FALSE FALSE FALSE

A vector in R is essentially an ordered list of things, with the special condition that everything in the vector must be the same basic data type. If you don’t choose the datatype, it’ll default to logical; or, you can declare an empty vector of whatever type you like.

R

another_vector <- vector(mode='character', length=3)
another_vector

OUTPUT

[1] "" "" ""

You can check if something is a vector:

R

str(another_vector)

OUTPUT

 chr [1:3] "" "" ""

The somewhat cryptic output from this command indicates the basic data type found in this vector - in this case chr, character; an indication of the number of things in the vector - actually, the indexes of the vector, in this case [1:3]; and a few examples of what’s actually in the vector - in this case empty character strings. If we similarly do

R

str(cats$weight)

OUTPUT

 num [1:3] 2.1 5 3.2

we see that cats$weight is a vector, too - the columns of data we load into R data.frames are all vectors, and that’s the root of why R forces everything in a column to be the same basic data type.

Discussion 1

Why is R so opinionated about what we put in our columns of data? How does this help us?

By keeping everything in a column the same, we allow ourselves to make simple assumptions about our data; if you can interpret one entry in the column as a number, then you can interpret all of them as numbers, so we don’t have to check every time. This consistency is what people mean when they talk about clean data; in the long run, strict consistency goes a long way to making our lives easier in R.

Coercion by combining vectors

You can also make vectors with explicit contents with the combine function:

R

combine_vector <- c(2,6,3)
combine_vector

OUTPUT

[1] 2 6 3

Given what we’ve learned so far, what do you think the following will produce?

R

quiz_vector <- c(2,6,'3')

This is something called type coercion, and it is the source of many surprises and the reason why we need to be aware of the basic data types and how R will interpret them. When R encounters a mix of types (here double and character) to be combined into a single vector, it will force them all to be the same type. Consider:

R

coercion_vector <- c('a', TRUE)
coercion_vector

OUTPUT

[1] "a"    "TRUE"

R

another_coercion_vector <- c(0, TRUE)
another_coercion_vector

OUTPUT

[1] 0 1

The type hierarchy

The coercion rules go: logical -> integer -> double (“numeric”) -> complex -> character, where -> can be read as are transformed into. For example, combining logical and character transforms the result to character:

R

c('a', TRUE)

OUTPUT

[1] "a"    "TRUE"

A quick way to recognize character vectors is by the quotes that enclose them when they are printed.

You can try to force coercion against this flow using the as. functions:

R

character_vector_example <- c('0','2','4')
character_vector_example

OUTPUT

[1] "0" "2" "4"

R

character_coerced_to_double <- as.double(character_vector_example)
character_coerced_to_double

OUTPUT

[1] 0 2 4

R

double_coerced_to_logical <- as.logical(character_coerced_to_double)
double_coerced_to_logical

OUTPUT

[1] FALSE  TRUE  TRUE

As you can see, some surprising things can happen when R forces one basic data type into another! Nitty-gritty of type coercion aside, the point is: if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame; make sure everything is the same type in your vectors and your columns of data.frames, or you will get nasty surprises!

But coercion can also be very useful! For example, in our cats data likes_string is numeric, but we know that the 1s and 0s actually represent TRUE and FALSE (a common way of representing them). We should use the logical datatype here, which has two states: TRUE or FALSE, which is exactly what our data represents. We can ‘coerce’ this column to be logical by using the as.logical function:

R

cats$likes_string

OUTPUT

[1] 1 0 1

R

cats$likes_string <- as.logical(cats$likes_string)
cats$likes_string

OUTPUT

[1]  TRUE FALSE  TRUE

Challenge 1

An important part of every data analysis is cleaning the input data. If you know that the input data is all of the same format, (e.g. numbers), your analysis is much easier! Clean the cat data set from the chapter about type coercion.

Copy the code template

Create a new script in RStudio and copy and paste the following code. Then move on to the tasks below, which help you to fill in the gaps (______).

# Read data
cats <- read.csv("data/feline-data_v2.csv")

# 1. Print the data
_____

# 2. Show an overview of the table with all data types
_____(cats)

# 3. The "weight" column has the incorrect data type __________.
#    The correct data type is: ____________.

# 4. Correct the 4th weight data point with the mean of the two given values
cats$weight[4] <- 2.35
#    print the data again to see the effect
cats

# 5. Convert the weight to the right data type
cats$weight <- ______________(cats$weight)

#    Calculate the mean to test yourself
mean(cats$weight)

# If you see the correct mean value (and not NA), you did the exercise
# correctly!

Instructions for the tasks

Execute the first statement (read.csv(...)). Then print the data to the console

Show the content of any variable by typing its name.

Solution to Challenge 1.1

Two correct solutions:

cats
print(cats)

2. Overview of the data types

The data type of your data is as important as the data itself. Use a function we saw earlier to print out the data types of all columns of the cats table.

In the chapter “Data types” we saw two functions that can show data types. One printed just a single word, the data type name. The other printed a short form of the data type, and the first few values. We need the second here.

Challenge 1 (continued)

Solution to Challenge 1.2

str(cats)

3. Which data type do we need?

The shown data type is not the right one for this data (weight of a cat). Which data type do we need?

  • Why did the read.csv() function not choose the correct data type?
  • Fill in the gap in the comment with the correct data type for cat weight!

Scroll up to the section about the type hierarchy to review the available data types

  • Weight is expressed on a continuous scale (real numbers). The R data type for this is “double” (also known as “numeric”).
  • The fourth row has the value “2.3 or 2.4”. That is not a number but two, and an english word. Therefore, the “character” data type is chosen. The whole column is now text, because all values in the same columns have to be the same data type.

4. Correct the problematic value

The code to assign a new weight value to the problematic fourth row is given. Think first and then execute it: What will be the data type after assigning a number like in this example? You can check the data type after executing to see if you were right.

Revisit the hierarchy of data types when two different data types are combined.

Challenge 1 (continued)

Solution to challenge 1.4

The data type of the column “weight” is “character”. The assigned data type is “double”. Combining two data types yields the data type that is higher in the following hierarchy:

logical < integer < double < complex < character

Therefore, the column is still of type character! We need to manually convert it to “double”. {: .solution}

5. Convert the column “weight” to the correct data type

Cat weight are numbers. But the column does not have this data type yet. Coerce the column to floating point numbers.

The functions to convert data types start with as.. You can look for the function further up in the manuscript or use the RStudio auto-complete function: Type “as.” and then press the TAB key.

Challenge 1 (continued)

Solution to Challenge 1.5

There are two functions that are synonymous for historic reasons:

cats$weight <- as.double(cats$weight)
cats$weight <- as.numeric(cats$weight)

Some basic vector functions

The combine function, c(), will also append things to an existing vector:

R

ab_vector <- c('a', 'b')
ab_vector

OUTPUT

[1] "a" "b"

R

combine_example <- c(ab_vector, 'SWC')
combine_example

OUTPUT

[1] "a"   "b"   "SWC"

You can also make series of numbers:

R

mySeries <- 1:10
mySeries

OUTPUT

 [1]  1  2  3  4  5  6  7  8  9 10

R

seq(10)

OUTPUT

 [1]  1  2  3  4  5  6  7  8  9 10

R

seq(1,10, by=0.1)

OUTPUT

 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
[91] 10.0

We can ask a few questions about vectors:

R

sequence_example <- 20:25
head(sequence_example, n=2)

OUTPUT

[1] 20 21

R

tail(sequence_example, n=4)

OUTPUT

[1] 22 23 24 25

R

length(sequence_example)

OUTPUT

[1] 6

R

typeof(sequence_example)

OUTPUT

[1] "integer"

We can get individual elements of a vector by using the bracket notation:

R

first_element <- sequence_example[1]
first_element

OUTPUT

[1] 20

To change a single element, use the bracket on the other side of the arrow:

R

sequence_example[1] <- 30
sequence_example

OUTPUT

[1] 30 21 22 23 24 25

Challenge 2

Start by making a vector with the numbers 1 through 26. Then, multiply the vector by 2.

R

x <- 1:26
x <- x * 2

Lists

Another data structure you’ll want in your bag of tricks is the list. A list is simpler in some ways than the other types, because you can put anything you want in it. Remember everything in the vector must be of the same basic data type, but a list can have different data types:

R

list_example <- list(1, "a", TRUE, 1+4i)
list_example

OUTPUT

[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 1+4i

When printing the object structure with str(), we see the data types of all elements:

R

str(list_example)

OUTPUT

List of 4
 $ : num 1
 $ : chr "a"
 $ : logi TRUE
 $ : cplx 1+4i

What is the use of lists? They can organize data of different types. For example, you can organize different tables that belong together, similar to spreadsheets in Excel. But there are many other uses, too.

We will see another example that will maybe surprise you in the next chapter.

To retrieve one of the elements of a list, use the double bracket:

R

list_example[[2]]

OUTPUT

[1] "a"

The elements of lists also can have names, they can be given by prepending them to the values, separated by an equals sign:

R

another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
another_list

OUTPUT

$title
[1] "Numbers"

$numbers
 [1]  1  2  3  4  5  6  7  8  9 10

$data
[1] TRUE

This results in a named list. Now we have a new function of our object! We can access single elements by an additional way!

R

another_list$title

OUTPUT

[1] "Numbers"

Names


With names, we can give meaning to elements. It is the first time that we do not only have the data, but also explaining information. It is metadata that can be stuck to the object like a label. In R, this is called an attribute. Some attributes enable us to do more with our object, for example, like here, accessing an element by a self-defined name.

Accessing vectors and lists by name

We have already seen how to generate a named list. The way to generate a named vector is very similar. You have seen this function before:

R

pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 )

The way to retrieve elements is different, though:

R

pizza_price["pizzasubito"]

OUTPUT

pizzasubito
       5.64 

The approach used for the list does not work:

R

pizza_price$pizzafresh

ERROR

Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors

It will pay off if you remember this error message, you will meet it in your own analyses. It means that you have just tried accessing an element like it was in a list, but it is actually in a vector.

Accessing and changing names

If you are only interested in the names, use the names() function:

R

names(pizza_price)

OUTPUT

[1] "pizzasubito" "pizzafresh"  "callapizza" 

We have seen how to access and change single elements of a vector. The same is possible for names:

R

names(pizza_price)[3]

OUTPUT

[1] "callapizza"

R

names(pizza_price)[3] <- "call-a-pizza"
pizza_price

OUTPUT

 pizzasubito   pizzafresh call-a-pizza
        5.64         6.60         4.50 

Challenge 3

  • What is the data type of the names of pizza_price? You can find out using the str() or typeof() functions.

You get the names of an object by wrapping the object name inside names(...). Similarly, you get the data type of the names by again wrapping the whole code in typeof(...):

typeof(names(pizza))

alternatively, use a new variable if this is easier for you to read:

n <- names(pizza)
typeof(n)

Challenge 4

Instead of just changing some of the names a vector/list already has, you can also set all names of an object by writing code like (replace ALL CAPS text):

names( OBJECT ) <-  CHARACTER_VECTOR

Create a vector that gives the number for each letter in the alphabet!

  1. Generate a vector called letter_no with the sequence of numbers from 1 to 26!
  2. R has a built-in object called LETTERS. It is a 26-character vector, from A to Z. Set the names of the number sequence to this 26 letters
  3. Test yourself by calling letter_no["B"], which should give you the number 2!
letter_no <- 1:26   # or seq(1,26)
names(letter_no) <- LETTERS
letter_no["B"]

Data frames


We have data frames at the very beginning of this lesson, they represent a table of data. We didn’t go much further into detail with our example cat data frame:

R

cats

OUTPUT

    coat weight likes_string
1 calico    2.1         TRUE
2  black    5.0        FALSE
3  tabby    3.2         TRUE

We can now understand something a bit surprising in our data.frame; what happens if we run:

R

typeof(cats)

OUTPUT

[1] "list"

We see that data.frames look like lists ‘under the hood’. Think again what we heard about what lists can be used for:

Lists organize data of different types

Columns of a data frame are vectors of different types, that are organized by belonging to the same table.

A data.frame is really a list of vectors. It is a special list in which all the vectors must have the same length.

How is this “special”-ness written into the object, so that R does not treat it like any other list, but as a table?

R

class(cats)

OUTPUT

[1] "data.frame"

A class, just like names, is an attribute attached to the object. It tells us what this object means for humans.

You might wonder: Why do we need another what-type-of-object-is-this-function? We already have typeof()? That function tells us how the object is constructed in the computer. The class is the meaning of the object for humans. Consequently, what typeof() returns is fixed in R (mainly the five data types), whereas the output of class() is diverse and extendable by R packages.

In our cats example, we have an integer, a double and a logical variable. As we have seen already, each column of data.frame is a vector.

R

cats$coat

OUTPUT

[1] "calico" "black"  "tabby" 

R

cats[,1]

OUTPUT

[1] "calico" "black"  "tabby" 

R

typeof(cats[,1])

OUTPUT

[1] "character"

R

str(cats[,1])

OUTPUT

 chr [1:3] "calico" "black" "tabby"

Each row is an observation of different variables, itself a data.frame, and thus can be composed of elements of different types.

R

cats[1,]

OUTPUT

    coat weight likes_string
1 calico    2.1         TRUE

R

typeof(cats[1,])

OUTPUT

[1] "list"

R

str(cats[1,])

OUTPUT

'data.frame':	1 obs. of  3 variables:
 $ coat        : chr "calico"
 $ weight      : num 2.1
 $ likes_string: logi TRUE

Challenge 5

There are several subtly different ways to call variables, observations and elements from data.frames:

  • cats[1]
  • cats[[1]]
  • cats$coat
  • cats["coat"]
  • cats[1, 1]
  • cats[, 1]
  • cats[1, ]

Try out these examples and explain what is returned by each one.

Hint: Use the function typeof() to examine what is returned in each case.

R

cats[1]

OUTPUT

    coat
1 calico
2  black
3  tabby

We can think of a data frame as a list of vectors. The single brace [1] returns the first slice of the list, as another list. In this case it is the first column of the data frame.

R

cats[[1]]

OUTPUT

[1] "calico" "black"  "tabby" 

The double brace [[1]] returns the contents of the list item. In this case it is the contents of the first column, a vector of type character.

R

cats$coat

OUTPUT

[1] "calico" "black"  "tabby" 

This example uses the $ character to address items by name. coat is the first column of the data frame, again a vector of type character.

R

cats["coat"]

OUTPUT

    coat
1 calico
2  black
3  tabby

Here we are using a single brace ["coat"] replacing the index number with the column name. Like example 1, the returned object is a list.

R

cats[1, 1]

OUTPUT

[1] "calico"

This example uses a single brace, but this time we provide row and column coordinates. The returned object is the value in row 1, column 1. The object is a vector of type character.

R

cats[, 1]

OUTPUT

[1] "calico" "black"  "tabby" 

Like the previous example we use single braces and provide row and column coordinates. The row coordinate is not specified, R interprets this missing value as all the elements in this column and returns them as a vector.

R

cats[1, ]

OUTPUT

    coat weight likes_string
1 calico    2.1         TRUE

Again we use the single brace with row and column coordinates. The column coordinate is not specified. The return value is a list containing all the values in the first row.

Tip: Renaming data frame columns

Data frames have column names, which can be accessed with the names() function.

R

names(cats)

OUTPUT

[1] "coat"         "weight"       "likes_string"

If you want to rename the second column of cats, you can assign a new name to the second element of names(cats).

R

names(cats)[2] <- "weight_kg"
cats

OUTPUT

    coat weight_kg likes_string
1 calico       2.1         TRUE
2  black       5.0        FALSE
3  tabby       3.2         TRUE

Matrices

Last but not least is the matrix. We can declare a matrix full of zeros:

R

matrix_example <- matrix(0, ncol=6, nrow=3)
matrix_example

OUTPUT

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    0    0    0    0
[2,]    0    0    0    0    0    0
[3,]    0    0    0    0    0    0

What makes it special is the dim() attribute:

R

dim(matrix_example)

OUTPUT

[1] 3 6

And similar to other data structures, we can ask things about our matrix:

R

typeof(matrix_example)

OUTPUT

[1] "double"

R

class(matrix_example)

OUTPUT

[1] "matrix" "array" 

R

str(matrix_example)

OUTPUT

 num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...

R

nrow(matrix_example)

OUTPUT

[1] 3

R

ncol(matrix_example)

OUTPUT

[1] 6

Challenge 6

What do you think will be the result of length(matrix_example)? Try it. Were you right? Why / why not?

What do you think will be the result of length(matrix_example)?

R

matrix_example <- matrix(0, ncol=6, nrow=3)
length(matrix_example)

OUTPUT

[1] 18

Because a matrix is a vector with added dimension attributes, length gives you the total number of elements in the matrix.

Challenge 7

Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did the matrix function fill your matrix by column, or by row, as its default behaviour? See if you can figure out how to change this. (hint: read the documentation for matrix!)

Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did the matrix function fill your matrix by column, or by row, as its default behaviour? See if you can figure out how to change this. (hint: read the documentation for matrix!)

R

x <- matrix(1:50, ncol=5, nrow=10)
x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row

Challenge 8

Create a list of length two containing a character vector for each of the sections in this part of the workshop:

  • Data types
  • Data structures

Populate each character vector with the names of the data types and data structures we’ve seen so far.

R

dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
dataStructures <- c('data.frame', 'vector', 'list', 'matrix')
answer <- list(dataTypes, dataStructures)

Note: it’s nice to make a list in big writing on the board or taped to the wall listing all of these types and structures - leave it up for the rest of the workshop to remind people of the importance of these basics.

Challenge 9

Consider the R output of the matrix below:

OUTPUT

     [,1] [,2]
[1,]    4    1
[2,]    9    5
[3,]   10    7

What was the correct command used to write this matrix? Examine each command and try to figure out the correct one before typing them. Think about what matrices the other commands will produce.

  1. matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)
  2. matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)
  3. matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)
  4. matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)

Consider the R output of the matrix below:

OUTPUT

     [,1] [,2]
[1,]    4    1
[2,]    9    5
[3,]   10    7

What was the correct command used to write this matrix? Examine each command and try to figure out the correct one before typing them. Think about what matrices the other commands will produce.

R

matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)

Key Points

  • Use read.csv to read tabular data in R.
  • The basic data types in R are double, integer, complex, logical, and character.
  • Data structures such as data frames or matrices are built on top of lists and vectors, with some added attributes.

Content from Exploring Data Frames


Last updated on 2024-12-04 | Edit this page

Estimated time: 30 minutes

Overview

Questions

  • How can I manipulate a data frame?

Objectives

  • Add and remove rows or columns.
  • Append two data frames.
  • Display basic properties of data frames including size and class of the columns, names, and first few rows.

At this point, you’ve seen it all: in the last lesson, we toured all the basic data types and data structures in R. Everything you do will be a manipulation of those tools. But most of the time, the star of the show is the data frame—the table that we created by loading information from a csv file. In this lesson, we’ll learn a few more things about working with data frames.

Adding columns and rows in data frames


We already learned that the columns of a data frame are vectors, so that our data are consistent in type throughout the columns. As such, if we want to add a new column, we can start by making a new vector:

R

age <- c(2, 3, 5)
cats

OUTPUT

    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

We can then add this as a column via:

R

cbind(cats, age)

OUTPUT

    coat weight likes_string age
1 calico    2.1            1   2
2  black    5.0            0   3
3  tabby    3.2            1   5

Note that if we tried to add a vector of ages with a different number of entries than the number of rows in the data frame, it would fail:

R

age <- c(2, 3, 5, 12)
cbind(cats, age)

ERROR

Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 4

R

age <- c(2, 3)
cbind(cats, age)

ERROR

Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2

Why didn’t this work? Of course, R wants to see one element in our new column for every row in the table:

R

nrow(cats)

OUTPUT

[1] 3

R

length(age)

OUTPUT

[1] 2

So for it to work we need to have nrow(cats) = length(age). Let’s overwrite the content of cats with our new data frame.

R

age <- c(2, 3, 5)
cats <- cbind(cats, age)

Now how about adding rows? We already know that the rows of a data frame are lists:

R

newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)

Let’s confirm that our new row was added correctly.

R

cats

OUTPUT

           coat weight likes_string age
1        calico    2.1            1   2
2         black    5.0            0   3
3         tabby    3.2            1   5
4 tortoiseshell    3.3            1   9

Removing rows


We now know how to add rows and columns to our data frame in R. Now let’s learn to remove rows.

R

cats

OUTPUT

           coat weight likes_string age
1        calico    2.1            1   2
2         black    5.0            0   3
3         tabby    3.2            1   5
4 tortoiseshell    3.3            1   9

We can ask for a data frame minus the last row:

R

cats[-4, ]

OUTPUT

    coat weight likes_string age
1 calico    2.1            1   2
2  black    5.0            0   3
3  tabby    3.2            1   5

Notice the comma with nothing after it to indicate that we want to drop the entire fourth row.

Note: we could also remove several rows at once by putting the row numbers inside of a vector, for example: cats[c(-3,-4), ]

Removing columns


We can also remove columns in our data frame. What if we want to remove the column “age”. We can remove it in two ways, by variable number or by index.

R

cats[,-4]

OUTPUT

           coat weight likes_string
1        calico    2.1            1
2         black    5.0            0
3         tabby    3.2            1
4 tortoiseshell    3.3            1

Notice the comma with nothing before it, indicating we want to keep all of the rows.

Alternatively, we can drop the column by using the index name and the %in% operator. The %in% operator goes through each element of its left argument, in this case the names of cats, and asks, “Does this element occur in the second argument?”

R

drop <- names(cats) %in% c("age")
cats[,!drop]

OUTPUT

           coat weight likes_string
1        calico    2.1            1
2         black    5.0            0
3         tabby    3.2            1
4 tortoiseshell    3.3            1

We will cover subsetting with logical operators like %in% in more detail in the next episode. See the section Subsetting through other logical operations

Appending to a data frame


The key to remember when adding data to a data frame is that columns are vectors and rows are lists. We can also glue two data frames together with rbind:

R

cats <- rbind(cats, cats)
cats

OUTPUT

           coat weight likes_string age
1        calico    2.1            1   2
2         black    5.0            0   3
3         tabby    3.2            1   5
4 tortoiseshell    3.3            1   9
5        calico    2.1            1   2
6         black    5.0            0   3
7         tabby    3.2            1   5
8 tortoiseshell    3.3            1   9

Challenge 1

You can create a new data frame right from within R with the following syntax:

R

df <- data.frame(id = c("a", "b", "c"),
                 x = 1:3,
                 y = c(TRUE, TRUE, FALSE))

Make a data frame that holds the following information for yourself:

  • first name
  • last name
  • lucky number

Then use rbind to add an entry for the people sitting beside you. Finally, use cbind to add a column with each person’s answer to the question, “Is it time for coffee break?”

R

df <- data.frame(first = c("Grace"),
                 last = c("Hopper"),
                 lucky_number = c(0))
df <- rbind(df, list("Marie", "Curie", 238) )
df <- cbind(df, coffeetime = c(TRUE,TRUE))

Realistic example


So far, you have seen the basics of manipulating data frames with our cat data; now let’s use those skills to digest a more realistic dataset. Let’s read in the gapminder dataset that we downloaded previously:

R

gapminder <- read.csv("data/gapminder_data.csv")

Miscellaneous Tips

  • Another type of file you might encounter are tab-separated value files (.tsv). To specify a tab as a separator, use "\\t" or read.delim().

  • Files can also be downloaded directly from the Internet into a local folder of your choice onto your computer using the download.file function. The read.csv function can then be executed to read the downloaded file from the download location, for example,

R

download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
gapminder <- read.csv("data/gapminder_data.csv")
  • Alternatively, you can also read in files directly into R from the Internet by replacing the file paths with a web address in read.csv. One should note that in doing this no local copy of the csv file is first saved onto your computer. For example,

R

gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv")
  • You can read directly from excel spreadsheets without converting them to plain text first by using the readxl package.

  • The argument “stringsAsFactors” can be useful to tell R how to read strings either as factors or as character strings. In R versions after 4.0, all strings are read-in as characters by default, but in earlier versions of R, strings are read-in as factors by default. For more information, see the call-out in the previous episode.

Let’s investigate gapminder a bit; the first thing we should always do is check out what the data looks like with str:

R

str(gapminder)

OUTPUT

'data.frame':	1704 obs. of  6 variables:
 $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num  779 821 853 836 740 ...

An additional method for examining the structure of gapminder is to use the summary function. This function can be used on various objects in R. For data frames, summary yields a numeric, tabular, or descriptive summary of each column. Numeric or integer columns are described by the descriptive statistics (quartiles and mean), and character columns by its length, class, and mode.

R

summary(gapminder)

OUTPUT

   country               year           pop             continent
 Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704
 Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character
 Mode  :character   Median :1980   Median :7.024e+06   Mode  :character
                    Mean   :1980   Mean   :2.960e+07
                    3rd Qu.:1993   3rd Qu.:1.959e+07
                    Max.   :2007   Max.   :1.319e+09
    lifeExp        gdpPercap
 Min.   :23.60   Min.   :   241.2
 1st Qu.:48.20   1st Qu.:  1202.1
 Median :60.71   Median :  3531.8
 Mean   :59.47   Mean   :  7215.3
 3rd Qu.:70.85   3rd Qu.:  9325.5
 Max.   :82.60   Max.   :113523.1  

Along with the str and summary functions, we can examine individual columns of the data frame with our typeof function:

R

typeof(gapminder$year)

OUTPUT

[1] "integer"

R

typeof(gapminder$country)

OUTPUT

[1] "character"

R

str(gapminder$country)

OUTPUT

 chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...

We can also interrogate the data frame for information about its dimensions; remembering that str(gapminder) said there were 1704 observations of 6 variables in gapminder, what do you think the following will produce, and why?

R

length(gapminder)

OUTPUT

[1] 6

A fair guess would have been to say that the length of a data frame would be the number of rows it has (1704), but this is not the case; remember, a data frame is a list of vectors and factors:

R

typeof(gapminder)

OUTPUT

[1] "list"

When length gave us 6, it’s because gapminder is built out of a list of 6 columns. To get the number of rows and columns in our dataset, try:

R

nrow(gapminder)

OUTPUT

[1] 1704

R

ncol(gapminder)

OUTPUT

[1] 6

Or, both at once:

R

dim(gapminder)

OUTPUT

[1] 1704    6

We’ll also likely want to know what the titles of all the columns are, so we can ask for them later:

R

colnames(gapminder)

OUTPUT

[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

At this stage, it’s important to ask ourselves if the structure R is reporting matches our intuition or expectations; do the basic data types reported for each column make sense? If not, we need to sort any problems out now before they turn into bad surprises down the road, using what we’ve learned about how R interprets data, and the importance of strict consistency in how we record our data.

Once we’re happy that the data types and structures seem reasonable, it’s time to start digging into our data proper. Check out the first few lines:

R

head(gapminder)

OUTPUT

      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134

Challenge 2

It’s good practice to also check the last few lines of your data and some in the middle. How would you do this?

Searching for ones specifically in the middle isn’t too hard, but we could ask for a few lines at random. How would you code this?

To check the last few lines it’s relatively simple as R already has a function for this:

R

tail(gapminder)
tail(gapminder, n = 15)

What about a few arbitrary rows just in case something is odd in the middle?

Tip: There are several ways to achieve this.

The solution here presents one form of using nested functions, i.e. a function passed as an argument to another function. This might sound like a new concept, but you are already using it! Remember my_dataframe[rows, cols] will print to screen your data frame with the number of rows and columns you asked for (although you might have asked for a range or named columns for example). How would you get the last row if you don’t know how many rows your data frame has? R has a function for this. What about getting a (pseudorandom) sample? R also has a function for this.

R

gapminder[sample(nrow(gapminder), 5), ]

To make sure our analysis is reproducible, we should put the code into a script file so we can come back to it later.

Challenge 3

Go to file -> new file -> R script, and write an R script to load in the gapminder dataset. Put it in the scripts/ directory and add it to version control.

Run the script using the source function, using the file path as its argument (or by pressing the “source” button in RStudio).

The source function can be used to use a script within a script. Assume you would like to load the same type of file over and over again and therefore you need to specify the arguments to fit the needs of your file. Instead of writing the necessary argument again and again you could just write it once and save it as a script. Then, you can use source("Your_Script_containing_the_load_function") in a new script to use the function of that script without writing everything again. Check out ?source to find out more.

R

download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
gapminder <- read.csv(file = "data/gapminder_data.csv")

To run the script and load the data into the gapminder variable:

R

source(file = "scripts/load-gapminder.R")

Challenge 4

Read the output of str(gapminder) again; this time, use what you’ve learned about lists and vectors, as well as the output of functions like colnames and dim to explain what everything that str prints out for gapminder means. If there are any parts you can’t interpret, discuss with your neighbors!

The object gapminder is a data frame with columns

  • country and continent are character strings.
  • year is an integer vector.
  • pop, lifeExp, and gdpPercap are numeric vectors.

Key Points

  • Use cbind() to add a new column to a data frame.
  • Use rbind() to add a new row to a data frame.
  • Remove rows from a data frame.
  • Use str(), summary(), nrow(), ncol(), dim(), colnames(), head(), and typeof() to understand the structure of a data frame.
  • Read in a csv file using read.csv().
  • Understand what length() of a data frame represents.

Content from Subsetting Data


Last updated on 2024-12-04 | Edit this page

Estimated time: 50 minutes

Overview

Questions

  • How can I work with subsets of data in R?

Objectives

  • To be able to subset vectors, factors, matrices, lists, and data frames
  • To be able to extract individual and multiple elements: by index, by name, using comparison operations
  • To be able to skip and remove elements from various data structures.

R has many powerful subset operators. Mastering them will allow you to easily perform complex operations on any kind of dataset.

There are six different ways we can subset any kind of object, and three different subsetting operators for the different data structures.

Let’s start with the workhorse of R: a simple numeric vector.

R

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
x

OUTPUT

  a   b   c   d   e
5.4 6.2 7.1 4.8 7.5 

Atomic vectors

In R, simple vectors containing character strings, numbers, or logical values are called atomic vectors because they can’t be further simplified.

So now that we’ve created a dummy vector to play with, how do we get at its contents?

Accessing elements using their indices


To extract elements of a vector we can give their corresponding index, starting from one:

R

x[1]

OUTPUT

  a
5.4 

R

x[4]

OUTPUT

  d
4.8 

It may look different, but the square brackets operator is a function. For vectors (and matrices), it means “get me the nth element”.

We can ask for multiple elements at once:

R

x[c(1, 3)]

OUTPUT

  a   c
5.4 7.1 

Or slices of the vector:

R

x[1:4]

OUTPUT

  a   b   c   d
5.4 6.2 7.1 4.8 

the : operator creates a sequence of numbers from the left element to the right.

R

1:4

OUTPUT

[1] 1 2 3 4

R

c(1, 2, 3, 4)

OUTPUT

[1] 1 2 3 4

We can ask for the same element multiple times:

R

x[c(1,1,3)]

OUTPUT

  a   a   c
5.4 5.4 7.1 

If we ask for an index beyond the length of the vector, R will return a missing value:

R

x[6]

OUTPUT

<NA>
  NA 

This is a vector of length one containing an NA, whose name is also NA.

If we ask for the 0th element, we get an empty vector:

R

x[0]

OUTPUT

named numeric(0)

Vector numbering in R starts at 1

In many programming languages (C and Python, for example), the first element of a vector has an index of 0. In R, the first element is 1.

Skipping and removing elements


If we use a negative number as the index of a vector, R will return every element except for the one specified:

R

x[-2]

OUTPUT

  a   c   d   e
5.4 7.1 4.8 7.5 

We can skip multiple elements:

R

x[c(-1, -5)]  # or x[-c(1,5)]

OUTPUT

  b   c   d
6.2 7.1 4.8 

Tip: Order of operations

A common trip up for novices occurs when trying to skip slices of a vector. It’s natural to try to negate a sequence like so:

R

x[-1:3]

This gives a somewhat cryptic error:

ERROR

Error in x[-1:3]: only 0's may be mixed with negative subscripts

But remember the order of operations. : is really a function. It takes its first argument as -1, and its second as 3, so generates the sequence of numbers: c(-1, 0, 1, 2, 3).

The correct solution is to wrap that function call in brackets, so that the - operator applies to the result:

R

x[-(1:3)]

OUTPUT

  d   e
4.8 7.5 

To remove elements from a vector, we need to assign the result back into the variable:

R

x <- x[-4]
x

OUTPUT

  a   b   c   e
5.4 6.2 7.1 7.5 

Challenge 1

Given the following code:

R

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)

OUTPUT

  a   b   c   d   e
5.4 6.2 7.1 4.8 7.5 

Come up with at least 2 different commands that will produce the following output:

OUTPUT

  b   c   d
6.2 7.1 4.8 

After you find 2 different commands, compare notes with your neighbour. Did you have different strategies?

R

x[2:4]

OUTPUT

  b   c   d
6.2 7.1 4.8 

R

x[-c(1,5)]

OUTPUT

  b   c   d
6.2 7.1 4.8 

R

x[c(2,3,4)]

OUTPUT

  b   c   d
6.2 7.1 4.8 

Subsetting by name


We can extract elements by using their name, instead of extracting by index:

R

x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly'
x[c("a", "c")]

OUTPUT

  a   c
5.4 7.1 

This is usually a much more reliable way to subset objects: the position of various elements can often change when chaining together subsetting operations, but the names will always remain the same!

Subsetting through other logical operations


We can also use any logical vector to subset:

R

x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]

OUTPUT

  c   e
7.1 7.5 

Since comparison operators (e.g. >, <, ==) evaluate to logical vectors, we can also use them to succinctly subset vectors: the following statement gives the same result as the previous one.

R

x[x > 7]

OUTPUT

  c   e
7.1 7.5 

Breaking it down, this statement first evaluates x>7, generating a logical vector c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the elements of x corresponding to the TRUE values.

We can use == to mimic the previous method of indexing by name (remember you have to use == rather than = for comparisons):

R

x[names(x) == "a"]

OUTPUT

  a
5.4 

Tip: Combining logical conditions

We often want to combine multiple logical criteria. For example, we might want to find all the countries that are located in Asia or Europe and have life expectancies within a certain range. Several operations for combining logical vectors exist in R:

  • &, the “logical AND” operator: returns TRUE if both the left and right are TRUE.
  • |, the “logical OR” operator: returns TRUE, if either the left or right (or both) are TRUE.

You may sometimes see && and || instead of & and |. These two-character operators only look at the first element of each vector and ignore the remaining elements. In general you should not use the two-character operators in data analysis; save them for programming, i.e. deciding whether to execute a statement.

  • !, the “logical NOT” operator: converts TRUE to FALSE and FALSE to TRUE. It can negate a single logical condition (eg !TRUE becomes FALSE), or a whole vector of conditions(eg !c(TRUE, FALSE) becomes c(FALSE, TRUE)).

Additionally, you can compare the elements within a single vector using the all function (which returns TRUE if every element of the vector is TRUE) and the any function (which returns TRUE if one or more elements of the vector are TRUE).

Challenge 2

Given the following code:

R

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)

OUTPUT

  a   b   c   d   e
5.4 6.2 7.1 4.8 7.5 

Write a subsetting command to return the values in x that are greater than 4 and less than 7.

R

x_subset <- x[x<7 & x>4]
print(x_subset)

OUTPUT

  a   b   d
5.4 6.2 4.8 

Tip: Non-unique names

You should be aware that it is possible for multiple elements in a vector to have the same name. (For a data frame, columns can have the same name — although R tries to avoid this — but row names must be unique.) Consider these examples:

R

x <- 1:3
x

OUTPUT

[1] 1 2 3

R

names(x) <- c('a', 'a', 'a')
x

OUTPUT

a a a
1 2 3 

R

x['a']  # only returns first value

OUTPUT

a
1 

R

x[names(x) == 'a']  # returns all three values

OUTPUT

a a a
1 2 3 

Tip: Getting help for operators

Remember you can search for help on operators by wrapping them in quotes: help("%in%") or ?"%in%".

Skipping named elements


Skipping or removing named elements is a little harder. If we try to skip one named element by negating the string, R complains (slightly obscurely) that it doesn’t know how to take the negative of a string:

R

x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'
x[-"a"]

ERROR

Error in -"a": invalid argument to unary operator

However, we can use the != (not-equals) operator to construct a logical vector that will do what we want:

R

x[names(x) != "a"]

OUTPUT

  b   c   d   e
6.2 7.1 4.8 7.5 

Skipping multiple named indices is a little bit harder still. Suppose we want to drop the "a" and "c" elements, so we try this:

R

x[names(x)!=c("a","c")]

WARNING

Warning in names(x) != c("a", "c"): longer object length is not a multiple of
shorter object length

OUTPUT

  b   c   d   e
6.2 7.1 4.8 7.5 

R did something, but it gave us a warning that we ought to pay attention to - and it apparently gave us the wrong answer (the "c" element is still included in the vector)!

So what does != actually do in this case? That’s an excellent question.

Recycling

Let’s take a look at the comparison component of this code:

R

names(x) != c("a", "c")

WARNING

Warning in names(x) != c("a", "c"): longer object length is not a multiple of
shorter object length

OUTPUT

[1] FALSE  TRUE  TRUE  TRUE  TRUE

Why does R give TRUE as the third element of this vector, when names(x)[3] != "c" is obviously false? When you use !=, R tries to compare each element of the left argument with the corresponding element of its right argument. What happens when you compare vectors of different lengths?

Inequality testing

When one vector is shorter than the other, it gets recycled:

Inequality testing: results of recycling

In this case R repeats c("a", "c") as many times as necessary to match names(x), i.e. we get c("a","c","a","c","a"). Since the recycled "a" doesn’t match the third element of names(x), the value of != is TRUE. Because in this case the longer vector length (5) isn’t a multiple of the shorter vector length (2), R printed a warning message. If we had been unlucky and names(x) had contained six elements, R would silently have done the wrong thing (i.e., not what we intended it to do). This recycling rule can can introduce hard-to-find and subtle bugs!

The way to get R to do what we really want (match each element of the left argument with all of the elements of the right argument) it to use the %in% operator. The %in% operator goes through each element of its left argument, in this case the names of x, and asks, “Does this element occur in the second argument?”. Here, since we want to exclude values, we also need a ! operator to change “in” to “not in”:

R

x[! names(x) %in% c("a","c") ]

OUTPUT

  b   d   e
6.2 4.8 7.5 

Challenge 3

Selecting elements of a vector that match any of a list of components is a very common data analysis task. For example, the gapminder data set contains country and continent variables, but no information between these two scales. Suppose we want to pull out information from southeast Asia: how do we set up an operation to produce a logical vector that is TRUE for all of the countries in southeast Asia and FALSE otherwise?

Suppose you have these data:

R

seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
## read in the gapminder data that we downloaded in episode 2
gapminder <- read.csv("data/gapminder_data.csv", header=TRUE)
## extract the `country` column from a data frame (we'll see this later);
## convert from a factor to a character;
## and get just the non-repeated elements
countries <- unique(as.character(gapminder$country))

There’s a wrong way (using only ==), which will give you a warning; a clunky way (using the logical operators == and |); and an elegant way (using %in%). See whether you can come up with all three and explain how they (don’t) work.

  • The wrong way to do this problem is countries==seAsia. This gives a warning ("In countries == seAsia : longer object length is not a multiple of shorter object length") and the wrong answer (a vector of all FALSE values), because none of the recycled values of seAsia happen to line up correctly with matching values in country.
  • The clunky (but technically correct) way to do this problem is

R

 (countries=="Myanmar" | countries=="Thailand" |
 countries=="Cambodia" | countries == "Vietnam" | countries=="Laos")

(or countries==seAsia[1] | countries==seAsia[2] | ...). This gives the correct values, but hopefully you can see how awkward it is (what if we wanted to select countries from a much longer list?).

  • The best way to do this problem is countries %in% seAsia, which is both correct and easy to type (and read).

Handling special values


At some point you will encounter functions in R that cannot handle missing, infinite, or undefined data.

There are a number of special functions you can use to filter out this data:

  • is.na will return all positions in a vector, matrix, or data.frame containing NA (or NaN)
  • likewise, is.nan, and is.infinite will do the same for NaN and Inf.
  • is.finite will return all positions in a vector, matrix, or data.frame that do not contain NA, NaN or Inf.
  • na.omit will filter out all missing values from a vector

Factor subsetting


Now that we’ve explored the different ways to subset vectors, how do we subset the other data structures?

Factor subsetting works the same way as vector subsetting.

R

f <- factor(c("a", "a", "b", "c", "c", "d"))
f[f == "a"]

OUTPUT

[1] a a
Levels: a b c d

R

f[f %in% c("b", "c")]

OUTPUT

[1] b c c
Levels: a b c d

R

f[1:3]

OUTPUT

[1] a a b
Levels: a b c d

Skipping elements will not remove the level even if no more of that category exists in the factor:

R

f[-3]

OUTPUT

[1] a a c c d
Levels: a b c d

Matrix subsetting


Matrices are also subsetted using the [ function. In this case it takes two arguments: the first applying to the rows, the second to its columns:

R

set.seed(1)
m <- matrix(rnorm(6*4), ncol=4, nrow=6)
m[3:4, c(3,1)]

OUTPUT

            [,1]       [,2]
[1,]  1.12493092 -0.8356286
[2,] -0.04493361  1.5952808

You can leave the first or second arguments blank to retrieve all the rows or columns respectively:

R

m[, c(3,4)]

OUTPUT

            [,1]        [,2]
[1,] -0.62124058  0.82122120
[2,] -2.21469989  0.59390132
[3,]  1.12493092  0.91897737
[4,] -0.04493361  0.78213630
[5,] -0.01619026  0.07456498
[6,]  0.94383621 -1.98935170

If we only access one row or column, R will automatically convert the result to a vector:

R

m[3,]

OUTPUT

[1] -0.8356286  0.5757814  1.1249309  0.9189774

If you want to keep the output as a matrix, you need to specify a third argument; drop = FALSE:

R

m[3, , drop=FALSE]

OUTPUT

           [,1]      [,2]     [,3]      [,4]
[1,] -0.8356286 0.5757814 1.124931 0.9189774

Unlike vectors, if we try to access a row or column outside of the matrix, R will throw an error:

R

m[, c(3,6)]

ERROR

Error in m[, c(3, 6)]: subscript out of bounds

Tip: Higher dimensional arrays

when dealing with multi-dimensional arrays, each argument to [ corresponds to a dimension. For example, a 3D array, the first three arguments correspond to the rows, columns, and depth dimension.

Because matrices are vectors, we can also subset using only one argument:

R

m[5]

OUTPUT

[1] 0.3295078

This usually isn’t useful, and often confusing to read. However it is useful to note that matrices are laid out in column-major format by default. That is the elements of the vector are arranged column-wise:

R

matrix(1:6, nrow=2, ncol=3)

OUTPUT

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

If you wish to populate the matrix by row, use byrow=TRUE:

R

matrix(1:6, nrow=2, ncol=3, byrow=TRUE)

OUTPUT

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

Matrices can also be subsetted using their rownames and column names instead of their row and column indices.

Challenge 4

Given the following code:

R

m <- matrix(1:18, nrow=3, ncol=6)
print(m)

OUTPUT

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    4    7   10   13   16
[2,]    2    5    8   11   14   17
[3,]    3    6    9   12   15   18
  1. Which of the following commands will extract the values 11 and 14?

A. m[2,4,2,5]

B. m[2:5]

C. m[4:5,2]

D. m[2,c(4,5)]

D

List subsetting


Now we’ll introduce some new subsetting operators. There are three functions used to subset lists. We’ve already seen these when learning about atomic vectors and matrices: [, [[, and $.

Using [ will always return a list. If you want to subset a list, but not extract an element, then you will likely use [.

R

xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))
xlist[1]

OUTPUT

$a
[1] "Software Carpentry"

This returns a list with one element.

We can subset elements of a list exactly the same way as atomic vectors using [. Comparison operations however won’t work as they’re not recursive, they will try to condition on the data structures in each element of the list, not the individual elements within those data structures.

R

xlist[1:2]

OUTPUT

$a
[1] "Software Carpentry"

$b
 [1]  1  2  3  4  5  6  7  8  9 10

To extract individual elements of a list, you need to use the double-square bracket function: [[.

R

xlist[[1]]

OUTPUT

[1] "Software Carpentry"

Notice that now the result is a vector, not a list.

You can’t extract more than one element at once:

R

xlist[[1:2]]

ERROR

Error in xlist[[1:2]]: subscript out of bounds

Nor use it to skip elements:

R

xlist[[-1]]

ERROR

Error in xlist[[-1]]: invalid negative subscript in get1index <real>

But you can use names to both subset and extract elements:

R

xlist[["a"]]

OUTPUT

[1] "Software Carpentry"

The $ function is a shorthand way for extracting elements by name:

R

xlist$data

OUTPUT

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Challenge 5

Given the following list:

R

xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars))

Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. Hint: the number 2 is contained within the “b” item in the list.

R

xlist$b[2]

OUTPUT

[1] 2

R

xlist[[2]][2]

OUTPUT

[1] 2

R

xlist[["b"]][2]

OUTPUT

[1] 2

Challenge 6

Given a linear model:

R

mod <- aov(pop ~ lifeExp, data=gapminder)

Extract the residual degrees of freedom (hint: attributes() will help you)

R

attributes(mod) ## `df.residual` is one of the names of `mod`

R

mod$df.residual

Data frames


Remember the data frames are lists underneath the hood, so similar rules apply. However they are also two dimensional objects:

[ with one argument will act the same way as for lists, where each list element corresponds to a column. The resulting object will be a data frame:

R

head(gapminder[3])

OUTPUT

       pop
1  8425333
2  9240934
3 10267083
4 11537966
5 13079460
6 14880372

Similarly, [[ will act to extract a single column:

R

head(gapminder[["lifeExp"]])

OUTPUT

[1] 28.801 30.332 31.997 34.020 36.088 38.438

And $ provides a convenient shorthand to extract columns by name:

R

head(gapminder$year)

OUTPUT

[1] 1952 1957 1962 1967 1972 1977

With two arguments, [ behaves the same way as for matrices:

R

gapminder[1:3,]

OUTPUT

      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007

If we subset a single row, the result will be a data frame (because the elements are mixed types):

R

gapminder[3,]

OUTPUT

      country year      pop continent lifeExp gdpPercap
3 Afghanistan 1962 10267083      Asia  31.997  853.1007

But for a single column the result will be a vector (this can be changed with the third argument, drop = FALSE).

Challenge 7

Fix each of the following common data frame subsetting errors:

  1. Extract observations collected for the year 1957

R

gapminder[gapminder$year = 1957,]
  1. Extract all columns except 1 through to 4

R

gapminder[,-1:4]
  1. Extract the rows where the life expectancy is longer the 80 years

R

gapminder[gapminder$lifeExp > 80]
  1. Extract the first row, and the fourth and fifth columns (continent and lifeExp).

R

gapminder[1, 4, 5]
  1. Advanced: extract rows that contain information for the years 2002 and 2007

R

gapminder[gapminder$year == 2002 | 2007,]

Fix each of the following common data frame subsetting errors:

  1. Extract observations collected for the year 1957

R

# gapminder[gapminder$year = 1957,]
gapminder[gapminder$year == 1957,]
  1. Extract all columns except 1 through to 4

R

# gapminder[,-1:4]
gapminder[,-c(1:4)]
  1. Extract the rows where the life expectancy is longer than 80 years

R

# gapminder[gapminder$lifeExp > 80]
gapminder[gapminder$lifeExp > 80,]
  1. Extract the first row, and the fourth and fifth columns (continent and lifeExp).

R

# gapminder[1, 4, 5]
gapminder[1, c(4, 5)]
  1. Advanced: extract rows that contain information for the years 2002 and 2007

R

# gapminder[gapminder$year == 2002 | 2007,]
gapminder[gapminder$year == 2002 | gapminder$year == 2007,]
gapminder[gapminder$year %in% c(2002, 2007),]

Challenge 8

  1. Why does gapminder[1:20] return an error? How does it differ from gapminder[1:20, ]?

  2. Create a new data.frame called gapminder_small that only contains rows 1 through 9 and 19 through 23. You can do this in one or two steps.

  1. gapminder is a data.frame so needs to be subsetted on two dimensions. gapminder[1:20, ] subsets the data to give the first 20 rows and all columns.

R

gapminder_small <- gapminder[c(1:9, 19:23),]

Key Points

  • Indexing in R starts at 1, not 0.
  • Access individual values by location using [].
  • Access slices of data using [low:high].
  • Access arbitrary sets of data using [c(...)].
  • Use logical operations and logical vectors to access subsets of data.

Content from Creating Publication-Quality Graphics with ggplot2


Last updated on 2024-12-04 | Edit this page

Estimated time: 80 minutes

Overview

Questions

  • How can I create publication-quality graphics in R?

Objectives

  • To be able to use ggplot2 to generate publication-quality graphics.
  • To apply geometry, aesthetic, and statistics layers to a ggplot plot.
  • To manipulate the aesthetics of a plot using different colors, shapes, and lines.
  • To improve data visualization through transforming scales and paneling by group.
  • To save a plot created with ggplot to disk.

Plotting our data is one of the best ways to quickly explore it and the various relationships between variables.

There are three main plotting systems in R, the base plotting system, the lattice package, and the ggplot2 package.

Today we’ll be learning about the ggplot2 package, because it is the most effective for creating publication-quality graphics.

ggplot2 is built on the grammar of graphics, the idea that any plot can be built from the same set of components: a data set, mapping aesthetics, and graphical layers:

  • Data sets are the data that you, the user, provide.

  • Mapping aesthetics are what connect the data to the graphics. They tell ggplot2 how to use your data to affect how the graph looks, such as changing what is plotted on the X or Y axis, or the size or color of different data points.

  • Layers are the actual graphical output from ggplot2. Layers determine what kinds of plot are shown (scatterplot, histogram, etc.), the coordinate system used (rectangular, polar, others), and other important aspects of the plot. The idea of layers of graphics may be familiar to you if you have used image editing programs like Photoshop, Illustrator, or Inkscape.

Let’s start off building an example using the gapminder data from earlier. The most basic function is ggplot, which lets R know that we’re creating a new plot. Any of the arguments we give the ggplot function are the global options for the plot: they apply to all layers on the plot.

R

library("ggplot2")
ggplot(data = gapminder)
Blank plot, before adding any mapping aesthetics to ggplot().

Here we called ggplot and told it what data we want to show on our figure. This is not enough information for ggplot to actually draw anything. It only creates a blank slate for other elements to be added to.

Now we’re going to add in the mapping aesthetics using the aes function. aes tells ggplot how variables in the data map to aesthetic properties of the figure, such as which columns of the data should be used for the x and y locations.

R

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible.

Here we told ggplot we want to plot the “gdpPercap” column of the gapminder data frame on the x-axis, and the “lifeExp” column on the y-axis. Notice that we didn’t need to explicitly pass aes these columns (e.g. x = gapminder[, "gdpPercap"]), this is because ggplot is smart enough to know to look in the data for that column!

The final part of making our plot is to tell ggplot how we want to visually represent the data. We do this by adding a new layer to the plot using one of the geom functions.

R

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point()
Scatter plot of life expectancy vs GDP per capita, now showing the data points.

Here we used geom_point, which tells ggplot we want to visually represent the relationship between x and y as a scatterplot of points.

Challenge 1

Modify the example so that the figure shows how life expectancy has changed over time:

R

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()

Hint: the gapminder dataset has a column called “year”, which should appear on the x-axis.

Here is one possible solution:

R

ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point()
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time
Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time

Challenge 2

In the previous examples and challenge we’ve used the aes function to tell the scatterplot geom about the x and y locations of each point. Another aesthetic property we can modify is the point color. Modify the code from the previous challenge to color the points by the “continent” column. What trends do you see in the data? Are they what you expected?

The solution presented below adds color=continent to the call of the aes function. The general trend seems to indicate an increased life expectancy over the years. On continents with stronger economies we find a longer life expectancy.

R

ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) +
  geom_point()
Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function
Binned scatterplot of life expectancy vs year with color-coded continents showing value of ‘aes’ function

Layers


Using a scatterplot probably isn’t the best for visualizing change over time. Instead, let’s tell ggplot to visualize the data as a line plot:

R

ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) +
  geom_line()

Instead of adding a geom_point layer, we’ve added a geom_line layer.

However, the result doesn’t look quite as we might have expected: it seems to be jumping around a lot in each continent. Let’s try to separate the data by country, plotting one line for each country:

R

ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
  geom_line()

We’ve added the group aesthetic, which tells ggplot to draw a line for each country.

But what if we want to visualize both lines and points on the plot? We can add another layer to the plot:

R

ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) +
  geom_line() + geom_point()

It’s important to note that each layer is drawn on top of the previous layer. In this example, the points have been drawn on top of the lines. Here’s a demonstration:

R

ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
  geom_line(mapping = aes(color=continent)) + geom_point()

In this example, the aesthetic mapping of color has been moved from the global plot options in ggplot to the geom_line layer so it no longer applies to the points. Now we can clearly see that the points are drawn on top of the lines.

Tip: Setting an aesthetic to a value instead of a mapping

So far, we’ve seen how to use an aesthetic (such as color) as a mapping to a variable in the data. For example, when we use geom_line(mapping = aes(color=continent)), ggplot will give a different color to each continent. But what if we want to change the color of all lines to blue? You may think that geom_line(mapping = aes(color="blue")) should work, but it doesn’t. Since we don’t want to create a mapping to a specific variable, we can move the color specification outside of the aes() function, like this: geom_line(color="blue").

Challenge 3

Switch the order of the point and line layers from the previous example. What happened?

The lines now get drawn over the points!

R

ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) +
 geom_point() + geom_line(mapping = aes(color=continent))
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.

Transformations and statistics


ggplot2 also makes it easy to overlay statistical models over the data. To demonstrate we’ll go back to our first example:

R

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

Currently it’s hard to see the relationship between the points due to some strong outliers in GDP per capita. We can change the scale of units on the x axis using the scale functions. These control the mapping between the data values and visual values of an aesthetic. We can also modify the transparency of the points, using the alpha function, which is especially helpful when you have a large amount of data which is very clustered.

R

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5) + scale_x_log10()
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread
Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread

The scale_x_log10 function applied a transformation to the coordinate system of the plot, so that each multiple of 10 is evenly spaced from left to right. For example, a GDP per capita of 1,000 is the same horizontal distance away from a value of 10,000 as the 10,000 value is from 100,000. This helps to visualize the spread of the data along the x-axis.

Tip Reminder: Setting an aesthetic to a value instead of a mapping

Notice that we used geom_point(alpha = 0.5). As the previous tip mentioned, using a setting outside of the aes() function will cause this value to be used for all points, which is what we want in this case. But just like any other aesthetic setting, alpha can also be mapped to a variable in the data. For example, we can give a different transparency to each continent with geom_point(mapping = aes(alpha = continent)).

We can fit a simple relationship to the data by adding another layer, geom_smooth:

R

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm")

OUTPUT

`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.

We can make the line thicker by setting the linewidth aesthetic in the geom_smooth layer:

R

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", linewidth=1.5)

OUTPUT

`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure.

There are two ways an aesthetic can be specified. Here we set the linewidth aesthetic by passing it as an argument to geom_smooth and it is applied the same to the whole geom. Previously in the lesson we’ve used the aes function to define a mapping between data variables and their visual representation.

Challenge 4a

Modify the color and size of the points on the point layer in the previous example.

Hint: do not use the aes function.

Hint: the equivalent of linewidth for points is size.

Here a possible solution: Notice that the color argument is supplied outside of the aes() function. This means that it applies to all data points on the graph and is not related to a specific variable.

R

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
 geom_point(size=3, color="orange") + scale_x_log10() +
 geom_smooth(method="lm", linewidth=1.5)

OUTPUT

`geom_smooth()` using formula = 'y ~ x'
Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency.

Challenge 4b

Modify your solution to Challenge 4a so that the points are now a different shape and are colored by continent with new trendlines. Hint: The color argument can be used inside the aesthetic.

Here is a possible solution: Notice that supplying the color argument inside the aes() functions enables you to connect it to a certain variable. The shape argument, as you can see, modifies all data points the same way (it is outside the aes() call) while the color argument which is placed inside the aes() call modifies a point’s color based on its continent value.

R

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) +
 geom_point(size=3, shape=17) + scale_x_log10() +
 geom_smooth(method="lm", linewidth=1.5)

OUTPUT

`geom_smooth()` using formula = 'y ~ x'

Multi-panel figures


Earlier we visualized the change in life expectancy over time across all countries in one plot. Alternatively, we can split this out over multiple panels by adding a layer of facet panels.

Tip

We start by making a subset of data including only countries located in the Americas. This includes 25 countries, which will begin to clutter the figure. Note that we apply a “theme” definition to rotate the x-axis labels to maintain readability. Nearly everything in ggplot2 is customizable.

R

americas <- gapminder[gapminder$continent == "Americas",]
ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
  geom_line() +
  facet_wrap( ~ country) +
  theme(axis.text.x = element_text(angle = 45))

The facet_wrap layer took a “formula” as its argument, denoted by the tilde (~). This tells R to draw a panel for each unique value in the country column of the gapminder dataset.

Modifying text


To clean this figure up for a publication we need to change some of the text elements. The x-axis is too cluttered, and the y axis should read “Life expectancy”, rather than the column name in the data frame.

We can do this by adding a couple of different layers. The theme layer controls the axis text, and overall text size. Labels for the axes, plot title and any legend can be set using the labs function. Legend titles are set using the same names we used in the aes specification. Thus below the color legend title is set using color = "Continent", while the title of a fill legend would be set using fill = "MyTitle".

R

ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
  geom_line() + facet_wrap( ~ country) +
  labs(
    x = "Year",              # x axis title
    y = "Life expectancy",   # y axis title
    title = "Figure 1",      # main title of figure
    color = "Continent"      # title of legend
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Exporting the plot


The ggsave() function allows you to export a plot created with ggplot. You can specify the dimension and resolution of your plot by adjusting the appropriate arguments (width, height and dpi) to create high quality graphics for publication. In order to save the plot from above, we first assign it to a variable lifeExp_plot, then tell ggsave to save that plot in png format to a directory called results. (Make sure you have a results/ folder in your working directory.)

R

lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
  geom_line() + facet_wrap( ~ country) +
  labs(
    x = "Year",              # x axis title
    y = "Life expectancy",   # y axis title
    title = "Figure 1",      # main title of figure
    color = "Continent"      # title of legend
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm")

There are two nice things about ggsave. First, it defaults to the last plot, so if you omit the plot argument it will automatically save the last plot you created with ggplot. Secondly, it tries to determine the format you want to save your plot in from the file extension you provide for the filename (for example .png or .pdf). If you need to, you can specify the format explicitly in the device argument.

This is a taste of what you can do with ggplot2. RStudio provides a really useful cheat sheet of the different layers available, and more extensive documentation is available on the ggplot2 website. All RStudio cheat sheets are available from the RStudio website. Finally, if you have no idea how to change something, a quick Google search will usually send you to a relevant question and answer on Stack Overflow with reusable code to modify!

Challenge 5

Generate boxplots to compare life expectancy between the different continents during the available years.

Advanced:

  • Rename y axis as Life Expectancy.
  • Remove x axis labels.

Here a possible solution: xlab() and ylab() set labels for the x and y axes, respectively The axis title, text and ticks are attributes of the theme and must be modified within a theme() call.

R

ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) +
 geom_boxplot() + facet_wrap(~year) +
 ylab("Life Expectancy") +
 theme(axis.title.x=element_blank(),
       axis.text.x = element_blank(),
       axis.ticks.x = element_blank())

Key Points

  • Use ggplot2 to create plots.
  • Think about graphics in layers: aesthetics, geometry, statistics, scale transformation, and grouping.

Content from Writing Data


Last updated on 2024-12-04 | Edit this page

Estimated time: 20 minutes

Overview

Questions

  • How can I save plots and data created in R?

Objectives

  • To be able to write out plots and data from R.

Saving plots


You have already seen how to save the most recent plot you create in ggplot2, using the command ggsave. As a refresher:

R

ggsave("My_most_recent_plot.pdf")

You can save a plot from within RStudio using the ‘Export’ button in the ‘Plot’ window. This will give you the option of saving as a .pdf or as .png, .jpg or other image formats.

Sometimes you will want to save plots without creating them in the ‘Plot’ window first. Perhaps you want to make a pdf document with multiple pages: each one a different plot, for example. Or perhaps you’re looping through multiple subsets of a file, plotting data from each subset, and you want to save each plot, but obviously can’t stop the loop to click ‘Export’ for each one.

In this case you can use a more flexible approach. The function pdf creates a new pdf device. You can control the size and resolution using the arguments to this function.

R

pdf("Life_Exp_vs_time.pdf", width=12, height=4)
ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) +
  geom_line() +
  theme(legend.position = "none")

# You then have to make sure to turn off the pdf device!

dev.off()

Open up this document and have a look.

Challenge 1

Rewrite your ‘pdf’ command to print a second page in the pdf, showing a facet plot (hint: use facet_grid) of the same data with one panel per continent.

R

pdf("Life_Exp_vs_time.pdf", width = 12, height = 4)
p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) +
  geom_line() +
  theme(legend.position = "none")
p
p + facet_grid(~continent)
dev.off()

The commands jpeg, png etc. are used similarly to produce documents in different formats.

Writing data


At some point, you’ll also want to write out data from R.

We can use the write.table function for this, which is very similar to read.table from before.

Let’s create a data-cleaning script, for this analysis, we only want to focus on the gapminder data for Australia:

R

aust_subset <- gapminder[gapminder$country == "Australia",]

write.table(aust_subset,
  file="cleaned-data/gapminder-aus.csv",
  sep=","
)

Let’s switch back to the shell to take a look at the data to make sure it looks OK:

BASH

head cleaned-data/gapminder-aus.csv

OUTPUT

"country","year","pop","continent","lifeExp","gdpPercap"
"61","Australia",1952,8691212,"Oceania",69.12,10039.59564
"62","Australia",1957,9712569,"Oceania",70.33,10949.64959
"63","Australia",1962,10794968,"Oceania",70.93,12217.22686
"64","Australia",1967,11872264,"Oceania",71.1,14526.12465
"65","Australia",1972,13177000,"Oceania",71.93,16788.62948
"66","Australia",1977,14074100,"Oceania",73.49,18334.19751
"67","Australia",1982,15184200,"Oceania",74.74,19477.00928
"68","Australia",1987,16257249,"Oceania",76.32,21888.88903
"69","Australia",1992,17481977,"Oceania",77.56,23424.76683

Hmm, that’s not quite what we wanted. Where did all these quotation marks come from? Also the row numbers are meaningless.

Let’s look at the help file to work out how to change this behaviour.

R

?write.table

By default R will wrap character vectors with quotation marks when writing out to file. It will also write out the row and column names.

Let’s fix this:

R

write.table(
  gapminder[gapminder$country == "Australia",],
  file="cleaned-data/gapminder-aus.csv",
  sep=",", quote=FALSE, row.names=FALSE
)

Now lets look at the data again using our shell skills:

BASH

head cleaned-data/gapminder-aus.csv

OUTPUT

country,year,pop,continent,lifeExp,gdpPercap
Australia,1952,8691212,Oceania,69.12,10039.59564
Australia,1957,9712569,Oceania,70.33,10949.64959
Australia,1962,10794968,Oceania,70.93,12217.22686
Australia,1967,11872264,Oceania,71.1,14526.12465
Australia,1972,13177000,Oceania,71.93,16788.62948
Australia,1977,14074100,Oceania,73.49,18334.19751
Australia,1982,15184200,Oceania,74.74,19477.00928
Australia,1987,16257249,Oceania,76.32,21888.88903
Australia,1992,17481977,Oceania,77.56,23424.76683

That looks better!

Challenge 2

Write a data-cleaning script file that subsets the gapminder data to include only data points collected since 1990.

Use this script to write out the new subset to a file in the cleaned-data/ directory.

R

write.table(
  gapminder[gapminder$year > 1990, ],
  file = "cleaned-data/gapminder-after1990.csv",
  sep = ",", quote = FALSE, row.names = FALSE
)

Key Points

  • Save plots from RStudio using the ‘Export’ button.
  • Use write.table to save tabular data.

Content from Data Frame Manipulation with dplyr


Last updated on 2024-12-04 | Edit this page

Estimated time: 55 minutes

Overview

Questions

  • How can I manipulate data frames without repeating myself?

Objectives

  • To be able to use the six main data frame manipulation ‘verbs’ with pipes in dplyr.
  • To understand how group_by() and summarize() can be combined to summarize datasets.
  • Be able to analyze a subset of data using logical filtering.

Manipulation of data frames means many things to many researchers: we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R operations:

R

mean(gapminder$gdpPercap[gapminder$continent == "Africa"])

OUTPUT

[1] 2193.755

R

mean(gapminder$gdpPercap[gapminder$continent == "Americas"])

OUTPUT

[1] 7136.11

R

mean(gapminder$gdpPercap[gapminder$continent == "Asia"])

OUTPUT

[1] 7902.15

But this isn’t very nice because there is a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.

The dplyr package


Luckily, the dplyr package provides a number of very useful functions for manipulating data frames in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr grammar easier to read.

Tip: Tidyverse

dplyr package belongs to a broader family of opinionated R packages designed for data science called the “Tidyverse”. These packages are specifically designed to work harmoniously together. Some of these packages will be covered along this course, but you can find more complete information here: https://www.tidyverse.org/.

Here we’re going to cover 5 of the most commonly used functions as well as using pipes (%>%) to combine them.

  1. select()
  2. filter()
  3. group_by()
  4. summarize()
  5. mutate()

If you have have not installed this package earlier, please do so:

R

install.packages('dplyr')

Now let’s load the package:

R

library("dplyr")

Using select()


If, for example, we wanted to move forward with only a few of the variables in our data frame we could use the select() function. This will keep only the variables you select.

R

year_country_gdp <- select(gapminder, year, country, gdpPercap)

Diagram illustrating use of select function to select two columns of a data frame If we want to remove one column only from the gapminder data, for example, removing the continent column.

R

smaller_gapminder_data <- select(gapminder, -continent)

If we open up year_country_gdp we’ll see that it only contains the year, country and gdpPercap. Above we used ‘normal’ grammar, but the strengths of dplyr lie in combining several functions using pipes. Since the pipes grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.

R

year_country_gdp <- gapminder %>% select(year, country, gdpPercap)

To help you understand why we wrote that in that way, let’s walk through it step by step. First we summon the gapminder data frame and pass it on, using the pipe symbol %>%, to the next step, which is the select() function. In this case we don’t specify which data object we use in the select() function since in gets that from the previous pipe. Fun Fact: There is a good chance you have encountered pipes before in the shell. In R, a pipe symbol is %>% while in the shell it is | but the concept is the same!

Tip: Renaming data frame columns in dplyr

In Chapter 4 we covered how you can rename columns with base R by assigning a value to the output of the names() function. Just like select, this is a bit cumbersome, but thankfully dplyr has a rename() function.

Within a pipeline, the syntax is rename(new_name = old_name). For example, we may want to rename the gdpPercap column name from our select() statement above.

R

tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap)

head(tidy_gdp)

OUTPUT

  year     country gdp_per_capita
1 1952 Afghanistan       779.4453
2 1957 Afghanistan       820.8530
3 1962 Afghanistan       853.1007
4 1967 Afghanistan       836.1971
5 1972 Afghanistan       739.9811
6 1977 Afghanistan       786.1134

Using filter()


If we now want to move forward with the above, but only with European countries, we can combine select and filter

R

year_country_gdp_euro <- gapminder %>%
    filter(continent == "Europe") %>%
    select(year, country, gdpPercap)

If we now want to show life expectancy of European countries but only for a specific year (e.g., 2007), we can do as below.

R

europe_lifeExp_2007 <- gapminder %>%
  filter(continent == "Europe", year == 2007) %>%
  select(country, lifeExp)

Challenge 1

Write a single command (which can span multiple lines and includes pipes) that will produce a data frame that has the African values for lifeExp, country and year, but not for other Continents. How many rows does your data frame have and why?

R

year_country_lifeExp_Africa <- gapminder %>%
                           filter(continent == "Africa") %>%
                           select(year, country, lifeExp)

As with last time, first we pass the gapminder data frame to the filter() function, then we pass the filtered version of the gapminder data frame to the select() function. Note: The order of operations is very important in this case. If we used ‘select’ first, filter would not be able to find the variable continent since we would have removed it in the previous step.

Using group_by()


Now, we were supposed to be reducing the error prone repetitiveness of what can be done with base R, but up to now we haven’t done that since we would have to repeat the above for each continent. Instead of filter(), which will only pass observations that meet your criteria (in the above: continent=="Europe"), we can use group_by(), which will essentially use every unique criteria that you could have used in filter.

R

str(gapminder)

OUTPUT

'data.frame':	1704 obs. of  6 variables:
 $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num  779 821 853 836 740 ...

R

str(gapminder %>% group_by(continent))

OUTPUT

gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
 $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
 - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
  ..$ .rows    : list<int> [1:5]
  .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
  .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
  .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
  .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
  .. ..@ ptype: int(0)
  ..- attr(*, ".drop")= logi TRUE

You will notice that the structure of the data frame where we used group_by() (grouped_df) is not the same as the original gapminder (data.frame). A grouped_df can be thought of as a list where each item in the listis a data.frame which contains only the rows that correspond to the a particular value continent (at least in the example above).

Diagram illustrating how the group by function oraganizes a data frame into groups

Using summarize()


The above was a bit on the uneventful side but group_by() is much more exciting in conjunction with summarize(). This will allow us to create new variable(s) by using functions that repeat for each of the continent-specific data frames. That is to say, using the group_by() function, we split our original data frame into multiple pieces, then we can run functions (e.g. mean() or sd()) within summarize().

R

gdp_bycontinents <- gapminder %>%
    group_by(continent) %>%
    summarize(mean_gdpPercap = mean(gdpPercap))
Diagram illustrating the use of group by and summarize together to create a new variable

R

continent mean_gdpPercap
     <fctr>          <dbl>
1    Africa       2193.755
2  Americas       7136.110
3      Asia       7902.150
4    Europe      14469.476
5   Oceania      18621.609

That allowed us to calculate the mean gdpPercap for each continent, but it gets even better.

Challenge 2

Calculate the average life expectancy per country. Which has the longest average life expectancy and which has the shortest average life expectancy?

R

lifeExp_bycountry <- gapminder %>%
   group_by(country) %>%
   summarize(mean_lifeExp = mean(lifeExp))
lifeExp_bycountry %>%
   filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))

OUTPUT

# A tibble: 2 × 2
  country      mean_lifeExp
  <chr>               <dbl>
1 Iceland              76.5
2 Sierra Leone         36.8

Another way to do this is to use the dplyr function arrange(), which arranges the rows in a data frame according to the order of one or more variables from the data frame. It has similar syntax to other functions from the dplyr package. You can use desc() inside arrange() to sort in descending order.

R

lifeExp_bycountry %>%
   arrange(mean_lifeExp) %>%
   head(1)

OUTPUT

# A tibble: 1 × 2
  country      mean_lifeExp
  <chr>               <dbl>
1 Sierra Leone         36.8

R

lifeExp_bycountry %>%
   arrange(desc(mean_lifeExp)) %>%
   head(1)

OUTPUT

# A tibble: 1 × 2
  country mean_lifeExp
  <chr>          <dbl>
1 Iceland         76.5

Alphabetical order works too

R

lifeExp_bycountry %>%
   arrange(desc(country)) %>%
   head(1)

OUTPUT

# A tibble: 1 × 2
  country  mean_lifeExp
  <chr>           <dbl>
1 Zimbabwe         52.7

The function group_by() allows us to group by multiple variables. Let’s group by year and continent.

R

gdp_bycontinents_byyear <- gapminder %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap))

OUTPUT

`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.

That is already quite powerful, but it gets even better! You’re not limited to defining 1 new variable in summarize().

R

gdp_pop_bycontinents_byyear <- gapminder %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop))

OUTPUT

`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.

count() and n()


A very common operation is to count the number of observations for each group. The dplyr package comes with two related functions that help with this.

For instance, if we wanted to check the number of countries included in the dataset for the year 2002, we can use the count() function. It takes the name of one or more columns that contain the groups we are interested in, and we can optionally sort the results in descending order by adding sort=TRUE:

R

gapminder %>%
    filter(year == 2002) %>%
    count(continent, sort = TRUE)

OUTPUT

  continent  n
1    Africa 52
2      Asia 33
3    Europe 30
4  Americas 25
5   Oceania  2

If we need to use the number of observations in calculations, the n() function is useful. It will return the total number of observations in the current group rather than counting the number of observations in each group within a specific column. For instance, if we wanted to get the standard error of the life expectency per continent:

R

gapminder %>%
    group_by(continent) %>%
    summarize(se_le = sd(lifeExp)/sqrt(n()))

OUTPUT

# A tibble: 5 × 2
  continent se_le
  <chr>     <dbl>
1 Africa    0.366
2 Americas  0.540
3 Asia      0.596
4 Europe    0.286
5 Oceania   0.775

You can also chain together several summary operations; in this case calculating the minimum, maximum, mean and se of each continent’s per-country life-expectancy:

R

gapminder %>%
    group_by(continent) %>%
    summarize(
      mean_le = mean(lifeExp),
      min_le = min(lifeExp),
      max_le = max(lifeExp),
      se_le = sd(lifeExp)/sqrt(n()))

OUTPUT

# A tibble: 5 × 5
  continent mean_le min_le max_le se_le
  <chr>       <dbl>  <dbl>  <dbl> <dbl>
1 Africa       48.9   23.6   76.4 0.366
2 Americas     64.7   37.6   80.7 0.540
3 Asia         60.1   28.8   82.6 0.596
4 Europe       71.9   43.6   81.8 0.286
5 Oceania      74.3   69.1   81.2 0.775

Using mutate()


We can also create new variables prior to (or even after) summarizing information using mutate().

R

gdp_pop_bycontinents_byyear <- gapminder %>%
    mutate(gdp_billion = gdpPercap*pop/10^9) %>%
    group_by(continent,year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop),
              mean_gdp_billion = mean(gdp_billion),
              sd_gdp_billion = sd(gdp_billion))

OUTPUT

`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.

Connect mutate with logical filtering: ifelse


When creating new variables, we can hook this with a logical condition. A simple combination of mutate() and ifelse() facilitates filtering right where it is needed: in the moment of creating something new. This easy-to-read statement is a fast and powerful way of discarding certain data (even though the overall dimension of the data frame will not change) or for updating values depending on this given condition.

R

## keeping all data but "filtering" after a certain condition
# calculate GDP only for people with a life expectation above 25
gdp_pop_bycontinents_byyear_above25 <- gapminder %>%
    mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop),
              mean_gdp_billion = mean(gdp_billion),
              sd_gdp_billion = sd(gdp_billion))

OUTPUT

`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.

R

## updating only if certain condition is fullfilled
# for life expectations above 40 years, the gpd to be expected in the future is scaled
gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>%
    mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              mean_gdpPercap_expected = mean(gdp_futureExpectation))

OUTPUT

`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.

Combining dplyr and ggplot2


First install and load ggplot2:

R

install.packages('ggplot2')

R

library("ggplot2")

In the plotting lesson we looked at how to make a multi-panel figure by adding a layer of facet panels using ggplot2. Here is the code we used (with some extra comments):

R

# Filter countries located in the Americas
americas <- gapminder[gapminder$continent == "Americas", ]
# Make the plot
ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
  geom_line() +
  facet_wrap( ~ country) +
  theme(axis.text.x = element_text(angle = 45))

This code makes the right plot but it also creates an intermediate variable (americas) that we might not have any other uses for. Just as we used %>% to pipe data along a chain of dplyr functions we can use it to pass data to ggplot(). Because %>% replaces the first argument in a function we don’t need to specify the data = argument in the ggplot() function. By combining dplyr and ggplot2 functions we can make the same figure without creating any new variables or modifying the data.

R

gapminder %>%
  # Filter countries located in the Americas
  filter(continent == "Americas") %>%
  # Make the plot
  ggplot(mapping = aes(x = year, y = lifeExp)) +
  geom_line() +
  facet_wrap( ~ country) +
  theme(axis.text.x = element_text(angle = 45))

More examples of using the function mutate() and the ggplot2 package.

R

gapminder %>%
  # extract first letter of country name into new column
  mutate(startsWith = substr(country, 1, 1)) %>%
  # only keep countries starting with A or Z
  filter(startsWith %in% c("A", "Z")) %>%
  # plot lifeExp into facets
  ggplot(aes(x = year, y = lifeExp, colour = continent)) +
  geom_line() +
  facet_wrap(vars(country)) +
  theme_minimal()

Advanced Challenge

Calculate the average life expectancy in 2002 of 2 randomly selected countries for each continent. Then arrange the continent names in reverse order. Hint: Use the dplyr functions arrange() and sample_n(), they have similar syntax to other dplyr functions.

R

lifeExp_2countries_bycontinents <- gapminder %>%
   filter(year==2002) %>%
   group_by(continent) %>%
   sample_n(2) %>%
   summarize(mean_lifeExp=mean(lifeExp)) %>%
   arrange(desc(mean_lifeExp))

Other great resources


Key Points

  • Use the dplyr package to manipulate data frames.
  • Use select() to choose variables from a data frame.
  • Use filter() to choose data based on values.
  • Use group_by() and summarize() to work with subsets of data.
  • Use mutate() to create new variables.

Content from Data Frame Manipulation with tidyr


Last updated on 2024-12-04 | Edit this page

Estimated time: 45 minutes

Overview

Questions

  • How can I change the layout of a data frame?

Objectives

  • To understand the concepts of ‘longer’ and ‘wider’ data frame formats and be able to convert between them with tidyr.

Researchers often want to reshape their data frames from ‘wide’ to ‘longer’ layouts, or vice-versa. The ‘long’ layout or format is where:

  • each column is a variable
  • each row is an observation

In the purely ‘long’ (or ‘longest’) format, you usually have 1 column for the observed variable and the other columns are ID variables.

For the ‘wide’ format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time, or observation of multiple variables (or a mix of both). You may find data input may be simpler or some other applications may prefer the ‘wide’ format. However, many of R‘s functions have been designed assuming you have ’longer’ formatted data. This tutorial will help you efficiently transform your data shape regardless of original format.

Diagram illustrating the difference between a wide versus long layout of a data frame

Long and wide data frame layouts mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due to its shape. However, the long format is more machine readable and is closer to the formatting of databases. The ID variables in our data frames are similar to the fields in a database and observed variables are like the database values.

Getting started


First install the packages if you haven’t already done so (you probably installed dplyr in the previous lesson):

R

#install.packages("tidyr")
#install.packages("dplyr")

Load the packages

R

library("tidyr")
library("dplyr")

First, lets look at the structure of our original gapminder data frame:

R

str(gapminder)

OUTPUT

'data.frame':	1704 obs. of  6 variables:
 $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num  779 821 853 836 740 ...

Challenge 1

Is gapminder a purely long, purely wide, or some intermediate format?

The original gapminder data.frame is in an intermediate format. It is not purely long since it had multiple observation variables (pop,lifeExp,gdpPercap).

Sometimes, as with the gapminder dataset, we have multiple types of observed data. It is somewhere in between the purely ‘long’ and ‘wide’ data formats. We have 3 “ID variables” (continent, country, year) and 3 “Observation variables” (pop,lifeExp,gdpPercap). This intermediate format can be preferred despite not having ALL observations in 1 column given that all 3 observation variables have different units. There are few operations that would need us to make this data frame any longer (i.e. 4 ID variables and 1 Observation variable).

While using many of the functions in R, which are often vector based, you usually do not want to do mathematical operations on values with different units. For example, using the purely long format, a single mean for all of the values of population, life expectancy, and GDP would not be meaningful since it would return the mean of values with 3 incompatible units. The solution is that we first manipulate the data either by grouping (see the lesson on dplyr), or we change the structure of the data frame. Note: Some plotting functions in R actually work better in the wide format data.

From wide to long format with pivot_longer()


Until now, we’ve been using the nicely formatted original gapminder dataset, but ‘real’ data (i.e. our own research data) will never be so well organized. Here let’s start with the wide formatted version of the gapminder dataset.

Download the wide version of the gapminder data from this link to a csv file and save it in your data folder.

We’ll load the data file and look at it. Note: we don’t want our continent and country columns to be factors, so we use the stringsAsFactors argument for read.csv() to disable that.

R

gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE)
str(gap_wide)

OUTPUT

'data.frame':	142 obs. of  38 variables:
 $ continent     : chr  "Africa" "Africa" "Africa" "Africa" ...
 $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
 $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
 $ gdpPercap_1957: num  3014 3828 960 918 617 ...
 $ gdpPercap_1962: num  2551 4269 949 984 723 ...
 $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
 $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
 $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
 $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
 $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
 $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
 $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
 $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
 $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
 $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
 $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
 $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
 $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
 $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
 $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
 $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
 $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
 $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
 $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
 $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
 $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
 $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
 $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
 $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
 $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
 $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
 $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
 $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
 $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
 $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
 $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
 $ pop_2002      : int  31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
 $ pop_2007      : int  33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
Diagram illustrating the wide format of the gapminder data frame

To change this very wide data frame layout back to our nice, intermediate (or longer) layout, we will use one of the two available pivot functions from the tidyr package. To convert from wide to a longer format, we will use the pivot_longer() function. pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns, or ‘lengthening’ your observation variables into a single variable.

Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format

R

gap_long <- gap_wide %>%
  pivot_longer(
    cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
    names_to = "obstype_year", values_to = "obs_values"
  )
str(gap_long)

OUTPUT

tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
 $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
 $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
 $ obstype_year: chr [1:5112] "pop_1952" "pop_1957" "pop_1962" "pop_1967" ...
 $ obs_values  : num [1:5112] 9279525 10270856 11000948 12760499 14760787 ...

Here we have used piping syntax which is similar to what we were doing in the previous lesson with dplyr. In fact, these are compatible and you can use a mix of tidyr and dplyr functions by piping them together.

We first provide to pivot_longer() a vector of column names that will be pivoted into longer format. We could type out all the observation variables, but as in the select() function (see dplyr lesson), we can use the starts_with() argument to select all variables that start with the desired character string. pivot_longer() also allows the alternative syntax of using the - symbol to identify which variables are not to be pivoted (i.e. ID variables).

The next arguments to pivot_longer() are names_to for naming the column that will contain the new ID variable (obstype_year) and values_to for naming the new amalgamated observation variable (obs_value). We supply these new column names as strings.

Diagram illustrating the long format of the gapminder data

R

gap_long <- gap_wide %>%
  pivot_longer(
    cols = c(-continent, -country),
    names_to = "obstype_year", values_to = "obs_values"
  )
str(gap_long)

OUTPUT

tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
 $ continent   : chr [1:5112] "Africa" "Africa" "Africa" "Africa" ...
 $ country     : chr [1:5112] "Algeria" "Algeria" "Algeria" "Algeria" ...
 $ obstype_year: chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
 $ obs_values  : num [1:5112] 2449 3014 2551 3247 4183 ...

That may seem trivial with this particular data frame, but sometimes you have 1 ID variable and 40 observation variables with irregular variable names. The flexibility is a huge time saver!

Now obstype_year actually contains 2 pieces of information, the observation type (pop,lifeExp, or gdpPercap) and the year. We can use the separate() function to split the character strings into multiple variables

R

gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
gap_long$year <- as.integer(gap_long$year)

Challenge 2

Using gap_long, calculate the mean life expectancy, population, and gdpPercap for each continent. Hint: use the group_by() and summarize() functions we learned in the dplyr lesson

R

gap_long %>% group_by(continent, obs_type) %>%
   summarize(means=mean(obs_values))

OUTPUT

`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.

OUTPUT

# A tibble: 15 × 3
# Groups:   continent [5]
   continent obs_type       means
   <chr>     <chr>          <dbl>
 1 Africa    gdpPercap     2194.
 2 Africa    lifeExp         48.9
 3 Africa    pop        9916003.
 4 Americas  gdpPercap     7136.
 5 Americas  lifeExp         64.7
 6 Americas  pop       24504795.
 7 Asia      gdpPercap     7902.
 8 Asia      lifeExp         60.1
 9 Asia      pop       77038722.
10 Europe    gdpPercap    14469.
11 Europe    lifeExp         71.9
12 Europe    pop       17169765.
13 Oceania   gdpPercap    18622.
14 Oceania   lifeExp         74.3
15 Oceania   pop        8874672. 

From long to intermediate format with pivot_wider()


It is always good to check work. So, let’s use the second pivot function, pivot_wider(), to ‘widen’ our observation variables back out. pivot_wider() is the opposite of pivot_longer(), making a dataset wider by increasing the number of columns and decreasing the number of rows. We can use pivot_wider() to pivot or reshape our gap_long to the original intermediate format or the widest format. Let’s start with the intermediate format.

The pivot_wider() function takes names_from and values_from arguments.

To names_from we supply the column name whose contents will be pivoted into new output columns in the widened data frame. The corresponding values will be added from the column named in the values_from argument.

R

gap_normal <- gap_long %>%
  pivot_wider(names_from = obs_type, values_from = obs_values)
dim(gap_normal)

OUTPUT

[1] 1704    6

R

dim(gapminder)

OUTPUT

[1] 1704    6

R

names(gap_normal)

OUTPUT

[1] "continent" "country"   "year"      "gdpPercap" "lifeExp"   "pop"      

R

names(gapminder)

OUTPUT

[1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

Now we’ve got an intermediate data frame gap_normal with the same dimensions as the original gapminder, but the order of the variables is different. Let’s fix that before checking if they are all.equal().

R

gap_normal <- gap_normal[, names(gapminder)]
all.equal(gap_normal, gapminder)

OUTPUT

[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
[2] "Attributes: < Component \"class\": 1 string mismatch >"
[3] "Component \"country\": 1704 string mismatches"
[4] "Component \"pop\": Mean relative difference: 1.634504"
[5] "Component \"continent\": 1212 string mismatches"
[6] "Component \"lifeExp\": Mean relative difference: 0.203822"
[7] "Component \"gdpPercap\": Mean relative difference: 1.162302"                           

R

head(gap_normal)

OUTPUT

# A tibble: 6 × 6
  country  year      pop continent lifeExp gdpPercap
  <chr>   <int>    <dbl> <chr>       <dbl>     <dbl>
1 Algeria  1952  9279525 Africa       43.1     2449.
2 Algeria  1957 10270856 Africa       45.7     3014.
3 Algeria  1962 11000948 Africa       48.3     2551.
4 Algeria  1967 12760499 Africa       51.4     3247.
5 Algeria  1972 14760787 Africa       54.5     4183.
6 Algeria  1977 17152804 Africa       58.0     4910.

R

head(gapminder)

OUTPUT

      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134

We’re almost there, the original was sorted by country, then year.

R

gap_normal <- gap_normal %>% arrange(country, year)
all.equal(gap_normal, gapminder)

OUTPUT

[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
[2] "Attributes: < Component \"class\": 1 string mismatch >"                                

That’s great! We’ve gone from the longest format back to the intermediate and we didn’t introduce any errors in our code.

Now let’s convert the long all the way back to the wide. In the wide format, we will keep country and continent as ID variables and pivot the observations across the 3 metrics (pop,lifeExp,gdpPercap) and time (year). First we need to create appropriate labels for all our new variables (time*metric combinations) and we also need to unify our ID variables to simplify the process of defining gap_wide.

R

gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_")
str(gap_temp)

OUTPUT

tibble [5,112 × 4] (S3: tbl_df/tbl/data.frame)
 $ var_ID    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
 $ obs_type  : chr [1:5112] "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
 $ year      : int [1:5112] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...

R

gap_temp <- gap_long %>%
    unite(ID_var, continent, country, sep = "_") %>%
    unite(var_names, obs_type, year, sep = "_")
str(gap_temp)

OUTPUT

tibble [5,112 × 3] (S3: tbl_df/tbl/data.frame)
 $ ID_var    : chr [1:5112] "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" "Africa_Algeria" ...
 $ var_names : chr [1:5112] "gdpPercap_1952" "gdpPercap_1957" "gdpPercap_1962" "gdpPercap_1967" ...
 $ obs_values: num [1:5112] 2449 3014 2551 3247 4183 ...

Using unite() we now have a single ID variable which is a combination of continent,country,and we have defined variable names. We’re now ready to pipe in pivot_wider()

R

gap_wide_new <- gap_long %>%
  unite(ID_var, continent, country, sep = "_") %>%
  unite(var_names, obs_type, year, sep = "_") %>%
  pivot_wider(names_from = var_names, values_from = obs_values)
str(gap_wide_new)

OUTPUT

tibble [142 × 37] (S3: tbl_df/tbl/data.frame)
 $ ID_var        : chr [1:142] "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
 $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
 $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
 $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
 $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
 $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
 $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
 $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
 $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
 $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
 $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
 $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
 $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
 $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
 $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
 $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
 $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
 $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
 $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
 $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
 $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
 $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
 $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
 $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
 $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
 $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
 $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
 $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
 $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
 $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
 $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
 $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
 $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
 $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
 $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
 $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
 $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...

Challenge 3

Take this 1 step further and create a gap_ludicrously_wide format data by pivoting over countries, year and the 3 metrics? Hint this new data frame should only have 5 rows.

R

gap_ludicrously_wide <- gap_long %>%
   unite(var_names, obs_type, year, country, sep = "_") %>%
   pivot_wider(names_from = var_names, values_from = obs_values)

Now we have a great ‘wide’ format data frame, but the ID_var could be more usable, let’s separate it into 2 variables with separate()

R

gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_")
gap_wide_betterID <- gap_long %>%
    unite(ID_var, continent, country, sep = "_") %>%
    unite(var_names, obs_type, year, sep = "_") %>%
    pivot_wider(names_from = var_names, values_from = obs_values) %>%
    separate(ID_var, c("continent","country"), sep = "_")
str(gap_wide_betterID)

OUTPUT

tibble [142 × 38] (S3: tbl_df/tbl/data.frame)
 $ continent     : chr [1:142] "Africa" "Africa" "Africa" "Africa" ...
 $ country       : chr [1:142] "Algeria" "Angola" "Benin" "Botswana" ...
 $ gdpPercap_1952: num [1:142] 2449 3521 1063 851 543 ...
 $ gdpPercap_1957: num [1:142] 3014 3828 960 918 617 ...
 $ gdpPercap_1962: num [1:142] 2551 4269 949 984 723 ...
 $ gdpPercap_1967: num [1:142] 3247 5523 1036 1215 795 ...
 $ gdpPercap_1972: num [1:142] 4183 5473 1086 2264 855 ...
 $ gdpPercap_1977: num [1:142] 4910 3009 1029 3215 743 ...
 $ gdpPercap_1982: num [1:142] 5745 2757 1278 4551 807 ...
 $ gdpPercap_1987: num [1:142] 5681 2430 1226 6206 912 ...
 $ gdpPercap_1992: num [1:142] 5023 2628 1191 7954 932 ...
 $ gdpPercap_1997: num [1:142] 4797 2277 1233 8647 946 ...
 $ gdpPercap_2002: num [1:142] 5288 2773 1373 11004 1038 ...
 $ gdpPercap_2007: num [1:142] 6223 4797 1441 12570 1217 ...
 $ lifeExp_1952  : num [1:142] 43.1 30 38.2 47.6 32 ...
 $ lifeExp_1957  : num [1:142] 45.7 32 40.4 49.6 34.9 ...
 $ lifeExp_1962  : num [1:142] 48.3 34 42.6 51.5 37.8 ...
 $ lifeExp_1967  : num [1:142] 51.4 36 44.9 53.3 40.7 ...
 $ lifeExp_1972  : num [1:142] 54.5 37.9 47 56 43.6 ...
 $ lifeExp_1977  : num [1:142] 58 39.5 49.2 59.3 46.1 ...
 $ lifeExp_1982  : num [1:142] 61.4 39.9 50.9 61.5 48.1 ...
 $ lifeExp_1987  : num [1:142] 65.8 39.9 52.3 63.6 49.6 ...
 $ lifeExp_1992  : num [1:142] 67.7 40.6 53.9 62.7 50.3 ...
 $ lifeExp_1997  : num [1:142] 69.2 41 54.8 52.6 50.3 ...
 $ lifeExp_2002  : num [1:142] 71 41 54.4 46.6 50.6 ...
 $ lifeExp_2007  : num [1:142] 72.3 42.7 56.7 50.7 52.3 ...
 $ pop_1952      : num [1:142] 9279525 4232095 1738315 442308 4469979 ...
 $ pop_1957      : num [1:142] 10270856 4561361 1925173 474639 4713416 ...
 $ pop_1962      : num [1:142] 11000948 4826015 2151895 512764 4919632 ...
 $ pop_1967      : num [1:142] 12760499 5247469 2427334 553541 5127935 ...
 $ pop_1972      : num [1:142] 14760787 5894858 2761407 619351 5433886 ...
 $ pop_1977      : num [1:142] 17152804 6162675 3168267 781472 5889574 ...
 $ pop_1982      : num [1:142] 20033753 7016384 3641603 970347 6634596 ...
 $ pop_1987      : num [1:142] 23254956 7874230 4243788 1151184 7586551 ...
 $ pop_1992      : num [1:142] 26298373 8735988 4981671 1342614 8878303 ...
 $ pop_1997      : num [1:142] 29072015 9875024 6066080 1536536 10352843 ...
 $ pop_2002      : num [1:142] 31287142 10866106 7026113 1630347 12251209 ...
 $ pop_2007      : num [1:142] 33333216 12420476 8078314 1639131 14326203 ...

R

all.equal(gap_wide, gap_wide_betterID)

OUTPUT

[1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
[2] "Attributes: < Component \"class\": 1 string mismatch >"                                

There and back again!

Other great resources


Key Points

  • Use the tidyr package to change the layout of data frames.
  • Use pivot_longer() to go from wide to longer layout.
  • Use pivot_wider() to go from long to wider layout.

Content from Basic Statistics: describing, modelling and reporting


Last updated on 2024-12-04 | Edit this page

Estimated time: 80 minutes

Overview

Questions

  • How can I detect the type of data I have?
  • How can I make meaningful summaries of my data?

Objectives

  • To be able to describe the different types of data
  • To be able to do basic data exploration of a real dataset
  • To be able to calculate descriptive statistics
  • To be able to perform statistical inference on a dataset

Content


  • Types of Data
  • Exploring your dataset
  • Descriptive Statistics
  • Inferential Statistics

Data


R

# We will need these libraries and this data later.
library(tidyverse)
library(lubridate)
library(gapminder)
# create a binary membership variable for europe (for later examples)
gapminder <- gapminder %>%
  mutate(european = continent == "Europe")

We are going to use the data from the gapminder package. We have added a variable European indicating if a country is in Europe.

The big picture


  • Research often seeks to answer a question about a larger population by collecting data on a small sample
  • Data collection:
    • Many variables
    • For each person/unit.
  • This procedure, sampling, must be controlled so as to ensure representative data.

Descriptive and inferential statistics


Callout

Just as data in general are of different types - for example numeric vs text data - statistical data are assigned to different levels of measure. The level of measure determines how we can describe and model the data.

Describing data

  • Continuous variables
  • Discrete variables

Callout

How do we convey information on what your data looks like, using numbers or figures?

Describing continuous data.

First establish the distribution of the data. You can visualise this with a histogram.

R

ggplot(gapminder, aes(x = gdpPercap)) +
  geom_histogram()

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What is the distribution of this data?

What is the distribution of population?

The raw values are difficult to visualise, so we can take the log of the values and log those. Try this command

R

ggplot(data = gapminder, aes(log(pop))) +
  geom_histogram()

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What is the distribution of this data?

Parametric vs non-parametric analysis

  • Parametric analysis assumes that
    • The data follows a known distribution
    • It can be described using parameters
    • Examples of distributions include, normal, Poisson, exponential.
  • Non parametric data
    • The data can’t be said to follow a known distribution

Emphasise that parametric is not equal to normal.

Describing parametric and non-parametric data

How do you use numbers to convey what your data looks like.

  • Parametric data
    • Use the parameters that describe the distribution.
    • For a Gaussian (normal) distribution - use mean and standard deviation
    • For a Poisson distribution - use average event rate
    • etc.
  • Non Parametric data
    • Use the median (the middle number when they are ranked from lowest to highest) and the interquartile range (the number 75% of the way up the list when ranked minus the number 25% of the way)
  • You can use the command summary(data_frame_name) to get these numbers for each variable.

Mean versus standard deviation

  • What does standard deviation mean?
  • Both graphs have the same mean (center), but the second one has data which is more spread out.

R

# small standard deviation
dummy_1 <- rnorm(1000, mean = 10, sd = 0.5)
dummy_1 <- as.data.frame(dummy_1)
ggplot(dummy_1, aes(x = dummy_1)) +
  geom_histogram()

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

R

# larger standard deviation
dummy_2 <- rnorm(1000, mean = 10, sd = 200)
dummy_2 <- as.data.frame(dummy_2)
ggplot(dummy_2, aes(x = dummy_2)) +
  geom_histogram()

OUTPUT

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Get them to plot the graphs. Explain that we are generating random data from different distributions and plotting them.

Calculating mean and standard deviation

R

mean(gapminder$pop, na.rm = TRUE)

OUTPUT

[1] 29601212

Calculate the standard deviation and confirm that it is the square root of the variance:

R

sdpopulation <- sd(gapminder$pop, na.rm = TRUE)
print(sdpopulation)

OUTPUT

[1] 106157897

R

varpopulation <- var(gapminder$pop, na.rm = TRUE)
print(varpopulation)

OUTPUT

[1] 1.12695e+16

R

sqrt(varpopulation) == sdpopulation

OUTPUT

[1] TRUE

The na.rm argument tells R to ignore missing values in the variable.

Calculating median and interquartile range

R

median(gapminder$pop, na.rm = TRUE)

OUTPUT

[1] 7023596

R

IQR(gapminder$gdpPercap, na.rm = TRUE)

OUTPUT

[1] 8123.402

Again, we ignore the missing values.

Describing discrete data

  • Frequencies

R

table(gapminder$continent)

OUTPUT


  Africa Americas     Asia   Europe  Oceania
     624      300      396      360       24 
  • Proportions

R

continenttable <- table(gapminder$continent)
prop.table(continenttable)

OUTPUT


    Africa   Americas       Asia     Europe    Oceania
0.36619718 0.17605634 0.23239437 0.21126761 0.01408451 

Contingency tables of frequencies can also be tabulated with table(). For example:

R

table(
  gapminder$country[gapminder$year == 2007],
  gapminder$continent[gapminder$year == 2007]
)

OUTPUT


                           Africa Americas Asia Europe Oceania
  Afghanistan                   0        0    1      0       0
  Albania                       0        0    0      1       0
  Algeria                       1        0    0      0       0
  Angola                        1        0    0      0       0
  Argentina                     0        1    0      0       0
  Australia                     0        0    0      0       1
  Austria                       0        0    0      1       0
  Bahrain                       0        0    1      0       0
  Bangladesh                    0        0    1      0       0
  Belgium                       0        0    0      1       0
  Benin                         1        0    0      0       0
  Bolivia                       0        1    0      0       0
  Bosnia and Herzegovina        0        0    0      1       0
  Botswana                      1        0    0      0       0
  Brazil                        0        1    0      0       0
  Bulgaria                      0        0    0      1       0
  Burkina Faso                  1        0    0      0       0
  Burundi                       1        0    0      0       0
  Cambodia                      0        0    1      0       0
  Cameroon                      1        0    0      0       0
  Canada                        0        1    0      0       0
  Central African Republic      1        0    0      0       0
  Chad                          1        0    0      0       0
  Chile                         0        1    0      0       0
  China                         0        0    1      0       0
  Colombia                      0        1    0      0       0
  Comoros                       1        0    0      0       0
  Congo, Dem. Rep.              1        0    0      0       0
  Congo, Rep.                   1        0    0      0       0
  Costa Rica                    0        1    0      0       0
  Cote d'Ivoire                 1        0    0      0       0
  Croatia                       0        0    0      1       0
  Cuba                          0        1    0      0       0
  Czech Republic                0        0    0      1       0
  Denmark                       0        0    0      1       0
  Djibouti                      1        0    0      0       0
  Dominican Republic            0        1    0      0       0
  Ecuador                       0        1    0      0       0
  Egypt                         1        0    0      0       0
  El Salvador                   0        1    0      0       0
  Equatorial Guinea             1        0    0      0       0
  Eritrea                       1        0    0      0       0
  Ethiopia                      1        0    0      0       0
  Finland                       0        0    0      1       0
  France                        0        0    0      1       0
  Gabon                         1        0    0      0       0
  Gambia                        1        0    0      0       0
  Germany                       0        0    0      1       0
  Ghana                         1        0    0      0       0
  Greece                        0        0    0      1       0
  Guatemala                     0        1    0      0       0
  Guinea                        1        0    0      0       0
  Guinea-Bissau                 1        0    0      0       0
  Haiti                         0        1    0      0       0
  Honduras                      0        1    0      0       0
  Hong Kong, China              0        0    1      0       0
  Hungary                       0        0    0      1       0
  Iceland                       0        0    0      1       0
  India                         0        0    1      0       0
  Indonesia                     0        0    1      0       0
  Iran                          0        0    1      0       0
  Iraq                          0        0    1      0       0
  Ireland                       0        0    0      1       0
  Israel                        0        0    1      0       0
  Italy                         0        0    0      1       0
  Jamaica                       0        1    0      0       0
  Japan                         0        0    1      0       0
  Jordan                        0        0    1      0       0
  Kenya                         1        0    0      0       0
  Korea, Dem. Rep.              0        0    1      0       0
  Korea, Rep.                   0        0    1      0       0
  Kuwait                        0        0    1      0       0
  Lebanon                       0        0    1      0       0
  Lesotho                       1        0    0      0       0
  Liberia                       1        0    0      0       0
  Libya                         1        0    0      0       0
  Madagascar                    1        0    0      0       0
  Malawi                        1        0    0      0       0
  Malaysia                      0        0    1      0       0
  Mali                          1        0    0      0       0
  Mauritania                    1        0    0      0       0
  Mauritius                     1        0    0      0       0
  Mexico                        0        1    0      0       0
  Mongolia                      0        0    1      0       0
  Montenegro                    0        0    0      1       0
  Morocco                       1        0    0      0       0
  Mozambique                    1        0    0      0       0
  Myanmar                       0        0    1      0       0
  Namibia                       1        0    0      0       0
  Nepal                         0        0    1      0       0
  Netherlands                   0        0    0      1       0
  New Zealand                   0        0    0      0       1
  Nicaragua                     0        1    0      0       0
  Niger                         1        0    0      0       0
  Nigeria                       1        0    0      0       0
  Norway                        0        0    0      1       0
  Oman                          0        0    1      0       0
  Pakistan                      0        0    1      0       0
  Panama                        0        1    0      0       0
  Paraguay                      0        1    0      0       0
  Peru                          0        1    0      0       0
  Philippines                   0        0    1      0       0
  Poland                        0        0    0      1       0
  Portugal                      0        0    0      1       0
  Puerto Rico                   0        1    0      0       0
  Reunion                       1        0    0      0       0
  Romania                       0        0    0      1       0
  Rwanda                        1        0    0      0       0
  Sao Tome and Principe         1        0    0      0       0
  Saudi Arabia                  0        0    1      0       0
  Senegal                       1        0    0      0       0
  Serbia                        0        0    0      1       0
  Sierra Leone                  1        0    0      0       0
  Singapore                     0        0    1      0       0
  Slovak Republic               0        0    0      1       0
  Slovenia                      0        0    0      1       0
  Somalia                       1        0    0      0       0
  South Africa                  1        0    0      0       0
  Spain                         0        0    0      1       0
  Sri Lanka                     0        0    1      0       0
  Sudan                         1        0    0      0       0
  Swaziland                     1        0    0      0       0
  Sweden                        0        0    0      1       0
  Switzerland                   0        0    0      1       0
  Syria                         0        0    1      0       0
  Taiwan                        0        0    1      0       0
  Tanzania                      1        0    0      0       0
  Thailand                      0        0    1      0       0
  Togo                          1        0    0      0       0
  Trinidad and Tobago           0        1    0      0       0
  Tunisia                       1        0    0      0       0
  Turkey                        0        0    0      1       0
  Uganda                        1        0    0      0       0
  United Kingdom                0        0    0      1       0
  United States                 0        1    0      0       0
  Uruguay                       0        1    0      0       0
  Venezuela                     0        1    0      0       0
  Vietnam                       0        0    1      0       0
  West Bank and Gaza            0        0    1      0       0
  Yemen, Rep.                   0        0    1      0       0
  Zambia                        1        0    0      0       0
  Zimbabwe                      1        0    0      0       0

Which leads quite naturally to the consideration of any association between the observed frequencies.

Inferential statistics

Meaningful analysis

  • What is your hypothesis - what is your null hypothesis?

Callout

Always: the level of the independent variable has no effect on the level of the dependent variable.

  • What type of variables (data type) do you have?

  • What are the assumptions of the test you are using?

  • Interpreting the result

Testing significance

  • p-value

  • <0.05

  • 0.03-0.049

    • Would benefit from further testing.

0.05 is not a magic number.

Comparing means

It all starts with a hypothesis

  • Null hypothesis
    • “There is no difference in mean height between men and women” \[mean\_height\_men - mean\_height\_women = 0\]
  • Alternate hypothesis
    • “There is a difference in mean height between men and women”

More on hypothesis testing

  • The null hypothesis (H0) assumes that the true mean difference (μd) is equal to zero.

  • The two-tailed alternative hypothesis (H1) assumes that μd is not equal to zero.

  • The upper-tailed alternative hypothesis (H1) assumes that μd is greater than zero.

  • The lower-tailed alternative hypothesis (H1) assumes that μd is less than zero.

  • Remember: hypotheses are never about data, they are about the processes which produce the data. The value of μd is unknown. The goal of hypothesis testing is to determine the hypothesis (null or alternative) with which the data are more consistent.

Comparing means

Is there an absolute difference between the populations of European vs non-European countries?

R

gapminder %>%
  group_by(european) %>%
  summarise(av.popn = mean(pop, na.rm = TRUE))

OUTPUT

# A tibble: 2 × 2
  european   av.popn
  <lgl>        <dbl>
1 FALSE    32931064.
2 TRUE     17169765.

Is the difference between heights statistically significant?

t-test

Assumptions of a t-test

  • One independent categorical variable with 2 groups and one dependent continuous variable

  • The dependent variable is approximately normally distributed in each group

  • The observations are independent of each other

  • For students’ original t-statistic, that the variances in both groups are more or less equal. This constraint should probably be abandoned in favour of always using a conservative test.

Doing a t-test

R

t.test(pop ~ european, data = gapminder)$statistic

OUTPUT

       t
4.611907 

R

t.test(pop ~ european, data = gapminder)$parameter

OUTPUT

      df
1585.104 

Notice that the summary()** of the test contains more data than is output by default.

Write a paragraph in markdown format reporting this test result including the t-statistic, the degrees of freedom, the confidence interval and the p-value to 4 places. To do this include your r code inline with your text, rather than in an R code chunk.

t-test result

Testing supported the rejection of the null hypothesis that there is no difference between mean populations of European and non-European participants (t=4.6119, df= 1585.1044, p= 0).

(Can you get p to display to four places? Cf format().)

More than two levels of IV

While the t-test is sufficient where there are two levels of the IV, for situations where there are more than two, we use the ANOVA family of procedures. To show this, we will create a variable that subsets our data by per capita GDP levels. If the ANOVA result is statistically significant, we will use a post-hoc test method to do pairwise comparisons (here Tukey’s Honest Significant Differences.)

R

quantile(gapminder$gdpPercap)

OUTPUT

         0%         25%         50%         75%        100%
   241.1659   1202.0603   3531.8470   9325.4623 113523.1329 

R

IQR(gapminder$gdpPercap)

OUTPUT

[1] 8123.402

R

gapminder$gdpGroup <- cut(gapminder$gdpPercap, breaks = c(241.1659, 1202.0603, 3531.8470, 9325.4623, 113523.1329), labels = FALSE)

gapminder$gdpGroup <- factor(gapminder$gdpGroup)

anovamodel <- aov(gapminder$pop ~ gapminder$gdpGroup)
summary(anovamodel)

OUTPUT

                     Df    Sum Sq   Mean Sq F value Pr(>F)
gapminder$gdpGroup    3 1.066e+17 3.553e+16   3.163 0.0237 *
Residuals          1699 1.908e+19 1.123e+16
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1 observation deleted due to missingness

R

TukeyHSD(anovamodel)

OUTPUT

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = gapminder$pop ~ gapminder$gdpGroup)

$`gapminder$gdpGroup`
         diff       lwr        upr     p adj
2-1  -4228756 -22914519 14457007.3 0.9375254
3-1 -19586897 -38272660  -901133.5 0.0357045
4-1 -15053430 -33739193  3632332.8 0.1628242
3-2 -15358141 -34032922  3316640.4 0.1487248
4-2 -10824674 -29499456  7850106.7 0.4433887
4-3   4533466 -14141315 23208247.5 0.9243090

Regression Modelling

The most common use of regression modelling is to explore the relationship between two continuous variables, for example between gdpPercap and lifeExp in our data. We can first determine whether there is any significant correlation between the values, and if there is, plot the relationship.

R

cor.test(gapminder$gdpPercap, gapminder$lifeExp)

OUTPUT


	Pearson's product-moment correlation

data:  gapminder$gdpPercap and gapminder$lifeExp
t = 29.658, df = 1702, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5515065 0.6141690
sample estimates:
      cor
0.5837062 

R

ggplot(gapminder, aes(gdpPercap, log(lifeExp))) +
  geom_point() +
  geom_smooth()

OUTPUT

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Having decided that a further investigation of this relationship is worthwhile, we can create a linear model with the function lm().

R

modelone <- lm(gapminder$gdpPercap ~ gapminder$lifeExp)
summary(modelone)

OUTPUT


Call:
lm(formula = gapminder$gdpPercap ~ gapminder$lifeExp)

Residuals:
   Min     1Q Median     3Q    Max
-11483  -4539  -1223   2482 106950

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)       -19277.25     914.09  -21.09   <2e-16 ***
gapminder$lifeExp    445.44      15.02   29.66   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8006 on 1702 degrees of freedom
Multiple R-squared:  0.3407,	Adjusted R-squared:  0.3403
F-statistic: 879.6 on 1 and 1702 DF,  p-value: < 2.2e-16

Regression with a categorical IV (the t-test)

Run the following code chunk and compare the results to the t test conducted earlier.

R

gapminder %>%
  mutate(european = factor(european))

OUTPUT

# A tibble: 1,704 × 8
   country     continent  year lifeExp      pop gdpPercap european gdpGroup
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <fct>    <fct>
 1 Afghanistan Asia       1952    28.8  8425333      779. FALSE    1
 2 Afghanistan Asia       1957    30.3  9240934      821. FALSE    1
 3 Afghanistan Asia       1962    32.0 10267083      853. FALSE    1
 4 Afghanistan Asia       1967    34.0 11537966      836. FALSE    1
 5 Afghanistan Asia       1972    36.1 13079460      740. FALSE    1
 6 Afghanistan Asia       1977    38.4 14880372      786. FALSE    1
 7 Afghanistan Asia       1982    39.9 12881816      978. FALSE    1
 8 Afghanistan Asia       1987    40.8 13867957      852. FALSE    1
 9 Afghanistan Asia       1992    41.7 16317921      649. FALSE    1
10 Afghanistan Asia       1997    41.8 22227415      635. FALSE    1
# ℹ 1,694 more rows

R

modelttest <- lm(gapminder$pop ~ gapminder$european)

summary(modelttest)

OUTPUT


Call:
lm(formula = gapminder$pop ~ gapminder$european)

Residuals:
       Min         1Q     Median         3Q        Max
 -32871053  -29780936  -22066032   -7948269 1285752032

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
(Intercept)             32931064    2891217  11.390   <2e-16 ***
gapminder$europeanTRUE -15761300    6290196  -2.506   0.0123 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.06e+08 on 1702 degrees of freedom
Multiple R-squared:  0.003675,	Adjusted R-squared:  0.00309
F-statistic: 6.278 on 1 and 1702 DF,  p-value: 0.01231

Regression with a categorical IV (ANOVA)

Use the lm() function to model the relationship between gapminder$gdpGroup and gapminder$pop. Compare the results with the ANOVA carried out earlier.

Lunch

  • Feel free to explore the handout and go through the exercises again.

Content from Producing Reports With knitr


Last updated on 2024-12-04 | Edit this page

Estimated time: 75 minutes

Overview

Questions

  • How can I integrate software and reports?

Objectives

  • Understand the value of writing reproducible reports
  • Learn how to recognise and compile the basic components of an R Markdown file
  • Become familiar with R code chunks, and understand their purpose, structure and options
  • Demonstrate the use of inline chunks for weaving R outputs into text blocks, for example when discussing the results of some calculations
  • Be aware of alternative output formats to which an R Markdown file can be exported

Data analysis reports


Data analysts tend to write a lot of reports, describing their analyses and results, for their collaborators or to document their work for future reference.

Many new users begin by first writing a single R script containing all of their work, and then share the analysis by emailing the script and various graphs as attachments. But this can be cumbersome, requiring a lengthy discussion to explain which attachment was which result.

Writing formal reports with Word or LaTeX can simplify this process by incorporating both the analysis report and output graphs into a single document. But tweaking formatting to make figures look correct and fixing obnoxious page breaks can be tedious and lead to a lengthy “whack-a-mole” game of fixing new mistakes resulting from a single formatting change.

Creating a report as a web page (which is an html file) using R Markdown makes things easier. The report can be one long stream, so tall figures that wouldn’t ordinarily fit on one page can be kept at full size and easier to read, since the reader can simply keep scrolling. Additionally, the formatting of and R Markdown document is simple and easy to modify, allowing you to spend more time on your analyses instead of writing reports.

Literate programming


Ideally, such analysis reports are reproducible documents: If an error is discovered, or if some additional subjects are added to the data, you can just re-compile the report and get the new or corrected results rather than having to reconstruct figures, paste them into a Word document, and hand-edit various detailed results.

The key R package here is knitr. It allows you to create a document that is a mixture of text and chunks of code. When the document is processed by knitr, chunks of code will be executed, and graphs or other results will be inserted into the final document.

This sort of idea has been called “literate programming”.

knitr allows you to mix basically any type of text with code from different programming languages, but we recommend that you use R Markdown, which mixes Markdown with R. Markdown is a light-weight mark-up language for creating web pages.

Creating an R Markdown file


Within RStudio, click File → New File → R Markdown and you’ll get a dialog box like this:

Screenshot of the New R Markdown file dialogue box in RStudio

You can stick with the default (HTML output), but give it a title.

Basic components of R Markdown


The initial chunk of text (header) contains instructions for R to specify what kind of document will be created, and the options chosen. You can use the header to give your document a title, author, date, and tell it what type of output you want to produce. In this case, we’re creating an html document.

---
title: "Initial R Markdown document"
author: "Karl Broman"
date: "April 23, 2015"
output: html_document
---

You can delete any of those fields if you don’t want them included. The double-quotes aren’t strictly necessary in this case. They’re mostly needed if you want to include a colon in the title.

RStudio creates the document with some example text to get you started. Note below that there are chunks like

```{r}
summary(cars)
```

These are chunks of R code that will be executed by knitr and replaced by their results. More on this later.

Markdown


Markdown is a system for writing web pages by marking up the text much as you would in an email rather than writing html code. The marked-up text gets converted to html, replacing the marks with the proper html code.

For now, let’s delete all of the stuff that’s there and write a bit of markdown.

You make things bold using two asterisks, like this: **bold**, and you make things italics by using underscores, like this: _italics_.

You can make a bulleted list by writing a list with hyphens or asterisks with a space between the list and other text, like this:

A list:

* bold with double-asterisks
* italics with underscores
* code-type font with backticks

or like this:

A second list:

- bold with double-asterisks
- italics with underscores
- code-type font with backticks

Each will appear as:

  • bold with double-asterisks
  • italics with underscores
  • code-type font with backticks

You can use whatever method you prefer, but be consistent. This maintains the readability of your code.

You can make a numbered list by just using numbers. You can even use the same number over and over if you want:

1. bold with double-asterisks
1. italics with underscores
1. code-type font with backticks

This will appear as:

  1. bold with double-asterisks
  2. italics with underscores
  3. code-type font with backticks

You can make section headers of different sizes by initiating a line with some number of # symbols:

# Title
## Main section
### Sub-section
#### Sub-sub section

You compile the R Markdown document to an html webpage by clicking the “Knit” button in the upper-left.

Challenge 1

Create a new R Markdown document. Delete all of the R code chunks and write a bit of Markdown (some sections, some italicized text, and an itemized list).

Convert the document to a webpage.

In RStudio, select File > New file > R Markdown…

Delete the placeholder text and add the following:

# Introduction

## Background on Data

This report uses the *gapminder* dataset, which has columns that include:

* country
* continent
* year
* lifeExp
* pop
* gdpPercap

## Background on Methods

Then click the ‘Knit’ button on the toolbar to generate an html document (webpage).

A bit more Markdown


You can make a hyperlink like this: [Carpentries Home Page](https://carpentries.org/).

You can include an image file like this: ![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)

You can do subscripts (e.g., F2) with F~2~ and superscripts (e.g., F2) with F^2^.

If you know how to write equations in LaTeX, you can use $ $ and $$ $$ to insert math equations, like $E = mc^2$ and

$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$

You can review Markdown syntax by navigating to the “Markdown Quick Reference” under the “Help” field in the toolbar at the top of RStudio.

R code chunks


The real power of Markdown comes from mixing markdown with chunks of code. This is R Markdown. When processed, the R code will be executed; if they produce figures, the figures will be inserted in the final document.

The main code chunks look like this:

```{r load_data}
gapminder

That is, you place a chunk of R code between ```{r chunk_name} and ```. You should give each chunk a unique name, as they will help you to fix errors and, if any graphs are produced, the file names are based on the name of the code chunk that produced them. You can create code chunks quickly in RStudio using the shortcuts Ctrl+Alt+I on Windows and Linux, or Cmd+Option+I on Mac.

Challenge 2

Add code chunks to:

  • Load the ggplot2 package
  • Read the gapminder data
  • Create a plot
```{r load-ggplot2}
library("ggplot2")
```
```{r read-gapminder-data}
gapminder
```{r make-plot}
plot(lifeExp ~ year, data = gapminder)
```

How things get compiled


When you press the “Knit” button, the R Markdown document is processed by knitr and a plain Markdown document is produced (as well as, potentially, a set of figure files): the R code is executed and replaced by both the input and the output; if figures are produced, links to those figures are included.

The Markdown and figure documents are then processed by the tool pandoc, which converts the Markdown file into an html file, with the figures embedded.

Chunk options


There are a variety of options to affect how the code chunks are treated. Here are some examples:

  • Use echo=FALSE to avoid having the code itself shown.
  • Use results="hide" to avoid having any results printed.
  • Use eval=FALSE to have the code shown but not evaluated.
  • Use warning=FALSE and message=FALSE to hide any warnings or messages produced.
  • Use fig.height and fig.width to control the size of the figures produced (in inches).

So you might write:

```{r load_libraries, echo=FALSE, message=FALSE}
library("dplyr")
library("ggplot2")
```

Often there will be particular options that you’ll want to use repeatedly; for this, you can set global chunk options, like so:

```{r global_options, echo=FALSE}
knitr::opts_chunk$set(fig.path="Figs/", message=FALSE, warning=FALSE,
                      echo=FALSE, results="hide", fig.width=11)
```

The fig.path option defines where the figures will be saved. The / here is really important; without it, the figures would be saved in the standard place but just with names that begin with Figs.

If you have multiple R Markdown files in a common directory, you might want to use fig.path to define separate prefixes for the figure file names, like fig.path="Figs/cleaning-" and fig.path="Figs/analysis-".

Challenge 3

Use chunk options to control the size of a figure and to hide the code.

```{r echo = FALSE, fig.width = 3}
plot(faithful)
```

You can review all of the R chunk options by navigating to the “R Markdown Cheat Sheet” under the “Cheatsheets” section of the “Help” field in the toolbar at the top of RStudio.

Inline R code


You can make every number in your report reproducible. Use `r and ` for an in-line code chunk, like so: `r round(some_value, 2)`. The code will be executed and replaced with the value of the result.

Don’t let these in-line chunks get split across lines.

Perhaps precede the paragraph with a larger code chunk that does calculations and defines variables, with include=FALSE for that larger chunk (which is the same as echo=FALSE and results="hide").

Rounding can produce differences in output in such situations. You may want 2.0, but round(2.03, 1) will give just 2.

The myround function in the R/broman package handles this.

Challenge 4

Try out a bit of in-line R code.

Here’s some inline code to determine that 2 + 2 = 4.

Other output options


You can also convert R Markdown to a PDF or a Word document. Click the little triangle next to the “Knit” button to get a drop-down menu. Or you could put pdf_document or word_document in the initial header of the file.

Tip: Creating PDF documents

Creating .pdf documents may require installation of some extra software. The R package tinytex provides some tools to help make this process easier for R users. With tinytex installed, run tinytex::install_tinytex() to install the required software (you’ll only need to do this once) and then when you knit to pdf tinytex will automatically detect and install any additional LaTeX packages that are needed to produce the pdf document. Visit the tinytex website for more information.

Tip: Visual markdown editing in RStudio

RStudio versions 1.4 and later include visual markdown editing mode. In visual editing mode, markdown expressions (like **bold words**) are transformed to the formatted appearance (bold words) as you type. This mode also includes a toolbar at the top with basic formatting buttons, similar to what you might see in common word processing software programs. You can turn visual editing on and off by pressing the Icon for turning on and off the visual editing mode in RStudio, which looks like a pair of compasses button in the top right corner of your R Markdown document.

Resources


Key Points

  • Mix reporting written in R Markdown with software written in R.
  • Specify chunk options to control formatting.
  • Use knitr to convert these documents into PDF and other formats.

Content from Writing Good Software


Last updated on 2024-12-04 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • How can I write software that other people can use?

Objectives

  • Describe best practices for writing R and explain the justification for each.

Structure your project folder


Keep your project folder structured, organized and tidy, by creating subfolders for your code files, manuals, data, binaries, output plots, etc. It can be done completely manually, or with the help of RStudio’s New Project functionality, or a designated package, such as ProjectTemplate.

Tip: ProjectTemplate - a possible solution

One way to automate the management of projects is to install the third-party package, ProjectTemplate. This package will set up an ideal directory structure for project management. This is very useful as it enables you to have your analysis pipeline/workflow organised and structured. Together with the default RStudio project functionality and Git you will be able to keep track of your work as well as be able to share your work with collaborators.

  1. Install ProjectTemplate.
  2. Load the library
  3. Initialise the project:

R

install.packages("ProjectTemplate")
library("ProjectTemplate")
create.project("../my_project_2", merge.strategy = "allow.non.conflict")

For more information on ProjectTemplate and its functionality visit the home page ProjectTemplate

Make code readable


The most important part of writing code is making it readable and understandable. You want someone else to be able to pick up your code and be able to understand what it does: more often than not this someone will be you 6 months down the line, who will otherwise be cursing past-self.

Documentation: tell us what and why, not how


When you first start out, your comments will often describe what a command does, since you’re still learning yourself and it can help to clarify concepts and remind you later. However, these comments aren’t particularly useful later on when you don’t remember what problem your code is trying to solve. Try to also include comments that tell you why you’re solving a problem, and what problem that is. The how can come after that: it’s an implementation detail you ideally shouldn’t have to worry about.

Keep your code modular


Our recommendation is that you should separate your functions from your analysis scripts, and store them in a separate file that you source when you open the R session in your project. This approach is nice because it leaves you with an uncluttered analysis script, and a repository of useful functions that can be loaded into any analysis script in your project. It also lets you group related functions together easily.

Break down problem into bite size pieces


When you first start out, problem solving and function writing can be daunting tasks, and hard to separate from code inexperience. Try to break down your problem into digestible chunks and worry about the implementation details later: keep breaking down the problem into smaller and smaller functions until you reach a point where you can code a solution, and build back up from there.

Know that your code is doing the right thing


Make sure to test your functions!

Don’t repeat yourself


Functions enable easy reuse within a project. If you see blocks of similar lines of code through your project, those are usually candidates for being moved into functions.

If your calculations are performed through a series of functions, then the project becomes more modular and easier to change. This is especially the case for which a particular input always gives a particular output.

Remember to be stylish


Apply consistent style to your code.

Key Points

  • Keep your project folder structured, organized and tidy.
  • Document what and why, not how.
  • Break programs into short single-purpose functions.
  • Write re-runnable tests.
  • Don’t repeat yourself.
  • Be consistent in naming, indentation, and other aspects of style.