Open this document in RStudio.

Click on the blue “Knit” icon above to view as HTML.

Questions:

  1. How can I manipulate a data frame?

Objectives:

  1. Add and remove rows or columns to/from a data frame
  2. Remove rows with NA values fromn a data frame
  3. Append two data frames
  4. Display basic properties of data frames including size and class of the columns, names, and first few rows.

A brief review of data structures in R.

  1. Vector - values of the same type
  2. Matrix - two-dimensional structure of elements of the same type
  3. Lists - several dimensions, any data type
  4. Data frame - two-dimensional structure that can store values of any data type or object

And the basic data types:

  1. Logical (TRUE, FALSE, T, F)
  2. Numeric (integer, 2L and double (“double precision floating point numbers”), 3.14)
  3. Character (string / text, like “Hello World!”)

A vector is created with the function “c()”, which means “concatenate”. It receives a sequence of values of the same type separated by a comma (,).

To create a numeric vector of integers:

my_vector <- c(2L, 5L, 6L, 8L)

Last week we created a data frame containing information on “cats”. You can re-create it here by copying and pasting this code into the Console window on RStudio. We are creating a dataframe with 3 rows and 3 columns.

cats <- data.frame(coat = c("calico", "black", "tabby"),
                    weight = c(2.1, 5.0, 3.2),
                    likes_string = c(1, 0, 1))

Adding columns and rows in data frames

To add a column to an existing data frame

Create a new vector (copy and paste, or type into RStudio Console window at prompt).

age <- c(2,3,5)

use “cbind()” function to add this vector as a column to the existing “cats” table

cbind(cats,age)

If we try to add a vector of ages with a different number of entries than the number of rows in the dataframe, it would produce an error.

age <- c(2,3,5,12)
cbind(cats,age)

or

age <- c(2,3)
cbind(cats,age)

This did not work as R wants to have one element in new column for every row in the table. Check the number of rows in the “cats” table with “nrows()”

nrows(cats)

and what is the length of the “age” vector we just created? Use “length” to find out.

length(age)

There are only 2 elements in the age vector, but 3 rows in the “cats” dataframe. Let’s do it correctly now.

age <- c(2,3,5)
cats <- cbind(cats,age)

What if we wanted to add some rows? Since the rows of a data frame are lists, we just need to add a new row as a list.

newRow <- list("tortoiseshell", 3.3, TRUE, 9)

and then use the “rbind()” function to add it to the data frame.

cats <- rbind(cats, newRow)

This did not produce the Error message/warning as shown in the example. Why?

Removing rows

Let’s try removing the row we just added. We can use indices to do this. So, to get the 4th row, we specify:

cats[-4,]

By using just the comma with nothing after it, we indicate we want to drop the entire 4th row.

Removing columns

We can remove columns from a data frame by using indices as we did with rows.

cats[,-4]

By using the comma with nothing in front of it - we indicate we want to keep all the rows.

Subsetting with the logical operator “%in%” will be covered more in the next section.

Appending to a data frame

Remember, columns are vectors and rows are lists.

Two data frames can be glued together with “rbind”.

cats <- rbind(cats,cats)
cats

In this case, we have “glued” or “appended” the cats data frame onto another copy of the cats data frame.

To remove the row names (numbers to the left)…we can set them to NULL.

rownames(cats) <- NULL
cats

and they will be re-named (in my version of R this happens without having to set the rownames to NULL).

Realistic example with “gapminder”

install.packages("gapminder")
library("gapminder")

Let’s check out what the data looks like?

str(gapminder)

To see that “gapminder” is a data frame (in this case a tibble data frame).

What is a tibble?

Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors). cran.r-project.org

  1. it does not change input type - reduces error messages
  2. does not adjust the names of variables
  3. evaluates arguments “lazily” and sequentially
  4. does not use “row.names()”
  5. only recycles vectors of length 1 (recycling vectors can cause bugs)

Differences between tibble data frames and standard data frames

  1. Printing - tibble only shows first ten rows and columns that fit on one screen
  2. Sub-setting tibbles always returns another tibble
  3. Recycling - only values of length 1 are recycled

We can also see a summary of each column by using “summary()”, which gives a numeric, tabular or descriptive summary of each column.

summary(gapminder$country)
summary(gapminder$year)
summary(gapminder$lifeExp)
summary(gapminder)

For data frames, summary yields a numeric, tabular, or descriptive summary of each column. Factor columns are summarized by the number of items in each level, numeric or integer columns by the descriptive statistics (quartiles and mean), and character columns by its length, class, and mode.

How do the different examples of “summary()” on the gapminder differ?

We can examine the individual columns of the data frame with “typeof()” function

typeof(gapminder$year)
typeof(gapminder$country)

Both of these should return type “integer”.

str(gapminder$country)

Does not return “integer”. What is the data type of “country” in “gapminder” data?

Let’s look at the dimensions of the data frame.

length(gapminder)

The “gapminder” data is built out of a list of 6 columns. Does the output reflect this?

typeof(gapminder)

Let’s check the numbers of rows and columns.

nrow(gapminder)
ncol(gapminder)
dim(gapminder)
colnames(gapminder)
head(gapminder)

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.