Creating new columns

Last updated on 2025-06-24 | Edit this page

Estimated time: 30 minutes

Overview

Questions

What function is used to create new columns in a dataset?
How can you generate a unique ID column for each row?
How can you modify an existing column?
How can you compute an average value over many columns in your data?

Objectives

Understand how to use mutate() to create new columns.
Learn how to add constant values, computed values, and conditionally assigned values to new columns.
Use ifelse() to create categorical columns based on conditions.
Learn efficient methods for computing row-wise sums and averages using rowSums() and rowMeans().
Understand the difference between sum() and rowSums() for column-wise and row-wise operations.

Making new columns

So far we have dealt with existing columns and worked on them. We have computed descriptive statistics, plotted basic overviews of data and cut the data just the way we like it using filter() and select(). Now, we will unlock a new power… the power of creation! Muhahahahaha.

Well, actually it’s just adding new columns to the dataset, but that’s still cool! Let’s start by reading in our DASS data again and loading the package dplyr.

R

library(dplyr)

OUTPUT


Attaching package: 'dplyr'

OUTPUT

The following objects are masked from 'package:stats':

    filter, lag

OUTPUT

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

R

dass_data <- read.csv("data/kaggle_dass/data.csv")

To add a new column, we need the dplyr function mutate(). It changes - or mutates - columns in the dataset, like adding entirely new ones. The syntax is quite simple. The first argument is the data, allowing easy use of pipes, and the second argument is the code for the new column you want to have.

For example, if we want to create a new column that carries just a single value, we type the following:

R

dass_data <- dass_data %>% 
  mutate(single_value_column = 42)

To check if this worked:

R

unique(dass_data$single_value_column)

OUTPUT

[1] 42

Saving progress

Notice how we had to assign the dataset to itself in order to save the progress we have made. If you do not do this, and just write down the pipeline, the end result will be outputted to the console, not stored in an object that we can use later.

R

dass_data %>% 
  head(6) %>% 
  mutate(some_column = "I have a value!") %>% 
  select(country, some_column)

OUTPUT

  country     some_column
1      IN I have a value!
2      US I have a value!
3      PL I have a value!
4      US I have a value!
5      MY I have a value!
6      US I have a value!

You can easily assign the new data to something with the same name, overriding the previous status. But be careful with this, I recommend you use new names for the data after filtering, selecting, and mutating existing columns. These operations change the original data (they remove some information) and should thus be stored in a new object, like filtered_dass_data.

We can do much more interesting things than just placing a constant in the new column. For example, we can use row_number() to create a running number of entries in our data - an id of sort.

R

dass_data <- dass_data %>% 
  mutate(entry_id = row_number())

dass_data %>% 
  select(country, testelapse, entry_id) %>% 
  head()

OUTPUT

  country testelapse entry_id
1      IN        167        1
2      US        193        2
3      PL        271        3
4      US        261        4
5      MY        164        5
6      US        349        6

Mutating existing columns

We can also base our mutation on existing columns. testelapse is in seconds right now, let’s add a new column that is the elapsed test time in minutes.

R

dass_data <- dass_data %>% 
  mutate(
    testelapse_min = testelapse / 60
  )

mean(dass_data$testelapse)

OUTPUT

[1] 2684.843

R

mean(dass_data$testelapse_min)

OUTPUT

[1] 44.74738

Another function I often use when creating new columns is ifelse(). It works using three arguments.

R

a = 5
b = 3

ifelse(
  a > b, # The condition
  "a is greater", # IF yes
  "a is smaller" # IF no
)

OUTPUT

[1] "a is greater"

We can use this to make a new column that tells us whether an entry came from the United States.

R

dass_data <- dass_data %>% 
  mutate(is_from_us = ifelse(country == "US", "yes", "no"))

table(dass_data$is_from_us)

OUTPUT


   no   yes
31566  8207

Adding existing columns

One of the most important uses of mutate() is adding together multiple columns. In our data, we have three columns indicating the length of elapsed time during survey completion, introelapse, testelapse and surveyelapse.

We can compute the sum of these three columns to get the total amount of elapsed time.

R

dass_data <- dass_data %>%
  mutate(
    total_elapse_seconds = introelapse + testelapse + surveyelapse
  )

Hint: Conduct Sanity Checks

When programming, you will make mistakes. There just is not a way around it. The only way to deal with it is trying to catch your mistakes as early as possible. Sometimes the computer will help you out by issuing an error or a warning. Sometimes it does not, because the code you wrote is technically valid but still does not exactly do what you want it to.

Therefore, it is important to conduct sanity checks! Try looking at the data and figure out if what you just did makes sense. In our case, let’s inspect the new column:

R

dass_data %>% 
  select(introelapse, testelapse, surveyelapse, total_elapse_seconds) %>% 
  head()

OUTPUT

  introelapse testelapse surveyelapse total_elapse_seconds
1          19        167          166                  352
2           1        193          186                  380
3           5        271          122                  398
4           3        261          336                  600
5        1766        164          157                 2087
6           4        349          213                  566

This looks good so far!

Now, this approach works well for a few columns, but gets tedious quickly when attempting to compute the sum or average over multiple columns. For this there exists a special fuction called rowSums(). This function takes a dataset, and computes the sums across all columns.

R

example_data <- data.frame(
  x = 1:4,
  y = 2:5,
  z = 3:6
)
example_data$sum <- rowSums(example_data)

example_data

OUTPUT

  x y z sum
1 1 2 3   6
2 2 3 4   9
3 3 4 5  12
4 4 5 6  15

However, in our case we do not want a sum across all columns of our data, but just some of the columns! To do this, we need to use across() inside the rowSums function to select specific columns.

R

dass_data %>%
  mutate(elapse_sum = rowSums(across(c("introelapse", "testelapse", "surveyelapse")))) %>% 
  select(introelapse, testelapse, surveyelapse, elapse_sum) %>% 
  head()

OUTPUT

  introelapse testelapse surveyelapse elapse_sum
1          19        167          166        352
2           1        193          186        380
3           5        271          122        398
4           3        261          336        600
5        1766        164          157       2087
6           4        349          213        566

Because across() also accepts the same input as select() we can use the special functions like starts_with(), ends_with() or contains().

R

dass_data %>%
  mutate(elapse_sum = rowSums(across(ends_with("elapse")))) %>% 
  select(ends_with("elapse"), elapse_sum) %>% 
  head()

OUTPUT

  introelapse testelapse surveyelapse elapse_sum
1          19        167          166        352
2           1        193          186        380
3           5        271          122        398
4           3        261          336        600
5        1766        164          157       2087
6           4        349          213        566

Similarly, we can use rowMeans() to compute the average test duration. The beautiful thing about mutate() is, that we can create multiple columns in one function call!

R

dass_data %>%
  mutate(
    constant_value = "I am a value!",
    id = row_number(),
    elapse_sum = rowSums(across(ends_with("elapse"))),
    elapse_mean = rowMeans(across(ends_with("elapse")))
  ) %>% 
  select(id, constant_value, ends_with("elapse"), elapse_sum, elapse_mean) %>% 
  head()

OUTPUT

  id constant_value introelapse testelapse surveyelapse elapse_sum elapse_mean
1  1  I am a value!          19        167          166        352    117.3333
2  2  I am a value!           1        193          186        380    126.6667
3  3  I am a value!           5        271          122        398    132.6667
4  4  I am a value!           3        261          336        600    200.0000
5  5  I am a value!        1766        164          157       2087    695.6667
6  6  I am a value!           4        349          213        566    188.6667

Pro Tip: mutate() + across()

Sometimes, you want to mutate multiple columns at once, applying the same function to all of them. For example, when you want to recode some values across all columns that share the incorrect format. Here, you can use mutate() in combination with across() to apply a function to all of them. In our case, lets compute the elapse time in minutes for all elapse columns.

R

dass_data %>% 
  mutate(
    across(
      ends_with("elapse"), # Select the columns you want
      ~ .x / 60, # Write the function to apply to all of them
      # Use this special function syntax ~ and a .x as the argument to be filled
      .names = "{.col}_minutes" # Give out new names by using the old name
      # which will take the place of {.col} and some new value.
    )
  ) %>% 
  select(contains("elapse")) %>% 
  head()

OUTPUT

  introelapse testelapse surveyelapse testelapse_min total_elapse_seconds
1          19        167          166       2.783333                  352
2           1        193          186       3.216667                  380
3           5        271          122       4.516667                  398
4           3        261          336       4.350000                  600
5        1766        164          157       2.733333                 2087
6           4        349          213       5.816667                  566
  introelapse_minutes testelapse_minutes surveyelapse_minutes
1          0.31666667           2.783333             2.766667
2          0.01666667           3.216667             3.100000
3          0.08333333           4.516667             2.033333
4          0.05000000           4.350000             5.600000
5         29.43333333           2.733333             2.616667
6          0.06666667           5.816667             3.550000

You don’t have to apply this straight away, and using this correctly takes quite a bit of practice. But I wanted you to know that this exists, so you might remember it when you need it.

Challenges

As you complete the challenges, make use of select() to conduct sanity checks on the newly created columns!

Challenge 1

Create and save a new column called participant_id that includes a unique number per entry in the data.

Challenge 2

Create a new column that includes the total time the participant took to complete all surveys in minutes! Afterwards, filter this data to exclude all participants who took more than 1 hour. Work with this dataset for all following challenges.

Challenge 3

Create a column that indicates whether participants completed a university degree or not.

Challenge 4

Use a new function called case_when() to create a column that has a translation from the numbers in the education column to the verbal represenation. Use ?case_when() to learn about this function. If that is not enough, try Google / ChatGPT. Write a short description of what it does (1-2 sentences in comments of script).

Challenge 5

The following code shows a common error when computing sums across columns.

R

# DO NOT USE THIS!
wrong_data <- dass_data %>% 
  mutate(wrong_elapse = sum(introelapse, testelapse, surveyelapse))

This code will execute without any errors and it looks correct. Can you figure out what went wrong here?

Challenge 6

Compute a column that contains the sum of all answers to the DASS-questionnaire. Use rowSums() and across().

Show me the solution

R

dass_data <- dass_data %>% 
  mutate(
    dass_score = rowSums(across(starts_with("Q") & ends_with("A")))
  )

Challenge 7

Compute a column that shows the mean response to the DASS-questionnaire. Visualize the distribution across the dataset.

Key Points

Use mutate() to create and modify columns in a dataset.
Assign constant values, compute values from other columns, or use conditions to define new columns.
Use ifelse() for conditional column creation.
Compute row-wise sums and means efficiently using rowSums() and rowMeans().