Creating new columns
Last updated on 2025-06-24 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- What function is used to create new columns in a dataset?
- How can you generate a unique ID column for each row?
- How can you modify an existing column?
- How can you compute an average value over many columns in your data?
Objectives
- Understand how to use
mutate()
to create new columns. - Learn how to add constant values, computed values, and conditionally assigned values to new columns.
- Use
ifelse()
to create categorical columns based on conditions. - Learn efficient methods for computing row-wise sums and averages
using
rowSums()
androwMeans()
. - Understand the difference between
sum()
androwSums()
for column-wise and row-wise operations.
Making new columns
So far we have dealt with existing columns and worked on them. We
have computed descriptive statistics, plotted basic overviews of data
and cut the data just the way we like it using filter()
and
select()
. Now, we will unlock a new power… the power of
creation! Muhahahahaha.
Well, actually it’s just adding new columns to the dataset, but
that’s still cool! Let’s start by reading in our DASS data again and
loading the package dplyr
.
R
library(dplyr)
OUTPUT
Attaching package: 'dplyr'
OUTPUT
The following objects are masked from 'package:stats':
filter, lag
OUTPUT
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
R
dass_data <- read.csv("data/kaggle_dass/data.csv")
To add a new column, we need the dplyr
function
mutate()
. It changes - or mutates - columns in the
dataset, like adding entirely new ones. The syntax is quite simple. The
first argument is the data, allowing easy use of pipes, and the second
argument is the code for the new column you want to have.
For example, if we want to create a new column that carries just a single value, we type the following:
R
dass_data <- dass_data %>%
mutate(single_value_column = 42)
To check if this worked:
R
unique(dass_data$single_value_column)
OUTPUT
[1] 42
Saving progress
Notice how we had to assign the dataset to itself in order to save the progress we have made. If you do not do this, and just write down the pipeline, the end result will be outputted to the console, not stored in an object that we can use later.
R
dass_data %>%
head(6) %>%
mutate(some_column = "I have a value!") %>%
select(country, some_column)
OUTPUT
country some_column
1 IN I have a value!
2 US I have a value!
3 PL I have a value!
4 US I have a value!
5 MY I have a value!
6 US I have a value!
You can easily assign the new data to something with the same name,
overriding the previous status. But be careful with this, I recommend
you use new names for the data after filtering, selecting, and mutating
existing columns. These operations change the original data (they remove
some information) and should thus be stored in a new object, like
filtered_dass_data
.
We can do much more interesting things than just placing a constant
in the new column. For example, we can use row_number()
to
create a running number of entries in our data - an id of sort.
R
dass_data <- dass_data %>%
mutate(entry_id = row_number())
dass_data %>%
select(country, testelapse, entry_id) %>%
head()
OUTPUT
country testelapse entry_id
1 IN 167 1
2 US 193 2
3 PL 271 3
4 US 261 4
5 MY 164 5
6 US 349 6
Mutating existing columns
We can also base our mutation on existing columns.
testelapse
is in seconds right now, let’s add a new column
that is the elapsed test time in minutes.
R
dass_data <- dass_data %>%
mutate(
testelapse_min = testelapse / 60
)
mean(dass_data$testelapse)
OUTPUT
[1] 2684.843
R
mean(dass_data$testelapse_min)
OUTPUT
[1] 44.74738
Another function I often use when creating new columns is
ifelse()
. It works using three arguments.
R
a = 5
b = 3
ifelse(
a > b, # The condition
"a is greater", # IF yes
"a is smaller" # IF no
)
OUTPUT
[1] "a is greater"
We can use this to make a new column that tells us whether an entry came from the United States.
R
dass_data <- dass_data %>%
mutate(is_from_us = ifelse(country == "US", "yes", "no"))
table(dass_data$is_from_us)
OUTPUT
no yes
31566 8207
Adding existing columns
One of the most important uses of mutate()
is adding
together multiple columns. In our data, we have three columns indicating
the length of elapsed time during survey completion,
introelapse
, testelapse
and
surveyelapse
.
We can compute the sum of these three columns to get the total amount of elapsed time.
R
dass_data <- dass_data %>%
mutate(
total_elapse_seconds = introelapse + testelapse + surveyelapse
)
Hint: Conduct Sanity Checks
When programming, you will make mistakes. There just is not a way around it. The only way to deal with it is trying to catch your mistakes as early as possible. Sometimes the computer will help you out by issuing an error or a warning. Sometimes it does not, because the code you wrote is technically valid but still does not exactly do what you want it to.
Therefore, it is important to conduct sanity checks! Try looking at the data and figure out if what you just did makes sense. In our case, let’s inspect the new column:
R
dass_data %>%
select(introelapse, testelapse, surveyelapse, total_elapse_seconds) %>%
head()
OUTPUT
introelapse testelapse surveyelapse total_elapse_seconds
1 19 167 166 352
2 1 193 186 380
3 5 271 122 398
4 3 261 336 600
5 1766 164 157 2087
6 4 349 213 566
This looks good so far!
Now, this approach works well for a few columns, but gets tedious
quickly when attempting to compute the sum or average over multiple
columns. For this there exists a special fuction called
rowSums()
. This function takes a dataset, and computes the
sums across all columns.
R
example_data <- data.frame(
x = 1:4,
y = 2:5,
z = 3:6
)
example_data$sum <- rowSums(example_data)
example_data
OUTPUT
x y z sum
1 1 2 3 6
2 2 3 4 9
3 3 4 5 12
4 4 5 6 15
However, in our case we do not want a sum across all columns
of our data, but just some of the columns! To do this, we need to use
across()
inside the rowSums
function to select
specific columns.
R
dass_data %>%
mutate(elapse_sum = rowSums(across(c("introelapse", "testelapse", "surveyelapse")))) %>%
select(introelapse, testelapse, surveyelapse, elapse_sum) %>%
head()
OUTPUT
introelapse testelapse surveyelapse elapse_sum
1 19 167 166 352
2 1 193 186 380
3 5 271 122 398
4 3 261 336 600
5 1766 164 157 2087
6 4 349 213 566
Because across()
also accepts the same input as
select()
we can use the special functions like
starts_with()
, ends_with()
or
contains()
.
R
dass_data %>%
mutate(elapse_sum = rowSums(across(ends_with("elapse")))) %>%
select(ends_with("elapse"), elapse_sum) %>%
head()
OUTPUT
introelapse testelapse surveyelapse elapse_sum
1 19 167 166 352
2 1 193 186 380
3 5 271 122 398
4 3 261 336 600
5 1766 164 157 2087
6 4 349 213 566
Similarly, we can use rowMeans()
to compute the average
test duration. The beautiful thing about mutate()
is, that
we can create multiple columns in one function call!
R
dass_data %>%
mutate(
constant_value = "I am a value!",
id = row_number(),
elapse_sum = rowSums(across(ends_with("elapse"))),
elapse_mean = rowMeans(across(ends_with("elapse")))
) %>%
select(id, constant_value, ends_with("elapse"), elapse_sum, elapse_mean) %>%
head()
OUTPUT
id constant_value introelapse testelapse surveyelapse elapse_sum elapse_mean
1 1 I am a value! 19 167 166 352 117.3333
2 2 I am a value! 1 193 186 380 126.6667
3 3 I am a value! 5 271 122 398 132.6667
4 4 I am a value! 3 261 336 600 200.0000
5 5 I am a value! 1766 164 157 2087 695.6667
6 6 I am a value! 4 349 213 566 188.6667
Pro Tip: mutate() + across()
Sometimes, you want to mutate multiple columns at once, applying the
same function to all of them. For example, when you want to recode some
values across all columns that share the incorrect format. Here, you can
use mutate()
in combination with across()
to
apply a function to all of them. In our case, lets compute the elapse
time in minutes for all elapse columns.
R
dass_data %>%
mutate(
across(
ends_with("elapse"), # Select the columns you want
~ .x / 60, # Write the function to apply to all of them
# Use this special function syntax ~ and a .x as the argument to be filled
.names = "{.col}_minutes" # Give out new names by using the old name
# which will take the place of {.col} and some new value.
)
) %>%
select(contains("elapse")) %>%
head()
OUTPUT
introelapse testelapse surveyelapse testelapse_min total_elapse_seconds
1 19 167 166 2.783333 352
2 1 193 186 3.216667 380
3 5 271 122 4.516667 398
4 3 261 336 4.350000 600
5 1766 164 157 2.733333 2087
6 4 349 213 5.816667 566
introelapse_minutes testelapse_minutes surveyelapse_minutes
1 0.31666667 2.783333 2.766667
2 0.01666667 3.216667 3.100000
3 0.08333333 4.516667 2.033333
4 0.05000000 4.350000 5.600000
5 29.43333333 2.733333 2.616667
6 0.06666667 5.816667 3.550000
You don’t have to apply this straight away, and using this correctly takes quite a bit of practice. But I wanted you to know that this exists, so you might remember it when you need it.
Challenges
As you complete the challenges, make use of select()
to
conduct sanity checks on the newly created columns!
Challenge 1
Create and save a new column called participant_id that includes a unique number per entry in the data.
Challenge 2
Create a new column that includes the total time the participant took to complete all surveys in minutes! Afterwards, filter this data to exclude all participants who took more than 1 hour. Work with this dataset for all following challenges.
Challenge 3
Create a column that indicates whether participants completed a university degree or not.
Challenge 4
Use a new function called case_when()
to create a column
that has a translation from the numbers in the education
column to the verbal represenation. Use ?case_when()
to
learn about this function. If that is not enough, try Google / ChatGPT.
Write a short description of what it does (1-2 sentences in comments of
script).
Challenge 5
The following code shows a common error when computing sums across columns.
R
# DO NOT USE THIS!
wrong_data <- dass_data %>%
mutate(wrong_elapse = sum(introelapse, testelapse, surveyelapse))
This code will execute without any errors and it looks correct. Can you figure out what went wrong here?
Challenge 6
Compute a column that contains the sum of all answers to the
DASS-questionnaire. Use rowSums()
and
across()
.
R
dass_data <- dass_data %>%
mutate(
dass_score = rowSums(across(starts_with("Q") & ends_with("A")))
)
Challenge 7
Compute a column that shows the mean response to the DASS-questionnaire. Visualize the distribution across the dataset.
Key Points
- Use
mutate()
to create and modify columns in a dataset. - Assign constant values, compute values from other columns, or use conditions to define new columns.
- Use
ifelse()
for conditional column creation. - Compute row-wise sums and means efficiently using
rowSums()
androwMeans()
.