Vectors and variable types

Last updated on 2025-06-24 | Edit this page

Estimated time: 25 minutes

Overview

Questions

  • How do you use scripts in R to document your work?
  • How can you organize scripts effectively?
  • What is a vector?

Objectives

  • Show how to use scripts to generate reproducible analyses
  • Explain what a vector is
  • Explain variable types
  • Show examples of sum() and mean()

Scripts in R


For now, we have been only using the R console to execute commands. This works great if there are some quick calculations you have to run, but is unsuited for more complex analyses. If you want to reproduce the exact steps you took in data cleaning and analyses, you need to write them down - like a recipe.

This is where R scripts come in. They are basically like a text file (similar to Word) where you can write down all the steps you took. This makes it possible to retrace them and produce the exact same result over and over again.

In order to use R scripts effectively, you need to do two things. Firstly, you need to write them in a way that is understandable in the future. We will learn more about how to write clean code in future lessons. Secondly (and maybe more importantly) you need to actually save these scripts on your computer, and ideally save them in a place where you can find them again. The place where you save your script and especially the place you save your data should ideally be a folder in a sensible place. For example, this script is saved in a sub-folder episodes/ of the workshop folder r_for_empra/. This makes it easy for humans to find the script. Aim for similar clarity in your folder structure!

For this lesson, create a folder called r_for_empra somewhere on your computer where it makes sense for you. Then, create a new R script by clicking File > New File > R Script or hitting Ctrl + Shift + N. Save this script as episode_scripts_and_vectors.R in the folder r_for_empra created above. Follow along with the episode and note down the commands in your script.

Using a script


Let’s start learning about some basic elements of R programming. We have already tried assigning values to variables using the <- operator. For example, we might assign the constant km_to_m the value 1000.

R

km_to_m <- 1000

Now, if we had data on distances in km, we could also use this value to compute the distance in meters.

R

distance_a_to_b_km <- 7.56

distance_a_to_b_m <- distance_a_to_b_km * km_to_m

If we have multiple distances and want to transform them from km to m at the same time, we can make use of a vector. A vector is just a collection of elements. We can create a vector using the function c() (for combine).

Tip: Running segments of your code

RStudio offers you great flexibility in running code from within the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can 1. click on the Run button just above the editor panel, or 2. select “Run Lines” from the “Code” menu, or 3. hit Ctrl + Enter in Windows or Linux or Command-Enter on OS X. (This shortcut can also be seen by hovering the mouse over the button). To run a block of code, select it and then Run via the button or the keyboard-shortcut Ctrl + Enter.

Variable Types


Vectors can only contain values of the same type. There are two basic types of variables that you will have to interact with - numeric and character variables. Numeric variables are any numbers, character variables are bits of text.

R

# These are numeric variables:
vector_of_numeric_variables <- c(4, 7.642, 9e5, 1/97) # recall, 9e5 = 9*10^5

# Show the output
vector_of_numeric_variables

OUTPUT

[1] 4.000000e+00 7.642000e+00 9.000000e+05 1.030928e-02

R

# These are character variables:
vector_of_character_variables <- c("This is a character variable", "A second variable", "And another one")

# Show the output
vector_of_character_variables

OUTPUT

[1] "This is a character variable" "A second variable"
[3] "And another one"             

We can not only combine single elements into a vector, but also combine multiple vectors into one long vector.

R

numeric_vector_1 <- c(1, 2, 3, 4, 5)
numeric_vector_2 <- c(6:10) # 6:10 generates the values 6, 7, 8, 9, 10

combined_vector <- c(numeric_vector_1, numeric_vector_2)
combined_vector

OUTPUT

 [1]  1  2  3  4  5  6  7  8  9 10

Recall that all elements have to be of the same type. If you try to combine numeric vectors with character vectors, R will automatically convert everything to a character vector (as this has less restrictions, anything can be a character).

R

character_vector_1 <- c("6", "7", "8")
# Note that the numbers are now in quotation marks!
# They will be treated as characters, not numerics!

combining_numeric_and_character <- c(numeric_vector_1, character_vector_1)
combining_numeric_and_character

OUTPUT

[1] "1" "2" "3" "4" "5" "6" "7" "8"

We can fix this issue by converting the vectors to the same type. Note how the characters are also just numbers, but in quotation marks. Sometimes, this happens in real data, too. Some programs store every variable as a character (sometimes also called string). We then have to convert the numbers back to the number format:

R

converted_character_vector_1 <- as.numeric(character_vector_1)

combining_numeric_and_converted_character <- c(numeric_vector_1, converted_character_vector_1)

combining_numeric_and_converted_character

OUTPUT

[1] 1 2 3 4 5 6 7 8

But be careful, this conversion does not always work! If R does not know how to convert a specific character to a number, it will simply replace this with NA.

R

character_vector_2 <- c("10", "11", "text")
# The value "text", can not be interpreted as a number

as.numeric(character_vector_2)

WARNING

Warning: NAs introduced by coercion

OUTPUT

[1] 10 11 NA

Inspecting the type of a variable

You can use the function str() to learn about the structure of a variable. The first entry of the output tells us about the type of the variable.

R

str(vector_of_numeric_variables)

OUTPUT

 num [1:4] 4.00 7.64 9.00e+05 1.03e-02

This tells us that there is a numeric vector, hence num, with 4 elements.

R

str(vector_of_character_variables)

OUTPUT

 chr [1:3] "This is a character variable" "A second variable" ...

This tells us that there is a character vector, hence chr, with 3 elements.

R

str(1:5)

OUTPUT

 int [1:5] 1 2 3 4 5

Note that this prints int and not the usual num. This is because the vector only contains integers, so whole numbers. These are stored in a special type that takes up less memory, because the numbers need to be stored with less precision. You can treat it as very similar to a numeric vector, and do all the same wonderful things with it!

Simple functions for vectors


Let’s use something more exciting than a sequence from 1 to 10 as an example vector. Here, we use the mtcars data that you already got to know in an earlier lesson. mtcars carries information about cars, like their name, fuel usage, weight, etc. This information is stored in a data frame. A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). In order to extract vectors from a data frame we can use the $ operator. data$column extracts a vector.

R

mtcars_weight_tons <- mtcars$wt 

# note that it is a good idea to include the unit in the variable name
mtcars_weight_tons

OUTPUT

 [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070
[13] 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840
[25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780

Let’s start learning some basic information about this vector:

R

str(mtcars_weight_tons)

OUTPUT

 num [1:32] 2.62 2.88 2.32 3.21 3.44 ...

The vector is of type numeric and contains 32 entries. We can double check this by looking into our environment tab, where num [1:32] indicates just that. Similarly, we can get the length of a vector by using length()

R

length(mtcars_weight_tons)

OUTPUT

[1] 32

We can compute some basic descriptive information of the weight of cars in the mtcars data using base-R functions:

R

# Mean weight:
mean(mtcars_weight_tons)

OUTPUT

[1] 3.21725

R

# Median weight:
median(mtcars_weight_tons)

OUTPUT

[1] 3.325

R

# Standard deviation:
sd(mtcars_weight_tons)

OUTPUT

[1] 0.9784574

To get the minimum and maximum value, we can use min() and max().

We can also get a quick overview of the weight distribution using hist(), which generates a simple histogram.

R

hist(mtcars_weight_tons)

Histograms

Histograms are more powerful tools than it seems on first glance. They allow you to simply gather knowledge about the distribution of your data. This is especially important for us psychologists. Do we have ceiling effects in the task? Were some response options even used? How is response time distributed?

If your histogram is not detailed enough, try adjusting the breaks parameter. This tells hist() how many bars to print in the plot.

R

hist(mtcars_weight_tons, breaks = 30)

The golden rule for the number of breaks is simple: try it until it looks good! You are free to explore here.

Another useful function is unique(). This function removes duplicates and only returns unique values. I use it a lot in experimental data. Since every participant contributes multiple trials, but I am sometimes interested in values a participant only contributes once, I can use unique() to only retain each value once.

In mtcars data, we might want to see how many cylinders are possible in the data. unique(mtcars$cyl) is much easier to read at a glance.

R

mtcars$cyl

OUTPUT

 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

R

unique(mtcars$cyl)

OUTPUT

[1] 6 4 8

Tip: Unique + Length = Quick Count

Combining unique() and length() leads to a really useful property. This returns the number of unique entries in a vector. I use it almost daily! For example to figure out how many participants are in a given data frame. unique(data$participant_id) returns all the different participant ids and length(unique(data$participant_id)) then gives me the number of different ids, so the number of participants in my data.

Indexing and slicing vectors


There is one more important thing to learn about vectors before we move on to more complicated data structures. Most often, you are going to use the whole vector for computation. But sometimes you may only want the first 150 entries, or only entries belonging to a certain group. Here, you must be able to access specific elements of the vector.

In the simplest case, you can access the \(n\)th element in a vector by using vector[n]. To access the first entry, use vector[1], and so on.

R

test_indexing_vector <- seq(1, 32, 1) 
# Seq generates a sequence from the first argument (1) to the second argument (32) 
# The size of the steps is given by the third argument

test_indexing_vector

OUTPUT

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32

R

test_indexing_vector[1]

OUTPUT

[1] 1

R

test_indexing_vector[c(1, 2)]

OUTPUT

[1] 1 2

You can also remove an element (or multiple) from a vector by using a negative sign -.

R

test_indexing_vector[-c(11:32)] # this removes all indexes from 11 to 32

OUTPUT

 [1]  1  2  3  4  5  6  7  8  9 10

The same rules apply for all types of vectors (numeric and character).

Tip: Think about the future

Excluding by index is a simple way to clean vectors. However, think about what happens if you accidentally run the command vector <- vector[-1] twice, instead of once?

The first element is going to be excluded twice. This means that the vector will lose two elements. In principle, it is a good idea to write code that you can run multiple times without causing unwanted issues. To achieve this, either use a different way to exclude the first element that we will talk about later, or simply assign the cleaned vector a new name:

R

cleaned_vector <- vector[-1]

Now, no matter how often you run this piece of code, cleaned_vector will always be vector without the first entry.

Filtering vectors


In most cases, you don’t know what the element of the vector is you want to exclude. You might know that some values are impossible or not of interest, but don’t know where they are. For example, the accuracy vector of a response time task might look like this:

R

accuracy_coded <- c(0, 1, 1, 1, 1, -2, 99, 1, 11, 0)
accuracy_coded

OUTPUT

 [1]  0  1  1  1  1 -2 99  1 11  0

-2 and 99 are often used to indicate invalid button-presses or missing responses. 11 in this case is a wrong entry that should be corrected to 1.

If we were to compute the average accuracy now using mean() we would receive a wrong response.

R

mean(accuracy_coded)

OUTPUT

[1] 11.3

Therefore, we need to exclude all invalid values before continuing our analysis. Before I show you how I would go about doing this, think about it yourself. You do not need to know the code, yet. Just think about some rules and steps that need to be taken to clean up the accuracy vector. What comparisons could you use?

Important: The silent ones are the worst

The above example illustrates something very important. R will not throw an error every time you do something that doesn’t make sense. You should be really careful of “silent” errors. The mean() function above works exactly as intended, but returns a completely nonsensical value. You should always conduct sanity checks. Can mean accuracy be higher than 1? How many entries am I expecting in my data?

Now, back to the solution to the wacky accuracy data. Note that R gives us the opportunity to do things in a million different ways. If you came up with something different from what was presented here, great! Make sure it works and if so, use it!

First, we need to recode the wrong entry. That 11 was supposed to be a 1, but someone entered the wrong data in the excel. To do this, we can find the index, or the element number, where accuracy is 11. Then, we can replace that entry with 1.

R

index_where_accuracy_11 <- which(accuracy_coded == 11)

accuracy_coded[index_where_accuracy_11] <- 1
# or in one line:
# accuracy_coded[which(accuracy_coded == 11)] <- 1

accuracy_coded

OUTPUT

 [1]  0  1  1  1  1 -2 99  1  1  0

Now, we can simply exclude all values that are not equal to 1 or 0. We do this using the - operators:

R

accuracy_coded[-which(accuracy_coded != 0 & accuracy_coded != 1)]

OUTPUT

[1] 0 1 1 1 1 1 1 0

However, note that this reduces the number of entries we have in our vector. This may not always be advisable. Therefore, it is often better to replace invalid values with NA. The value NA (not available) indicates that something is not a number, but just missing.

R

# Note that now we are not using the - operator
# We want to replace values that are not 0 or 1
accuracy_coded[which(accuracy_coded != 0 & accuracy_coded != 1)] <- NA

Now, we can finally compute the average accuracy in our fictional experiment.

R

mean(accuracy_coded)

OUTPUT

[1] NA

Challenges


Challenge 1:

We did not get a number in the above code. Figure out why and how to fix this. You are encourage to seek help on the internet.

Challenge 2:

Below is a vector of response times. Compute the mean response time, the standard deviation, and get the number of entries in this vector.

R

response_times_ms <- c(
  230.7298, 620.6292, 188.8168, 926.2940, 887.4730,
  868.6299, 834.5548, 875.2154, 239.7057, 667.3095,
  -142.891, 10000, 876.9879
  )

Challenge 3:

There might be some wrong values in the response times vector from above. Use hist() to inspect the distribution of response times. Figure out which values are implausible and exclude them. Recompute the mean response time, standard deviation, and the number of entries.

Challenge 4:

Get the average weight (column wt) of cars in the mtcars data . Can you spot any outliers in the histogram?

Exclude the “outlier” and rerun the analyses.

Challenge 5:

Get the mean values of responses to this fictional questionnaire:

R

item_15_responses <- c("1", "3", "2", "1", "2", "none", "4", "1", "3", "2", "1", "1", "none")

Challenge 6 (difficult):

Compute the average of responses that are valid as indicated by the vector is_response_valid:

R

response_values <- c(91, 82, 69, 74, 62, 19, 84, 61, 92, 53)
is_response_valid <- c("yes", "yes", "yes", "yes", "yes",
                       "no", "yes", "yes", "yes", "no")

R

valid_responses <- response_values[which(is_response_valid == "yes")]
mean(valid_responses)

OUTPUT

[1] 76.875

Key Points

  • Scripts facilitate reproducible research
  • create vectors using c()
  • Numeric variables are used for computations, character variables often contain additional information
  • You can index vectors by using vector[index] to return or exclude specific indices
  • Use which() to filter vectors based on specific conditions