R Cheat Sheet Tidyverse

R Cheat Sheet Tidyverse

Approximate time: 75 minutes
R Tidyverse Cheat Sheet Pdf
Tidyr Cheat Sheet Pdf
Rstudio::conf 2019. Using R, the Tidyverse, H2O, and Shiny to reduce employee attrition. January 25, 2019. An organization that loses 200 high-performing employees per year has a lost productivity cost of about $15M/year. The goal of readr is to provide a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. If you are new to readr, the best place to start is the data import chapter in R.
The Tidyverse suite of integrated packages are designed to work together to make common data science operations more user friendly. The packages have functions for data wrangling, tidying, reading/writing, parsing, and visualizing, among others. There is a freely available book, R for Data Science, with detailed descriptions and practical examples of the tools available and how they work together. We will explore the basic syntax for working with these packages, as well as, specific functions for data wrangling with the ‘dplyr' package, data tidying with the ‘tidyr' package, and data visualization with the ‘ggplot2' package.
All of these packages use the same style of code, which is snake_case formatting for all function names and arguments. The tidy style guide is available for perusal.
Adding files to your working directoryWe have three files that we need to bring in for this lesson:
A normalized counts file (gene expression counts normalized for library size)
A metadata file corresponding to the samples in our normalized counts dataset
The differential expression results output from our DE analysis using DESeq2
Download the files to the data folder by right-clicking the links below:
Normalized counts file: right-click here
Differential expression results: right-click here
Choose to Save Link As or Download Linked File As and navigate to your Visualizations-in-R/data folder. You should now see the files appear in the data folder in the RStudio file directory.
Reading in the data filesLet's read in all of the files we have downloaded:
Tidyverse basicsAs it is difficult to change how fundamental base R structures/functions work, the Tidyverse suite of packages create and use data structures, functions and operators to make working with data more intuitive. The two most basic changes are in the use of pipes and tibbles.
Pipes

Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing.
To make R code more human readable, the Tidyverse tools use the pipe, %>%, which was acquired from the ‘magrittr' package and comes installed automatically with Tidyverse. The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.
NOTE: Shortcut to write the pipe is shift + command + M
An example of using the pipe to run multiple commands:

The pipe represents a much easier way of writing and deciphering R code, and we will be taking advantage of it for all future activities.

Exercises
Extract the replicate column from the metadata data frame (use the $ notation) and save the values to a vector named rep_number.
Use the pipe (%>%) to perform two steps in a single line:
Turn rep_number into a factor.
Use the head() function to return the first six values of the rep_number factor.
Tibbles

A core component of the tidyverse is the tibble. Tibbles are a modern rework of the standard data.frame, with some internal improvements to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have column names that are not normally allowed, such as numbers/symbols.
Important: tidyverse is very opininated about row names. These packages insist that all column data (e.g. data.frame) be treated equally, and that special designation of a column as rownames should be deprecated. Tibble provides simple utility functions to handle rownames: rownames_to_column() and column_to_rownames(). More help for dealing with row names in tibbles can be found:
Tibbles can be created directly using the tibble() function or data frames can be converted into tibbles using as_tibble(name_of_df).
NOTE: The function as_tibble() will ignore row names, so if a column representing the row names is needed, then the function rownames_to_column(name_of_df) should be run prior to turning the data.frame into a tibble. Also, as_tibble() will not coerce character vectors to factors by default.
Exercises
Create a tibble called df_tibble using the tibble() function to combine the vectors species and glengths.
Change the metadata data frame to a tibble called meta_tibble. Use the rownames_to_column() function to preserve the rownames combined with using %>% and the as_tibble() function.
Differences between tibbles and data.framesThe main differences between tibbles and data.frames relate to printing and subsetting.
PrintingA nice feature of a tibble is that when printing a variable to screen, it will show only the first 10 rows and the columns that fit to the screen by default. This is nice since you don't have to specify head to take a quick look at your dataset. If it is desirable to view more of the dataset, the print() function can change the number of rows or columns displayed.
SubsettingWhen subsetting base R data.frames the default behavior is to simplify the output to the simplest data structure. Therefore, if subsetting a single column from a data.frame, R will output a vector (unless drop=FALSE is specified). In contrast, subsetting a single column of a tibble will by default return another tibble, not a vector.
Due to this behavior, some older functions do not work with tibbles, so if you need to convert a tibble to a data.frame, the function as.data.frame(name_of_tibble) will easily convert it.
Also note that if you use piping to subset a data frame, then the notation is slightly different, requiring a placeholder . prior to the [ ] or $.
Tidyverse toolsWhile all of the tools in the Tidyverse suite are deserving of being explored in more depth, we are going to investigate only the tools we will be using most for data wrangling and tidying.
DplyrThe most useful tool in the tidyverse is dplyr. It's a swiss-army knife for data wrangling. dplyr has many handy functions that we recommend incorporating into your analysis:
select() extracts columns and returns a tibble.
arrange() changes the ordering of the rows.
filter() picks cases based on their values.
mutate() adds new variables that are functions of existing variables.
rename() easily changes the name of a column(s)
summarise() reduces multiple values down to a single summary.
pull() extracts a single column as a vector.
_join() group of functions that merge two data frames together, includes (inner_join(), left_join(), right_join(), and full_join()).
Note:dplyr underwent a massive revision this year, switching versions from 0.5 to 0.7. If you consult other dplyr tutorials online, note that many materials developed prior to 2017 are no longer correct. In particular, this applies to writing functions with dplyr (see Notes section below).
select()To extract columns from a tibble we can use the select().
Conversely, you can remove columns you don't want with negative selection.
arrange()Note that the rows are sorted by the gene symbol. Let's fix that and sort them by adjusted P value instead with arrange().
filter()Let's keep only genes that are expressed (baseMean above 0) with an adjusted P value below 0.01. You can perform multiple filter() operations together in a single command.
mutate()mutate() enables you to create a new column from an existing column. Let's generate log10 calculations of our baseMeans for each gene.
rename()You can quickly rename an existing column with rename(). The syntax is new_name = old_name.
summarise()You can perform column summarization operations with summarise().
Advanced:summarise() is particularly powerful in combination with the group_by() function, which allows you to group related rows together.
Note: summarize() also works if you prefer to use American English. This applies across the board to any tidy functions, including in ggplot2 (e.g. color in place of colour).
pull()In the recent dplyr 0.7 update, pull() was added as a quick way to access column data as a vector. This is very handy in chain operations with the pipe operator.
_join()Dplyr has a powerful group of join operations, which join together a pair of data frames based on a variable or set of variables present in both data frames that uniquely identify all observations. These variables are called keys.
inner_join: Only the rows with keys present in both datasets will be joined together.
left_join: Keeps all the rows from the first dataset, regardless of whether in second dataset, and joins the rows of the second that have keys in the first.
right_join: Keeps all the rows from the second dataset, regardless of whether in first dataset, and joins the rows of the first that have keys in the second.
full_join: Keeps all rows in both datasets. Rows without matching keys will have NA values for those variables from the other dataset.
To practice with the join functions, we can use a couple of built-in R datasets.
TidyrThe purpose of Tidyr is to have well-organized or tidy data, which Tidyverse defines as having:
Each variable in a column
Each observation in a row
Each value as a cell
There are two main functions in Tidyr, gather() and spread(). These functions allow for conversion between long data format and wide data format. The downstream use of the data will determine which format is required.
gather()The gather() function changes a wide data format into a long data format. This function is particularly helpful when using ‘ggplot2' to get all of the values to plot into a single column.
To use this function, you need to give the columns in the data frame you would like to gather together as a single column. Then, provide a name to give the column where all of the column names will be present using the key argument, and the name to give the column where all of the values will be present using the value argument.
spread()The spread() function is the reverse of the gather() function. The categories of the key column will become separate columns, and the values in the value column split across the associated key columns.
R Tidyverse Cheat Sheet Pdf

Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing.
To make R code more human readable, the Tidyverse tools use the pipe, %>%, which was acquired from the ‘magrittr' package and comes installed automatically with Tidyverse. The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.
NOTE: Shortcut to write the pipe is shift + command + M
An example of using the pipe to run multiple commands:
The pipe represents a much easier way of writing and deciphering R code, and we will be taking advantage of it for all future activities.
Exercises
Extract the replicate column from the metadata data frame (use the $ notation) and save the values to a vector named rep_number.
Use the pipe (%>%) to perform two steps in a single line:
Turn rep_number into a factor.
Use the head() function to return the first six values of the rep_number factor.
TibblesA core component of the tidyverse is the tibble. Tibbles are a modern rework of the standard data.frame, with some internal improvements to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have column names that are not normally allowed, such as numbers/symbols.
Important: tidyverse is very opininated about row names. These packages insist that all column data (e.g. data.frame) be treated equally, and that special designation of a column as rownames should be deprecated. Tibble provides simple utility functions to handle rownames: rownames_to_column() and column_to_rownames(). More help for dealing with row names in tibbles can be found:
Tibbles can be created directly using the tibble() function or data frames can be converted into tibbles using as_tibble(name_of_df).
NOTE: The function as_tibble() will ignore row names, so if a column representing the row names is needed, then the function rownames_to_column(name_of_df) should be run prior to turning the data.frame into a tibble. Also, as_tibble() will not coerce character vectors to factors by default.
Exercises
Create a tibble called df_tibble using the tibble() function to combine the vectors species and glengths.
Change the metadata data frame to a tibble called meta_tibble. Use the rownames_to_column() function to preserve the rownames combined with using %>% and the as_tibble() function.
Differences between tibbles and data.framesThe main differences between tibbles and data.frames relate to printing and subsetting.
PrintingA nice feature of a tibble is that when printing a variable to screen, it will show only the first 10 rows and the columns that fit to the screen by default. This is nice since you don't have to specify head to take a quick look at your dataset. If it is desirable to view more of the dataset, the print() function can change the number of rows or columns displayed.
SubsettingWhen subsetting base R data.frames the default behavior is to simplify the output to the simplest data structure. Therefore, if subsetting a single column from a data.frame, R will output a vector (unless drop=FALSE is specified). In contrast, subsetting a single column of a tibble will by default return another tibble, not a vector.
Due to this behavior, some older functions do not work with tibbles, so if you need to convert a tibble to a data.frame, the function as.data.frame(name_of_tibble) will easily convert it.
Also note that if you use piping to subset a data frame, then the notation is slightly different, requiring a placeholder . prior to the [ ] or $.
Tidyverse toolsWhile all of the tools in the Tidyverse suite are deserving of being explored in more depth, we are going to investigate only the tools we will be using most for data wrangling and tidying.
DplyrThe most useful tool in the tidyverse is dplyr. It's a swiss-army knife for data wrangling. dplyr has many handy functions that we recommend incorporating into your analysis:
select() extracts columns and returns a tibble.
arrange() changes the ordering of the rows.
filter() picks cases based on their values.
mutate() adds new variables that are functions of existing variables.
rename() easily changes the name of a column(s)
summarise() reduces multiple values down to a single summary.
pull() extracts a single column as a vector.
_join() group of functions that merge two data frames together, includes (inner_join(), left_join(), right_join(), and full_join()).
Note:dplyr underwent a massive revision this year, switching versions from 0.5 to 0.7. If you consult other dplyr tutorials online, note that many materials developed prior to 2017 are no longer correct. In particular, this applies to writing functions with dplyr (see Notes section below).
select()To extract columns from a tibble we can use the select().
Conversely, you can remove columns you don't want with negative selection.
arrange()Note that the rows are sorted by the gene symbol. Let's fix that and sort them by adjusted P value instead with arrange().
filter()Let's keep only genes that are expressed (baseMean above 0) with an adjusted P value below 0.01. You can perform multiple filter() operations together in a single command.
mutate()mutate() enables you to create a new column from an existing column. Let's generate log10 calculations of our baseMeans for each gene.
rename()You can quickly rename an existing column with rename(). The syntax is new_name = old_name.
summarise()You can perform column summarization operations with summarise().
Advanced:summarise() is particularly powerful in combination with the group_by() function, which allows you to group related rows together.
Note: summarize() also works if you prefer to use American English. This applies across the board to any tidy functions, including in ggplot2 (e.g. color in place of colour).
pull()In the recent dplyr 0.7 update, pull() was added as a quick way to access column data as a vector. This is very handy in chain operations with the pipe operator.
_join()Dplyr has a powerful group of join operations, which join together a pair of data frames based on a variable or set of variables present in both data frames that uniquely identify all observations. These variables are called keys.
inner_join: Only the rows with keys present in both datasets will be joined together.
left_join: Keeps all the rows from the first dataset, regardless of whether in second dataset, and joins the rows of the second that have keys in the first.
right_join: Keeps all the rows from the second dataset, regardless of whether in first dataset, and joins the rows of the first that have keys in the second.
full_join: Keeps all rows in both datasets. Rows without matching keys will have NA values for those variables from the other dataset.
To practice with the join functions, we can use a couple of built-in R datasets.
TidyrThe purpose of Tidyr is to have well-organized or tidy data, which Tidyverse defines as having:
Each variable in a column
Each observation in a row
Each value as a cell
There are two main functions in Tidyr, gather() and spread(). These functions allow for conversion between long data format and wide data format. The downstream use of the data will determine which format is required.
gather()The gather() function changes a wide data format into a long data format. This function is particularly helpful when using ‘ggplot2' to get all of the values to plot into a single column.
To use this function, you need to give the columns in the data frame you would like to gather together as a single column. Then, provide a name to give the column where all of the column names will be present using the key argument, and the name to give the column where all of the values will be present using the value argument.
spread()The spread() function is the reverse of the gather() function. The categories of the key column will become separate columns, and the values in the value column split across the associated key columns.
R Tidyverse Cheat Sheet PdfProgramming notesTidyr Cheat Sheet PdfUnderneath the hood, tidyverse packages build upon the base R language using rlang, which is a complete rework of how functions handle variable names and evaluate arguments. This is achieved through the tidyeval framework, which interprates command operations using tidy evaluation. This is outside of the scope of the course, but explained in detail in the Programming with dplyr vignette, in case you'd like to understand how these new tools behave differently from base R.
Source: R/pivot-long.R
pivot_longer() 'lengthens' data, increasing the number of rows anddecreasing the number of columns. The inverse transformation ispivot_wider()
Learn more in vignette('pivot').
ArgumentsdataA data frame to pivot.
cols<tidy-select> Columns to pivot intolonger format.
names_toA string specifying the name of the column to createfrom the data stored in the column names of data.
Can be a character vector, creating multiple columns, if names_sepor names_pattern is provided. In this case, there are two specialvalues you can take advantage of:
NA will discard that component of the name.
.value indicates that component of the name defines the name of thecolumn containing the cell values, overriding values_to.
names_prefixA regular expression used to remove matching textfrom the start of each variable name.
names_sep, names_patternIf names_to contains multiple values,these arguments control how the column name is broken up.
names_sep takes the same specification as separate(), and can eitherbe a numeric vector (specifying positions to break on), or a single string(specifying a regular expression to split on).
names_pattern takes the same specification as extract(), a regularexpression containing matching groups (()).
If these arguments do not give you enough control, usepivot_longer_spec() to create a spec object and process manually asneeded.
names_ptypes, values_ptypesA list of column name-prototype pairs.A prototype (or ptype for short) is a zero-length vector (like integer()or numeric()) that defines the type, class, and attributes of a vector.Use these arguments if you want to confirm that the created columns arethe types that you expect. Note that if you want to change (instead of confirm)the types of specific columns, you should use names_transform orvalues_transform instead.
names_transform, values_transformA list of column name-function pairs.Use these arguments if you need to change the types of specific columns.For example, names_transform = list(week = as.integer) would converta character variable called week to an integer.
If not specified, the type of the columns generated from names_to willbe character, and the type of the variables generated from values_towill be the common type of the input columns used to generate them.
names_repairWhat happens if the output has invalid column names?The default, 'check_unique' is to error if the columns are duplicated.Use 'minimal' to allow duplicates in the output, or 'unique' tode-duplicated by adding numeric suffixes. See vctrs::vec_as_names()for more options.
values_toA string specifying the name of the column to createfrom the data stored in cell values. If names_to is a charactercontaining the special .value sentinel, this value will be ignored,and the name of the value column will be derived from part of theexisting column names.
values_drop_naIf TRUE, will drop rows that contain only NAsin the value_to column. This effectively converts explicit missing valuesto implicit missing values, and should generally be used only when missingvalues in data were created by its structure.
...Additional arguments passed on to methods.
Detailspivot_longer() is an updated approach to gather(), designed to be bothsimpler to use and to handle more use cases. We recommend you usepivot_longer() for new code; gather() isn't going away but is no longerunder active development.
Examples

data	A data frame to pivot.
cols	<`tidy-select`> Columns to pivot intolonger format.
names_to	A string specifying the name of the column to createfrom the data stored in the column names of `data`. Can be a character vector, creating multiple columns, if `names_sep`or `names_pattern` is provided. In this case, there are two specialvalues you can take advantage of: `NA` will discard that component of the name. `.value` indicates that component of the name defines the name of thecolumn containing the cell values, overriding `values_to`.
names_prefix	A regular expression used to remove matching textfrom the start of each variable name.
names_sep, names_pattern	If `names_to` contains multiple values,these arguments control how the column name is broken up. `names_sep` takes the same specification as `separate()`, and can eitherbe a numeric vector (specifying positions to break on), or a single string(specifying a regular expression to split on). `names_pattern` takes the same specification as `extract()`, a regularexpression containing matching groups (`()`). If these arguments do not give you enough control, use`pivot_longer_spec()` to create a spec object and process manually asneeded.
names_ptypes, values_ptypes	A list of column name-prototype pairs.A prototype (or ptype for short) is a zero-length vector (like `integer()`or `numeric()`) that defines the type, class, and attributes of a vector.Use these arguments if you want to confirm that the created columns arethe types that you expect. Note that if you want to change (instead of confirm)the types of specific columns, you should use `names_transform` or`values_transform` instead.
names_transform, values_transform	A list of column name-function pairs.Use these arguments if you need to change the types of specific columns.For example, `names_transform = list(week = as.integer)` would converta character variable called `week` to an integer. If not specified, the type of the columns generated from `names_to` willbe character, and the type of the variables generated from `values_to`will be the common type of the input columns used to generate them.
names_repair	What happens if the output has invalid column names?The default, `'check_unique'` is to error if the columns are duplicated.Use `'minimal'` to allow duplicates in the output, or `'unique'` tode-duplicated by adding numeric suffixes. See `vctrs::vec_as_names()`for more options.
values_to	A string specifying the name of the column to createfrom the data stored in cell values. If `names_to` is a charactercontaining the special `.value` sentinel, this value will be ignored,and the name of the value column will be derived from part of theexisting column names.
values_drop_na	If `TRUE`, will drop rows that contain only `NA`sin the `value_to` column. This effectively converts explicit missing valuesto implicit missing values, and should generally be used only when missingvalues in `data` were created by its structure.
...	Additional arguments passed on to methods.