--- title: "Using matsindf for principal components analysis" author: "Alexander Davis" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using matsindf for principal components analysis} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(datasets) library(dplyr) library(ggplot2) library(matsindf) library(tidyr) ``` ## Introduction When working with tidy data, it can be challenging to use R operations that take in matrices. But the functions in `matsindf` make it easier. ## Data We will illustrate how to handle these cases with `matsindf` functions by doing principal components analysis (PCA) on the classic Fisher iris dataset, often used to illustrate PCA. We will be using a "long" input table, in which each measurement, rather than each flower, is a single row. ```{r} long_iris <- datasets::iris %>% dplyr::mutate(flower = sprintf("flower_%d", 1:nrow(datasets::iris))) %>% tidyr::pivot_longer( cols = c(-Species, -flower), names_to = "dimension", values_to = "length" ) %>% dplyr::rename(species = Species) %>% dplyr::select(flower, species, dimension, length) %>% dplyr::mutate(species = as.character(species)) head(long_iris, n = 5) ``` ## Generate PCA results Using `matsindf`, we can convert to a matrix, apply PCA, and then convert back to a long format table. ```{r} long_pca_embeddings <- long_iris %>% collapse_to_matrices( rownames = "flower", colnames = "dimension", matvals = "length" ) %>% dplyr::transmute(projection = lapply(length, function(mat) stats::prcomp(mat, center = TRUE, scale = TRUE)$x )) %>% expand_to_tidy( rownames = "flower", colnames = "component", matvals = "projection" ) head(long_pca_embeddings, n = 5) ``` The result are the coordinates of the iris data along the principal components, as a long format table. We just need to add back the species column ... ```{r} long_pca_withspecies <- long_iris %>% dplyr::select(flower, species) %>% dplyr::distinct() %>% dplyr::left_join(long_pca_embeddings, by = "flower") head(long_pca_withspecies, n = 5) ``` ... followed by the familiar PCA plot. ```{r, fig.width=7, fig.align='center', fig.retina=2} long_pca_withspecies %>% tidyr::pivot_wider( id_cols = c(flower, species), names_from = component, values_from = projection ) %>% ggplot2::ggplot(ggplot2::aes(x = PC1, y = PC2, colour = species)) + ggplot2::geom_point() + ggplot2::labs(colour = ggplot2::element_blank()) + ggplot2::theme_bw() + ggplot2::coord_equal() ``` As expected, we see that the distribution of measurements differs across the three species of iris. ## Conclusion `matsindf` simplifies tasks that are otherwise much more difficult.