--- title: "Use Cases and Examples for matsindf" author: "Matthew Kuperus Heun" date: "`r Sys.Date()`" header-includes: - \usepackage{amsmath} output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Use Cases and Examples for matsindf} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console bibliography: References.bib --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(dplyr) library(ggplot2) library(matsbyname) library(matsindf) library(tidyr) library(tibble) ``` \newcommand{\transpose}[1]{#1^\mathrm{T}} \newcommand{\inverse}[1]{#1^{\mathrm{-}1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\colvec}[1]{\mathbf{#1}} \newcommand{\rowvec}[1]{\transpose{\colvec{#1}}} \newcommand{\inversetranspose}[1]{\transpose{\left( \inverse{\mat{#1}} \right)}} \newcommand{\transposeinverse}[1]{\inverse{\left( \transpose{\mat{#1}} \right)}} \newcommand{\hatinv}[1]{\inverse{\widehat{#1}}} ## Introduction Matrices are important mathematical objects, and they often describe networks of flows among nodes. Example networks are given in the following table. | System type | Flows | Nodes |:--------------|:-----------|:--------- | Ecological | nutrients | organisms | Manufacturing | materials | factories | Economic | money | economic sectors The power of matrices lies in their ability to organize network-wide calculations, thereby simplifying the work of analysts who study entire systems. But [wouldn't it be nice](https://en.wikipedia.org/wiki/Wouldn%27t_It_Be_Nice) if there were an easy way to create `R` data frames whose entries were not numbers but entire matrices? If that were possible, matrix algebra could be performed on columns of similar matrices. That's the reason for `matsindf`. It provides functions to convert a suitably-formatted [tidy](https://tidyr.tidyverse.org/articles/tidy-data.html) data frame into a data frame containing a column of matrices. Furthermore, `matsbyname` is a sister package that * provides matrix algebra functions that respect names of matrix rows and columns (`dimnames` in `R`) to free the analyst from the task of aligning rows and columns of operands (matrices) passed to matrix algebra functions and * allows matrix algebra to be conducted within data frames using [dplyr](https://dplyr.tidyverse.org), [tidyr](https://tidyr.tidyverse.org), and other [tidyverse](https://www.tidyverse.org) functions. When used together, `matsindf` and `matsbyname` allow analysts to wield simultaneously the power of both [matrix mathematics](https://en.wikipedia.org/wiki/Matrix_(mathematics)) and [tidyverse](https://www.tidyverse.org) functional programming. This vignette demonstrates the use of these packages and suggests a workflow to accomplish sophisticated analyses using *matrices in data frames* (`matsindf`). ## Data: `UKEnergy2000` To demonstrate the use of `matsindf` functions, consider a network of energy flows from the environment, through transformation and distribution processes, and, ultimately, to final demand. Such energy flow networks are called energy conversion chains (ECCs), and this example is based on an approximation to a portion of the UK's ECC circa 2000. (Note that these data are to be used for demonstration purposes only and have been rounded to 1--2 significant digits.) These example data first appeared in Figures 3 and 4 of Heun, Owen, and Brockway [-@Heun:2018]. ```{r} head(UKEnergy2000, 2) ``` `Country` and `Year` contain only one value each, `GB` and `2000` respectively. Following conventions of the [International Energy Agency](https://www.iea.org)'s [energy balance tables](https://www.iea.org/data-and-statistics/data-product/world-energy-balances), * `Ledger.side` indicates `Supply` or `Consumption`; * `Flow.aggregation.point` indicates how data are to be aggregated; * `Flow` indicates the industry, machine, or final demand sector for this flow; * `Product` indicates the energy carrier for this flow; and * `E.ktoe` gives the magnitude of this flow in [units](https://www.eia.gov/energyexplained/units-and-calculators/energy-conversion-calculators.php) of kilotons of oil equivalent (ktoe). Each flow is its own observation (its own row) in the `UKEnergy2000` data frame, making it [tidy](https://tidyr.tidyverse.org/articles/tidy-data.html). The remainder of this vignette demonstrates an analysis conducted using the `UKEnergy2000` data frame as a basis. It: * shows how to *collapse* and spread the data into appropriate matrices stored in columns of a data frame, * demonstrates analyzing the matrices with `matsbyname` functions, * illustrates *expand*ing the matrices back into a tidy data frame, and * uses [ggplot](https://ggplot2.tidyverse.org) to graph the results. ## Suggested workflow ### Prepare for *collapse* The `EnergyUK2000` data frame is similar to "cleaned" data from an external source: there are no missing entries, and it is [tidy](https://tidyr.tidyverse.org/articles/tidy-data.html). But the data are not organized as matrices, and additional metadata is needed. The `collapse_to_matrices` function converts a tidy data frame into a `matsindf` data frame using using information within the tidy data frame. So the first task is to prepare for *collapse* by adding metadata columns. `collapse_to_matrices` needs the following information: | argument to `collapse_to_matrices` | identifies |--------------------------------------------:|:-------------------------------- | `matnames` | Name of the input column of matrix names | `values` | Name of the input column of matrix entries | `rownames` | Name of the input column of matrix row names | `colnames` | Name of the input column of matrix column name | `rowtypes` | Optional name of the input column of matrix row types | `coltypes` | Optional name of the input column of matrix column types The following code gives the approach to adding metadata, appropriate for this application, relying on `Ledger.side`, the sign of `E.ktoe`, and knowledge about the rows and columns for each matrix. Each type of network will have its own algorithm for identifying row names, column names, row types, and column types in a tidy data frame. ```{r} UKEnergy2000_with_metadata <- UKEnergy2000 %>% # Add a column indicating the matrix in which this entry belongs (U, V, or Y). matsindf:::add_UKEnergy2000_matnames() %>% # Add columns for row names, column names, row types, and column types. matsindf:::add_UKEnergy2000_row_col_meta() %>% mutate( # Eliminate columns we no longer need Ledger.side = NULL, Flow.aggregation.point = NULL, Flow = NULL, Product = NULL, # Ensure that all energy values are positive, as required for analysis. E.ktoe = abs(E.ktoe) ) head(UKEnergy2000_with_metadata, 2) ``` ### Collapse With the metadata now in place, `UKEnergy2000_with_metadata` can be collapsed to a `matsindf` data frame by the `collapse_to_matrices` function. Much like `dplyr::summarise`, `collapse_to_matrices` relies on grouping to indicate which rows of the tidy data frame belong to which matrices. The usual approach is to `tidyr::group_by` the `matnames` column and any other columns to be preserved in the output, in this case `Country` and `Year`. ```{r} EnergyMats_2000 <- UKEnergy2000_with_metadata %>% group_by(Country, Year, matname) %>% collapse_to_matrices(matnames = "matname", matvals = "E.ktoe", rownames = "rowname", colnames = "colname", rowtypes = "rowtype", coltypes = "coltype") %>% rename(matrix.name = matname, matrix = E.ktoe) # The remaining columns are Country, Year, matrix.name, and matrix glimpse(EnergyMats_2000) # To access one of the matrices, try one of these approaches: (EnergyMats_2000 %>% filter(matrix.name == "U"))[["matrix"]] # The U matrix EnergyMats_2000$matrix[[2]] # The V matrix EnergyMats_2000$matrix[[3]] # The Y matrix ``` ### Duplicate (for purposes of illustration) Larger studies will include data for multiple countries and years. The ECC data from UK in year `2000` can be duplicated for `2001` and for a fictitious country `AB`. Although the data are unchanged, the additional rows serve to illustrate the functional programming aspects of the `matsindf` and `matsbyname` packages. ```{r} Energy <- EnergyMats_2000 %>% # Create rows for a fictitious country "AB". # Although the rows for "AB" are same as the "GB" rows, # they serve to illustrate functional programming with matsindf. rbind(EnergyMats_2000 %>% mutate(Country = "AB")) %>% spread(key = Year, value = matrix) %>% mutate( # Create a column for a second year (2001). `2001` = `2000` ) %>% gather(key = Year, value = matrix, `2000`, `2001`) %>% # Now spread to put each matrix in a column. spread(key = matrix.name, value = matrix) glimpse(Energy) ``` ### Verify data An important step in any analysis is data verification. For an ECC analysis, it is important to verify that energy is conserved (i.e., energy is in balance) across all industries. Equations 1 and 2 in Heun, Owen, and Brockway [-@Heun:2018] show that energy balance is verified by $$\mat{W} = \transpose{\mat{V}} - \mat{U},$$ and $$\mat{W}\colvec{i} - \mat{Y}\colvec{i} = \colvec{0}.$$ Energy balance verification can be implemented with `matsbyname` functions and `tidyverse` functional programming: ```{r} Check <- Energy %>% mutate( W = difference_byname(transpose_byname(V), U), # Need to change column name and type on y so it can be subtracted from row sums of W err = difference_byname(rowsums_byname(W), rowsums_byname(Y) %>% setcolnames_byname("Industry") %>% setcoltype("Industry")), EBalOK = iszero_byname(err) ) Check %>% select(Country, Year, EBalOK) all(Check$EBalOK %>% as.logical()) ``` This example demonstrates that energy balance can be verified for *all* combinations of Country and Year with a few lines of code. In fact, the exact same code can be applied to the `Energy` data frame, regardless of the number of rows in it. Secure in the knowledge that energy is conserved across all ECCs in the `Energy` data frame, other analyses can proceed. ### Efficiencies To further illustrate the power of `matsbyname` functions in the context of `matsindf`, consider the calculation of the efficiency of every industry in the ECC as column vector $\eta$ as shown by Equation 11 of Heun, Owen, and Brockway [-@Heun:2018]. $$\colvec{g} = \mat{V}\colvec{i}$$ $$\colvec{\eta} = \hatinv{\transpose{\mat{U}} \colvec{i}} \colvec{g}$$ ```{r} Etas <- Energy %>% mutate( g = rowsums_byname(V), eta = transpose_byname(U) %>% rowsums_byname() %>% hatize_byname(keep = "rownames") %>% invert_byname() %>% matrixproduct_byname(g) %>% setcolnames_byname("eta") %>% setcoltype("Efficiency") ) %>% select(Country, Year, eta) Etas$eta[[1]] ``` Note that only a few lines of code are required to perform the same series of matrix operations on every combination of `Country` and `Year`. In fact, the same code will be used to calculate the efficiency of any number of industries in any number of countries and years! ### Expand Plotting values from a `matsindf` data frame can be accomplished by *expand*ing the matrices of the `matsindf` data frame (in this example, `Etas`) back out to a tidy data frame. *Expand*ing is the reverse of *collapse*-ing, and the following information must be supplied to the `expand_to_tidy` function: | argument to `expand_to_tidy` | identifies |--------------------------------------------:|:-------------------------------- | `matnames` | Name of the input column of matrix names | `matvals` | Name of the input column of matrices to be expanded | `rownames` | Name of the output column of matrix row names | `colnames` | Name of the output column of matrix column name | `rowtypes` | Optional name of the output column of matrix row types | `coltypes` | Optional name of the output column of matrix column types | `drop` | Optional value to be dropped from output (often 0) Prior to `expand`ing, it is usually necessary to `gather` columns of matrices. ```{r} etas_forgraphing <- Etas %>% gather(key = matrix.names, value = matrix, eta) %>% expand_to_tidy(matnames = "matrix.names", matvals = "matrix", rownames = "Industry", colnames = "etas", rowtypes = "rowtype", coltypes = "Efficiencies") %>% mutate( # Eliminate columns we no longer need. matrix.names = NULL, etas = NULL, rowtype = NULL, Efficiencies = NULL ) %>% rename( eta = matrix ) # Compare to Figure 8 of Heun, Owen, and Brockway (2018) etas_forgraphing %>% filter(Country == "GB", Year == 2000) ``` `etas_forgraphing` is a data frame of efficiencies, one for each Country, Year, and Industry, in a format that is amenable to plotting with packages such as [ggplot](https://ggplot2.tidyverse.org). ### Report The following code creates a bar graph of efficiency results for the UK in 2000: ```{r} etas_UK_2000 <- etas_forgraphing %>% filter(Country == "GB", Year == 2000) etas_UK_2000 %>% ggplot(mapping = aes_string(x = "Industry", y = "eta", fill = "Industry", colour = "Industry")) + geom_bar(stat = "identity") + labs(x = NULL, y = expression(eta[UK*","*2000]), fill = NULL) + scale_y_continuous(breaks = seq(0, 1, by = 0.2)) + scale_fill_manual(values = rep("white", nrow(etas_UK_2000))) + scale_colour_manual(values = rep("gray20", nrow(etas_UK_2000))) + guides(fill = FALSE, colour = FALSE) + theme(axis.text.x = element_text(angle = 90, vjust = 0.4, hjust = 1)) ``` ## Conclusion This vignette demonstrated the use of the `matsindf` and `matsbyname` packages and suggested a workflow to accomplish sophisticated analyses using *matrices in data frames* (`matsindf`). The workflow is as follows: * Reshape data into a tidy data frame with columns for matrix name, element value, row name, column name, row type, and column type, similar to `UKEnergy2000` above. * Use `collapse_to_matrices` to create a data frame of matrices with columns for matrix names and matrices themselves, similar to `EnergyMats_2000` above. * `tidyr::spread` the matrices to obtain a data frame with columns for each matrix, similar to `Energy` above. * Validate the data, similar to `Check` above. * Perform matrix algebra operations on the columns of matrices using `matsbyname` functions in a manner similar to the process of generating the `Etas` data frame above. * `tidyr::gather` the columns to obtain a tidy data frame of matrices. * Use `expand_to_tidy` to create a tidy data frame of matrix elements, similar to `etas_forgraphing` above. * Plot and report results as demonstrated by the graph above. Data frames of matrices, such as those created by `matsindf`, are like magic spreadsheets in which single cells contain entire matrices. With this data structure, analysts can wield simultaneously the power of both [matrix mathematics](https://en.wikipedia.org/wiki/Matrix_(mathematics)) and [tidyverse](https://www.tidyverse.org) functional programming. ## References