--- title: "RWDataPlyr Workflow" author: "Alan Butlerr" date: "`r Sys.Date()`" output: bookdown::html_document2: base_format: rmarkdown::html_vignette number_sections: false vignette: > %\VignetteIndexEntry{RWDataPlyr Workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` The RWDataPlyr package provides an interface to read, aggregate, and summarize data from one or more [RiverWare^TM^](http://riverware.org) simulations and then work with those data in a dplyr pipeline. As RiverWare (and RiverSMART) produce data in multiple formats, and these data can be combined using RiverSMART in multiple ways, there are several different workflows that may make sense depending on the goals of the analysis. This vignette provides details on the different workflows that are covered by v0.6.0 of RWDataPlyr. As of version 0.6.0, the package assumes the following: * Data are monthly or annual. * All data start in January of one year and goes through December of another year, e.g., January 2014 through December 2060. *Note that only the summary and aggregation functions include these implicit assumptions, i.e., reading rdfs should work regardless of timestep size, start timestep, and end timestep. However, only data conforming to these standards has been tested.* ## Background [RiverWare^TM^](http://riverware.org) is a generalized river system modeling tool that represents physical processes and basin features, while allowing for complex reservoir operation policies to be expressed in rule-based logic. RiverWare is widely used across the U.S. by the Bureau of Reclamation, Tennessee Valley Authority, and U.S. Army Corps to assist their water management missions, and has applications worldwide, e.g., the Nile Basin. One of the strengths of RiverWare is its "multiple run" mode, where many unique realizations, or traces, of hypothetical operations can be simulated with different input assumptions, e.g., hydrology. Multiple traces can be grouped together to form a single scenario, and it is often desirable to compare multiple scenarios together to understand how changes in assumptions affect model results. The RiverSMART tool is a wrapper to RiverWare (and other tools) that helps in the simulation and analysis of multiple scenarios. RiverWare stores important data in "slots" on "objects" in the RiverWare model file. Any subset of these data can be output and saved in ASCII text files, known as RiverWare data files, with a .rdf extension; csv files; and/or NetCDF files. When multiple scenarios are run using RiverSMART, these output files are saved in unique scenario folders. RiverSMART can also combine csv output from multiple scenarios into a single csv file to expedite the analysis. This setup is summarized in figure \@ref(fig:rwFlowChart). ```{r, rwFlowChart, fig.cap="RiverWare and RiverSMART output flow chart. Part A shows a single RiverWare run; part B shows how RiverSMART wraps individual RiverSMART runs to produce output, with those output becoming input to RiverSMART for post-processing. The color of the file extensions correspond to the color of the functions that read those file types; yellow indicates that reading NetCDF files is not yet supported by RWDataPlyr.", echo=FALSE} knitr::include_graphics("RiverWareFlowChart.png") ``` When using RiverSMART to simulate multiple scenarios, each with hundreds or thousands of traces of data, it is increasingly easy to generate large amounts of data from RiverWare simulations. This package provides a tool to read these data into R, manipulate and summarize the data, and aggregate multiple scenarios' data together. As there are multiple ways RiverWare and RiverSMART can generate the data, there are also different workflows that may expedite the analysis process. This vignette documents these different workflows. ## RWDataPlyr Workflows RiverWare and RiverSMART can create data (rdf, csv, nc) for one or more scenarios, and depending on the analysis different workflows will provide advantages to the user. There are currently three different workflows supported by RWDataPlyr that are presented in the following sections. 1. Reading and manipulating a single scenario * Fast * Best for inspecting a single slot * If comparing scenarios, must manually repeat for each scenario * Relies on `read_rdf()` and `read_rw_csv()` 2. Summarizing multiple slots of data from a single scenario * Repeatable; allows user to process many slots at once * Best for producing "polished" analyses of a single scenario * Relies on `rdf_aggregate()` and user specified `rwd_agg` object 3. Aggregating and summarizing many scenarios * Repeatable; allows user to process many slots for many scenarios at once * Repeats summary of a single scenario on multiple scenarios and combines results together * Relies on `rw_scen_aggregate()` and user specified `rwd_agg` object **As of v0.6.0, RWDataPlyr can only read rdf and csv files, and the summary and aggregation can only be performed on rdf files.** ### Reading and Manipulating a Single Scenario Both rdf and csv files can be read into R using `read_rdf()` and `read_rw_csv()`. `read_rdf()` returns an `rdf` object; to be useful, the rdf object must be converted to a `tbl_df` using `rdf_to_rwtbl2()` or individual slot data can be obtained using `rdf_get_slot()`. `rdf_get_slot()` creates a timestep by traces matrix for a single slot. Several useful summary functions are available to summarize a single slot: `rwslot_annual_sum()`, `rwslot_annual_min()`, `rwslot_annual_max()`. A typical workflow to obtain the minimum annual Mead pool elevation from the included sample data might be: ```{r oneScen} library(RWDataPlyr) suppressPackageStartupMessages(library(dplyr)) rdf <- read_rdf(system.file( "extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/KeySlots.rdf", package = "RWDataPlyr" )) # ensure the slot you want is in the rdf: rdf_slot_names(rdf) # then get the minimum annual Mead pool elevation for all 5 years and 4 traces rdf %>% rdf_get_slot("Mead.Pool Elevation") %>% rwslot_annual_min() ``` `read_rw_csv()` returns a `tbl_df` that can easily be manipulated using the dplyr framework. There are no analogous functions to `rdf_get_slot()` or `rwslot_annual_*()`, but `dplyr::filter()` and `dplyr::summarise()` work well instead. The following will obtain the same minimum annual Mead pool elevation data from the example csv file with the added benefit of having years and trace number (note that the output is converted to a data.frame to avoid the rounding in the current tibble print method): ```{r csvExmpl} library(tidyr) read_rw_csv(system.file( "extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/KeySlots.csv", package = "RWDataPlyr" )) %>% filter(ObjectSlot == "Mead.Pool Elevation") %>% group_by(Year, TraceNumber) %>% summarise(Value = min(Value)) %>% spread(TraceNumber, Value) %>% as.data.frame() ``` The fields that are stored in csv files are controlled by the user in RiverWare. Depending on which fields the user specifies in RiverWare and the `keep_cols` options provided by the user, `read_rw_csv(x)` is approximately equal to `rdf_to_rwtbl2(read_rdf(y), keep_cols = TRUE)` if `x` is a csv file and `y` is an rdf file that contain the same RiverWare slots. ### Summarizing Multiple Slots of Data From a Single Scenario To summarize and process multiple slots of data from a single scenario at once, the user can define a RiverWare data aggregator object (`rwd_agg`) and provide that to `rdf_aggregate()`, which returns a `tbl_df` of the summarized data. Slots can be from different rdf files, but it is up to the user to know which slots come from which rdf file. **As of v0.6.0, the aggregation and summarization will only work for rdf files.** #### rwd_agg `rwd_agg` objects are data.frames with specific column names and expected values in those columns. The expected column names are `file`, `slot`, `period`, `summary`, `eval`, `t_s`, and `variable` in that order. It is up to the user to specify which rdf file contains each slot. In a general case, the user specifies the `slot` that is found in a specific rdf file (`file`). A `summary` function is applied to a subset `period` of the `slot` data, and then compared (`eval`) to a threshold (`t_s`) and saved as the `variable`. `rwd_agg` objects can be specified in three ways: 1. By providing a data.frame with the expected column names (see below). 2. By providing a vector of rdf file names using the `rdfs` parameter. If specified in this manor, all of the slots in each rdf file will be read into a `rwtbl`, but will not be aggregated/summarized. 3. By reading in a csv file, with the expected column names (see below), using `read_rdf_agg()`. The user may first use `rdf_agg_template()` to create a blank template, to help ensure it has the correct column names. In order to specify how each slot should be aggregated, each column should include specific keywords, which are described below. * *file:* specifies the rdf file that contains the slot. * *slot:* the full RiverWare slot name, i.e., "Object.Slot". * *period:* the period that the slot should be summarized over. This should either be a function name, a full month name (found in `month.name`), or the keyword "`asis`". - *function name:* Specifying a function name allows for pre-specified or custom functions to group together several months in the *period*. This package provides the following functions: `cy()`, `wy()`, `eocy()`, and `eowy()`. `cy()` indicates the data will be summarized over the calendar year, i.e., January - December, while `wy()` summarizes over the water year, i.e., October - September. `eocy()` selects the end of the calendar year, and `eowy()` selects the end of the water year. When specified in the `slot_agg` object, leave off the parenthesis, i.e., only specify the function name. If `wy()` is specified, the function will remove data for any water years that have less than 7 months of data. This "tolerance" is specified by the `rwdataplyr.wy_month_tol` option, and can be modified by updating this option to another number. For standard monthly data that starts in January and ends in December, this results in keeping the first water year, since it includes 9 months of data, and removing the final water year, since it includes only three months of data. Setting this option to 0 will result in keeping any water year data that has at least one month of data; setting this option to 11, ensures that there must be a full water year of data for that year to be kept. This can also be a user specified custom function; see the *Custom Period Functions* section for details on constructing the custom functions. - *full month name:* When the full month name is specified, data will be filtered to only include data for that particular month. To select multiple months of data, use a function as described above. If the month specified is not found in `month.name`, an error will occur. - *asis:* If the keyword "asis" is specified, the data is returned for its native timestep, i.e, monthly data will return monthly data and annual data will return annual. * *summary:* the summary function that should be applied to the period specified as a function name, or `NA`. If the `period` specified is "asis" or returns only one month, e.g., `eocy()`, then the summary should be `NA`. The summary function should only return one value; for that reason, most of the `Summary` S4groupGenerics work. Notably, `range()` will not since it returns two values. There is no reason that a custom function will not work here, but it has not been tested. * *eval:* the comparison operator to use (see `Compare` S4groupGenerics). If no comparison is desired, then `NA` should be used. If `eval` is specified the value returned from applying the `summary` to the `period` will be compared to the threshold specified by `t_s`. The results of the comparison are returned as 0 and 1 instead of `TRUE` and `FALSE`. * *t_s:* either the threshold to be compared to if `eval` is not `NA` or a value to scale the result by, e.g,. 0.001 to convert from acre-ft to thousand acre-ft. `NA` can also be specified to not scale the data. * *variable:* the variable name that will be used to identify the results of applying the period, summary, comparison/scaling to. All variable names should be unique. #### Custom Period Functions The `period` in a `slot_agg` object that a slot will be summarized over can be a function that is provided by RWDataPlyr, e.g., `wy()` for summarizing over a water year, or it can be a user defined custom function to allow for more complex time aggregations. The custom function should return a list with three elements: * `fun` - a function that will modify a rwtbl and properly determine the new `Year` based on the custom period. Necessary for more complex periods that might span different calendar years, e.g., water years. This function should modify the `Year` column so that months that should be grouped together have the same year value. * `filter_months` - the months that should be grouped together. * `group_tbl` - how to group the returned rwtbl; likely either `c("Year")` or `c("Year", "Month")` For example, a user might want to aggregate all summer months together (June - August). A relatively simple custom function for this would be: ```{r customSummer} summer <- function() { list( fun = function(x) x, filter_months = c("June", "July", "August"), group_tbl = c("Year") ) } ``` The `fun` in the summer example simply returns the original data, because it does not need to provide any complex logic for grouping different years together. A more complex example would be summarizing all winter months together (December - February). In this example December from year 0 should be grouped with January and February of year 1 for the winter of year 1 summary. The `djf()` custom function would be: ```{r customWinter} djf <- function() { djf_convert <- function(rwtbl) { rwtbl %>% dplyr::mutate_at( "Timestep", .funs = list("Year" = zoo::as.yearmon) ) %>% # can use the ym_get_wateryear b/c djf are all in same water year dplyr::mutate_at("Year", list(ym_get_wateryear)) } list( fun = djf_convert, filter_months = month.name[c(12, 1, 2)], group_tbl = c("Year") ) } ``` In this example, `fun` now includes a function that takes in the original data (and expects a `Timestep` column). The time step is converted to a `yearmon` object, and then the years are re-assigned based on their water year (since December - February are in the same water year). #### `rdf_aggregate()` Example To summarize multiple slots at once, the first step is to define the `rwd_agg` object. In this example, the following summaries are desired: * a flag of each occurrence that Mead's minimum annual elevation is below elevation 1,060' * the end-of-calendar year Mead elevation * the winter release from Powell (sum of December - February) in thousands acre-ft (this uses our custom period function defined in the previous section) * the July release from Powell * the water year total release from Powell in million acre-ft * the shortage flag (already annual no summary required) For this, we want to create the following table: ```{r echo=FALSE, results="asis"} knitr::kable(data.frame( file = c(rep("KeySlots.rdf", 5), "SystemConditions.rdf"), slot = c( rep("Mead.Pool Elevation", 2), rep("Powell.Outflow", 3), "SummaryOutputData.LBShortageConditions" ), period = c("cy", "eocy", "djf", "July", "wy", "asis"), summary = c("min", NA, "sum", NA, "sum", NA), eval = c("<=", rep(NA, 5)), t_s = formatC( c(1060, NA, 0.001, NA, 0.000001, NA), big.mark = ",", drop0trailing = TRUE ), variable = c("meadLt1060", "meadEocy", "powellDjfRel", "powellJulRel", "powellWyRel", "short") )) ``` Often times, it will be easier to define the table in Excel and then read it in as a csv using `read_rwd_agg(x)` but below shows an example of defining it directly in R. ```{r rwdAgg} rwa1 <- rwd_agg(data.frame( file = c(rep("KeySlots.rdf", 5), "SystemConditions.rdf"), slot = c( rep("Mead.Pool Elevation", 2), rep("Powell.Outflow", 3), "SummaryOutputData.LBShortageConditions" ), period = c("cy", "eocy", "djf", "July", "wy", "asis"), summary = c("min", NA, "sum", NA, "sum", NA), eval = c("<=", rep(NA, 5)), t_s = c(1060, NA, 0.001, NA, 0.000001, NA), variable = c("meadLt1060", "meadEocy", "powellDjfRel", "powellJulRel", "powellWyRel", "short"), stringsAsFactors = FALSE )) ``` After the `rwd_agg` is specified the object is passed to `rdf_aggregate()` along with a few parameters specifying where the data are stored, and a tbl_df is returned: ```{r rdfAgg} rdf_aggregate( rwa1, rdf_dir = system.file( "extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/", package = "RWDataPlyr" ) ) ``` To keep all of the columns (attribute info), specify `keep_cols = TRUE`: ```{r rdfAgg2} rdf_aggregate( rwa1, rdf_dir = system.file( "extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/", package = "RWDataPlyr" ), keep_cols = TRUE ) ``` ### Aggregating and Summarizing Many Scenarios `rw_scen_aggregate()` will aggregate and summarize multiple scenarios, essentially calling `rdf_aggregate()` for each scenario. Similar to `rdf_aggregate()` it relies on a user specified `rwd_agg` object to know how to summarize and process the scenarios. `rw_scen_aggregate()` is setup to work with a directory structure typical to a RiverSMART study. RiverSMART manages the output of multiple scenarios by creating unique scenario folders in a "Scenario" directory. The `scen_dir` argument should point to this scenario directory in a typical study setup. For example, if the file system is setup as shown in figure \@ref(fig:dirStruct), `scen_dir` should point to "C:/user/crss/CRSS.Jan2017/Scenario" (or the correct relative path to the same location). ```{r dirStruct, results="asis",fig.cap="Sample directory structure", echo=FALSE} knitr::include_graphics("dir_structure.png") ``` `rw_scen_aggregate()` also needs to know the scenario folders and allows the user to specify scenario names that may differ from the folder name so they are easier to use in R. In this example, the `scenarios` argument would be `c("ISM1988_2014,2007Dems,IG,Most", "ISM1988_2014,2007Dems,IG,2002")`. The scenario names can be specified either through the `scen_names` argument, or by naming the `scenarios` vector. The following will obtain the same results as defined by the earlier `rwd_agg` object for both of these scenarios. ```{r scenAgg`} my_scens <-c("ISM1988_2014,2007Dems,IG,Most", "ISM1988_2014,2007Dems,IG,2002") names(my_scens) <- c("most", "yr2002") scen_res <- rw_scen_aggregate( my_scens, agg = rwa1, scen_dir = system.file("extdata/Scenario", package ="RWDataPlyr") ) ``` To compare results from multiple studies, i.e., the different top level CRSS.JanNNNN folders in figure \@ref(fig:dirStruct), `scen_dir` should point to "C:/user/crss" and `scenarios` should be assigned to `c("CRSS.Jan2017/Scenario/ISM1988_2014,2007Dems,IG,Most", "CRSS.Jan2018/Scenario/ISM1988_2014,2007Dems,IG,Most")`. If `scenarios` has a length of one, it will match the results of `rdf_aggregate()` if the `scenario` parameter is used and the same as the `scen_names` parameter in `rw_scen_aggregate()`. #### Legacy Workflow `getDataForAllScens()` is the legacy function that preceded `rw_scen_aggregate()` and `rdf_aggregate()` in RWDataPlyr v0.5.0 and earlier. It works similarly to `rw_scen_aggregate()` and relies on a `slot_agg_list`, which is analogous to the `rwd_agg` object, though it only allows for certain pre-specified aggregation methods. The help pages include information on these functions, though `rw_scen_aggregate()` is superior and preferred as of v0.6.0. `getDataForAllScens()` treats `NaN`s as 0s in its aggregation methods. The `nans_are` parameter defaults to `"0"` in `rdf_aggregate()` and ` rw_scen_aggregate()` and thus matches the way `getDataForAllScens()` treats `NaN`s. ## Utilizing RWDataPlyr within RiverSMART RiverSMART includes a [plugin](http://riverware.org/RiverSmart/RiverSmartSoftwareSuiteHelp.html#R_Plugin) that allows R scripts to be executed from RiverSMART for post-processing scenario(s). The exact configuration of the script and the RiverSMART R plugin is beyond the scope of this vignette, but a user can pass scenario folders and names to `rw_scen_aggregate()` from within RiverSMART, thus streamlining the post-processing steps.