The RWDataPlyr package provides an interface to read, aggregate, and summarize data from one or more RiverWareTM simulations and then work with those data in a dplyr pipeline. As RiverWare (and RiverSMART) produce data in multiple formats, and these data can be combined using RiverSMART in multiple ways, there are several different workflows that may make sense depending on the goals of the analysis. This vignette provides details on the different workflows that are covered by v0.6.0 of RWDataPlyr.
As of version 0.6.0, the package assumes the following:
Note that only the summary and aggregation functions include these implicit assumptions, i.e., reading rdfs should work regardless of timestep size, start timestep, and end timestep. However, only data conforming to these standards has been tested.
RiverWareTM is a generalized river system modeling tool that represents physical processes and basin features, while allowing for complex reservoir operation policies to be expressed in rule-based logic. RiverWare is widely used across the U.S. by the Bureau of Reclamation, Tennessee Valley Authority, and U.S. Army Corps to assist their water management missions, and has applications worldwide, e.g., the Nile Basin.
One of the strengths of RiverWare is its “multiple run” mode, where many unique realizations, or traces, of hypothetical operations can be simulated with different input assumptions, e.g., hydrology. Multiple traces can be grouped together to form a single scenario, and it is often desirable to compare multiple scenarios together to understand how changes in assumptions affect model results. The RiverSMART tool is a wrapper to RiverWare (and other tools) that helps in the simulation and analysis of multiple scenarios.
RiverWare stores important data in “slots” on “objects” in the RiverWare model file. Any subset of these data can be output and saved in ASCII text files, known as RiverWare data files, with a .rdf extension; csv files; and/or NetCDF files. When multiple scenarios are run using RiverSMART, these output files are saved in unique scenario folders. RiverSMART can also combine csv output from multiple scenarios into a single csv file to expedite the analysis. This setup is summarized in figure @ref(fig:rwFlowChart).
When using RiverSMART to simulate multiple scenarios, each with hundreds or thousands of traces of data, it is increasingly easy to generate large amounts of data from RiverWare simulations. This package provides a tool to read these data into R, manipulate and summarize the data, and aggregate multiple scenarios’ data together.
As there are multiple ways RiverWare and RiverSMART can generate the data, there are also different workflows that may expedite the analysis process. This vignette documents these different workflows.
RiverWare and RiverSMART can create data (rdf, csv, nc) for one or more scenarios, and depending on the analysis different workflows will provide advantages to the user. There are currently three different workflows supported by RWDataPlyr that are presented in the following sections.
read_rdf()
and
read_rw_csv()
rdf_aggregate()
and user specified
rwd_agg
objectrw_scen_aggregate()
and user specified
rwd_agg
objectAs of v0.6.0, RWDataPlyr can only read rdf and csv files, and the summary and aggregation can only be performed on rdf files.
Both rdf and csv files can be read into R using
read_rdf()
and read_rw_csv()
.
read_rdf()
returns an rdf
object; to be
useful, the rdf object must be converted to a tbl_df
using
rdf_to_rwtbl2()
or individual slot data can be obtained
using rdf_get_slot()
. rdf_get_slot()
creates a
timestep by traces matrix for a single slot. Several useful summary
functions are available to summarize a single slot:
rwslot_annual_sum()
, rwslot_annual_min()
,
rwslot_annual_max()
. A typical workflow to obtain the
minimum annual Mead pool elevation from the included sample data might
be:
library(RWDataPlyr)
suppressPackageStartupMessages(library(dplyr))
rdf <- read_rdf(system.file(
"extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/KeySlots.rdf",
package = "RWDataPlyr"
))
# ensure the slot you want is in the rdf:
rdf_slot_names(rdf)
#> [1] "Powell.Outflow" "Mead.Pool Elevation"
# then get the minimum annual Mead pool elevation for all 5 years and 4 traces
rdf %>%
rdf_get_slot("Mead.Pool Elevation") %>%
rwslot_annual_min()
#> [,1] [,2] [,3] [,4]
#> [1,] 1074.58 1070.91 1069.84 1073.80
#> [2,] 1069.01 1062.59 1068.99 1074.16
#> [3,] 1046.61 1044.63 1057.35 1086.97
#> [4,] 1029.81 1034.33 1065.68 1070.34
#> [5,] 1012.75 1045.17 1064.05 1076.88
read_rw_csv()
returns a tbl_df
that can
easily be manipulated using the dplyr framework. There are no analogous
functions to rdf_get_slot()
or
rwslot_annual_*()
, but dplyr::filter()
and
dplyr::summarise()
work well instead. The following will
obtain the same minimum annual Mead pool elevation data from the example
csv file with the added benefit of having years and trace number (note
that the output is converted to a data.frame to avoid the rounding in
the current tibble print method):
library(tidyr)
read_rw_csv(system.file(
"extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/KeySlots.csv",
package = "RWDataPlyr"
)) %>%
filter(ObjectSlot == "Mead.Pool Elevation") %>%
group_by(Year, TraceNumber) %>%
summarise(Value = min(Value)) %>%
spread(TraceNumber, Value) %>%
as.data.frame()
#> `summarise()` has grouped output by 'Year'. You can override using the
#> `.groups` argument.
#> Year 1 2 3 4
#> 1 2018 1074.58 1070.91 1069.84 1073.80
#> 2 2019 1069.01 1062.59 1068.99 1074.16
#> 3 2020 1046.61 1044.63 1057.35 1086.97
#> 4 2021 1029.81 1034.33 1065.68 1070.34
#> 5 2022 1012.75 1045.17 1064.05 1076.88
The fields that are stored in csv files are controlled by the user in
RiverWare. Depending on which fields the user specifies in RiverWare and
the keep_cols
options provided by the user,
read_rw_csv(x)
is approximately equal to
rdf_to_rwtbl2(read_rdf(y), keep_cols = TRUE)
if
x
is a csv file and y
is an rdf file that
contain the same RiverWare slots.
To summarize and process multiple slots of data from a single
scenario at once, the user can define a RiverWare data aggregator object
(rwd_agg
) and provide that to rdf_aggregate()
,
which returns a tbl_df
of the summarized data. Slots can be
from different rdf files, but it is up to the user to know which slots
come from which rdf file.
As of v0.6.0, the aggregation and summarization will only work for rdf files.
rwd_agg
objects are data.frames with specific column
names and expected values in those columns. The expected column names
are file
, slot
, period
,
summary
, eval
, t_s
, and
variable
in that order. It is up to the user to specify
which rdf file contains each slot. In a general case, the user specifies
the slot
that is found in a specific rdf file
(file
). A summary
function is applied to a
subset period
of the slot
data, and then
compared (eval
) to a threshold (t_s
) and saved
as the variable
.
rwd_agg
objects can be specified in three ways:
rdfs
parameter. If specified in this manor, all of the slots in each rdf file
will be read into a rwtbl
, but will not be
aggregated/summarized.read_rdf_agg()
. The user may first use
rdf_agg_template()
to create a blank template, to help
ensure it has the correct column names.In order to specify how each slot should be aggregated, each column should include specific keywords, which are described below.
month.name
), or the keyword “asis
”.
function name: Specifying a function name allows for
pre-specified or custom functions to group together several months in
the period. This package provides the following functions:
cy()
, wy()
, eocy()
, and
eowy()
. cy()
indicates the data will be
summarized over the calendar year, i.e., January - December, while
wy()
summarizes over the water year, i.e., October -
September. eocy()
selects the end of the calendar year, and
eowy()
selects the end of the water year. When specified in
the slot_agg
object, leave off the parenthesis, i.e., only
specify the function name. If wy()
is specified, the
function will remove data for any water years that have less than 7
months of data. This “tolerance” is specified by the
rwdataplyr.wy_month_tol
option, and can be modified by
updating this option to another number. For standard monthly data that
starts in January and ends in December, this results in keeping the
first water year, since it includes 9 months of data, and removing the
final water year, since it includes only three months of data. Setting
this option to 0 will result in keeping any water year data that has at
least one month of data; setting this option to 11, ensures that there
must be a full water year of data for that year to be kept.
This can also be a user specified custom function; see the Custom Period Functions section for details on constructing the custom functions.
full month name: When the full month name is specified,
data will be filtered to only include data for that particular month. To
select multiple months of data, use a function as described above. If
the month specified is not found in month.name
, an error
will occur.
asis: If the keyword “asis” is specified, the data is returned for its native timestep, i.e, monthly data will return monthly data and annual data will return annual.
NA
. If the
period
specified is “asis” or returns only one month, e.g.,
eocy()
, then the summary should be NA
. The
summary function should only return one value; for that reason, most of
the Summary
S4groupGenerics work. Notably,
range()
will not since it returns two values. There is no
reason that a custom function will not work here, but it has not been
tested.Compare
S4groupGenerics). If no comparison is desired, then
NA
should be used. If eval
is specified the
value returned from applying the summary
to the
period
will be compared to the threshold specified by
t_s
. The results of the comparison are returned as 0 and 1
instead of TRUE
and FALSE
.eval
is not NA
or a value to scale the result
by, e.g,. 0.001 to convert from acre-ft to thousand acre-ft.
NA
can also be specified to not scale the data.The period
in a slot_agg
object that a slot
will be summarized over can be a function that is provided by
RWDataPlyr, e.g., wy()
for summarizing over a water year,
or it can be a user defined custom function to allow for more complex
time aggregations. The custom function should return a list with three
elements:
fun
- a function that will modify a rwtbl and properly
determine the new Year
based on the custom period.
Necessary for more complex periods that might span different calendar
years, e.g., water years. This function should modify the
Year
column so that months that should be grouped together
have the same year value.filter_months
- the months that should be grouped
together.group_tbl
- how to group the returned rwtbl; likely
either c("Year")
or c("Year", "Month")
For example, a user might want to aggregate all summer months together (June - August). A relatively simple custom function for this would be:
summer <- function()
{
list(
fun = function(x) x,
filter_months = c("June", "July", "August"),
group_tbl = c("Year")
)
}
The fun
in the summer example simply returns the
original data, because it does not need to provide any complex logic for
grouping different years together. A more complex example would be
summarizing all winter months together (December - February). In this
example December from year 0 should be grouped with January and February
of year 1 for the winter of year 1 summary. The djf()
custom function would be:
djf <- function()
{
djf_convert <- function(rwtbl)
{
rwtbl %>%
dplyr::mutate_at(
"Timestep",
.funs = list("Year" = zoo::as.yearmon)
) %>%
# can use the ym_get_wateryear b/c djf are all in same water year
dplyr::mutate_at("Year", list(ym_get_wateryear))
}
list(
fun = djf_convert,
filter_months = month.name[c(12, 1, 2)],
group_tbl = c("Year")
)
}
In this example, fun
now includes a function that takes
in the original data (and expects a Timestep
column). The
time step is converted to a yearmon
object, and then the
years are re-assigned based on their water year (since December -
February are in the same water year).
rdf_aggregate()
ExampleTo summarize multiple slots at once, the first step is to define the
rwd_agg
object. In this example, the following summaries
are desired:
For this, we want to create the following table:
file | slot | period | summary | eval | t_s | variable |
---|---|---|---|---|---|---|
KeySlots.rdf | Mead.Pool Elevation | cy | min | <= | 1,060 | meadLt1060 |
KeySlots.rdf | Mead.Pool Elevation | eocy | NA | NA | NA | meadEocy |
KeySlots.rdf | Powell.Outflow | djf | sum | NA | 0.001 | powellDjfRel |
KeySlots.rdf | Powell.Outflow | July | NA | NA | NA | powellJulRel |
KeySlots.rdf | Powell.Outflow | wy | sum | NA | 1e-06 | powellWyRel |
SystemConditions.rdf | SummaryOutputData.LBShortageConditions | asis | NA | NA | NA | short |
Often times, it will be easier to define the table in Excel and then
read it in as a csv using read_rwd_agg(x)
but below shows
an example of defining it directly in R.
rwa1 <- rwd_agg(data.frame(
file = c(rep("KeySlots.rdf", 5), "SystemConditions.rdf"),
slot = c(
rep("Mead.Pool Elevation", 2),
rep("Powell.Outflow", 3),
"SummaryOutputData.LBShortageConditions"
),
period = c("cy", "eocy", "djf", "July", "wy", "asis"),
summary = c("min", NA, "sum", NA, "sum", NA),
eval = c("<=", rep(NA, 5)),
t_s = c(1060, NA, 0.001, NA, 0.000001, NA),
variable = c("meadLt1060", "meadEocy", "powellDjfRel", "powellJulRel",
"powellWyRel", "short"),
stringsAsFactors = FALSE
))
After the rwd_agg
is specified the object is passed to
rdf_aggregate()
along with a few parameters specifying
where the data are stored, and a tbl_df is returned:
rdf_aggregate(
rwa1,
rdf_dir = system.file(
"extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/",
package = "RWDataPlyr"
)
)
#> Processing 5 slots in KeySlots.rdf
#> Processing 1 slot in SystemConditions.rdf
#> # A tibble: 124 × 5
#> # Groups: Year [6]
#> Year Month TraceNumber Variable Value
#> * <dbl> <chr> <int> <chr> <dbl>
#> 1 2018 December 1 meadLt1060 0
#> 2 2018 December 2 meadLt1060 0
#> 3 2018 December 3 meadLt1060 0
#> 4 2018 December 4 meadLt1060 0
#> 5 2019 December 1 meadLt1060 0
#> 6 2019 December 2 meadLt1060 0
#> 7 2019 December 3 meadLt1060 0
#> 8 2019 December 4 meadLt1060 0
#> 9 2020 December 1 meadLt1060 0
#> 10 2020 December 2 meadLt1060 0
#> # ℹ 114 more rows
To keep all of the columns (attribute info), specify
keep_cols = TRUE
:
rdf_aggregate(
rwa1,
rdf_dir = system.file(
"extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/",
package = "RWDataPlyr"
),
keep_cols = TRUE
)
#> Processing 5 slots in KeySlots.rdf
#> Processing 1 slot in SystemConditions.rdf
#> # A tibble: 124 × 11
#> # Groups: Year, ObjectName, SlotName, ObjectType, Unit, TraceNumber,
#> # RulesetFileName [64]
#> Year Month ObjectName SlotName ObjectType Unit TraceNumber RulesetFileName
#> * <dbl> <chr> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 2018 Decem… Mead Pool El… LevelPowe… ft 1 "$CRSS_DIR\\ru…
#> 2 2018 Decem… Mead Pool El… LevelPowe… ft 2 "$CRSS_DIR\\ru…
#> 3 2018 Decem… Mead Pool El… LevelPowe… ft 3 "$CRSS_DIR\\ru…
#> 4 2018 Decem… Mead Pool El… LevelPowe… ft 4 "$CRSS_DIR\\ru…
#> 5 2019 Decem… Mead Pool El… LevelPowe… ft 1 "$CRSS_DIR\\ru…
#> 6 2019 Decem… Mead Pool El… LevelPowe… ft 2 "$CRSS_DIR\\ru…
#> 7 2019 Decem… Mead Pool El… LevelPowe… ft 3 "$CRSS_DIR\\ru…
#> 8 2019 Decem… Mead Pool El… LevelPowe… ft 4 "$CRSS_DIR\\ru…
#> 9 2020 Decem… Mead Pool El… LevelPowe… ft 1 "$CRSS_DIR\\ru…
#> 10 2020 Decem… Mead Pool El… LevelPowe… ft 2 "$CRSS_DIR\\ru…
#> # ℹ 114 more rows
#> # ℹ 3 more variables: InputDMIName <chr>, Variable <chr>, Value <dbl>
rw_scen_aggregate()
will aggregate and summarize
multiple scenarios, essentially calling rdf_aggregate()
for
each scenario. Similar to rdf_aggregate()
it relies on a
user specified rwd_agg
object to know how to summarize and
process the scenarios.
rw_scen_aggregate()
is setup to work with a directory
structure typical to a RiverSMART study. RiverSMART manages the output
of multiple scenarios by creating unique scenario folders in a
“Scenario” directory. The scen_dir
argument should point to
this scenario directory in a typical study setup. For example, if the
file system is setup as shown in figure @ref(fig:dirStruct),
scen_dir
should point to
“C:/user/crss/CRSS.Jan2017/Scenario” (or the correct relative path to
the same location).
rw_scen_aggregate()
also needs to know the scenario
folders and allows the user to specify scenario names that may differ
from the folder name so they are easier to use in R. In this example,
the scenarios
argument would be
c("ISM1988_2014,2007Dems,IG,Most", "ISM1988_2014,2007Dems,IG,2002")
.
The scenario names can be specified either through the
scen_names
argument, or by naming the
scenarios
vector. The following will obtain the same
results as defined by the earlier rwd_agg
object for both
of these scenarios.
my_scens <-c("ISM1988_2014,2007Dems,IG,Most", "ISM1988_2014,2007Dems,IG,2002")
names(my_scens) <- c("most", "yr2002")
scen_res <- rw_scen_aggregate(
my_scens,
agg = rwa1,
scen_dir = system.file("extdata/Scenario", package ="RWDataPlyr")
)
#> Processing 2 total scenarios.
#> ------------------
#> ** Starting scenario 1 of 2: ISM1988_2014,2007Dems,IG,Most
#> Processing 5 slots in KeySlots.rdf
#> Processing 1 slot in SystemConditions.rdf
#> ** Starting scenario 2 of 2: ISM1988_2014,2007Dems,IG,2002
#> Processing 5 slots in KeySlots.rdf
#> Processing 1 slot in SystemConditions.rdf
To compare results from multiple studies, i.e., the different top
level CRSS.JanNNNN folders in figure @ref(fig:dirStruct),
scen_dir
should point to “C:/user/crss” and
scenarios
should be assigned to
c("CRSS.Jan2017/Scenario/ISM1988_2014,2007Dems,IG,Most", "CRSS.Jan2018/Scenario/ISM1988_2014,2007Dems,IG,Most")
.
If scenarios
has a length of one, it will match the
results of rdf_aggregate()
if the scenario
parameter is used and the same as the scen_names
parameter
in rw_scen_aggregate()
.
getDataForAllScens()
is the legacy function that
preceded rw_scen_aggregate()
and
rdf_aggregate()
in RWDataPlyr v0.5.0 and earlier. It works
similarly to rw_scen_aggregate()
and relies on a
slot_agg_list
, which is analogous to the
rwd_agg
object, though it only allows for certain
pre-specified aggregation methods. The help pages include information on
these functions, though rw_scen_aggregate()
is superior and
preferred as of v0.6.0.
getDataForAllScens()
treats NaN
s as 0s in
its aggregation methods. The nans_are
parameter defaults to
"0"
in rdf_aggregate()
and
rw_scen_aggregate()
and thus matches the way
getDataForAllScens()
treats NaN
s.
RiverSMART includes a plugin
that allows R scripts to be executed from RiverSMART for post-processing
scenario(s). The exact configuration of the script and the RiverSMART R
plugin is beyond the scope of this vignette, but a user can pass
scenario folders and names to rw_scen_aggregate()
from
within RiverSMART, thus streamlining the post-processing steps.