rdf_to_rwtbl2() nor by
rdf_aggregate() or by rw_scen_aggregate(),
which both rely on rdf_to_rwtbl2().bigrdf_to_rwtbl() works similarly to
rdf_to_rwtbl2() creating the same ‘long’ data frame, but
does not store it in memorybigrdf_to_rwtbl() returns a connection to the parquet
file, but not the data itself. This might take ~5 minutes to run. It
will process 20 traces at a time, and will report out as it starts each
chunk of 20 traces.
library(RWDataPlyr)
library(dplyr)
library(stringr)
rdf_path <- "//manoa.colorado.edu/bor/Shared/P26/Dec202024_NA_CCS_FA_Final_Runs/2016Dems,CRMMS_Trace12,ICSon,SuperEnsembleV3,CCS.9087.mdl,CCS.9047.rls/AZWU.rdf"
zz <- bigrdf_to_rwtbl(rdf_path, scenario = 'CCS', n_trace_per_chunk = 20)
zzThen, you can work with the data in a typical dplyr pipeline, and
move it into memory when it is smaller. Use
collect() to move it into memory.
# get the annual depletion requested for each user
df <- zz |>
filter(str_ends(ObjectSlot, '\\.Depletion Requested')) |>
group_by(Scenario, ObjectSlot, Year) |>
summarise(Value = sum(Value)) |>
collect()Now df is a reasonable size (.1 MB) to keep in memory
and use as you normally would.
collect(zz), but this will take ~14 GB of
free memoryn_trace_per_chunk variable controls how many traces
are parsed on each call to the C++ code.
rdf_to_rwtbl2()).bigrdf_move(df, 'path/to/move/to')arrow::open_dataset('path/to/move/to')
to reconnect to these data in the futurerdf_to_rwtbl2() is faster than
bigrdf_to_rwtbl(), so for smaller rdf files that’s probably
still the preferred method for getting the data into RIf this seems to help/work, then rdf_aggregate() and
rwscen_aggregate() will be enhanced to also be able to work
with ‘big’ rdfs.