Introduction
This article explains how to import and process data with the
annex
package when all required data is available the XLSX
format.
For this purpose, a demo data set called demo_UIBK.xlsx is used (shipped with the package) which contains two sheets:
- measurements: The data (e.g., from a logger or pre-processed)
- annex_configuration: Configuration (see article Config file) Save the demo file and/or your xlsx data file in your R working directory.
Importing the data into R is up to the user, the annex
package does not provide import functionality. For XLSX files a
multitude of R packages exist; this article is making use of the readxl
package, however, any other package which allows to read data from XLSX
files and return an object of class data.frame
will work as
well.
Reading the data
The first step is to import both (i) the measurement data (stored on
raw_df
) and (ii) the configuration (stored on
config
):
library("readxl")
raw_df <- read_excel("demo_UIBK.xlsx", sheet = "measurements")
config <- read_excel("demo_UIBK.xlsx", sheet = "annex_configuration")
# Class and dimension of the objects
c("raw_df" = is.data.frame(raw_df), "config" = is.data.frame(config))
## raw_df config
## TRUE TRUE
## raw_df config
## [1,] 1500 106
## [2,] 106 7
Both objects are of class data.frame
(tibble data frames
to be precise) with a dimension of \(1500
\times 106\) (raw_df
) and \(106 \times 7\) (config
)
respectively.
The first few observations (rows) of the two objects look as follows:
head(raw_df[, 1:3], n = 3) # First three columns only
## # A tibble: 3 × 3
## `Zeitraum ab` `T-WZ_W1 [°C] avg` `rH-WZ_W1 [% rH] avg`
## <dttm> <dbl> <dbl>
## 1 2011-10-01 00:05:00 22.6 55.6
## 2 2011-10-01 00:10:00 22.6 55.5
## 3 2011-10-01 00:15:00 22.6 55.4
head(config, n = 3)
## # A tibble: 3 × 7
## process column variable unit study home room
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Zeitraum ab datetime NA NA NA NA
## 2 1 T-WZ_W1 [°C] avg T C demo W1 LIV
## 3 1 rH-WZ_W1 [% rH] avg rH % demo W1 LIV
The object raw_df
contains variables (columns) named
“Zeitraum ab”, “T-WZ_W1 [°C] avg”, “rH-WZ_W1 [% rH] avg” which are the
original names from the XLSX sheet, the config
object
contains the definition what the columns in raw_df
contains
and where they belong to. For more details read the article about the Config file.
The variable config$process
is a custom column
containing a logical value whether or not the column should be used for
further processing. We would like to only process the rows where
config$proces == TRUE
.
config <- subset(config, process == TRUE) # Remove all rows process = FALSE
As config
is a pure data.frame
, any
subsetting can be used.
Checking the config object
To check whether or not the config
object is as expected
by the annex
package, the function
annex_check_config()
can be used. In case problems would be
detected, an error will be thrown (see Config
file). Else, the function is silent as in this example:
library("annex")
annex_check_config(config)
… no errors, the config
object meets the
annex
requirements. Note that this step is not
necessary as it will be performed automatically when calling
annex_prepare()
but can be handy during development.
Preparing data
Based on raw_df
which still contains the raw data set
read from the XLSX file and the config
object with the
information about the renaming (and which columns to be used), the data
set for further processing can be created.
prepared_df <- annex_prepare(raw_df, config, quiet = TRUE)
head(prepared_df)
## datetime study home room RH SolRad T CO2 Other
## 1 2011-10-01 00:05:00 demo General AMB 88.4 3 13.35 NA NA
## 2 2011-10-01 00:10:00 demo General AMB 88.9 3 13.25 NA NA
## 3 2011-10-01 00:15:00 demo General AMB 89.2 3 13.17 NA NA
## 4 2011-10-01 00:20:00 demo General AMB 90.2 3 13.00 NA NA
## 5 2011-10-01 00:25:00 demo General AMB 90.9 3 12.86 NA NA
## 6 2011-10-01 00:30:00 demo General AMB 90.7 3 12.80 NA NA
annex_prepare()
performs a series of tasks:
- Checking the
config
object (callsannex_check_config()
internally). If theconfig
object is valid, - the variables (columns) in
raw_df
are renamed and checked to be of the correct class, - informs the user if there are any columns in
raw_df
not included inconfig
(just a note) and additional columns defined inconfig
which do not occur inraw_df
, and returns the modified (possibly subsetted) object, - ensures that
datetime
is a proper datetime object (POSIXt
).
The checks of missing/additional definitions in config
are intended to inform the user about possible misspecifications and
will not result in an error.
Ensure correct date and time (timezone)
Note: read_excel()
(from the readxl
)
package may automartically detect and convert date and time information
into POSIXct
(date and time object), however, as Excel has
no notion of timezones, the information returned will always be in
UTC!
Investigate the class and the time zone attribute attached:
class(prepared_df$datetime) # Class of our first variable (Zeitraum ab)
## [1] "POSIXct" "POSIXt"
attr(prepared_df$datetime, "tzone") # Time zone used
## [1] "UTC"
To enforce a different time zone (assuming the data has been recorded
in a time zone different than UTC) one need to acoerce the information
back into a character without time zone information (using
format()
) and back into POSIXct
which allows
to set the correct time zone. For demonstration (to show the command) we
convert the information to ‘US/Central’ and back to ‘UTC’.
head(prepared_df$datetime, n = 2) # First three entries
## [1] "2011-10-01 00:05:00 UTC" "2011-10-01 00:10:00 UTC"
# Coerce to 'US/Central'
prepared_df$datetime <- as.POSIXct(format(prepared_df$datetime), tz = "US/Central")
head(prepared_df$datetime, n = 2) # First three entries
## [1] "2011-10-01 00:05:00" "2011-10-01 00:10:00"
attr(prepared_df$datetime, "tzone")
## [1] "US/Central"
# Coerce back to 'UTC'
prepared_df$datetime <- as.POSIXct(format(prepared_df$datetime), tz = "UTC")
head(prepared_df$datetime, n = 2) # First three entries
## [1] "2011-10-01 00:05:00 UTC" "2011-10-01 00:10:00 UTC"
attr(prepared_df$datetime, "tzone")
## [1] "UTC"
The time zone conversion can also be done on the raw_df
object (if needed) but needs to be addressed before calling
annex()
(see Calculating
statistics (analysis)) to ensure that this information is processed
properly.