Prepare data from XLSX

Introduction

This article explains how to import and process data with the annex package when all required data is available the XLSX format.

For this purpose, a demo data set called demo_UIBK.xlsx is used (shipped with the package) which contains two sheets:

measurements: The data (e.g., from a logger or pre-processed)
annex_configuration: Configuration (see article Config file) Save the demo file and/or your xlsx data file in your R working directory.

Importing the data into R is up to the user, the annex package does not provide import functionality. For XLSX files a multitude of R packages exist; this article is making use of the readxl package, however, any other package which allows to read data from XLSX files and return an object of class data.frame will work as well.

Reading the data

The first step is to import both (i) the measurement data (stored on raw_df) and (ii) the configuration (stored on config):

library("readxl")
raw_df <- read_excel("demo_UIBK.xlsx", sheet = "measurements")
config <- read_excel("demo_UIBK.xlsx", sheet = "annex_configuration")

# Class and dimension of the objects
c("raw_df" = is.data.frame(raw_df), "config" = is.data.frame(config))

## raw_df config 
##   TRUE   TRUE

cbind("raw_df" = dim(raw_df), "config" = dim(config))

##      raw_df config
## [1,]   1500    106
## [2,]    106      7

Both objects are of class data.frame (tibble data frames to be precise) with a dimension of $1500 \times 106$ (raw_df) and $106 \times 7$ (config) respectively.

The first few observations (rows) of the two objects look as follows:

head(raw_df[, 1:3], n = 3) # First three columns only

## # A tibble: 3 × 3
##   `Zeitraum ab`       `T-WZ_W1 [°C] avg` `rH-WZ_W1 [% rH] avg`
##   <dttm>                           <dbl>                 <dbl>
## 1 2011-10-01 00:05:00               22.6                  55.6
## 2 2011-10-01 00:10:00               22.6                  55.5
## 3 2011-10-01 00:15:00               22.6                  55.4

head(config, n = 3)

## # A tibble: 3 × 7
##   process column              variable unit  study home  room 
##     <dbl> <chr>               <chr>    <chr> <chr> <chr> <chr>
## 1       1 Zeitraum ab         datetime NA    NA    NA    NA   
## 2       1 T-WZ_W1 [°C] avg    T        C     demo  W1    LIV  
## 3       1 rH-WZ_W1 [% rH] avg rH       %     demo  W1    LIV

The object raw_df contains variables (columns) named “Zeitraum ab”, “T-WZ_W1 [°C] avg”, “rH-WZ_W1 [% rH] avg” which are the original names from the XLSX sheet, the config object contains the definition what the columns in raw_df contains and where they belong to. For more details read the article about the Config file.

The variable config$process is a custom column containing a logical value whether or not the column should be used for further processing. We would like to only process the rows where config$proces == TRUE.

config <- subset(config, process == TRUE) # Remove all rows process = FALSE

As config is a pure data.frame, any subsetting can be used.

Checking the config object

To check whether or not the config object is as expected by the annex package, the function annex_check_config() can be used. In case problems would be detected, an error will be thrown (see Config file). Else, the function is silent as in this example:

library("annex")
annex_check_config(config)

… no errors, the config object meets the annex requirements. Note that this step is not necessary as it will be performed automatically when calling annex_prepare() but can be handy during development.

Preparing data

Based on raw_df which still contains the raw data set read from the XLSX file and the config object with the information about the renaming (and which columns to be used), the data set for further processing can be created.

prepared_df <- annex_prepare(raw_df, config, quiet = TRUE)
head(prepared_df)

##              datetime study    home room   RH SolRad     T CO2 Other
## 1 2011-10-01 00:05:00  demo General  AMB 88.4      3 13.35  NA    NA
## 2 2011-10-01 00:10:00  demo General  AMB 88.9      3 13.25  NA    NA
## 3 2011-10-01 00:15:00  demo General  AMB 89.2      3 13.17  NA    NA
## 4 2011-10-01 00:20:00  demo General  AMB 90.2      3 13.00  NA    NA
## 5 2011-10-01 00:25:00  demo General  AMB 90.9      3 12.86  NA    NA
## 6 2011-10-01 00:30:00  demo General  AMB 90.7      3 12.80  NA    NA

annex_prepare() performs a series of tasks:

Checking the config object (calls annex_check_config() internally). If the config object is valid,
the variables (columns) in raw_df are renamed and checked to be of the correct class,
informs the user if there are any columns in raw_df not included in config (just a note) and additional columns defined in config which do not occur in raw_df, and returns the modified (possibly subsetted) object,
ensures that datetime is a proper datetime object (POSIXt).

The checks of missing/additional definitions in config are intended to inform the user about possible misspecifications and will not result in an error.

Ensure correct date and time (timezone)

Note: read_excel() (from the readxl) package may automartically detect and convert date and time information into POSIXct (date and time object), however, as Excel has no notion of timezones, the information returned will always be in UTC!

Investigate the class and the time zone attribute attached:

class(prepared_df$datetime)         # Class of our first variable (Zeitraum ab)

## [1] "POSIXct" "POSIXt"

attr(prepared_df$datetime, "tzone") # Time zone used

## [1] "UTC"

To enforce a different time zone (assuming the data has been recorded in a time zone different than UTC) one need to acoerce the information back into a character without time zone information (using format()) and back into POSIXct which allows to set the correct time zone. For demonstration (to show the command) we convert the information to ‘US/Central’ and back to ‘UTC’.

head(prepared_df$datetime, n = 2) # First three entries

## [1] "2011-10-01 00:05:00 UTC" "2011-10-01 00:10:00 UTC"

# Coerce to 'US/Central'
prepared_df$datetime <- as.POSIXct(format(prepared_df$datetime), tz = "US/Central")
head(prepared_df$datetime, n = 2) # First three entries

## [1] "2011-10-01 00:05:00 CDT" "2011-10-01 00:10:00 CDT"

attr(prepared_df$datetime, "tzone")

## [1] "US/Central"

# Coerce back to 'UTC'
prepared_df$datetime <- as.POSIXct(format(prepared_df$datetime), tz = "UTC")
head(prepared_df$datetime, n = 2) # First three entries

## [1] "2011-10-01 00:05:00 UTC" "2011-10-01 00:10:00 UTC"

attr(prepared_df$datetime, "tzone")

## [1] "UTC"

The time zone conversion can also be done on the raw_df object (if needed) but needs to be addressed before calling annex() (see Calculating statistics (analysis)) to ensure that this information is processed properly.

Next steps

After performing the data preparation, the following steps can be performed: