Introduction
This article explains how to import and process data with the
annex
package when the require data is available as tabular
text files (CSV).
To demonstrate this, two files are used called demo_Bedroom.txt (contains the measurement data) as well as demo_Bedroom_config.TXT (contains configuration; see article Config file).
Both files can easily be read using base R functions, namely
read.table()
and its interfacing functions like
read.csv()
, utils::read.delim()
etc. (see
?read.table
for more details).
Reading the data
The first step is to import both (i) the measurement data (stored on
raw_df
) and (ii) the configuration (stored on
config
):
raw_df <- read.csv("demo_Bedroom.txt")
config <- read.table("demo_Bedroom_config.TXT",
comment.char = "#", sep = "",
header = TRUE, na.strings = c("NA", "empty"))
# see ?read.table for details
# Class and dimension of the objects
c("raw_df" = is.data.frame(raw_df), "config" = is.data.frame(config))
## raw_df config
## TRUE TRUE
## raw_df config
## [1,] 51890 7
## [2,] 8 6
Both objects are of class data.frame
(tibble data frames
to be precise) with a dimension of \(51890
\times 8\) (raw_df
) and \(7 \times 6\) (config
)
respectively.
The first few observations (rows) of the two objects look as follows:
head(raw_df[, 1:4], n = 3) # First three columns only
## X radonShortTermAvg temp humidity
## 1 2011-01-01 00:01:26 151 18.8 51
## 2 2011-01-01 00:06:25 151 18.8 51
## 3 2011-01-01 00:11:25 151 18.8 51
head(config, n = 3)
## column variable study unit home room
## 1 X datetime <NA> <NA> <NA> <NA>
## 2 co2 CO2 DEMO_STUD ppm Casa_Blanca Bed1
## 3 humidity rH DEMO_STUD % Casa_Blanca Bed1
The object raw_df
contains variables (columns) named
“X”, “radonShortTermAvg”, “temp”, “humidity” which are the original
names from the XLSX sheet, the config
object contains the
definition what the columns in raw_df
contains and where
they belong to. For more details read the article about the Config file.
Checking the config object
To check whether or not the config
object is as expected
by the annex
package, the function
annex_check_config()
can be used. In case problems would be
detected, an error will be thrown (see Config
file). Else, the function is silent as in this example:
library("annex")
annex_check_config(config)
… no errors, the config
object meets the
annex
requirements. Note that this step is not
necessary as it will be performed automatically when calling
annex_prepare()
but can be handy during development.
Preparing data
While raw_df
contains the raw data set, the
config
object contains the information on how to rename the
columns and where the observations belong to.
prepare_annex()
is a helper function to prepare the data
set for further steps.
prepared_df <- annex_prepare(raw_df, config, quiet = TRUE)
## [1] "datetime" "study" "home" "room" "CO2" "Pressure"
## [7] "Radon" "RH" "T" "VOC"
## Error in annex_prepare(raw_df, config, quiet = TRUE): variable `datetime` (originally column `X`) must be of class POSIXt
At this moment we get an error as the variable containing the date
and time information is not a proper datetime object (object of class
POSIXt
) but a character. As the information comes in a
proper ISO format, we simply convert the column (column X
in raw_df
) and call annex_prepare()
again.
# see ?as.POSIXct for details and options
raw_df <- transform(raw_df, X = as.POSIXct(X, tz = "UTC"))
class(raw_df$X)
## [1] "POSIXct" "POSIXt"
prepared_df <- annex_prepare(raw_df, config, quiet = TRUE)
head(prepared_df)
## datetime study home room CO2 Pressure Radon RH T VOC
## 1 2011-01-01 00:01:26 DEMO_STUD Casa_Blanca BED1 470 1026.5 151 51 18.8 136
## 2 2011-01-01 00:06:25 DEMO_STUD Casa_Blanca BED1 477 1026.5 151 51 18.8 142
## 3 2011-01-01 00:11:25 DEMO_STUD Casa_Blanca BED1 483 1026.5 151 51 18.8 131
## 4 2011-01-01 00:16:25 DEMO_STUD Casa_Blanca BED1 477 1026.5 151 51 18.8 140
## 5 2011-01-01 00:21:25 DEMO_STUD Casa_Blanca BED1 481 1026.4 151 51 18.8 135
## 6 2011-01-01 00:26:25 DEMO_STUD Casa_Blanca BED1 483 1026.4 168 51 18.7 131
annex_prepare()
performs a series of tasks:
- Checking the
config
object (callsannex_check_config()
internally). If theconfig
object is valid, - the variables (columns) in
raw_df
are renamed and checked to be of the correct class, - informs the user if there are any columns in
raw_df
not included inconfig
(just a note) and additional columns defined inconfig
which do not occur inraw_df
, and returns the modified (possibly subsetted) object, - ensures that
datetime
is a proper datetime object (POSIXt
).
The checks of missing/additional definitions in config
are intended to inform the user about possible misspecifications and
will not result in an error.