Prepare data from textfiles

Introduction

This article explains how to import and process data with the annex package when the require data is available as tabular text files (CSV).

To demonstrate this, two files are used called demo_Bedroom.txt (contains the measurement data) as well as demo_Bedroom_config.TXT (contains configuration; see article Config file).

Both files can easily be read using base R functions, namely read.table() and its interfacing functions like read.csv(), utils::read.delim() etc. (see ?read.table for more details).

Reading the data

The first step is to import both (i) the measurement data (stored on raw_df) and (ii) the configuration (stored on config):

raw_df <- read.csv("demo_Bedroom.txt")
config <- read.table("demo_Bedroom_config.TXT",
                     comment.char = "#", sep = "",
                     header = TRUE, na.strings = c("NA", "empty"))
                     # see ?read.table for details

# Class and dimension of the objects
c("raw_df" = is.data.frame(raw_df), "config" = is.data.frame(config))

## raw_df config 
##   TRUE   TRUE

cbind("raw_df" = dim(raw_df), "config" = dim(config))

##      raw_df config
## [1,]  51890      7
## [2,]      8      6

Both objects are of class data.frame (tibble data frames to be precise) with a dimension of $51890 \times 8$ (raw_df) and $7 \times 6$ (config) respectively.

The first few observations (rows) of the two objects look as follows:

head(raw_df[, 1:4], n = 3) # First three columns only

##                     X radonShortTermAvg temp humidity
## 1 2011-01-01 00:01:26               151 18.8       51
## 2 2011-01-01 00:06:25               151 18.8       51
## 3 2011-01-01 00:11:25               151 18.8       51

head(config, n = 3)

##     column variable     study unit        home room
## 1        X datetime      <NA> <NA>        <NA> <NA>
## 2      co2      CO2 DEMO_STUD  ppm Casa_Blanca Bed1
## 3 humidity       rH DEMO_STUD    % Casa_Blanca Bed1

The object raw_df contains variables (columns) named “X”, “radonShortTermAvg”, “temp”, “humidity” which are the original names from the XLSX sheet, the config object contains the definition what the columns in raw_df contains and where they belong to. For more details read the article about the Config file.

Checking the config object

To check whether or not the config object is as expected by the annex package, the function annex_check_config() can be used. In case problems would be detected, an error will be thrown (see Config file). Else, the function is silent as in this example:

library("annex")
annex_check_config(config)

… no errors, the config object meets the annex requirements. Note that this step is not necessary as it will be performed automatically when calling annex_prepare() but can be handy during development.

Preparing data

While raw_df contains the raw data set, the config object contains the information on how to rename the columns and where the observations belong to. prepare_annex() is a helper function to prepare the data set for further steps.

prepared_df <- annex_prepare(raw_df, config, quiet = TRUE)

##  [1] "datetime" "study"    "home"     "room"     "CO2"      "Pressure"
##  [7] "Radon"    "RH"       "T"        "VOC"

## Error in annex_prepare(raw_df, config, quiet = TRUE): variable `datetime` (originally column `X`) must be of class POSIXt

At this moment we get an error as the variable containing the date and time information is not a proper datetime object (object of class POSIXt) but a character. As the information comes in a proper ISO format, we simply convert the column (column X in raw_df) and call annex_prepare() again.

# see ?as.POSIXct for details and options
raw_df <- transform(raw_df, X = as.POSIXct(X, tz = "UTC"))
class(raw_df$X)

## [1] "POSIXct" "POSIXt"

prepared_df <- annex_prepare(raw_df, config, quiet = TRUE)
head(prepared_df)

##              datetime     study        home room CO2 Pressure Radon RH    T VOC
## 1 2011-01-01 00:01:26 DEMO_STUD Casa_Blanca BED1 470   1026.5   151 51 18.8 136
## 2 2011-01-01 00:06:25 DEMO_STUD Casa_Blanca BED1 477   1026.5   151 51 18.8 142
## 3 2011-01-01 00:11:25 DEMO_STUD Casa_Blanca BED1 483   1026.5   151 51 18.8 131
## 4 2011-01-01 00:16:25 DEMO_STUD Casa_Blanca BED1 477   1026.5   151 51 18.8 140
## 5 2011-01-01 00:21:25 DEMO_STUD Casa_Blanca BED1 481   1026.4   151 51 18.8 135
## 6 2011-01-01 00:26:25 DEMO_STUD Casa_Blanca BED1 483   1026.4   168 51 18.7 131

annex_prepare() performs a series of tasks:

Checking the config object (calls annex_check_config() internally). If the config object is valid,
the variables (columns) in raw_df are renamed and checked to be of the correct class,
informs the user if there are any columns in raw_df not included in config (just a note) and additional columns defined in config which do not occur in raw_df, and returns the modified (possibly subsetted) object,
ensures that datetime is a proper datetime object (POSIXt).

The checks of missing/additional definitions in config are intended to inform the user about possible misspecifications and will not result in an error.

Next steps

After performing the data preparation, the following steps can be performed:

Reto Stauffer

Introduction

Reading the data

Checking the config object

Preparing data

Next steps