Skip to contents

Introduction

This article once again shows how to import and prepare a data set to calculate the annex statistics, focusing on the following two aspects:

  • Time zone: How to deal with time zones
  • Unit conversion: Automatically convert units of the observations.

For this demonstration a file called demo_US.xlsx is used which already contains two sheets, one which contains simulated observations, and one containing the configuration (meta info) which allows to translate the data in the ‘observations sheet’ into the object required by annex().

Please also have a look at the two articles Prepare data from XLSX and Prepare data from textfiles for more examples.

Scenario (1)

Imagine you got observations from a logger which was recording the date and time in the “US/Central” (CST) time and contains data from four different sensors:

  • Kitchen room temperature (indoor); Fahrenheit
  • Kitchen relative humidity (indoor); no unit (0.0 to 1.0)
  • An exhaust temperature (heating system); degrees Celsius
  • Ambient air pressure; mmHg (millimetre of mercury)

Reading the data

Step one is to import the data; using the readxl package once again:

library("readxl")
excel_sheets("demo_US.xlsx") # Checking sheet names
## [1] "meta_config"     "airquality_data"
# Reading observations
raw_df <- read_excel("demo_US.xlsx", sheet = "airquality_data")
raw_df <- as.data.frame(raw_df) # Convert to simple data.frame
# Reading configuration
config <- read_excel("demo_US.xlsx", sheet = "meta_config")
config <- as.data.frame(config) # Convert to simple data.frame

A quick look at the first \(3\) observations (rows) of the object raw_df shows that we got a variable loggertime which seems to contain the date and time of the observation in the US format (mm/dd/yyyy HH:MM:SS) which we know represents date and time in the “US/Central” time. However, when looking at the class of the variable raw_df$loggertime we can see that it is of class character (chr; text) wherefore we need to properly convert it before making use of it.

head(raw_df, n = 3)
##            loggertime temperature.kitchen rel.hum.kitchen
## 1 12/28/2028 06:30:00            72.01654            0.89
## 2 12/28/2028 06:40:00            72.31357            0.96
## 3 12/28/2028 06:50:00            72.49413            0.75
##   exhaust.temperature..Cels. air.pressure
## 1                   47.24872     675.4629
## 2                   46.95983     676.5550
## 3                   46.62533     676.1809
class(raw_df$loggertime)
## [1] "character"

In addition, this simulated data set contains a periods with missing data as well as some periods where not all but single sensors did not report data which will be shown later.

Convert loggertime

To convert the date and time information into a proper datetime object (an object of class POSIXct) including the proper time zone, we can call as.POSIXct().

Arguments:

  • x: The character vector raw_df$loggertime as we have it right now in our object.
  • format: The format of the date and time information (see ?strftime to get a list of all available options).
  • tz: Time zone definition. Can be in the long format (e.g., "US/Central") or short (e.g., "CST"). Call OlsonNames() to get a list of time zones (long format).
# Convert and overwrite existing column
raw_df$loggertime <- as.POSIXct(x = raw_df$loggertime,
                                format = "%m/%d/%Y %H:%M:%S",
                                tz = "US/Central")
# Checking the variable
head(raw_df$loggertime)
## [1] "2028-12-28 06:30:00" "2028-12-28 06:40:00" "2028-12-28 06:50:00"
## [4] "2028-12-28 07:00:00" "2028-12-28 07:10:00" "2028-12-28 07:20:00"
attr(raw_df$loggertime, "tz")   # Shows time zone abbreviation
## [1] "US/Central"

This step is important to ensure annex() later on knows how to interpret the date and time information for the aggregation. Note that the time zone of the raw data and the time zone when calculating the statistics can be different.

The config object

Next, let us have a look at the config object.

config
##                       column variable   study     home room unit
## 1                 loggertime datetime demo_US top_flat <NA> <NA>
## 2        temperature.kitchen        T demo_US top_flat  KIT    F
## 3            rel.hum.kitchen       RH demo_US top_flat  KIT    -
## 4 exhaust.temperature..Cels.        T demo_US top_flat  EHA    C
## 5               air.pressure Pressure demo_US top_flat  AMB mmHg

As described in detail in Config file/objects this object contains information on how to interpret the raw data set. As shown, the unit column defines that one temperature is recorded in Fahrenheit (unit = "F"), the relative humidity without unit (unit = "-"), and the air pressure in millimeter of mercury (unit = 'mmHg') in the raw data.

This becomes important later when calling annex_prepare().

Data overview

To better understand what happens let us have a quick look at the data.

summary(raw_df)
##    loggertime                     temperature.kitchen rel.hum.kitchen 
##  Min.   :2028-12-28 06:30:00.00   Min.   :60.49       Min.   :0.3300  
##  1st Qu.:2028-12-29 12:20:00.00   1st Qu.:62.86       1st Qu.:0.8200  
##  Median :2028-12-31 09:00:00.00   Median :67.93       Median :0.8900  
##  Mean   :2028-12-31 03:55:05.44   Mean   :68.00       Mean   :0.8928  
##  3rd Qu.:2029-01-01 14:50:00.00   3rd Qu.:73.15       3rd Qu.:0.9600  
##  Max.   :2029-01-03 00:20:00.00   Max.   :75.72       Max.   :1.1000  
##                                   NA's   :8           NA's   :33      
##  exhaust.temperature..Cels.  air.pressure  
##  Min.   :39.52              Min.   :647.8  
##  1st Qu.:40.16              1st Qu.:677.2  
##  Median :42.38              Median :680.1  
##  Mean   :46.10              Mean   :678.8  
##  3rd Qu.:52.08              3rd Qu.:685.4  
##  Max.   :60.20              Max.   :701.2  
##                             NA's   :376

summary() returns a five number summary plus the arithmetic mean of each of the variables in our raw_df object. As shown, the observations for the kitchen temperature (second variable) has a mean of NA which corresponds to about 20 degrees Celsius. The average relative humidity is NA which corresponds to about NA percent.

Alternatively we can quickly convert the raw_df into a time series object (now the loggertime has been prepared correctly) to check the data visually which clearly shows the gaps in the data:

library("zoo")
raw_zoo <- zoo(raw_df[, -1], raw_df[, 1])
plot(raw_zoo)

Prepare the data

We can now use raw_df in combination with our config object or prepare the data for the next step. In this step, the units will automatically be converted (for those variables the conversion is defined).

library("annex")
prepared_df <- annex_prepare(raw_df, config)
head(prepared_df)
##              datetime   study     home room Pressure  T RH
## 1 2028-12-28 06:30:00 demo_US top_flat  AMB 900.5433 NA NA
## 2 2028-12-28 06:40:00 demo_US top_flat  AMB 901.9993 NA NA
## 3 2028-12-28 06:50:00 demo_US top_flat  AMB 901.5005 NA NA
## 4 2028-12-28 07:00:00 demo_US top_flat  AMB 904.6508 NA NA
## 5 2028-12-28 07:10:00 demo_US top_flat  AMB 905.8652 NA NA
## 6 2028-12-28 07:20:00 demo_US top_flat  AMB 903.9762 NA NA
summary(prepared_df)
##     datetime                         study               home          
##  Min.   :2028-12-28 06:30:00.00   Length:2151        Length:2151       
##  1st Qu.:2028-12-29 12:20:00.00   Class :character   Class :character  
##  Median :2028-12-31 09:00:00.00   Mode  :character   Mode  :character  
##  Mean   :2028-12-31 03:55:05.44                                        
##  3rd Qu.:2029-01-01 14:50:00.00                                        
##  Max.   :2029-01-03 00:20:00.00                                        
##                                                                        
##      room              Pressure           T               RH        
##  Length:2151        Min.   :863.7   Min.   :15.83   Min.   : 33.00  
##  Class :character   1st Qu.:902.8   1st Qu.:20.03   1st Qu.: 82.00  
##  Mode  :character   Median :906.7   Median :39.57   Median : 89.00  
##                     Mean   :904.9   Mean   :33.12   Mean   : 89.28  
##                     3rd Qu.:913.8   3rd Qu.:42.39   3rd Qu.: 96.00  
##                     Max.   :934.9   Max.   :60.20   Max.   :110.00  
##                     NA's   :1810    NA's   :725     NA's   :1467

The function also reshapes the data, however, one can see that the temperature data (contains temperature data from the exhaust as well as the kitchen) now all is in degrees Celsius; the relative humidity has been automatically converted to percent.

Calculating statistics

Once the data set is prepared, annex() can be called to prepare the data set for the final calculations. Note: So far, the date and time information is still in the time zone of the logger. When calling annex() we can specify in which time zone the statistics should be calculated.

Correct solution: According to our scenario we know that our study area is located in the “US/Central” time zone. This is what we need to tell annex() to correctly do the conversion.

annex_df <- annex(T + RH + Pressure ~ datetime | study + home + room,
                  data = prepared_df, tz = "US/Central")
head(annex_df)
##              datetime   study     home room year month   tod  T RH Pressure
## 1 2028-12-28 06:30:00 demo_US top_flat  AMB 2028    12 23-07 NA NA 900.5433
## 2 2028-12-28 06:40:00 demo_US top_flat  AMB 2028    12 23-07 NA NA 901.9993
## 3 2028-12-28 06:50:00 demo_US top_flat  AMB 2028    12 23-07 NA NA 901.5005
## 4 2028-12-28 07:00:00 demo_US top_flat  AMB 2028    12 07-23 NA NA 904.6508
## 5 2028-12-28 07:10:00 demo_US top_flat  AMB 2028    12 07-23 NA NA 905.8652
## 6 2028-12-28 07:20:00 demo_US top_flat  AMB 2028    12 07-23 NA NA 903.9762

As shown, the first three observations (before 7 am local time) fall into the period tod = "23-07" while the next ones fall into the period tod = "07-23".

Scenario (2)

For illustration, let us imagine the same situation where our study is located in the US/Central time zone, however, imagine the loggertime is reported in UTC.

Thus, what is needed is to (i) convert the loggertime into UTC, but keep the further analysis steps relative to the US/Central time zone (when calling annex()).

# Re-reading the original data set; Store it on a new object raw_UTC
raw_UTC <- read_excel("demo_US.xlsx", sheet = "airquality_data")
# Convert loggertime; this time with 'tz = "UTC"'
raw_UTC$loggertime <- as.POSIXct(x = raw_UTC$loggertime,
                                 format = "%m/%d/%Y %H:%M:%S",
                                 tz = "UTC")
# Prepare the data relative to US/Central; same 'config' object as before
prepared_UTC <- annex_prepare(raw_UTC, config)
annex_UTC    <- annex(T + RH + Pressure ~ datetime | study + home + room,
                      data = prepared_UTC, tz = "US/Central")

When comparing the first six observations (rows) of our object annex_df (all handled in US/Central time) and annex_UTC (raw data in UTC, processing in US/Central) one can see the difference in the tod variable (also affects year and month) …

head(annex_df)
##              datetime   study     home room year month   tod  T RH Pressure
## 1 2028-12-28 06:30:00 demo_US top_flat  AMB 2028    12 23-07 NA NA 900.5433
## 2 2028-12-28 06:40:00 demo_US top_flat  AMB 2028    12 23-07 NA NA 901.9993
## 3 2028-12-28 06:50:00 demo_US top_flat  AMB 2028    12 23-07 NA NA 901.5005
## 4 2028-12-28 07:00:00 demo_US top_flat  AMB 2028    12 07-23 NA NA 904.6508
## 5 2028-12-28 07:10:00 demo_US top_flat  AMB 2028    12 07-23 NA NA 905.8652
## 6 2028-12-28 07:20:00 demo_US top_flat  AMB 2028    12 07-23 NA NA 903.9762
head(annex_UTC)
##              datetime   study     home room year month   tod  T RH Pressure
## 1 2028-12-28 06:30:00 demo_US top_flat  AMB 2028    12 23-07 NA NA 900.5433
## 2 2028-12-28 06:40:00 demo_US top_flat  AMB 2028    12 23-07 NA NA 901.9993
## 3 2028-12-28 06:50:00 demo_US top_flat  AMB 2028    12 23-07 NA NA 901.5005
## 4 2028-12-28 07:00:00 demo_US top_flat  AMB 2028    12 07-23 NA NA 904.6508
## 5 2028-12-28 07:10:00 demo_US top_flat  AMB 2028    12 07-23 NA NA 905.8652
## 6 2028-12-28 07:20:00 demo_US top_flat  AMB 2028    12 07-23 NA NA 903.9762