Introduction
This article once again shows how to import and prepare a data set to
calculate the annex
statistics, focusing on the following
two aspects:
- Time zone: How to deal with time zones
- Unit conversion: Automatically convert units of the observations.
For this demonstration a file called
demo_US.xlsx
is used which already contains two sheets, one which contains simulated
observations, and one containing the configuration (meta info) which
allows to translate the data in the ‘observations sheet’ into the object
required by annex()
.
Please also have a look at the two articles Prepare data from XLSX and Prepare data from textfiles for more examples.
Scenario (1)
Imagine you got observations from a logger which was recording the date and time in the “US/Central” (CST) time and contains data from four different sensors:
- Kitchen room temperature (indoor); Fahrenheit
- Kitchen relative humidity (indoor); no unit
(
0.0
to1.0
) - An exhaust temperature (heating system); degrees Celsius
- Ambient air pressure; mmHg (millimetre of mercury)
Reading the data
Step one is to import the data; using the readxl
package once again:
library("readxl")
excel_sheets("demo_US.xlsx") # Checking sheet names
## [1] "meta_config" "airquality_data"
# Reading observations
raw_df <- read_excel("demo_US.xlsx", sheet = "airquality_data")
raw_df <- as.data.frame(raw_df) # Convert to simple data.frame
# Reading configuration
config <- read_excel("demo_US.xlsx", sheet = "meta_config")
config <- as.data.frame(config) # Convert to simple data.frame
A quick look at the first \(3\)
observations (rows) of the object raw_df
shows that we got
a variable loggertime
which seems to contain the date and
time of the observation in the US format
(mm/dd/yyyy HH:MM:SS
) which we know represents date and
time in the “US/Central” time. However, when looking at the class of the
variable raw_df$loggertime
we can see that it is of class
character (chr
; text) wherefore we need to properly convert
it before making use of it.
head(raw_df, n = 3)
## loggertime temperature.kitchen rel.hum.kitchen
## 1 12/28/2028 06:30:00 72.01654 0.89
## 2 12/28/2028 06:40:00 72.31357 0.96
## 3 12/28/2028 06:50:00 72.49413 0.75
## exhaust.temperature..Cels. air.pressure
## 1 47.24872 675.4629
## 2 46.95983 676.5550
## 3 46.62533 676.1809
class(raw_df$loggertime)
## [1] "character"
In addition, this simulated data set contains a periods with missing data as well as some periods where not all but single sensors did not report data which will be shown later.
Convert loggertime
To convert the date and time information into a proper datetime
object (an object of class POSIXct
) including the proper
time zone, we can call as.POSIXct()
.
Arguments:
-
x
: The character vectorraw_df$loggertime
as we have it right now in our object. -
format
: The format of the date and time information (see?strftime
to get a list of all available options). -
tz
: Time zone definition. Can be in the long format (e.g.,"US/Central"
) or short (e.g.,"CST"
). CallOlsonNames()
to get a list of time zones (long format).
# Convert and overwrite existing column
raw_df$loggertime <- as.POSIXct(x = raw_df$loggertime,
format = "%m/%d/%Y %H:%M:%S",
tz = "US/Central")
# Checking the variable
head(raw_df$loggertime)
## [1] "2028-12-28 06:30:00" "2028-12-28 06:40:00" "2028-12-28 06:50:00"
## [4] "2028-12-28 07:00:00" "2028-12-28 07:10:00" "2028-12-28 07:20:00"
attr(raw_df$loggertime, "tz") # Shows time zone abbreviation
## [1] "US/Central"
This step is important to ensure annex()
later on knows
how to interpret the date and time information for the aggregation. Note
that the time zone of the raw data and the time zone when calculating
the statistics can be different.
The config object
Next, let us have a look at the config
object.
config
## column variable study home room unit
## 1 loggertime datetime demo_US top_flat <NA> <NA>
## 2 temperature.kitchen T demo_US top_flat KIT F
## 3 rel.hum.kitchen RH demo_US top_flat KIT -
## 4 exhaust.temperature..Cels. T demo_US top_flat EHA C
## 5 air.pressure Pressure demo_US top_flat AMB mmHg
As described in detail in Config
file/objects this object contains information on how to interpret
the raw data set. As shown, the unit column defines that one temperature
is recorded in Fahrenheit (unit = "F"
), the relative
humidity without unit (unit = "-"
), and the air pressure in
millimeter of mercury (unit = 'mmHg'
) in the raw data.
This becomes important later when calling
annex_prepare()
.
Data overview
To better understand what happens let us have a quick look at the data.
summary(raw_df)
## loggertime temperature.kitchen rel.hum.kitchen
## Min. :2028-12-28 06:30:00.00 Min. :60.49 Min. :0.3300
## 1st Qu.:2028-12-29 12:20:00.00 1st Qu.:62.86 1st Qu.:0.8200
## Median :2028-12-31 09:00:00.00 Median :67.93 Median :0.8900
## Mean :2028-12-31 03:55:05.44 Mean :68.00 Mean :0.8928
## 3rd Qu.:2029-01-01 14:50:00.00 3rd Qu.:73.15 3rd Qu.:0.9600
## Max. :2029-01-03 00:20:00.00 Max. :75.72 Max. :1.1000
## NA's :8 NA's :33
## exhaust.temperature..Cels. air.pressure
## Min. :39.52 Min. :647.8
## 1st Qu.:40.16 1st Qu.:677.2
## Median :42.38 Median :680.1
## Mean :46.10 Mean :678.8
## 3rd Qu.:52.08 3rd Qu.:685.4
## Max. :60.20 Max. :701.2
## NA's :376
summary()
returns a five number summary plus the
arithmetic mean of each of the variables in our raw_df
object. As shown, the observations for the kitchen temperature (second
variable) has a mean of NA which corresponds to about 20 degrees
Celsius. The average relative humidity is NA which corresponds to about
NA percent.
Alternatively we can quickly convert the raw_df
into a
time series object (now the loggertime
has been prepared
correctly) to check the data visually which clearly shows the gaps in
the data:
Prepare the data
We can now use raw_df
in combination with our
config
object or prepare the data for the next step. In
this step, the units will automatically be converted (for those
variables the conversion is defined).
library("annex")
prepared_df <- annex_prepare(raw_df, config)
head(prepared_df)
## datetime study home room Pressure T RH
## 1 2028-12-28 06:30:00 demo_US top_flat AMB 900.5433 NA NA
## 2 2028-12-28 06:40:00 demo_US top_flat AMB 901.9993 NA NA
## 3 2028-12-28 06:50:00 demo_US top_flat AMB 901.5005 NA NA
## 4 2028-12-28 07:00:00 demo_US top_flat AMB 904.6508 NA NA
## 5 2028-12-28 07:10:00 demo_US top_flat AMB 905.8652 NA NA
## 6 2028-12-28 07:20:00 demo_US top_flat AMB 903.9762 NA NA
summary(prepared_df)
## datetime study home
## Min. :2028-12-28 06:30:00.00 Length:2151 Length:2151
## 1st Qu.:2028-12-29 12:20:00.00 Class :character Class :character
## Median :2028-12-31 09:00:00.00 Mode :character Mode :character
## Mean :2028-12-31 03:55:05.44
## 3rd Qu.:2029-01-01 14:50:00.00
## Max. :2029-01-03 00:20:00.00
##
## room Pressure T RH
## Length:2151 Min. :863.7 Min. :15.83 Min. : 33.00
## Class :character 1st Qu.:902.8 1st Qu.:20.03 1st Qu.: 82.00
## Mode :character Median :906.7 Median :39.57 Median : 89.00
## Mean :904.9 Mean :33.12 Mean : 89.28
## 3rd Qu.:913.8 3rd Qu.:42.39 3rd Qu.: 96.00
## Max. :934.9 Max. :60.20 Max. :110.00
## NA's :1810 NA's :725 NA's :1467
The function also reshapes the data, however, one can see that the temperature data (contains temperature data from the exhaust as well as the kitchen) now all is in degrees Celsius; the relative humidity has been automatically converted to percent.
Calculating statistics
Once the data set is prepared, annex()
can be called to
prepare the data set for the final calculations. Note: So far,
the date and time information is still in the time zone of the logger.
When calling annex()
we can specify in which time zone the
statistics should be calculated.
Correct solution: According to our scenario we know
that our study area is located in the “US/Central” time zone. This is
what we need to tell annex()
to correctly do the
conversion.
annex_df <- annex(T + RH + Pressure ~ datetime | study + home + room,
data = prepared_df, tz = "US/Central")
head(annex_df)
## datetime study home room year month tod T RH Pressure
## 1 2028-12-28 06:30:00 demo_US top_flat AMB 2028 12 23-07 NA NA 900.5433
## 2 2028-12-28 06:40:00 demo_US top_flat AMB 2028 12 23-07 NA NA 901.9993
## 3 2028-12-28 06:50:00 demo_US top_flat AMB 2028 12 23-07 NA NA 901.5005
## 4 2028-12-28 07:00:00 demo_US top_flat AMB 2028 12 07-23 NA NA 904.6508
## 5 2028-12-28 07:10:00 demo_US top_flat AMB 2028 12 07-23 NA NA 905.8652
## 6 2028-12-28 07:20:00 demo_US top_flat AMB 2028 12 07-23 NA NA 903.9762
As shown, the first three observations (before 7 am local time) fall
into the period tod = "23-07"
while the next ones fall into
the period tod = "07-23"
.
Scenario (2)
For illustration, let us imagine the same situation where our study
is located in the US/Central time zone, however, imagine the
loggertime
is reported in UTC.
Thus, what is needed is to (i) convert the loggertime
into UTC, but keep the further analysis steps relative to the US/Central
time zone (when calling annex()
).
# Re-reading the original data set; Store it on a new object raw_UTC
raw_UTC <- read_excel("demo_US.xlsx", sheet = "airquality_data")
# Convert loggertime; this time with 'tz = "UTC"'
raw_UTC$loggertime <- as.POSIXct(x = raw_UTC$loggertime,
format = "%m/%d/%Y %H:%M:%S",
tz = "UTC")
# Prepare the data relative to US/Central; same 'config' object as before
prepared_UTC <- annex_prepare(raw_UTC, config)
annex_UTC <- annex(T + RH + Pressure ~ datetime | study + home + room,
data = prepared_UTC, tz = "US/Central")
When comparing the first six observations (rows) of our object
annex_df
(all handled in US/Central time) and
annex_UTC
(raw data in UTC, processing in US/Central) one
can see the difference in the tod
variable (also affects
year
and month
) …
head(annex_df)
## datetime study home room year month tod T RH Pressure
## 1 2028-12-28 06:30:00 demo_US top_flat AMB 2028 12 23-07 NA NA 900.5433
## 2 2028-12-28 06:40:00 demo_US top_flat AMB 2028 12 23-07 NA NA 901.9993
## 3 2028-12-28 06:50:00 demo_US top_flat AMB 2028 12 23-07 NA NA 901.5005
## 4 2028-12-28 07:00:00 demo_US top_flat AMB 2028 12 07-23 NA NA 904.6508
## 5 2028-12-28 07:10:00 demo_US top_flat AMB 2028 12 07-23 NA NA 905.8652
## 6 2028-12-28 07:20:00 demo_US top_flat AMB 2028 12 07-23 NA NA 903.9762
head(annex_UTC)
## datetime study home room year month tod T RH Pressure
## 1 2028-12-28 06:30:00 demo_US top_flat AMB 2028 12 23-07 NA NA 900.5433
## 2 2028-12-28 06:40:00 demo_US top_flat AMB 2028 12 23-07 NA NA 901.9993
## 3 2028-12-28 06:50:00 demo_US top_flat AMB 2028 12 23-07 NA NA 901.5005
## 4 2028-12-28 07:00:00 demo_US top_flat AMB 2028 12 07-23 NA NA 904.6508
## 5 2028-12-28 07:10:00 demo_US top_flat AMB 2028 12 07-23 NA NA 905.8652
## 6 2028-12-28 07:20:00 demo_US top_flat AMB 2028 12 07-23 NA NA 903.9762