Calculate Statistics on Annex object

Usage

annex_stats(object, format = "wide", ..., probs = NULL)

Arguments

object: an object of class annex.
format: character, either "wide" (default) or "long".
...: currently unused.
probs: NULL (default; see Details) or a numeric vector of probabilities with values in [0,1] (Values will be rounded to closest 3 digits).

Value

Returns an object of class c("annex_stats", "data_frame").

Details

The function allows to return the statistics in a wide format or long format. Both can be used when calling annex_write_stats(), but he long/wide format can be handy fur custom applications (e.g., plotting, ...).

Argument probs will be forwarded to the stats::quantile() function. If probs = NULL (default) the empirical quantiles will be calculated from 0 (the minimum) up to 1 (the maximum) in an interval of 0.01 (one percent steps), including quantiles 0.005, 0.025, 0.975 and 0.995. Can be specified differently by the user if needed, however, this no longer yields the standard statistics and the validation will report a problem.

Statistics

Grouping: Statistics are calculated on different subsets (or groups), typically study, home, room, year, month, tod (time of day). However, this set can vary depending on the users function call to annex (see argument formula).

annex_stats calculates a series of data/quality flags as well as statistical measures.

Quality: quality_lower and quality_upper contain the fraction of observations (in percent) falling below the lower and upper defined threshold (see annex_variable_definition). quality_start and quality_end contain the day (date only) where the first non-missing observation was given for the current group; used to estimate Nestim (see below).

Interval: Time increments of all non-missing observations are calculated in seconds. The interval_ columns show the five digit summary plus the arithmetic mean of these intervals. interval_Median is used to calculate estimate Nestim (see below).

Nestim: Number of estimated observations (see section below) N: Number of non-missing observations NAs: Number of missing observations (NA in the data set) Mean: $$\bar{x} = \frac{1}{N} \sum_{i = 1}^N x_i$$ (arithmetic mean) Sd: $$\text{sd}(x) = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^N \big( (x_i - \bar{x})^2\big)}$$ p: Probabilites for different quantiles. p00 represents the overall minimum, p50 the median, p100 the overall maximum of all non-missing values. Uses the empirical quantile function with type = 7 (default; see quantile).

Note: If N - NAs lower than 30, both Mean and Sd will be set to NA!

Estimated number of observations

The value Nestim contains an estimate for the number of possible observations for a specific group. This estimate is based on the first/last date an observation was available (non-missing) as well as the year, month, and tod. Last but not least the interval_Median is used.

As an example: Imagine the statistics for temperature observations for one speicifc year and month (monthly level aggregation) with tod = "07-23". The first non-missing value has been reported on the first day of the month, the last one on day 10. Given that tod = "07-23" covers 16 hours, this indicates that observations could be available 16 hours over 10 days = 160 hours in total. Based on the best guess for interval_median this allows to calculate Nestim. E.g., if the median interval is 300 (300 seconds = 5 minutes) this would leas to a possible number of observations Nestim = 10 days * 16 hours per day * 3600 seonds per hour / 300 seconds = 1920. Keep in mind that this is only an estimate or best guess!

Author

Reto Stauffer