Calculate Statistics on Annex object
Details
The function allows to return the statistics in a wide format or long format.
Both can be used when calling annex_write_stats(), but he long/wide
format can be handy fur custom applications (e.g., plotting, ...).
Argument probs will be forwarded to the stats::quantile() function.
If probs = NULL (default) the empirical quantiles will be calculated
from 0 (the minimum) up to 1 (the maximum) in an interval of
0.01 (one percent steps), including quantiles 0.005,
0.025, 0.975 and 0.995. Can be specified differently
by the user if needed, however, this no longer yields the standard statistics
and the validation will report a problem.
Statistics
Grouping: Statistics are calculated on different subsets (or groups),
typically study, home, room, year, month,
tod (time of day). However, this set can vary depending on the users
function call to annex (see argument formula).
annex_stats calculates a series of data/quality flags as well as statistical
measures.
Quality: quality_lower and quality_upper contain the fraction of
observations (in percent) falling below the lower and upper defined threshold
(see annex_variable_definition).
quality_start and quality_end contain the day (date only)
where the first non-missing observation was given for the current group; used to
estimate Nestim (see below).
Interval: Time increments of all non-missing observations are calculated in seconds.
The interval_ columns show the five digit summary plus the arithmetic mean of these
intervals. interval_Median is used to calculate estimate Nestim (see below).
Nestim: Number of estimated observations (see section below)
N: Number of non-missing observations
NAs: Number of missing observations (NA in the data set)
Mean: $$\bar{x} = \frac{1}{N} \sum_{i = 1}^N x_i$$ (arithmetic mean)
Sd: $$\text{sd}(x) = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^N \big( (x_i - \bar{x})^2\big)}$$
p: Probabilites for different quantiles. p00 represents the overall minimum,
p50 the median, p100 the overall maximum of all non-missing values. Uses
the empirical quantile function with type = 7 (default; see quantile).
Note: If N - NAs lower than 30, both Mean and Sd will be set to NA!
Estimated number of observations
The value Nestim contains an estimate for the number of possible observations
for a specific group. This estimate is based on the first/last date an observation
was available (non-missing) as well as the year, month, and tod. Last but not least
the interval_Median is used.
As an example: Imagine the statistics for temperature observations for one speicifc
year and month (monthly level aggregation) with tod = "07-23". The first non-missing
value has been reported on the first day of the month, the last one on day 10.
Given that tod = "07-23" covers 16 hours, this indicates that observations could
be available 16 hours over 10 days = 160 hours in total. Based on the best guess
for interval_median this allows to calculate Nestim. E.g., if the median interval
is 300 (300 seconds = 5 minutes) this would leas to a possible number of observations
Nestim = 10 days * 16 hours per day * 3600 seonds per hour / 300 seconds = 1920.
Keep in mind that this is only an estimate or best guess!
