The goal of the modeling and prediction engine is to determine a model, using a set of past sensor observations, to forecast future values. The key premise is that the physical phenomena observed by sensors exhibit long-term and short-term correlations and past values can be used to predict the future. This is true for weather phenomena such as temperature that exhibit long-term seasonal variations as well as short-term time-of-day and hourly variations. Similarly phenomena such as traffic at an intersection exhibit correlations based on the hour of the day (e.g., traffic peaks during ``rush'' hours) and day of the week (e.g., there is less traffic on weekends). PRESTO proxies rely on seasonal ARIMA models; ARIMA is a popular family of time-series models that are commonly used for studying weather and stock market data. Seasonal ARIMA models (also known as SARIMA) are a class of ARIMA models that are suitable for data exhibiting seasonal trends and are well-suited for sensor data. Further they offer a way to deal with non-stationary data i.e. whose statistical properties change over time [1]. Last, as we demonstrate later, while seasonal ARIMA models are computationally expensive to construct, they are inexpensive to check at the remote sensors--an important property we seek from our system. The rest of this section presents the details of our SARIMA model and its use within PRESTO.
Prediction Model: A discrete time series can be represented by a
set of time-ordered data
, resulting
from observation of some temporal physical phenomenon such as
temperature or humidity. Samples are assumed to be taken at discrete
time instants
. The goal of time-series analysis is
to obtain the parameters of the underlying physical process that
governs the observed time-series and use this model to forecast future
values.
PRESTO models the time series of observations at a sensor as an Autoregressive Integrated Moving Average (ARIMA) process. In particular, the data is assumed to conform to the Box-Jenkins SARIMA model [1]. While a detailed discussion of SARIMA models is outside the scope of this paper, we provide the intuition behind these models for the benefit of the reader. An SARIMA process has four components: auto-regressive (AR), moving-average (MA), one-step differencing, and seasonal differencing. The AR component estimates the current sample as a linear weighted sum of previous samples; the MA component captures relationship between prediction errors; the one-step differencing component captures relationship between adjacent samples; and the seasonal differencing component captures the diurnal, monthly, or yearly patterns in the data. In SARIMA, the MA component is modeled as a zero-mean, uncorrelated Gaussian random variable (also referred to as white noise). The AR component captures the temporal correlation in the time series by modeling a future value as a function of a number of past values.
In its most general form, the Box-Jenkins seasonal model is said to
have an order
; the order of the model
captures the dependence of the predicted value on prior values. In
SARIMA,
and
are the orders of the auto-regressive (AR) and
moving average (MA) processes,
and
are orders of the seasonal
AR and MA components,
is the order of differencing,
is the
order of seasonal differencing, and
is the seasonal period of the
series. Thus, SARIMA is family of models depending on the integral
values of
.
2
Model Identification and Parameter Estimation: Given the general
SARIMA model, the proxy needs to determine the order of the model,
including the order of differential and the order of auto-regression
and moving average. That is, the values of ,
,
,
,
and
need to be determined. This step is called model identification
and is typically performed once during system initialization. Model
identification is well documented in most time series textbooks
[1] and we only provide a high level overview
here. Intuitively, since the general model is actually a family of
models, depending on the values of
,
, etc., this phase
identifies a particular model from the family that best captures the
variations exhibited by the underlying data. It is somewhat analogous
to fitting a curve on a set of data values. Model identification
involves collecting a sample time series from the field and computing
its auto-correlation function (ACF) and partial auto-correlation
function (PACF). A series of tests are then performed on the ACF and
the PACF to determine the order of the model [1].
Our analysis of temperature traces has shown that the best model
for temperature data is a Seasonal ARIMA of order
. The general model in Equation
1 reduces to
When employed for a temperature monitoring application, PRESTO proxies
are seeded with a
SARIMA model. The
seasonal period
is also seeded. The parameters
and
are then computed by the proxy during the initial training phase
before the system becomes operational. The training phase involves
gathering a data set from each sensor and using the least squares method
to estimate the values of parameters
and
on a
per-sensor basis (see [1] for the detailed procedure
for estimating these parameters). The order of the model and the
values of
and
are then conveyed to each sensor.
Section 5 explains how
and
can be
periodically refined to adapt to any long-term changes in the sensed
data that occurs after the initial training phase.
Model-based Predictions: Once the model order and its parameters
have been determined, using it for predicting future values is a
simple task. The predicted value for time
is simply given
as:
Since PRESTO sensors push a value to the proxy only when it deviates
from the prediction by more than a threshold, the actual values of
,
and
seen at the sensor may not be
known to the proxy. However, since the lack of a push indicates that
the model predictions are accurate,
the proxy can simply
use the corresponding model predictions as an
approximation for the actual values in Equation 3. In
this case, the corresponding prediction error
is set to
zero. In the event
,
or
were either
pushed by the sensor or pulled by the proxy,
the actual values and the actual prediction errors can used in
Equation 3.
Both the proxy and the sensors use Equation 3 to
predict each sampled value. At the proxy, the predictions serve as a
substitute for the actual values seen by the sensor and are used to
answer queries that might request the data. At the sensor, the
prediction is used to determine whether to push--the sensed value is
pushed only if the prediction error exceeds a threshold .
Finally, we note the asymmetric property of our model. The initial model identification and parameter estimation is a compute-intensive task performed by the proxy. Once determined, predicting a value using the model involves no more than eight floating point operations (three multiplications and five additions/subtractions, as shown in Equation 3). This is inexpensive even on resource-poor sensor nodes such as Motes and can be approximated using fixed point arithmetic.