The goal of the modeling and prediction engine is to determine a model, using a set of past sensor observations, to forecast future values. The key premise is that the physical phenomena observed by sensors exhibit long-term and short-term correlations and past values can be used to predict the future. This is true for weather phenomena such as temperature that exhibit long-term seasonal variations as well as short-term time-of-day and hourly variations. Similarly phenomena such as traffic at an intersection exhibit correlations based on the hour of the day (e.g., traffic peaks during ``rush'' hours) and day of the week (e.g., there is less traffic on weekends). PRESTO proxies rely on seasonal ARIMA models; ARIMA is a popular family of time-series models that are commonly used for studying weather and stock market data. Seasonal ARIMA models (also known as SARIMA) are a class of ARIMA models that are suitable for data exhibiting seasonal trends and are well-suited for sensor data. Further they offer a way to deal with non-stationary data i.e. whose statistical properties change over time [1]. Last, as we demonstrate later, while seasonal ARIMA models are computationally expensive to construct, they are inexpensive to check at the remote sensors--an important property we seek from our system. The rest of this section presents the details of our SARIMA model and its use within PRESTO.
Prediction Model: A discrete time series can be represented by a set of time-ordered data , resulting from observation of some temporal physical phenomenon such as temperature or humidity. Samples are assumed to be taken at discrete time instants . The goal of time-series analysis is to obtain the parameters of the underlying physical process that governs the observed time-series and use this model to forecast future values.
PRESTO models the time series of observations at a sensor as an Autoregressive Integrated Moving Average (ARIMA) process. In particular, the data is assumed to conform to the Box-Jenkins SARIMA model [1]. While a detailed discussion of SARIMA models is outside the scope of this paper, we provide the intuition behind these models for the benefit of the reader. An SARIMA process has four components: auto-regressive (AR), moving-average (MA), one-step differencing, and seasonal differencing. The AR component estimates the current sample as a linear weighted sum of previous samples; the MA component captures relationship between prediction errors; the one-step differencing component captures relationship between adjacent samples; and the seasonal differencing component captures the diurnal, monthly, or yearly patterns in the data. In SARIMA, the MA component is modeled as a zero-mean, uncorrelated Gaussian random variable (also referred to as white noise). The AR component captures the temporal correlation in the time series by modeling a future value as a function of a number of past values.
In its most general form, the Box-Jenkins seasonal model is said to have an order ; the order of the model captures the dependence of the predicted value on prior values. In SARIMA, and are the orders of the auto-regressive (AR) and moving average (MA) processes, and are orders of the seasonal AR and MA components, is the order of differencing, is the order of seasonal differencing, and is the seasonal period of the series. Thus, SARIMA is family of models depending on the integral values of . 2
Model Identification and Parameter Estimation: Given the general SARIMA model, the proxy needs to determine the order of the model, including the order of differential and the order of auto-regression and moving average. That is, the values of , , , , and need to be determined. This step is called model identification and is typically performed once during system initialization. Model identification is well documented in most time series textbooks [1] and we only provide a high level overview here. Intuitively, since the general model is actually a family of models, depending on the values of , , etc., this phase identifies a particular model from the family that best captures the variations exhibited by the underlying data. It is somewhat analogous to fitting a curve on a set of data values. Model identification involves collecting a sample time series from the field and computing its auto-correlation function (ACF) and partial auto-correlation function (PACF). A series of tests are then performed on the ACF and the PACF to determine the order of the model [1].
Our analysis of temperature traces has shown that the best model for temperature data is a Seasonal ARIMA of order . The general model in Equation 1 reduces to
When employed for a temperature monitoring application, PRESTO proxies are seeded with a SARIMA model. The seasonal period is also seeded. The parameters and are then computed by the proxy during the initial training phase before the system becomes operational. The training phase involves gathering a data set from each sensor and using the least squares method to estimate the values of parameters and on a per-sensor basis (see [1] for the detailed procedure for estimating these parameters). The order of the model and the values of and are then conveyed to each sensor. Section 5 explains how and can be periodically refined to adapt to any long-term changes in the sensed data that occurs after the initial training phase.
Model-based Predictions: Once the model order and its parameters
have been determined, using it for predicting future values is a
simple task. The predicted value for time is simply given
as:
Since PRESTO sensors push a value to the proxy only when it deviates from the prediction by more than a threshold, the actual values of , and seen at the sensor may not be known to the proxy. However, since the lack of a push indicates that the model predictions are accurate, the proxy can simply use the corresponding model predictions as an approximation for the actual values in Equation 3. In this case, the corresponding prediction error is set to zero. In the event , or were either pushed by the sensor or pulled by the proxy, the actual values and the actual prediction errors can used in Equation 3.
Both the proxy and the sensors use Equation 3 to predict each sampled value. At the proxy, the predictions serve as a substitute for the actual values seen by the sensor and are used to answer queries that might request the data. At the sensor, the prediction is used to determine whether to push--the sensed value is pushed only if the prediction error exceeds a threshold .
Finally, we note the asymmetric property of our model. The initial model identification and parameter estimation is a compute-intensive task performed by the proxy. Once determined, predicting a value using the model involves no more than eight floating point operations (three multiplications and five additions/subtractions, as shown in Equation 3). This is inexpensive even on resource-poor sensor nodes such as Motes and can be approximated using fixed point arithmetic.