To illustrate QML and demonstrate its utility, we use it to specify the QoS properties of an example system. The example shows how QML can help designers decompose application level QoS requirements into QoS properties for application components. The example also demonstrates that different QoS trade-offs can give rise to different designs.
This example is a simplified version of a system for executing telephony services, such as telephone banking, ordering, etc. The purpose of having such an execution system is to allow rapid development and installation of new telephony services. The system must be scalable in order to be useful both in small businesses and for servicing several hundred simultaneous calls. More importantly---especially from the perspective of this paper---the system needs to provide services with sufficient availability.
Executing a service typically involves playing messages for the caller, reacting to key strokes, recording responses, retrieving and updating databases, etc. It should be possible to dynamically install new telephone services and upgrade them at runtime without shutting down the system. The system answers incoming telephone calls and selects a service based on the phone number that was called. The executed service may, for example, play messages for the caller and react to events from the caller or events from resources allocated to handle the call.
Telephone users generally expect plain old telephony to be reliable, and they commonly have the same expectations for telephony services. A telephony service that is unavailable will have a severe impact on customer satisfaction, in addition, the service company will loose business. Consequently, the system needs to be highly available.
Following the categorization by Gray et al. [], we want the telephony service to be a highly-available system which means it should have a total maximum down-time of 5 minutes per year. The availability measure will then be 0.99999. We assume the system is built on a general purpose computer platform with specialized computer telephony hardware. The system is built using a CORBA [] Object Request Broker (ORB) to achieve scalability and reliability through distribution.
We call the service execution system module PhoneServiceSystem . As illustrated by Figure 19 , it uses an EventSystem module and a TraderService module.
Figure: High-level architecture
Opening up the PhoneServiceSystem module in Figure 18 , we see its main classes and interfaces. Classes are drawn as rectangles and interfaces as circles. Classes implement and use interfaces. As an example, the diagram shows that ServiceExecutor implements ServiceI
and uses TraderI . In the diagram we have included references to QML profiles---such as PlayerProfile_P ---of which a subset will be described in section 6.2. To ease the reading of the diagram we have named required and provided profiles so that they end with the letters R and P respectively. We have omitted to draw some interrelationships for the purpose of keeping the diagram simple.
CallHandlerI , ServiceI , and ResourceI are three important interfaces of the system. The model also shows that the system uses interfaces provided by the EventService and TraderService .
: Class diagram for PhoneServiceSystem
When a call is made, the CallHandlerImpl receives the incoming call through the CallHandlerI interface and invokes the ServiceExecutor through the ServiceI interface. CallHandlerImpl receives the telephone number as an argument and maps that to a service identity. When CallHandlerImpl calls the ServiceExecutor it supplies the service identifier as an argument and a CallHandle . The CallHandle contains information about the call---such as the speech channel---that is needed during the execution of the service. A new instance of CallHandle is created and initialized by the CallHandler when an incoming call is received. The information in the CallHandle remains unchanged for the remainder of the call.
In order to execute a service, the ServiceExecutor retrieves the service description associated with the received service identifier. It also needs to allocate resources such as databases, players, recorders, etc. To obtain resources, the ServiceExecutor
calls the Trader . Each resource offer its services when it is initially started by contacting the trader and registering its offer. To reduce complexity of the diagram we omit showing that resources use the trader.
ServiceExecutor uses the PushSupplier and implements the PushConsumer interface in the EventService
module. Resources connect to the event service by using the PushConsumer interfaces. The communication between the service executor and its resources is asynchronous. When the service executor needs a resource to perform an operation, it invokes the resource which returns immediately. The service executor will then continue executing the service or stop to wait for events. When the resource has finished its operation, it notifies the service executor by sending an event through the event service. This communication model allows the service executor to listen for events from many sources at the same time, which is essential if, for example, the service executor simultaneously initiates the playing of menu alternatives and waits for responses from the caller.
Figure 18 also includes references to QoS profiles. In new designs, clients and services are usually designed to match each others needs therefore the same profile often specifies both what clients expect and what services provide. When clients and services refer to the same profiles, it becomes trivial to ensure that the requirements by a client are satisfied by the service. To point out an example, CallHandlerImpl requires that the ServiceI interface is implemented with the QoS properties defined by SEProfile_P
and at the same time ServiceExecutor provides ServiceI
according to the same QoS profile.
In other cases, such as the Trader , are expected to preexist and therefore have previously specified QoS properties. In those situations we have one contract specifying the required properties and another contract specifying what is provided. Consequently we need to make sure the provided characteristics satisfy the required; this is referred to as conformance and is discussed in section 4.8.
We will now present simplified versions of three main interfaces in the design. The ServiceI interface provides an operation, called execute , to start the execution of a service. The service identifier is obtained from a table that maps phone numbers to services. The CallHandle
argument contain channel identifiers and other data necessary to execute the service.
Figure: The ServiceI interface
The Trader allows resources to offer and withdraw their services. Service executors can invoke the find or findAll
operations on the Trader to locate the resources they need. Using a trader allows us to decouple ServiceExecutor s and resources. This decoupling make it possible to smoothly introduce new resources and remove malfunctioning or deprecated resources. Observe that this is a much simplified trader for the purpose of this paper.
Finally, we have the PlayerI that represents a simple player resource. Players allow us to play a sequence of messages on the connection associated with the supplied CallHandle . The idea is that a complete message can be built up by a sequence of smaller phrases. The interface allows the service executor to interrupt the playing of messages by calling stop .
We have already shown in Figure 18 how profiles are associated with uses and implements relationships between interfaces and classes. We will now in more depth discuss what the QoS profiles and contracts should be for this particular design. For the contracts we will use the dimensions proposed in section 3. We will not present any development process with which you identify important profiles and their content.
To meet end-to-end reliability requirements, the underlying communications infrastructure, as well as the execution system, must meet reliability expectations. We assume that the communications infrastructure is reliable, and focus on the reliability of the service execution system.
From a telephone user's perspective, the interface CallHandlerI
represents the peer on the other side of the line. Thus, to provide high-availability to telephone users, the CallHandlerI
service must be highly-available.
To provide a highly-available telephone service, we require that the CallhandlerImpl has very short recovery time and long time between failures. Due to the expected shopping behavior of telephone service users we must require the repair time ( MTTR ) to not significantly exceed 2 minutes and that the variance is small.
The CallHandler does not provide any sophisticated failure masking, but it has a special kind of object reference that does not require rebinding after a failure. We are prepared to accept on average 2 failures per year. If the service fails, any executing and pending requests are discontinued and removed. This means we have a at most once operation semantics. The contract and profile of CallHandlerI as provided by CallHandlerImpl
is described in Figure 23 .
Figure: Contract and binding for CallHandler
From Figure 18 we can see that the reliability of CallHandlerI directly depends on the reliability of service defined by ServiceI . ServiceExecutor can not provide any services without resources. Unless ServiceExecutor can handle failing traders and resources the reliability depends directly on the reliability of TraderI
and any resources it uses. In this example we want to keep the ServiceExecutor as small and simple as possible, therefore we propagate high-availability requirements from CallHandlerI
to the trader and the resources. This is certainly a major design decision which will affect the design and implementation of the other components of the system.
We expect the ServiceExecutor to have a short recovery time since it holds no information that we wish to recover. If it fails, the service interactions it currently executes will be discontinued. We assume that users consider it more annoying if a session is interrupted due to a failure than if they are unable to connect to the service. We therefore require the ServiceExecutor to be reliable in the sense that it should function adequately over the duration of a typical service call. Calls are estimated to last 3 minutes on average with 80% of the calls less than 5 minutes. With this in mind, we will require that the service executor provides high continuous availability with a time period of 5 minutes.
Since the recovery time is short, we can allow more frequent failures without compromising the availability requirements.
The ServiceExecutor recovers to a well defined initial state and will forget about all executions that where going on at the time of the failure. The contract states that rebinding is necessary, which means that when the service executor is restarted, the CallHandler receives a notification that it can obtain a reference to the ServiceExecutor by rebinding. Pending requests are executed at most once in case of a failure; most likely they are not executed at all which is considered acceptable for this system. The contract and profile used for ServiceI are described in Figure 24 .
Figure: Contract and binding for service
Although the ServiceExecutor itself can recover rapidly, it still depends on the Trader and the resources.
We expect the Trader to have a relatively short recovery time, which relaxes the mean time to failure requirements slightly. We insist that all types of telephony services can be executed when the system is up, which means that all resources must be available and consequently satisfy the high-availability requirements.
The reliability contract for the Trader
(Figure 26 ) is based on a general contract ( HAServiceReliability ) for highly-available services. The contract is abstract in the sense that it only states the availability requirements and leaves several of the other dimensions unspecified. The Trader profile refines it by stating that the recovery time should be short.
In addition, we state that offer identifiers and object references returned by the trader are valid even after a failure. This means that an offer identifier returned before a failure can be used to withdraw an offer after the Trader has recovered. Also, any references returned by the Trader are valid during the Trader 's down period as well as after it has recovered, assuming, of course, that the services referred to by the references have not failed.
The start-up time for a service execution is very important; the time between a call is answered and the service starts executing must be short and definitely not more than one second. A start-up time that exceeds one second can make users believe there is a problem with the connection and therefore hang-up the phone, the consequence being both an unsatisfied customer and a lost business opportunity.
Having analyzed and estimated the execution times in the start-up execution path, we require that the find and findAll operations on the Trader respond quickly. We do not anticipate the throughput to constitute a bottleneck in this case.
We can relax the performance requirements for the offer and withdraw operations on the Trader . The reason being that these operations are not time critical from the service execution point of view. We specify the performance in Figure 26 as part of the TraderProfile_P profile.
The performance profile makes it clear that the implementation of TraderI should give invocations of find and findAll higher priority than invocations of offer and withdraw .
A resource service represents a pool of hardware and software resources that are expected to be highly-available. If a resource service is down, it is likely that there are major hardware or software problems that will take a long time to repair. Since failing resource services are expected to have long recovery times, they need to have, in principle, infinite MTTF to satisfy high availability requirements. This does not mean that individual resource cannot fail, but it does mean that there must be sufficient redundancy to mask failures.
In Figure 25 we define a general contract, called ResourceReliability , for ResourceI . The contract captures that resources need to be highly available. Each specific resource type---such as PlayerReliability ---will then refine this general contract to specify its individual QoS properties.
Figure: Contract and binding for resources
Figure: Contract and binding for the Trader
The specification of reliability and performance contracts, and the analysis of inter-component QoS dependencies, have given us many insights and important guidance. As an example, it has helped us realize that the Trader needs to support fast fail-over and use a reliable storage. We also found that the reliability of resources is essential, and that, in this example system, resource services should be responsible for their own reliability. The explicit specification also allows us to assign well-defined values to various dimension which make design goals and requirements mreo clear.
QML allows detailed descriptions of the QoS associated with operations, attributes, and operation parameters of interfaces. This level of detail is essential to clearly specify and divide the responsibilities among client and service implementations. The refinement mechanism is also essential. Refinement allows us to form hierarchies of contracts and profiles, which allows us to capture QoS requirements at various levels of abstraction.
Due to the limited space of this paper, we have not been able to include a full analysis or specification of the example system. In a real design, we also need to study what happens when various components fail, estimate the frequency of failures due to programming errors, etc. We also need to ensure that the QoS contracts provided by components actually allows the clients to satisfy requirements imposed on them. There are various modeling techniques available that are applicable to selected types of systems; see Reibman et al. [] for an overview.
In our case, high availability requirements for CallHandler
have resulted in strong demands on other services in the application. Another design alternative would be to demand that components such as the ServiceExecutor can handle failing resources and switch to other resources when needed. This would require more from the ServiceExecutor , but allow resource services to be less reliable.
Despite the limitations of our example, we believe that it demonstrates three important points: QoS should be considered during the design of distributed systems; QoS requires appropriate language support; QML is useful as a QoS specification language.
Firstly, we want to stress that considering QoS during design is both useful and necessary. It will directly impact the design and make developers aware of non-functional requirements.
Secondly, QoS cannot be effectively considered without appropriate language support. We need a language that helps designer capture QoS requirements and associate these with interfaces at a detailed level. We also need to make QoS requirements and offers first class citizens from a design language point of view.
Finally, we believe the example shows that QML is suitable to support designers in involving QoS considerations in the design phase.