Thursday, July 18, 2019

Fault tolerance in integration flows - handling target system availability problems

An important non-functional property of any software system is "Availability". In the ISO/IEC 25010:2011 product quality model, this is grouped under an overall category of "Reliability". 
Fault tolerance is a closely associated property also grouped under "Reliability". 

System downtimes could be either due to scheduled maintenance or for other unexpected reasons (server crashes, network failures, DNS misconfiguration). Normally, consumer systems would expect some form of service level agreement (SLA) around system availability - measured by values like downtime percentage over a period, maximum response times, and average response times. 

Caller systems, on the other hand, need to be resilient to sporadic availability problems (apart from maintaining availability of their own). Various strategies are applied for maintaining availability, like high quality networks, high-availability or clustered configurations with load-balancing, and so forth. In complex web-based systems, core services are also fascaded by very highly available proxy services or a web-tier layer. This layer ensures that callers receive at least some response (such as an http 503 code and message) even when core services are unreachable. 

For fault tolerance, most middleware and integration products have two very important features: timeouts and retries
These diagrams illustrate the importance of these features in ensuring fault tolerance. 

We start with an ideal operational scenario where the caller sends a request and expects a synchronous response back. Dark black lines indicate the "lifetime" of an object - for example, it could represent the duration for which a request thread waits for a response. 

Figure 1 - normal operational scenario

Occasionally, the target system could be slow to respond to a synchronous invocation, but still within an agreed "maximum response time". 

Figure 2 - Target system slow

We discussed how a web-tier or web layer could "inform" caller systems of a more complex core system's unavailability. This would be good practice and cover most cases of system downtime. However, there could be situations where even the target system's web layer is also inaccessible (for instance, due to a network outage or DNS misconfiguration). 
This is where a timeout feature comes into picture. In this example, the caller system needs to be able to detect a connection problem and fail within a finite amount of time - in integration tools, this duration is typically defined by a "connect timeout" property. 


Figure 3 - target system downtime
Finally, the diagram below represents a scenario where an http connection is possible to the target system, but, it never replies. This could happen for various reasons: the system could have crashed after receiving the request, or it could have a network firewall problems. Under such scenarios, it is important for the caller system to stop waiting after a certain time has passed - this duration is typically defined by a "read timeout" property and would be equal or slightly greater to the "maximum response time" property of the target system

Figure 4 - target system connects but doesn't respond
The important thing to note is that timeouts ensure that any caller system resources are freed up after a finite amount of time. The alternative could be a system overload (e.g. too many caller system threads simply waiting for a response from a target system that is unresponsive could overload resources like the CPU and memory, affecting its own availability

For fault tolerance, another feature closely related to timeouts is retries. Many sporadic availability issues can be bypassed if the caller system simply retries its invocation. The appropriate number of retries, and time gaps between consecutive retries  depends on the specific systems and the use-case. Often, for sensitive transactional operations like payments, it might not be appropriate to retry without human intervention, so, system design might need to incorporate either idempotence or human workflow or standard "error hospital" functionality available in Oracle SOA Suite, and Oracle Integration Cloud. 


I use an abstract term called "timeout" and mostly from an http synchronous request perspective. Different types of connections have timeouts of their own - for example, message-queuing system connections and database connections would also have similar timeouts defined. Transaction timeouts are also an important property. Where configurable timeouts are provided by a product, it is important to set sensible values depending on the type of system and timeouts defined by downstream systems, if known. 

Timeout and automated retry properties in some leading integration products:
1) Oracle SOA Suite: or SOA-CS How to timeout the HTTP connections for a SOA composite globally (Oracle Support Doc ID 2105692.1)
2)  Oracle Service Bus - 
(see Table 29-3 HTTP Transport Properties for Business Services)
3)  Client timeout in Apache camel: https://camel.apache.org/http.html

Manual retry and error hospital features
1) Manually re-submitting failed messages in Oracle Integration Cloud: 
2) Error hospital for manual error recovery in Oracle SOA Suite: