Problem
- How to prevent a network or service failure from cascading to other service?
Solution
- A service client should invoke a remote service via a proxy that functions in a similar fashion to an electrical circuit breaker.
- When the number of consecutive failures crosses a threshold, the circuit breaker trips, and for the duration of a timeout period, all attempts to invoke the remote service will fail immediately.
- After the timeout expires the circuit breaker allows a limited number of test requests to pass through. If those requests succeed the circuit breaker resumes normal operation. Otherwise, if there is a failure the timeout period begins again.
The different States of the Circuit Breaker
1. Closed
- When everything is normal, the circuit breaker remains in the closed state and all calls pass through to the services. When the number of failures exceeds a predetermined threshold the breaker trips, and it goes into the Open state.
2. Open
3. Half-Open
- After a timeout period, the circuit switches to a half-open state to test if the underlying problem still exists. If a single call fails in this half-open state, the breaker is once again tripped. If it succeeds, the circuit breaker resets back to the normal, closed state .
Use Case of Circuit Breaker Pattern
Let’s take an example to understand, where we can apply Circuit Breaker Pattern in Microservices architecture.
Scenario:
- Assume there are 5 different services in a Microservices application. Whenever it receives requests, the server will allocate one thread to call the particular service. But, due to some failure, the service is little delayed, and the thread is waiting. However it’s okay, if only one thread is waiting for that service.
- But, if the service is a high demanding service that gets many requests, it is not good to hold. Because more threads will be allocated for this service within some time, and these threads will have to wait.
- As a result, the remaining requests that comes to your service will be blocked or queued. Even though, the service is recovered back, the webserver is still trying to process the requests that in the queue. Because the webserver will never recover, since it receives requests continuously.
- Eventually this might lead to Cascading failures throughout the application. Therefore, this kind of scenarios will lead to crash your services and even the application.
Solution:
- The above scenario is a perfect example to apply Circuit Breaker Pattern. Assume you have defined threshold for a particular service, as it should respond within 200ms. As I mentioned above, that service is a high demand service, that continuously receive requests. In case, if 75% of those requests are reaching the upper threshold (150ms — 200ms) means that service is going to fail soon.
- However, if several requests exceed the maximum threshold (200ms) means that service not responding anymore. As a result, it will fail back to the consumer and inform this particular service is not available. So, if you remember the above-mentioned states of this pattern, now we are moving to “Open” state from the “Closed” state.
- As a result, all those requests that comes to the particular service, not going to wait anymore. However, after a timeout, the Circuit Breaker sends ping requests to that service in the background. That means now we are in the “Half-Open” state of the Circuit Breaker pattern. If these requests are successful, the Circuit Breaker will allow to send requests for that service again.
- So, you can use the Circuit Breaker Pattern to improve the fault-tolerance and resilience of the Microservice Architecture and also to prevent the cascading of failure to other microservices.
Different Aspects we can use for Circuit Breaker pattern
- Circuit Breaker
- Retry
- Rate Limiter.
- Bulkhead
- Time Limiter
- The circuit breaker has three distinct states: Closed, Open, and Half-Open:
- You can implement the circuit breaker pattern with Netflix Hystrix. The following code can better explain the solution.
Example
The below microservice recommends the reading list to the customer:
package hello;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.bind.annotation.RequestMapping;@RestController
@SpringBootApplication
public class BookstoreApplication {@RequestMapping(value = "/recommended")
public String readingList(){
return "Spring in Action ;
}
public static void main(String[] args) {SpringApplication.run(BookstoreApplication.class, args);
}
}
Client application code which will call the reading list recommendation service:
package hello;
import java.net.URI;
import org.springframework.stereotype.Service;
import org.springframework.web.client.RestTemplate;
import com.netflix.hystrix.contrib.javanica.annotation.HystrixCommand;@Service
public class BookService {
private final RestTemplate restTemplate;
public BookService(RestTemplate rest) {
this.restTemplate = rest;
}
@HystrixCommand(fallbackMethod = "reliable")
public String readingList() {
URI uri = URI.create("http://localhost:8090/recommended");
return this.restTemplate.getForObject(uri, String.class);
}
public String reliable() {
return "Cloud Native Java (O'Reilly)";
}
}
- In the above code method, the reading list is called remote microservice API to get the reading list recommendation.
- Look at line number 19 of the above code, we have provided a fallback method "reliable." If the remote API does not respond in time, the method "reliable" will be called and that will serve the request.
- In the fallback method, you can return either a default output or even call some other remote or local API to serve the request.
Update application.properties:
resilience4j.circuitbreaker.instances.getInvoiceCB.failure-rate-threshold=80
resilience4j.circuitbreaker.instances.getInvoiceCB.sliding-window-size=10
resilience4j.circuitbreaker.instances.getInvoiceCB.sliding-window-type=COUNT_BASED
resilience4j.circuitbreaker.instances.getInvoiceCB.minimum-number-of-calls=5
resilience4j.circuitbreaker.instances.getInvoiceCB.automatic-transition-from-open-to-half-open-enabled=true
resilience4j.circuitbreaker.instances.getInvoiceCB.permitted-number-of-calls-in-half-open-state=4
resilience4j.circuitbreaker.instances.getInvoiceCB.wait-duration-in-open-state=1s
- ‘failure-rate-threshold=80‘
- Indicates that if 80% of requests are getting failed, open the circuit ie. Make the Circuit Breaker state as Open.
- ‘sliding-window-size=10‘
- indicates that if 80% of requests out of 10 (it means 8) are failing, open the circuit.
- 'sliding-window-type=COUNT_BASED‘
- indicates that we are using COUNT_BASED sliding window. Another type is TIME_BASED.
- 'minimum-number-of-calls=5‘
- indicates that we need at least 5 calls to calculate the failure rate threshold.
- ‘automatic-transition-from-open-to-half-open-enabled=true‘
- indicates that don’t switch directly from the open state to the closed state, consider the half-open state also.
- 'permitted-number-of-calls-in-half-open-state=4‘
- indicates that when on half-open state, consider sending 4 requests. If 80% of them are failing, switch circuit breaker to open state.
- ‘wait-duration-in-open-state=1s’
- indicates the waiting time interval while switching from the open state to the closed state.
- Suppose Microservice ‘A’ depends on another Microservice ‘B’. Let’s assume Microservice ‘B’ is a faulty service and its success rate is only upto 50-60%.
- However, fault may be due to any reason, such as service is unavailable, buggy service that sometimes responds and sometimes not, or an intermittent network failure etc.
- However, in this case, if Microservice ‘A’ retries to send request 2 to 3 times, the chances of getting response increases. Obviously, we can achieve this functionality with the help of annotation @Retry provided by Resilience4j without writing a code explicitly.
- Here, we have to implement a Retry mechanism in Microservice ‘A’. We will call Microservice ‘A’ as Fault Tolerant as it is participating in tolerating the fault. However, Retry will take place only on a failure not on a success.
- By default retry happens 3 times. Moreover, we can configure how many times to retry as per our requirement.
in Circuit Breaker Example replace
- @HystrixCommand(fallbackMethod = "reliable") with
- @Retry(fallbackMethod = "reliable")
Update application.properties.
resilience4j.retry.instances.getInvoiceRetry.max-attempts=5resilience4j.retry.instances.getInvoiceRetry.wait-duration=2sresilience4j.retry.instances.getInvoiceRetry.retry-exceptions=org.springframework.web.client.ResourceAccessException
- By default the retry mechanism makes 3 attempts if the service fails for the first time.
- But here we have configured for 5 attempts, each after 2 seconds interval.
- Additionally, if business requires it to retry only if a specific exception occurs, that can also be configured as above.
- If we want Resilience4j to retry when any type of exception occurs, we don’t need to mention the property ‘retry-exceptions’.
3. Rate Limiter
- Rate Limiter limits the number of requests for a given period. Let’s assume that we want to limit the number of requests on a Rest API and fix it for a particular duration.
- There are various reasons to limit the number of requests that an API can handle, such as protect the resources from spammers, minimize the overhead, meet a service level agreement and many others.
- Undoubtedly, we can achieve this functionality with the help of annotation @RateLimiter provided by Resilience4j without writing a code explicitly.
in Circuit Breaker Example replace
- @HystrixCommand(fallbackMethod = "reliable") with
- @RateLimiter(fallbackMethod = "reliable")
Update application.properties.
resilience4j.ratelimiter.instances.getMessageRateLimit.limit-for-period=2 resilience4j.ratelimiter.instances.getMessageRateLimit.limit-refresh-period=5s resilience4j.ratelimiter.instances.getMessageRateLimit.timeout-duration=0
- The above properties represent that only 2 requests are allowed in 5 seconds duration.
- Also, there is no timeout duration which means after completion of 5 seconds, the user can send request again.
- In the context of the Fault Tolerance mechanism, if we want to limit the number of concurrent requests, we can use Bulkhead as an aspect. Using Bulkhead, we can limit the number of concurrent requests within a particular period.
- Please note the difference between Bulkhead and Rate Limiting. Rate Limiter never talks about concurrent requests, but Bulkhead does. Rate Limiter talks about limiting number of requests within a particular period.
- Hence, using Bulkhead we can limit the number of concurrent requests. We can achieve this functionality easily with the help of annotation @Bulkhead without writing a specific code.
in Circuit Breaker Example replace
- @HystrixCommand(fallbackMethod = "reliable") with
- @Bulkhead(fallbackMethod = "reliable")
Update application.properties.
resilience4j.bulkhead.instances.getMessageBH.max-concurrent-calls=5 resilience4j.bulkhead.instances.getMessageBH.max-wait-duration=0
- ‘max-concurrent-calls=5’ indicates that if the number of concurrent calls exceed 5, activate the fallback method.
- ‘max-wait-duration=0’ indicates that don’t wait for anything, show response immediately based on the configuration.
4. Time Limiter
- Time Limiting is the process of setting a time limit for a Microservice to respond. Suppose Microservice ‘A’ sends a request to Microservice ‘B’, it sets a time limit for the Microservice ‘B’ to respond.
- If Microservice ‘B’ doesn’t respond within that time limit, then it will be considered that it has some fault. We can achieve this functionality easily with the help of annotation @Timelimiter without writing a specific code.
import java.util.concurrent.CompletableFuture;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import org.springframework.web.bind.annotation.GetMapping;import org.springframework.web.bind.annotation.RestController;import io.github.resilience4j.timelimiter.annotation.TimeLimiter;@RestControllerpublic class TimeLimiterController {Logger logger = LoggerFactory.getLogger(TimeLimiterController.class);@GetMapping("/getMessageTL")@TimeLimiter(name = "getMessageTL")public CompletableFuture<String> getMessage() {return CompletableFuture.supplyAsync(this::getResponse);}private String getResponse() {if (Math.random() < 0.4) { //Expected to fail 40% of the timereturn "Executing Within the time Limit...";} else {try {logger.info("Getting Delayed Execution");Thread.sleep(1000);} catch (InterruptedException e) {e.printStackTrace();}}return "Exception due to Request Timeout.";}}
Update application.properties.
resilience4j.timelimiter.instances.getMessageTL.timeout-duration=1ms resilience4j.timelimiter.instances.getMessageTL.cancel-running-future=false
- ‘timeout-duration=1ms’ indicates that the maximum amount of time a request can take to respond is 1 millisecond
- ‘cancel-running-future=false’ indicates that do not cancel the Running Completable Futures After TimeOut.