Circuit breakers¶
This lab demonstrate how to configure circuit breaking both with and without outlier detection in Istio.
Prerequisites and setup¶
- Kubernetes with Istio and other tools (Prometheus, Zipkin, Grafana) installed
web-frontend
andcustomers
workloads already deployed and running.
Revise the Istio installation configuration¶
Modify the installation of Istio to use the demo profile which enables high levels of tracing, which is convenient for this lab.
Install Fortio¶
Let us try and generate some load to the web-frontend
workload and see the distribution of responses.
We'll use Fortio to generate load on the web-frontend
service.
-
Deploy Fortio
Above, notice the annotation, which configures the inclusion of additional Envoy metrics (aka statistics) including circuit breaking.
Save the above file to
fortio.yaml
and deploy it: -
Make a single request to make sure everything is working:
The above command should result in an HTTP 200 "OK" response from the
web-frontend
app. -
With fortio, we can generate a load of 50 requests with two concurrent connections like this:
All 50 requests should succeed. That is the meaning of
Code 200 : 50
in the output.
Info
Fortio also has a GUI, to access it:
-
Port-forward the deployment's port
-
In a browser, visit http://localhost:8080/fortio
Circuit breaker - connection pool settings¶
Study the following DestionationRule:
cb-web-frontend.yaml | |
---|---|
- The maximum number of pending HTTP requests to a destination.
- The maximum number of concurrent requests to a destination.
- The maximum number of requests per connection.
It configures the connection pool for web-frontend
with very low thresholds, to easily trigger the circuit breaker.
Save the above YAML to cb-web-frontend.yaml
and apply the changes:
Since all values are set to 1, we won't trigger the circuit breaker if we send the request using one connection and one request per second.
If we increase the number of connections and send more requests (i.e. 2 workers sending requests concurrently, and sending 50 requests), we'll start getting errors.
The errors happen because the http2MaxRequests
is set to 1 and we have more than 1 concurrent request being sent. Additionally, we're exceeding the maxRequestsPerConnection
limit.
Tip
To reset the metric counters, run:
The x-envoy-overloaded
header¶
When a request is dropped due to circuit breaking, the response will contain a response header x-envoy-overloaded
with value "true".
One way to see this header is to run a fortio load with two concurrent connections in one terminal for a couple of minutes:
kubectl exec deploy/fortio -c fortio -- \
fortio load -c 2 -qps 0 -t 2m --allow-initial-errors -quiet http://web-frontend
In a separate terminal, invoke a single request:
Here is an example response to a dropped request:
> HTTP/1.1 503 Service Unavailable
> x-envoy-overloaded: true
> content-length: 81
> content-type: text/plain
> date: Thu, 10 Aug 2023 18:25:37 GMT
> server: envoy
>
> upstream connect error or disconnect/reset before headers. reset reason: overflowcommand terminated with exit code 1
Then press Ctrl+C to interrupt the load generation.
Observe failures in Zipkin¶
Open the Zipkin dashboard:
In the Zipkin UI, list failing traces by clicking the "+" button in the search field and specifying the query: tagQuery=error
. Then click the Run Query button.
Pick a failing trace to view the details.
The requests are failing because the circuit breaker is tripped. Response flags are set to UO
(Upstream Overflow) and the status code is 503 (service unavailable).
Prometheus metrics¶
Another option is looking at the Prometheus metrics directly.
Open the Prometheus dashboard:
Apply the following PromQL query:
envoy_cluster_upstream_rq_pending_overflow{app="fortio", cluster_name="outbound|80||web-frontend.default.svc.cluster.local"}
The query shows the metrics for requests originating from the fortio
app and going to the web-frontend
service.
The upstream_rq_pending_overflow
and other metrics are described in the Envoy documentation.
Noteworthy are circuit-breaking specific metrics showing the state of various circuit breakers. For example rq_open
indicates whether the "requests" circuit breaker is open, and its companion remaining_rq
indicates how many requests remain to trip the corresponding circuit breaker.
We can also look at the metrics directly from the istio-proxy
container in the Fortio Pod:
kubectl exec deploy/fortio -c istio-proxy -- \
pilot-agent request GET stats | grep web-frontend | grep pending
cluster.outbound|80||web-frontend.default.svc.cluster.local.circuit_breakers.default.remaining_pending: 1
cluster.outbound|80||web-frontend.default.svc.cluster.local.circuit_breakers.default.rq_pending_open: 0
cluster.outbound|80||web-frontend.default.svc.cluster.local.circuit_breakers.high.rq_pending_open: 0
cluster.outbound|80||web-frontend.default.svc.cluster.local.upstream_rq_pending_active: 0
cluster.outbound|80||web-frontend.default.svc.cluster.local.upstream_rq_pending_failure_eject: 0
cluster.outbound|80||web-frontend.default.svc.cluster.local.upstream_rq_pending_overflow: 26
cluster.outbound|80||web-frontend.default.svc.cluster.local.upstream_rq_pending_total: 24
Info
Yet another convenient way to look at the stats emitted by an Envoy sidecar is via the Envoy dashboard:
In the web ui, click on the "stats" endpoint, and filter by target outbound cluster "web-frontend".
Resolving the errors¶
To resolve these errors, we can adjust the circuit breaker settings.
Increase the maximum number of concurrent requests to 2 (http2MaxRequests
), as shown below:
Save the above YAML to cb-web-frontend.yaml
and apply the changes:
If we re-run Fortio with the same parameters, we'll notice less failures this time:
Since we're sending more than 1 request per connection, we can increase the maxRequestsPerConnection
to 2:
Save the above YAML to cb-web-frontend.yaml
and apply the changes:
If we re-run Fortio this time, we'll get zero or close to zero HTTP 503 responses. Even if we increase the number of requests per second, we should only get a small number of 503 responses. To get rid of the remaining failing requests, we can increase the http1MaxPendingRequests
to 2:
With these settings (assuming 2 concurrent connections), we can easily handle a higher number of requests.
To be clear, the numbers we used in settings are just examples and are not realistic - we set them intentionally low to make the circuit breaker easier to trip.
Before continuing, delete the DestinationRule:
Reset the metric counters:
Outlier detection¶
The circuit breaker is great when we want to protect the services from a sudden burst of requests. However, how can we protect the services in case of failures?
For example, if we have a service that is still failing after multiple requests, it doesn't make sense to send even more requests to it. Instead, we can remove the instance of the failing service from the load balancing pool for a certain period of time. That way, we know that the requests will go to other instances of the service. After a pre-defined period of time, we can bring the failing service back into the load balancing pool.
This process is called outlier detection. Just like in the connection pool settings, we can configure outlier detection in the DestinationRule.
To see the outlier detection in action we need a service that is failing. We'll create a web-frontend-failing
deployment and configure it to return HTTP 503 responses:
Click for web-frontend-failing.yaml
Save the above YAML to web-frontend-failing.yaml
and apply it to the cluster:
If we run Fortio we'll see that majority (roughly, 80%) of the requests will be failing. That's because the web-frontend-failing
deployment has more replicas than the "good" deployment.
Let's look at an example of outlier detection configuration:
outlier-web-frontend.yaml | |
---|---|
- Number of 5xx errors in a row that will trigger the outlier detection.
- The interval at which the hosts are checked whether they need to be ejected.
- The duration of time an outlier is ejected from the load balancing pool. If the same host is ejected multiple times, the ejection time increases by multiplying the base ejection time by the number of times the host is ejected.
- The maximum percentage of hosts that can be ejected.
Save the YAML to outlier-web-frontend.yaml
and apply it:
If we repeat the test, we might get a similar distribution of responses the first time. However, if we repeat the command (once the outliers were ejected), we'll get a much better distribution:
The reason for more HTTP 200 responses is because as soon as the failing hosts were ejected (failing Pods from the web-frontend-failing
deployment), the requests were sent to the other host that doesn't fail. If we waited until after the 60 second baseEjectionTime
expired, the failing hosts would be brought back into the load balancing pool and we'd get a similar distribution of responses as before (majority of them failing).
We can also look at the metrics from the outlier detection in the same way we did for the circuit breakers:
kubectl exec deploy/fortio -c istio-proxy -- \
pilot-agent request GET stats | grep web-frontend | grep ejections_total
Produces output similar to this:
Note
Other metrics that we can look at are ejections_consecutive_5xx
, ejections_enforced_total
or any other metric with outlier_detection
in its name. The full list of metric names and their descriptions can be found in the Envoy documentation.
Cleanup¶
To clean up resources created in this lab, run: