Ошибка 503 urx

There are cases where API calls or cray command invocations will fail (sometimes intermittently) with an HTTP 503 error code.
In the event that this occurs, attempt to remediate the issue by taking the following actions, according to specific error codes
found in the pod or Envoy container log.

The Envoy container is typically named istio-proxy, and it runs as a sidecar for pods that are part of the Istio mesh.
For pods with this sidecar, the logs can be viewed by running a command similar to the following:

ncn-mw# kubectl logs <podname> -n <namespace> -c istio-proxy | grep 503

For general Kubernetes troubleshooting information, including more information on viewing pod logs, see
Kubernetes troubleshooting topics.

This page is broken into different sections, based on the errors found in the log.

  • UF,URX with TLS error
    • Symptom
    • Description
    • Remediation
  • UAEX
    • Symptom
    • Description
    • Remediation
  • Other error codes

UF,URX with TLS error

Symptom (UF,URX with TLS error)

[2022-05-10T16:27:29.232Z] "POST /apis/hbtd/hmi/v1/heartbeat HTTP/2" 503 UF,URX "-" "TLS error: Secret is not supplied by SDS"

Description (UF,URX with a TLS error)

Envoy containers can occasionally get into this state when NCNs are being rebooted or upgraded, as well as when many deployments
are being created.

Remediation (UF,URX with a TLS error)

Do a Kubernetes delete or rolling restart:

  • If it is a single replica, then delete the pod.

  • If it is part of a multiple replica exhibiting the issue, then perform a rolling restart of the deployment or StatefulSet.

    Here is an example of how to do that for the istio-ingressgateway deployment in the istio-system namespace.

    1. Initiate a rolling restart of the deployment.

      ncn-mw# kubectl rollout restart -n istio-system deployment istio-ingressgateway
      
    2. Wait for the restart to complete.

      ncn-mw# kubectl rollout status -n istio-system deployment istio-ingressgateway
      

Once the roll out is complete, or the new pod is running, then the HTTP 503 message should clear.

UAEX

Symptom (UAEX)

[2022-06-24T14:16:27.229Z] "POST /apis/hbtd/hmi/v1/heartbeat HTTP/2" 503 UAEX "-" 131 0 30 - "10.34.0.0" "-" "1797b0d3-56f0-4674-8cf2-a8a61f9adaea" "api-gw-service-nmn.local" "-" - - 10.40.0.29:443 10.34.0.0:15995 api-gw-service-nmn.local -

Description (UAEX)

This error code typically indicates an issue with the authorization service (for example, Spire).

Remediation (UAEX)

  1. Initiate a rolling restart of Spire.

    ncn-mw# kubectl rollout restart -n spire statefulset spire-postgres spire-server
    ncn-mw# kubectl rollout restart -n spire daemonset spire-agent request-ncn-join-token
    ncn-mw# kubectl rollout restart -n spire deployment spire-jwks spire-postgres-pooler
    
  2. Wait for all of the restarts to complete.

    ncn-mw# kubectl rollout status -n spire statefulset spire-postgres
    ncn-mw# kubectl rollout status -n spire statefulset spire-server
    ncn-mw# kubectl rollout status -n spire daemonset spire-agent
    ncn-mw# kubectl rollout status -n spire daemonset request-ncn-join-token
    ncn-mw# kubectl rollout status -n spire deployment spire-jwks
    ncn-mw# kubectl rollout status -n spire deployment spire-postgres-pooler
    

Once the restarts are all complete, the HTTP 503 message should clear.

Other error codes

Although the above codes are most common, various other issues such as networking or application errors can cause different errors in
the pod or sidecar logs. Refer to the Envoy access log documentation
for a list of possible Envoy response flags. In general, running a rolling restart of the application itself to see if it clears the error is a good practice.
If that does not resolve the problem, then an understanding of what the error message or response flag means is required to further troubleshoot the issue.

I am creating istio service mesh and then trying to call an external service from istio pod.

I followed steps in link

https://istio.io/docs/tasks/traffic-management/egress/egress-gateway-tls-origination/

till

2 Verify that your ServiceEntry was applied correctly by sending a request to http://edition.cnn.com/politics.

but in place of «edition.cnn.com», used my service.

When I try to do curl inside my pod, I am getting the below error.

[2020-02-02T10:02:52.465Z] "GET / HTTP/1.1" 503 UF,URX "-" "-" 0 91 150 - "-" "curl/7.58.0" "fafa8680-bdf1-468a-b50f-1a4430707ceb" "service.abc.com" "173.25.13.66:80" outbound|80||service.abc.com - 173.25.13.66:80 10.44.0.6:47544 - default

I can ping to service.abc.com, but how do I debug this error, and how to get more logs for analysis? As it did not mention to create steps for mtls and destination rules in above link, I did not create them.

Note: I am not facing any issue with edition.cnn.com, but getting issues when using my service which is external to mesh and is running in another server within my company network.

Requests are rejected by Envoy

Requests may be rejected for various reasons. The best way to understand why requests are being rejected is
by inspecting Envoy’s access logs. By default, access logs are output to the standard output of the container.
Run the following command to see the log:

$ kubectl logs PODNAME -c istio-proxy -n NAMESPACE

In the default access log format, Envoy response flags are located after the response code,
if you are using a custom log format, make sure to include %RESPONSE_FLAGS%.

Refer to the Envoy response flags
for details of response flags.

Common response flags are:

  • NR: No route configured, check your DestinationRule or VirtualService.
  • UO: Upstream overflow with circuit breaking, check your circuit breaker configuration in DestinationRule.
  • UF: Failed to connect to upstream, if you’re using Istio authentication, check for a
    mutual TLS configuration conflict.

Route rules don’t seem to affect traffic flow

With the current Envoy sidecar implementation, up to 100 requests may be required for weighted
version distribution to be observed.

If route rules are working perfectly for the Bookinfo sample,
but similar version routing rules have no effect on your own application, it may be that
your Kubernetes services need to be changed slightly.
Kubernetes services must adhere to certain restrictions in order to take advantage of
Istio’s L7 routing features.
Refer to the Requirements for Pods and Services
for details.

Another potential issue is that the route rules may simply be slow to take effect.
The Istio implementation on Kubernetes utilizes an eventually consistent
algorithm to ensure all Envoy sidecars have the correct configuration
including all route rules. A configuration change will take some time
to propagate to all the sidecars. With large deployments the
propagation will take longer and there may be a lag time on the
order of seconds.

503 errors after setting destination rule

If requests to a service immediately start generating HTTP 503 errors after you applied a DestinationRule
and the errors continue until you remove or revert the DestinationRule, then the DestinationRule is probably
causing a TLS conflict for the service.

For example, if you configure mutual TLS in the cluster globally, the DestinationRule must include the following trafficPolicy:

trafficPolicy:
  tls:
    mode: ISTIO_MUTUAL

Otherwise, the mode defaults to DISABLE causing client proxy sidecars to make plain HTTP requests
instead of TLS encrypted requests. Thus, the requests conflict with the server proxy because the server proxy expects
encrypted requests.

Whenever you apply a DestinationRule, ensure the trafficPolicy TLS mode matches the global server configuration.

Route rules have no effect on ingress gateway requests

Let’s assume you are using an ingress Gateway and corresponding VirtualService to access an internal service.
For example, your VirtualService looks something like this:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
  - "myapp.com" # or maybe "*" if you are testing without DNS using the ingress-gateway IP (e.g., http://1.2.3.4/hello)
  gateways:
  - myapp-gateway
  http:
  - match:
    - uri:
        prefix: /hello
    route:
    - destination:
        host: helloworld.default.svc.cluster.local
  - match:
    ...

You also have a VirtualService which routes traffic for the helloworld service to a particular subset:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: helloworld
spec:
  hosts:
  - helloworld.default.svc.cluster.local
  http:
  - route:
    - destination:
        host: helloworld.default.svc.cluster.local
        subset: v1

In this situation you will notice that requests to the helloworld service via the ingress gateway will
not be directed to subset v1 but instead will continue to use default round-robin routing.

The ingress requests are using the gateway host (e.g., myapp.com)
which will activate the rules in the myapp VirtualService that routes to any endpoint of the helloworld service.
Only internal requests with the host helloworld.default.svc.cluster.local will use the
helloworld VirtualService which directs traffic exclusively to subset v1.

To control the traffic from the gateway, you need to also include the subset rule in the myapp VirtualService:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
  - "myapp.com" # or maybe "*" if you are testing without DNS using the ingress-gateway IP (e.g., http://1.2.3.4/hello)
  gateways:
  - myapp-gateway
  http:
  - match:
    - uri:
        prefix: /hello
    route:
    - destination:
        host: helloworld.default.svc.cluster.local
        subset: v1
  - match:
    ...

Alternatively, you can combine both VirtualServices into one unit if possible:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
  - myapp.com # cannot use "*" here since this is being combined with the mesh services
  - helloworld.default.svc.cluster.local
  gateways:
  - mesh # applies internally as well as externally
  - myapp-gateway
  http:
  - match:
    - uri:
        prefix: /hello
      gateways:
      - myapp-gateway #restricts this rule to apply only to ingress gateway
    route:
    - destination:
        host: helloworld.default.svc.cluster.local
        subset: v1
  - match:
    - gateways:
      - mesh # applies to all services inside the mesh
    route:
    - destination:
        host: helloworld.default.svc.cluster.local
        subset: v1

Envoy is crashing under load

Check your ulimit -a. Many systems have a 1024 open file descriptor limit by default which will cause Envoy to assert and crash with:

[2017-05-17 03:00:52.735][14236][critical][assert] assert failure: fd_ != -1: external/envoy/source/common/network/connection_impl.cc:58

Make sure to raise your ulimit. Example: ulimit -n 16384

Envoy won’t connect to my HTTP/1.0 service

Envoy requires HTTP/1.1 or HTTP/2 traffic for upstream services. For example, when using NGINX for serving traffic behind Envoy, you
will need to set the proxy_http_version directive in your NGINX configuration to be “1.1”, since the NGINX default is 1.0.

Example configuration:

upstream http_backend {
    server 127.0.0.1:8080;

    keepalive 16;
}

server {
    ...

    location /http/ {
        proxy_pass http://http_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        ...
    }
}

503 error while accessing headless services

Assume Istio is installed with the following configuration:

  • mTLS mode set to STRICT within the mesh
  • meshConfig.outboundTrafficPolicy.mode set to ALLOW_ANY

Consider nginx is deployed as a StatefulSet in the default namespace and a corresponding Headless Service is defined as shown below:

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: http-web  # Explicitly defining an http port
  clusterIP: None   # Creates a Headless Service
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx
  serviceName: "nginx"
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: registry.k8s.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web

The port name http-web in the Service definition explicitly specifies the http protocol for that port.

Let us assume we have a sleep pod Deployment as well in the default namespace.
When nginx is accessed from this sleep pod using its Pod IP (this is one of the common ways to access a headless service), the request goes via the PassthroughCluster to the server-side, but the sidecar proxy on the server-side fails to find the route entry to nginx and fails with HTTP 503 UC.

$ export SOURCE_POD=$(kubectl get pod -l app=sleep -o jsonpath='{.items..metadata.name}')
$ kubectl exec -it $SOURCE_POD -c sleep -- curl 10.1.1.171 -s -o /dev/null -w "%{http_code}"
  503

10.1.1.171 is the Pod IP of one of the replicas of nginx and the service is accessed on containerPort 80.

Here are some of the ways to avoid this 503 error:

  1. Specify the correct Host header:

    The Host header in the curl request above will be the Pod IP by default. Specifying the Host header as nginx.default in our request to nginx successfully returns HTTP 200 OK.

    $ export SOURCE_POD=$(kubectl get pod -l app=sleep -o jsonpath='{.items..metadata.name}')
    $ kubectl exec -it $SOURCE_POD -c sleep -- curl -H "Host: nginx.default" 10.1.1.171 -s -o /dev/null -w "%{http_code}"
      200
    
  2. Set port name to tcp or tcp-web or tcp-<custom_name>:

    Here the protocol is explicitly specified as tcp. In this case, only the TCP Proxy network filter on the sidecar proxy is used both on the client-side and server-side. HTTP Connection Manager is not used at all and therefore, any kind of header is not expected in the request.

    A request to nginx with or without explicitly setting the Host header successfully returns HTTP 200 OK.

    This is useful in certain scenarios where a client may not be able to include header information in the request.

    $ export SOURCE_POD=$(kubectl get pod -l app=sleep -o jsonpath='{.items..metadata.name}')
    $ kubectl exec -it $SOURCE_POD -c sleep -- curl 10.1.1.171 -s -o /dev/null -w "%{http_code}"
      200
    
    $ kubectl exec -it $SOURCE_POD -c sleep -- curl -H "Host: nginx.default" 10.1.1.171 -s -o /dev/null -w "%{http_code}"
      200
    
  3. Use domain name instead of Pod IP:

    A specific instance of a headless service can also be accessed using just the domain name.

    $ export SOURCE_POD=$(kubectl get pod -l app=sleep -o jsonpath='{.items..metadata.name}')
    $ kubectl exec -it $SOURCE_POD -c sleep -- curl web-0.nginx.default -s -o /dev/null -w "%{http_code}"
      200
    

    Here web-0 is the pod name of one of the 3 replicas of nginx.

Refer to this traffic routing page for some additional information on headless services and traffic routing behavior for different protocols.

TLS configuration mistakes

Many traffic management problems
are caused by incorrect TLS configuration.
The following sections describe some of the most common misconfigurations.

Sending HTTPS to an HTTP port

If your application sends an HTTPS request to a service declared to be HTTP,
the Envoy sidecar will attempt to parse the request as HTTP while forwarding the request,
which will fail because the HTTP is unexpectedly encrypted.

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: httpbin
spec:
  hosts:
  - httpbin.org
  ports:
  - number: 443
    name: http
    protocol: HTTP
  resolution: DNS

Although the above configuration may be correct if you are intentionally sending plaintext on port 443 (e.g., curl http://httpbin.org:443),
generally port 443 is dedicated for HTTPS traffic.

Sending an HTTPS request like curl https://httpbin.org, which defaults to port 443, will result in an error like
curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number.
The access logs may also show an error like 400 DPE.

To fix this, you should change the port protocol to HTTPS:

spec:
  ports:
  - number: 443
    name: https
    protocol: HTTPS

Gateway to virtual service TLS mismatch

There are two common TLS mismatches that can occur when binding a virtual service to a gateway.

  1. The gateway terminates TLS while the virtual service configures TLS routing.
  2. The gateway does TLS passthrough while the virtual service configures HTTP routing.

Gateway with TLS termination

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: gateway
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 443
      name: https
      protocol: HTTPS
    hosts:
      - "*"
    tls:
      mode: SIMPLE
      credentialName: sds-credential
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: httpbin
spec:
  hosts:
  - "*.example.com"
  gateways:
  - istio-system/gateway
  tls:
  - match:
    - sniHosts:
      - "*.example.com"
    route:
    - destination:
        host: httpbin.org

In this example, the gateway is terminating TLS while the virtual service is using TLS based routing.
The TLS route rules will have no effect since the TLS is already terminated when the route rules are evaluated.

With this misconfiguration, you will end up getting 404 responses because the requests will be
sent to HTTP routing but there are no HTTP routes configured.
You can confirm this using the istioctl proxy-config routes command.

To fix this problem, you should switch the virtual service to specify http routing, instead of tls:

spec:
  ...
  http:
  - match: ...

Gateway with TLS passthrough

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: gateway
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - "*"
    port:
      name: https
      number: 443
      protocol: HTTPS
    tls:
      mode: PASSTHROUGH
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: virtual-service
spec:
  gateways:
  - gateway
  hosts:
  - httpbin.example.com
  http:
  - route:
    - destination:
        host: httpbin.org

In this configuration, the virtual service is attempting to match HTTP traffic against TLS traffic passed through the gateway.
This will result in the virtual service configuration having no effect. You can observe that the HTTP route is not applied using
the istioctl proxy-config listener and istioctl proxy-config route commands.

To fix this, you should switch the virtual service to configure tls routing:

spec:
  tls:
  - match:
    - sniHosts: ["httpbin.example.com"]
    route:
    - destination:
        host: httpbin.org

Alternatively, you could terminate TLS, rather than passing it through, by switching the tls configuration in the gateway:

spec:
  ...
    tls:
      credentialName: sds-credential
      mode: SIMPLE

Double TLS (TLS origination for a TLS request)

When configuring Istio to perform TLS origination, you need to make sure
that the application sends plaintext requests to the sidecar, which will then originate the TLS.

The following DestinationRule originates TLS for requests to the httpbin.org service,
but the corresponding ServiceEntry defines the protocol as HTTPS on port 443.

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: httpbin
spec:
  hosts:
  - httpbin.org
  ports:
  - number: 443
    name: https
    protocol: HTTPS
  resolution: DNS
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: originate-tls
spec:
  host: httpbin.org
  trafficPolicy:
    tls:
      mode: SIMPLE

With this configuration, the sidecar expects the application to send TLS traffic on port 443
(e.g., curl https://httpbin.org), but it will also perform TLS origination before forwarding requests.
This will cause the requests to be double encrypted.

For example, sending a request like curl https://httpbin.org will result in an error:
(35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number.

You can fix this example by changing the port protocol in the ServiceEntry to HTTP:

spec:
  hosts:
  - httpbin.org
  ports:
  - number: 443
    name: http
    protocol: HTTP

Note that with this configuration your application will need to send plaintext requests to port 443,
like curl http://httpbin.org:443, because TLS origination does not change the port.
However, starting in Istio 1.8, you can expose HTTP port 80 to the application (e.g., curl http://httpbin.org)
and then redirect requests to targetPort 443 for the TLS origination:

spec:
  hosts:
  - httpbin.org
  ports:
  - number: 80
    name: http
    protocol: HTTP
    targetPort: 443

404 errors occur when multiple gateways configured with same TLS certificate

Configuring more than one gateway using the same TLS certificate will cause browsers
that leverage HTTP/2 connection reuse
(i.e., most browsers) to produce 404 errors when accessing a second host after a
connection to another host has already been established.

For example, let’s say you have 2 hosts that share the same TLS certificate like this:

  • Wildcard certificate *.test.com installed in istio-ingressgateway
  • Gateway configuration gw1 with host service1.test.com, selector istio: ingressgateway, and TLS using gateway’s mounted (wildcard) certificate
  • Gateway configuration gw2 with host service2.test.com, selector istio: ingressgateway, and TLS using gateway’s mounted (wildcard) certificate
  • VirtualService configuration vs1 with host service1.test.com and gateway gw1
  • VirtualService configuration vs2 with host service2.test.com and gateway gw2

Since both gateways are served by the same workload (i.e., selector istio: ingressgateway) requests to both services
(service1.test.com and service2.test.com) will resolve to the same IP. If service1.test.com is accessed first, it
will return the wildcard certificate (*.test.com) indicating that connections to service2.test.com can use the same certificate.
Browsers like Chrome and Firefox will consequently reuse the existing connection for requests to service2.test.com.
Since the gateway (gw1) has no route for service2.test.com, it will then return a 404 (Not Found) response.

You can avoid this problem by configuring a single wildcard Gateway, instead of two (gw1 and gw2).
Then, simply bind both VirtualServices to it like this:

  • Gateway configuration gw with host *.test.com, selector istio: ingressgateway, and TLS using gateway’s mounted (wildcard) certificate
  • VirtualService configuration vs1 with host service1.test.com and gateway gw
  • VirtualService configuration vs2 with host service2.test.com and gateway gw

Configuring SNI routing when not sending SNI

An HTTPS Gateway that specifies the hosts field will perform an SNI match on incoming requests.
For example, the following configuration would only allow requests that match *.example.com in the SNI:

servers:
- port:
    number: 443
    name: https
    protocol: HTTPS
  hosts:
  - "*.example.com"

This may cause certain requests to fail.

For example, if you do not have DNS set up and are instead directly setting the host header, such as curl 1.2.3.4 -H "Host: app.example.com", no SNI will be set, causing the request to fail.
Instead, you can set up DNS or use the --resolve flag of curl. See the Secure Gateways task for more information.

Another common issue is load balancers in front of Istio.
Most cloud load balancers will not forward the SNI, so if you are terminating TLS in your cloud load balancer you may need to do one of the following:

  • Configure the cloud load balancer to instead passthrough the TLS connection
  • Disable SNI matching in the Gateway by setting the hosts field to *

A common symptom of this is for the load balancer health checks to succeed while real traffic fails.

Unchanged Envoy filter configuration suddenly stops working

An EnvoyFilter configuration that specifies an insert position relative to another filter can be very
fragile because, by default, the order of evaluation is based on the creation time of the filters.
Consider a filter with the following specification:

spec:
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      context: SIDECAR_OUTBOUND
      listener:
        portNumber: 443
        filterChain:
          filter:
            name: istio.stats
    patch:
      operation: INSERT_BEFORE
      value:
        ...

To work properly, this filter configuration depends on the istio.stats filter having an older creation time
than it. Otherwise, the INSERT_BEFORE operation will be silently ignored. There will be nothing in the
error log to indicate that this filter has not been added to the chain.

This is particularly problematic when matching filters, like istio.stats, that are version
specific (i.e., that include the proxyVersion field in their match criteria). Such filters may be removed
or replaced by newer ones when upgrading Istio. As a result, an EnvoyFilter like the one above may initially
be working perfectly but after upgrading Istio to a newer version it will no longer be included in the network
filter chain of the sidecars.

To avoid this issue, you can either change the operation to one that does not depend on the presence of
another filter (e.g., INSERT_FIRST), or set an explicit priority in the EnvoyFilter to override the
default creation time-based ordering. For example, adding priority: 10 to the above filter will ensure
that it is processed after the istio.stats filter which has a default priority of 0.

Virtual service with fault injection and retry/timeout policies not working as expected

Currently, Istio does not support configuring fault injections and retry or timeout policies on the
same VirtualService. Consider the following configuration:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: helloworld
spec:
  hosts:
    - "*"
  gateways:
  - helloworld-gateway
  http:
  - match:
    - uri:
        exact: /hello
    fault:
      abort:
        httpStatus: 500
        percentage:
          value: 50
    retries:
      attempts: 5
      retryOn: 5xx
    route:
    - destination:
        host: helloworld
        port:
          number: 5000

You would expect that given the configured five retry attempts, the user would almost never see any
errors when calling the helloworld service. However since both fault and retries are configured on
the same VirtualService, the retry configuration does not take effect, resulting in a 50% failure
rate. To work around this issue, you may remove the fault config from your VirtualService and
inject the fault to the upstream Envoy proxy using EnvoyFilter instead:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: hello-world-filter
spec:
  workloadSelector:
    labels:
      app: helloworld
  configPatches:
  - applyTo: HTTP_FILTER
    match:
      context: SIDECAR_INBOUND # will match outbound listeners in all sidecars
      listener:
        filterChain:
          filter:
            name: "envoy.filters.network.http_connection_manager"
    patch:
      operation: INSERT_BEFORE
      value:
        name: envoy.fault
        typed_config:
          "@type": "type.googleapis.com/envoy.extensions.filters.http.fault.v3.HTTPFault"
          abort:
            http_status: 500
            percentage:
              numerator: 50
              denominator: HUNDRED

This works because this way the retry policy is configured for the client proxy while the fault
injection is configured for the upstream proxy.

Bug description
I’m following https://istio.io/latest/docs/tasks/traffic-management/egress/egress-gateway-tls-origination/#perform-tls-origination-with-an-egress-gateway to perform TLS origination, in order to «convert» HTTP traffic directed to an external endpoint to HTTPS.
If I try the domain of the example (edition.cnn.com), things seem work pretty well. However, if I try other domains it’s not really «deterministic»: for some of them it works, for others it doesn’t.
For instance: after setting up a new ServiceEntry, VirtualService, and DestinationRule (actually 2, one for sending traffic to the egress gateway and one for sending traffic to www.facebook.com from the egress gateway, as indicated in the documentation) for www.facebook.com, it works fine. But if I try something like www.gemuese.ch (after configuring the same set of resources for this domain) it does not:

root@command-demo-privileged-57c9c99d77-kdc4z:/# curl www.gemuese.ch
upstream connect error or disconnect/reset before headers. reset reason: connection failure

The access logs of the sidecar container of command-demo-privileged-57c9c99d77-kdc4z:

[2021-05-17T22:45:12.675Z] "GET / HTTP/1.1" 503 URX via_upstream - "-" 0 91 412 412 "-" "curl/7.64.0" "b3152ea2-5034-44c0-a86e-4567bb36b1dc" "www.gemuese.ch" "10.32.22.76:8080" outbound|80|gemuese|istio-egressgateway-http.istio-system.svc.cluster.local 10.32.18.17:50088 88.99.98.200:80 10.32.18.17:46026 - -

The access logs of the egress gateway:

[2021-05-17T23:02:00.311Z] "GET / HTTP/2" 503 UF,URX upstream_reset_before_response_started{connection_failure} - "-" 0 91 130 - "10.32.18.17" "curl/7.64.0" "886573f2-df9b-9682-b53d-71d626a61d67" "www.gemuese.ch" "88.99.98.200:443" outbound|443||www.gemuese.ch - 10.32.22.76:8080 10.32.18.17:43276 www.gemuese.ch -
[2021-05-17T23:02:00.452Z] "GET / HTTP/2" 503 UF,URX upstream_reset_before_response_started{connection_failure} - "-" 0 91 137 - "10.32.18.17" "curl/7.64.0" "886573f2-df9b-9682-b53d-71d626a61d67" "www.gemuese.ch" "88.99.98.200:443" outbound|443||www.gemuese.ch - 10.32.22.76:8080 10.32.18.17:43276 www.gemuese.ch -
[2021-05-17T23:02:00.630Z] "GET / HTTP/2" 503 UF,URX upstream_reset_before_response_started{connection_failure} - "-" 0 91 111 - "10.32.18.17" "curl/7.64.0" "886573f2-df9b-9682-b53d-71d626a61d67" "www.gemuese.ch" "88.99.98.200:443" outbound|443||www.gemuese.ch - 10.32.22.76:8080 10.32.18.17:43276 www.gemuese.ch -

If I enable debug logs on the egress gateway:

istioctl pc log --level "admin:debug,aws:debug,assert:debug,backtrace:debug,client:debug,config:debug,connection:debug,conn_handler:debug,dubbo:debug,file:debug,filter:debug,forward_proxy:debug,grpc:debug,hc:debug,health_checker:debug,http:debug,http2:debug,hystrix:debug,init:debug,io:debug,jwt:debug,main:debug,misc:debug,quic:debug,pool:debug,rbac:trace,router:debug,runtime:debug,stats:debug,secret:debug,tap:debug,testing:debug,tracing:debug,upstream:debug,udp:debug,envoy_bug:debug,ext_authz:debug,cache_filter:debug" istio-egressgateway-http-55c6c5d89d-fnvlh.istio-system

kubectl logs -n istio-system -f istio-egressgateway-http-55c6c5d89d-fnvlh --tail=100
...
...
2021-05-17T23:02:00.311946Z	debug	envoy pool	queueing stream due to no available connections
2021-05-17T23:02:00.311952Z	debug	envoy pool	creating a new connection
2021-05-17T23:02:00.336836Z	debug	envoy misc	Unknown error code 104 details Connection reset by peer
2021-05-17T23:02:00.336889Z	debug	envoy pool	[C10477] client disconnected, failure reason:
2021-05-17T23:02:00.336903Z	debug	envoy router	[C10231][S3091061571089400553] upstream reset: reset reason: connection failure, transport failure reason:
2021-05-17T23:02:00.356083Z	debug	envoy router	[C10231][S3091061571089400553] performing retry
2021-05-17T23:02:00.356132Z	debug	envoy pool	queueing stream due to no available connections
2021-05-17T23:02:00.356138Z	debug	envoy pool	creating a new connection
2021-05-17T23:02:00.380846Z	debug	envoy misc	Unknown error code 104 details Connection reset by peer
2021-05-17T23:02:00.380893Z	debug	envoy pool	[C10478] client disconnected, failure reason:
2021-05-17T23:02:00.380905Z	debug	envoy router	[C10231][S3091061571089400553] upstream reset: reset reason: connection failure, transport failure reason:
2021-05-17T23:02:00.417241Z	debug	envoy router	[C10231][S3091061571089400553] performing retry
2021-05-17T23:02:00.417303Z	debug	envoy pool	queueing stream due to no available connections
2021-05-17T23:02:00.417313Z	debug	envoy pool	creating a new connection
2021-05-17T23:02:00.442163Z	debug	envoy misc	Unknown error code 104 details Connection reset by peer
2021-05-17T23:02:00.442211Z	debug	envoy pool	[C10479] client disconnected, failure reason:
2021-05-17T23:02:00.442224Z	debug	envoy router	[C10231][S3091061571089400553] upstream reset: reset reason: connection failure, transport failure reason:
2021-05-17T23:02:00.442265Z	debug	envoy http	[C10231][S3091061571089400553] Sending local reply with details upstream_reset_before_response_started{connection failure}
2021-05-17T23:02:00.442317Z	debug	envoy http	[C10231][S3091061571089400553] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '91'
'content-type', 'text/plain'
'date', 'Mon, 17 May 2021 23:02:00 GMT'
'server', 'istio-envoy'

2021-05-17T23:02:00.442532Z	debug	envoy filter	Called AuthenticationFilter : onDestroy
2021-05-17T23:02:00.452372Z	debug	envoy http	[C10231] new stream
2021-05-17T23:02:00.452463Z	debug	envoy http	[C10231][S387142506190779831] request headers complete (end_stream=true):
':authority', 'www.gemuese.ch'
':path', '/'
':method', 'GET'
':scheme', 'https'
'user-agent', 'curl/7.64.0'
'accept', '*/*'
'x-forwarded-proto', 'http'
'x-request-id', '886573f2-df9b-9682-b53d-71d626a61d67'
'x-envoy-decorator-operation', 'istio-egressgateway-http.istio-system.svc.cluster.local:80/*'

I’m not sure how to debug that Unknown error code 104 details Connection reset by peer.
I don’t get why a different URL gives this kind of error: as far as I understood, the target server just sees the HTTPS request. That’s why I don’t see why the behaviour should be different for different domains.

[ ] Docs
[ ] Installation
[X] Networking
[ ] Performance and Scalability
[ ] Extensions and Telemetry
[X] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
[ ] Upgrade

Expected behavior
running curl www.gemuese.ch should return the same content as the https page.

Steps to reproduce the bug
Follow https://istio.io/latest/docs/tasks/traffic-management/egress/egress-gateway-tls-origination/#perform-tls-origination-with-an-egress-gateway but use another domain (e.g., www.gemuese.ch).

Version (include the output of istioctl version --remote and kubectl version --short and helm version --short if you used Helm)

istioctl  version --remote
client version: 1.9.0
control plane version: 1.9.2
data plane version: 1.9.2 (159 proxies)

kubectl version --short
Client Version: v1.19.3
Server Version: v1.19.7

helm version --short
v3.4.1+gc4e7485

How was Istio installed?
With the istio operator, via the helm chart on https://github.com/istio/istio.git (under the manifests/charts/istio-operator folder).

Environment where the bug was observed (cloud vendor, OS, etc)
AKS.

Additionally, please consider running istioctl bug-report and attach the generated cluster-state tarball to this issue.
Refer cluster state archive for more details.

Bug description

We have been seeing external health checks randomly fail with 503 URX reported by the ingress gateway. This failure happens infrequently (once or twice a day per service). I have analysed two cases, and in both I was able to narrow down the cause to 503s being reported by the sidecar itself. The sidecar is reporting 503s with response_code «NR».

These failures are affecting our stability (sometimes non-healthcheck requets will fail) as well as adding noise to our monitoring.

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[X ] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior

Given that the cluster was stable during this time and no pods have come up or down, I do not expect to have any requests fail within the mesh due to the sidecar configuration.

Steps to reproduce the bug

  • Run a cluster with a retry policy and external monitoring
  • Check for 503s.

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

istio:

client version: 1.4.2
control plane version: 1.4.3
data plane version: 1.4.3 (11 proxies), 1.4.2 (3 proxies)

Note: we did the 1.4.3 upgrade last week. I guess not every pod has rolled over yet (hence the 3 on 1.4.2). Either way, the pod that had the problem is running 1.4.3:

2020-01-21T14:21:08.861754Z info Version 1.4.3-17f6bfc3d7121ad527c2d617ffc27c758d6a7241-Clean

Further, we have seen this issue prior to the 1.4.3 upgrade, so I don’t think the version skew is at fault.

kubernetes:

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-10T03:03:57Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.11-gke.14", GitCommit:"56d89863d1033f9668ddd6e1c1aea81cd846ef88", GitTreeState:"clean", BuildDate:"2019-11-07T19:12:22Z", GoVersion:"go1.12.11b4", Compiler:"gc", Platform:"linux/amd64"}

helm:

version.BuildInfo{Version:"v3.0.2", GitCommit:"19e47ee3283ae98139d98460de796c1be1e3975f", GitTreeState:"clean", GoVersion:"go1.13.5"}

How was Istio installed?
helm template, followed by kustomize.

Environment where bug was observed (cloud vendor, OS, etc)

Google Kubernetes Engine (GKE)

** Extra info **

The route getting us there (from the virtual service):

    match:
    - uri:
        prefix: /v1/tokens/healthz
    retries:
      attempts: 3
      retryOn: connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,5xx,retriable-status-codes
    rewrite:
      uri: /healthz
    route:
    - destination:
        host: v1-tokens.api.svc.cluster.local
        port:
          number: 5005

I have some logs that show the issue. Note how the same path and authority worked for a request started at 14:53:45.167Z but then failed for one started at 14:53:45.548Z.

[Envoy (Epoch 0)] [2020-01-21 14:53:44.957][20][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 13,
[Envoy (Epoch 0)] [2020-01-21 14:53:45.286][20][warning][filter] [src/envoy/http/authn/http_filter_factory.cc:102] mTLS PERMISSIVE mode is used, connection can be either plaintext or TLS, and client cert can be omitted. Please consider to upgrade to mTLS STRICT mode for more secure configuration that only allows TLS connection with client cert. See https://istio.io/docs/tasks/security/mtls-migration/
[Envoy (Epoch 0)] [2020-01-21 14:53:45.292][20][warning][filter] [src/envoy/http/authn/http_filter_factory.cc:102] mTLS PERMISSIVE mode is used, connection can be either plaintext or TLS, and client cert can be omitted. Please consider to upgrade to mTLS STRICT mode for more secure configuration that only allows TLS connection with client cert. See https://istio.io/docs/tasks/security/mtls-migration/
[Envoy (Epoch 0)] [2020-01-21 14:53:45.316][20][warning][filter] [src/envoy/http/authn/http_filter_factory.cc:102] mTLS PERMISSIVE mode is used, connection can be either plaintext or TLS, and client cert can be omitted. Please consider to upgrade to mTLS STRICT mode for more secure configuration that only allows TLS connection with client cert. See https://istio.io/docs/tasks/security/mtls-migration/
[Envoy (Epoch 0)] [2020-01-21 14:53:45.329][20][warning][filter] [src/envoy/http/authn/http_filter_factory.cc:102] mTLS PERMISSIVE mode is used, connection can be either plaintext or TLS, and client cert can be omitted. Please consider to upgrade to mTLS STRICT mode for more secure configuration that only allows TLS connection with client cert. See https://istio.io/docs/tasks/security/mtls-migration/
{"upstream_cluster":"inbound|5005|http|v1-tokens.api.svc.cluster.local","downstream_remote_address":"10.162.15.223:58308","authority":"10.24.11.30:5005","path":"/healthz","protocol":"HTTP/1.1","upstream_service_time":"3","upstream_local_address":"-","duration":"6","downstream_local_address":"10.24.11.30:5005","upstream_transport_failure_reason":"-","route_name":"default","response_code":"200","user_agent":"kube-probe/1.13+","response_flags":"-","start_time":"2020-01-21T14:53:43.772Z","method":"GET","request_id":"5ec4d28e-753e-4c3d-91de-ad42965f2499","upstream_host":"127.0.0.1:5005","x_forwarded_for":"-","requested_server_name":"-","bytes_received":"0","istio_policy_status":"-","bytes_sent":"2"}
{"upstream_cluster":"inbound|5005|http|v1-tokens.api.svc.cluster.local","downstream_remote_address":"10.24.11.15:55374","authority":"10.24.11.30:5005","path":"/metrics","protocol":"HTTP/1.1","upstream_service_time":"5","upstream_local_address":"-","duration":"8","downstream_local_address":"10.24.11.30:5005","upstream_transport_failure_reason":"-","route_name":"default","response_code":"200","user_agent":"Prometheus/2.12.0","response_flags":"-","start_time":"2020-01-21T14:53:43.772Z","method":"GET","request_id":"b459b4c3-4836-4943-8fed-52c622dad5d0","upstream_host":"127.0.0.1:5005","x_forwarded_for":"-","requested_server_name":"-","bytes_received":"0","istio_policy_status":"-","bytes_sent":"6141"}
{"bytes_sent":"2","upstream_cluster":"inbound|5005|http|v1-tokens.api.svc.cluster.local","downstream_remote_address":"35.236.207.68:0","authority":"api.agilicus.com","path":"/v1/tokens/healthz","protocol":"HTTP/1.1","upstream_service_time":"1","upstream_local_address":"-","duration":"4","downstream_local_address":"10.24.11.30:5005","upstream_transport_failure_reason":"-","route_name":"default","response_code":"200","user_agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","response_flags":"-","start_time":"2020-01-21T14:53:45.167Z","method":"GET","request_id":"528e0cfe-19c4-42bf-966a-ee3ae2d18e74","upstream_host":"127.0.0.1:5005","x_forwarded_for":"35.236.207.68","requested_server_name":"-","bytes_received":"0","istio_policy_status":"-"}
{"bytes_sent":"0","upstream_cluster":"-","downstream_remote_address":"XXX.XXX.XXX.XXX:0","authority":"api.agilicus.com","path":"/v1/tokens/healthz","protocol":"HTTP/1.1","upstream_service_time":"-","upstream_local_address":"-","duration":"3","downstream_local_address":"10.24.11.30:5005","upstream_transport_failure_reason":"-","route_name":"default","response_code":"503","user_agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","response_flags":"NR","start_time":"2020-01-21T14:53:45.548Z","method":"GET","request_id":"a3f7c112-2ae1-4494-be7f-e5248fbbe530","upstream_host":"-","x_forwarded_for":"XXX.XXX.XXX.XXX","requested_server_name":"-","bytes_received":"0","istio_policy_status":"-"}
{"response_code":"503","user_agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","response_flags":"NR","start_time":"2020-01-21T14:53:45.577Z","method":"GET","request_id":"a3f7c112-2ae1-4494-be7f-e5248fbbe530","upstream_host":"-","x_forwarded_for":"XXX.XXX.XXX.XXX","requested_server_name":"-","bytes_received":"0","istio_policy_status":"-","bytes_sent":"0","upstream_cluster":"-","downstream_remote_address":"XXX.XXX.XXX.XXX:0","authority":"api.agilicus.com","path":"/v1/tokens/healthz","protocol":"HTTP/1.1","upstream_service_time":"-","upstream_local_address":"-","duration":"2","downstream_local_address":"10.24.11.30:5005","upstream_transport_failure_reason":"-","route_name":"default"}
{"istio_policy_status":"-","bytes_sent":"0","upstream_cluster":"-","downstream_remote_address":"XXX.XXX.XXX.XXX:0","authority":"api.agilicus.com","path":"/v1/tokens/healthz","protocol":"HTTP/1.1","upstream_service_time":"-","upstream_local_address":"-","duration":"3","downstream_local_address":"10.24.11.30:5005","upstream_transport_failure_reason":"-","route_name":"default","response_code":"503","user_agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","response_flags":"NR","start_time":"2020-01-21T14:53:45.592Z","method":"GET","request_id":"a3f7c112-2ae1-4494-be7f-e5248fbbe530","upstream_host":"-","x_forwarded_for":"XXX.XXX.XXX.XXX","requested_server_name":"-","bytes_received":"0"}

Relevant log on the ingress:

{"upstream_local_address":"-","duration":"94","downstream_local_address":"10.24.11.20:443","upstream_transport_failure_reason":"-","route_name":"-","response_code":"503","user_agent":"GoogleStackdriverMonitoring-UptimeChecks(https://cloud.google.com/monitoring)","response_flags":"URX","start_time":"2020-01-21T14:53:45.548Z","method":"GET","request_id":"a3f7c112-2ae1-4494-be7f-e5248fbbe530","upstream_host":"10.24.11.30:5005","x_forwarded_for":"XXX.XXX.XXX.XXX","requested_server_name":"api.agilicus.com","bytes_received":"0","istio_policy_status":"-","bytes_sent":"0","upstream_cluster":"outbound|5005||v1-tokens.api.svc.cluster.local","downstream_remote_address":"XXX.XXX.XXX.XXX:38317","authority":"api.agilicus.com","path":"/v1/tokens/healthz","protocol":"HTTP/2","upstream_service_time":"-"}

Понравилась статья? Поделить с друзьями:

Интересное по теме:

  • Ошибка 503 site temporarily unavailable
  • Ошибка 502 опен офис
  • Ошибка 503 sharepoint
  • Ошибка 502 плохой шлюз при открытии сайта
  • Ошибка 502 при открытии приложения на андроид

  • 0 0 голоса
    Рейтинг статьи
    Подписаться
    Уведомить о
    guest

    0 комментариев
    Старые
    Новые Популярные
    Межтекстовые Отзывы
    Посмотреть все комментарии