Pod Network Latency
Introduction¶
- It injects latency on the specified container by starting a traffic control (tc) process with netem rules to add egress delays
- It can test the application's resilience to lossy/flaky network
Scenario: Induce letency in the network of target pod
Uses¶
View the uses of the experiment
The experiment causes network degradation without the pod being marked unhealthy/unworthy of traffic by kube-proxy (unless you have a liveness probe of sorts that measures latency and restarts/crashes the container). The idea of this experiment is to simulate issues within your pod network OR microservice communication across services in different availability zones/regions etc.
Mitigation (in this case keep the timeout i.e., access latency low) could be via some middleware that can switch traffic based on some SLOs/perf parameters. If such an arrangement is not available the next best thing would be to verify if such a degradation is highlighted via notification/alerts etc,. so the admin/SRE has the opportunity to investigate and fix things. Another utility of the test would be to see what the extent of impact caused to the end-user OR the last point in the app stack on account of degradation in access to a downstream/dependent microservice. Whether it is acceptable OR breaks the system to an unacceptable degree. The experiment provides DESTINATION_IPS or DESTINATION_HOSTS so that you can control the chaos against specific services within or outside the cluster.
The applications may stall or get corrupted while they wait endlessly for a packet. The experiment limits the impact (blast radius) to only the traffic you want to test by specifying IP addresses or application information.This experiment will help to improve the resilience of your services over time
Prerequisites¶
Verify the prerequisites
- Ensure that Kubernetes Version > 1.16
- Ensure that the Litmus Chaos Operator is running by executing
kubectl get pods
in operator namespace (typically,litmus
).If not, install from here - Ensure that the
pod-network-latency
experiment resource is available in the cluster by executingkubectl get chaosexperiments
in the desired namespace. If not, install from here
Default Validations¶
View the default validations
The application pods should be in running state before and after chaos injection.
Minimal RBAC configuration example (optional)¶
NOTE
If you are using this experiment as part of a litmus workflow scheduled constructed & executed from chaos-center, then you may be making use of the litmus-admin RBAC, which is pre installed in the cluster as part of the agent setup.
View the Minimal RBAC permissions
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: pod-network-latency-sa
namespace: default
labels:
name: pod-network-latency-sa
app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-network-latency-sa
namespace: default
labels:
name: pod-network-latency-sa
app.kubernetes.io/part-of: litmus
rules:
# Create and monitor the experiment & helper pods
- apiGroups: [""]
resources: ["pods"]
verbs: ["create","delete","get","list","patch","update", "deletecollection"]
# Performs CRUD operations on the events inside chaosengine and chaosresult
- apiGroups: [""]
resources: ["events"]
verbs: ["create","get","list","patch","update"]
# Fetch configmaps details and mount it to the experiment pod (if specified)
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get","list",]
# Track and get the runner, experiment, and helper pods log
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get","list","watch"]
# for creating and managing to execute comands inside target container
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["get","list","create"]
# deriving the parent/owner details of the pod(if parent is anyof {deployment, statefulset, daemonsets})
- apiGroups: ["apps"]
resources: ["deployments","statefulsets","replicasets", "daemonsets"]
verbs: ["list","get"]
# deriving the parent/owner details of the pod(if parent is deploymentConfig)
- apiGroups: ["apps.openshift.io"]
resources: ["deploymentconfigs"]
verbs: ["list","get"]
# deriving the parent/owner details of the pod(if parent is deploymentConfig)
- apiGroups: [""]
resources: ["replicationcontrollers"]
verbs: ["get","list"]
# deriving the parent/owner details of the pod(if parent is argo-rollouts)
- apiGroups: ["argoproj.io"]
resources: ["rollouts"]
verbs: ["list","get"]
# for configuring and monitor the experiment job by the chaos-runner pod
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create","list","get","delete","deletecollection"]
# for creation, status polling and deletion of litmus chaos resources used within a chaos workflow
- apiGroups: ["litmuschaos.io"]
resources: ["chaosengines","chaosexperiments","chaosresults"]
verbs: ["create","list","get","patch","update","delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-network-latency-sa
namespace: default
labels:
name: pod-network-latency-sa
app.kubernetes.io/part-of: litmus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: pod-network-latency-sa
subjects:
- kind: ServiceAccount
name: pod-network-latency-sa
namespace: default
Experiment tunables¶
check the experiment tunables
Optional Fields
Variables | Description | Notes |
---|---|---|
NETWORK_INTERFACE | Name of ethernet interface considered for shaping traffic | |
TARGET_CONTAINER | Name of container which is subjected to network latency | Applicable for containerd & CRI-O runtime only. Even with these runtimes, if the value is not provided, it injects chaos on the first container of the pod |
NETWORK_LATENCY | The latency/delay in milliseconds | Default 2000, provide numeric value only |
JITTER | The network jitter value in ms | Default 0, provide numeric value only |
CONTAINER_RUNTIME | container runtime interface for the cluster | Defaults to containerd, supported values: docker, containerd and crio for litmus and only docker for pumba LIB |
SOCKET_PATH | Path of the containerd/crio/docker socket file | Defaults to /run/containerd/containerd.sock |
TOTAL_CHAOS_DURATION | The time duration for chaos insertion (seconds) | Default (60s) |
TARGET_PODS | Comma separated list of application pod name subjected to pod network corruption chaos | If not provided, it will select target pods randomly based on provided appLabels |
DESTINATION_IPS | IP addresses of the services or pods or the CIDR blocks(range of IPs), the accessibility to which is impacted | comma separated IP(S) or CIDR(S) can be provided. if not provided, it will induce network chaos for all ips/destinations |
DESTINATION_HOSTS | DNS Names/FQDN names of the services, the accessibility to which, is impacted | if not provided, it will induce network chaos for all ips/destinations or DESTINATION_IPS if already defined |
SOURCE_PORTS | ports of the target application, the accessibility to which is impacted | comma separated port(s) can be provided. If not provided, it will induce network chaos for all ports |
DESTINATION_PORTS | ports of the destination services or pods or the CIDR blocks(range of IPs), the accessibility to which is impacted | comma separated port(s) can be provided. If not provided, it will induce network chaos for all ports |
PODS_AFFECTED_PERC | The Percentage of total pods to target | Defaults to 0 (corresponds to 1 replica), provide numeric value only |
LIB | The chaos lib used to inject the chaos | Default value: litmus, supported values: pumba and litmus |
TC_IMAGE | Image used for traffic control in linux | default value is gaiadocker/iproute2 |
LIB_IMAGE | Image used to run the netem command | Defaults to litmuschaos/go-runner:latest |
RAMP_TIME | Period to wait before and after injection of chaos in sec | |
SEQUENCE | It defines sequence of chaos execution for multiple target pods | Default value: parallel. Supported: serial, parallel |
Experiment Examples¶
Common and Pod specific tunables¶
Refer the common attributes and Pod specific tunable to tune the common tunables for all experiments and pod specific tunables.
Network Latency¶
It defines the network latency(in ms) to be injected in the targeted application. It can be tuned via NETWORK_LATENCY
ENV.
Use the following example to tune this:
# it inject the network-latency for the egress traffic
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
appinfo:
appns: "default"
applabel: "app=nginx"
appkind: "deployment"
chaosServiceAccount: pod-network-latency-sa
experiments:
- name: pod-network-latency
spec:
components:
env:
# network latency to be injected
- name: NETWORK_LATENCY
value: '2000' #in ms
- name: TOTAL_CHAOS_DURATION
value: '60'
Destination IPs And Destination Hosts¶
The network experiments interrupt traffic for all the IPs/hosts by default. The interruption of specific IPs/Hosts can be tuned via DESTINATION_IPS
and DESTINATION_HOSTS
ENV.
DESTINATION_IPS
: It contains the IP addresses of the services or pods or the CIDR blocks(range of IPs), the accessibility to which is impacted.DESTINATION_HOSTS
: It contains the DNS Names/FQDN names of the services, the accessibility to which, is impacted.
Use the following example to tune this:
# it inject the chaos for the egress traffic for specific ips/hosts
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
appinfo:
appns: "default"
applabel: "app=nginx"
appkind: "deployment"
chaosServiceAccount: pod-network-latency-sa
experiments:
- name: pod-network-latency
spec:
components:
env:
# supports comma separated destination ips
- name: DESTINATION_IPS
value: '8.8.8.8,192.168.5.6'
# supports comma separated destination hosts
- name: DESTINATION_HOSTS
value: 'nginx.default.svc.cluster.local,google.com'
- name: TOTAL_CHAOS_DURATION
value: '60'
Source And Destination Ports¶
The network experiments interrupt traffic for all the source & destination ports by default. The interruption of specific port(s) can be tuned via SOURCE_PORTS
and DESTINATION_PORTS
ENV.
SOURCE_PORTS
: It contains ports of the target application, the accessibility to which is impactedDESTINATION_PORTS
: It contains the ports of the destination services or pods or the CIDR blocks(range of IPs), the accessibility to which is impacted
Use the following example to tune this:
# it inject the chaos for the ingrees and egress traffic for specific ports
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
appinfo:
appns: "default"
applabel: "app=nginx"
appkind: "deployment"
chaosServiceAccount: pod-network-latency-sa
experiments:
- name: pod-network-latency
spec:
components:
env:
# supports comma separated source ports
- name: SOURCE_PORTS
value: '80'
# supports comma separated destination ports
- name: DESTINATION_PORTS
value: '8080,9000'
- name: TOTAL_CHAOS_DURATION
value: '60'
Blacklist Source and Destination Ports¶
By default, the network experiments disrupt traffic for all the source and destination ports. The specific ports can be blacklisted via SOURCE_PORTS
and DESTINATION_PORTS
ENV.
SOURCE_PORTS
: Provide the comma separated source ports preceded by!
, that you'd like to blacklist from the chaos.DESTINATION_PORTS
: Provide the comma separated destination ports preceded by!
, that you'd like to blacklist from the chaos.
Use the following example to tune this:
# blacklist the source and destination ports
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
appinfo:
appns: "default"
applabel: "app=nginx"
appkind: "deployment"
chaosServiceAccount: pod-network-latency-sa
experiments:
- name: pod-network-latency
spec:
components:
env:
# it will blacklist 80 and 8080 source ports
- name: SOURCE_PORTS
value: '!80,8080'
# it will blacklist 8080 and 9000 destination ports
- name: DESTINATION_PORTS
value: '!8080,9000'
- name: TOTAL_CHAOS_DURATION
value: '60'
Network Interface¶
The defined name of the ethernet interface, which is considered for shaping traffic. It can be tuned via NETWORK_INTERFACE
ENV. Its default value is eth0
.
Use the following example to tune this:
# provide the network interface
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
appinfo:
appns: "default"
applabel: "app=nginx"
appkind: "deployment"
chaosServiceAccount: pod-network-latency-sa
experiments:
- name: pod-network-latency
spec:
components:
env:
# name of the network interface
- name: NETWORK_INTERFACE
value: 'eth0'
- name: TOTAL_CHAOS_DURATION
value: '60'
Jitter¶
It defines the jitter (in ms), a parameter that allows introducing a network delay variation. It can be tuned via JITTER
ENV. Its default value is 0
.
Use the following example to tune this:
# provide the network latency jitter
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
appinfo:
appns: "default"
applabel: "app=nginx"
appkind: "deployment"
chaosServiceAccount: pod-network-latency-sa
experiments:
- name: pod-network-latency
spec:
components:
env:
# value of the network latency jitter (in ms)
- name: JITTER
value: '200'
Container Runtime Socket Path¶
It defines the CONTAINER_RUNTIME
and SOCKET_PATH
ENV to set the container runtime and socket file path.
CONTAINER_RUNTIME
: It supportsdocker
,containerd
, andcrio
runtimes. The default value isdocker
.SOCKET_PATH
: It contains path of docker socket file by default(/run/containerd/containerd.sock
). For other runtimes provide the appropriate path.
Use the following example to tune this:
## provide the container runtime and socket file path
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
appinfo:
appns: "default"
applabel: "app=nginx"
appkind: "deployment"
chaosServiceAccount: pod-network-latency-sa
experiments:
- name: pod-network-latency
spec:
components:
env:
# runtime for the container
# supports docker, containerd, crio
- name: CONTAINER_RUNTIME
value: 'containerd'
# path of the socket file
- name: SOCKET_PATH
value: '/run/containerd/containerd.sock'
- name: TOTAL_CHAOS_DURATION
value: '60'
Pumba Chaos Library¶
It specifies the Pumba chaos library for the chaos injection. It can be tuned via LIB
ENV. The defaults chaos library is litmus
.
Provide the traffic control image via TC_IMAGE
ENV for the pumba library.
Use the following example to tune this:
# use pumba chaoslib for the network chaos
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
appinfo:
appns: "default"
applabel: "app=nginx"
appkind: "deployment"
chaosServiceAccount: pod-network-latency-sa
experiments:
- name: pod-network-latency
spec:
components:
env:
# name of the chaoslib
# supports litmus and pumba lib
- name: LIB
value: 'pumba'
# image used for the traffic control in linux
# applicable for pumba lib only
- name: TC_IMAGE
value: 'gaiadocker/iproute2'
- name: TOTAL_CHAOS_DURATION
value: '60'