Ansible AWX Pod Crash Issue in Kubernetes

Mindwatering Incorporated

Author: Tripp W Black

Created: 11/07/2024 at 10:43 PM

Category:
Linux
Kubernetes

Issue:
Jobs taking more time and memory are crashing w/o logs in AT/AAP.
The template job logs are generally empty.

Job Error Messages:

Tower or AAP UI:
Job terminated due to error
or
Error with pod's stdout: unexpected EOF

The awx-task container logs show error:
Task was destroyed but it is pending

Cause 1:
Log length is beyond limit.

The default log limit is 10 Mi.

It can be tested by passing to the kubelet node agent.
kubelet --container-log-max-size=200Mi --container-log-max-files=10
Passing on the command line is deprecated. So typically set w/in the YAML kubelet config file. Per the documentation, overriding the default for one variable requires others to be updated, as well.

The file directory is specified via:
--config-dir:/etc/kubernetes/kubelet.conf.d

Example from the kubelet-config-file documentation page:
- memory.available: evict pod when drops below/less than this threshold
- nodefs.available: evict pod when filesystem available space is less than this threshold
- nodefs.inodesFree: evict pod when available filesystem inodes use is less then this threshold
- imagefs.available: evict pod when the image filesystem space is less than this threshold

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: "192.168.0.8"
port: 20250
serializeImagePulls: false
evictionHard:
memory.available: "100Mi"
nodefs.available: "10%"
nodefs.inodesFree: "5%"
imagefs.available: "15%"

Additional environmental var:
$ vi awx.yml
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: awx
spec:
service_type: nodeport
ingress_type: none
hostname: awx.mindwatering.net
. . .
ee_extra_env: |
- name: RECEPTOR_KUBE_SUPPORT_RECONNECT
value: enabled

<esc>:wq (to save)

$ kubectl apply -f awx.yml
<awx created message>

Note:
The receptor release means delete/destroy pod when run complete. Disable to keep it around to see why it died. e.g. OOMKilled

Cause 2:
Memory was exhausted.

AWX --> Instance Groups --> Customize pod specification:
. . .
resources:
requests:
cpu: 2
memory: "20G"
. . .

previous page