Issue:
Jobs taking more time and memory are crashing w/o logs in AT/AAP.
The template job logs are generally empty.
Job Error Messages:
Tower or AAP UI:
Job terminated due to error
or
Error with pod's stdout: unexpected EOF
The awx-task container logs show error:
Task was destroyed but it is pending
Cause 1:
Log length is beyond limit.
The default log limit is 10 Mi.
It can be tested by passing to the kubelet node agent.
kubelet --container-log-max-size=200Mi --container-log-max-files=10
Passing on the command line is deprecated. So typically set w/in the YAML kubelet config file. Per the documentation, overriding the default for one variable requires others to be updated, as well.
The file directory is specified via:
--config-dir:/etc/kubernetes/kubelet.conf.d
Example from the kubelet-config-file documentation page:
- memory.available: evict pod when drops below/less than this threshold
- nodefs.available: evict pod when filesystem available space is less than this threshold
- nodefs.inodesFree: evict pod when available filesystem inodes use is less then this threshold
- imagefs.available: evict pod when the image filesystem space is less than this threshold
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: "192.168.0.8"
port: 20250
serializeImagePulls: false
evictionHard:
memory.available: "100Mi"
nodefs.available: "10%"
nodefs.inodesFree: "5%"
imagefs.available: "15%"
Additional environmental var:
$ vi awx.yml
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: awx
spec:
service_type: nodeport
ingress_type: none
hostname: awx.mindwatering.net
. . .
ee_extra_env: |
- name: RECEPTOR_KUBE_SUPPORT_RECONNECT
value: enabled
<esc>:wq (to save)
$ kubectl apply -f awx.yml
<awx created message>
Note:
The receptor release means delete/destroy pod when run complete. Disable to keep it around to see why it died. e.g. OOMKilled
Cause 2:
Memory was exhausted.
AWX --> Instance Groups --> Customize pod specification:
. . .
resources:
requests:
cpu: 2
memory: "20G"
. . .
previous page
|