AAP HA Patching VM Reboot Notes for Limited or No AAP Disruption

Mindwatering Incorporated

Author: Tripp W Black

Created: 10/01 at 12:03 PM

Category:
Linux
RH AAP

Task: Limit AAP HA reboot disruption from post-OS patching.

Environment:
We have following nodes/VMs:
- 3 AAP Controllers (aap.mindwatering.net VIP with aap1, aap2, and app3)
- 2 Gateways (aapgw.mindwatering.net with aapgw1, aapgw2)
- 2 Hubs (aaphub1, aaphub2)
- 2 EDAs (aapeda1, aapeda2)
- 4 Execution nodes, 2 for 2 regions, East and West (aapee1, aapee2, and aapew1, aapew2)
- 2 PostgreSQL (aapdb1, aapdb2) in an active/replication set-up

Note:
A reboot of primary PostgreSQL node will require complete shutdown. The replica node can be rebooted, but confirm afterwards that data replication has resumed.

---
Steps for reboot of AAP Controller VMs:
Notes:
- AAP Controllers are configured in an HA configuration; they can be rebooted one at a time with verification in-between. Start with Controller 3, then 2, and finally 1
- The ansible-core documentation says to run awx-manage commands with the awx user, which is awx for us
- The awx-manage is missing the old Tower update_inventory and list_jobs commands. We cannot disable the node via awx-manage anymore. We'll use the GUI.
- After the node is disabled from running jobs, any running jobs will not be cancelled. Wait for the node to "drain".

1. Via browser, disable the controller node from running jobs:
a. Browser --> aap.mindwatering.net
<login>

b. If this controller can run jobs under the "default" name, disable default for the current controller:
Administration (left menu) --> Instance Groups (view) --> open default (name column) --> Instances (tab) --> on row of controller node (e.g. app3.mindwatering.net), under Actions, toggle Enabled to Disabled

c. Disable the control plane for this controller:
Administration (left menu) --> Instance Groups (view) --> open controlplane (name column) --> Instances (tab) --> on row controller node (e.g. app3.mindwatering.net) under Actions, toggle Enabled to Disabled

d. Verify the controller nodes/instances are healthy and the current controller is disabled:
$ ssh myadminid@aap3.mindwatering.net
$ sudo su -
# su awx -
$ awx-manage list_instances
<shows control/hybrid mode, capacity, heartbeat timestamps, and if nodes are enabled:true or enabled:false>
$ exit
<back to root session>

2. Wait for Node to Drain:
In the browser, return to the controlplane instance group --> Instances (tab) --> open the controller node, and verify the Running Jobs field: 0.

3. Back in the terminal, with no jobs now left, shut down the controller node services:
# systemctl stop automation-controller
<wait>

Confirm stopped with:
# systemctl status automation-controller
< confirm not active/not running>

Note:
For not clustered controllers, use automation-controller-service stop.

4. Still in the terminal, reboot:
# reboot

5. Wait for the controller to come back up. After it boots, verify with:
a. In the web browser, open the GUI to confirm the GUI is running. eg. aap3.mindwatering.com
Browser --> aap.mindwatering.net
<login>

b. Back in terminal confirm, the services are running okay:
$ ssh myadminid@aap3.mindwatering.net
$ sudo systemctl status automation-controller
<view output, should have Active:running status>
$ sudo systemctl status redis
<view output, should be Active, as well>

6. Re-enable the controller node:
a. Browser --> aap.mindwatering.net
<login>

b. If this controller can run jobs under the "default" name, re-enable default for the current controller:
Administration (left menu) --> Instance Groups (view) --> open default (name column) --> Instances (tab) --> on row of controller node (e.g. app3.mindwatering.net), under Actions, toggle Disabled to Enabled

c. Re-enable the control plane for this controller:
Administration (left menu) --> Instance Groups (view) --> open controlplane (name column) --> Instances (tab) --> on row controller node (e.g. app3.mindwatering.net) under Actions, toggle Disabled to Enabled

7. Verify the node is available for jobs:
$ ssh myadminid@aap3.mindwatering.net
$ sudo su -
# su awx -
$ awx-manage list_instances
<shows control/hybrid mode, capacity, heartbeat timestamps, confirm no errors, and that nodes are enabled:true>

If the node does not show enabled, restart the controller services:
$ exit
(to root session)
# systemctl status automation-controller
< confirm Active:running>
# su awx -
$ awx-manage list_instances
<shows control/hybrid mode, capacity, heartbeat timestamps, confirm no errors, and that nodes are enabled:true>
# exit
$ exit

8. When node is healthy and can/accepting jobs, perform reboot of next controller.

---
Steps for reboot of AAP Gateway VMs:
Notes:
- AAP Gateways are configured in an HA configuration; they can be rebooted one at a time with verification in-between. Start with Gateway 2, and then 1, although they can be done in either order.
- If the VIP is round-robin and does not check health, update the VIP redirection before each node is rebooted.

1. Verify the status of the gateway node:
$ ssh myadminid@aapgw2.mindwatering.net
$ sudo su -
# systemctl status automation-gateway-service
<confirm Active:running>
# systemctl status supervisorctl
<confirm active, as well>

2. Shutdown the Gateway services (optional):
# systemctl stop automation-gateway-service
<wait>
# systemctl stop supervisorctl
<wait>

3. Reboot:
# reboot
<wait>

4. Verify the services' status again:
$ ssh myadminid@aapgw2.mindwatering.net
$ sudo su -
# systemctl status automation-gateway-service supervisorctl
<view status of each of the services>
# exit
$ exit

5. Repeat with the other gateway VM (e.g. aapgw1).

---
Steps for reboot of AAP Hub VMs:
Notes:
- AAP Hubs are configured in an HA configuration; they can be rebooted one at a time with verification in-between. Start with Hub 2, and then 1, although they can be done in either order.
- By default, pulpcore sends logs to /var/log/messages. It can customized by creating /etc/rsyslog.d/pulp.conf per RH technote: 7004301. In this case we'll include /var/log/messages and /var/log/pulp locations.
- For step 2, the number of pulp-worker nodes can vary if changed.

1. Verify in logs that the hub is currently not performing ansible-galaxy or ansible-builder pulls.
$ ssh myadminid@aaphub2.mindwatering.net
$ sudo view /var/log/messages
<check log for recent errors or activity>

or, if custom logging configured:
$ sudo tail -n 100 /var/log/pulp/pulpcore-api.log
<view output>
$ sudo tail -n 100 /var/log/pulp/pulpcore-content.log
<view output>
$ sudo tail -n 100 /var/log/pulp/pulpcore-worker.log
<view output>

2. Stop services:
$ sudo systemctl stop pulpcore.service pulpcore-api.service pulpcore-content.service pulpcore-worker@1.service pulpcore-worker@2.service nginx.service redis.service

3. Reboot:
$ sudo reboot
<wait>

4. Verify the services are running okay:
$ ssh myadminid@aaphub2.mindwatering.net
$ sudo systemctl status pulpcore.service pulpcore-api.service pulpcore-content.service pulpcore-worker@1.service pulpcore-worker@2.service nginx.service redis.service
<view status of each of the services>
# exit
$ exit

5. In web GUI, verify login, and that collections, repositories, and execution environments are shown.
a. Browser --> aaphub2.mindwatering.net
<login>

b. Collections (left menu)
< view collections>

c. Collections --> Repositories
<view repositories and any sync status>

d. Execution Environments --> Execution Environments
<view repositories and any sync status>

6. Repeat with the other hub VM (e.g. aaphub1).

---
Steps for reboot of AAP EDA VMs:
Notes:
- AAP EDA's (in 2.5) are configured in an HA configuration. In version 2.4, there could only be one. They can be rebooted one at a time with verification in-between. Start with EDA 2, and then 1, although they can be done in either order.
- EDA's run Ansible rulebooks which typically run as persistent processes. You'll need to disable them. Note which rulebooks you choose to disable.
- EDA typically run, ansible-eda.service, redis.service, and nginx.service. We run PostgreSQL externally.

1. Verify the status of the EDA node:
$ ssh myadminid@aapeda2.mindwatering.net
$ sudo su -
# systemctl list-units --type=service | grep eda
<confirm services Active:running>
# ps aux | grep ansible-rulebook
<confirm what persistence events are processing, if any>

2. Disable running rulebooks:
Browser --> aap.mindwatering.net
<login>
Views (menu) --> Ansible Activations (menu option) --> select rulebook --> Slider to Disabled

3. Shutdown the EDA services (optional):
# systemctl stop ansible-eda.service
<wait>
# systemctl stop redis
<wait>
# systemctl stop nginx
<wait>

4. Reboot:
# reboot
<wait>

5. Verify the services' status again:
$ ssh myadminid@aapeda2.mindwatering.net
$ sudo su -
# systemctl status ansible-eda.service redis nginx
<view status of each of the services>

6. Re-enable the disabled rulebooks:
Browser --> aap.mindwatering.net
<login>
Views (menu) --> Ansible Activations (menu option) --> select rulebook --> Slider to Enabled

7. Check logs for issues:
# journalctl -u ansible-eda.service
<view status of each of the services>
# tail -n 100 /var/log/messages
<check for errors/issues>
# exit
$ exit

8. Repeat with the other EDA VM.

---
Steps for reboot of AAP Execution Nodes/VMs:
Notes:
- AAP Execution nodes are configured in an HA configuration; they can be rebooted one at a time with verification in-between. Start with Execution node 2, and then 1 for the set, although they can be done in either order. Just don't take both down at the same time.
- In this scenario, we have 2 in the East region, and 2 in the West region.

1. Via browser, disable the execution node from running jobs:
a. Browser --> aap.mindwatering.net
<login>

b. Locate the controller's execution nodes. In our case they are "east" and "west", disable one execution node for each:
Administration (left menu) --> Instance Groups (view) --> open east (name column) --> Instances (tab) --> on row of controller node (e.g. aapee2.mindwatering.net), under Actions, toggle Enabled to Disabled

Administration (left menu) --> Instance Groups (view) --> open west (name column) --> Instances (tab) --> on row of controller node (e.g. aapew2.mindwatering.net), under Actions, toggle Enabled to Disabled

2. Wait for jobs to finish.
- Monitor jobs either through the Controller GUI, or the controller terminal running the awx-manage list_instances command and viewing the output for the current "east" and "west" nodes being rebooted.

3. Reboot:
$ ssh myadminid@aapee2.mindwatering.net
$ sudo su -
# reboot
<wait>

$ ssh myadminid@aapew2.mindwatering.net
$ sudo su -
# reboot
<wait>

4. Verify the execution nodes/instances are healthy:
$ ssh myadminid@aapee2.mindwatering.net
$ sudo systemctl status receptor
< confirm Active:running status>
$ exit

$ ssh myadminid@aapew2.mindwatering.net
$ sudo systemctl status receptor
< confirm Active:running status>
$ exit

5. Re-enable the execution nodes/instances:
a. Browser --> aap.mindwatering.net
<login>

b. Locate the controller's execution nodes. In our case they are "east" and "west", disable one execution node for each:
Administration (left menu) --> Instance Groups (view) --> open east (name column) --> Instances (tab) --> on row of controller node (e.g. aapee2.mindwatering.net), under Actions, toggle Disabled to Enabled

Administration (left menu) --> Instance Groups (view) --> open west (name column) --> Instances (tab) --> on row of controller node (e.g. aapew2.mindwatering.net), under Actions, toggle Disabled to Enabled

6. Confirm the execution nodes are available:
$ ssh myadminid@aap.mindwatering.net
$ sudo su -
# su awx -
$ awx-manage list_instances
<shows control/hybrid mode, capacity, heartbeat timestamps, and if these execution nodes are enabled:true or enabled:false>
$ exit
<back to root session>
# exit
$ exit

7. Repeat with the other execution nodes (e.g. aapee1 and aapew1).

previous page