Manage a High Availability Deployment

This topic describes how to manage and troubleshoot Controllers as a high availability pair.

Set Up Monitoring for the HA Pair

You can set up monitoring for your HA pair by installing another Controller to act as the monitoring Controller. This provides the same administrative functionality as the HA toolkit used in 4.3 and earlier versions.

If you do not already have an HA pair, set one up.
Install the monitoring Controller on the Enterprise Console host in a new platform by selecting Custom Install:
1. Create a platform (e.g.: Controller Monitor Platform).
  
  This platform should not be used for installing any other services.
2. Install a Controller.
3. Make sure to unselect the Install Events Service option before clicking Install.
Complete the monitoring setup by installing and configuring the App Agents and Machine Agents on your HA pair:
- Set Up App Agents for Monitoring
- Install and Set Up Machine Agents for Monitoring

Set Up App Agents for Monitoring

You can set up App Agents, which are automatically installed on the Controller hosts by the Enterprise Console, on both Controllers of an HA pair to report to the monitoring Controller. This can be done by updating the JVM options of your HA pair platform. To set up your App Agents using the Enterprise Console, perform the following steps:

SSH into the primary Controller box and update the primary Controller App Agent's controller-info.xml by running the following commands:
```
cd <controller-install-dir>/appserver/glassfish/domains/domain1/appagent
cp conf/controller-info.xml ver<version#>/conf/
```
Repeat step 1 for the secondary Controller.
In the Enterprise Console UI, select your HA pair platform, and navigate to the JVM Options section by clicking Configurations, Controller Settings, and Appserver Configurations.
Make the following updates to JVM Options:
1. Update the appdynamics.controller.hostName to the monitoring Controller's IP.
2. Add the following required jvm-options for monitoring:
```
-Dappdynamics.agent.applicationName=<app_name>, -Dappdynamics.agent.tierName=<tier_name>, 
-Dappdynamics.agent.nodeName=<node_name>, -Dappdynamics.agent.accountName=<account_name>, 
-Dappdynamics.agent.accountAccessKey=<access_key>
```
  You can get your access key from the Controller UI: navigate to Settings, License, and Account. Then click to show your access key. Note, when you log in to the Controller, use the account specified in appdynamics.agent.accountName.
Scroll down the page and click Save. The job will apply these properties and restart both the primary and secondary Controllers.
In the Enterprise Console UI, select your Controller Monitor Platform, and navigate to the Controller page.
Click on External URL on the widget to open the monitoring Controller's UI.
Log in to the Controller. You should be able to see the monitoring application for both the primary and secondary Controllers.

Install and Set Up Machine Agents for Monitoring

You must install Machine Agents on both Controllers of an HA pair to report to the monitoring Controller. These agents are Java programs that collect hardware metrics. To install and set up your machine agents, perform the following steps:

Install the Machine Agent on the primary Controller box. Do not start the agent.
Repeat step 1 for the secondary Controller.
Configure the Machine Agent properties for both Machine Agents by editing the controller-info-xml file located in the <machine_agent_home>/conf directory.
1. Update the <controller-host> to the monitoring Controller's IP.
2. Model the rest of your controller-info-xml file after the Example Configuration.
Start both Machine Agents.
In the Enterprise Console UI, select your Controller Monitor Platform, and navigate to the Controller page.
Click on External URL on the widget to open the monitoring Controller's UI.
Log in to the Controller. You should be able to see the monitoring application for both the primary and secondary Controllers.

Bouncing the Primary Controller Without Triggering Failover

The Enterprise Console does not allow you to stop and start the primary Controller without initiating failover. Therefore, to work around this, you will need to perform the following steps:

Log in to the Enterprise Console and navigate to the Appserver Configurations page by clicking through Configurations, followed by Controller Settings.
Deselect Enable Auto Failover and click Save.
SSH to the Controller machine where the Controller is installed.
Run the following commands on the Enterprise Console host:
```
bin/platform-admin.sh stop-controller-appserver
bin/platform-admin.sh start-controller-appserver
```
This will bounce the primary Controller in HA mode.
Re-enable auto failover on the Enterprise Console Appserver Configurations page.

Starting and Stopping the Controller

The Enterprise Console does not allow you to shut down the primary Controller. However, you can restart the secondary Controller via the start and stop Controller commands.

To start or stop the Controller manually, use the following commands:

To start:

bin/platform-admin.sh start-controller-appserver --with-db

To stop:

bin/platform-admin.sh stop-controller-appserver --with-db

Automatic Failover

The Enterprise Console monitors the health of the primary Appserver and database. If the Appserver or database is unresponsive, the Enterprise Console will by default wait for five minutes before initiating a failover. This interval can be configured by updating the default value in the Domain Protocol text field on the Appserver Configurations page under Controller settings.

You can also disable or enable automatic failover through the CLI.

Version 4.5.14 and above of the Enterprise Console comes with the High Availability (HA) module which utilizes the Controller Watchdog for auto-failover. If you want to enable or disable the auto-failover, then the watchdog script needs to be running or stopped.

To disable and enable the Controller Watchdog with CLI using the following commands:

Stop the Controller Watchdog:

./platform-admin.sh submit-job --job stop-controller-watchdog --service controller

Start the Controller Watchdog:

./platform-admin.sh submit-job --job start-controller-watchdog --service controller

To disable automatic failover, run the following command on the Enterprise Console host:

bin/platform-admin.sh submit-job --service=controller --job update-configs --platform-name <platform_of_the_platform> --args "enableAutoFailover=false"

To enable automatic failover, run the following command on the Enterprise Console host:

bin/platform-admin.sh submit-job --service=controller --job update-configs --platform-name <platform_of_the_platform> --args "enableAutoFailover=true"

Performing a Manual Failover and Failback

To failover from the primary to the secondary manually, click the HA Failover option on the Controller page of the Enterprise Console or run the following command on the Enterprise Console host:

bin/platform-admin.sh submit-job --service controller --job ha-failover --platform-name <name_of_the_platform>

This changes the Appserver on the secondary as primary and database on the secondary as the replication master. It also changes the old primary to secondary.

The process for performing a failback to the old primary is the same as failing over to the secondary. Simply run the following command on the Enterprise Console host:

bin/platform-admin.sh submit-job --service controller --job ha-failover --platform-name <name_of_the_platform>

Note that if it has been down for more than seven days, you need to revive the database, as described in the following section.

Initiate Controller Database Incremental Replication

Re-enable Broken Replication

Incremental replication, replication via rsync when the primary database is up, is required in cases where the database replication on the secondary Controller is lagging behind the primary Controller by more than three days. This type of replication allows the primary Controller to keep operating while the disk contents are copied to the secondary node.

To initiate incremental replication:

Run the following command on the Enterprise Console host:
```
bin/platform-admin.sh submit-job --service controller --job incremental-replication
```
This launches a continuously running background job.
Make sure replication occurs four or more times by running either one of the following commands:
1. ```
cd <controller_home>/controller-ha
./ha_replicate.sh -r status
```
  CODE
2. ```
cd <controller_home>/controller-ha/tmp
cat replication.status
```
  CODE
If replication fails, go to the secondary host and stop all rsync and ha-replicate.sh processes. Then try running the incremental-replication job again.
Finalize the job by running the following command on the Enterprise Console host:
```
bin/platform-admin.sh submit-job --service controller --job finalize-replication
```
This stops the incremental replication loop. The command will restart the primary Controller, resulting in downtime.
Make sure replication is working by checking that there is no significant gap between the primary and secondary Controllers. You can run the following command on the Enterprise Console host to check the replication status:
```
bin/platform-admin.sh show-service-status --platform-name <platform_name> --service controller
```
It may take a few minutes for the secondary status to catch up.

Add a Secondary Controller Using Incremental Replication

You can convert a single Controller with a large amount of data to an HA pair by using incremental replication. This way, you can rsync most of the Controller's data while the Controller is still running, limiting the downtime of adding a secondary Controller.

To add a secondary Controller using incremental replication:

Start the incremental replication, giving host and rsync parameters:

bin/platform-admin.sh submit-job --service controller --job incremental-replication --args controllerSecondaryHost=1.1.1.1 rsyncThrottle=40000 rsyncCompress=true

This launches a continuously running background job.

Make sure replication occurs four or more times, by checking mysqlDir/incremental_sync.status on the primary database host.
Sample rsync status file output:
```
rsync started at Mon Mar  5 11:49:56 PST 2018
rsync completed at Mon Mar  5 11:50:56 PST 2018
rsync started at Mon Mar  5 11:51:01 PST 2018
rsync completed at Mon Mar  5 11:51:11 PST 2018
```
If replication fails, go to the secondary host and stop all rsync and ha-replicate.sh processes. Then try running the incremental-replication job again.
Run the add secondary job. The Enterprise Console will perform a final rsync and add the secondary.
```
bin/platform-admin.sh submit-job --service controller --job add-secondary --args controllerSecondaryHost=secondary mysqlRootPassword=‘password'
```
The command will restart the primary Controller, resulting in downtime.

Until you trigger the add-secondary command, the secondary Controller is not added to the Enterprise Console platform. Therefore, the Enterprise Console will not be able to perform any other operations on the secondary Controller.

If you need to stop replication, you can run the following command:

bin/platform-admin.sh submit-job --service controller --job stop-incremental-replication

Set Replication Factors for Rsync Threads

Using the Enterprise Console UI or the CLI, you can set the number of parallel rsync threads as a job parameter when you perform incremental or finalize replication.

From the Enterprise Console UI:
1. Log in to the Enterprise Console and access the Controller page.
2. From the More menu, based on which replication you are performing, select either Incremental Replication or Finalize Replication.
3. Enter a number in the Number of parallel rsync threads field and select Submit. The default value is 1.

From the CLI, based on which replication you are performing, run either of the following commands from the Enterprise Console host and set the numberThreadForRsync argument.

bin/platform-admin.sh submit-job --job incremental-replication --args numberThreadForRsync=<number> bin/platform-admin.sh submit-job --job finalize-replication --args numberThreadForRsync=<number>

CODE

Enable MySQL5.7 Parallel Replication

Using the Enterprise Console UI or the CLI, you can enable MySQL5.7 parallel replication when you perform finalize replication.

From the Enterprise Console UI:
1. Log in to the Enterprise Console and access the Controller page.
2. From the More menu, select Finalize Replication.
3. Select the Database parallel replication check box to enable parallel replication with the MySQL7.5 database.
4. Select Submit.
From the CLI, run the following command from the Enterprise Console host to enable MySQL5.7 parallel replication. The default value is true.
```
bin/platform-admin.sh submit-job --job finalize-replication --args dbParallelReplication=true
```
CODE

Troubleshooting the Incremental Replication Status

If your first incremental replication run is taking longer than usual, you can check the replication status by executing either one of the below commands:

cd <controller_home>/controller-ha
./ha_replicate.sh -r status

CODE

cd <controller_home>/controller-ha/tmp
cat replication.status

CODE

Re-enable Controller Database Replication

The Controller databases can be synchronized using the replicate script if they have been out of sync for more than seven days. Synchronizing a database that is more than seven days behind a master is considered reviving a Controller database. Reviving a database involves essentially the same procedure as adding a new secondary Controller to an existing production Controller, as described in Set Up the Secondary Controller and Initiate Replication. You can also follow these steps in the case of an HA failover that failed at replication.

To re-enable replication or revive a Controller database:

On the Controller page, click on Remove Controller, or run the following command on the Enterprise Console host:
```
bin/platform-admin.sh submit-job --job remove --service controller
```
Enter the database root credentials.

Check Remove Binaries, or run the following command on the Enterprise Console host:

bin/platform-admin.sh submit-job --job remove --service controller --args removeBinaries=true

Uncheck Remove Controller Cluster. If it is already unchecked, remove the secondary server.
Click Submit.
Add a secondary controller from the Controller page, or run the following command on the Enterprise Console host:
```
bin/platform-admin.sh submit-job --service controller --job add-secondary --args controllerSecondaryHost=secondary mysqlRootPassword=‘password'
```
The command will restart the primary Controller, resulting in downtime.

The Enterprise Console will onboard the secondary Controller and re-enable replication.

Backing Up and Restoring Controller Data in an HA Pair

An HA deployment makes backing up Controller data relatively straightforward since the secondary Controller offers a complete set of production data on which you can perform a cold backup without disrupting the primary Controller service.

After setting up HA, perform a back up by stopping the Controller on the Enterprise Console and performing a file-level copy of the AppDynamics home directory (i.e., a cold backup). When finished, simply restart the Controller from the Enterprise Console. The secondary will then catch up its data to the primary.

When restoring the database from a back up in an HA or standalone environment, you should check that the primary and secondary servers ha.type and ha.mode are set properly to active and passive, respectively.

Updating the Configuration in an HA Pair

The Enterprise Console will copy any file-level configuration customizations made on the primary controller to the secondary controller, such as changes in domain.xml and db.cnf.

Over time, if you need to make modifications to the Controller configuration, always do those changes in the Enterprise Console on the Controller Settings page under Configurations. These changes will be preserved during upgrades. Any changes made outside the Enterprise Console will not be preserved after upgrade.

Troubleshooting HA

Controller Diagnostic Data

The Enterprise Console writes log messages pertaining to HA to the platform-admin-server.log on the Enterprise Console host.

To diagnose the Controller, run the following command:

bin/platform-admin.sh submit-job --platform-name <name_of_the_platform> --job diagnosis --service controller

Refer to the Controller diagnostic data in the platform-admin-server.log.

Sample Controller diagnostic data

Linux

Click here to expand...

Controller diagnostic data:
123.45.0.1:
controller_database: running
controller_appserver: running
reports_service: running
operating_system: Linux
controller_version: 004-004-001-000
controller_performance_profile: small
controller_ha_type: primary
controller_appserver_mode: active
controller_metric_data_per_min: N/A
slave_io_state: Waiting for master to send event
seconds_behind_master: 0
master_server_id: 567.
master_host: controller-secondary
master_ssl_allowed: No

123.45.0.2:
controller_database: running
controller_appserver: not running
reports_service: running
operating_system: Linux
controller_version: 004-004-001-000
controller_performance_profile: small
controller_ha_type: secondary
controller_appserver_mode: passive

Invalid HA Controller Roles

If your HA Controller roles in the Controller databases are incorrect, the Enterprise Console will prevent discover and upgrade jobs. An invalid HA Controller state is when both of your Controller role types are identical, such as in a primary/primary or secondary/secondary case.

To fix this issue:

Identify which server is the primary.
1. Log in to one of the Controller databases by running the following command in the Controller installation directory:
```
bin/controller.sh login-db
```
2. Run the following command:
```
select * from global_configuration_local where name=‘ha.controller.type’;
```

Ensure that ha.controller.type is set correctly in the database.

Log in to the Controller database you would like to change by running the following command in the Controller installation directory:
```
bin/controller.sh login-db
```

Run the following commands to set the database to the primary or secondary:

use controller;
update global_configuration_local set value=‘primary’ where name=‘ha.controller.type’;
update global_configuration_local set value=‘active’ where name=‘appserver.mode’;

BASH

use controller:
update global_configuration_local set value=‘secondary’ where name=‘ha.controller.type’;
update global_configuration_local set value=‘passive’ where name=‘appserver.mode’;

BASH

Restart the database for the change to take effect on the Appserver:
```
bin/platform-admin.sh stop-controller-appserver --with-db
bin/platform-admin.sh start-controller-appserver --with-db
```
If the secondary Appserver is already in a shutdown state, then there is no need to restart the database.
Verify the replication is healthy:
```
show slave status\G
```
Slave_IO_Running and Slave_SQL_Running should show Yes.

You may now retry the discover and upgrade job.

Failover Prevention

If failover is prevented on your Controller HA configuration, it may be due to one of two scenarios:

The secondary database is down. Failover cannot occur when the secondary database is not running.
To fix this issue:
1. Restart the secondary database by running the following command on the secondary host:
```
bin/controller.sh start-db
```
If this does not enable failover, then it may be due to the second scenario.

Database replication is not healthy. Failover is not allowed when the database replication is not healthy.
There are various reasons why this may be the case. Please work closely with your AppDynamics account representative to correct the issue.