On this page:
On this page:
The HA toolkit consists of bash scripts that automate HA-related set up and administration tasks for the Linux operating system. It works with most flavors of Linux, including Ubuntu and Red Hat/CentOS.
You can use the toolkit to:
Deploying Controllers as an HA pair ensures that service downtime in the event of a Controller machine failure is minimized. It also facilitates other administrative tasks, such as backing up data. For more background information, including the benefits of HA, see Controller High Availability (HA).
The toolkit works on Linux systems only. However, even if you cannot use the HA toolkit directly (due to a different operating system or because of site-specific requirements), you are likely to benefit from learning how the toolkit works since it can provide a model for your own scripts and processes.
You can download the HA toolkit from the following GitHub link: https://github.com/Appdynamics/HA-toolkit/releases.
In an HA deployment, the HA toolkit on one Controller host needs to be able to interact with the other Controller host. The toolkit relies on certain conditions in the environment to support this interaction, along with the other operations it performs.
General guidelines and requirements for the environment in which you deploy HA are:
If the AppDynamics Controller is run by a non-root user on the system, the HA toolkit process must be able to escalate its privilege level to accomplish certain tasks, including replication, failover and assassin tasks.
The script uses one of the following two mechanisms to accomplish privilege escalation. The toolkit installation adds artifacts required for both mechanisms; however, the one that it actually uses is determined at runtime based on dependencies in the environment, as follows:
/sbin/appdservice is a setuid root program distributed in source form in HA/appdservice.c. It is written explicitly to support auditing by security audit systems. The install-init.sh script compiles and installs the program. It is executable only by the AppDynamics user and the root user. The script requires a C compiler to be available on the system. You can install a C compiler using the package manager for your operating system. For example, on Yum-based Linux distributions, you can use the following command to install the GNU Compiler, which includes a C compiler:
After deploying the HA Controller pair, you will be able to test support for one of these functions (without causing system changes) by running these commands as the appd user:
At least one of the commands must successfully return the Controller status for the toolkit to work.
Before setting up HA, a reverse proxy or load balancer needs to be available and configured to route traffic to the active Controller in the HA pair. Using a load balancer to route traffic between Controllers (rather than other approaches, such as DNS manipulation) ensures that a failover can occur quickly, without, for example, delays due to DNS caching on the agent machines.
An HA deployment requires the following IP addresses:
When configuring replication, you specify the external address at which Controller clients, such as app agents and UI users, will address the Controller at the load balancer. The Controllers themselves need to be able to reach this address as well. If the Controller will reside within a protected network relative to the load balancer, preventing them from reaching this address, there needs to be an internal VIP on the protected side that proxies the active Controller from within the network. This is specified using the -i parameter.
The load balancer can check the availability of the Controller at the following address:
If the Controller is active, it responds to a GET request at this URL with an HTTP 200 response. The body of the response indicates the status of the Controller in the following manner:
Ensure that the load balancer policy you configure for the Controller pair can send traffic to only a single Controller in the pair at a time (i.e., do not use round-robin or similar routing distribution policy at the load balancer). For more information about setting up a load balancer for the Controller, see Use a Reverse Proxy.
Setting up high availability involves the following steps:
The following sections provide more information on how to configure a few of the system requirements. They describe how to configure the settings on Red Hat Linux for a sample deployment. Note that the specific steps for configuring these requirements may differ on different systems. Consult documentation for your system for details on that system.
Reliable symmetrical reverse host lookup needs to be set up on each machine. The best way to accomplish this is by placing the host names of the pair into the hosts files (/etc/hosts) on each machine. This is preferable over other approaches, namely using reverse DNS, which adds a point of failure.
In /etc/hosts file, add an entry for each host in the HA pair, as in the following example:
192.168.144.128 host1.domain.com host1
192.168.144.137 host2.domain.com host2
It is important to adhere to the correct format of /etc/hosts files in order to reduce errors. If you have both dotted hostnames and short versions, you need to list the dotted hostnames with the most dots first and the other versions subsequently. This should be done consistently for both HA server entries in each of the two /etc/hosts files. Note in the examples above that the aliases are listed last.
SSH must be installed on both hosts in a way that gives the user who runs the Controller passwordless SSH access to the other Controller system in the HA pair. You can accomplish this by generating a key pair on each node, and placing the public key of the other Controller into the authorized keys (authorized_keys) file on each Controller.
The following steps illustrate how to perform this configuration. The instructions assume an AppDynamics user named appduser, and the Controller hostnames are node1, the active primary, and node2, the secondary. Adjust the instructions for your particular environment. Also note that you may not need to perform every steps (for example, you may already have the .ssh directory and don't need to create a new one).
Although not shown here, some of the steps may prompt you for a password.
Change to the AppDynamics user, appduser in our example:
Create a directory for SSH artifacts (if it doesn't already exist) and set permissions on the directory, as follows:
Generate the RSA-formatted key:
Secure copy the key to the other Controller:
As you did for node1, run these commands:
Add the public key of node1 that you previously copied to the secondary Controller host's authorized keys and set permissions on the authorized keys file:
Move the secondary's public key to the authorized keys
To test the configuration, try this:
Make sure the echo command succeeds.
With your environment configured, you can get and install the HA Toolkit. The toolkit is packaged as a shar (shared archive) file, which, when executed, extracts the set of scripts of the toolkit.
Change to the Controller home directory:
Create the HA directory in the Controller home:
Make the entire directory writeable and
cd into the directory:
Make the file executable:
Run the shell archive script:
Once you have set up the environment and downloaded the toolkit to the primary Controller, you can set up the primary Controller for high availability. The steps for setting up HA differ if you are deploying a Controller for the first time or adding an HA secondary to an existing Controller, i.e., one that has already accumulated application data:
Once installation is finished, ensure that the Controller home and data directories are writable by the AppDynamics user.
ls command to verify write privileges. The output should look similar to the output below, which shows the current privileges for the sample AppDynamics user, appduser.
After preparing the primary, you can set up replication, as described next.
To set up the secondary Controller, you run the replicate.sh script on the primary Controller machine. This script is the primary entry point for the toolkit. It performs these functions, among others:
This script can only be run on the primary Controller. If you run the replicate script with super user (sudo) privileges, it performs the complete HA setup—from installing the secondary Controller, copying data to the secondary, and setting up master-master database replication. If you do not run the script as a super user, you will need to perform some additional configuration tasks later to install system services. To perform those tasks, run the
Install-init.sh script as described in Installing as a Service.
You will need to run replicate script at least twice. On the first pass, the script performs initial setup and replication tasks. The final replication pass (which is specified using the -f flag) completes replication and restarts the Controller. This results in a brief service downtime.
For an existing Controller deployment with a large amount of data to replicate, you may wish to execute replication multiple times before finalizing replication. The first time you run the script, it can take a significant amount of time to complete, possibly days. The second pass replicates the data changes accumulated while the first pass executed. If performed immediately after the first pass, the subsequent pass should take considerably less time. You can run the replication pass again until the amount of time it takes for the replicate script to complete fits within what would be an acceptable downtime window.
For the initial replication steps, invoke the replicate script, passing the hostname of the secondary and virtual IP for the Controller pair at the load balancer when invoking the replicate script. The command should be in the following form:
Command options are:
For all available options to the script, run the replicate script with no arguments.
If running as non-root, the command asks that you run the install-init script manually as root to complete the installation.
If replicating a large amount of data, repeat the previous command to minimize the time required for the finalization pass in which the Controller is restarted.
When ready to finalize replication, run the script again, this time passing the -w and -f flags:
The flags have the following effect
-w – Starts the watchdog process on the secondary (see below for more information about the watchdog)
-f – Causes the script to finalize setup, restarting the Controllers.
Log in to the secondary Controller database:
Execute following command:
This step should provide you following result:
If you get a non-zero number for this test, wait until the number becomes zero.
The watchdog process is a background process that runs on the secondary. It monitors the availability of the primary Controller, and, if it detects that the primary is unavailable, automatically initiates a failover. If you passed the -w flag to the replicate script when setting up HA, the watchdog was started for you. You can start the watchdog manually and configure its default settings as described here.
The watchdog sets a health check every 10 seconds. You can configure how long the watchdog waits after detecting the primary is down before it considers it a failure and initiates failover. By default, it waits 5 minutes, rechecking for availability every 10 seconds. Since the watchdog should not take over for a primary while it is in the process of shutting down or starting up, there are individual wait times for these operations as well.
DOWNLIMIT: Length of time that the primary is detected as unavailable before the watchdog on the secondary initiates failover.
FALLINGLIMIT: Length of time that the primary reports itself as shutting down before the watchdog on the secondary initiates failover. The secondary needs to allow the primary to shut down without initiating failover, so this setting specifies the length of time after which the primary may be considered "stuck" in that state, at which point the secondary takes over.
RISINGLIMIT: Length of time that the primary reports itself as starting up before the watchdog on the secondary initiates failover. The secondary needs to allow the primary to start up without initiating failover, so this setting specifies the length of time after which the primary may be considered "stuck" in that state, at which point the secondary takes over.
DBDOWNLIMIT: Length of time that the primary database is detected as unavailable before the watchdog on the secondary initiates failover.
PINGLIMIT: Length of time ping attempts of the primary fail before the secondary initiates failover.
Create the control file that enables the watchdog. For example:
Enable read/write permissions on the file:
Start the service:
Note: Running the replicate.sh script with the -w option at final activation creates the watchdog control file automatically.
Removing the WATCHDOG_ENABLE file causes the watchdog to exit.
The Controllers are now configured for high availability. The following sections describe how to perform various administrative tasks in a high availability environment.
To stop and start the primary Controller without initiating failover, remove the watchdog file on the secondary before stopping or initiating the restart on the primary. This causes the secondary to stop watching the primary, so that it doesn't initiate failover when the primary is briefly unavailable.
When the primary is finished restarting, you can add the file back to resume the watchdog process. The file is:
After you have set up HA, the Controller is automatically started at boot time and shut down when the system is halted. You can start and stop the Controller service and HA facility manually at any time using the Linux service command as root user.
To start or stop the Controller manually, use the following commands:
The replicate script installs the Controller as a service for you automatically if you run the script as a root user. If you did not run the replicate script as root, after the replicate script process is finished, you can run the following script manually to complete the installation:
install-init.sh with one of the following options to choose how to elevate the user privilege:
-c#use setuid c wrapper
-p#use prune wrapper
-x#use user privilege wrapper
-a <Machine Agent install directory>. This option configures the machine agent to report to the Controller's self-monitoring account and install an init script for it.
If you need to uninstall the service later later, use the
Once installed as a service, the Linux service utility can be run on either node to report the current state of the replication, background processes, and the Controller itself.
To check its status, use this command:
The toolkit also writes status and progress logs of its various components to the logs.
To fail over from the primary to the secondary manually, run the
failover.sh script on the secondary. This kills the watchdog process, starts the app server on the secondary, and makes the database on the secondary the replication master. If you have custom procedures you want to add to the failover process (such as updating a dynamic DNS service or notifying a load balancer or proxy), you can add or call it from this script.
Should the secondary be unable to reach the MySQL database on the primary, it will then try to kill the app server process on the primary Controller, avoiding the possibility of having two Controllers active at the same time. This function is performed by the
assassin.sh script, which continues to watch for the former primary Controller process to ensure that there aren't two active, replicating Controllers.
The process for performing a failback to the old primary is the same as failing over to the secondary. Simply run failover.sh on the machine of the Controller to restore it as the primary. Note that if it has been down for more than seven days, you need to revive the database, as described in the following section.
The Controller databases can be synchronized using the replicate script if they have been out of sync for more than seven days. Synchronizing a database that is more than seven days behind a master is considered reviving a database. Reviving a database involves essentially the same procedure as adding a new secondary Controller to an existing production Controller, as described in Set Up the Secondary Controller and Initiate Replication.
In short, you run the
replicate.sh without the
-f switch multiple times on the primary. Once you have an opportunity for a service window and reduced the replication time to an acceptable amount of time for a service window, take the primary Controller out of service (stop the app server) and allow data synchronization to catch up.
An HA deployment makes backing up Controller data relatively straightforward, since the secondary Controller offers a complete set of production data on which you can perform a cold backup without disrupting the primary Controller service.
After setting up HA, perform a back up by stopping the
appdcontroller service on the secondary and performing a file-level copy of the AppDynamics home directory (i.e., a cold backup). When finished, simply restart the service. The secondary will then catch up its data to the primary.
When you run the replicate script, the toolkit copies any file-level configuration customizations made on the primary Controller to the backup, such as configuration changes in domain.xml file.
Over time, you may need to make modifications to the Controller configuration. After you do so, you can use the
-j switch to replicate configuration changes only from the primary to the secondary. For example:
The HA toolkit writes log messages to the log files located in the same directory as other Controller logs, <controller_home>/logs by default. The files include: