Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment.

Cisco Cloud Observability supports monitoring the following Amazon SageMaker entities:

  • Endpoint: Used to programmatically connect to an AWS service.
  • Training Job: Used to train machine learning models.
  • Processing Job: Used to analyze data and evaluate machine learning models.

You must configure cloud connections to monitor this entity. See Set up Cisco AppDynamics Cloud Collectors to Monitor AWS.

Cisco Cloud Observability displays AWS entities on the Observe page. Metrics are displayed for specific entity instances in the list and detail views.

This document contains references to third-party documentation. Splunk AppDynamics does not own any rights and assumes no responsibility for the accuracy or completeness of such third-party documentation.

Detail View

To display the detail view for an Amazon SageMaker instance:

  1. Navigate to the Observe page. 
  2. Under App Integrations, click AWS SageMaker Endpoints.
    The list view now displays.
  3. From the list, click an instance Name to display the detail view.
    The detail view displays the metrics, key performance indicators, and properties (attributes) related to the instance you selected.
  1. Navigate to the Observe page. 
  2. Under App Integrations, click AWS SageMaker Jobs.
    The list view now displays. The Processing Jobs tab is selected by default.
  3. From the list, click an instance Name to display the detail view.
    The detail view displays the metrics, key performance indicators, and properties (attributes) related to the instance you selected.
  1. Navigate to the Observe page. 
  2. Under App Integrations, click AWS SageMaker Jobs.
    The list view now displays. Click the Training Jobs tab.
  3. From the list, click an instance Name to display the detail view.
    The detail view displays the metrics, key performance indicators, and properties (attributes) related to the instance you selected.

Metrics and Key Performance Indicators

Cisco Cloud Observability displays the following metrics and key performance indicators (KPIs) for Amazon SageMaker. For more information, see Monitor Amazon SageMaker with Amazon CloudWatch.

Display NameSource Metric NameDescription
CPU Utilization (%)CPUUtilization

The sum of each individual CPU core's utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the CPUUtilization range is 0%–400%. For processing jobs, the value is the CPU utilization of the processing container on the instance.

For endpoint variants, the value is the sum of the CPU utilization of the primary and supplementary containers on the instance.

Disk Utilization (%)DiskUtilization

The percentage of disk space used by the containers on an instance uses. This value range is 0%–100%. This metric is not supported for batch transform jobs.

For endpoint variants, the value is the sum of the disk space utilization of the primary and supplementary containers on the instance.

Memory Utilization (%)MemoryUtilization

The percentage of memory that is used by the containers on an instance. This value range is 0%–100%.

For endpoint variants, the value is the sum of the memory utilization of the primary and supplementary containers on the instance.

GPU Utilization (%)GPUUtilization

The percentage of GPU units that are used by the containers on an instance. The value can range between 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the GPUUtilization range is 0%–400%.

For endpoint variants, the value is the sum of the GPU utilization of the primary and supplementary containers on the instance.

GPU Memory Utilization (%)GPUMemoryUtilization

The percentage of GPU memory used by the containers on an instance. The value range is 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the GPUMemoryUtilization range is 0%–400%.

For endpoint variants, the value is the sum of the GPU memory utilization of the primary and supplementary containers on the instance.

Invoke Endpoint Requests (Count)

Invocations

The number of InvokeEndpoint requests sent to a model endpoint.
Invoke Endpoint Errors (Count)

Invocation4XXErrors

The number of InvokeEndpoint requests where the model returned a 4xx HTTP response code. For each 4xx response, 1 is sent; otherwise, 0 is sent.

Invocation5XXErrors

The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. For each 5xx response, 1 is sent; otherwise, 0 is sent.

Model Cache Hits (Count)

ModelCacheHitThe number of InvokeEndpoint requests sent to the multi-model endpoint for which the model was already loaded.

Model Latency (ms)

ModelLatencyThe interval of time taken by a model to respond to a SageMaker Runtime API request. This interval includes the local communication times taken to send the request and to fetch the response from the model container and the time taken to complete the inference in the container.

Model Latency Overhead (ms)

OverheadLatencyThe interval of time added to the time taken to respond to a client request by SageMaker overheads. This interval is measured from the time SageMaker receives the request until it returns a response to the client, minus the ModelLatency. Overhead latency can vary depending on multiple factors, including request and response payload sizes, request frequency, and authentication/authorization of the request.

Model Operation Time (ms)




ModelLoadingTimeThe interval of time that it took to load the model through the container's LoadModel API call.
ModelDownloadingTimeThe interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3).
ModelUnloadingTimeThe interval of time that it took to unload the model through the container's UnloadModel API call.
ModelLoadingWaitTimeThe interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference.
Display NameSource Metric NameDescription
Normalized CPU Utilization (%)

CPUUtilizationNormalized

The normalized sum of the utilization of each individual CPU core. The value ranges between 0%–100%. For example, if there are four CPUs, and the CPUUtilization metric is 200%, then the CPUUtilizationNormalized metric is 50%.
Normalized GPU Utilization (%)

GPUUtilizationNormalized

The normalized percentage of GPU units that are used by the containers on an instance. The value ranges between 0%–100%. For example, if there are four GPUs, and the GPUUtilization metric is 200%, then the GPUUtilizationNormalized metric is 50%.
Normalized GPU Memory Utilization (%)GPUMemoryUtilizationNormalizedThe normalized percentage of GPU memory used by the containers on an instance. The value ranges between 0%–100%. For example, if there are four GPUs, and the GPUMemoryUtilization metric is 200%, then the GPUMemoryUtilizationNormalized metric is 50%.
CPU Utilization (%)CPUUtilization

The sum of each individual CPU core's utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the CPUUtilization range is 0%–400%. For processing jobs, the value is the CPU utilization of the processing container on the instance.

Disk Utilization (%)DiskUtilization

The percentage of disk space used by the containers on an instance uses. This value range is 0%–100%. This metric is not supported for batch transform jobs.

For processing jobs, the value is the disk space utilization of the processing container on the instance.

Memory Utilization (%)MemoryUtilization

The percentage of memory that is used by the containers on an instance. This value range is 0%–100%.

For processing jobs, the value is the GPU memory utilization of the processing container on the instance.

GPU Utilization (%)GPUUtilization

The percentage of GPU units that are used by the containers on an instance. The value can range between 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the GPUUtilization range is 0%–400%.

For processing jobs, the value is the GPU utilization of the processing container on the instance.

GPU Memory Utilization (%)GPUMemoryUtilization

The percentage of GPU memory used by the containers on an instance. The value range is 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the GPUMemoryUtilization range is 0%–400%.

For processing jobs, the value is the GPU memory utilization of the processing container on the instance.

CPU Reservation (%)CPUReservationThe sum of CPUs reserved by containers on an instance. The value ranges between 0%–100%. In the settings for an inference component, you set the CPU reservation with the NumberOfCpuCoresRequired parameter. For example, if there 4 CPUs, and 2 are reserved, the CPUReservation metric is 50%.
GPU Reservation (%)GPUReservationThe sum of GPUs reserved by containers on an instance. The value ranges between 0%–100%. In the settings for an inference component, you set the GPU reservation by NumberOfAcceleratorDevicesRequired. For example, if there are 4 GPUs and 2 are reserved, the GPUReservation metric is 50%.
Memory Reservation (%)MemoryReservationThe sum of memory reserved by containers on an instance. The value ranges between 0%–100%. In the settings for an inference component, you set the memory reservation with the MinMemoryRequiredInMb parameter. For example, if a 32 GiB instance reserved 1024 MB, the MemoryReservation metric is 29.8%.


Properties (Attributes)

Cisco Cloud Observability displays the following properties for Amazon SageMaker.

Display NameProperty NameDescription
Nameaws.sagemaker_endpoint.nameThe name of the SageMaker endpoint.
ARNaws.sagemaker_endpoint.arnThe ARN of the SageMaker endpoint.
Config Nameaws.sagemaker_endpoint.config_nameThe name of the endpoint configuration associated with this endpoint.
Current Sampling %aws.sagemaker_endpoint.current_sampling_percentageThe % of requests being captured by the endpoint.
Production Variant Namesaws.sagemaker_endpoint.production_variant_namesNames of production variants.
Shadow Variant Namesaws.sagemaker_endpoint.shadow_variant_namesNames of shadow variants.
Display NameProperty NameDescription
Job Nameaws.sagemaker_job.nameThe name of the SageMaker job.
ARNaws.sagemaker_job.arnThe ARN of the SageMaker job.
S3 Source ARNsaws.sagemaker_job.s3_source_arnsThe S3 locations of the data sources.
Athena Workgroup Nameaws.sagemaker_processing_job.athena_workgroup_nameThe name of the workgroup in which the Athena query is started.
S3 Destination ARNaws.sagemaker_processing_job.destination_s3_arnA URI that identifies the Amazon S3 bucket where you want Amazon SageMaker to save the results of a processing job.
IAM Role ARNaws.sagemaker_processing_job.iam_role_arnThe Amazon Resource Name (ARN) of an IAM role that Amazon SageMaker can assume to perform tasks.
Training Job ARNaws.sagemaker_processing_job.training_job_arnThe ARN of a training job associated with this processing job.
Display NameProperty NameDescription
Job Nameaws.sagemaker_job.nameThe name of the SageMaker job.
ARNaws.sagemaker_job.arnThe ARN of the SageMaker job.
S3 Source ARNsaws.sagemaker_job.s3_source_arnsThe S3 locations of the data sources.

Artifacts Destination S3 ARN

aws.sagemaker_training_job.destination_s3_arnIdentifies the S3 path where you want SageMaker to store the model artifacts.
Checkpoints S3 ARNaws.sagemaker_training_job.checkpoints_s3_arnIdentifies the S3 path where you want SageMaker to store checkpoints.


Retention and Purge Time-To-Live (TTL)

For all cloud and infrastructure entities, the retention TTL is 180 minutes (3 hours) and the purge TTL is 525,600 minutes (365 days). 

Amazon Web Services, the AWS logo, AWS, and any other AWS Marks used in these materials are trademarks of Amazon.com, Inc. or its affiliates in the United States and/or other countries.