Amazon SageMaker

Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment.

Cisco Cloud Observability supports monitoring the following Amazon SageMaker entities:

Endpoint: Used to programmatically connect to an AWS service.
Training Job: Used to train machine learning models.
Processing Job: Used to analyze data and evaluate machine learning models.

You must configure cloud connections to monitor this entity. See Set up Cisco AppDynamics Cloud Collectors to Monitor AWS.

Cisco Cloud Observability displays AWS entities on the Observe page. Metrics are displayed for specific entity instances in the list and detail views.

This document contains references to third-party documentation. Splunk AppDynamics does not own any rights and assumes no responsibility for the accuracy or completeness of such third-party documentation.

Detail View

To display the detail view for an Amazon SageMaker instance:

Navigate to the Observe page.
Under App Integrations, click AWS SageMaker Endpoints.
The list view now displays.
From the list, click an instance Name to display the detail view.
The detail view displays the metrics, key performance indicators, and properties (attributes) related to the instance you selected.

Navigate to the Observe page.
Under App Integrations, click AWS SageMaker Jobs.
The list view now displays. The Processing Jobs tab is selected by default.
From the list, click an instance Name to display the detail view.
The detail view displays the metrics, key performance indicators, and properties (attributes) related to the instance you selected.

Navigate to the Observe page.
Under App Integrations, click AWS SageMaker Jobs.
The list view now displays. Click the Training Jobs tab.
From the list, click an instance Name to display the detail view.
The detail view displays the metrics, key performance indicators, and properties (attributes) related to the instance you selected.

Metrics and Key Performance Indicators

Cisco Cloud Observability displays the following metrics and key performance indicators (KPIs) for Amazon SageMaker. For more information, see Monitor Amazon SageMaker with Amazon CloudWatch.

Display Name	Source Metric Name	Description
CPU Utilization (%)	`CPUUtilization`	The sum of each individual CPU core's utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the `CPUUtilization` range is 0%–400%. For processing jobs, the value is the CPU utilization of the processing container on the instance. For endpoint variants, the value is the sum of the CPU utilization of the primary and supplementary containers on the instance.
Disk Utilization (%)	`DiskUtilization`	The percentage of disk space used by the containers on an instance uses. This value range is 0%–100%. This metric is not supported for batch transform jobs. For endpoint variants, the value is the sum of the disk space utilization of the primary and supplementary containers on the instance.
Memory Utilization (%)	`MemoryUtilization`	The percentage of memory that is used by the containers on an instance. This value range is 0%–100%. For endpoint variants, the value is the sum of the memory utilization of the primary and supplementary containers on the instance.
GPU Utilization (%)	`GPUUtilization`	The percentage of GPU units that are used by the containers on an instance. The value can range between 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the GPU utilization of the primary and supplementary containers on the instance.
GPU Memory Utilization (%)	`GPUMemoryUtilization`	The percentage of GPU memory used by the containers on an instance. The value range is 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUMemoryUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the GPU memory utilization of the primary and supplementary containers on the instance.
Invoke Endpoint Requests (Count)	`Invocations`	The number of `InvokeEndpoint` requests sent to a model endpoint.
Invoke Endpoint Errors (Count)	`Invocation4XXErrors`	The number of `InvokeEndpoint` requests where the model returned a 4xx HTTP response code. For each 4xx response, 1 is sent; otherwise, 0 is sent.
Invoke Endpoint Errors (Count)	`Invocation5XXErrors`	The number of `InvokeEndpoint` requests where the model returned a 5xx HTTP response code. For each 5xx response, 1 is sent; otherwise, 0 is sent.
Model Cache Hits (Count)	`ModelCacheHit`	The number of `InvokeEndpoint` requests sent to the multi-model endpoint for which the model was already loaded.
Model Latency (ms)	`ModelLatency`	The interval of time taken by a model to respond to a SageMaker Runtime API request. This interval includes the local communication times taken to send the request and to fetch the response from the model container and the time taken to complete the inference in the container.
Model Latency Overhead (ms)	`OverheadLatency`	The interval of time added to the time taken to respond to a client request by SageMaker overheads. This interval is measured from the time SageMaker receives the request until it returns a response to the client, minus the `ModelLatency`. Overhead latency can vary depending on multiple factors, including request and response payload sizes, request frequency, and authentication/authorization of the request.
Model Operation Time (ms)	`ModelLoadingTime`	The interval of time that it took to load the model through the container's `LoadModel` API call.
	`ModelDownloadingTime`	The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3).
	`ModelUnloadingTime`	The interval of time that it took to unload the model through the container's `UnloadModel` API call.
	`ModelLoadingWaitTime`	The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference.

Display Name	Source Metric Name	Description
Normalized CPU Utilization (%)	`CPUUtilizationNormalized`	The normalized sum of the utilization of each individual CPU core. The value ranges between 0%–100%. For example, if there are four CPUs, and the `CPUUtilization` metric is 200%, then the `CPUUtilizationNormalized` metric is 50%.
Normalized GPU Utilization (%)	`GPUUtilizationNormalized`	The normalized percentage of GPU units that are used by the containers on an instance. The value ranges between 0%–100%. For example, if there are four GPUs, and the `GPUUtilization` metric is 200%, then the `GPUUtilizationNormalized` metric is 50%.
Normalized GPU Memory Utilization (%)	`GPUMemoryUtilizationNormalized`	The normalized percentage of GPU memory used by the containers on an instance. The value ranges between 0%–100%. For example, if there are four GPUs, and the `GPUMemoryUtilization` metric is 200%, then the `GPUMemoryUtilizationNormalized` metric is 50%.
CPU Utilization (%)	`CPUUtilization`	The sum of each individual CPU core's utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the `CPUUtilization` range is 0%–400%. For processing jobs, the value is the CPU utilization of the processing container on the instance.
Disk Utilization (%)	`DiskUtilization`	The percentage of disk space used by the containers on an instance uses. This value range is 0%–100%. This metric is not supported for batch transform jobs. For processing jobs, the value is the disk space utilization of the processing container on the instance.
Memory Utilization (%)	`MemoryUtilization`	The percentage of memory that is used by the containers on an instance. This value range is 0%–100%. For processing jobs, the value is the GPU memory utilization of the processing container on the instance.
GPU Utilization (%)	`GPUUtilization`	The percentage of GPU units that are used by the containers on an instance. The value can range between 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUUtilization` range is 0%–400%. For processing jobs, the value is the GPU utilization of the processing container on the instance.
GPU Memory Utilization (%)	`GPUMemoryUtilization`	The percentage of GPU memory used by the containers on an instance. The value range is 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUMemoryUtilization` range is 0%–400%. For processing jobs, the value is the GPU memory utilization of the processing container on the instance.
CPU Reservation (%)	`CPUReservation`	The sum of CPUs reserved by containers on an instance. The value ranges between 0%–100%. In the settings for an inference component, you set the CPU reservation with the `NumberOfCpuCoresRequired` parameter. For example, if there 4 CPUs, and 2 are reserved, the `CPUReservation` metric is 50%.
GPU Reservation (%)	`GPUReservation`	The sum of GPUs reserved by containers on an instance. The value ranges between 0%–100%. In the settings for an inference component, you set the GPU reservation by `NumberOfAcceleratorDevicesRequired`. For example, if there are 4 GPUs and 2 are reserved, the `GPUReservation` metric is 50%.
Memory Reservation (%)	`MemoryReservation`	The sum of memory reserved by containers on an instance. The value ranges between 0%–100%. In the settings for an inference component, you set the memory reservation with the `MinMemoryRequiredInMb` parameter. For example, if a 32 GiB instance reserved 1024 MB, the `MemoryReservation` metric is 29.8%.

Properties (Attributes)

Cisco Cloud Observability displays the following properties for Amazon SageMaker.

Display Name	Property Name	Description
Name	`aws.sagemaker_endpoint.name`	The name of the SageMaker endpoint.
ARN	`aws.sagemaker_endpoint.arn`	The ARN of the SageMaker endpoint.
Config Name	`aws.sagemaker_endpoint.config_name`	The name of the endpoint configuration associated with this endpoint.
Current Sampling %	`aws.sagemaker_endpoint.current_sampling_percentage`	The % of requests being captured by the endpoint.
Production Variant Names	`aws.sagemaker_endpoint.production_variant_names`	Names of production variants.
Shadow Variant Names	`aws.sagemaker_endpoint.shadow_variant_names`	Names of shadow variants.

Display Name	Property Name	Description
Job Name	`aws.sagemaker_job.name`	The name of the SageMaker job.
ARN	`aws.sagemaker_job.arn`	The ARN of the SageMaker job.
S3 Source ARNs	`aws.sagemaker_job.s3_source_arns`	The S3 locations of the data sources.
Athena Workgroup Name	`aws.sagemaker_processing_job.athena_workgroup_name`	The name of the workgroup in which the Athena query is started.
S3 Destination ARN	`aws.sagemaker_processing_job.destination_s3_arn`	A URI that identifies the Amazon S3 bucket where you want Amazon SageMaker to save the results of a processing job.
IAM Role ARN	`aws.sagemaker_processing_job.iam_role_arn`	The Amazon Resource Name (ARN) of an IAM role that Amazon SageMaker can assume to perform tasks.
Training Job ARN	`aws.sagemaker_processing_job.training_job_arn`	The ARN of a training job associated with this processing job.

Display Name	Property Name	Description
Job Name	`aws.sagemaker_job.name`	The name of the SageMaker job.
ARN	`aws.sagemaker_job.arn`	The ARN of the SageMaker job.
S3 Source ARNs	`aws.sagemaker_job.s3_source_arns`	The S3 locations of the data sources.
Artifacts Destination S3 ARN	`aws.sagemaker_training_job.destination_s3_arn`	Identifies the S3 path where you want SageMaker to store the model artifacts.
Checkpoints S3 ARN	`aws.sagemaker_training_job.checkpoints_s3_arn`	Identifies the S3 path where you want SageMaker to store checkpoints.

Retention and Purge Time-To-Live (TTL)

For all cloud and infrastructure entities, the retention TTL is 180 minutes (3 hours) and the purge TTL is 525,600 minutes (365 days).

Amazon Web Services, the AWS logo, AWS, and any other AWS Marks used in these materials are trademarks of Amazon.com, Inc. or its affiliates in the United States and/or other countries.