Introduction

An Azure Machine Learning (Azure ML) Workspace is a centralized platform within Azure Machine Learning Services that enables data scientists and developers to efficiently manage machine learning (ML) projects. It acts as a collaborative environment for building, training, deploying, and monitoring ML models while ensuring security and scalability.

Use OpsRamp Azure Public Cloud Integration to discover and collect metrics against Machine Learning Services Workspaces.

Setup

To set up the Azure integration and discover the Azure Machine Learning Services Workspaces resources, do the following:

  1. Create an Azure Integration, if not available in your installed integrations. For more information on how to install the Azure Integration, refer to Install Azure Integration.
  2. Create a discovery profile. For more information on how to create a discovery profile, refer to Create Discovery Profile.
  3. Select Machine Learning Services Workspaces under the Filter Criteria in the Edit Discovery Profile page.
  4. Save the discovery profile to make them available in the list of Discovery Profiles.
  5. Scan to discover the resources at any time independent of the predefined schedule.
  6. Once the scan is completed, you can view the Machine Learning Services Workspaces resources under Infrastructure > Resources > Microsoft Azure category.

Event support

OpsRamp supports Azure events for Machine Learning Services Workspaces. Configure Azure Events in the OpsRamp Azure integration discovery profile.

See Process Azure Events for more information on how to configure Azure events.

Supported metrics

OpsRamp MetricAzure MetricMetric Display NameUnitAggregation TypeDescription
azure_ml_services_workspaces_AgentsAgentsAgentsCountAverageNumber of events for AI Agents in this workspace
azure_ml_services_workspaces_IndexedFilesIndexedFilesIndexedFilesCountAverageNumber of files indexed for file search in this workspace
azure_ml_services_workspaces_MessagesMessagesMessagesCountAverageNumber of events for AI Agent messages in this workspace
azure_ml_services_workspaces_RunsRunsRunsCountAverageNumber of runs by AI Agents in this workspace
azure_ml_services_workspaces_ThreadsThreadsThreadsCountAverageNumber of events for AI Agent threads in this workspace
azure_ml_services_workspaces_TokensTokensTokensCountAverageCount of tokens by AI Agents in this workspace
azure_ml_services_workspaces_ToolCallsToolCallsToolCallsCountAverageTool calls made by AI Agents in this workspace
azure_ml_services_workspaces_Model_Deploy_FailedModel Deploy FailedModel Deploy FailedCountTotalNumber of model deployments that failed in this workspace
azure_ml_services_workspaces_Model_Deploy_StartedModel Deploy StartedModel Deploy StartedCountTotalNumber of model deployments started in this workspace
azure_ml_services_workspaces_Model_Deploy_SucceededModel Deploy SucceededModel Deploy SucceededCountTotalNumber of model deployments that succeeded in this workspace
azure_ml_services_workspaces_Model_Register_FailedModel Register FailedModel Register FailedCountTotalNumber of model registrations that failed in this workspace
azure_ml_services_workspaces_Model_Register_SucceededModel Register SucceededModel Register SucceededCountTotalNumber of model registrations that succeeded in this workspace
azure_ml_services_workspaces_Active_CoresActive CoresActive CoresCountAverageNumber of active cores
azure_ml_services_workspaces_Active_NodesActive NodesActive NodesCountAverageNumber of Acitve nodes. These are the nodes which are actively running a job
azure_ml_services_workspaces_Idle_CoresIdle CoresIdle CoresCountAverageNumber of idle cores
azure_ml_services_workspaces_Idle_NodesIdle NodesIdle NodesCountAverageNumber of idle nodes. Idle nodes are the nodes which are not running any jobs but can accept new job if available
azure_ml_services_workspaces_Leaving_CoresLeaving CoresLeaving CoresCountAverageNumber of leaving cores
azure_ml_services_workspaces_Leaving_NodesLeaving NodesLeaving NodesCountAverageNumber of leaving nodes. Leaving nodes are the nodes which just finished processing a job and will go to Idle state
azure_ml_services_workspaces_Preempted_CoresPreempted CoresPreempted CoresCountAverageNumber of preempted cores
azure_ml_services_workspaces_Preempted_NodesPreempted NodesPreempted NodesCountAverageNumber of preempted nodes. These nodes are the low priority nodes which are taken away from the available node pool
azure_ml_services_workspaces_Quota_Utilization_PercentageQuota Utilization PercentageQuota Utilization PercentageCountAveragePercent of quota utilized
azure_ml_services_workspaces_Total_CoresTotal CoresTotal CoresCountAverageNumber of total cores
azure_ml_services_workspaces_Total_NodesTotal NodesTotal NodesCountAverageNumber of total nodes. This total includes some of Active Nodes, Idle Nodes, Unusable Nodes, Premepted Nodes, Leaving Nodes
azure_ml_services_workspaces_Unusable_CoresUnusable CoresUnusable CoresCountAverageNumber of unusable cores
azure_ml_services_workspaces_Unusable_NodesUnusable NodesUnusable NodesCountAverageNumber of unusable nodes. Unusable nodes are not functional due to some unresolvable issue. Azure will recycle these nodes
azure_ml_services_workspaces_CpuCapacityMillicoresCpuCapacityMillicoresCpuCapacityMillicoresCountAverageMaximum capacity of a CPU node in millicores. Capacity is aggregated in one minute intervals
azure_ml_services_workspaces_CpuMemoryCapacityMegabytesCpuMemoryCapacityMegabytesCpuMemoryCapacityMegabytesCountAverageMaximum memory utilization of a CPU node in megabytes. Utilization is aggregated in one minute intervals
azure_ml_services_workspaces_CpuMemoryUtilizationMegabytesCpuMemoryUtilizationMegabytesCpuMemoryUtilizationMegabytesCountAverageMemory utilization of a CPU node in megabytes. Utilization is aggregated in one minute intervals
azure_ml_services_workspaces_CpuMemoryUtilizationPercentageCpuMemoryUtilizationPercentageCpuMemoryUtilizationPercentageCountAverageMemory utilization percentage of a CPU node. Utilization is aggregated in one minute intervals
azure_ml_services_workspaces_CpuUtilizationCpuUtilizationCpuUtilizationCountAveragePercentage of utilization on a CPU node. Utilization is reported at one minute intervals
azure_ml_services_workspaces_CpuUtilizationMillicoresCpuUtilizationMillicoresCpuUtilizationMillicoresCountAverageUtilization of a CPU node in millicores. Utilization is aggregated in one minute intervals
azure_ml_services_workspaces_CpuUtilizationPercentageCpuUtilizationPercentageCpuUtilizationPercentageCountAverageUtilization percentage of a CPU node. Utilization is aggregated in one minute intervals
azure_ml_services_workspaces_DiskAvailMegabytesDiskAvailMegabytesDiskAvailMegabytesCountAverageAvailable disk space in megabytes. Metrics are aggregated in one minute intervals
azure_ml_services_workspaces_DiskReadMegabytesDiskReadMegabytesDiskReadMegabytesCountAverageData read from disk in megabytes. Metrics are aggregated in one minute intervals
azure_ml_services_workspaces_DiskUsedMegabytesDiskUsedMegabytesDiskUsedMegabytesCountAverageUsed disk space in megabytes. Metrics are aggregated in one minute intervals
azure_ml_services_workspaces_DiskWriteMegabytesDiskWriteMegabytesDiskWriteMegabytesCountAverageData written into disk in megabytes. Metrics are aggregated in one minute intervals
azure_ml_services_workspaces_GpuCapacityMilliGPUsGpuCapacityMilliGPUsGpuCapacityMilliGPUsCountAverageMaximum capacity of a GPU device in milli-GPUs. Capacity is aggregated in one minute intervals
azure_ml_services_workspaces_GpuEnergyJoulesGpuEnergyJoulesGpuEnergyJoulesCountAverageInterval energy in Joules on a GPU node. Energy is reported at one minute intervals
azure_ml_services_workspaces_GpuMemoryCapacityMegabytesGpuMemoryCapacityMegabytesGpuMemoryCapacityMegabytesCountAverageMaximum memory capacity of a GPU device in megabytes. Capacity aggregated in at one minute intervals
azure_ml_services_workspaces_GpuMemoryUtilizationGpuMemoryUtilizationGpuMemoryUtilizationCountAveragePercentage of memory utilization on a GPU node. Utilization is reported at one minute intervals
azure_ml_services_workspaces_GpuMemoryUtilizationMegabytesGpuMemoryUtilizationMegabytesGpuMemoryUtilizationMegabytesCountAverageMemory utilization of a GPU device in megabytes. Utilization aggregated in at one minute intervals
azure_ml_services_workspaces_GpuMemoryUtilizationPercentageGpuMemoryUtilizationPercentageGpuMemoryUtilizationPercentageCountAverageMemory utilization percentage of a GPU device. Utilization aggregated in at one minute intervals
azure_ml_services_workspaces_GpuUtilizationGpuUtilizationGpuUtilizationCountAveragePercentage of utilization on a GPU node. Utilization is reported at one minute intervals
azure_ml_services_workspaces_GpuUtilizationMilliGPUsGpuUtilizationMilliGPUsGpuUtilizationMilliGPUsCountAverageUtilization of a GPU device in milli-GPUs. Utilization is aggregated in one minute intervals
azure_ml_services_workspaces_GpuUtilizationPercentageGpuUtilizationPercentageGpuUtilizationPercentageCountAverageUtilization percentage of a GPU device. Utilization is aggregated in one minute intervals
azure_ml_services_workspaces_IBReceiveMegabytesIBReceiveMegabytesIBReceiveMegabytesCountAverageNetwork data received over InfiniBand in megabytes. Metrics are aggregated in one minute intervals
azure_ml_services_workspaces_IBTransmitMegabytesIBTransmitMegabytesIBTransmitMegabytesCountAverageNetwork data sent over InfiniBand in megabytes. Metrics are aggregated in one minute intervals
azure_ml_services_workspaces_NetworkInputMegabytesNetworkInputMegabytesNetworkInputMegabytesCountAverageNetwork data received in megabytes. Metrics are aggregated in one minute intervals
azure_ml_services_workspaces_NetworkOutputMegabytesNetworkOutputMegabytesNetworkOutputMegabytesCountAverageNetwork data sent in megabytes. Metrics are aggregated in one minute intervals
azure_ml_services_workspaces_StorageAPIFailureCountStorageAPIFailureCountStorageAPIFailureCountCountAverageAzure Blob Storage API calls failure count
azure_ml_services_workspaces_StorageAPISuccessCountStorageAPISuccessCountStorageAPISuccessCountCountAverageAzure Blob Storage API calls success count
azure_ml_services_workspaces_Cancel_Requested_RunsCancel Requested RunsCancel Requested RunsCountTotalNumber of runs where cancel was requested for this workspace. Count is updated when cancellation request has been received for a run
azure_ml_services_workspaces_Cancelled_RunsCancelled RunsCancelled RunsCountTotalNumber of runs cancelled for this workspace. Count is updated when a run is successfully cancelled
azure_ml_services_workspaces_Completed_RunsCompleted RunsCompleted RunsCountTotalNumber of runs completed successfully for this workspace. Count is updated when a run has completed and output has been collected
azure_ml_services_workspaces_ErrorsErrorsErrorsCountTotalNumber of run errors in this workspace. Count is updated whenever run encounters an error
azure_ml_services_workspaces_Failed_RunsFailed RunsFailed RunsCountTotalNumber of runs failed for this workspace. Count is updated when a run fails
azure_ml_services_workspaces_Finalizing_RunsFinalizing RunsFinalizing RunsCountTotalNumber of runs entered finalizing state for this workspace. Count is updated when a run has completed but output collection still in progress
azure_ml_services_workspaces_Not_Responding_RunsNot Responding RunsNot Responding RunsCountTotalNumber of runs not responding for this workspace. Count is updated when a run enters Not Responding state
azure_ml_services_workspaces_Not_Started_RunsNot Started RunsNot Started RunsCountTotalNumber of runs in Not Started state for this workspace. Count is updated when a request is received to create a run but run information has not yet been populated
azure_ml_services_workspaces_Preparing_RunsPreparing RunsPreparing RunsCountTotalNumber of runs that are preparing for this workspace. Count is updated when a run enters Preparing state while the run environment is being prepared
azure_ml_services_workspaces_Provisioning_RunsProvisioning RunsProvisioning RunsCountTotalNumber of runs that are provisioning for this workspace. Count is updated when a run is waiting on compute target creation or provisioning
azure_ml_services_workspaces_Queued_RunsQueued RunsQueued RunsCountTotalNumber of runs that are queued for this workspace. Count is updated when a run is queued in compute target. Can occure when waiting for required compute nodes to be ready
azure_ml_services_workspaces_Started_RunsStarted RunsStarted RunsCountTotalNumber of runs running for this workspace. Count is updated when run starts running on required resources
azure_ml_services_workspaces_Starting_RunsStarting RunsStarting RunsCountTotalNumber of runs started for this workspace. Count is updated after request to create run and run info, such as the Run Id, has been populated
azure_ml_services_workspaces_WarningsWarningsWarningsCountTotalNumber of run warnings in this workspace. Count is updated whenever a run encounters a warning