Camunda SaaS health monitoring and runtime error handling

Overview

Apr 2026: Updated for Camunda 8.9 SaaS

In this article we review infrastructure monitoring within the Camunda 8 SaaS environment. We will not be covering any of the monitoring topics related to Camunda Self-Managed because it's a completely different topic.

It's essential to recognise that monitoring in a SaaS model, especially with a platform like Camunda, often comes with its predefined scope. This scope is primarily influenced by the metrics and monitoring capabilities that Camunda chooses to expose to its users.

By opting for a SaaS solution, the responsibility of maintaining the platform's health and performance is largely assumed by Camunda itself. This model frees us from the granularities of manual monitoring tasks, allowing us to focus more on what matters most — process efficiencies.

One of the features of Camunda SaaS is its inherent high availability and fault tolerance. Through a sophisticated replication mechanism, the platform ensures continuity and resilience against failures — be they from hardware malfunctions or software anomalies. This level of redundancy is crucial, as it guarantees minimal downtime and prevents data loss, all while requiring little to no manual intervention from our end.

What's new since the original article was published?

Since this article was first published, Camunda has significantly expanded the monitoring capabilities available to SaaS customers. The most notable addition is the Cluster Metrics Endpoint - a dedicated, per-cluster Prometheus-compatible metrics endpoint that allows teams to scrape application-level metrics (Zeebe throughput, latency, disk and memory usage, partition health, and more) directly into their existing observability stacks. This was previously not available, and the monitoring landscape was limited to the Management API health status, the Operate API, and Camunda's built-in Alerts.

In this updated article, we'll cover the original monitoring strategies (Management API, Operate API, Camunda Alerts) and introduce the new Cluster Metrics Endpoint, along with guidance on integrating it with Prometheus, Grafana, Dynatrace, and Datadog.

Camunda Management APIs

These APIs serve as the backbone for programmatically managing various components of the Camunda SaaS environment. You may find information about available APIs in the Swagger UI. Information on how to authenticate can be found in the official Camunda documentation.

One of the APIs that is of interest for us in scope of monitoring is the ability to retrieve the health status of clusters, the workflow engine, and stock applications running inside the cluster.

It allows us to programmatically assess the operational health of our Camunda infrastructure. Through a simple API call, we gain insights into the status of Camunda SaaS clusters.

While the Management API offers critical insights into the health status of the clusters, it's important to note that its scope is limited to a high-level Healthy / Unhealthy status. It does not expose granular performance metrics such as throughput, latency, or resource utilisation. For deeper observability, see the new Cluster Metrics Endpoint section below.

Monitoring health state of an environment/cluster

For this we can use the GET /clusters/{clusterUuid} API. The response contains the overall ready state of the cluster:

// GET {{baseUrl}}/clusters/:clusterUuid

{
  "status": {
    "ready": "Healthy"
  }
}

Monitoring health of workflow engine and cluster stock applications

The same API endpoint returns the health of individual cluster components — Zeebe engine, Operate, Optimize, and TaskList:

{
  "zeebeStatus":    "Healthy",
  "operateStatus":  "Healthy",
  "tasklistStatus": "Healthy",
  "optimizeStatus": "Healthy"
}

Example

Following is an example on how you can use the API we discussed in a monitoring tool like Dynatrace.

High-level procedure:

1) Set up a "Synthetic Classic" HTTP Monitor in Dynatrace.

2) Set up 2 HTTP requests: one to get the OAuth 2.0 token for authentication and another one is our health monitoring API.

3) Set up frequency per your requirements

4) Create a dashboard based on this HTTP Monitor

Cluster Metrics Endpoint (Prometheus) - New

As of early 2026, Camunda exposes a dedicated, per-cluster Prometheus-compatible metrics endpoint that provides application-level metrics such as Zeebe throughput, processing latency, backpressure, disk and memory usage, partition health, and more. This represents a major improvement over the Management API's binary health status.

With the Cluster Metrics endpoint, Camunda introduces a supported and secure way to integrate Camunda 8 SaaS metrics into your existing observability stack. Most Camunda customers already rely on Prometheus-compatible tooling - whether Prometheus itself, Grafana Agent, or platforms such as Datadog. Previously, there was no direct access to Camunda application-level metrics for alerting, correlation, and root-cause analysis alongside the rest of the infrastructure.

The endpoint is scoped strictly to the customer's own cluster namespace. This means teams can monitor Camunda using the same processes and standards already applied to their other production systems, without needing to understand Camunda's internal platform topology or depend on Camunda-managed dashboards.

Enabling the endpoint

The Cluster Metrics endpoint is enabled per Orchestration cluster via the Camunda Console:

Sign in to Camunda Console.
Navigate to Clusters → select your cluster.
Open the Monitoring tab.
Click Activate monitoring endpoint.
Enter a username for the monitoring credentials.
Click Activate — copy and securely store the generated password (it will not be shown again).
Wait for the Monitoring endpoint to reach ready state. Note the Monitoring endpoint URL.

⚠️ Important

Copy and safely store the password when it is displayed. The password is not shown again after you close the dialog. If you lose it, you will need to generate a new password. You can create up to 20 credentials per cluster.

Prometheus scrape configuration

Once enabled, integrating with Prometheus is straightforward. The endpoint behaves like any other HTTPS scrape target and can be added to an existing Prometheus configuration using Basic Auth credentials.

Note that the monitoring endpoint URL should be split into the server name (<c8-location-code>.monitoring.camunda.io) used as the target, and the cluster ID used in metrics_path.

prometheus.yml

scrape_configs:
  - job_name: c8-cluster
    scheme: https
    metrics_path: /<cluster-id>
    basic_auth:
      username: "<MONITORING_USERNAME>"
      password: "<MONITORING_PASSWORD>"
    static_configs:
      - targets:
          - <c8-location-code>.monitoring.camunda.io
    scrape_interval: 30s
    scrape_timeout: 5s

After this is in place, Camunda metrics appear alongside your other monitored services. Teams can define alerts, build dashboards, or ingest the metrics to downstream systems without special handling.

Grafana dashboards

Camunda provides a prebuilt Grafana dashboard you can import directly from the camunda/camunda GitHub repository. It highlights key performance indicators and metrics for each supported Camunda version, including cluster topology, throughput, handled requests, exported events per second, disk and memory usage, and more.

ℹ️ Note

The sample dashboards are intended as a reference and may rely on metrics from additional sources such as kube-state-metrics and node-exporter. Metric sets may vary between Camunda versions. You are free to adapt the dashboards to your own conventions and operational needs.

Non-Prometheus systems (Dynatrace, Datadog)

If your monitoring system does not natively support Prometheus scraping, you can still integrate with the Cluster Metrics endpoint. Common approaches include:

Datadog: Datadog natively supports Prometheus scraping via its OpenMetrics integration. Configure the Datadog Agent to scrape the Cluster Metrics endpoint URL with the Basic Auth credentials.
Dynatrace: Use a script (similar to the approach described in the Management API section) to periodically scrape the Prometheus endpoint and push the data to Dynatrace via its Metrics API v2. This allows you to build dashboards, alerts, and notifications on top of the rich metrics data.
Grafana Agent / Alloy: These agents natively support Prometheus remote-write and can scrape the endpoint directly.

See the Camunda documentation on non-Prometheus integrations for additional details.

Authentication and IP allowlisting

The Cluster Metrics endpoint enforces both authentication and network restrictions:

Restriction	Description
Authentication	Basic Authentication (username + password managed in Console). Up to 20 credentials per cluster.
IP Allowlisting	Enforces the cluster-level IP allowlist. Requests from non-allowlisted IPs are rejected with `403`.
Credential rotation	Old credentials are invalidated within ~5 minutes. Create multiple credentials and rotate without downtime.

Error responses follow standard HTTP status codes: 401 Unauthorized, 403 Forbidden, 429 Too Many Requests, 503 Service Unavailable, 504 Gateway Timeout.

Camunda SaaS status page

As you probably already know Camunda SaaS stack runs on the Google Cloud Platform (GCP) or on Amazon Web Services (AWS).

Like any service, it might occasionally undergo availability changes. When availability changes, Camunda makes sure to provide you with a current service status. You can get notified about changes to the service status automatically via Atom or RSS feeds.

You can have an external script that could be used to periodically check the RSS or Atom feed, parse the relevant status information, and then use the monitoring tool of your choice and send this data using its APIs. For instance, you can use Dynatrace API to send this data to Dynatrace as custom metrics or events. This script could be scheduled to run at regular intervals to push data to Dynatrace and get alerts from Dynatrace if there is a status change.

Runtime error handling (Updated for Camunda 8.9 release)

Another important part of monitoring is related to the actual runtime monitoring of process instances that run your Camunda clusters. As such, let's focus on runtime behaviors and error handling within process instances. Understanding how to manage and mitigate errors is important in maintaining the lifecycle of your process instances.

The main application for this type of exercise is Camunda Operate. But in this article, I’d like to talk about the capabilities of the Orchestration Cluster API.

With Camunda 8.8, the dedicated Operate REST API (v1) entered a formal deprecation period. The deprecation of the Operate and Tasklist REST APIs began with the 8.8 release, and teams can begin migrating to the Orchestration Cluster REST API for querying to prepare for this change. These APIs remain available in Camunda 8.8 and 8.9, but are not recommended for new implementations.

The planned removal timeline is as follows:

Version	Status
8.8	Operate API deprecated — migration recommended
8.9	Operate API still available, deprecation continued
8.10	Operate API removed

If you are building new integrations or updating existing monitoring scripts, use the Orchestration Cluster REST API (v2) described below.

Orchestration Cluster REST API - incidents monitoring

The Orchestration Cluster REST API lets you interact programmatically with process orchestration capabilities in Camunda 8 including starting, managing, and querying process instances, completing user tasks, resolving incidents, and managing variables. This is now the single unified API replacing the former Operate, Tasklist, and Zeebe APIs.

For SaaS, the base URL pattern is https://${REGION_ID}.api.camunda.io/${CLUSTER_ID}/v2/ — your Region Id and Cluster Id can be found in the Cluster Details in the Camunda Console.

Searching for incidents:

POST {BASE_URL}/v2/incidents/search

The request body uses a consistent filter/sort/page structure shared across all Orchestration Cluster API search endpoints:

{
  "filter": {
    "state": "ACTIVE"
  },
  "sort": [
    { "field": "creationTime", "order": "DESC" }
  ],
  "page": {
    "limit": 50
  }
}

Fetching a single incident by key:

GET {BASE_URL}/v2/incidents/{incidentKey}

Resolving an incident:

POST {BASE_URL}/v2/incidents/{incidentKey}/resolution

Searching incidents for a specific process instance:

POST {BASE_URL}/v2/process-instances/{processInstanceKey}/incidents/search

Note the key naming change from the old Operate API: identifiers are now suffixed with Key (e.g. incidentKey, processInstanceKey) and are returned as strings rather than integers in v2.

Authentication

Authentication for the Orchestration Cluster REST API depends on your environment. For SaaS, use the OAuth 2.0 client credentials flow — the same credential model used for Zeebe and other Camunda SaaS APIs. If you are migrating existing scripts, you can reuse the same OAuth token retrieval step; only the base URL and endpoint paths need updating.

Example: Python script for incident monitoring

High-level - retrieve an OAuth token, call the incidents search endpoint, parse the response, and route a summary to an operator:

import requests

# Step 1: Get OAuth token
token_response = requests.post(
    "https://login.cloud.camunda.io/oauth/token",
    data={
        "grant_type": "client_credentials",
        "client_id": CLIENT_ID,
        "client_secret": CLIENT_SECRET,
        "audience": "api.camunda.io"
    }
)
token = token_response.json()["access_token"]

# Step 2: Search for active incidents via Orchestration Cluster API v2
response = requests.post(
    f"https://{REGION}.api.camunda.io/{CLUSTER_ID}/v2/incidents/search",
    headers={"Authorization": f"Bearer {token}"},
    json={
        "filter": { "state": "ACTIVE" },
        "sort": [{ "field": "creationTime", "order": "DESC" }],
        "page": { "limit": 100 }
    }
)

incidents = response.json().get("items", [])
# Step 3: Parse, prioritise, and send operator summary...
...
...

The response items array contains incident objects. Key fields include incidentKey, processInstanceKey, processDefinitionId, errorType, errorMessage, state, and creationTime.

Operate API - incidents monitoring (deprecated)

⚠️ Deprecated. The Operate API endpoint below is deprecated as of Camunda 8.8 and will be removed in Camunda 8.10. New implementations should use the Orchestration Cluster REST API v2 described above. Existing scripts should be migrated before upgrading to 8.10.

The legacy endpoint was:

POST {SERVER-URL}/v1/incidents/search

The pattern from the previous version of this article (retrieve OAuth token → call Operate API → parse → email summary) remains valid as a pattern; only the endpoint URL and field names differ in the new API.

Use Orchestration Cluster API in Dynatrace or any other monitoring tool of your choice

Use a script (or Dynatrace Synthetic HTTP Monitor) to call the /v2/incidents/search endpoint with an OAuth token, then push the incident count and details to Dynatrace via its Metrics API v2 or use them to drive dashboards, alerts, and notifications.

Camunda Alerts

Camunda offers a feature to notify you when process instances stop with an error.

There are two forms of notification:

By email to the email address of your user account
By webhook (you provide payload URL)

Alerts are configured on a cluster level (Login to SaaS Console -> Clusters -> Select a cluster -> switch to Alerts tab).

The whole idea of Alerts in Camunda is a prominent one but I believe there is room for improvement which is why I'd like to review what are the pros and cons of this approach when monitoring for incidents.

PROS:
Easy to configure

Suitable for Production with low volumes of incidents
OOTB (Out-of-the-box) feature
Real-Time monitoring and notification
Notification Methods (Email and Webhook)

CONS:
Separate email/notification for each Incident
Lack of filtering or distinguishing between deployed processes
Notifications may become spam/noise in high-volume environments
Limited customization of Alert conditions

Monitoring Approaches Comparison - New

The following table summarises the different monitoring approaches available for Camunda 8 SaaS, helping you decide which combination is right for your environment:

Approach	What it provides	Integration	Best for
Management API (Health Status)	Binary Healthy / Unhealthy status for cluster and each component (Zeebe, Operate, Optimize, TaskList)	REST API → any tool via HTTP synthetic monitoring	Simple up/down availability monitoring
Cluster Metrics Endpoint New	Prometheus-formatted application-level metrics: throughput, latency, backpressure, disk, memory, partition health, exported events, and more	Native Prometheus scraping → Grafana, Datadog, Dynatrace, etc.	Deep performance observability, alerting, capacity planning
Operate API / Orchestration Cluster APIs (since 8.8/8.9) (Incident Search)	Active/resolved incidents per process instance, error messages, creation times	REST API → custom scripts, email summaries, push to monitoring tools	Runtime error tracking, SLA monitoring, operator workflows
Camunda Alerts	Real-time notifications when process instances hit an error	Email or Webhook (OOTB)	Low-volume environments, quick setup, immediate notification
Status Page (RSS / Atom)	Platform-wide availability changes across all Camunda SaaS services	RSS/Atom feed → custom scripts → monitoring tool APIs	Platform outage awareness, SLA tracking

💡 Recommendation

For a comprehensive monitoring strategy, combine the Cluster Metrics Endpoint (for infrastructure and performance observability) with the Operate API (for business-level incident tracking) and the Status Page feed (for platform-wide awareness). This gives you coverage across all three layers: infrastructure health, application performance, and runtime error management.

Conclusion

The monitoring capabilities provided by Camunda SaaS have improved significantly since this article was first published. What was once limited to a binary health status via the Management API has evolved into a comprehensive monitoring ecosystem.

The introduction of the Cluster Metrics Endpoint is the most significant change - it gives SaaS customers the same level of metrics visibility that was previously only available in Self-Managed deployments. Teams can now integrate Zeebe, Operate, Tasklist, and Optimize metrics directly into their existing Prometheus-compatible observability stacks (Grafana, Datadog, Dynatrace, etc.), enabling proper alerting, capacity planning, and root-cause analysis alongside other production systems.

Combined with the Management API for high-level health checks, the Operate API for runtime incident monitoring, Camunda's built-in Alerts for immediate notifications, and the Status Page feeds for platform-wide awareness, organizations now have a solid foundation for end-to-end operational visibility of their Camunda SaaS environment.

As Camunda continues to evolve, it will be essential for organizations to stay informed of enhancements to the platform's monitoring capabilities. By leveraging these tools effectively, businesses can ensure that their process automation efforts are both resilient and aligned with their broader operational objectives.

References

Looking for help?

Articles in this section