Overview
In this article, we will review a topic of infrastructure monitoring within the Camunda SaaS environment. We will not be covering any of the monitoring topics related to Camunda Self-Managed type of environment because it's a completely different topic.
It's essential to recognise that monitoring in a SaaS model, especially with a platform like Camunda, often comes with its predefined scope. This scope is primarily influenced by the metrics and monitoring capabilities that Camunda chooses to expose to its users. And in this regard, Camunda SaaS make a difference in how we perceive infrastructure management and monitoring.
By opting for a SaaS solution, the responsibility of maintaining the platform's health and performance is largely assumed by Camunda itself. This model frees us from the granularities of manual monitoring tasks, allowing us to focus more on what matters most – process efficiencies.
One of the features of Camunda SaaS is its inherent high availability and fault tolerance. Through a sophisticated replication mechanism, the platform ensures continuity and resilience against failures – be they from hardware malfunctions or software anomalies. This level of redundancy is crucial, as it guarantees minimal downtime and prevents data loss, all while requiring little to no manual intervention from our end.
Despite these advanced capabilities, there remains a frontier of monitoring possibilities that we can explore in this article. We'll delve into strategies and practices to monitor the health of your Camunda clusters and its stock applications.
Additionally, in one of the sections of this article we will focus on runtime behaviors and error handling within process instances. Understanding how to manage and mitigate errors is important in maintaining the integrity of your workflows and ensuring a seamless operational model.
Camunda Management APIs
These APIs serve as the backbone for programmatically managing various components of the Camunda SaaS environments.
You may find information about available APIs in the Swagger UI. Information on how to Authenticate to use the APIs can be found in official Camunda documentation here.
One of the APIs that is of interest for us in scope of monitoring subject is the ability to retrieve the health status of clusters, the workflow engine, and stock applications running inside the cluster.
It allows us to programmatically assess the operational health of our Camunda infrastructure. Through a simple API call, we gain insights into the status of Camunda SaaS clusters.
While the Management API offers critical insights into the health status of the clusters, it's important to note the scope of monitoring with Camunda SaaS is somewhat limited. Beyond the health status, Camunda SaaS does not expose additional metrics for monitoring purposes. This limitation again suggests the SaaS model's philosophy, where Camunda assumes a significant portion of the operational burden, including monitoring of the actual Kubernetes clusters where Camunda clusters are running.
As such, it becomes essential to leverage the available APIs to their fullest extent. The health status provided by the Management API, while limited, offers a importantl data point. By integrating these API calls into your monitoring tools or scripts, we can establish a monitoring mechanism.
In this article we will explore integrating this API call into Dynatrace monitoring tool.
-
Monitoring health state of an environment/cluster
For this we can use "{{baseUrl}}/clusters/:clusterUuid"
API:
-
Monitoring health of workflow engine and cluster stock applications
The same API can be used for this as well ("{{baseUrl}}/clusters/:clusterUuid")
as response contains the health of the cluster and it's components (Zeebe engine, Operate, Optimize, TaskList)
Example
Following is an example on how you can use the API we discussed in a monitoring tool like Dynatrace.
High-level procedure:
1) Set up a "Synthetic Classic" HTTP Monitor in Dynatrace.
2) Set up 2 HTTP requests: one to get the OAuth 2.0 token for authentication and another one is our health monitoring API.
3) Set up frequency per your requirements
4) Create a dashboard based on this HTTP Monitor
Camunda SaaS status page
As you probably already know Camunda SaaS stack runs on the Google Cloud Platform (GCP).
Like any service, it might occasionally undergo availability changes. When availability changes, Camunda makes sure to provide you with a current service status. You can get notified about changes to the service status automatically via Atom or RSS feeds.
You can have an external script that could be used to periodically check the RSS or Atom feed, parse the relevant status information, and then use the monitoring tool of your choice and send this data using its APIs. For instance, you can use Dynatrace API to send this data to Dynatrace as custom metrics or events. This script could be scheduled to run at regular intervals to push data to Dynatrace and get alerts from Dynatrace if there is a status change.
Runtime error handling
Another important part of monitoring is related to the actual runtime monitoring of process instances that run your Camunda clusters. As such, let's focus on runtime behaviors and error handling within process instances. Understanding how to manage and mitigate errors is important in maintaining the lifecycle of your process instances.
The main application for this type of exercise is Camunda Operate. But in this article, I’d like to talk about the capabilities of the Operate API. The Operate API allows us to search, retrieve, and modify data in Operate with requests and responses formatted in the JSON notation. One of the critical functionalities provided by the Operate API is its ability to retrieve process instance incidents. This feature is instrumental for monitoring incidents arising from critical business processes. It allows us to not just passively observe but actively respond to these incidents.
To fully harness the Operate API's capabilities, I encourage exploring the interactive Operate API Explorer. This tool offers comprehensive specifications, example requests, responses, and even code samples for interacting with the API.
Monitoring requirements can vary significantly across different environments and use cases. Therefore, it's crucial to evaluate how and where to retrieve data for monitoring purposes. Whether it's through automated systems that alert us to incidents in real-time or through regular email summaries, understanding these requirements is key to implementing an effective monitoring strategy for the incidents.
In practice, our clients have adopted a range of approaches to monitor their processes. While some prefer the hands-on approach of manual monitoring through the Operate application, others lean towards automation, setting up alerts or scheduled reports that highlight critical incidents needing immediate attention.
In conclusion, the Operate API offers a flexible and powerful interface for monitoring Camunda workflows and responding to incidents.
At BP3 Production Operations we have developed a set of scripts including the one that polls incidents on periodic basis and sends a summary to an Operator. Moreover, it may contain information about priorities for more critical processes incidents to be handled immediately.
-
Operate API - incidents monitoring
Use Search Operate API to create your custom monitoring solution:
POST {SERVER-URL}/v1/incidents/search
Example:
Below I have a simple python scripts that retrieves the token from Camunda for Operate API, then does a call to search for incidents, parses the response and sends a summary email to an Operator and Operator takes action either via API call as well or using Operate application:
-
Use Operate API in Dynatrace
Since we talked about Dynatrace in one of the previous sections in this article let's review how one could use Operate API in this tool.
For this purpose it's easier to use Custom Metrics for External Data in Dynatrace. You may then create a Dashboard for Custom Metrics and use Automation and Real-Time Data.
If we look at the script we used as an example in previous section one may essentially use the same/similar script to push the data to Dynatrace Metrics API and then build dashboards, notifications, alerts on top of it.
-
Camunda Alerts
Camunda offers a feature to notify you when process instances stop with an error.
There are two forms of notification:
- By email to the email address of your user account
- By webhook (you provide payload URL)
Alerts are configured on a cluster level (Login to SaaS Console -> Clusters -> Select a cluster -> switch to Alerts tab).
The whole idea of Alerts in Camunda is a prominent one but I believe there is room for improvement which is why I'd like to review what are the pros and cons of this approach when monitoring for incidents.
PROS:
Easy to configure
Suitable for Production with low volumes of incidents
OOTB (Out-of-the-box) feature
Real-Time monitoring and notification
Notification Methods (Email and Webhook)
CONS:
Separate email/notification for each Incident
Lack of filtering or distinguishing between deployed processes
Notifications may become spam/noise in high-volume environments
Limited customization of Alert conditions
Conslusion
The monitoring capabilities provided by Camunda SaaS offer a good foundation for ensuring the health and efficiency of process orchestration in a SaaS environment. Through the use of Management APIs, the Operate API, and the inherent high availability and fault tolerance features, Camunda SaaS enables organizations to maintain their operational workflows with minimal manual intervention. The ability to monitor the health state of an environment or cluster, alongside real-time error handling and incident monitoring, empowers businesses to proactively address issues and maintain continuous process improvements.
As we've explored, integrating monitoring capabilities with tools like Dynatrace extends the visibility into the operational health of the Camunda SaaS environments.
As Camunda continues to evolve, it will be essential for organizations to stay informed of enhancements to the platform's monitoring capabilities. By leveraging these tools effectively, businesses can ensure that their process automation efforts are both resilient and aligned with their broader operational objectives.
In closing, while Camunda SaaS monitoring features present a good foundation for operational oversight, there remains a path for continuous improvement. By addressing the outlined challenges and leveraging the full potential of the platform's monitoring capabilities, organizations can achieve an optimal balance between automation efficiency and operational transparency.
Comments
0 comments
Please sign in to leave a comment.