IBM BPM Instrumentation Monitor page and instrumentation logging are often underestimated because one might consider it's more of a rocket science. There is a good technote that covers all the basics on instrumentation logging that you can generate using the instrumentation monitor page and how it can be analyzed. Make sure to go through that technote first, generate some instrumentation logging on your system, convert it to txt, analyze as described in the technote, before proceeding further with this article.
Instrumentation monitor page:
In this article I would like to go through a more advanced course of how instrumentation page itself can be used to monitor your BPM system and how instrumentation logging can be used to troubleshoot performance issues.
Unfortunately not all of the BPM methods are measured in the Instrumentation monitor but the most important ones that are primarily related to BPM Standard version can be helpful to analyze and find out specific bottleneck in your environment.
Instrumentation monitor page itself can be used to identify bottlenecks for:
- Save execution context
- User role management
- Database timings, transaction
- BPD/Service Engine timings
- Repository timings
- FIFO's caches, Generic Caches, PO caches, Cache misses
- Managed asset classloader
- EJB Calls
- Event Manager, Task Loader
- Standard WebServices calls (inbound/outbound)
First of all you would need to know the exact URL to use to access all of the above statistics in instrumentation monitor. If you goto Process Admin console -> Monitoring -> Instrumentation -> you will see most of the data but not all.
To see the extended data in instrumentation monitor page use the following URL:
What values to examine first of all:
Make sure you pay most attention to the following calls:
- findByPrimaryKey calls - these are for loading persisted objects from the database. These are critical for BPM to operate, it's important that these queries be as fast as possible. There is a whole findByPrimaryKey section, so, you might want to review values for all the types under this section. Here is what the section in question looks like:
- BPD Engine - this is a section under BPD -> Engine, pay most of the attention to:
- Acquire BPD Instance lock - this is a call that is trying to acquire a lock on LSW_BPD_INSTANCE table in process center/server database. It should be a fast call. If it's not then consider looking into tunings on db side.
- Failed Notification Load - there is a special kind of BPD task called - "BPD Notification..." - this is a task that is run by Event Manager every time BPD instances moves a token to the next activity. If you see a lot of failures in this call then something is not right with your solution/environment and you might have a lot of stuck tokens.
- Hold BPD Instance lock - for how long is the lock held on LSW_BPD_INSTANCE table. The BPD engine is a highly parallel engine that can work on multiple instances at the same time. When working on a particular instance the engine works on one token at a time per instance. Let's say you have 2 instances of a BPD (instance A and instance B). The BPD engine can work on A and B at the same time but within one instance it can only work on one token of that instance at a time. If instance A has 3 tokens that can advance and instance B has 2, it will work on one token from A and one token from B at the same time, but will not work on two tokens within the same instance at the same time.
- While doing this work, the BPD Engine has to make sure that no other threads or BPD Engines (running on other process server nodes) will be working on the same instance at the same time. To do that, the BPD Engine puts an update lock on the row in the LSW_BPD_INSTANCE table that corresponds to the bpd instance it is working on. This lock is highly granular and is fully indexed. The suggestion here is to look if there are big delays on holding the lock and if yes the more investigation has to be performed on db and BPM sides.
- Load execution context - BPD execution context is stored in the database. Execution context is a LOB that contains that values of all the BPD layer variables. This call should also be fast. If it's not then consider tuning LOB's in the database like enabling LOB caching, using dedicated tablespaces and so on.
- PO (persistent objects) caches - there are two types of PO caches in BPM - versioned and unversioned.
- Versioned POs - The defining characteristic of versioned persistent objects is that multiple versions of the object data are stored in the database as changes are made to the objects over time. Each committed version of an object is stored as a separate immutable entity. BPM model objects (processes, variable types, etc.) are versioned in BPM. Other pieces of data such as object metadata, references, and dependencies are also versioned -- though not necessarily as root persistent objects.
- Unversioned POs - In contrast, unversioned persistent objects have only one current representation in the database at any given time. Old versions of the data are not saved as changes are made. BPM runtime data such as process instances, tasks, users, and groups are unversioned persistent objects. Unversioned POs may refer to versioned POs, in which case they will use the runtime references.
- In instrumentation monitor page pay your attention to the whole section called: "PO Factory" You will find the following sub-sections under this main section:
- Cache Bypasses, Cache Hits, Cache Misses, Cache Size
- Monitor cache misses most of the time and see if there is a lot of cache misses for a particular type or a number of types then you might want to increase the value of versioned or unversioned cache accordingly based on your findings. Corresponding values in BPM config files are:
- default-unversioned-po-cache-size (default value: 500)
- default-versioned-po-cache-size (default value: 500)
- Service Engine - corresponding calls for service engine can be found under the section called "Workflow Engine"
Pay most of the attention to the following calls under this section:
- Load Execution Context - same as with BPD load execution context it is stored in the database. Execution context is a LOB (in LSW_TASK_EXECUTION_CONTEXT) that contains that values of all the service layer variables. This call should also be fast. If it's not then consider tuning LOB's in the database like enabling LOB caching, using dedicated tablespaces and so on.
- Save Execution Context - if it's slow then consider tuning on the database side (in LSW_TASK_EXECUTION_CONTEXT)
- Resume Workflow Engine - this call is essentially a call to run a coach, if it's slow then it's worth making more investigation on database (load execution context) and on client sides.
What else to monitor?
There is a number of very useful statistics in instrumentation monitor page, above you can find those that are most important for a good performance and healthy IBM BPM environment. Next things to monitor should be based off of your experience with IBM BPM environment. If you use a lot of webservices look at corresponding "Webservices" section that covers inbound and outbound webservices calls. If you want to monitor database transactions then look at the "Database" section and corresponding calls:
So on and so forth. Again - instrumentation monitor page is a great resource to monitor BPM operations and finding bottlenecks.
As mentioned at the beginning of this article there is an IBM technote that tells you how you can capture instrumentation data in a binary format and then convert it to a txt file for further analysis. Perhaps you might want to start logging and stop logging using some automated scripts and wonder if it's doable. The short answer is - "Yes, it's doable" but unfortunately there is no official API or wsadmin command for that. The only way I found is to call the following URL's:
http(s)://your_BPM_host:port/teamworks/instrumentation/startLogging - this would start instrumentation logging and will generate corresponding .dat file in profile logs directory.
http(s)://your_BPM_host:port/teamworks/instrumentation/stopLogging - this will stop logging and will stop writing to .dat file and it can be used for analysis of the gathered data.
Part II can be found here.