IBM BPM and Multi-Instance Loops

Follow

While some BPM practitioners can work for years without having to use or understand Multi-Instance Loops (MIL), they can be exceptionally useful for specific use cases.  Many practitioners have told me that MILs are bad.  Generally, this is because the way IBM BPM implements MIL can cause significant problems if you don’t understand the shortcomings of the tool.  This post is an attempt to de-mystify these shortcomings in the hope that people will be less scared of MILs and also avoid the pitfalls.

Use Case

A good use case for MIL is whenever you have a process where, at a given point you need multiple users to work with the process simultaneously, but that number is determined at runtime based on the process data.  For example a process where a person can fill in a request for many different things, and each of the things they request should start their own flow for approval and execution of each request without blocking one another.  Additionally let’s assume that there are additional activities that need to take place after all the requested items have been addressed.

It would be possible to meet this use case through a sophisticated use of Events, however that level of abstraction can make it difficult to properly understand exactly what is going on in the process. Additionally it makes understanding the current state of the process quite difficult. A better answer is to use a MIL.

Shortcomings

Before jumping in with a MIL we need to understand the shortcomings that cause people to avoid them.  By understanding these we will be able to properly use the MIL in a way that will ensure our BPM system continues to run efficiently.

In order to avoid a myriad of very complex problems, the authors of the BPD engine in IBM BPM decided that the movement and execution of tokens within a BPD Instance should be singly threaded.  This means that even if you use a split to show the execution of 3 system activities is parallel, if you watch the actual execution of the system activities for a given instance you will see them execute one at a time.  This avoids having to deal with different threads attempting to update the same data and causing data coherency issues.  

The problem arises not due this choice, but rather due to how the threading was accomplished.  As of my most recent tests in our scenario above the BPD Engine will place 3 events in the event manager, one for the execution of each system service.  However a DB lock on the BPD Instance entry will be created by the one that executes first.  This will block the other 2 from executing until the lock is released.

In our 3 activity scenario this is a relatively minor problem as long as the system service all complete in a timely manner (essentially more quickly than the blocked threads will trigger a DB timeout).  However if you create a MIL that has its first step being a system service you can set yourself up for a potentially bad problem.

In the MIL scenario, where the first step in the MIL implementation is a System Service, if you were to spin up lets say 1000 parallel instances, each one will get an entry in the event manager.  The event manager will assign a thread to processing one of the instances.  So your event manager now has 1000 entries all to perform BPD operations on the same BPD instance.  Of these there will be one on every Event Manager thread, but due to the DB lock only one will be executing and all the others will be blocked.  

The first effect of this is that your BPD engine effectively grinds to a halt.  All of the BPD processing threads are now dedicated to attempting to handle this MIL.  Any new BPD Event requests will queue up behind this EM log jam and will not be processed nearly all of our 1000 MIL system service calls have been processed.  

If the system service we call is non-trivial, meaning it can take a few seconds to run, our BPD instance is now in jeopardy.  The problem is that which of the waiting threads gets the DB lock once the processing thread is done is non-deterministic.  If one of the waiting threads consistently loses the “DB Lock Lottery”, there is a chance that it will exceed the DB timeout permitted for waiting on a lock.  This will throw a SQL Exception to the thread, and the thread will raise this exception to the BPD instance engine, marking the BPD as failed.

I believe the main reason so many people avoid the MIL is that the real danger here lies in the following “perfect storm”

  • When testing this in Dev and UAT it rarely gets tested at production scale.  Meaning the MIL creates far fewer instances than in the real world.
  • The data processed by the system service is frequently less complex in non production environment than in production environments.
  • The failure is non-deterministic.  You can run one MIL with 200 items and see no errors because every thread unlocked without a problem, then run another with 200 where one loses the “DB Lock Lottery” and throws an exeception.

How to solve it

The key to solving this is in creating a MIL that avoids the bad pattern.  The number one key to that is that the first “real” activity in MIL should be a Human Service or an Intermediate Message Event. While you will still flood your EM with the same number of requests as in our bad example and still have an adverse impact on the BPD engine, each thread should complete in a few milliseconds. While for 1000 instances 3ms (totally made up number, likely low) of processing time means the BPD Engine slows down for approximately 30 seconds, contrast this with a system service that takes 3 seconds to marshal and execute (another made up number but likely realistic) which will take down your BPD Execution for nearly 50 minutes.

In a case where each task needs to marshal data for the human service you have two choices. If this data is required as an input to the human service but is not required to create the human service (e.g. none of the data is used in the task subject or narrative), the data marshalling can be moved into the Human Service as the first step in the Service execution.

If this data is used in the creation of the task, then you should marshall the data for each task prior to entering the MIL so that each instance is simply handed the “correct” data to generate the task. Note that if you really do potentially have 1000s of tasks in your MIL you may need to revisit this requirement as you will be adding a significant amount of data into the BPDs execution context, which does impact DB space and performance.

Have more questions? Submit a request

Comments

Powered by Zendesk