How to explain growing Worker Threads under Load

I recently got engaged with a client who ran an increasing load test against their load-balanced application. I got engaged because they had a phenomenon they couldn't explain - here is an excerpt of the email:

We have a jBoss application with mySQL that runs stable in a load testing environment with let’s say 20 threads. At one point this suddenly changes and jBoss uses up to 100 threads for handling requests (or whatever the configured max number of threads in jBoss might be) – until now we have not been able to pinpoint what causes this issue

I requested their collected performance metrics to take a quick look at it. Here were my steps to analyze their problem.

Step 1: Verify what they said

Of course I trusted what I read in the email - but it is always good to confirm and verify. They subscribed several JBoss measures such as the Thread Pool Current Threads Busy. Charting them verified what they said:

Looking at the busy worker threads shows us the load behavior on the load balanced web servers
The initial high number of worker threads is probably caused as the application had to "warm-up" - meaning that individual requests were slow so more worker threads were needed to handle the initial load. Throughout the test the 2nd app server had constantly more threads than the first. At the end of the test we see that both servers spike in load with server 2 maxing its workers threads. This confirms what they told me in the email. Let's see why that is

Step 2: Correlate with other metrics

The next thing I do is to look at actual web request throughput, CPU Utilization and response times of requests. I chart these additional metrics on my dashboard. These metrics are provided by the application server or by the performance management solution we used in this case. Let's look at the dashboard including my red marks that I want to focus on:

Correlating Thread Count with CPU, Throughput and Response Time
The top graph shows the busy worker threads again with a red mark around the problematic timeframe. The second graph shows the number of successful web requests - which is basically the throughput. It seems that the throughput increased in the beginning which is explained by the increasing workload. Already before we see the worker threads go up we can see that throughput stops and stays flat. On server 2 (brown line) we even see a drop in throughput until the end of the test. So - even though we have more worker threads we actually have fewer requests handled successfully.

The CPU Graph now actually explains the root cause of the problem. I chose to split the chart to have a separate graph for each app server. Both App Servers max out their CPU at the point that I marked in red. The time correlates with the time when we see the throughput becoming flat. It is also the time when we see more worker threads being used. The problem though is that new worker threads won't help to handle the still increasing load because the servers are just out of CPU.

The bottom graph shows the response time of those requests that are handled successfully - again - split by the two app servers. The red area shows the time spent on CPU, the blue area shows the total execution time. We can again observe that the contribution of the CPU flats once we have reached the CPU limit. From there on we only see increased Execution Time. This is time where the app needs to wait on resources, such as I/O, Network or even CPU cycles. Interesting here is that the 2nd app server has a much higher execution time than app server 1. I explain this because the 2nd server gets all these additional requests from the load balancer that results in more worker threads which all compete for the same scarce resources.

Summarizing the problems: It seems we have a CPU problem in combination with a Load Balancer configuration problem. As we run out of CPU the throughput stalls, execution times get higher but the load-balancer still forwards incoming requests to the already overloaded application servers. It also seems that the load-balancer is unevenly distributing the load causing even more problems on server 2.

Is it really a CPU Problem?

Based on the analysis it seems that CPU is our main problem here. Question is whether there is something we can do about it (fix a problem in the app), or whether we just reached the limits of the used hardware (fix it by adding more machines).

The customer used dynaTrace with a detailed level of instrumentation which allows me to analyze which methods consume most of the CPU. The following illustration shows what we call the Methods Dashlet. It allows me to look at those method executions that have been traced while executing the load test. Sorting it by CPU Time shows me that there are two methods called getXML which spend most of the time in CPU:

There are two different implementations of getXML that consume most of the CPU
The first getXML method consumes by far more CPU than all other methods combined. We need to look at the code as to what is going on in there. Looking at the parameters I assume it is reading content from a file and then returning it as a String. File access explains the difference in Execution Time and CPU Time. The CPU Time is then probably spent in processing the file and generating an XML representation of the content. Inefficient usage of an XML Parser or String concatenations would be a good explanation.

For me this seems like we have a good chance to optimize this method, save CPU and then be able to handle more load on that machine.

What about these spikes on server 2?

In the response time chart we could see that Server 2 had a totally different response time behavior than server 1 with very high Execution Times. Lets take a closer look at some of these long running requests. The following illustration shows parts of a PurePath. A PurePath is the transactional trace of one request that got executed during the load test. In this case it was a request executed against server 2 taking a total of 68s to execute. It seems it calls an external web service that returns invalid content after 60 seconds. I assume the web service simply ran into a timeout and after 60 seconds returned an error html page instead of SOAP Content:

External Web Service Call runs into a timeout and returns an error page instead of SOAP Content
Looking at other PurePath's from server 2 it seems that many transactions ran into the same problem causing the long execution times. The PurePath contains additional context information such as method arguments and parameters passed to the web service. This information can be used when talking with the external service provider to narrow down the problematic calls.

Conclusion

The most important thing to remember is to monitor enough metrics: CPU, I/O, Memory, Execution Times, Throughput, Application Server Specifics, ... - These metrics allow you to figure out whether you have problems with CPU or other resources. Once that is figured out you need more in-depth data to identify the actual root cause. We have also learned that it is not always our own application code that causes problems. External services or frameworks that we use are also places to look for.