One day, one of our testers comes up to our table and tells us: “The testing environment (Weblogic container on IBM JDK 1.4.2) is hanging”… So we kill Weblogic, fire it up again and have the tester replay his scenario. He manages to reproduce the hanging behaviour. So on come the Java Forces to find the problem. As the JDK is an 1.4.2 JDK we cannot use the Java5 jconsole to inspect the JVM. The easiest then is to first create a stack dump, as we want to see what the threads are doing. Upon careful inspection of the dump file we first find this disturbing snippet.
3LKMONOBJECT com.xebia.client.domain.impl.ReleaseRunImpl@189DDCA8/189DDCB0: Flat locked by thread ident 0x23, entry count 1
3LKNOTIFYQ Waiting to be notified:
3LKWAITNOTIFY "ExecuteThread: '9' for queue: 'weblogic.kernel.Default'" (0x6E6AA08)
3LKWAITNOTIFY "ExecuteThread: '21' for queue: 'weblogic.kernel.Default'" (0x69512F0)
3LKWAITNOTIFY "ExecuteThread: '17' for queue: 'weblogic.kernel.Default'" (0x694C0B0)
3LKWAITNOTIFY "ExecuteThread: '5' for queue: 'weblogic.kernel.Default'" (0x6937598)
3LKWAITNOTIFY "ExecuteThread: '6' for queue: 'weblogic.kernel.Default'" (0x6937988)
3LKWAITNOTIFY "ExecuteThread: '18' for queue: 'weblogic.kernel.Default'" (0x694D538)
3LKWAITNOTIFY "ExecuteThread: '10' for queue: 'weblogic.kernel.Default'" (0x69431D0)
3LKWAITNOTIFY "ExecuteThread: '22' for queue: 'weblogic.kernel.Default'" (0x6952788)
3LKWAITNOTIFY "ExecuteThread: '14' for queue: 'weblogic.kernel.Default'" (0x6948318)
3LKWAITNOTIFY "ExecuteThread: '15' for queue: 'weblogic.kernel.Default'" (0x69497A0)
3LKWAITNOTIFY "ExecuteThread: '23' for queue: 'weblogic.kernel.Default'" (0x6953C20)
3LKWAITNOTIFY "ExecuteThread: '19' for queue: 'weblogic.kernel.Default'" (0x694E9C0)
3LKWAITNOTIFY "ExecuteThread: '11' for queue: 'weblogic.kernel.Default'" (0x6944580)
3LKWAITNOTIFY "ExecuteThread: '7' for queue: 'weblogic.kernel.Default'" (0x6E6A1E8)
3LKWAITNOTIFY "ExecuteThread: '8' for queue: 'weblogic.kernel.Default'" (0x6E6A5D8)
3LKWAITNOTIFY "ExecuteThread: '16' for queue: 'weblogic.kernel.Default'" (0x694AC28)
3LKWAITNOTIFY "ExecuteThread: '4' for queue: 'weblogic.kernel.Default'" (0x69371A8)
3LKWAITNOTIFY "ExecuteThread: '20' for queue: 'weblogic.kernel.Default'" (0x694FE58)
A lot of threads are waiting for the Thread with ident 0x23 to release an instance of the ReleaseRunImpl. Upon further investigation of the dump we quickly discover which thread this is and what it is doing:
2LKFLATMON ident 0x23 "ExecuteThread: '24' for queue: 'weblogic.kernel.Default'" (0x69550B8) ee 0x06954EE0
....
3XMTHREADINFO "ExecuteThread: '24' for queue: 'weblogic.kernel.Default'" (TID:0x101E8930, sys_thread_t:0x69550B8, state:R, native ID:0x114C) prio=5
4XESTACKTRACE at org.joda.time.format.PeriodFormatterBuilder$FieldFormatter.isSupported(PeriodFormatterBuilder.java(Compiled Code))
4XESTACKTRACE at org.joda.time.format.PeriodFormatterBuilder$FieldFormatter.getFieldValue(PeriodFormatterBuilder.java(Compiled Code))
4XESTACKTRACE at org.joda.time.format.PeriodFormatterBuilder$FieldFormatter.countFieldsToPrint(PeriodFormatterBuilder.java:1017)
4XESTACKTRACE at org.joda.time.format.PeriodFormatterBuilder$Composite.countFieldsToPrint(PeriodFormatterBuilder.java:1719)
4XESTACKTRACE at org.joda.time.format.PeriodFormatterBuilder$Separator.calculatePrintedLength(PeriodFormatterBuilder.java:1568)
4XESTACKTRACE at org.joda.time.format.PeriodFormatterBuilder$Separator.calculatePrintedLength(PeriodFormatterBuilder.java:1564)
4XESTACKTRACE at org.joda.time.format.PeriodFormatterBuilder$Separator.calculatePrintedLength(PeriodFormatterBuilder.java:1564)
4XESTACKTRACE at org.joda.time.format.PeriodFormatter.print(PeriodFormatter.java:240)
4XESTACKTRACE at com.xebia.client.domain.impl.AbstractRun.formatMillisToStringPeriod(AbstractRun.java:395)
The thread with id 24 seems busy printing a human readable form of milliseconds using the joda-time library. We are actually already using the latest stable (1.4) version of this library. Now what to do? Why is formatMillisToStringPeriod causing the PeriodFormatter to hang? It isn’t always hanging, so it might be just for specific values, but which values then? Let’s create a JUnit testcase which closely resembles the behaviour of the formatMillisToStringPeriod, so we can try a few values.
public class PeriodFormatterTest extends TestCase { private PeriodFormatter pf; protected void setUp() throws Exception { super.setUp(); pf = new PeriodFormatterBuilder().printZeroAlways().printZeroRarelyFirst().appendWeeks() .appendSuffix(" week", " weken").appendSeparator(" ").appendDays().appendSuffix(" dag", " dagen").appendSeparator(" ").appendHours() .appendSuffix(" uur", " uur").appendSeparator(" ").minimumPrintedDigits(2).printZeroAlways().appendMinutes().appendSuffix(":", ":") .appendSeconds().appendSuffix(" minuut", " minuten").toFormatter(); } public void testPeriodFormatter_aPeriod() { formatMillisToStringPeriod(4000, "PT4S", "00:04 minuten"); } public void testPeriodFormatter_0Period() { formatMillisToStringPeriod(0, "PT0S", "00:00 minuten"); } private void formatMillisToStringPeriod(long periodTime, String toStringExpected, String printExpected) { DateTime start = new DateTime(); Period p = new Period(start, start.plus(periodTime)); assertEquals(toStringExpected, p.toString()); assertEquals(printExpected, pf.print(p)); } }
Running this test in Eclipse, doesn’t show any problems, so we try the Maven build. Lo and behold, now Maven is suddenly hanging on the execution of the 2nd testcase in the JUnit test. Doing a process listing we suddenly realize our mistake, the Maven build is running with the same IBM JDK 1.4.2 that our container is using, but Eclipse is using its default Java5 JDK. So we back up and reconfigure Eclipse. This is better, now it also hangs on the execution of the JUnit test in Eclipse. Furthermore, we notice another thing: The CPU is going berserk at 100% usage.
We need to look into the joda-time code to see what is going awry. The best possible candidate we can look for is an infinite loop, as this would explain the CPU being used 100%. A close inspection of the methods which we see in the stacktrace points us into the direction of the following code snippet (taken from the method FieldFormatter.getFieldValue, lines 1437:1445 of PeriodFormatterBuilder.java):
if (isZero(period) && iFieldFormatters[iFieldType] == this) { for (int i = Math.min(iFieldType, 8 ) - 1; i >= 0; i++) { if (isSupported(type, i) && iFieldFormatters[i] != null) { return Long.MAX_VALUE; } } } else { return Long.MAX_VALUE; }
We see that this method has an open-ended loop. The int field iFieldType has a range of 0 upwards. This means that the loop only ends (immediately) for iFieldType with value 0. For any other value of iFieldType the loop in itself will never end. The only way to break out of this loop is to have the if-statement evaluate to true. Let’s have a look at what isSupported does to get a feel whether this mighet pose a problem.
boolean isSupported(PeriodType type, int field) { switch (field) { default: return false; case YEARS: // 0 return type.isSupported(DurationFieldType.years()); case MONTHS: // 1 return type.isSupported(DurationFieldType.months()); case WEEKS: // 2 return type.isSupported(DurationFieldType.weeks()); case DAYS: // 3 return type.isSupported(DurationFieldType.days()); case HOURS: // 4 return type.isSupported(DurationFieldType.hours()); case MINUTES: // 5 return type.isSupported(DurationFieldType.minutes()); case SECONDS: // 6 return type.isSupported(DurationFieldType.seconds()); case MILLIS: // 7 return type.isSupported(DurationFieldType.millis()); case SECONDS_MILLIS: // drop through // 8 case SECONDS_OPTIONAL_MILLIS: // 9 return type.isSupported(DurationFieldType.seconds()) || type.isSupported(DurationFieldType.millis()); } }
Ok, the problem becomes obvious. If for some reason iFieldType falls out of the region of [0,9], isSupported will always return false, which in turn would mean that the above loop would never end. Ok, we might have something here, let’s set a few breakpoints to see what’s going wrong… Unfortunately this doesn’t seem to work, stepping through the code using a debugger somehow magically fixes the problem, and the tests end normally as we would have expected in the first place. This poses a problem. We decide to revert to old-style debugging, namely adding logging to spot the problem.
for (int i = Math.min(iFieldType, 8 ) - 1; i >= 0; i++) { log.info("Unbounded looping i = " + i + ", and iFieldType = " + iFieldType); if (isSupported(type, i) && iFieldFormatters[i] != null) { return Long.MAX_VALUE; } }
Having modified the loop in this way, we see the following statement repeatedly scrolling past in the log:
Feb 14, 2007 11:37:27 AM org.joda.time.format.PeriodFormatterBuilder$FieldFormatter getFieldValue
INFO: Unbounded looping i = 2147483647, and iFieldType = 4
This is weird, somehow the value of our loop-variable seems to have jumped from 3 (the minimum of 4 and 8, minus 1) to Integer.MAX_VALUE in an instant. Even in a normal small loop it would take quite a lot of time to do over 2 billion iterations. However closer inspection of the log shows that it somehow starts out at Integer.MAX_VALUE. Luckily we can conclude that the IBM JDK guys implemented Math.min correctly, so there must be some other problem. Rewriting the code to resemble the next snippet solves the problem, but sheds no light on the cause.
int i = Math.min(iFieldType, 8 ); i--; for (; i >= 0; i++) { log.info("Unbounded looping i = " + i + ", and iFieldType = " + iFieldType); if (isSupported(type, i) && iFieldFormatters[i] != null) { return Long.MAX_VALUE; } }
This of course does not fix the problem that the loop still is potentially unbounded. The creator of joda-time should have checked that the loop variable doesn’t take a value outside the domain in which it has meaning. If he had done that, the problem we have investigated here would never have occurred. With the knowledge gained from this session, we have created a patch and submitted it to the creators, which makes the offending lines look like this:
int i = Math.min(iFieldType, 8 ); i--; for (; i >= 0 && i < 10; i++) { log.info("Unbounded looping i = " + i + ", and iFieldType = " + iFieldType); if (isSupported(type, i) && iFieldFormatters[i] != null) { return Long.MAX_VALUE; } }
The result of this debugging session gave us mixed feelings. On the one hand we managed to find the problem, and create a solution for it. However we still have no direct clue to what the difference is between initializing the loop variable in 2 statements, and doing it inside the for-statement. It seems safe to assume that the IBM JDK generates different bytecode for these two cases in such a way that the loop variable is suddenly correctly initialized in the new case.