Pentaho Kettle and Integration Testing

30 Sep, 2009

Recently for our project we started using Kettle for ETL purposes. Pentaho Kettle provides UI based tool. Initially it takes quite some time to get used to Kettle UI as it becomes difficult to visualize how to orchestrate available Kettle Steps to solve a business problem. As you know how to use it, it’s all about drag and drop a step and configuring it with available UI. With our experience we observed that it’s pretty easy to design 90% stuff easily but rest 10% involves a lot of research and at the end involved some hacks which we never liked.

As we created Kettle transformations and jobs, we were not very sure about its testability part. After some research we found that we can use BlackBoxTests class available in Kettle distribution for test purposes. The fundamentals of it are quite simple. You pass some inputs and define the expected file and in the output you get actual output file after executing Kettle transformation. BlackBoxTests asserts if expected file matches with actual file. So for instance if you have a Sample.ktr under test, BlackBoxTests will expect Sample.expected.<txt/xml/csv> as an expected file and Sample.actual.<txt/xml/csv> as actual file to make it work. It tests all available transformations under a folder and subfolders.
By definition Kettle uses (available under $HOME/.kettle folder) which creates complications from testing point of view. However you should be able to test a Kettle transformation in isolation. That’s why instead of using, we planned to use application specific property file to pass it to TransMeta class with available injectVariables() method. We were kind of successful but later found out that Kettle still uses even if we use a different property file.
After a lot of debugging we found out the culprit. BlackBoxTests uses EnvUtil.environmentInit() and does all the magic. It loads the by default and to our horror loads into java.lang.System.
We quickly got rid of using EnvUtil but found again that it’s not enough to pass the properties from outside. It works for the current transformation but somehow Kettle is not able to pass these properties to embedded sub-transformations. It worked earlier just because EnvUtil.environmentInit() loads properties into java.lang.System.
Overall, though we were finally able to do the testing with BlackBoxTests in isolation with some hacks, we concluded that the Kettle code is not designed to be testable and it can be termed as legacy code in Michael Feather’s language.

Newest Most Voted
Inline Feedbacks
View all comments
Matt Casters
12 years ago

It’s always good fun to hear about hard-core programmers that try to solve business intelligence issues.
If you don’t want to load the information in the file, here’s some advice: don’t put anything in there! The Kettle variable or named parameter system does indeed NOT put anything in java.lang.System.
Not testable? You got to be kidding me.
Just because you have problems grasping a few basic concepts, that doesn’t mean you have the right to call Kettle “legacy code” or throw around other insults. Try to find another way to vent your frustrations.

12 years ago

Great post. We are currently looking into the same issue — the ability to test Kettle transforms in isolation — and this is a better explanation than I’ve seen anywhere else (including the Pentaho wiki).
Matt – your solution of not using is a good one, and I’d agree that Kettle is far more testable than most other ETL tools… no need to get so defensive, though. The poster pointed out some legitimate issues, and this is good feedback for your community.

Shrikant Vashishtha
12 years ago

@Matt – We already used a different properties file and passed it through using TransMeta.injectVariables(). It works fine until we have a sub-transformation underneath. Somehow transformation is not able to pass properties to sub-transformation. If you use EnvUtil.environmentInit(), it overrides the properties passed with the ones existing in (should have been other way round).
I may have been a bit harsh in calling the Kettle code “legacy code” but I could see source code of 6000+ lines which is hard to understand and certainly not designed for testability.
While working with Kettle I found following roadblocks for which may be solutions exist but I could not find them in available resources:
1. Manual restart after failure (the ability to restart from where it failed)
2. Transaction over multiple insert steps
3. Automatic retry (for instance HTTP service or web service) and recovery for items that have exhausted their retry count.
4. Integration testing with database independence (using actual db instead of in-memory db right now)
5. Web services portability. For certain standards it doesn’t work. It’s difficult to ask a web-service vendor to change the web-service itself.
Many a times we used to reach to a wall from where we had to find some workaround. Integration testing (using Continuous Integration) example I mentioned in the blog is one of them.

Max Hofer
Max Hofer
11 years ago

ShriKant, have you made any progress in this direction?
I’m also trying to figure out how to test transformations/jobs in an automated way.

Shrikant Vashishtha
11 years ago

Hi Max,
We felt that it was much more difficult to do that and would require a lot of time and resources as there was not much available documentation. Also we didn’t feel very confident that we are under control while implementing enterprise level solution with Kettle because on and off we kept on seeing problems coming. We actually left using Kettle and adopted a new strategy based on Spring Integration implementation and it worked for us.

9 years ago

I ran into a similar problem, recently. Our application was based on Spring Framework and Hibernate, and I based the testing on Spring’s JUnit integration tests. I put some notes here:


[…] an article from 2007 talking about a framework for PDI testing, but has no code.  Here’s a blog post with some comments about this […]

Dan Moore
Dan Moore
9 years ago

I wasn’t able to find the blackboxtests classes you mention. Do you know if those are still distributed with kettle?
Also, I’ve written a series of blog posts on how to test kettle transformations–Basically, you build a parallel kettle job that exercises your logic and compares it to a golden set of values. (My post focuses on using file based golden data, but can easily be extended to database tables.) The most recent post is here:
It’s a bit different Than JK’s solution in that you don’t have to write java code (but you do have to write some etl code).

Explore related posts