For my current project we use Kettle to process data from a number of sources and store it in a shared database. Kettle has great support for parsing data from a variety of sources, transforming it and writing it to a variety of destinations. One problem that often arises when inserting data in a relational database is the need for a syntactic, unique key that will be generated when a new row is inserted and later in the process used for other rows that refer the primary row. There are many solutions for this problem, both in the RDBMS and in the Java space. This blog reports on a search through several code bases to come up with a good solution in kettle which, unfortunately, still evades me.
A complexity that adds to the problem is that it should work both in the production setup as well as in our kettle unit tests. For production we use Postgresql and for testing we use HsqlDB. In it self, this is a challenge with Kettle, but by using a generic database connector and injecting the JDBC properties through variables this is feasible.
Both Postgresql and HsqlDB have (some form) of support for generated keys:
Postgresql To create an auto-increment column in Postgresql you use the (big)serial datatype. This will generate a simple numeric column owning a sequence that is used for its default value. The name of the sequence is tablename_columnname_seq.
HsqlDB An auto-increment using the following column definition: … generated by default as identity (start with 1).
When a row is inserted that does not specify a value for the auto-increment column, its value is generated, so far so good.
The problem arises when we need the generated value for the just inserted row. And thus the quest begins…
JDBC3 — Generated Keys
From the surface, this should be an easy problem, since JDBC (since v.3) has support for generated keys, through several interfaces:
Kettle’s TableOutput step uses this feature to return the generated keys for each row inserted.
Unfortunately neither HsqlDB (latest version 1.8.0.10) nor Postgres (latest version 8.4) support these interfaces.
Hibernate — Inspiration
In the same project we use Hibernate to generate the schema and to work with the data that has been gathered and integrated. Hibernate handles the situation quite gracefully. Using the @GeneratedValue(strategy=IDENTITY) it generated the correct schema for both HsqlDB and Postgresql through its specialized dialects.
To fetch the generated value, it uses the identity() function on HsqlDB and it queries Postgresql for the current value of the sequence. This works perfectly, although in highly concurrent situation one might expect problems.
Unfortunately, on Kettle this wont work, because separating the insert statement and the query for the generated key in two steps, will make queries run in separate threads. This is how Kettle works. To be able to fetch the last sequence value for each row, it has to happen within the same step.
Postgresql specific — Inline solutions
So, does Postgres, being a cutting edge database not support a feature that allows you to generate an id and then use it? Isn’t this something that is quite useful? Of course it does! It uses a custom syntax using the following statement:
INSERT INTO table (col1, col2, …) VALUES (…) RETURNING col0
Thus you instruct Postgres to return the value generated for the auto-increment column in a ResultSet. This statement should thus be executed using the executeQuery() method in stead of the executeUpdate() method and the TableOutput step has no support for this. The implementation of this step is on the whole quite rigid: there are no points to specify and custom SQL to execute.
And of course HsqlDB does not support this syntax, so we would have to write some logic to determine what to execute when…
Query the sequence before inserting
This approach will query the database sequence in a separate step before inserting the data in the table. It works fine (for postgresql) although it bypasses the sequence linked to the column. We have to manually make sure to call out to the correct sequence to keep the two in synch.
On HsqlDB, a sequence is not automatically created for an identity column. We would have to manually add the sequences to the schema. Also the syntax for querying a sequence is different on HsqlDB.
The problem with this approach is that Kettle does not support sequences in the Generic Database adapter. To implement this solution we would have to enhance the generic database adapter to support sequences and also make it flexible enough to work on both databases.
Conclusion
Although this seems like a simple and common problem, a good solution in Kettle does not seem to exist. The quest continues…