At my current project we have to parse a large CSV-file with related data. Each line contains an "entity" or "object" and most lines are related through some references to other lines. The data needs to be send to the server, but some validations and manipulations have to be performed beforehand. Files could become so large, that holding all objects in memory at the client machine could be a problem. We decided to use a Database to temporarily hold the data and to search for related lines. The lines are not necessarily ordered, so we need to wait for the last line to be parsed before validating and manipulating the data. We would like the data to be captured in a nice object oriented model. We used Hibernate and Apache Derby for our implementation and it turned out to be quite convenient!
Schema generation
Hibernate allows for runtime schema generation. This feature is mostly used for tests, but in this case it is very usefull. We don’t need the database after the data has been sent, so we create the schema when we setup the Hibernate SessionFactory and drop the schema just before closing it. This will also keep the disk usage after running the application at a minimum.
Apache Derby is a fairly mature database that can be run in process. It will use the disk for storage. We tried using HSQLDB, but searching a large database proved to be very slow.
Inserting data
Now let’s say we have lines like these:
[code]10;hospitalId;…
20;hospitalId;patientId;…
30;hospitalId;patientId;treatmentId;…
30;hospitalId;patientId;treatmentId;…
20;hospitalId;patientId;…
30;hospitalId;patientId;treatmentId;…
10;hospitalId;…
20;hospitalId;patientId;…
30;hospitalId;patientId;treatmentId;…
…
[/code]
We first want to parse this into objects like these:
[java]
class Hospital {
int hospitalId;
String location;
…
}
class Patient {
int hospitalId;
int patientId;
String name;
…
}
class Treatment {
int hospitalId;
int patientId;
int treatmentId;
int numberOfDays;
…
}
[/java]
Each line represents an object. The first part of the line identified the type of the object. The hospitalId uniquely identifies an Hospital, but the patientId might not be unique across hospitals. The same holds for the treatmentId across patients. Since we cannot ensure that the rows will actually be in order, we cannot make the associations between Hospitals, Patients and Treatments on a line-by-lnie basis. Each Treatment thus has to hold on to it’s hospitalId, to enable making the association later on.
The Hibernate mapping to insert this data is rather simple:
[xml]
<hibernate-mapping default-access="field">
<class name="Hospital">
<id type="int" column="ID">
<generator class="native"/>
</id>
<property name="hospitalId"/>
<property name="location"/>
</class>
<class name="Patient">
<id type="int" column="ID">
<generator class="native"/>
</id>
<property name="hospitalId"/>
<property name="patientId"/>
<property name="name"/>
</class>
<class name="Hospital">
<id type="int" column="ID">
<generator class="native"/>
</id>
<property name="hospitalId"/>
<property name="patientId"/>
<property name="treatmentId"/>
<property name="numberOfDays"/>
</class>
</hibernate-mapping>
[/xml]
We do not worry about column names, since we will only access the database through Hibernate. Notice furthermore the absence of the name property on the id tag. This means that Hibernate will worry about the ids and that out code does not have to handle them, so there is no id field in any of the classes. This has some far reaching consequences, that make it only usefull in a small number of cases, but our case happens to be one of them (later on in the project we might have to introduce the field nonetheless…).
Reading data
Now, when reading data we would like to use object associations between Patient objects and Hospitals objects and Treatment objects and Patient objects. Actually, a Treatment has to know it’s Hospital intimitally too, since the Patient data might be corrupted. The classes need to be adapted like this:
[java]
class Hospital {
int hospitalId;
String location;
…
}
class Patient {
int hospitalId;
int patientId;
Hospital hospital;
String name;
…
}
class Treatment {
int hospitalId;
int patientId;
int treatmentId;
Hospital hospital;
Patient patient;
int numberOfDays;
…
}
[/java]
Now we would like Hibernate to fill in those fields with the related objects. Let’s first look at the Patient mapping. We need some sort of many-to-one association in the mapping:
[xml]
…
<class name="Patient">
<id type="int" column="ID">
<generator class="native"/>
</id>
<many-to-one name="Hospital" column="hospitalId" property-ref="hospitalId" update="false" insert="false" foreign-key="none">
<property name="hospitalId"/>
<property name="patientId"/>
<property name="name"/>
</class>
…
[/xml]
The association mapping is quite intricate:
- The column attribute indicates that the association is mapped on the Patient.hospitalId column.
- The property-ref attribute indicates that the other end of the association is mapped on the Hospital.hospitalId column. This is not the id property of the Hospital entity!
- The hospitalId property is already mapped. Hibernate will complain if you map a column twice unless the mapping is attributed with update=”false” insert=”false”
- The foreign-key attribute set to “none” will prohibit Hibernate generating a foreign key constraint between the columns. Since the order in which the rows will be inserted is uncertain, the constraint would be violated when the Client record is inserted first. Furthermore, corruptions in the data would prevent records from being inserted, while we want to validate the data after it has been inserted in the database.
Pff, there’s quite a lot going on in that single line of mapping!
Now let’s take a look at the Treatment mapping. Treatment maps onto Patient, but the reference is on two columns. We need to specify these columns as a special construct in the Client mapping.
[xml]
<class name="Patient">
<id type="int" column="ID">
<generator class="native"/>
</id>
<many-to-one name="hospital" column="hospitalId" property-ref="hospitalId" update="false" insert="false" foreign-key="none"/>
<properties name="reference">
<property name="hospitalId" index="IDX_PATIENT_REFERENCE"/>
<property name="patientId" index="IDX_PATIENT_REFERENCE"/>
</properties>
<property name="name"/>
</class>
<class name="Hospital">
<id type="int" column="ID">
<generator class="native"/>
</id>
<many-to-one name="hospital" column="hospitalId" property-ref="hospitalId" update="false" insert="false" foreign-key="none"/>
<many-to-one name="patient" property-ref="reference" update="false" insert="false" foreign-key="none">
<column name="hospitalId"/>
<column name="patientId"/>
</many-to-one>
<property name="hospitalId"/>
<property name="patientId"/>
<property name="treatmentId"/>
<property name="numberOfDays"/>
</class>
[/xml]
In essence, the mapping is the same. The only difference is the property-ref is pointing to a properties construct in the Patient mapping. This construct groups a number of properties. The many-to-one mapping needs the same number of columns as subelements in the same order. Adding the same index name to the properties in the Patient mapping will ensure fast lookup.
Your world is now connected!
So now, we can insert non related objects into the database using Hibernate, but when we fetch them from the database using Hibernate… BAM!… Your world is connected! All data manipulation logic can now safely assume that the associated entities are present. That is of course after validating that the references exist. There is one slight problem: We are not yet protected against duplicates. When hibernate tries to load a Treatment object and it finds that there are two Patient objects that fullfill the reference, it will throw an exception. This cannot be avoided: Hibernate cannot decide which row to use to read the Patient object. In our case we ignore the errors, since there is a validation that checks the uniqueness of patientId. The validation will register an error and further processing of the records is useless. In other cases it might be more intricate.