Monday, April 26, 2004

Data access sanity check

(Cross posted from my blog)

One of the first tasks that I performed for my employer was to diagnose and resolve a several minutes long CPU spike on the database server for one of our J2EE applications. All of our servers were well monitored, and without much ado we were able to pin the spike on a specific use case. As it turns out, the culprit was a use case for exporting Loan Application records from our company to a client. To accomplish this task several thousand Entity Beans were instantiated, data was extracted from the objects, and a comma-delimited output file was generated. Adding insult to injury, in addition to instantiating thousands of EJBs, the collection exceeded our cache size resulting in the passivation and activation of beans (to and from the database) as we traversed the collection.

In retrospect it's hard to fathom how this implementation strategy got past a design review, but in the heat of battle all sorts of less-then-optimal solutions creep into most products.

The first tack that I took to resolve this issue was to pursue a JDBC rather then an Entity Bean approach (inspired by the Fast-Lane Reader pattern), and this resulted in a substantial performance gain (the use case executed in a third of the original time). Fortunately, my colleagues are way more SQL savvy then I am, and they suggested pursuing a stored procedure approach. The stored-procedure implementation of the use case executes in about 1/100th of the time required for the original EJB-centric solution.

This is one of those great "war stories" that can be used to make all sorts of points. It speaks to inadequate design reviews, the need for system monitoring, the misuse of Entity Beans, the value of teams with diverse skill sets, and numerous other "soap box issues" that I've been known to pontificate about (a former co-worker coined the tern "johntification" to refer to my frequent monologues).

Today I would like to use my "Entity Bean Based Loan Export" war story to talk about optimizing data access, and how we really ought to code in a way that enables it.

The goal of our Loan Export use case was well defined:

Produce an output file that contains data from Loans that meet specified criteria.

Note that the use case concerns Loans; not Java objects; not database records. This is a key point to remember. Depending on the current state of the system, the data that constitutes the "Loan" could be on a hard disk, in the cache of the database system, or in the application's memory (real or virtual).

The best strategy for collecting data can vary wildly based on where the data currently resides. Using my war story as an example; if Entity Beans are already instantiated for all of the Loans to be exported, then producing an output file from the Entity Beans will generate no additional load on our database server and should be pretty zippy. If the Loan data is still exclusively on disk, then the stored procedure approach is the way to go (assuming that I'm using a single RDBMS).

I am not sure how to clearly express the point that I want to make, but it has something to do with optimizing for the present and planning for the future. One solution may be optimal if all objects can reside in memory, while another may be optimal if the number of objects exceeds some threshold. We need to code in a manner that allows an "optimized" solution to be injected without disrupting or confusing our intent.

These thoughts gell with the goals of SQL query optimization. In some database systems, the SQL that you submit is not the SQL that is executed. Behind-the-scenes query optimizations are applied by the database engine, resulting in better overall performance.

In the SQL research world, the goal is along the lines:

The query that you specified is sufficient for the system to determine the records that you want to retrieve. The procedure by which those records are obtained is an implementation detail that you need not worry about.
Wouldn't it be delightful to write Java data access code along similar lines? Consider a "collection populator" service. Specify the type of objects the collection should hold, specify the criteria that the objects within the collection must meet, and let the service worry about the details of populating the collection.

Of course there's no such thing as a free lunch: You are going to have to write all of the methods of your "collection populator" service. The advantage will come later if your data sources change or you need to develop a more efficient implementation.


A Brief Introduction to IoC by Sam Newman -- provides a good example of using IOC to inject specific DAOs.

Post a Comment