Friday, April 30, 2010

Spring Batch integration module for GridGain

For the purpose of using Spring Batch in a scalable and distributed manner to process huge amount of data, I am actually developing some components to make integration of Spring Batch with compute/data grid easier.

Different solutions is offered by Spring Batch to provide scalability, the one that best suit my needs is remote chunking.

As I already done some investigation before using GridGain I chose this framework to implement a distributed remote chunking system that can be easily integrated into any existing Spring Batch systems.

Using GridGain is really straightforward, and setting up a grid on a development machine doesn't need so much configuration.

The only issue I faced is due to the fact that GridGain use serialization to deploy tasks on nodes, in order to be able to deploy a remote ChunkProcessor, it must contains serializable ItemProcessor and ItemWriter, which unfortunately is not the case by default.

So instead of creating new interfaces, I made a SerializableChunkProcessor which only accept serializable ItemProcessor and ItemWriter. It's surely not the smarter solution, but since I can't modify default interfaces in Spring Batch and I don't want to create my own interfaces, this workaround will suffice.

Usage
Here is the job application context used for the integration test, as you can see the 'real' ItemProcessor / ItemWriter are injected into the GridGain chunk writer:

Download
You can download the spring-batch-integration-gridgain module here:
http://github.com/downloads/aloiscochard/spring-batch-integration-gridgain/spring-batch-integration-gridgain-0.0.1-SNAPSHOT.jar

If you want to see a full working sample, take a look at the integration test. The full project sources can be downloaded here:
http://github.com/aloiscochard/spring-batch-integration-gridgain

3 comments:

  1. Hi Alois,

    your integration looks very nice, just one question: did you previously evaluate Hadoop to perform scalable batch processing? If so, why did you choose Spring Batch over Hadoop?

    ReplyDelete
  2. Are there a lot of people using Spring Batch? The site looks very *ahem* different from the main Spring website.

    ReplyDelete
  3. @Sergio
    It's a long story and the response could make a full blog entry ;)
    In a nutshell, we wanted the capability to use different data sources.
    Hadoop is a fantastic framework, but when using hadoop you are tighly coupled with the storage mechanisms of hadoop (HBase).
    In this project we need to be able to process data from a variety of different sources, like data-grid (terracotta) but too from RDBMS like Oracle or other sources like Documentum DocBase.
    Using GridGain with hadoop give lot of flexibility for processing data.
    Some indexes are stored in terracotta using compass (lucene behind the scene), but I'm really interested about your framework and the possibility to use it with gridgain (as compass gonna have integration with elastic-search soon).

    @Ashwin
    I didn't work for spring and I can't give you some stats about it's usage. But as far as I know from what I read online, spring-batch seems to have a strong user-base.
    But batch processing is a kind of 'niche' market, in our project we need to process hundred's gigabit xml file, and other stuff of this kind.
    Spring Batch enable us to process this data with great control of the flow, and ability to recover from failure.
    Don't hesitate to ask your question on the spring-batch forum, I'm sure Dave Syer or other spring-batch users/developers will have pleasure to answer your question !

    ReplyDelete