Running long tasks in App Engine for Java

Processing an unknown number of results in App Engine is a problem. "Unknown" means "potentially long-lived", and "potentially long-lived" is a ticket to pain.

The problem

App Engine is designed to scale by enforcing some requirements like **short-lived requests**: any query will timeout after 60 seconds for front-end requests and 10 minutes for Task Queues. Any Datastore queries should be processed before this limit expires, and any pending work postponed using cursors and Task Queues. These limits do not apply to backends, but neither do automatic scaling; with backends, you must manage scaling yourself.

And then, there is the expiration timeout for queries: as a protection from getting stale data, **queries have a 30-second timeout**. Forget about 60-sec front-end, 10-min queues and unlimited backends, your queries will timeout in all these cases after 30 seconds.

Since we are going to be forced to use cursors and postpone work, we can go with task queues and get automatic retries and exponential backoff for free.

Postponing work

Let's say that we are changing a `User.name` attribute in the Datastore, and we need to propagate the change to other denormalized entities bound to it.

public class User {
   private Key key;
   private String name;
}
public class Ticket {
   private Key key;
   /** the owner of this ticket */
   private Key userKey;
   /** DENORMALIZED: The user name of the owner of this ticket */
   private String userName;
}

This "cascade update" is usually done in seconds, but you never know. As a general rule you don't want to design a system that fails on your best customers.

Through trial-and-error, this is the workflow that we devised:

  • Execute a query and start looping through results. For each result:
    * If still below the 30-second limit, process result.
    • If the 30-second limit has been exceeded, grab a cursor and execute the query again immediately, starting on that cursor.
    • If the 10-minute limit has been exceeded, grab a cursor and post the task again, starting on that cursor.
  • When there are no more results to process, end.

    Wrap it up in a library, label it "open source", and we get queue4gae:

public class UpdateTicketsTask extends CursorTask {
    private Key userKey;
    private String newUserName;
    private UpdateTicketsTask() {
        // for Jackson deserialization
    }
    public UpdateTicketsTask(Key userKey, String newUserName) {
        super("user-changes"); // the name of the Task Queue to use
        this.userKey = userKey;
        this.newUserName = newUserName;
    }
    @Override
    protected Cursor runQuery(Cursor startCursor) {
        EntityManager entityManager = EntityManagerFactory.getEntityManager();
        CursorIterator it = entityManager.createQuery(Ticket.class)
            .equal("userKey", userKey)
            .withStartCursor(startCursor)
            .withChunkSize(300)
            .withPrefetchSize(300)
            .asIterator();
        while (it.hasNext() && !queryTimeOut()) {
            Ticket t = it.next();
            t.setUserName(newUserName);
            entityManager.put(t);
        }
        return it.hasNext()? it.getCursor() : null;
    }
}
queueService.post(new UpdateTicketsTask(userKey, "Foobar from Hell"));

As long as `runQuery()` returns a not-null `Cursor` the method will be invoked again, passing the cursor as an argument (null on its first invocation). The Task class should keep an eye on `queryTimeout()` to check if the 30-second time limit is close to expire. Anything else (re-executing immediately, reposting to execute later, JSON serialization) is being taken care of automatically.

Queue4gae is **framework-agnostic**: it works with any web and persistence framework for App Engine (as long as the later supports Cursors) and uses Jackson for serialization to be able to inspect tasks in the App Engine Console.

Some closing thoughts

**This is a simple example**: specifically, transactions and batch updates have been left out. You may want to add those for real-world code.

**Is this implementation fool-proof?** No. If a user starts changing names like crazy there is the possibility that two tasks will collide, leaving a ticket instance in an inconsistent state. If this is important, consider adding some controls to delay a task until the previous one has finished.

**Is this implementation super-fast?** No. This implementation will not run over a Map-Reduce but use linear execution, and it will require some time to process a large result set. If a new release of your application requires updating everything you will be better with [appengine-mapreduce](https://code.google.com/p/appengine-mapreduce/). On the other side, queue4gae is optimized for cases that are "usually done in seconds, but maybe not", consuming less quota.

**Tasks should be idempotent**: if anything fails, the same task could be executed more than once. You should plan for that.

First post as GDE!

This is my first post as [Google Developer Expert](http://blog.icoloma.com/2014/01/im-google-developer-expert.html) for the Cloud Platform. Let me know if you like or hate it!