HomeRamblings  ⁄  ProgrammingRuby Language

Implementing Ruby jobs in the background

Published: November 06, 2009 (about 8 years ago)
Updated: over 2 years ago

I needed a way to kick off a background job that was triggered by an end-user on my Ramaze-backed website and could run for almost two hours! This article brings together all the elements to get the job done.

Preamble (a.k.a. Ramblings)

I can hear you exclaiming, “Two hours!? Are you nuts? You really don’t want to do that on a web application,” and you would be correct; I did not. But alas, my users wanted it and they waved money in front of my face to get it done. What’s a poor sap like me to do but to comply? So, what is this crazy task that needs to run so urgently it couldn’t wait until a midnight cronjob task to kick it off? Well, the task is basically exporting data from the live production database to a reporting database and the users sometimes have to generate those reports as soon as humanly possible against the then current production data. The problem is, the export process takes anywhere from 15 minutes to almost two hours, depending on how much data has changed since the last midnight run. Such a long-running process is clearly outside the normal bounds of a website’s usual request/respond lifecycle and, if you’re like me and fronting your web app with Phusion’s Passenger, you’re quite at risk of Passenger’s process recycling action killing the task in mid-stream and losing your work in progress.

I didn’t quite know what I was going after when I started solving this problem and the main thing I could think to google for was “background jobs ruby,” which led to a rash of articles about backgrounDRB and many many rubyists complaining about its many shortcomings and creating their own spin on what, at core is known as a message queue. Many of the articles I read just did not cover the whole scope of the problem I faced and usually had simple solutions that lived well within the boundaries of their Rails (or Ramaze) framework’s inner workings. Since I had a process that was capable of consuming resources upwards of two hours, I knew I had to completely escape the bounds of the framework and do the heavy lifting without fear that the parent process’ lifespan would risk the job getting terminated mid-stream.

I also didn’t need a Rails plugin because, well, I’m building a Ramaze application and those fancy plugins just aren’t much of a help to me there. One of the things I’m loving about Ramaze is that its forcing me to learn the Ruby language and its also forcing me to work harder to understand the components I’m pulling together in a solution. With Rails, I just pulled a plugin out of thin air, plugged it in and moved along. Which is great for rapid development, but more often than not, I have found myself later in a bind after the site’s live finding out that wonderful plugin is running way too slow and not knowing a darn thing about how to fix the problem without tearing up the whole plugin with a complete rewrite (in many cases). With Ramaze, I feel much closer to the language, and I’m also finding that there’s just a lot of fluff I flat out don’t need that the Rails framework brings along, or rather, encourages developers to toss into the soup. On the other hand, Ramaze’s philosophical approach of squeezing every ounce of functionality out of every line of code has really pushed me towards being a better Ruby programmer and yet the same old time crunch every nearly developer faces actually forces me to get the job done with less code and with utilizing only the critical components actually necessary to get the job done. Which amounts to a one-two knockout punch for me because the solutions I deploy are consistently rock solid and I fully understand what I just put out there end-to-end.

Getting Started

The full project source can be found on github.com and you will also find my database enhancements to ProgressBar on github.com as well. This project utilizes the following:

The Players

With the ramblings out of the way, its time to get to the heart of the matter at hand. There are three players in this game:

  1. The trigger - This is the process that starts the whole chain reaction. In this case, the web app where the user can click a link to start the export. In my case, that web app was built with Ramaze.
  2. The message queue - This is the middleman in the game. He sits around and accepts the message from trigger and passes the buck to the worker. We call him, “beanstalkd.”
  3. The worker - This is the process doing all the work. Amazingly, Ruby was up for the task and the daemons gem is the ticket to making this work. So lets cover these in more detail…

The Worker (sort of…)

The worker has got to always be around to do the job whenever that job comes in, so this suggested to me that I needed a daemonized process to watch the message queue for incoming jobs. Many solutions that I saw called for simple cronjobs running every minute or 5 minutes or whatever. But I couldn’t fathom why I’d want to resort to such hackery. How are you going to scale that? What happens if two cronjob instances fire and grab the same job request? What about if the job crashes or the server dies in mid-job? How is it reliably restarted? How do I ensure long-running jobs don’t pile up and consume all memory and CPU resources (e.g. every minute start processing the next big job and next thing you know, ten of ‘em are running when you only have the capacity to run one or two)? Simply put: cronjob workers is brute force mentality and rife with potential pitfalls. Point being, I rapidly came to understand that I needed a message queue to serialize the jobs until the workers could take the jobs and run them and guarantee the jobs run before flushing them away. What was surprising, was the number of message queue implementations out there in the wild! Luckily, beanstalkd existed and had a very elegant implementation and API well-suited for Ruby. I easily installed from source to a Ubuntu server and also installed via macports on my Macbook Pro laptop AND installed the ruby client AND tested the small demos in irb all in about 30 minutes. Sadly I can’t say the same for RabbitMQ, Starling/Workling, ApacheMQ, and others I tried in vain to get up and going with. So, beanstalkd won out by default. Go to the beanstalkd website and follow their instructions. Its that simple.

How are we gonna keep up with the jobs?

Although the message queue has a way of peeking into it to learn about number of jobs and what’s running and the status of said jobs, the background processes I was going to be doing was going to run long enough that I wanted to pass along exactly what stage in the process the job was crunching and this meant building periodic update logging into the actual background job itself. The problem then is, how do we pass the information cleanly between worker and web app? After some thinking on the matter, I chose to pass along progress updates via the database since all the headaches of I/O contention, persistence and deadlocking had already been solved there. I wasn’t quite sure how I was going to format my status updates and post the info to the database until I thought about the great little ProgressBar gem that I was already using for the task when I kicked it off at console. After a few minutes of looking into the internals of ProgressBar, I knew I could easily extend it to report its status to the database rather than emitting to the console. An hour later, I had my DbProgressBar and another hour later I had my unit tests written and passing.

Oh, yeah, the Worker…

I digressed from talking about the worker to covering the message queue and passing messages around, so lets get back to that worker script. As I pointed out earlier, I needed a worker script to do the heavy lifting for the job at hand and I wanted that puppy daemonized. One thing was perfectly clear in my mind: I didn’t want cronjobs nor Threads that depended on the parent processes staying alive (remember how Passenger aggressively scrubs ruby processes from memory!) nor any other means that depended on the parent process within the Ramaze framework kicking off the process directly. The “daemons” rubygem fits the bill perfectly! I was really surprised by this little gem of a gem. Daemons makes it extremely easy to wrap ordinary Ruby scripts in the usual start|stop|restart|status world you find inside the typcial /etc/init.d/ Linux folder. What blew my mind was that I didn’t have to look at the usual long-winded case statement for all those daemonizing commands. Just focus on writing the main task for the worker, wrap that task in an infinite loop, install it rc.d folders as appropriate and you’re pretty much done. The following shows an example worker script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
require 'rubygems'
require 'daemons'
require 'beanstalk-client' 
require 'sequel'

ROOT_DIR = File.dirname(File.expand_path(__FILE__))
$LOAD_PATH.unshift(File.join(ROOT_DIR, 'lib'))

require 'dbprogressbar'
require 'sequel_output'

Daemons.run_proc('process_astroid_jobs', :log_output => true) do
  DB = Sequel.sqlite(File.join(ROOT_DIR, 'demo.db'))
  SequelOutput.prepare_database(DB)
  beanstalk = Beanstalk::Pool.new(["0.0.0.0:11300"])

  loop do
    job = beanstalk.reserve 
    jh = job.ybody
    puts "processing #{jh[:job_id]}"

    total = 15
    pbar = DbProgressBar.new("job ##{jh[:job_id]}", total, 
      SequelOutput.new(jh[:job_id], DB))

    total.times do
      pbar.inc
      sleep(1)
    end

    pbar.finish
    puts "finished #{jh[:job_id]}"
    job.delete
  end
end

There are a few key things to note about the above script. Firstly, wrapping it in the Daemons.run_proc tells the daemons gem we want to run this like we’d run a typical server. SequelOutput is the extension I wrote for ProgressBar to get it to outout to the database and at the start of the script, we ensure the database is ready (namely that the “progress_bars” table is created) before we go into our infinite loop to process jobs. Next up, the loop itself. Its not a runaway infinite loop that’ll end up consuming all resources because we are calling the beanstalk queue with “reserve” which is a blocking call, thus we sit and wait until a job comes through. As soon as we get a new job request, we create the DbProgressBar and get to work processing….twiddling our thumbs for 15 seconds in this case! When the job’s completed, we remove it form the message queue and begin again.

The Trigger

Finally, with the back-end done, we could focus on enabling the user to trigger the job on the front-end. Of course, that was as simple as firing up a connection to the beanstalk server and pushing a job onto the queue in a controller’s action. But how do we watch the job and report to the user what’s going on without overwhelming our server? The answer came along with a “Comet with Ramaze” example by Pistos. What a lovely idea: Let the server intentionally hold onto the client’s request until it had something to report back! I knew it’d never hold onto the request for more than 2 or 3 seconds, so this was a perfectly workable model for keeping traffic levels sane and timely. This was also my first time playing around with JSON in AJAX calls and so I got to spend some time flexing my JQuery muscles a bit and also discovered the Ruby json gem that makes it dirt simple to turn a hash into a JSON response. This is just a small excerpt from the controller/main.rb check_job_progress action:

1
2
3
4
5
6
7
8
9
# report status as finished when the job is completed
if progress[:job_finish]
  return {:progress_text => "finished.", :progress_status => "finished"}.to_json

# otherwise report the percent done as the job status 
else
  return {:progress_text => progress[:progress_text].gsub(' ', ' '), 
      :progress_status => progress[:progress_percent]}.to_json
end

Notice the call to the to_json to convert that hash to JSON response and how clean the Ruby code is even though we’re passing a rich data object back to our browser client. I substituted blanks with the   character to keep the progress bar from collapsing all those empty blanks between the two vertical bars. I decided to pass on a progress status so that I could check when things were done and thus stop polling the server for job updates. This is just a demo, so try not to snicker too much at my use of plain English to convey critical job statuses. At least the intent is clear! The JQuery script is also correspondingly simple.

1
2
3
4
5
6
7
8
9
10
11
12
function check_progress() {
  job_progress_element = $('#job-progress');
  $.getJSON('/check_job_progress', {job_id: job_progress_element.attr('value')}, 
    function(data){
      job_progress_element.html(data.progress_text);
        if (data.progress_status != "finished") check_progress();
  });
}

$(document).ready(function() {
  if($('#job-progress').length) check_progress();
});

As you can see, it simply asks the server for an update and when it receives one, updates itself and asks again! It will check the job status after each update to see if the job has completed or not, thus stopping its persistent nagging once we’ve reached the Kingdom of Far, Far Away.

References

Here are a few other references, not mentioned above, that were invaluable in helping me pull this solution together.

comments powered by Disqus