Terminator - Timeout without Mercy
September 11th, 2008
If you have been following my posts on Ruby-Talk and Ruby on Rails and even RSpec mailing list (and who wouldn’t?! I mean, aside from my mother) then you would have noticed I have been banging my head against a brick wall on the subject of System calls not being handled by the Timeout libraries in Ruby…
I’ve done a lot of work with this. I mean a lot.
So I thought I would share.
It all starts with a basic requirement I have, which is to replicate a data set from one side of the planet to the other. This involves logging into a database server remotely using ActiveRecord across some encrypted VPNs, querying some tables, and then inside of a transaction, replicating down certain data on certain rows.
The solution I have works well, I can replicate about 2-5 rows per second across the link which is more than adequate (the dataset changes upto about 100 rows per minute). The problem is that occasionally, ActiveRecord’s link to the remote database will hit a snag, then Ruby will wait, and wait, and wait to Timeout…. This can take hours, and in fact, sometimes, it never does timeout, leaving a zombie process.
To avoid multiple replication problems, I have lock files which prevent further copies of the replicator firing up if it detects that one is already running.
So, in the end, I get stuck with several replicators “running”, all hung, all waiting for a link to come back that never will, and no replication happening. In the mean time, the rows are still changing and I am getting backlogged. Leave it for a day and you are 80,000 rows backlogged and people start asking questions and I start looking for old travel tickets that I had ‘forgotten about’.
But why does ruby do this?
Shouldn’t the Timeout library protect us from such evilness?
Well, yes and no. The problem is because ruby uses Green threads. Green threads mean that there is only one real Ruby process running on the server, and it “internally” makes other “threads” that it schedules within the main ruby process’ kernel time. Green threads are very efficient (they save a lot of context switching) but the problem is that if you get a System call that ‘blocks’, then the kernel will stop servicing your main ruby thread until your system call is complete, and THIS means that the little ‘Timeout.timeout(5) { my code }’ block you put in, will never get called…
And then you get a hang… forever.
I tried a couple of handlings, the first was the SystemTimer library by Philippe Hanrigou. This entry actually has a really good and simple example of what Green threads and blocking system calls mean. If you want to brush up, it is a good read.
But SystemTimer didn’t work. It uses alarm signals which could have some problems. But in my case, it didn’t work and I still got never ending timeouts.
So I asked again on the Ruby-Talk mailing list and Ara T. Howard came back with a quick script that created little homicidal external ruby processes to kill the ruby process I was in if it didn’t “make it” in time.
The approach was actually very clever, and it works!
What he did was go “We can’t be sure that the existing Ruby process will be able to nuke itself, so instead, lets start up another Ruby instance, running on it’s own, that has the PIDs of the ruby instance that spawned it, and if the time runs out, do a system based kill TERM on the pids.”
You could say that our little ruby processes have a license to kill. Literally.
So I implemented his idea, found a couple of problems, gave back some ideas, and he coded, I spec’d, tested and described and we released a brand new gem…
The Terminator.
You can get it with;
gem install terminator |
Or grab the source code.
And how do you use it? Simple:
1 2 3 4 5 |
require 'terminator' Terminator.terminate 1 do sleep 2 puts "I'll never print" end |
This will never print because the terminator times out after 1 second, which is before the sleep of 2 seconds inside the block. This will raise a Terminator::Error, which you could catch and try again if you want.
This will always work. It is because we are starting a separate process of Ruby (which has some minor overhead) that waits the specified number of seconds and then just simple does a system kernel TERM on our misbehaving process.
Why should you use it?
Well, if you are making ANY calls to external web services, external databases, OpenID, Youtube, Google Maps… anything, then you should have a fail fast policy in place and time out rapidly if these fail. As these are external system calls, they will most likely not be caught by Ruby’s timeout.rb library… and that means that your application will just hang, waiting for the call that never comes back.
It is much better to go “Ok, 2 seconds are up, no response, let’s tell the user to try again in a minute” and render an appropriate message, than have the user frustratingly whack the reload button a few hundred more times wondering why your application is not responding.
So yes, I think you should use Terminator.
Besides, how many other gems do you know that have the method ‘plot_to_kill’ ?
blogLater
Mikel
September 12th, 2008 at 12:19 AM
I just gave it a quick test on windows and it doesn’t seem to work. Do you happen to know if it’s supposed to work on windows?
Thanks!
September 12th, 2008 at 09:53 AM
You know, I thought it might, but I just tested it as well and found the same thing. I’ll look into it.
Mikel
September 12th, 2008 at 08:39 PM
You should maybe link to the source code ( http://codeforpeople.com/lib/ruby/terminator ) for those that want to peruse the code online. :)
September 13th, 2008 at 01:06 AM
Thanks!
I’ve got some code that does a similar thing on Windows, but it’s not nearly as complete as terminator, just a quick hack. If I see anything useful, I’ll let you know.
Are you guys using git? I have to admit that when I found that the project wasn’t on github (and I couldn’t fork it there), my enthusiasm for possibly contributing a patch dropped off significantly.
September 13th, 2008 at 11:49 AM
@pistos: Good idea, fixed it :)
@Mike: Let me know how it goes on Windows? I am having mixed reactions. On Git, no it’s a tar file right now. Maybe I’ll get off my butt and go gitify it :)
Mikel
September 16th, 2008 at 01:00 AM
This is super helpful, I have noticed some errors when using Ruby’s timeout on a typical XML call.
Thanks Mikel!
September 17th, 2008 at 07:09 AM
When I do this:
it throws ‘Terminator::Error’ and terminates my web server process which is not the behavior I want :-).
Any ideas?
September 17th, 2008 at 01:38 PM
@Justin, you need to:
Instead of:
Mikel
September 18th, 2008 at 01:01 AM
So .. a ruby process will be spawned for every request to a web service .. ?
September 18th, 2008 at 02:43 AM
What if the call finishes within n seconds ? SystemTimer doesn’t seem to work in that case, and Terminator seems to terminate the process completely even if the call (system “ls”) was successfull.
Any ideas ?
September 21st, 2008 at 11:07 AM
@Ahsan: That was working here. I just did a clean up on the initial code and added documentation for a 0.4.4 release. Maybe update and try again.
September 24th, 2008 at 03:43 AM
@Mikel: Ok, I upgraded to 0.4.4, but the problem persists.
September 24th, 2008 at 04:39 AM
@Ahsan, could you pastie your code?
October 2nd, 2008 at 07:41 AM
The simple example of:
begin Terminator.terminate(1) do sleep 3 end end
does not work for me. I’m using rails 1.2.2, ruby 1.8.5, fattr 1.0.3, and terminator 0.4.4.
Any help is appreciated.
November 22nd, 2008 at 02:36 AM
I’m having a similar problem as Phil. I can rescue Terminator::Error in Rails script/console but not when rails is being run with mongrel. Any
November 23rd, 2008 at 03:35 AM
@Doug and @Phil
Hmm… are you guys running on Windows?
Mikel
November 23rd, 2008 at 08:48 PM
@Mikel – *Ubuntu (tried on both 7.10 and 8.04) *ruby 1.8.6 *Mongrel 1.1.5 *Rails 2.1.1
I tried using a different signal in Terminator and commenting out the TERM trap in Mongrel both with no luck. Thanks for the gem, I really like it in theory ;). Doug