What is Orchestrate?

Orchestrate makes databases simple by powering full-text search, events, graph, and K/V storage behind a REST API.


Receive Blog Updates

Share:

Brady Catherman

Cron in production? That is a double edged sword!

Cron in prodiction

Cron is a unix tool for launching processes at given time intervals. It is incredibly useful, but the dangers that it presents can often be overlooked when selecting it as a solution. In this post I would like to spend some time covering some of the pitfalls that I have seen by using cron, as well as solutions to get around these issues.

I am going to rely heavily on an example from a previous employer as it demonstrates several major short comings on cron all in one pass, while also being a good example of cron’s power. We had an application that would sync user accounts across all of our production systems. This app would query the LDAP server to get a list of all expected users, and would then add, update, or delete users on the local system to match LDAP. This allowed us to run in production without the stability issues that linking directly to LDAP can cause, while removing the need to manually manage our cluster. This little tool was brilliant and solved lots of problems for us.

Email, email everywhere!

Cron will email the output of a given job to the user that the job is running as. This can be very useful when some aspect of the job requires alerting the user. This is why this initial design approach was taken. You can spawn a job at midnight to count the total number of requests for the day and email it to you, etc.

The problem is that you very often do not want emails from cron jobs. And usually when they start they can flood your inbox very fast. In our example app above, we made sure that the process wouldn’t ever output anything. Everything was running along smoothly right up until we took our LDAP server down for maintenance. This promptly caused an exception in python which in turn caused every job to mail us every time it ran. Annoyingly this made our mailboxes unusable very fast (and worse, started causing delivery delays for all of our email) right in the middle of an important maintenance. In this case it was the non 0 exit code that was causing the email.

Cron supports setting MAILTO=”" in the command in order to disable email, or redirect it to a different account. That only helps if you are editing the direct command line, if you are running a helper script you will need to redirect and control exit codes. This post was initially written to address scripts so I avoided using MAILTO here.

Our solution was to force redirect the output (both stdout and stderr) to null, and to ensure that the command always returned a 0 exit code by running true afterwards if the command failed.

Solution: Prevent output and bad exit codes in the cron command line.
0 * * * * /command/to/run > /dev/null 2>&1 || true

Did my job actually run?

Because the typical way to monitor cron jobs is via email it is often extremely hard to find out if a job has actually run successfully. Ideally your monitoring shouldn’t involve people being awake and willing to read automatically generated emails. Especially if the cron job is important.

So then, how do you monitor cron jobs with something like Nagios? I use a simple tactic of touching a file that I can then check the age of at the end of the command. Keep in mind the general runtime of the cron job and how many you want to fail before alerting when using this approach. So for a job that finish every hour, I check that the file being touched is at most 3 hours and a few minutes old so that I know when the job has failed at least twice. For human readability I often write the output of date into the file, but that is not strictly required.

Solution: Touch a file and alert on the file’s age.
0 * * * * /command/to/run && date > /var/run/last_successful_run

When combined with the above issue, the command should look like this:
0 * * * * ( /command/to/run && date > /var/run/last_successful_run ) > /dev/null 2>&1 || true

A thundering herd is coming!

On of the biggest dangers when using cron in a clustered environment is the thundering herd problem. Specifically, when each machine wakes up and tries to query a backend all at the same time. Typically machines in a cluster are synchronized via NTP, so the time on each machine is within milliseconds of each other, or better.

In our account manager example, we had initially configured this tool to run at the top of the hour on each machine. This means that every machine in the cluster would start and query LDAP for a complete listing of all users all within milliseconds. Soon enough this would cause the LDAP server to slow down, or even start failing requests.

To solve this issue we first setup our machines to randomize which minute they ran at. For consistency we used the MD5 sum of the hostname modded by 60 so it would be consistent with an even distribution across the hour. This reduced the load on the LDAP server from 100% at the top of each hour, to (100/60)% at the top of each minute. Eventually this too became a problem, so we had to add a sleep at the start of the command as well as the modded minute trick.

As a side note, a friend told me that they were getting around thundering herd by using Redis, and within a week they had an issue where the cron jobs over ran the Redis instances listen queue which caused a minor production issue, so never assume that you are immune to issues due to the cost of the query.

Solution: Use minute offsets and sleep to avoid running all the jobs at the same time:
15 * * * * sleep `perl -e 'print int(rand(60))'` && /command/to/run

When combined with the above issues, the command should look like this:
15 * * * * ( sleep `perl -e 'print int(rand(60))'` && /command/to/run && date > /var/run/last_successful_run ) > /dev/null 2>&1 || true

Why is my latency so volatile?

This can be a form of thundering herd, only rather than competing for a shared cluster resource, the problem is that the cron jobs can start competing for shared system resources. By far the most common form of this problem is log rotation which compresses log files. This consumes I/O resources to read and write the files, as well as CPU resources to compress them. In my experience this goes unnoticed until it starts to become an issue.

While debugging an issue where our site latency seemed to get wildly unpredictable I noticed that our hourly latency graph ended up looking like this:
latency

Note the hourly cycle in the graph? At the top of every hour our latency would increase drastically. For the remaining 59 minutes or so the latency would return down to expected levels. After digging a bit it became clear that running logrotate was causing latency to increase. The CPU it would consume would be better suited to serving user requests. This problem is not limited to logrotate, any job that consumes resources can become part of this same problem.

Unix provides some very nice solutions to this issue. The absolute best solution is still cgroups, but that is beyond the scope of this post. Instead I will explain the far simpler approach. There is a tool called ‘nice’ on Linux (and most unixes) that basically tells the operating system to force the job that it is running to play nice. It gets whatever resources are left over after all the other applications runs (though in reality it is given some CPU time to ensure that it eventually finishes.) There is also another tool, ionice, which does the same thing to io, but for this example I will stick to nice.

Solution: Use the nice command:
0 * * * * nice /command/to/run

When combined with the above issues, the command should look like this:
15 * * * * ( sleep `perl -e 'print int(rand(60))'` && nice /command/to/run && date > /var/run/last_successful_run ) > /dev/null 2>&1 || true

Can not fork a new process?

We thought that we would be smart and make our LDAP application retry until it was successful. This was an easy way to get around delays and crashes, upstream server outages, etc. It kept seeming like a really good idea right up until we were paged because machines had thousands of the processes running, all doing nothing but sleeping and retrying the tcp connection. A network blip had caused all the connections to fail, and soon enough every machine was trying and retrying which caused a thundering herd to the LDAP server. But even worse, the outage had continued long enough that several copies of our tool were running, which made the thundering herd worse, and they started spawning faster than they finished. Soon enough the machine was being over run with the things.

Cron makes no promises that one, and only one job will be run at a time. It will just keep spinning up new ones when the time matches its internal pattern. It is up to you as the user to ensure that you don’t allow more than one process to run at a time. Luckily there is an easy way to do this using the flock command in Linux. We can lock a file exclusively which will prevent a second process from starting while an existing process is running.

Solution: Use a lock file and the flock command to prevent duplicate commands from running:
0 * * * * ( flock -w 0 200 && /command/to/run ) 200> /var/run/cron_job_lock

When combined with the above issues, the command should look like this:
15 * * * * ( flock -w 0 200 && sleep `perl -e 'print int(rand(60))'` && nice /command/to/run && date > /var/run/last_successful_run ) > /dev/null 2>&1 200> /var/run/cron_job_lock || true

When did THAT happen?

Sometimes its prudent to redirect the output of the program into a file. This is useful for capturing the errors, or other such events. The problem is that this usually is paired with the inability to establish when something happened. Seeing a stack trace in the log doesn’t tell you much. For readability its often desirable to attach a time stamp to each line. Luckily you can accomplish this with bash fairly easily.

Solution: Use a shell wrapper to attach timestamps.
0 * * * * /command/to/run | while read line ; do echo `date` "$line" ; done > /path/to/the/log

When combined with the above issues, the command should look like this:
15 * * * * ( flock -w 0 200 && sleep `perl -e 'print int(rand(60))'` && nice /command/to/run && date > /var/run/last_successful_run ) 2>&1 200> /var/run/cron_job_lock | while read line ; echo `date` "$line" ; done > /path/to/the/log || true

Simplicity

That combined command line is anything but simple. Luckily its not too complicated to simplify it with a little script. It can do everything but setting the offset at the minute mark. Download cron_helper.sh and put it in /usr/local/bin. This would allow us to do the above complex definition like this:
15 * * * * /usr/local/bin/cron_helper -c -n job_name -i -s -t /command/to/run