These forums are read-only!
various lengthy cron jobs mysteriously stopping
  • I'm definitely a noobie with this stuff, but I'm usually good about finding answers online. This one has been all misses for 2 days so I'm giving in and asking.

    I have lengthy (1-5 hour) cron jobs dying consistently (sort of).

    The exact same code, cron commands, data works on my former host (dreamhost). I know for certain that they worked when I first got here (have emails from cron of 3 hour+ jobs). Within the last week, they all fail within an hour of starting. Of course, the first question is: what did I change in the last week? Unfortunately I've been working like crazy and I can't for the life of me remember anything that I haven't tested. So I'm pretty desperate for help.

    What's cron running?
    I currently have it starting php processes which handle retrieval from a remote source at regular intervals and doing different things in various stages.

    Things I've checked with more than a day's worth of cron job trials
    - searching google for a number of terms including: cron, cron limit, cron error, cron log, cron time limit, cron jobs, cron tab, etc
    - every log in /var/log/
    - php memory usage every loop while the script is running (stays in same range the entire duration of the process, including when it dies)
    - measured app specific metrics (no correlation)
    - total duration of script (the same script typically dies after the same duration +/- 1 second)
    - iterations of processes (same number +/-1 in most cases)
    - duration of each loop/process
    - start and stop times coinciding with events in logs in /var/log/ (none)
    - start and stop times coinciding with other cron jobs (none)
    - remove any influence of app specific factors by creating a dummy php process which sleeps for 1second, 5 seconds, 60 seconds, and 5 minutes before kicking off another process. (each of those dies after a specific duration +/- 1 second
    - not sure what other lengthy/repeating processes could be launched aside from php
    - php cli error logging (nothing, except purposely triggered errors to confirm logging worked)
    - php environment vars (such as memory, max execution time, etc)
    - application/logging output file sizes (no correlation)

    In all of these cases, between 46 and 3400 iterations (depending on specific php script) successfully occurred before it just stops. Mysteriously in all cases, they stop at exactly the same spot in the php script, but when changing the script, display the same behaviors at different steps.


    - Are there default or configurable cron limitations I don't know about?
    - Where are events like a process getting killed logged? I tried killing my own processes and didn't find them in syslog.

    I have this lengthy spreadsheet where I've documented the various experiments to try to figure this out and would greatly appreciate if anyone could help me solve this one.

    Thanks very much for reading and giving me your thoughts.
  • I forgot to mention the big one.

    Starting the processes with the exact same command as in the crontab runs successfully for hours until it normally ends.
  • A pretty common issue with scripts running under cron is that some command in the script expects the existence of a terminal or relies on the existence of stdin (a common example: vi in batchmode).. Cron jobs don't have a terminal or stdin associated with them, so if there is some reason your script uses either, the script will die as a cron job. The same script would run successfully in a login shell session, which is associated with both a terminal and stdin..

    Hard to say if this is relevant to you, at least without details of the code around the point where the script(s) die... The only thing that bothers me about this possibility is that you say the script runs x iterations successfully before dying, which suggests to me that this isn't what is causing your problem, but one never knows..

    Have you changed anything in the scripts or in your shell startup script (.bashrc or whatever) recently? Is each iteration exactly the same from a code perspective? Are there any conditionally executed commands (within the loop) that could write to the (non-existent) terminal under specific conditions? Do your scripts modify (or rely) on any environment variables (cron jobs don't inherit the login environment of the job owner either..)?
  • Hi combhua

    Is the job running under the same user account when it's run from cron as run interactively? How about the shell - are they both the same interactively and batch?

    You could try ulimit -a to see if there's any resources there it might be running out of?

    When you say 'retrieval from a remote source...' is there any chance that the remote source could be killing the connection - causing the job to die?

    How about /var/log/crond - I don't suppose there's anything in there?

    When you say 'PHP memory usage ...' - I don't really know PHP. Is that just monitoring how much memory PHP thinks it's using? Would it be possible to run something like 'sar -r' while the job was running to make sure the vps wasn't running out of memory?

    Just a few thoughts ...

    Matt
  • @gadget
    Thanks for the suggestions. You're right in the baffling part are the successful iterations before failing. The failures occur long before there is any change in the processing. Aside from the experiments where I deliberately controlled iteration duration via sleep(), iterations were always less than 13 seconds, usually hovering at 10 seconds. In the actual runs, we're seeing 46-47 iterations, then nothing. In test runs, each iteration varies from 1 second to 5 minutes long and total duration varies even more widely as well as number of iterations. As far as I can tell, there's no relationship between them.

    As for the php app itself, there's nothing special it does that changed from the first 46, etc iterations. In fact, the only outside interaction is writing to a log file, which it successfully does the first x iterations.

    Also, the cron jobs are run as the same user as when I manually start them from the command line and calls to the command are done via absolute paths.

    @daintree
    Thanks also for thinking through this.
    They're definitely running as the same user. In addition to filing the cron job under the right user, I also confirm this by seeing the same user listed in 'top' while watching both situations run.

    I'm unfamiliar with ulimit, but from what I could tell, it wasn't indicating anything of concern:
    $ ulimit -a
    core file size (blocks, -c) 0
    data seg size (kbytes, -d) unlimited
    max nice (-e) 0
    file size (blocks, -f) unlimited
    pending signals (-i) unlimited
    max locked memory (kbytes, -l) unlimited
    max memory size (kbytes, -m) unlimited
    open files (-n) 1024
    pipe size (512 bytes, -p) 8
    POSIX message queues (bytes, -q) unlimited
    max rt priority (-r) 0
    stack size (kbytes, -s) 8192
    cpu time (seconds, -t) unlimited
    max user processes (-u) unlimited
    virtual memory (kbytes, -v) unlimited
    file locks (-x) unlimited


    As for the remote source killing the connection, I actually have code that successfully catches those exceptions. It does so in previous iterations, being tolerant of timeouts, failure to connect, unresponsive host, etc. Even if I had no network connection, it would log and run it's duration, showing in-app failures and why.

    Yeah, it's what php thinks is the amt of memory it's using. I don't seem to have sar on my system, but monitoring top and free over the last 2 days shows very consistent memory usage and plenty of free memory even as it crashes.


    ---

    I just went through and compared php ini's line by line and the only thing I found is something about garbage collection every 24 minutes, but they're cron'd at times not coinciding with my failures. Even so, I disabled it and still have the same problem.

    Unfortunately cron wasn't logging. After a bit of digging on google, I figured out to check syslog.conf and found it was commented out. I removed the comment and it still doesn't seem to be running so I'll need to figure out how to 'restart cron' or something similar to tell it to take the changes in the .conf.

    poopy
  • # /etc/init.d/crond restart duh
    err.. maybe this is it:
    # /etc/init.d/sysklogd restart
  • restarted sysklogd, and it started logging. The only thing in my log is an entry that it started. Nothing about it stopping :(

    The only thing I found in php.ini that differed was the memory_limit setting, which I changed from 16M to 128M to match my other server. That didn't work either.

    Also, I disabled the php cron job which ran garbage cleanup, no effect.

    However, on one test run, it ran 70 iterations instead of 46-47. Still died, but different count. Very next attempt, back to 47. All metrics recorded were reasonable increases due to increased iterations (duration, log file size, etc).

    /cry
  • combhua, is every iteration *internal* to a single script instance (internal loop), or does each iteration invoke a *new* instance of the same script?
  • The process went away? What was its exit status? You should be able to determine this by running it within a simple shell wrapper that prints $? afterward; cron will email you everything printed to stdout or stderr. In particular, it might be interesting to know success vs failure vs terminating on signal. This would also be a confirmation that if the job produces output you'll see it.

    How do you know they die at the same spot of the script? You see some sort of backtrace or you just know from results that they got to step A but not step B?

    What is your script doing at that spot?