librelist archives

« back to archive

Track stuck long running jobs

Track stuck long running jobs

From:
Niels Kristian Schjødt
Date:
2014-01-09 @ 12:52
Hi,

I’m trying to find a good way of debugging some long running jobs, which 
is running unusually long (seems like they are not finishing for some 
reason). The situation is as follows:

I process a lot (300.000 per day) of different jobs in sidekiq. Some are 
fast (a few ms) and some are slow (up to 30 minutes), because the rely on 
external http lookups with throughput limitations and so on. Due to this, 
I have modified sidekiq to send a kill -15 signal when deploying, with a 
timeout of a couple of hours, in order to not interfere with long running 
tasks. 
Now recently I have discovered that some of my processes are hanging at 
status “stopping” ("ps aux" on server) for hours until i kill them hard, 
due to some of my jobs “never” finishing. This for sure is a bug in my job
code, however I have a very hard time figuring out which of my jobs are 
accountable for this issue. In the sidekiq dashboard I have the “working” 
tab, but at this point, where the workers have been stuck trying to finish
and eventually restart for hours, there is NO information about currently 
running tasks in sidekiq dashboard. To me it looks like those stats is 
somehow being cleared, even though the workers really haven’t finished 
yet.

Any ideas on how I could debug / find out which of my tasks/jobs are 
causing my workers to hang around in limbo for ever?

Thanks!

Re: [sidekiq] Track stuck long running jobs

From:
Ken Mayer
Date:
2014-01-09 @ 16:57
There's another kill signal that will do a stack dump of all running
threads. I've used that in the past to figure out what's going on.
On Thursday, January 9, 2014, Niels Kristian Schjødt wrote:

> Hi,
>
> I’m trying to find a good way of debugging some long running jobs, which
> is running unusually long (seems like they are not finishing for some
> reason). The situation is as follows:
>
> I process a lot (300.000 per day) of different jobs in sidekiq. Some are
> fast (a few ms) and some are slow (up to 30 minutes), because the rely on
> external http lookups with throughput limitations and so on. Due to this, I
> have modified sidekiq to send a kill -15 signal when deploying, with a
> timeout of a couple of hours, in order to not interfere with long running
> tasks.
> Now recently I have discovered that some of my processes are hanging at
> status “stopping” ("ps aux" on server) for hours until i kill them hard,
> due to some of my jobs “never” finishing. This for sure is a bug in my job
> code, however I have a very hard time figuring out which of my jobs are
> accountable for this issue. In the sidekiq dashboard I have the “working”
> tab, but at this point, where the workers have been stuck trying to finish
> and eventually restart for hours, there is NO information about currently
> running tasks in sidekiq dashboard. To me it looks like those stats is
> somehow being cleared, even though the workers really haven’t finished yet.
>
> Any ideas on how I could debug / find out which of my tasks/jobs are
> causing my workers to hang around in limbo for ever?
>
> Thanks!
>


-- 
*Ken Mayer* | Engineering Manager | Pivotal Labs
ken@pivotallabs.com | 808.722.6142 (c) | *875 Howard St, San Francisco, CA
94103 <http://goo.gl/maps/7eGVl>*

Re: [sidekiq] Track stuck long running jobs

From:
Shane Emmons
Date:
2014-01-09 @ 13:26
I’ve used sidekiq-status [1] with pretty good effect when debugging issue
like this. I insert various debugging messages inside the job and use the
console to inspect them.

[1] https://github.com/utgarda/sidekiq-status


- shane


On Thu, Jan 9, 2014 at 7:52 AM, Niels Kristian Schjødt <
nielskristian@autouncle.com> wrote:

> Hi,
>
> I’m trying to find a good way of debugging some long running jobs, which
> is running unusually long (seems like they are not finishing for some
> reason). The situation is as follows:
>
> I process a lot (300.000 per day) of different jobs in sidekiq. Some are
> fast (a few ms) and some are slow (up to 30 minutes), because the rely on
> external http lookups with throughput limitations and so on. Due to this, I
> have modified sidekiq to send a kill -15 signal when deploying, with a
> timeout of a couple of hours, in order to not interfere with long running
> tasks.
> Now recently I have discovered that some of my processes are hanging at
> status “stopping” ("ps aux" on server) for hours until i kill them hard,
> due to some of my jobs “never” finishing. This for sure is a bug in my job
> code, however I have a very hard time figuring out which of my jobs are
> accountable for this issue. In the sidekiq dashboard I have the “working”
> tab, but at this point, where the workers have been stuck trying to finish
> and eventually restart for hours, there is NO information about currently
> running tasks in sidekiq dashboard. To me it looks like those stats is
> somehow being cleared, even though the workers really haven’t finished yet.
>
> Any ideas on how I could debug / find out which of my tasks/jobs are
> causing my workers to hang around in limbo for ever?
>
> Thanks!
>