librelist archives

« back to archive

Cooperative precomputation

Cooperative precomputation

From:
Dag Sverre Seljebotn
Date:
2011-03-17 @ 13:31
I have this scenario:

  - I invoke each process seperately with different parameters (done by 
a cluster scheduling system -- my jobs could sit and wait for five 
hours, and then suddenly 20 of them are launched simultaneously -- or 
they could go one after another)

  - Each job needs some precomputed data that only depend on a few of 
the parameters. I'd like caching this to be automated.

How about using joblib, and point each process to the same cache 
directory on a network drive? (I'm very CPU-bound.) I can't find 
anything hinting at locking in the source, so I'd think that I'm almost 
guaranteed to get race conditions, right?

The solution would be to refactor Memory to use a Store, and implement 
one that does file-based locking. "lockf" (NOT "flock") works reliably 
across nodes on our cluster (GPFS), and also works locally and on some 
NFS setups.

Given a Store that locks, the default behaviour would be to sleep until 
the other process finished the computation, and then fetch its results. 
An improvement over this is doing work-sharing, so that you can specify 
a list of tasks that can be processed in any order; sequentially, and in 
the same process, but so that one can skip to the next one when given a 
task is locked (because somebody else works on it).

Would this be a welcome addition to joblib? (I'm basically evaluating 
where I should put in the necessary work to do this.) It seems to me 
that it should integrate nicely with what is already there, and it could 
be optional (although when stable I don't think making lockf-based 
locking the default would hurt). Any feedback welcome.

Dag Sverre

Re: [joblib] Cooperative precomputation

From:
Gael Varoquaux
Date:
2011-03-17 @ 14:44
Hi Dag,

Good to see you on this (very quiet) mailing list.

On Thu, Mar 17, 2011 at 02:31:32PM +0100, Dag Sverre Seljebotn wrote:
> How about using joblib, and point each process to the same cache 
> directory on a network drive? (I'm very CPU-bound.) I can't find 
> anything hinting at locking in the source, so I'd think that I'm almost 
> guaranteed to get race conditions, right?

You will. I use joblib's cache on my 12 CPUs box all the time. I do hit
race conditions all the time. That said, I have tried adding locks, and
it ended up slowing down everything. So I have adopted a 'better ask for
forgiveness than permission' approach: any operation that could fail is
in a try/except block (or should be, I sometimes find places where there
is a race I didn't see, in which case I fix it, but more and more
infrequently).

The limitation of this approach is that the current codebase, when ran in
parallel, might have 100 jobs computing the same item to be cached, and
using up resources for no good reason. It won't crash, it will just
gobble up resources. That's where a lock might come in handy, if there is
some kind of scheduling system able to make us of the resources freed.

> The solution would be to refactor Memory to use a Store, and implement 
> one that does file-based locking. "lockf" (NOT "flock") works reliably 
> across nodes on our cluster (GPFS), and also works locally and on some 
> NFS setups.

Refactoring to use a store would be a good idea anyhow. A better design
than the current one would make use of a sub object of memory that would
expose a store API (to be defined). That way we could easily replace
disk-based storage by memory storage, or storage in a key-value database,
such as memcached, couchdb or mongodb. That would certainly be a great
addition. In any case, I am more than open to refactoring of the codebase
to do such modifications easier.

I am planning to get there, but I wanted to get a cache replacement
policy in place before. It turned ut that this feature pretty much
requires a lock, and when I tried implementing it the first time, I ended
up slowing down joblib too when running many jobs in parallel. I think I
know how to avoid this problem (this would be the topic for another
discussion) and I have started work on this, but it turned out to be too
much work for my current available time.

> Would this be a welcome addition to joblib? (I'm basically evaluating 
> where I should put in the necessary work to do this.) It seems to me 
> that it should integrate nicely with what is already there, and it could 
> be optional (although when stable I don't think making lockf-based 
> locking the default would hurt). Any feedback welcome.

I don't know whether making the lock by default would hurt or not. I have
three worries.

 1. The portability of the locking operation. From what I had read when 
    I was worrying about locks, filesystem locking is neither
    portable, nor reliable. I was adviced to use as a lock a mkdir. Indeed
    mkdir is an atomic operations on all file systems and all OSs that either
    succeeds, or fails if the directory exists. Thus acquiring the lock can
    be achieved by::

	try:
	    os.mkdir('foo.lock')
	except OSError, e:
	    # Here use code that checks the number of the OSError to assert
	    # that it indeed is a 'file exists' error, if so the lock is
	    # acquired elsewhere.

    releasing the lock is done by removing the directory (also atomic if it's
    empty).

 2. The slow down due to the lock (in particular in the memory access,
    rather than the memory write). This problem can be reduced as much
    as possible by having many locks, rather than a global lock, which
    always ends up being a bottleneck. One option would be to have a lock
    per function+arguments hash. That would probably work nicely with the
    current model.

 3. Dead-locks. Joblib is designed to be resilient to hack crash:
    segfaults in the Python code (hell, code I write does segfault),
    machine running out of memory in the middle of a persistence
    operation. If we add locks, I'd like joblib to be able to detect
    deadlocks because of a dead process as much as possible. I'd like 
    also a time out on the locks. In a single machine environment,
    detecting dead processes can be done by saving the PID of the
    process in the lock, and having code that checks if this process is
    still alive (and is still running Python, to catter for PID reuse).
    In a multi-machine environement, you need on top of that to save a
    hash that enables the identification of the node. I think it is OK to
    give up on testing of the process is still alive if the node that
    wrote the lock is not the same node as the node that is trying to
    acquire it. However, I think that parallel computing on a single
    machine is more and more frequent. I have 12 CPUs at the lab, in two
    years, I might have 24. I want to be able to use them, and with
    multiprocessing, from the standard library, I can easily achieve that
    quite well. Thus, I'd like us to cater as much as possible for the
    single machine case.
 
Sorry, that's a pretty lengthy email: I had already given this problem a
thought. My advice would be to play around as much as possible to see
what can be achieved without writing too complex code before making any
plans. The strategy that worked so far with joblib has been to avoid hard
problems, and tackle only the simple/big gain ones. There were previous
versions of joblib (0.1 and 0.2) that did dependency tracking. They never
made it to a release on Internet, as they were too fragile and I would
break them.

In any case, if you start exploring any ideas, you should work only from
the 0.5.X branch, and not the master branch, as the latter as features
required for cache replacement policy that are not needed for your work,
and will probably do more harm than good. I should have kept them in a
feature branch, but I didn't foresee the locking problems (eventhough I
was warned by friends :$).

Thanks for your interest, it's exciting!

Gael

Re: [joblib] Cooperative precomputation

From:
Olivier Grisel
Date:
2011-03-17 @ 17:23
2011/3/17 Gael Varoquaux <gael.varoquaux@normalesup.org>:
>
>  2. The slow down due to the lock (in particular in the memory access,
>    rather than the memory write). This problem can be reduced as much
>    as possible by having many locks, rather than a global lock, which
>    always ends up being a bottleneck. One option would be to have a lock
>    per function+arguments hash. That would probably work nicely with the
>    current model.

That sounds like a potential simple yet useful strategy for
scikit-learn. Suppose all calls to fit / transform inside a pipeline
are wrapped with a @Memory(cachedir="always/the/same/path") memoizer
before being passed to a GridSearchCV with n_jobs != 1 so that
joblib.Parallel is used to run the grid search. Making the early stage
of the pipeline lock would avoid saturating the CPU computing the
exact same thing several times at the beginning.

Other than that I am also in favor of keeping things as simple as possible.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: [joblib] Cooperative precomputation

From:
Dag Sverre Seljebotn
Date:
2011-03-17 @ 15:30
On 03/17/2011 03:44 PM, Gael Varoquaux wrote:
> Hi Dag,
>
> Good to see you on this (very quiet) mailing list.
>
> On Thu, Mar 17, 2011 at 02:31:32PM +0100, Dag Sverre Seljebotn wrote:
>> How about using joblib, and point each process to the same cache
>> directory on a network drive? (I'm very CPU-bound.) I can't find
>> anything hinting at locking in the source, so I'd think that I'm almost
>> guaranteed to get race conditions, right?
> You will. I use joblib's cache on my 12 CPUs box all the time. I do hit
> race conditions all the time. That said, I have tried adding locks, and
> it ended up slowing down everything. So I have adopted a 'better ask for
> forgiveness than permission' approach: any operation that could fail is
> in a try/except block (or should be, I sometimes find places where there
> is a race I didn't see, in which case I fix it, but more and more
> infrequently).
>
> The limitation of this approach is that the current codebase, when ran in
> parallel, might have 100 jobs computing the same item to be cached, and
> using up resources for no good reason. It won't crash, it will just
> gobble up resources. That's where a lock might come in handy, if there is
> some kind of scheduling system able to make us of the resources freed.

My tasks would have a granularity of minutes rather than milliseconds, 
so the locking overhead is the least of my worries. But it certainly 
means things should go via refactor and store API.

I'm surprised that an FS stat (or process PID lookup?) drowns hashing a 
NumPy array (if that is what is going on?) What kind of jobs where 
these, and how many?

>
> I don't know whether making the lock by default would hurt or not. I have
> three worries.
>
>   1. The portability of the locking operation. From what I had read when
>      I was worrying about locks, filesystem locking is neither
>      portable, nor reliable. I was adviced to use as a lock a mkdir. Indeed
>      mkdir is an atomic operations on all file systems and all OSs that either
>      succeeds, or fails if the directory exists. Thus acquiring the lock can
>      be achieved by::
>
> 	try:
> 	    os.mkdir('foo.lock')
> 	except OSError, e:
> 	    # Here use code that checks the number of the OSError to assert
> 	    # that it indeed is a 'file exists' error, if so the lock is
> 	    # acquired elsewhere.
>
>      releasing the lock is done by removing the directory (also atomic if it's
>      empty).

Interesting. I'm guessing this should have about the same performance as 
lockf (although I guess only testing could tell for sure). The downside 
is what you write below: Possible dead-locks etc.

Within a single host I believe lockf should be rather safe, and those 
automatically disappear when a process dies. More below.

>   2. The slow down due to the lock (in particular in the memory access,
>      rather than the memory write). This problem can be reduced as much
>      as possible by having many locks, rather than a global lock, which
>      always ends up being a bottleneck. One option would be to have a lock
>      per function+arguments hash. That would probably work nicely with the
>      current model.
>
>   3. Dead-locks. Joblib is designed to be resilient to hack crash:
>      segfaults in the Python code (hell, code I write does segfault),
>      machine running out of memory in the middle of a persistence
>      operation. If we add locks, I'd like joblib to be able to detect
>      deadlocks because of a dead process as much as possible. I'd like
>      also a time out on the locks. In a single machine environment,
>      detecting dead processes can be done by saving the PID of the
>      process in the lock, and having code that checks if this process is
>      still alive (and is still running Python, to catter for PID reuse).
>      In a multi-machine environement, you need on top of that to save a
>      hash that enables the identification of the node. I think it is OK to
>      give up on testing of the process is still alive if the node that
>      wrote the lock is not the same node as the node that is trying to
>      acquire it. However, I think that parallel computing on a single
>      machine is more and more frequent. I have 12 CPUs at the lab, in two
>      years, I might have 24. I want to be able to use them, and with
>      multiprocessing, from the standard library, I can easily achieve that
>      quite well. Thus, I'd like us to cater as much as possible for the
>      single machine case.

(I'm afraid 24 cores won't do me much good :-) )

Indeed, avoiding dead-locks on segfaults is the single most important 
thing to get right.

It seems to me that a number of approaches is necessary. Locking is 
intrinsically tied to what hardware, OS and software one is on, and what 
performance one needs. One could imagine scenarios where joblib would be 
unusable unless there's an MPI store backend that acts as cache using 
MPI -- or internet distributed stores using ZMQ -- but first things 
first :-)

I'm just saying that perhaps two approaches -- the one that suits me 
(cluster), and the one that suits you (single host) -- are needed. But 
let's see. Two approaches:

1) Use mkdir for lock + probe whether process has died or not. I think 
this last step needs support for plugging in the appropriate solution. A 
few options (where we should take as few as possible, but probably at 
least 2):

  i) PID's for localhost, as you said
  ii) For multi-host, SSH in and check PID (Linux-centric etc., but 
still fairly portable)
  iii) Specific cluster support (in my case, by far the most reliable is 
to record $SLURM_JOB_ID, and ask the queue system whether the job still 
lives)
  iv) Each process publishes a ZMQ port that can be ping-ed and that 
responds with a unique hash. Run in it's own thread outside of 
CPython/GIL control.
  v) For MPI jobs, do the locking over MPI
  vi) Some time based solution where a process is required to write a 
time stamp regularly. Pretty fragile.

2) Use lockf. This a) should be reliable on single-host, b) is only 
supported on some network systems, but when it is it should better do 
the job (why else leave an NFS lockd running...).

This has the advantage of the OS taking care of releasing when the 
process dies. I'm thinking of the following algorithm:

  a) Check if store/ab1234 exists. If so, result is computed, go on and 
use it.
  b) Open store/ab1234.lock (in "create but do not overwrite" mode) and 
attempt to get lock with lockf.
     b1) If a lock is acquired (= either we're first, or other process 
died):
       - Write some information to the lockfile that identifies the process.
       - Compute results to store/tmp-$hostname-$pid-ab1234.
       - When done, mv it to store/ab1234
       - Release lock and delete store/ab1234.lock
     b2) If a lock is not acquired, it means some other process is alive 
and processing, so wait (or do something else)

It's possible to do a bit of forking and ping-pong to test the lock up 
front, at least on single-host. Remains to look what is supported on 
Windows though.

BTW, from what you say I'm glad the first joblib attempts died. What I 
like so much is exactly using simple mechanisms. But introducing 
(optional) locking seems necessary.

Dag Sverre

Re: [joblib] Cooperative precomputation

From:
Dag Sverre Seljebotn
Date:
2011-03-17 @ 15:36
On 03/17/2011 04:30 PM, Dag Sverre Seljebotn wrote:
> On 03/17/2011 03:44 PM, Gael Varoquaux wrote:
>> Hi Dag,
>>
>> Good to see you on this (very quiet) mailing list.
>>
>> On Thu, Mar 17, 2011 at 02:31:32PM +0100, Dag Sverre Seljebotn wrote:
>>> How about using joblib, and point each process to the same cache
>>> directory on a network drive? (I'm very CPU-bound.) I can't find
>>> anything hinting at locking in the source, so I'd think that I'm almost
>>> guaranteed to get race conditions, right?
>> You will. I use joblib's cache on my 12 CPUs box all the time. I do hit
>> race conditions all the time. That said, I have tried adding locks, and
>> it ended up slowing down everything. So I have adopted a 'better ask for
>> forgiveness than permission' approach: any operation that could fail is
>> in a try/except block (or should be, I sometimes find places where there
>> is a race I didn't see, in which case I fix it, but more and more
>> infrequently).
>>
>> The limitation of this approach is that the current codebase, when ran in
>> parallel, might have 100 jobs computing the same item to be cached, and
>> using up resources for no good reason. It won't crash, it will just
>> gobble up resources. That's where a lock might come in handy, if there is
>> some kind of scheduling system able to make us of the resources freed.
> My tasks would have a granularity of minutes rather than milliseconds,
> so the locking overhead is the least of my worries. But it certainly
> means things should go via refactor and store API.
>
> I'm surprised that an FS stat (or process PID lookup?) drowns hashing a
> NumPy array (if that is what is going on?) What kind of jobs where
> these, and how many?
>
>> I don't know whether making the lock by default would hurt or not. I have
>> three worries.
>>
>>    1. The portability of the locking operation. From what I had read when
>>       I was worrying about locks, filesystem locking is neither
>>       portable, nor reliable. I was adviced to use as a lock a mkdir. Indeed
>>       mkdir is an atomic operations on all file systems and all OSs that either
>>       succeeds, or fails if the directory exists. Thus acquiring the lock can
>>       be achieved by::
>>
>> 	try:
>> 	    os.mkdir('foo.lock')
>> 	except OSError, e:
>> 	    # Here use code that checks the number of the OSError to assert
>> 	    # that it indeed is a 'file exists' error, if so the lock is
>> 	    # acquired elsewhere.
>>
>>       releasing the lock is done by removing the directory (also atomic if it's
>>       empty).
> Interesting. I'm guessing this should have about the same performance as
> lockf (although I guess only testing could tell for sure). The downside
> is what you write below: Possible dead-locks etc.
>
> Within a single host I believe lockf should be rather safe, and those
> automatically disappear when a process dies. More below.
>
>>    2. The slow down due to the lock (in particular in the memory access,
>>       rather than the memory write). This problem can be reduced as much
>>       as possible by having many locks, rather than a global lock, which
>>       always ends up being a bottleneck. One option would be to have a lock
>>       per function+arguments hash. That would probably work nicely with the
>>       current model.
>>
>>    3. Dead-locks. Joblib is designed to be resilient to hack crash:
>>       segfaults in the Python code (hell, code I write does segfault),
>>       machine running out of memory in the middle of a persistence
>>       operation. If we add locks, I'd like joblib to be able to detect
>>       deadlocks because of a dead process as much as possible. I'd like
>>       also a time out on the locks. In a single machine environment,
>>       detecting dead processes can be done by saving the PID of the
>>       process in the lock, and having code that checks if this process is
>>       still alive (and is still running Python, to catter for PID reuse).
>>       In a multi-machine environement, you need on top of that to save a
>>       hash that enables the identification of the node. I think it is OK to
>>       give up on testing of the process is still alive if the node that
>>       wrote the lock is not the same node as the node that is trying to
>>       acquire it. However, I think that parallel computing on a single
>>       machine is more and more frequent. I have 12 CPUs at the lab, in two
>>       years, I might have 24. I want to be able to use them, and with
>>       multiprocessing, from the standard library, I can easily achieve that
>>       quite well. Thus, I'd like us to cater as much as possible for the
>>       single machine case.
> (I'm afraid 24 cores won't do me much good :-) )
>
> Indeed, avoiding dead-locks on segfaults is the single most important
> thing to get right.
>
> It seems to me that a number of approaches is necessary. Locking is
> intrinsically tied to what hardware, OS and software one is on, and what
> performance one needs. One could imagine scenarios where joblib would be
> unusable unless there's an MPI store backend that acts as cache using
> MPI -- or internet distributed stores using ZMQ -- but first things
> first :-)
>
> I'm just saying that perhaps two approaches -- the one that suits me
> (cluster), and the one that suits you (single host) -- are needed. But
> let's see. Two approaches:
>
> 1) Use mkdir for lock + probe whether process has died or not. I think
> this last step needs support for plugging in the appropriate solution. A
> few options (where we should take as few as possible, but probably at
> least 2):
>
>    i) PID's for localhost, as you said
>    ii) For multi-host, SSH in and check PID (Linux-centric etc., but
> still fairly portable)
>    iii) Specific cluster support (in my case, by far the most reliable is
> to record $SLURM_JOB_ID, and ask the queue system whether the job still
> lives)
>    iv) Each process publishes a ZMQ port that can be ping-ed and that
> responds with a unique hash. Run in it's own thread outside of
> CPython/GIL control.
>    v) For MPI jobs, do the locking over MPI
>    vi) Some time based solution where a process is required to write a
> time stamp regularly. Pretty fragile.
>
> 2) Use lockf. This a) should be reliable on single-host, b) is only
> supported on some network systems, but when it is it should better do
> the job (why else leave an NFS lockd running...).
>
> This has the advantage of the OS taking care of releasing when the
> process dies. I'm thinking of the following algorithm:
>
>    a) Check if store/ab1234 exists. If so, result is computed, go on and
> use it.
>    b) Open store/ab1234.lock (in "create but do not overwrite" mode) and
> attempt to get lock with lockf.
>       b1) If a lock is acquired (= either we're first, or other process
> died):

Insert here: - Check again, having the lock, that store/ab1234 doesn't 
exist now (which means that another process finished it a microsecond 
ago, in which case go to a))

>         - Write some information to the lockfile that identifies the process.
>         - Compute results to store/tmp-$hostname-$pid-ab1234.
>         - When done, mv it to store/ab1234
>         - Release lock and delete store/ab1234.lock
>       b2) If a lock is not acquired, it means some other process is alive
> and processing, so wait (or do something else)
>
> It's possible to do a bit of forking and ping-pong to test the lock up
> front, at least on single-host. Remains to look what is supported on
> Windows though.
>
> BTW, from what you say I'm glad the first joblib attempts died. What I
> like so much is exactly using simple mechanisms. But introducing
> (optional) locking seems necessary.
>
> Dag Sverre

Re: [joblib] Cooperative precomputation

From:
Dag Sverre Seljebotn
Date:
2011-03-17 @ 16:01
I think this is actually a lot simpler than all I wrote below:

The downside of mkdir is it allows deadlocks. The upside of mkdir is that 
it is more portable. Lockf is deadlocksafe but unportable.

So it should probably be possible to use both kinds of locks, which should
turn into pessimistic (and deadlock safe) lockf locks on OSes that support
it, an optimistic cpuwasting mkdir locks when the lockf mechanism fails 
for some reason.
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Dag Sverre Seljebotn <d.s.seljebotn@astro.uio.no> wrote:

On 03/17/2011 04:30 PM, Dag Sverre Seljebotn wrote: > On 03/17/2011 03:44 
PM, Gael Varoquaux wrote: >> Hi Dag, >> >> Good to see you on this (very 
quiet) mailing list. >> >> On Thu, Mar 17, 2011 at 02:31:32PM +0100, Dag 
Sverre Seljebotn wrote: >>> How about using joblib, and point each process
to the same cache >>> directory on a network drive? (I'm very CPU-bound.) 
I can't find >>> anything hinting at locking in the source, so I'd think 
that I'm almost >>> guaranteed to get race conditions, right? >> You will.
I use joblib's cache on my 12 CPUs box all the time. I do hit >> race 
conditions all the time. That said, I have tried adding locks, and >> it 
ended up slowing down everything. So I have adopted a 'better ask for >> 
forgiveness than permission' approach: any operation that could fail is >>
in a try/except block (or should be, I sometimes find places where there 
>> is a race I didn't see, in which case I fix it, but more and more >> 
infrequently). >> >> The limitation of this
approach is that the current codebase, when ran in >> parallel, might have
100 jobs computing the same item to be cached, and >> using up resources 
for no good reason. It won't crash, it will just >> gobble up resources. 
That's where a lock might come in handy, if there is >> some kind of 
scheduling system able to make us of the resources freed. > My tasks would
have a granularity of minutes rather than milliseconds, > so the locking 
overhead is the least of my worries. But it certainly > means things 
should go via refactor and store API. > > I'm surprised that an FS stat 
(or process PID lookup?) drowns hashing a > NumPy array (if that is what 
is going on?) What kind of jobs where > these, and how many? > >> I don't 
know whether making the lock by default would hurt or not. I have >> three
worries. >> >> 1. The portability of the locking operation. From what I 
had read when >> I was worrying about locks, filesystem locking is neither
>> portable, nor reliable. I was adviced to use
as a lock a mkdir. Indeed >> mkdir is an atomic operations on all file 
systems and all OSs that either >> succeeds, or fails if the directory 
exists. Thus acquiring the lock can >> be achieved by:: >> >> 	try: >> 	 
os.mkdir('foo.lock') >> 	except OSError, e: >> 	 # Here use code that 
checks the number of the OSError to assert >> 	 # that it indeed is a 
'file exists' error, if so the lock is >> 	 # acquired elsewhere. >> >> 
releasing the lock is done by removing the directory (also atomic if it's 
>> empty). > Interesting. I'm guessing this should have about the same 
performance as > lockf (although I guess only testing could tell for 
sure). The downside > is what you write below: Possible dead-locks etc. > 
> Within a single host I believe lockf should be rather safe, and those > 
automatically disappear when a process dies. More below. > >> 2. The slow 
down due to the lock (in particular in the memory access, >> rather than 
the memory write). This problem can be reduced as much >> as
possible by having many locks, rather than a global lock, which >> always 
ends up being a bottleneck. One option would be to have a lock >> per 
function+arguments hash. That would probably work nicely with the >> 
current model. >> >> 3. Dead-locks. Joblib is designed to be resilient to 
hack crash: >> segfaults in the Python code (hell, code I write does 
segfault), >> machine running out of memory in the middle of a persistence
>> operation. If we add locks, I'd like joblib to be able to detect >> 
deadlocks because of a dead process as much as possible. I'd like >> also 
a time out on the locks. In a single machine environment, >> detecting 
dead processes can be done by saving the PID of the >> process in the 
lock, and having code that checks if this process is >> still alive (and 
is still running Python, to catter for PID reuse). >> In a multi-machine 
environement, you need on top of that to save a >> hash that enables the 
identification of the node. I think it is OK to >> give up on
testing of the process is still alive if the node that >> wrote the lock 
is not the same node as the node that is trying to >> acquire it. However,
I think that parallel computing on a single >> machine is more and more 
frequent. I have 12 CPUs at the lab, in two >> years, I might have 24. I 
want to be able to use them, and with >> multiprocessing, from the 
standard library, I can easily achieve that >> quite well. Thus, I'd like 
us to cater as much as possible for the >> single machine case. > (I'm 
afraid 24 cores won't do me much good :-) ) > > Indeed, avoiding 
dead-locks on segfaults is the single most important > thing to get right.
> > It seems to me that a number of approaches is necessary. Locking is > 
intrinsically tied to what hardware, OS and software one is on, and what >
performance one needs. One could imagine scenarios where joblib would be >
unusable unless there's an MPI store backend that acts as cache using > 
MPI -- or internet distributed stores using ZMQ -- but
first things > first :-) > > I'm just saying that perhaps two approaches 
-- the one that suits me > (cluster), and the one that suits you (single 
host) -- are needed. But > let's see. Two approaches: > > 1) Use mkdir for
lock + probe whether process has died or not. I think > this last step 
needs support for plugging in the appropriate solution. A > few options 
(where we should take as few as possible, but probably at > least 2): > > 
i) PID's for localhost, as you said > ii) For multi-host, SSH in and check
PID (Linux-centric etc., but > still fairly portable) > iii) Specific 
cluster support (in my case, by far the most reliable is > to record 
$SLURM_JOB_ID, and ask the queue system whether the job still > lives) > 
iv) Each process publishes a ZMQ port that can be ping-ed and that > 
responds with a unique hash. Run in it's own thread outside of > 
CPython/GIL control. > v) For MPI jobs, do the locking over MPI > vi) Some
time based solution where a process is required to write a >
time stamp regularly. Pretty fragile. > > 2) Use lockf. This a) should be 
reliable on single-host, b) is only > supported on some network systems, 
but when it is it should better do > the job (why else leave an NFS lockd 
running...). > > This has the advantage of the OS taking care of releasing
when the > process dies. I'm thinking of the following algorithm: > > a) 
Check if store/ab1234 exists. If so, result is computed, go on and > use 
it. > b) Open store/ab1234.lock (in "create but do not overwrite" mode) 
and > attempt to get lock with lockf. > b1) If a lock is acquired (= 
either we're first, or other process > died): Insert here: - Check again, 
having the lock, that store/ab1234 doesn't exist now (which means that 
another process finished it a microsecond ago, in which case go to a)) > -
Write some information to the lockfile that identifies the process. > - 
Compute results to store/tmp-$hostname-$pid-ab1234. > - When done, mv it 
to store/ab1234 > - Release lock and delete
store/ab1234.lock > b2) If a lock is not acquired, it means some other 
process is alive > and processing, so wait (or do something else) > > It's
possible to do a bit of forking and ping-pong to test the lock up > front,
at least on single-host. Remains to look what is supported on > Windows 
though. > > BTW, from what you say I'm glad the first joblib attempts 
died. What I > like so much is exactly using simple mechanisms. But 
introducing > (optional) locking seems necessary. > > Dag Sverre 

Re: [joblib] Cooperative precomputation

From:
Gael Varoquaux
Date:
2011-03-17 @ 17:00
Hey Dag,

Replying at this mail, rather than the following, but I'll try to take
them in account.

On Thu, Mar 17, 2011 at 04:30:28PM +0100, Dag Sverre Seljebotn wrote:
> > You will. I use joblib's cache on my 12 CPUs box all the time. I do hit
> > race conditions all the time. That said, I have tried adding locks, and
> > it ended up slowing down everything.

> My tasks would have a granularity of minutes rather than milliseconds, 
> so the locking overhead is the least of my worries. But it certainly 
> means things should go via refactor and store API.

Fair enough, but this might not apply to everybody. Of course we could
have a switch, but I'd like the numbers of switch to be kept to a
minimum: the package should do the right thing by default. The slowdown
was noticeable on a task like using dynamical programming to speed up
computations of a fibonacci serie. It also multiplied the time to run the
test suite of joblib by three.

> I'm surprised that an FS stat (or process PID lookup?) drowns hashing a 
> NumPy array (if that is what is going on?)

What was going on was simply that multiple processes were trying to grab
the lock. Thus all but one of them ended up having to wait. This waiting
was the problem, not the time to acquire or release the lock. That said,
it was a global lock, so the problem was quite different. Also, the
processes were doing polling on the lock, so the polling time gave the
speed limit. The lock was a sqlite lock, so I couldn't avoid the polling.
Will a lockf based lock induce a release of lock without polling (eg with
a blocking call that unblocks when the lock is released, à la 'select').
I am not convinced that a well-coded lock to answer the problem that you
are trying to answer will necessarily lead to slow downs. I think it's a
matter of coding and trying. I am just raising the point that we need to
be careful about this possible issue.

> Within a single host I believe lockf should be rather safe, and those 
> automatically disappear when a process dies. More below.

> [lots of complicated options, involving MPI, ZMQ, ssh, mkdirs, lockf]

Wow. You've convinced me. Let's go for the simple option :P.

More seriously, the complexity of trying to cater for all the cases
frightens me. Here is what I propose: on systems that support lockf, use
it. It seems (from what you say) that it should be resilient to deadlocks
and fast. I believe that most people doing highly parallel computing
will be on such systems. Does that match your intuition too? For people
doing moderately parallel computing, I don't think that the races should
be any problem. They are try/excepted and the worst thing that can happen
is to force a bit of extra computation. I've been doing a lot of 12-job
computation without any problems.

I believe that this compromise should keep a simple and efficient
codebase, and solve the problem for 99% of the users. We can worry about
the rest later, if the need ever arises.

> BTW, from what you say I'm glad the first joblib attempts died.

Retrospectively me too :). The golden rule of coding is really to _throw
away all your prototypes_. It's crazy how important it is to step away
from a codebase and think about the problems that it is trying to solve,
at some point.

> What I like so much is exactly using simple mechanisms.

I'd really like to keep joblib as simple as possible. It is its value, in
my eyes.

> But introducing (optional) locking seems necessary.

I still do wonder... given the try/except blocks. I'd be interested by an
real world experiment on a highly-parallel environment.

Thanks for your thoughts,

Gael

Re: [joblib] Cooperative precomputation

From:
Dag Sverre Seljebotn
Date:
2011-03-17 @ 19:53
On 03/17/2011 06:00 PM, Gael Varoquaux wrote:
>
>
>> I'm surprised that an FS stat (or process PID lookup?) drowns hashing a
>> NumPy array (if that is what is going on?)
> What was going on was simply that multiple processes were trying to grab
> the lock. Thus all but one of them ended up having to wait. This waiting
> was the problem, not the time to acquire or release the lock. That said,
> it was a global lock, so the problem was quite different. Also, the

Ah OK. Yes, a global lock would be rather disastrous :-)

And we're in luck -- Python fcntl.lockf (NOT to be confused with 
fcntl.flock, which will not work at this for all...) can be called in 
either blocking or nonblocking (probing) mode, and if called in blocking 
mode it blocks at the OS level.

> More seriously, the complexity of trying to cater for all the cases
> frightens me. Here is what I propose: on systems that support lockf, use
> it. It seems (from what you say) that it should be resilient to deadlocks
> and fast. I believe that most people doing highly parallel computing
> will be on such systems. Does that match your intuition too? For people

Yes. Except some poor souls are on Windows which has a completely 
different API, but that's a rather trivial case to add when somebody has 
an interest, and optimistic locking w/ mkdir will still work until then.

I'll get to it then.

Dag Sverre