librelist archives

« back to archive

Re: joblib bug? Mutating argument in cached function (fwd)

Re: joblib bug? Mutating argument in cached function (fwd)

From:
Gael Varoquaux
Date:
2010-08-29 @ 12:21
I am forwarding to the mailing list some discussions I have had with
various users in which there where interesting proposal/conclusions. 

The reason I am forwarding this is to open up a bit development, so that
it is easy for other people to see what has been discussed and proposed.
A few more will follow...

----- Forwarded message from Dag Sverre Seljebotn 
<dagss@student.matnat.uio.no> -----

Date: Thu, 29 Jul 2010 14:40:15 +0200
From: Dag Sverre Seljebotn <dagss@student.matnat.uio.no>
To: Gael Varoquaux <gael.varoquaux@normalesup.org>
Subject: Re: joblib bug? Mutating argument in cached function

Gael Varoquaux wrote:
> On Thu, Jul 29, 2010 at 01:56:45PM +0200, Dag Sverre Seljebotn wrote:

>> The patch is in fact 100% identical, so consider it confirmed.

>
> Thanks for the feedback,
>

>> I'm now successfully passing in np.random.RandomState and am able to 
>> get  cached MCMC results back even if the random state is modified. (I 
>> needed  to fix RandomState pickling as well; not a joblib issue.)

>> I still need to construct one random state per call though   
>> ("memoized_func(fork_rng(rng))"), so that the random state isn't   
>> affected by caching. I'm contemplating adding magic to joblib for   
>> np.random.RandomState to make this transparent and safer; but most of  
>> the time I feel that being explicit is better... *shrug*. If you think 
>>  it should be automagic and would like a patch like this:

>
> It took me a while to understand what the problem is. These things are
> tricky!
>
> To sum up, and check that I have indeed understood, the idea is that the
> function calls methods on the RandomState object, and thus has a side
> effect. Thus whether the function is executed or not changes the outcome
> of further calculation.

Yes. Sorry, I could have been clearer.
> Here is a proposal: implement a 'modifies' keyword argument to
> Memory.cache, just like there is an 'ignore' one. This keyword argument
> would take a list of objects modified by the function. The state of these
> objects after the first run would be grabbed by calling their
> '__getstate__' and stored in a separate file in the cache dir, and after
> it would be set using a '__setstate__'.
>
> Do you believe this would give you a solution for your problem? It would
> be a fairly general solution that could be reused in other cases.

Yes, that's a lot better!

Thinking more about this, two more solutions though:

1) I could return my rngs explicitly, and reassign it in the caller, for  
the same effect. I do like to push for a more functional style of  
programming... perhaps this is better after all.

2) Sticking with imperative programming, is this better?:

- Get a hash per argument (instead of the total) before executing function
- Get a hash per argument afterwards as well.
- Cache (and later __setstate__) all arguments that changed by default
- One can optionally turn off the caching of arguments (so default is on 
instead of off).

This would mean that joblib would always cache return values *and* changes 
to mutable arguments. Which feels a lot safer to me.

One would need to explicitly declare when one "gives up", say, a large  
NumPy array, but I feel it is better to be explicit about those cases  
because one then need to make sure they are not used after calling the  
joblib-ed function.

Of  course, nothing in 2) prevents 1), it then becomes a matter of  
personal taste.

<digression>
Puristically, whether or not you give up a NumPy array is a decision that 
belongs with the caller, not the implemented function. Something like this:

joblibed_func(arr1, give_up(arr2))

where give_up would signal that the scope where give_up is called would  
never more use arr2.

That needs a lot more thought though, and I don't need it myself.
</digression>


Dag Sverre

----- End forwarded message -----

-- 
    Gael Varoquaux
    Research Fellow, INRIA
    Laboratoire de Neuro-Imagerie Assistee par Ordinateur
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-78-35
    Mobile: ++ 33-6-28-25-64-62
    http://gael-varoquaux.info

Re: [joblib] Re: joblib bug? Mutating argument in cached function (fwd)

From:
Gael Varoquaux
Date:
2010-08-29 @ 12:23
----- Forwarded message from Gael Varoquaux <gael.varoquaux@normalesup.org> -----

Date: Thu, 29 Jul 2010 15:10:25 +0200
From: Gael Varoquaux <gael.varoquaux@normalesup.org>
To: Dag Sverre Seljebotn <dagss@student.matnat.uio.no>
Subject: Re:  joblib bug? Mutating argument in cached function

On Thu, Jul 29, 2010 at 02:40:15PM +0200, Dag Sverre Seljebotn wrote:
>>> I'm now successfully passing in np.random.RandomState and am able to 
>>> get  cached MCMC results back even if the random state is modified. 
>>> (I needed  to fix RandomState pickling as well; not a joblib issue.)

>>> I still need to construct one random state per call though   
>>> ("memoized_func(fork_rng(rng))"), so that the random state isn't   
>>> affected by caching. I'm contemplating adding magic to joblib for   
>>> np.random.RandomState to make this transparent and safer; but most of 
>>>  the time I feel that being explicit is better... *shrug*. If you 
>>> think  it should be automagic and would like a patch like this:


>> It took me a while to understand what the problem is. These things are
>> tricky!

>> To sum up, and check that I have indeed understood, the idea is that the
>> function calls methods on the RandomState object, and thus has a side
>> effect. Thus whether the function is executed or not changes the outcome
>> of further calculation.

> Yes. Sorry, I could have been clearer.
>> Here is a proposal: implement a 'modifies' keyword argument to
>> Memory.cache, just like there is an 'ignore' one. This keyword argument
>> would take a list of objects modified by the function. The state of these
>> objects after the first run would be grabbed by calling their
>> '__getstate__' and stored in a separate file in the cache dir, and after
>> it would be set using a '__setstate__'.

>> Do you believe this would give you a solution for your problem? It would
>> be a fairly general solution that could be reused in other cases.

> Yes, that's a lot better!
>
> Thinking more about this, two more solutions though:
>
> 1) I could return my rngs explicitly, and reassign it in the caller, for  
> the same effect. I do like to push for a more functional style of  
> programming... perhaps this is better after all.

Functional programming is a great idea, however I have found that
everybody, including me, hates changing a bit of code to satisfy a
framework.

> 2) Sticking with imperative programming, is this better?:
>
> - Get a hash per argument (instead of the total) before executing function
> - Get a hash per argument afterwards as well.
> - Cache (and later __setstate__) all arguments that changed by default
> - One can optionally turn off the caching of arguments (so default is on 
> instead of off).
>
> This would mean that joblib would always cache return values *and*  
> changes to mutable arguments. Which feels a lot safer to me.

Yes, this sounds good. I like that. I would give an option to the memory
object and the .cache method to turn this feature of (implemented just
like the mmap_mode option), but it seems like a net gain.

I'd love to take a patch for this feature, unless of course, you find
that the patch brings in a lot of complexity and magic. Also, I would
try/except the __setstate__, because some objects don't accept
__setstate__. If it fails, I would simply raise a warning and go on (one
of the design goals of joblib is not to get in your way).

> One would need to explicitly declare when one "gives up", say, a large  
> NumPy array, but I feel it is better to be explicit about those cases  
> because one then need to make sure they are not used after calling the  
> joblib-ed function.

Yup. You'd be surprised how fast joblib is on large numpy arrays. I use
it with Gb sized arrays.

> Of  course, nothing in 2) prevents 1), it then becomes a matter of  
> personal taste.

> <digression>
> Puristically, whether or not you give up a NumPy array is a decision  
> that belongs with the caller, not the implemented function. Something  
> like this:
>
> joblibed_func(arr1, give_up(arr2))
>
> where give_up would signal that the scope where give_up is called would  
> never more use arr2.
>
> That needs a lot more thought though, and I don't need it myself.
> </digression>

Right, I agree that explicit is good. I don't think I need it myself
either, so I would tend to just put it on the side until someone asks for
it. The syntax I would favor would be one similar to the ignore argument
for the cache method.

Thanks for your contributions, they are very useful,

Gaƫl

----- End forwarded message -----

-- 
    Gael Varoquaux
    Research Fellow, INRIA
    Laboratoire de Neuro-Imagerie Assistee par Ordinateur
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-78-35
    Mobile: ++ 33-6-28-25-64-62
    http://gael-varoquaux.info