Re: [joblib] Re: joblib bug? Mutating argument in cached function (fwd)
- From:
- Gael Varoquaux
- Date:
- 2010-08-29 @ 12:23
----- Forwarded message from Gael Varoquaux <gael.varoquaux@normalesup.org> -----
Date: Thu, 29 Jul 2010 15:10:25 +0200
From: Gael Varoquaux <gael.varoquaux@normalesup.org>
To: Dag Sverre Seljebotn <dagss@student.matnat.uio.no>
Subject: Re: joblib bug? Mutating argument in cached function
On Thu, Jul 29, 2010 at 02:40:15PM +0200, Dag Sverre Seljebotn wrote:
>>> I'm now successfully passing in np.random.RandomState and am able to
>>> get cached MCMC results back even if the random state is modified.
>>> (I needed to fix RandomState pickling as well; not a joblib issue.)
>>> I still need to construct one random state per call though
>>> ("memoized_func(fork_rng(rng))"), so that the random state isn't
>>> affected by caching. I'm contemplating adding magic to joblib for
>>> np.random.RandomState to make this transparent and safer; but most of
>>> the time I feel that being explicit is better... *shrug*. If you
>>> think it should be automagic and would like a patch like this:
>> It took me a while to understand what the problem is. These things are
>> tricky!
>> To sum up, and check that I have indeed understood, the idea is that the
>> function calls methods on the RandomState object, and thus has a side
>> effect. Thus whether the function is executed or not changes the outcome
>> of further calculation.
> Yes. Sorry, I could have been clearer.
>> Here is a proposal: implement a 'modifies' keyword argument to
>> Memory.cache, just like there is an 'ignore' one. This keyword argument
>> would take a list of objects modified by the function. The state of these
>> objects after the first run would be grabbed by calling their
>> '__getstate__' and stored in a separate file in the cache dir, and after
>> it would be set using a '__setstate__'.
>> Do you believe this would give you a solution for your problem? It would
>> be a fairly general solution that could be reused in other cases.
> Yes, that's a lot better!
>
> Thinking more about this, two more solutions though:
>
> 1) I could return my rngs explicitly, and reassign it in the caller, for
> the same effect. I do like to push for a more functional style of
> programming... perhaps this is better after all.
Functional programming is a great idea, however I have found that
everybody, including me, hates changing a bit of code to satisfy a
framework.
> 2) Sticking with imperative programming, is this better?:
>
> - Get a hash per argument (instead of the total) before executing function
> - Get a hash per argument afterwards as well.
> - Cache (and later __setstate__) all arguments that changed by default
> - One can optionally turn off the caching of arguments (so default is on
> instead of off).
>
> This would mean that joblib would always cache return values *and*
> changes to mutable arguments. Which feels a lot safer to me.
Yes, this sounds good. I like that. I would give an option to the memory
object and the .cache method to turn this feature of (implemented just
like the mmap_mode option), but it seems like a net gain.
I'd love to take a patch for this feature, unless of course, you find
that the patch brings in a lot of complexity and magic. Also, I would
try/except the __setstate__, because some objects don't accept
__setstate__. If it fails, I would simply raise a warning and go on (one
of the design goals of joblib is not to get in your way).
> One would need to explicitly declare when one "gives up", say, a large
> NumPy array, but I feel it is better to be explicit about those cases
> because one then need to make sure they are not used after calling the
> joblib-ed function.
Yup. You'd be surprised how fast joblib is on large numpy arrays. I use
it with Gb sized arrays.
> Of course, nothing in 2) prevents 1), it then becomes a matter of
> personal taste.
> <digression>
> Puristically, whether or not you give up a NumPy array is a decision
> that belongs with the caller, not the implemented function. Something
> like this:
>
> joblibed_func(arr1, give_up(arr2))
>
> where give_up would signal that the scope where give_up is called would
> never more use arr2.
>
> That needs a lot more thought though, and I don't need it myself.
> </digression>
Right, I agree that explicit is good. I don't think I need it myself
either, so I would tend to just put it on the side until someone asks for
it. The syntax I would favor would be one similar to the ignore argument
for the cache method.
Thanks for your contributions, they are very useful,
Gaƫl
----- End forwarded message -----
--
Gael Varoquaux
Research Fellow, INRIA
Laboratoire de Neuro-Imagerie Assistee par Ordinateur
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-78-35
Mobile: ++ 33-6-28-25-64-62
http://gael-varoquaux.info