librelist archives

« back to archive

Faster function hashing

Faster function hashing

From:
Gael Varoquaux
Date:
2010-08-29 @ 12:38
Here is a discussion I had a while ago with Jorik Blaas about having
faster hashing of the Python function in joblib.

The suggestion is very sensible and would not only make joblib faster,
but also robust to the situation in which the file is changed on the
disk, but not reloaded.

Gaël

----- Forwarded message from Gael Varoquaux <gael.varoquaux@normalesup.org> -----

Date: Wed, 3 Feb 2010 11:39:57 +0100
From: Gael Varoquaux <gael.varoquaux@normalesup.org>
To: Jorik Blaas <jorik@scivis.net>
Subject: Re:  joblib question

On Wed, Feb 03, 2010 at 11:30:44AM +0100, Jorik Blaas wrote:
> I've experimented a bit with this in the past, and in CPython you can at  
> least do the following:
>
> def a():
>     return 4
>
> and then use  a.__code__.__hash__() : -6234642185175670730
>
> If you change the function, the hash changes.
>
> However the name is included in the hash code, so if you have two  
> functions that are exactly the same, but have a different name, their  
> hashes will differ.  But this is not really an issue.

That approach sounds sensible. I still want to fall back to checking the
stored code if the hash is different, and to store the code rather than
the hash as a persistence between processes, but I think your idea may
work and be a good one. It's nice to have another pairs of eyes looking
at the project and comig up with good suggestions.

> So we could check this before checking the the source, which would  
> probably make my runs quite a bit faster, as my jobs only run for  
> sub-second length.

OK. You'll have to profile where the joblib time is spent. I fear that
quite a lot will be sent looking up the stored data. I absolutely want to
keep the data stored on the disk for several reasons. I used to be lazy
at storing/reading the data, but that was bad because when I had a
process crash (which does happen when you are experimenting with
scientific computing code, or when I used multi-processing I would fall
in inconsistent situations. I currently use joblib to do computations on
grids, using the fact that the grid I have has a shared disk. It could be
improved for this purpose, but it is currently solid-enough to handle
this situation mostly fine.


Cheers,

Gaël

----- End forwarded message -----