librelist archives

« back to archive

joblib rocks

joblib rocks

From:
Ondrej Certik
Date:
2011-05-10 @ 00:46
Hi Gael and others,

I just wanted to say thanks for this fantastic lib. Dag pointed it to
me couple days ago, just yesterday I gave it a shot on my research
code, that I wrote before hearing about joblib. See for yourself:

Qsnake: certik1@pike:~/repos/dftatom(master)$ time python examples/optimize.py
a =  1000000000.0 N0 =  3000
a =  100000000.0 N0 =  2600
a =  10000000.0 N0 =  2300
a =  1000000.0 N0 =  6000

real	4m9.032s
user	4m8.490s
sys	0m0.520s
Qsnake: certik1@pike:~/repos/dftatom(master)$ time python examples/optimize.py
a =  1000000000.0 N0 =  3000
a =  100000000.0 N0 =  2600
a =  10000000.0 N0 =  2300
a =  1000000.0 N0 =  6000

real	0m2.191s
user	0m2.000s
sys	0m0.200s


So 114x speedup right away. I was just thinking before discovering
joblib, that I need to store the computed data on the disk, so that I
don't need to painfully recalculate it over and over again. Here is
the main part of the code:

from dftatom import atom_lda
from joblib import Memory

memory = Memory(cachedir="/tmp/optimize", verbose=0)
@memory.cache
def atom(Z, rmin, rmax, a, N, eps1, eps2):
    return atom_lda(Z, rmin, rmax, a, N, eps1, eps2)

def find_N0(rmin, rmax, a, E):
    data = []
    #print "Calculating..."
    for i in range(1, 100):
        N = 500 + i*100
        E_tot, ks_energies, n, l, f, R, V_tot, density, orbitals = \
                atom(92, rmin, rmax, a, N, 1e-11, 1e-11)
        data.append((N, E_tot))
    #    print "%d %15.10f" % (N, E_tot)
        E_band = sum(ks_energies * f)
    #print "  Done."
    global n_study_c
    i = len(data)-1
    assert abs(data[i][1]-E) < 1e-6
    while abs(data[i][1]-E) < 1e-6:
        i -= 1
    N0 = data[i+1][0]
    print "a = ", a, "N0 = ", N0

atom_lda is written in modern optimized Fortran, wrapped using Cython.
It returns total energy, Kohn Sham energies (numpy array), occupations
(n, l, f -- numpy array), grid (R, numpy array), final SCF potential +
density (numpy array) and finally the orbitals (numpy 2D array). So
it's quite a bit of work to make sure to save everything correctly.
With joblib, everything seems to be saved properly out of the box.

I have one related question --- is there any way to convert the disk
"cache" into some platform independent format (hdf5?), that I can just
copy onto another machine and reuse? Currently I need to rerun the
calculation once at home and once at work, even though it produces
exactly the same results.

Ondrej

Re: [joblib] joblib rocks

From:
Gael Varoquaux
Date:
2011-05-10 @ 05:43
On Mon, May 09, 2011 at 05:46:34PM -0700, Ondrej Certik wrote:
> I just wanted to say thanks for this fantastic lib.

Yey! Thanks. I was waiting to have working cache replacement policy
before announcing it to the world.

> I have one related question --- is there any way to convert the disk
> "cache" into some platform independent format (hdf5?), that I can just
> copy onto another machine and reuse? Currently I need to rerun the
> calculation once at home and once at work, even though it produces
> exactly the same results.

Good question. It brings two answers:

 1. I tend to consider this behavior as a bug rather than a feature. I
    believe that it comes from the fact that the same array with 2
    different versions of numpy gets a different hash (in joblib.hash). I
    would be happy taking a patch fixing that behavior. I am OK
    special-casing the numpy array in the fix, because I believe that it
    is important-enough. If you investigate and pin-point where the
    problem happens exactly, it might even be enough to get me to write
    the fix.

 2. Storing to different outputs should be possible. So far it wasn't,
    because the persistence code was not factored out. Dag has started
    work on defining a JobStore object. Once this is ready, I would
    gladly accept having an HDF5JobStore object. However, I am not sure
    that this would fix your problem. Indeed, to be able to persist any
    Python object (which is a core design goal for joblib), I suspect
    that we will have to rely on pickle at some point.

Cheers,

Gael

@Dag: I'll review your branch on Friday. I have a deadline Friday
morning, and I'd rather avoid working on other (more interesting) things.

Re: [joblib] joblib rocks

From:
Ondrej Certik
Date:
2011-05-10 @ 06:24
On Mon, May 9, 2011 at 10:43 PM, Gael Varoquaux
<gael.varoquaux@normalesup.org> wrote:
[...]
>> I have one related question --- is there any way to convert the disk
>> "cache" into some platform independent format (hdf5?), that I can just
>> copy onto another machine and reuse? Currently I need to rerun the
>> calculation once at home and once at work, even though it produces
>> exactly the same results.
>
> Good question. It brings two answers:
>
>  1. I tend to consider this behavior as a bug rather than a feature. I
>    believe that it comes from the fact that the same array with 2
>    different versions of numpy gets a different hash (in joblib.hash). I
>    would be happy taking a patch fixing that behavior. I am OK
>    special-casing the numpy array in the fix, because I believe that it
>    is important-enough. If you investigate and pin-point where the
>    problem happens exactly, it might even be enough to get me to write
>    the fix.
>
>  2. Storing to different outputs should be possible. So far it wasn't,
>    because the persistence code was not factored out. Dag has started
>    work on defining a JobStore object. Once this is ready, I would
>    gladly accept having an HDF5JobStore object. However, I am not sure
>    that this would fix your problem. Indeed, to be able to persist any
>    Python object (which is a core design goal for joblib), I suspect
>    that we will have to rely on pickle at some point.

Right. I didn't realize that arbitrary Python objects are supported
(which of course makes sense).
So it would only work for things, that hdf5 can support directly.
It brings two points as well:

1) one is to make things work cross platform with ints, floats and
numpy arrays in terms of hashes. That's your point 1., and that should
be possible to fix

2) second is to make sure that the cache format itself is
multiplatform. I just saw, that Python pickles are used so far. I
think that pickles are platform dependent? Or can they be quite safely
transferred across computers?


Ondrej

Re: [joblib] joblib rocks

From:
Gael Varoquaux
Date:
2011-05-10 @ 07:03
On Mon, May 09, 2011 at 11:24:40PM -0700, Ondrej Certik wrote:
> Right. I didn't realize that arbitrary Python objects are supported
> (which of course makes sense).

Yeah, if ones doesn't do that, it breaks quite quickly. On the other
hand, I am open to some special casing (I have already special-cased
numpy arrays.

> So it would only work for things, that hdf5 can support directly.

As you can stick strings in HDF5, everything can be pickled. I think that
Pauli had written an hdf5pickle, you can check what its status is.

> 2) second is to make sure that the cache format itself is
> multiplatform. I just saw, that Python pickles are used so far. I
> think that pickles are platform dependent? Or can they be quite safely
> transferred across computers?

Pickles are pretty much a dump of the internal structure as a string. As
such they are not garantied to be consistent across platform. I my
experience cross-platform is not a big deal. Python versions can be.
Another problem that I think we encounter a lot in the scientific Python
world is that the descriptors of numpy arrays might have changed between
versions. 

I suspect that the difference is in the input hash. If it is the case,
you can try pin-pointing which object makes a difference, by running
joblib.hash manually on each object on the different computers.

Gael

Re: [joblib] joblib rocks

From:
Olivier Grisel
Date:
2011-05-10 @ 07:12
2011/5/10 Gael Varoquaux <gael.varoquaux@normalesup.org>:
> On Mon, May 09, 2011 at 11:24:40PM -0700, Ondrej Certik wrote:
>> Right. I didn't realize that arbitrary Python objects are supported
>> (which of course makes sense).
>
> Yeah, if ones doesn't do that, it breaks quite quickly. On the other
> hand, I am open to some special casing (I have already special-cased
> numpy arrays.
>
>> So it would only work for things, that hdf5 can support directly.
>
> As you can stick strings in HDF5, everything can be pickled. I think that
> Pauli had written an hdf5pickle, you can check what its status is.
>
>> 2) second is to make sure that the cache format itself is
>> multiplatform. I just saw, that Python pickles are used so far. I
>> think that pickles are platform dependent? Or can they be quite safely
>> transferred across computers?
>
> Pickles are pretty much a dump of the internal structure as a string. As
> such they are not garantied to be consistent across platform. I my
> experience cross-platform is not a big deal.

Are you sure about that? I think the pickle formats (there are several
generations) are documented and the same across platforms (need to
check though).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: [joblib] joblib rocks

From:
Gael Varoquaux
Date:
2011-05-10 @ 07:18
On Tue, May 10, 2011 at 09:12:10AM +0200, Olivier Grisel wrote:
> > Pickles are pretty much a dump of the internal structure as a string. As
> > such they are not garantied to be consistent across platform. I my
> > experience cross-platform is not a big deal.

> Are you sure about that? I think the pickle formats (there are several
> generations) are documented and the same across platforms (need to
> check though).

If you have an object that is instanciated different across platform (it
does happen), than its pickle will change.

Pickling is reasonnably portable. The reason that it is not fully
portable is because it does not go through hoops to account for changes
in representation. For instance, inspect.getargspec will return a
different object under 2.5 and 2.6 (2.6: namedtuple). This is the kind of
changes that will trigger problems with pickling.

G