librelist archives

« back to archive

joblib.hpc: Python functions as cluster jobs

joblib.hpc: Python functions as cluster jobs

From:
Dag Sverre Seljebotn
Date:
2011-04-29 @ 13:45
Hi list,

my efforts to use a cluster with long-running jobs while attempting to 
maintain sanity seems to finally be going somewhere. I'm just very bad 
with anything involving "keeping logs" or "registering runs in 
databases" -- what I want is to keep the iteratively explorative 
approach, with some figure-producing scripts gradually being built, even 
if the computations must run for days or weeks.

Partly for historical reasons I've been developing in a package 
"joblib.hpc". This is convenient because I have rather hard dependencies 
on joblib. Still, my approach had to be a bit different, so it is also 
in a sense a reimplementation.

My main question is: Is what I'll describe below something people would 
like to (eventually, when stable) see distributed with joblib itself? 
What are the ambitions of the joblib project: Small and for one task, or 
umbrella for any kind of pipeline jobs (from in-memory caching to 
distributed computing)? Should I create a new project for this (with a 
hard joblib dependency)?

Secondary question: Input to this solicited. There's a lot to this I'm 
not writing below, this is merely an overview, so feel free to ask for 
details into specific.

The idea is joblib meets concurrent.futures meets Nix [1]. Some of the 
stuff (a caching version of concurrent.futures; @versioned) I think 
could in time be refactored into joblib proper, while some things would 
obviously stay hpc-specific.

1) Rather than @memory.cache, I decouple the issue of versioning a 
function from computation/caching:

@versioned()
def func(x, y): ...

Note that for week-long runs I don't rerun because of a refactor, but I 
need to reliably trigger just the right reruns when I fix a critical 
bug. By default, it takes the joblib approach of hashing the function 
source, but you can override it:

@versioned(2) # increment manually each time a critical bug is found
def func(x, y):
     ...

Currently, one is also required to pass "deps=False", to not track 
*called* functions. However, I've made sure the algorithm/design can be 
extended to encompass build-system-like dependency tracking, so that 
changing a @versioned function deep in the program triggers necesarry 
re-runs. (Functions without @versioned are always ignored, I think.)

The core insight leading to this come from an off-hand comment from 
Konrad Hinsen's EuroScipy2010 presentation: Often the caller must decide 
whether the function should be run asynchronous, not the function. 
(Which is not to say that @memory.cache isn't convenient, I just need 
something more fine-grained myself.)

2) Use the concurrent.futures API to submit and cache jobs.

from joblib.hpc.clusters.titan_oslo import TitanOsloExecutor
ex = TitanOsloExecutor(account='astro', logger=logger)
job1 = ex.submit(func1, 2, 3)
job2 = ex.submit(func2, 2, 3)
print job1.result() # waits for cluster job to finish

  a) Results are cached, like joblib and unlike default concurrent.futures
  b) You need to integrate with whatever queue system the cluster uses 
(rather easy)
  c) Jobs are really spawned; you can kill the launching process without 
stopping the jobs. So since I don't bother to let my script wait for 
days, I'll hit Ctrl+C. Then when the job is run, I can simply restart 
the script, which will find the results in cache and continue immediately

3) No messy persistent servers, databases, custom scheduling etc. In my 
case, the above really just looks for or produces these files:

$JOBSTORE/func1-jDdfJt_1A2mMPIOvlrifWoQY0PoA/input.pkl
$JOBSTORE/func1-jDdfJt_1A2mMPIOvlrifWoQY0PoA/sbatchscript

and if it "came there first", runs a command "sbatch 
$JOBSTORE/func1-jDdfJt_1A2mMPIOvlrifWoQY0PoA/sbatchscript", which in 
addition produces "output.pkl" +  auxiliary files from the job.

If a job goes wrong, I'll simply look at the log, resubmit it etc. with 
my usual tools, and joblib.hpc is none the wiser (although it could make 
that more convenient in time).

4) Status? I'm still hacking on this, only the basic functionality you 
see above is working. Before pushing this anywhere I'd like to eat my 
own dogfood with some real jobs for a month or so.

https://github.com/dagss/joblib/tree/hpc

Finally, thanks to Jon Olav for useful discussions and inspiration.

Dag Sverre

[1] Another big inspiration is nix for building software:
http://nixos.org
My take on Nix:
https://github.com/dagss/scidist/blob/master/ideas.rst

Re: joblib.hpc: Python functions as cluster jobs

From:
Dag Sverre Seljebotn
Date:
2011-04-29 @ 13:50
On 04/29/2011 03:45 PM, Dag Sverre Seljebotn wrote:
> Hi list,
>
> my efforts to use a cluster with long-running jobs while attempting to
> maintain sanity seems to finally be going somewhere. I'm just very bad
> with anything involving "keeping logs" or "registering runs in
> databases" -- what I want is to keep the iteratively explorative
> approach, with some figure-producing scripts gradually being built, even
> if the computations must run for days or weeks.
>
> Partly for historical reasons I've been developing in a package
> "joblib.hpc". This is convenient because I have rather hard dependencies
> on joblib. Still, my approach had to be a bit different, so it is also
> in a sense a reimplementation.
>
> My main question is: Is what I'll describe below something people would
> like to (eventually, when stable) see distributed with joblib itself?
> What are the ambitions of the joblib project: Small and for one task, or
> umbrella for any kind of pipeline jobs (from in-memory caching to
> distributed computing)? Should I create a new project for this (with a
> hard joblib dependency)?
>
> Secondary question: Input to this solicited. There's a lot to this I'm
> not writing below, this is merely an overview, so feel free to ask for
> details into specific.
>
> The idea is joblib meets concurrent.futures meets Nix [1]. Some of the
> stuff (a caching version of concurrent.futures; @versioned) I think
> could in time be refactored into joblib proper, while some things would
> obviously stay hpc-specific.
>
> 1) Rather than @memory.cache, I decouple the issue of versioning a
> function from computation/caching:
>
> @versioned()
> def func(x, y): ...
>
> Note that for week-long runs I don't rerun because of a refactor, but I
> need to reliably trigger just the right reruns when I fix a critical
> bug. By default, it takes the joblib approach of hashing the function
> source, but you can override it:
>
> @versioned(2) # increment manually each time a critical bug is found
> def func(x, y):
> ...
>
> Currently, one is also required to pass "deps=False", to not track
> *called* functions. However, I've made sure the algorithm/design can be
> extended to encompass build-system-like dependency tracking, so that
> changing a @versioned function deep in the program triggers necesarry
> re-runs. (Functions without @versioned are always ignored, I think.)
>
> The core insight leading to this come from an off-hand comment from
> Konrad Hinsen's EuroScipy2010 presentation: Often the caller must decide
> whether the function should be run asynchronous, not the function.
> (Which is not to say that @memory.cache isn't convenient, I just need
> something more fine-grained myself.)
>
> 2) Use the concurrent.futures API to submit and cache jobs.
>
> from joblib.hpc.clusters.titan_oslo import TitanOsloExecutor
> ex = TitanOsloExecutor(account='astro', logger=logger)
> job1 = ex.submit(func1, 2, 3)
> job2 = ex.submit(func2, 2, 3)
> print job1.result() # waits for cluster job to finish
>
> a) Results are cached, like joblib and unlike default concurrent.futures
> b) You need to integrate with whatever queue system the cluster uses
> (rather easy)
> c) Jobs are really spawned; you can kill the launching process without
> stopping the jobs. So since I don't bother to let my script wait for
> days, I'll hit Ctrl+C. Then when the job is run, I can simply restart
> the script, which will find the results in cache and continue immediately
>
> 3) No messy persistent servers, databases, custom scheduling etc. In my
> case, the above really just looks for or produces these files:
>
> $JOBSTORE/func1-jDdfJt_1A2mMPIOvlrifWoQY0PoA/input.pkl
> $JOBSTORE/func1-jDdfJt_1A2mMPIOvlrifWoQY0PoA/sbatchscript
>
> and if it "came there first", runs a command "sbatch
> $JOBSTORE/func1-jDdfJt_1A2mMPIOvlrifWoQY0PoA/sbatchscript", which in
> addition produces "output.pkl" + auxiliary files from the job.
>
> If a job goes wrong, I'll simply look at the log, resubmit it etc. with
> my usual tools, and joblib.hpc is none the wiser (although it could make
> that more convenient in time).
>
> 4) Status? I'm still hacking on this, only the basic functionality you
> see above is working. Before pushing this anywhere I'd like to eat my
> own dogfood with some real jobs for a month or so.
>
> https://github.com/dagss/joblib/tree/hpc


I forgot a point that is *very* important to me:

5) The "cache" is never cleared. The idea is that I can use git to go in 
different "directions of exploration" with my script, and when switching 
branch, all the (older/different) computed results are also present in 
the same store.

In general you are careful about throwing away the results of massive 
amounts of CPU hours.

To clear things out, I'd actually use garbage collection:

$ joblib gc mark
$ git checkout assumption_1
$ python produce_paper_figures.py 1.png
$ git checkout assumption_2
$ python produce_paper_figures.py 2.png
# Now, gets rid of runs only needed for assumption_3, which had no value
$ joblib gc sweep

DS


>
> Finally, thanks to Jon Olav for useful discussions and inspiration.
>
> Dag Sverre
>
> [1] Another big inspiration is nix for building software:
> http://nixos.org
> My take on Nix:
> https://github.com/dagss/scidist/blob/master/ideas.rst

Re: [joblib] joblib.hpc: Python functions as cluster jobs

From:
Gael Varoquaux
Date:
2011-04-29 @ 19:28
Hi,

Thanks for your input. Follows a really long email. You started it :).

Your suggestions are partly going in the same direction than my vision
for joblib, and partly moving away from it. First let me state some core
design principles that I would like to undermine joblib:

 1. Simplicity: if a problem is hard, I try to find a way of not
    solving it.

 2. Robustness: I'd rather have quality over quantity. I use a lot of
    code for data and computation management that I do not push to
    joblib because I do not want to make it more fragile. Althought this
    is not currently the case, Joblib should never ever break (this is also 
    a reason for point 1), because if it does, it ends up getting in the
    way (preferably before a deadline).

 3. Seamlessness: joblib should be useable be people who don't understand
    how it works. Joblib is also meant to be used as a library, enclosed
    in end-user programs. We use it in scikits.learn, for instance.

 4. General purpose: I don't want to encode my usage pattern in joblib,
    because if I keep it general it has more chances to get users and
    contributors, and thus become a Good library.

That said, I think that joblib has the vocation of doing more than it
currently does. That is the reason that Parallel and Memory are in the
same package. They should interplay at some point. What you describe
here:

> I'm just very bad with anything involving "keeping logs" or
> "registering runs in databases" -- what I want is to keep the
> iteratively explorative approach, with some figure-producing scripts
> gradually being built, even if the computations must run for days or
> weeks.

matches very well my goals. The reason that it does not solve all these
problems, is that I found, from failed previous attempts, that I needed
to slowly extract robust patterns solving atomic problems that I
encountered in day-to-day work, and iron them, rather than trying to
tackle the big picture, which was too hard a problem..

> Partly for historical reasons I've been developing in a package 
> "joblib.hpc". This is convenient because I have rather hard dependencies 
> on joblib.

For joblib, I frown on the name HPC: it is very much anchored in the
scientific computing community, and actually in a sub-community. Joblib
has the vocation of targetting a much wider community.

> My main question is: Is what I'll describe below something people would 
> like to (eventually, when stable) see distributed with joblib itself? 

Partly. I would encourage you to sort out ideas that are general and
reusable from those more specific to your current usecase. I think that
this would be beneficial for you also, as you will find that when you
move to new problems, you will be able to reuse more code.

> What are the ambitions of the joblib project: Small and for one task, or 
> umbrella for any kind of pipeline jobs (from in-memory caching to 
> distributed computing)?

In between. On the one hand, I would like joblib to host many patterns
that make pipelining operations of big data or long running jobs easier.
On the other hand, distributed computing tends to bring in very
challenging problems that require strong expertise. These I want to stay
away from. Let's say that in terms of distributed computing, I am open to
anything that:

 1. Leads to simple, readable API (we will probably need to iterate on
    that).

 2. Can be implemented using multiprocessing. I think that it would be
    interesting to have an optional IPython backend for all these
    operations, the message passed on to users being that you could 
    write for multiprocessing during the debugging part, and use IPython 
    for the real thing.

> The idea is joblib meets concurrent.futures meets Nix [1]. Some of the 
> stuff (a caching version of concurrent.futures; @versioned) I think 
> could in time be refactored into joblib proper, while some things would 
> obviously stay hpc-specific.

I think that things that are HPC specific should be in a different
project. "hpclib" sounds good. "gridlib" maybe even cooler.

Below I discuss your ideas, converting them in 'proposals' that I can
directly relate to the code base. I am sorry for thinking in these terms,
rather than in a grand vision, but it really helps me envisaging of to
move forward and empirically try ideas to see if they work.

> 1) Rather than @memory.cache, I decouple the issue of versioning a 
> function from computation/caching:

> @versioned()
> def func(x, y): ...

> Note that for week-long runs I don't rerun because of a refactor, but I 
> need to reliably trigger just the right reruns when I fix a critical 
> bug. By default, it takes the joblib approach of hashing the function 
> source, but you can override it:

> @versioned(2) # increment manually each time a critical bug is found
> def func(x, y):
>      ...

Sounds like a good idea. It meets an important usecase, and I am all in
favor of this. I must admit that I had been thinking along similar lines.
Let me however suggest a variant around this idea.

First, a bit of background on why joblib works the way it works. I tried
explicite trajectory tracking in the early days. It didn't work because
it led to convoluted code that would fail too often. From this experience
I learned my 3 first design principles listed in the begining of this
mail. This is why I fell back on hashes: hashes avoid having to maintain
a dependency graph, as git has shown us. They enable us to solve a
problem that is local on the execution and not global.

**Proposal a.** First, I am perfectly happy to take a patch that adds an
option to the Memory object (and to it's cache method) to turn tracking
of function source code off. this would probably answer 90% of your
usecases.

**Proposal b** Second, if we want actual versioning, that is the ability
to go back to results computed with previous versions of the code, we
could change the way the memory object computes it hash, and simply add
the function code to the hash. In this sens, the function code source
becomes just like another argument. I would like this to happen in a
sublclass of the memory object, because I find that it is already too
complicated.

**Proposal c** The previous proposal raises an interesting issue, that is
that you might want to recall an previously-computed results. For
instance you want to compare a previous run with an old version of a
function to the current version. Right now, this is tedious: you have to
navigate through the hash directories to find which hash corresponds to
what, before you are able to reload the data using the private APIs of
Memory. This problem is not specific to versioning functions, but can
also be useful to recal on previous arguments, that you might not have
stored. There is a simple solution to that problem: add the idea of 'tag'
on the hashes, exactly like git does. I am not sure what the API should
be for this, but I think that this taging idea is probably very generic
and can come in handy in many concepts. It seems fairly easy to code (the
tag table will probably need some garbage collection, but that's
trivial). One option that I see it to add a method to retrieve results by
tags and another to call specifying a tag.

**Proposal d** Finally, something even more general and useful, it to
have a method on the memory object that returns a full 'Result' object
that knows its hash, the directory it is stored, and many other info.
This would be useful because a DeferredResult could come in handy in the
parallel computing part, and we could have a partly unifying API.

> Currently, one is also required to pass "deps=False", to not track 
> *called* functions.

Hum, which joblib are we talking about? This is a feature I'd love to
have, but I don't see where it is. Have you implemented it in a branch?

> The core insight leading to this come from an off-hand comment from 
> Konrad Hinsen's EuroScipy2010 presentation: Often the caller must decide 
> whether the function should be run asynchronous, not the function. 
> (Which is not to say that @memory.cache isn't convenient, I just need 
> something more fine-grained myself.)

+1. I fully agree with that vision. I just want to find abstraction and
code layouts so that the joblib code base stay easily tractable, which is
very high on my priority list.

> 2) Use the concurrent.futures API to submit and cache jobs.

> from joblib.hpc.clusters.titan_oslo import TitanOsloExecutor
> ex = TitanOsloExecutor(account='astro', logger=logger)
> job1 = ex.submit(func1, 2, 3)
> job2 = ex.submit(func2, 2, 3)
> print job1.result() # waits for cluster job to finish

>   a) Results are cached, like joblib and unlike default concurrent.futures
>   b) You need to integrate with whatever queue system the cluster uses 
> (rather easy)
>   c) Jobs are really spawned; you can kill the launching process without 
> stopping the jobs. So since I don't bother to let my script wait for 
> days, I'll hit Ctrl+C. Then when the job is run, I can simply restart 
> the script, which will find the results in cache and continue immediately

I can partly buy that, although I think we are pushing fairly far to be
able to implement this robustly. Note that I don't think that it is
unmanageable, I just think that we need to walk slowly when going in this
direction. A few remarks:

 1. How do you handle job submission? I would like an implementation
    using multiprocessing that only knows how to submit in a
    multiprocessing pool to accept in joblib. Optionally, I would be
    happy taking code that uses IPython to submit on remote server but I
    would really insist on the following:

     a. Provide a joblib.distributed.Parallel, that implements Parallel using
	IPython, so as to have seamless features with and without IPython

     b. Anything that needs complex code, for instance a job scheduler,
	should live in IPython, not joblib. Distributed computing is the 
	expertise of IPython, not joblib.

 2. In the case of multiprocessing, Ctr-C would probably run into
    problems, but in virtue of principle 1 (simplicity) I am perfectly
    happy deciding that it is not a feature.

**Proposal e** I'd love an implementation of such method in a
ParallelMemory (DistributedMemory for the IPython version), but I would
prefer the method to be called 'deferred' or maybe even 'deferred_cache',
rather than 'submit'. I could have almost the same API than the call that
I suggest in proposal d, but returning a DeferredResult.

> 3) No messy persistent servers, databases, custom scheduling etc. In my 
> case, the above really just looks for or produces these files:

> $JOBSTORE/func1-jDdfJt_1A2mMPIOvlrifWoQY0PoA/input.pkl
> $JOBSTORE/func1-jDdfJt_1A2mMPIOvlrifWoQY0PoA/sbatchscript

> and if it "came there first", runs a command "sbatch 
> $JOBSTORE/func1-jDdfJt_1A2mMPIOvlrifWoQY0PoA/sbatchscript", which in 
> addition produces "output.pkl" +  auxiliary files from the job.

Sounds good, except that I'd like to stay away from shell/command-line
programs: in my opinion they give code that is most often untested and
can give rise to an explosion of options. I'd rather push people to
implement small Python scripts solving their needs, and provide a rich,
well-tested, Python API.

> If a job goes wrong, I'll simply look at the log, resubmit it etc. with 
> my usual tools, and joblib.hpc is none the wiser (although it could make 
> that more convenient in time).

Yes, we are absolutely on the same page here. However, the logging
framework of joblib is almost non-existent. In my experience, it is
necessary to develop a real logging framework to adress these problems.
This problem is somewhat independent from the above, so it might be good
to implement it in a separate branch, as I faear that the list of
features to implement is getting fairly huge.

**Proposal f** Implement proper logging methods on the Logger object, and
use the logging Python core module to be able to log to a file.

> 4) Status? I'm still hacking on this, only the basic functionality you 
> see above is working. Before pushing this anywhere I'd like to eat my 
> own dogfood with some real jobs for a month or so.

> https://github.com/dagss/joblib/tree/hpc

It would really be great if we could interact and merge partial
functionality rather than a huge patch (see below).

> 5) The "cache" is never cleared. The idea is that I can use git to go
> in 
> different "directions of exploration" with my script, and when
> switching 
> branch, all the (older/different) computed results are also present in 
> the same store.

Well, that will not really be an option for everybody. I run computations
on massive amount of data. My number one pet pieve with joblib is that it
does not have cache replacement policy. My number one problem with it
(violating 3) is that I start computations on a Friday evening, and when
I come back to work on Monday, I have blown up all the disk space storing
results. 

> To clear things out, I'd actually use garbage collection:

> $ joblib gc mark
> $ git checkout assumption_1
> $ python produce_paper_figures.py 1.png
> $ git checkout assumption_2
> $ python produce_paper_figures.py 2.png
> # Now, gets rid of runs only needed for assumption_3, which had no
> value
> $ joblib gc sweep

That's interesting. We could debate on the API: I'd rather not have a
command line module for joblib. But the real question is: how to
implement garbage collection. For me, garbage collection needs the
information of dependency, and we don't have a way of storing
dependencies in joblib. Now if you come up with a robust way of doing
this that is scalable, I am really interested.

To summarize, I think that we share a lot of ideas in comon. I would like
to see joblib become a bit more like a programmatic key-value store that
associate computational jobs to their results with an imperative syntax.

However, in order to stay pragmatic, and make sure that good code
actually comes from all these crazy ideas, and that we can use it to
solve our day-to-day problems in the long run, I think that it would be
very helpful to focus on a few ideas and move them forward. These last
years my strategy to be able to publish papers quickly and yet build a
code base has been to have two codebases: one in which I implemented
quickly what I needed to get my work done, and another in which I
transfered the good ideas, months after writing them, and most often
reimplementing from scratch. This is actually how joblib was born. I
guess I am suggesting that you shouldn't tie your day work to the
evolution of joblib, but rather let them live in parallel. 

Finally, let me strongly advocate for you submitting small focus pull
requests implementing one feature at a time. It will be much easier to
review and to iterate from these, and hopefully pitching in to give you
some help in your endeavor. I had a quick look at your fork. You probably
have done very valuable work in it that should already be in joblib. Some
of the work is clearly not directly related to the more recent HPC stuff
that you have introduced. I am perfectly happy reviewing it and trying to
merge it (I haven't looked at it yet). You could branch out origin/master
in a new branch, cherry pick these changes, and send me a pull request. I
happen to have some time tomorrow (hint, hint).

Thanks a lot for pitching in. I need help to go beyond the current state
of joblib, and it feels great to have you on board.

G

Re: [joblib] joblib.hpc: Python functions as cluster jobs

From:
Dag Sverre Seljebotn
Date:
2011-04-29 @ 21:39
On 04/29/2011 09:28 PM, Gael Varoquaux wrote:
> Hi,
>
> Thanks for your input. Follows a really long email. You started it :).
>    

Thanks for a very helpful and informative reply. I'll try my best to 
make this one short. We'll see how I manage :-)

First, my branch wasn't at all meant as a "pull request" of any sort. It 
was indeed a snapshot of my "research fork", mentioned just in case 
somebody had a personal research interest in it (which I know you don't).

>> The idea is joblib meets concurrent.futures meets Nix [1]. Some of the
>> stuff (a caching version of concurrent.futures; @versioned) I think
>> could in time be refactored into joblib proper, while some things would
>> obviously stay hpc-specific.
>>      
> I think that things that are HPC specific should be in a different
> project. "hpclib" sounds good. "gridlib" maybe even cooler.
>    

OK, given what you wrote above I think that makes sense.

>> 1) Rather than @memory.cache, I decouple the issue of versioning a
>> function from computation/caching:
>>      
>> @versioned()
>> def func(x, y): ...
>>      
>> Note that for week-long runs I don't rerun because of a refactor, but I
>> need to reliably trigger just the right reruns when I fix a critical
>> bug. By default, it takes the joblib approach of hashing the function
>> source, but you can override it:
>>      
>> @versioned(2) # increment manually each time a critical bug is found
>> def func(x, y):
>>       ...
>>      
> Sounds like a good idea. It meets an important usecase, and I am all in
> favor of this. I must admit that I had been thinking along similar lines.
> Let me however suggest a variant around this idea.
>
> First, a bit of background on why joblib works the way it works. I tried
> explicite trajectory tracking in the early days. It didn't work because
> it led to convoluted code that would fail too often. From this experience
> I learned my 3 first design principles listed in the begining of this
> mail. This is why I fell back on hashes: hashes avoid having to maintain
> a dependency graph, as git has shown us. They enable us to solve a
> problem that is local on the execution and not global.
>
> **Proposal a.** First, I am perfectly happy to take a patch that adds an
> option to the Memory object (and to it's cache method) to turn tracking
> of function source code off. this would probably answer 90% of your
> usecases.
>    

That's a patch I have little interest in myself. It would lead me to a 
situation where "if I find a bug, to get reliable results, I need to 
clear my entire store". Clearing too much of the store would literally 
be throwing money down the toilet. And I don't run out of disk space. I 
guess we're on opposite sides on the disk space usecase.

> **Proposal b** Second, if we want actual versioning, that is the ability
> to go back to results computed with previous versions of the code, we
> could change the way the memory object computes it hash, and simply add
> the function code to the hash. In this sens, the function code source
> becomes just like another argument. I would like this to happen in a
> sublclass of the memory object, because I find that it is already too
> complicated.
>    

In "gridlib", I very much let the function revision be part of the hash; 
that's a very important part of it.

But: @versioned simply annotates a function with version information. It 
doesn't tell you how to use it.

So I actually thought this would make Memory simpler, not more 
complicated.. What I was thinking in order to turn this into a pull 
request for joblib down the line is:

  a) Have @memory.cache imply @versioned (call it on argument if needed)
  b) Remove all the source code comparison-stuff from Memory, and simply 
check if the function version is the same as the last time (since Memory 
operates on an "only-newest" basis).

I agree that Memory should not start keeping multiple versions by default.

Here's how @versioned works:

  - A version is a string (but str() is applied to argument)
  - If no version is given, use the base32-encoded hash of the source as 
the version. (So if you decide to "freeze" the function after-the-fact, 
there's a way to do it, without needing a feature for it.)
  - Then the combination of the fully-qualified name and the version is 
used for the digest/hash.
  - The function gets an extra version_info attribute (a dict) 
containing the digest, version information, etc. (I may instead stick it 
in a global WeakKeyDictionary if I get problems here, we'll see)


> **Proposal c** The previous proposal raises an interesting issue, that is
> that you might want to recall an previously-computed results. For
> instance you want to compare a previous run with an old version of a
> function to the current version. Right now, this is tedious: you have to
> navigate through the hash directories to find which hash corresponds to
> what, before you are able to reload the data using the private APIs of
> Memory. This problem is not specific to versioning functions, but can
> also be useful to recal on previous arguments, that you might not have
> stored. There is a simple solution to that problem: add the idea of 'tag'
> on the hashes, exactly like git does. I am not sure what the API should
> be for this, but I think that this taging idea is probably very generic
> and can come in handy in many concepts. It seems fairly easy to code (the
> tag table will probably need some garbage collection, but that's
> trivial). One option that I see it to add a method to retrieve results by
> tags and another to call specifying a tag.
>    

This is exactly what my 5) is about. I'm not sure why you think we need 
explicit tags at all though.

You simply use git on the source code. Checking out a specific revision 
of the script in git, and running the script, fetches the old results.

(Since function version is part of the hash, and I never delete 
anything, this just works.)

Ahh, perhaps I see where we're not having the same idea... I work with 2 
git repositories for my project:

  - A library repository for "code"
  - A "runs" repository where I check in the concrete uses of my code 
(link code with data).

So to pull out an earlier run that's not currently available for some 
reason, I simply check out another revision of the "runs" repository.

Similarly, I can use git branches to explore different lines of exploration.

And fetching data to analyse from a central repository to local store is 
just another job too (run about once).


> **Proposal d** Finally, something even more general and useful, it to
> have a method on the memory object that returns a full 'Result' object
> that knows its hash, the directory it is stored, and many other info.
> This would be useful because a DeferredResult could come in handy in the
> parallel computing part, and we could have a partly unifying API.
>    

That's what I've implemented for gridlib, using the Python 3 
concurrent.futures API for which there is a backport.

My implementation is for the cluster, not for multiprocessing, but since 
concurrent.futures has a multiprocessing implementation this isn't too 
difficult.

So I guess I could see whatever is abstract enough to be useful with 
joblib+multiprocessing to find its way, with the rest staying in gridlib.

>    
>> Currently, one is also required to pass "deps=False", to not track
>> *called* functions.
>>      
> Hum, which joblib are we talking about? This is a feature I'd love to
> have, but I don't see where it is. Have you implemented it in a branch?
>    

I was describing "gridlib". Sorry about the confusion.

I basically just require code I write now to be forward-compatible with 
turning on such tracking in the future, and make sure my design supports 
it. It's simply "assert deps==False" in the @versioned decorator.

>> 2) Use the concurrent.futures API to submit and cache jobs.
>>      
>> from joblib.hpc.clusters.titan_oslo import TitanOsloExecutor
>> ex = TitanOsloExecutor(account='astro', logger=logger)
>> job1 = ex.submit(func1, 2, 3)
>> job2 = ex.submit(func2, 2, 3)
>> print job1.result() # waits for cluster job to finish
>>      
>>    a) Results are cached, like joblib and unlike default concurrent.futures
>>    b) You need to integrate with whatever queue system the cluster uses
>> (rather easy)
>>    c) Jobs are really spawned; you can kill the launching process without
>> stopping the jobs. So since I don't bother to let my script wait for
>> days, I'll hit Ctrl+C. Then when the job is run, I can simply restart
>> the script, which will find the results in cache and continue immediately
>>      
> I can partly buy that, although I think we are pushing fairly far to be
> able to implement this robustly. Note that I don't think that it is
> unmanageable, I just think that we need to walk slowly when going in this
> direction. A few remarks:
>
>   1. How do you handle job submission? I would like an implementation
>      using multiprocessing that only knows how to submit in a
>      multiprocessing pool to accept in joblib. Optionally, I would be
>      happy taking code that uses IPython to submit on remote server but I
>      would really insist on the following:
>
>       a. Provide a joblib.distributed.Parallel, that implements Parallel using
> 	IPython, so as to have seamless features with and without IPython
>
>       b. Anything that needs complex code, for instance a job scheduler,
> 	should live in IPython, not joblib. Distributed computing is the
> 	expertise of IPython, not joblib.
>    

I definitely don't implement job scheduling myself, and have no 
ambitions in that direction. I think you answered this above, by wanting 
HPC-specific stuff be in a seperate "gridlib" (joblib is not at the 
umbrella extreme). (In my case, I submit a shell command over SSH; each 
cluster would be different here.)

I think the only integration I'd like with joblib is for joblib to 
provide a "caching concurrent.futures implementation" (probably an 
adapter that can be used on top of either ThreadExecutor or 
ProcessExecutor) so that I could write code that would dispatch either 
to cluster or to local machine depending on what executor I pass in.

>   2. In the case of multiprocessing, Ctr-C would probably run into
>      problems, but in virtue of principle 1 (simplicity) I am perfectly
>      happy deciding that it is not a feature.
>
> **Proposal e** I'd love an implementation of such method in a
> ParallelMemory (DistributedMemory for the IPython version), but I would
> prefer the method to be called 'deferred' or maybe even 'deferred_cache',
> rather than 'submit'. I could have almost the same API than the call that
> I suggest in proposal d, but returning a DeferredResult.
>    

I don't control the API. Guido does.

http://docs.python.org/dev/library/concurrent.futures.html
http://pypi.python.org/pypi/futures

> **Proposal f** Implement proper logging methods on the Logger object, and
> use the logging Python core module to be able to log to a file.
>    

gridlib already take logger arguments everywhere. I'm itching to add 
them to joblib, but there's only time for so much; perhaps it will happen.


> Finally, let me strongly advocate for you submitting small focus pull
> requests implementing one feature at a time. It will be much easier to
> review and to iterate from these, and hopefully pitching in to give you
> some help in your endeavor. I had a quick look at your fork. You probably
> have done very valuable work in it that should already be in joblib. Some
> of the work is clearly not directly related to the more recent HPC stuff
> that you have introduced. I am perfectly happy reviewing it and trying to
> merge it (I haven't looked at it yet). You could branch out origin/master
> in a new branch, cherry pick these changes, and send me a pull request. I
> happen to have some time tomorrow (hint, hint).
>    

I'll basically be working on this most of tomorrow, so I might take you 
up on that, I'll think about it again tomorrow morning. But my focus is 
on the "research branch"/gridlib rather than upstream joblib at the moment.

Thanks again for your comments, it's been very helpful -- this sort of 
high-level discussion is exactly what I wanted, I had no intention of 
pushing anything upstream yet.

Dag Sverre

Re: [joblib] joblib.hpc: Python functions as cluster jobs

From:
Dag Sverre Seljebotn
Date:
2011-04-29 @ 21:43
On 04/29/2011 11:39 PM, Dag Sverre Seljebotn wrote:
> On 04/29/2011 09:28 PM, Gael Varoquaux wrote:
>    
>> Hi,
>>
>> Thanks for your input. Follows a really long email. You started it :).
>>
>>      
> Thanks for a very helpful and informative reply. I'll try my best to
> make this one short. We'll see how I manage :-)
>
> First, my branch wasn't at all meant as a "pull request" of any sort. It
> was indeed a snapshot of my "research fork", mentioned just in case
> somebody had a personal research interest in it (which I know you don't).
>
>    
>>> The idea is joblib meets concurrent.futures meets Nix [1]. Some of the
>>> stuff (a caching version of concurrent.futures; @versioned) I think
>>> could in time be refactored into joblib proper, while some things would
>>> obviously stay hpc-specific.
>>>
>>>        
>> I think that things that are HPC specific should be in a different
>> project. "hpclib" sounds good. "gridlib" maybe even cooler.
>>
>>      
> OK, given what you wrote above I think that makes sense.
>
>    
>>> 1) Rather than @memory.cache, I decouple the issue of versioning a
>>> function from computation/caching:
>>>
>>> @versioned()
>>> def func(x, y): ...
>>>
>>> Note that for week-long runs I don't rerun because of a refactor, but I
>>> need to reliably trigger just the right reruns when I fix a critical
>>> bug. By default, it takes the joblib approach of hashing the function
>>> source, but you can override it:
>>>
>>> @versioned(2) # increment manually each time a critical bug is found
>>> def func(x, y):
>>>        ...
>>>
>>>        
>> Sounds like a good idea. It meets an important usecase, and I am all in
>> favor of this. I must admit that I had been thinking along similar lines.
>> Let me however suggest a variant around this idea.
>>
>> First, a bit of background on why joblib works the way it works. I tried
>> explicite trajectory tracking in the early days. It didn't work because
>> it led to convoluted code that would fail too often. From this experience
>> I learned my 3 first design principles listed in the begining of this
>> mail. This is why I fell back on hashes: hashes avoid having to maintain
>> a dependency graph, as git has shown us. They enable us to solve a
>> problem that is local on the execution and not global.
>>
>> **Proposal a.** First, I am perfectly happy to take a patch that adds an
>> option to the Memory object (and to it's cache method) to turn tracking
>> of function source code off. this would probably answer 90% of your
>> usecases.
>>
>>      
> That's a patch I have little interest in myself. It would lead me to a
> situation where "if I find a bug, to get reliable results, I need to
> clear my entire store". Clearing too much of the store would literally
> be throwing money down the toilet. And I don't run out of disk space. I
> guess we're on opposite sides on the disk space usecase.
>    

Don't bother to tell me why I'm totally wrong here, I just realized.

Dag Sverre

Re: [joblib] joblib.hpc: Python functions as cluster jobs

From:
Gael Varoquaux
Date:
2011-04-29 @ 22:54
On Fri, Apr 29, 2011 at 11:39:48PM +0200, Dag Sverre Seljebotn wrote:
> Thanks for a very helpful and informative reply. I'll try my best to 
> make this one short.

I'll do the same: keeping a high SNR.

> > **Proposal a.** First, I am perfectly happy to take a patch that adds an
> > option to the Memory object (and to it's cache method) to turn tracking
> > of function source code off. this would probably answer 90% of your
> > usecases.

> That's a patch I have little interest in myself.

Fine. It goes down the drain then. Someone interested in such
functionality can implement it.

> I guess we're on opposite sides on the disk space usecase.

Great. I enjoy working with people having different usecases. It's good
for the code.

> [second mail]
> Don't bother to tell me why I'm totally wrong here, I just realized.

Please tell me why. I am in all honesty interested. I don't think that
they are any clear cut arguments, so I am interested reading your
thoughts.

>   a) Have @memory.cache imply @versioned (call it on argument if needed)
>   b) Remove all the source code comparison-stuff from Memory, and simply 
> check if the function version is the same as the last time (since Memory 
> operates on an "only-newest" basis).

That's an option. My instinct would be to prefer including function code
in the hashing, as I see no assymetry between function code and argument,
if we start not flushing automatically. Also, I don't think that this
option should make the Memory code much longer, maybe even simpler.

> Here's how @versioned works:

>   - A version is a string (but str() is applied to argument)
>   - If no version is given, use the base32-encoded hash of the source as 
> the version. (So if you decide to "freeze" the function after-the-fact, 
> there's a way to do it, without needing a feature for it.)

Very useful feature. With a hash-based solution I guess the option would
be to always include the function code in the hash, but to have a flag
that flushes or not the directory when the function code changes. That
way you can freeze a posteriori.

>   - Then the combination of the fully-qualified name and the version is 
> used for the digest/hash.

One thing to keep in mind: the inode limit. When doing computations with
a lot of stored results, I do sometimes hit the inode limit. This does
suprise me now that I mention it, as the limit to the number of inodes
per directory is fairly large on most file systems. However, I have
indeed seen the error. I am not sure on which file system, though. This
is why I prefer nested to flat in general.

>   - The function gets an extra version_info attribute (a dict) 
> containing the digest, version information, etc. (I may instead stick it 
> in a global WeakKeyDictionary if I get problems here, we'll see)

Two questions/remarks:

 1. What's your retrieving API? I want to keep a light API that does not
    rely on decorators. Most of my code now looks like::

	foo = mem.cache(function)(args)

 2. Your proposal requires user to manually increment the tag pointing to
    the function. I fear this might lead to errors. One thing that I am
    try to avoid really hard with joblib is people getting false results
    because of their use of joblib. On the other hand, your proposal 
    probably puts less stress on disk space.

> This is exactly what my 5) is about. I'm not sure why you think we need 
> explicit tags at all though.

> You simply use git on the source code. Checking out a specific revision 
> of the script in git, and running the script, fetches the old results.

That's the difference between you and me :). My data is huge. I cannot
keep intermediate results on the disk, let even in version control. Thus
tags would be useful to retrieve final results even without intermediate
step. It's one way of recalling something that you have alreay done at
some point in time.

> And fetching data to analyse from a central repository to local store is 
> just another job too (run about once).

I'd like that to be true, but unfortunately it is not always the case.
Beside, as I want my local cache to be flushed (unlike you, fair enough),
I might not have the data locally after a we weeks of runs. Tags are
incredibly useful in this regard. 

Basically, if you have screwed up everything: original data, intermediate
steps, and function code, you can still retrieve the final results that
you published in a paper for inspection after hand, as long as you have
not flushed the joblib cache. And I do agree that we should have options
to make this cache flushing harder. This could be really useful for
regression testing.

> [deferred]
> So I guess I could see whatever is abstract enough to be useful with 
> joblib+multiprocessing to find its way, with the rest staying in gridlib.

As I said, I am completely happy having code relying on IPython for
grid-level computing in joblib. I think that it would be a great selling
point. I do see that you want to implement your own job distribution
framework. If you decide in the end to use IPython are you are able to
have fairly simple code, there is room for such code in joblib.

> >> Currently, one is also required to pass "deps=False", to not track
> >> *called* functions.

> > Hum, which joblib are we talking about? This is a feature I'd love to
> > have, but I don't see where it is. Have you implemented it in a branch?

> I was describing "gridlib". Sorry about the confusion.

Sounds interesting, though :)

> I basically just require code I write now to be forward-compatible with 
> turning on such tracking in the future, and make sure my design supports 
> it. It's simply "assert deps==False" in the @versioned decorator.

I am not sure that I understand you here.

> I definitely don't implement job scheduling myself, and have no 
> ambitions in that direction. I think you answered this above, by wanting 
> HPC-specific stuff be in a seperate "gridlib" (joblib is not at the 
> umbrella extreme). (In my case, I submit a shell command over SSH; each 
> cluster would be different here.)

OK. These features should live in a different codebase, I believe. Not
that I might not be interested in them for my own personnal consumption
:). I do advise you to look careful at prior art here. Doing these
things well is challenging, and I believe that there is a lot of prior
art.

> I think the only integration I'd like with joblib is for joblib to 
> provide a "caching concurrent.futures implementation" (probably an 
> adapter that can be used on top of either ThreadExecutor or 
> ProcessExecutor) so that I could write code that would dispatch either 
> to cluster or to local machine depending on what executor I pass in.

Sounds good. Something like this was on my mind also.

> > I would prefer the method to be called 'deferred' or maybe even
> > 'deferred_cache', rather than 'submit'. I could have almost the same
> > API than the call that I suggest in proposal d, but returning a
> > DeferredResult.

> I don't control the API. Guido does.

> http://docs.python.org/dev/library/concurrent.futures.html
> http://pypi.python.org/pypi/futures

Point taken. I believe that the API that we are discussing right now is
about something different, and that we should not try to stick to the
future API, but you have a point, and I am willing to reconsider my
stance.

> > **Proposal f** Implement proper logging methods on the Logger object, and
> > use the logging Python core module to be able to log to a file.

> gridlib already take logger arguments everywhere. I'm itching to add 
> them to joblib, but there's only time for so much; perhaps it will happen.

Hum, interesting. I understand the time argument very well. Could you
please submit an enhancement bug report to the joblib bug tracker on
github with a link to the corresponding code, if you think it could be
backported. That way someone could pitch in.

> I'll basically be working on this most of tomorrow, so I might take you 
> up on that, I'll think about it again tomorrow morning.

Excellent. I am definitely planning to spend most of my day tomorrow on
software, so I can free time for it. I also have an overdue review :(.

> But my focus is on the "research branch"/gridlib rather than upstream
> joblib at the moment.

Good. Focusing on paying jobs is a great way to stay on track. Besides, I
am sure that you will gain a lot of insights from these.

> Thanks again for your comments, it's been very helpful -- this sort of 
> high-level discussion is exactly what I wanted, I had no intention of 
> pushing anything upstream yet.

Thanks a lot for your interest. It is great to see that we are sharing
similar goals.

G

Re: [joblib] joblib.hpc: Python functions as cluster jobs

From:
Dag Sverre Seljebotn
Date:
2011-04-30 @ 08:54
On 04/30/2011 12:54 AM, Gael Varoquaux wrote:
> On Fri, Apr 29, 2011 at 11:39:48PM +0200, Dag Sverre Seljebotn wrote:
>> Thanks for a very helpful and informative reply. I'll try my best to
>> make this one short.
>
> I'll do the same: keeping a high SNR.
>
>>> **Proposal a.** First, I am perfectly happy to take a patch that adds an
>>> option to the Memory object (and to it's cache method) to turn tracking
>>> of function source code off. this would probably answer 90% of your
>>> usecases.
>
>> That's a patch I have little interest in myself.
>
> Fine. It goes down the drain then. Someone interested in such
> functionality can implement it.
>
>> I guess we're on opposite sides on the disk space usecase.
>
> Great. I enjoy working with people having different usecases. It's good
> for the code.
>
>> [second mail]
>> Don't bother to tell me why I'm totally wrong here, I just realized.
>
> Please tell me why. I am in all honesty interested. I don't think that
> they are any clear cut arguments, so I am interested reading your
> thoughts.

My thoughts are that if you discover a bug in a function whose tracking 
is turned off, you can either momentarily turn it on again, or clear the 
cache of that single function (by calling .call(), or entering the 
directory tree and wipe out the cache)

You need to do that anyway if a dependency changes.

However, "manual targetet clearing of the cache" is the kind of thing I 
find pretty boring and frustrating, and I would much prefer keeping a 
manual revision integer in a decorator and increment it.

If I can only remember to increment it, but that's about exactly as hard 
as remembering to clear the cache of the function, so that's no argument 
on dont-track vs. @versioned, only an argument against both.

In summary, I'm still opposed to this one:
  - Version counters are no harder or unsafer than turning tracking off
  - Version counters can be used for other purposes as well, such as 
coexisting multiple revisions tracked by git

>
>>    a) Have @memory.cache imply @versioned (call it on argument if needed)
>>    b) Remove all the source code comparison-stuff from Memory, and simply
>> check if the function version is the same as the last time (since Memory
>> operates on an "only-newest" basis).
>
> That's an option. My instinct would be to prefer including function code
> in the hashing, as I see no assymetry between function code and argument,
> if we start not flushing automatically. Also, I don't think that this
> option should make the Memory code much longer, maybe even simpler.

To clarify, @versioned

  a) By default, the behaviour is simply to hash the function source 
once (on definition) rather than many times (on call), as an 
optimization (useful for scaling down to in-memory caching, when speed 
starts to matter).

  The default behaviour *is* "include function code in hashing".

  b) There's an optional parameter that I expect will be rarely used, 
that serves as (IMO) a more elegant way of turning tracking off: Manage 
the revision counter manually.


>> Here's how @versioned works:
>
>>    - A version is a string (but str() is applied to argument)
>>    - If no version is given, use the base32-encoded hash of the source as
>> the version. (So if you decide to "freeze" the function after-the-fact,
>> there's a way to do it, without needing a feature for it.)
>
> Very useful feature. With a hash-based solution I guess the option would
> be to always include the function code in the hash, but to have a flag
> that flushes or not the directory when the function code changes. That
> way you can freeze a posteriori.

  c) @versioned also allows me (not you) to keep multiple versions, at 
no extra complexity.

Thinking in terms of "flushing or not" is inherently single-revision 
framework. So it simply seems more complicated and less elegant to me.

In your case, wouldn't it about the same to think "only keep the last 
revision, which I'll manually track with a counter" as "do not flush on 
source code change"?

(last meaning "last computed", we never compare on revision strings of 
course)

>
>>    - Then the combination of the fully-qualified name and the version is
>> used for the digest/hash.
>
> One thing to keep in mind: the inode limit. When doing computations with
> a lot of stored results, I do sometimes hit the inode limit. This does
> suprise me now that I mention it, as the limit to the number of inodes
> per directory is fairly large on most file systems. However, I have
> indeed seen the error. I am not sure on which file system, though. This
> is why I prefer nested to flat in general.

Git solves this by making subdirs using the first two chars of the hash, 
which is a bit more generic (works also for >15k requests to the same 
function).

Hmm. I really don't like the nested layout for my purposes. If you want 
to join my bikeshedding, here's my thoughts on this:

  - First, I decided on base32, because git uses base16 so it visually 
distinguishes the hashes + down from 40 chars to 32 chars.

  - I like the idea of identifying a "job" (function hash + args hash) 
by a single joblib (or, perhaps, gridlib) hash. This is more important 
to me who would like to, say, query the cluster for the status of "job ...".

print myfuture.hash() # and go to cluster and see how it goes

  - I don't need (or like) the current function-name-based structure. 
But I can see why you need it.

Anyway, I'd like for this to be pluggable through a subclass in joblib.

>>    - The function gets an extra version_info attribute (a dict)
>> containing the digest, version information, etc. (I may instead stick it
>> in a global WeakKeyDictionary if I get problems here, we'll see)
>
> Two questions/remarks:
>
>   1. What's your retrieving API? I want to keep a light API that does not
>      rely on decorators. Most of my code now looks like::
>
> 	foo = mem.cache(function)(args)

Currently you could do

versioned(function).version_info['version']

do get the base32-encoded hash of the source code. I'm thinking this 
must be changed to

get_function_version_info(versioned(function))['version']

or the equivalent

versioned(function)
get_function_version_info(function)['version']

with a WeakKeyDictionary implementation to avoid mutating functions.

>   2. Your proposal requires user to manually increment the tag pointing to
>      the function. I fear this might lead to errors. One thing that I am
>      try to avoid really hard with joblib is people getting false results
>      because of their use of joblib. On the other hand, your proposal
>      probably puts less stress on disk space.

And CPU time.

It only requires increment the tag *if* you want to "freeze" it 
(alternative to your proposed do_not_check_source flag). By all means, 
leave the tag out.

Look, my goal is to automate some of what I currently must do manually 
w.r.t. queuing cluster jobs. My current (and most other's) process is: 
"OK, so I found this bug, that affects runs A, B, F, G, so I'll schedule 
a rerun, but definitely not C, D, E, so I'll leave those."

I just want a more convenient way of encoding (and tracking!) such 
choices. With the amount of CPU time involved, there's just no 
alternative to some manual tracking; it's "part of my job".

And it would be used in situations where the only real alternative is to 
not use joblib at all. You simply do not submit a week-long cluster job 
again because you added some logging to a function.

When I started on this, my first thought was: "My, I wish Python was 
functional, that would make this more elegant". Then I realized that, 
no: If I change to a more computationally more efficient numerical 
scheme, the numerical errors will be different, but I still don't want 
to have jobs resubmitted. Only if I doubt the validity of the result do 
I want it resubmitted. (This assumes arrays not being the input to 
functions, of course.)

>> This is exactly what my 5) is about. I'm not sure why you think we need
>> explicit tags at all though.
>
>> You simply use git on the source code. Checking out a specific revision
>> of the script in git, and running the script, fetches the old results.
>
> That's the difference between you and me :). My data is huge. I cannot
> keep intermediate results on the disk, let even in version control. Thus
> tags would be useful to retrieve final results even without intermediate
> step. It's one way of recalling something that you have alreay done at
> some point in time.

Well... let's say you start with this code:

data = memory.cache(get_data)()
result = memory.cache(process_data)(data, parameters)
plots = memory.cache(make_plots)(result)

I agree that my scheme (keep everything + GC) would waste too much disk 
for you here. But, you could just change it do this (even after the 
fact, before moving on):

@memory.cache
def myjob(parameters):
     data = memory.cache(get_data)()
     result = memory.cache(process_data)(data, parameters)
     return memory.cache(make_plots)(result)

Then, call the function in combination with GC to preserve only the 
plots for posterity and wipe the intermediate results.

I think something like

myjob.call_and_discard_temporaries(parameters)

could work -- you basically discard any joblib result used while running 
the function upon return (but not the result itself). Or:

  i) call myjob, ii) gc mark, iii) call myjob again, iv) gc sweep.

Which deletes everything except the result of myjob. This could have a 
utility function as well.

OK, it definitely takes some getting used to. OTOH, tagging adds a lot 
of complexity too. And perhaps the above 4-step routine could have some 
convenience function, perhaps even more targeted 
(invoke_function_and_discard_its_temporaries).

But, yes, our usecases are very different.

>> [deferred]
>> So I guess I could see whatever is abstract enough to be useful with
>> joblib+multiprocessing to find its way, with the rest staying in gridlib.
>
> As I said, I am completely happy having code relying on IPython for
> grid-level computing in joblib. I think that it would be a great selling
> point. I do see that you want to implement your own job distribution
> framework. If you decide in the end to use IPython are you are able to
> have fairly simple code, there is room for such code in joblib.

I should have been more clear: I'm on a time-sharing system with 
hundreds of other people I don't know. I need to use the command-line 
command set up by administrators to allocate a time-slot (and say how 
much memory and CPU I'll use). The job may be executed at some point in 
the future.

With my job granularity, submitting jobs that are simply IPython 
workers, and synchronize them somehow with a master, would be a massive 
complication and a total waste of my time. Although I do know people who 
do something similar, with a different job pattern.

So note that unlike joblib, I need to pickle the input (input.pkl).


>
>>>> Currently, one is also required to pass "deps=False", to not track
>>>> *called* functions.
>
>>> Hum, which joblib are we talking about? This is a feature I'd love to
>>> have, but I don't see where it is. Have you implemented it in a branch?
>
>> I was describing "gridlib". Sorry about the confusion.
>
> Sounds interesting, though :)
>> I basically just require code I write now to be forward-compatible with
>> turning on such tracking in the future, and make sure my design supports
>> it. It's simply "assert deps==False" in the @versioned decorator.
>
> I am not sure that I understand you here.

I'll return to this in the future I think, there's too much on the plate 
currently in this thread.

>> I definitely don't implement job scheduling myself, and have no
>> ambitions in that direction. I think you answered this above, by wanting
>> HPC-specific stuff be in a seperate "gridlib" (joblib is not at the
>> umbrella extreme). (In my case, I submit a shell command over SSH; each
>> cluster would be different here.)
>
> OK. These features should live in a different codebase, I believe. Not
> that I might not be interested in them for my own personnal consumption
> :). I do advise you to look careful at prior art here. Doing these
> things well is challenging, and I believe that there is a lot of prior
> art.

The choice of job submission system is not up to me, but cluster admins.

I'm not sure if job submission to our cluster has lot of prior art, 
although I know of some that do the same as me: 10 lines of Python code 
invoking some bash code and parsing the result.

>>> I would prefer the method to be called 'deferred' or maybe even
>>> 'deferred_cache', rather than 'submit'. I could have almost the same
>>> API than the call that I suggest in proposal d, but returning a
>>> DeferredResult.
>
>> I don't control the API. Guido does.
>
>> http://docs.python.org/dev/library/concurrent.futures.html
>> http://pypi.python.org/pypi/futures
>
> Point taken. I believe that the API that we are discussing right now is
> about something different, and that we should not try to stick to the
> future API, but you have a point, and I am willing to reconsider my
> stance.

The pro argument is: It is possible to write code like this:

def dosomething(executor):
     ...

that would work well both with a caching executor using joblib, and a 
more traditional non-caching executor.

So to facilitate this, I think we should implement a superset of this API.

If 'dosomething' rely on side-effects you do break things, but I still 
believe this is useful enough to not warrant breaking compatability as 
an arbitrary decision.


>>> **Proposal f** Implement proper logging methods on the Logger object, and
>>> use the logging Python core module to be able to log to a file.
>
>> gridlib already take logger arguments everywhere. I'm itching to add
>> them to joblib, but there's only time for so much; perhaps it will happen.
>
> Hum, interesting. I understand the time argument very well. Could you
> please submit an enhancement bug report to the joblib bug tracker on
> github with a link to the corresponding code, if you think it could be
> backported. That way someone could pitch in.

I meant, new stuff in gridlib that have no analog in joblib, I added 
logger to.

DS

Re: [joblib] joblib.hpc: Python functions as cluster jobs

From:
Gael Varoquaux
Date:
2011-04-30 @ 12:48
On Sat, Apr 30, 2011 at 10:54:30AM +0200, Dag Sverre Seljebotn wrote:
> >>> **Proposal a.** First, I am perfectly happy to take a patch that adds an
> >>> option to the Memory object (and to it's cache method) to turn tracking
> >>> of function source code off.

> >> That's a patch I have little interest in myself.

> However, "manual targetet clearing of the cache" is the kind of thing I 
> find pretty boring and frustrating, and I would much prefer keeping a 
> manual revision integer in a decorator and increment it.

Fine. Down the drain, as I said. No point loosing time on something you
don't believe in.

> To clarify, @versioned

>   a) By default, the behaviour is simply to hash the function source 
> once (on definition) rather than many times (on call), as an 
> optimization (useful for scaling down to in-memory caching, when speed 
> starts to matter).

>   The default behaviour *is* "include function code in hashing".

>   b) There's an optional parameter that I expect will be rarely used, 
> that serves as (IMO) a more elegant way of turning tracking off: Manage 
> the revision counter manually.

I guess version counters are your solution to the problem I'd like to
solve with tags. It seems to me that this differences that we are
discussing mostly lie in implementation, rather than concepts.

A notable difference, though is that you are using the label (or tag) in
the hash, whether I advocate using the function code. I can see value in
your suggestion, in particular in really makes it easy to completely
disconnect the results from the function code.

On the other hand, it makes it hard to have an equivalence between
implicite tagging, i.e. not assigning labels and having the system avoid
recomputation, and explicit tagging: going from the function code to the
store result without knowing the tag is hard. In addition, if, I have
been computing without versionning/tagging, and I want to assign a
version/tag. In your current implementation, my version label is going to
be a hash, which is quite unreadable.

An important issue here is the danger to make errors in what is being
called: I need to know what the code I am calling does, elsewhere I'll
start getting false results. One of the reasons that prompted me to write
joblib was that people where doing this tracking manually, and getting it
wrong. This is why I advocate being able to easily do implicite but safe
versionning.

Here is my suggestion, which is somewhat a compromise between your
approach and mine:

  * Directory layout changes: have an extra directory between the
    function directory and the results hash in which several
    sub-directories are named by a function code hash.
 
  * When an untagged/unversionned function is called, a hash is created
    with its source code, and this give us the name of directory the
    results are stored in, as currently.

  * When a tagged/versionned function is called, it checks for the
    presence of the tag in a correspondance table. If the tag is
    present, it points to a hash, and thus we now the directory in 
    which the results must be fetched. If a tag is not present, the
    directory is created as above, and the tag is created in the
    correspondance table. Note that on systems allowing symlinks, this
    correspondance table should probably be simply symlinks.

  * A 'head' tag is kept, and always moved to the last
    untagged/unversionned function directory.
   
  * A garbage collection process removes dangling directories that are
    not refered to by any tags. The agressivity of this process tunes
    whether or not we keep all the history, a bit, or none (as in the
    current behavior).

> Hmm. I really don't like the nested layout for my purposes. If you want 
> to join my bikeshedding, here's my thoughts on this:

>   - First, I decided on base32, because git uses base16 so it visually 
> distinguishes the hashes + down from 40 chars to 32 chars.

Sure, no issue with that in general. What's the chances of collisions,
though?

>   - I like the idea of identifying a "job" (function hash + args hash) 
> by a single joblib (or, perhaps, gridlib) hash. This is more important 
> to me who would like to, say, query the cluster for the status of "job ...".

> print myfuture.hash() # and go to cluster and see how it goes

That's a perfectly valid usecase, but it doesn't mean a flat layout, IMHO
:), and it doesn't mean that the function name shouldn't be in the
identifier, rather than a simple hash.

>   - I don't need (or like) the current function-name-based structure. 
> But I can see why you need it.

Yeah, it's pretty important for me, the reason being that I would like
joblib to be as little blackbox as possible.

> Anyway, I'd like for this to be pluggable through a subclass in joblib.

Absolutely. If the methods layout need to be changed for that, don't
hesitate to propose changes.

> >   2. Your proposal requires user to manually increment the tag pointing to
> >      the function. I fear this might lead to errors. One thing that I am
> >      try to avoid really hard with joblib is people getting false results
> >      because of their use of joblib. On the other hand, your proposal
> >      probably puts less stress on disk space.

> And CPU time.

> It only requires increment the tag *if* you want to "freeze" it 
> (alternative to your proposed do_not_check_source flag).

Yes, but in returns, it is obscure as, if you have not specified an
explicit tag, you get the hash as a tag. I think that additional
computation time should be really small in my proposition, especially if
you are on a system that has symlinks.

> I just want a more convenient way of encoding (and tracking!) such 
> choices. With the amount of CPU time involved, there's just no 
> alternative to some manual tracking; it's "part of my job".

Yes, I agree, but the solution should work for people who are not in this
extreme case.

> And it would be used in situations where the only real alternative is to 
> not use joblib at all. You simply do not submit a week-long cluster job 
> again because you added some logging to a function.

Yes, but it's also very dangerous. I don't want to let loose a framework
that makes it easy to get incorrect results. In the couple of years of
using joblib, I have sometimes spent way too long trying to debug
something in my code that was actually due to old versions of the code
being called by joblib. You are in an extreme case in which the cost of
recomputation is huge. For people in a more moderate situation, the cost
of loosing a few days with incorrect results might be higher that the
cost of a few hours of recomputation.

> > That's the difference between you and me :). My data is huge. I cannot
> > keep intermediate results on the disk, let even in version control. Thus
> > tags would be useful to retrieve final results even without intermediate
> > step. It's one way of recalling something that you have alreay done at
> > some point in time.

> data = memory.cache(get_data)()
> result = memory.cache(process_data)(data, parameters)
> plots = memory.cache(make_plots)(result)

> I agree that my scheme (keep everything + GC) would waste too much disk 
> for you here. But, you could just change it do this (even after the 
> fact, before moving on):

First, I big principle in joblib is that it should not demand adaption of
the user working pattern. Once again, you are in a situation in which you
seem to be able to justify a lot to avoid recomputation. Most people are
in a situation in which operator time is more expensive than CPU time.
And it's going to be more and more the case.

> @memory.cache
> def myjob(parameters):
>      data = memory.cache(get_data)()
>      result = memory.cache(process_data)(data, parameters)
>      return memory.cache(make_plots)(result)

> Then, call the function in combination with GC to preserve only the 
> plots for posterity and wipe the intermediate results.

> I think something like

> myjob.call_and_discard_temporaries(parameters)

That's nice, but you need a notion of sub-jobs. It's not there. It would
be interesting, but I wonder how complicated it would become, and how
often it would break. Also, GC is not there yet, and I suspect that it is
a fairly challenging problem.

> could work -- you basically discard any joblib result used while running 
> the function upon return (but not the result itself). Or:

>   i) call myjob, ii) gc mark, iii) call myjob again, iv) gc sweep.

I prefer tagging. I find that it is easier to understand for naive users
than explicite garbage collection.

> OTOH, tagging adds a lot of complexity too. 

Not sure why you think so. It seems very close to what you are
suggesting. It seems to me that the only difference is the need to keep a
table with a correspondance between tags and hashes.

> With my job granularity, submitting jobs that are simply IPython 
> workers, and synchronize them somehow with a master, would be a massive 
> complication and a total waste of my time. Although I do know people who 
> do something similar, with a different job pattern.

Fair enough. I know this usage pattern. I think that it is a bit off
topic for joblib, but it would be good if joblib objects could expose an
API that make them easy to subclass in order to implement the
functionality that you need.

> So note that unlike joblib, I need to pickle the input (input.pkl).

I never do that, because my input are huge arrays.

> > Point taken. I believe that the API that we are discussing right now is
> > about something different, and that we should not try to stick to the
> > future API, but you have a point, and I am willing to reconsider my
> > stance.

> The pro argument is: It is possible to write code like this:

> def dosomething(executor):
>      ...

> that would work well both with a caching executor using joblib, and a 
> more traditional non-caching executor.

> So to facilitate this, I think we should implement a superset of this API.

OK, I am fine with that, although it is not a high priority for me.

G

Re: [joblib] joblib.hpc: Python functions as cluster jobs

From:
Dag Sverre Seljebotn
Date:
2011-04-30 @ 13:43
I think we're getting close to understand one another now. I think I 
have a fair understanding of what kind of changes to joblib you'd 
accept, and can design gridlib (thanks for the name) around that.

I do have motivation to integrate stuff well with work on joblib because 
I also need to use joblib in-process for sub-tasks, and I'd like for 
that to not be seperate worlds, but just a matter of how you call the 
function/what executor is used.

So, hopefully, the next thing you hear from me is an isolated pull 
request of some sort.

OK, I just checked, and my response below is actually short this time:

On 04/30/2011 02:48 PM, Gael Varoquaux wrote:
> On Sat, Apr 30, 2011 at 10:54:30AM +0200, Dag Sverre Seljebotn wrote:
> Here is my suggestion, which is somewhat a compromise between your
> approach and mine:
>
>    * Directory layout changes: have an extra directory between the
>      function directory and the results hash in which several
>      sub-directories are named by a function code hash.

I take "results hash" to mean "input hash".

>
>    * When an untagged/unversionned function is called, a hash is created
>      with its source code, and this give us the name of directory the
>      results are stored in, as currently.
>
>    * When a tagged/versionned function is called, it checks for the
>      presence of the tag in a correspondance table. If the tag is
>      present, it points to a hash, and thus we now the directory in
>      which the results must be fetched. If a tag is not present, the
>      directory is created as above, and the tag is created in the
>      correspondance table. Note that on systems allowing symlinks, this
>      correspondance table should probably be simply symlinks.
>
>    * A 'head' tag is kept, and always moved to the last
>      untagged/unversionned function directory.
>
>    * A garbage collection process removes dangling directories that are
>      not refered to by any tags. The agressivity of this process tunes
>      whether or not we keep all the history, a bit, or none (as in the
>      current behavior).

I like this idea and I'll keep it in mind.

>> Hmm. I really don't like the nested layout for my purposes. If you want
>> to join my bikeshedding, here's my thoughts on this:
>
>>    - First, I decided on base32, because git uses base16 so it visually
>> distinguishes the hashes + down from 40 chars to 32 chars.
>
> Sure, no issue with that in general. What's the chances of collisions,
> though?

It's still a sha1, so the chance of collisions is the same. 160 bits + 
birthday paradox.

>>    - I like the idea of identifying a "job" (function hash + args hash)
>> by a single joblib (or, perhaps, gridlib) hash. This is more important
>> to me who would like to, say, query the cluster for the status of "job ...".
>
>> print myfuture.hash() # and go to cluster and see how it goes
>
> That's a perfectly valid usecase, but it doesn't mean a flat layout, IMHO
> :), and it doesn't mean that the function name shouldn't be in the
> identifier, rather than a simple hash.
>
>>    - I don't need (or like) the current function-name-based structure.
>> But I can see why you need it.
>
> Yeah, it's pretty important for me, the reason being that I would like
> joblib to be as little blackbox as possible.

This may well only be a matter of taste in the end. I'll think about it.

> First, I big principle in joblib is that it should not demand adaption of
> the user working pattern. Once again, you are in a situation in which you
> seem to be able to justify a lot to avoid recomputation. Most people are
> in a situation in which operator time is more expensive than CPU time.
> And it's going to be more and more the case.

Going OT, and regarding your last sentence, this really depends on the 
field. For many it is indeed less and less the case. Titus Brown has a 
fun slide and commentary where you can see the amount of data they have 
grows faster than Moore's law:

http://pycon.blip.tv/file/4881076/

Cosmology is similar. Problem sizes we're presented with grows much 
faster than available CPU, just from new data coming in. Some years ago 
you could do it on a couple of machines -- not any longer.

>> could work -- you basically discard any joblib result used while running
>> the function upon return (but not the result itself). Or:
>
>>    i) call myjob, ii) gc mark, iii) call myjob again, iv) gc sweep.
>
> I prefer tagging. I find that it is easier to understand for naive users
> than explicite garbage collection.
>
>> OTOH, tagging adds a lot of complexity too.
>
> Not sure why you think so. It seems very close to what you are
> suggesting. It seems to me that the only difference is the need to keep a
> table with a correspondance between tags and hashes.

Just to be clear: This is not the same table as the one keeping function 
tags?

So basically, just a directory of symlinks to job paths, whose targets 
are kept across cache clears? I like that idea, although I'd want it in 
addition to GC (which I don't believe to be hard, but "show me the code" 
and all that -- and I can stick the GC in gridlib).

DS