librelist archives

« back to archive

Memory usage

Memory usage

From:
Sebastien Campion
Date:
2012-01-18 @ 12:25
Hi,
we had observed an excessive memory usage when our job is parrallelized.
In a single thread, our job will use something like 1GB, using 2 threads
we use 2 or 3 GB.
I would like to know how the // job is done, the problem is maybe in the
multiprocessing usage (?)

If you have an idea, a sugestion, I take it

Sebastien

PS: Anybody try to replace the // function with Ipython // mecanism
(ipcontroller and ipengine) ?


-- 
Sébastien Campion

Re: [joblib] Memory usage

From:
Olivier Grisel
Date:
2012-01-18 @ 13:51
2012/1/18 Sebastien Campion <sebastien.campion@inria.fr>:
> Hi,
> we had observed an excessive memory usage when our job is parrallelized.
> In a single thread, our job will use something like 1GB, using 2 threads
> we use 2 or 3 GB.
> I would like to know how the // job is done, the problem is maybe in the
> multiprocessing usage (?)

Have a look a the pre_dispatch option in:

   http://packages.python.org/joblib/parallel.html

Maybe try "2*n_jobs" for instance.

> If you have an idea, a sugestion, I take it
>
> Sebastien
>
> PS: Anybody try to replace the // function with Ipython // mecanism
> (ipcontroller and ipengine) ?

AFAIK nobody did although I think Gael had in mind to make it possible
to use it as an alternative backend (instead of multiprocessing). In
the mean time you can directly use the IPython.parallel API directly.
The Python.parallel.Client + LoadBalancedView API is quite high level
and quite comparable to joblib.Parallel.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: [joblib] Memory usage

From:
Sebastien Campion
Date:
2012-01-18 @ 14:23
On 01/18/12 14:51, Olivier Grisel wrote:
> 2012/1/18 Sebastien Campion <sebastien.campion@inria.fr>:
>> Hi,
>> we had observed an excessive memory usage when our job is parrallelized.
>> In a single thread, our job will use something like 1GB, using 2 threads
>> we use 2 or 3 GB.
>> I would like to know how the // job is done, the problem is maybe in the
>> multiprocessing usage (?)
> 
> Have a look a the pre_dispatch option in:
> 
>    http://packages.python.org/joblib/parallel.html
> 
> Maybe try "2*n_jobs" for instance.
Thank you, I missed it during my first reading

> 
>> If you have an idea, a sugestion, I take it
>>
>> Sebastien
>>
>> PS: Anybody try to replace the // function with Ipython // mecanism
>> (ipcontroller and ipengine) ?
> 
> AFAIK nobody did although I think Gael had in mind to make it possible
> to use it as an alternative backend (instead of multiprocessing). In
> the mean time you can directly use the IPython.parallel API directly.
> The Python.parallel.Client + LoadBalancedView API is quite high level
> and quite comparable to joblib.Parallel.


-- 
Sébastien Campion
Research Engineer
SED    - Service of Experimental plateforms and Development
TEXMEX - Research Team

INRIA Rennes - Campus de Beaulieu
35042 Rennes Cedex - France
http://www.irisa.fr/
phone  : +33 2 99 84 75 53 - Fax. +33 2 99 84 71 71

mailto/jabber   | sebastien.campion@inria.fr
GPG fingerprint | 6395 9C87 B5E5 23CC 90DC F695 7C39 6C33 6044 C34A
Web Page        | http://www.irisa.fr/prive/Sebastien.Campion/

Re: [joblib] Memory usage

From:
Gael Varoquaux
Date:
2012-01-18 @ 15:48
On Wed, Jan 18, 2012 at 03:23:04PM +0100, Sebastien Campion wrote:
> > Have a look a the pre_dispatch option in:

> >    http://packages.python.org/joblib/parallel.html

> > Maybe try "2*n_jobs" for instance.
> Thank you, I missed it during my first reading

Tell us if it solves your problem.

If any body has an idea of how to make a guess (in a portable way) of the
size of the input objects so as to try to make some reasonnable default
choice for this parameter, it would be helpful.

Gael

PS: Thanks Olivier for answering.

Re: [joblib] Memory usage

From:
Olivier Grisel
Date:
2012-01-18 @ 17:01
2012/1/18 Gael Varoquaux <gael.varoquaux@normalesup.org>:
> On Wed, Jan 18, 2012 at 03:23:04PM +0100, Sebastien Campion wrote:
>> > Have a look a the pre_dispatch option in:
>
>> >    http://packages.python.org/joblib/parallel.html
>
>> > Maybe try "2*n_jobs" for instance.
>> Thank you, I missed it during my first reading
>
> Tell us if it solves your problem.
>
> If any body has an idea of how to make a guess (in a portable way) of the
> size of the input objects so as to try to make some reasonnable default
> choice for this parameter, it would be helpful.

I am not familiar with the joblib code base so it's hard for me to
understand what "pre_dispatching" means from actual memory allocation
on which process at which time.

Also another trick for reducing the memory usage that should be
highlighted in the doc:

# allocate 10 arrays of 100MB each (total 1GB)
>>> arrays = [np.zeros(1e8 / 8.) for i  in range(10)]

# dump them using joblib on temporary folder
>>> os.makedirs('/tmp/joblib')
>>> joblib.dump(arrays, '/tmp/joblib/arrays')

# reload them in readonly memmap mod: the previous arrays will be
garbage collected
>>> arrays = joblib.load('/tmp/joblib/arrays', mmap_mode='r')

Now you can use the joblib.Parallel tool to fork python processes that
will be much more lightweight and and that will fetch only the data
for their own jobs.

Note that instead of calling dump and load explicitly you can use the
joblib.Memory(cachedir='/tmp', mmap_mode='r').cache memoizer.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel