librelist archives

« back to archive

Job batch mechanism for Parallel

Job batch mechanism for Parallel

From:
Christian Jauvin
Date:
2012-11-04 @ 18:42
Hi,

Disclaimer: I am new to both joblib and sklearn, so please bear with me.

As I was studying the implementation of the sklearn CountVectorizer,
my attention was drawn to a comment in the code saying that its main
loop could not be efficiently parallelized with joblib:


https://github.com/scikit-learn/scikit-learn/blob/33a5911b55617e919fcfabc283c24784deaed686/sklearn/feature_extraction/text.py#L469

I don't know much about the internals of multiprocessing, but I
imagined that there might be a tradeoff between the size of individual
jobs and the number of times that a process in the pool is dispatched
a new job. For instance, if the vectorizer is passed a very long list
of very short documents, then it would seem possible that the
dispatching overhead makes it very suboptimal. Perhaps a smaller
number of longer jobs would work better in that case?

To explore this hypothesis, I had the simple idea of chaining together
jobs as batches (to be executed by a single process) and dispatch
those, instead of individual jobs. The user can then experiment with
different batch sizes, trying to find the sweet spot.

Here are the results from simple experiments with this idea, using my
old dual core Thinkpad:

https://gist.github.com/4012764

Admittedly, there are a lot of unknowns with this idea: I don't know
if it really makes sense, if the issue has not been previously studied
or is perhaps already solved using another mechanism, if my
implementation is correct:

https://github.com/cjauvin/joblib/compare/parallel_job_batch

and finally, I don't know if it would play well with the rest of
Joblib, i.e. that it wouldn't introduce problems that I didn't think
of.

Christian

Re: [joblib] Job batch mechanism for Parallel

From:
Dag Sverre Seljebotn
Date:
2012-11-04 @ 20:00
On 11/04/2012 07:42 PM, Christian Jauvin wrote:
> Hi,
>
> Disclaimer: I am new to both joblib and sklearn, so please bear with me.
>
> As I was studying the implementation of the sklearn CountVectorizer,
> my attention was drawn to a comment in the code saying that its main
> loop could not be efficiently parallelized with joblib:
>
> 
https://github.com/scikit-learn/scikit-learn/blob/33a5911b55617e919fcfabc283c24784deaed686/sklearn/feature_extraction/text.py#L469
>
> I don't know much about the internals of multiprocessing, but I
> imagined that there might be a tradeoff between the size of individual
> jobs and the number of times that a process in the pool is dispatched
> a new job. For instance, if the vectorizer is passed a very long list
> of very short documents, then it would seem possible that the
> dispatching overhead makes it very suboptimal. Perhaps a smaller
> number of longer jobs would work better in that case?
>
> To explore this hypothesis, I had the simple idea of chaining together
> jobs as batches (to be executed by a single process) and dispatch
> those, instead of individual jobs. The user can then experiment with
> different batch sizes, trying to find the sweet spot.
>
> Here are the results from simple experiments with this idea, using my
> old dual core Thinkpad:
>
> https://gist.github.com/4012764
>
> Admittedly, there are a lot of unknowns with this idea: I don't know
> if it really makes sense, if the issue has not been previously studied
> or is perhaps already solved using another mechanism, if my
> implementation is correct:
>
> https://github.com/cjauvin/joblib/compare/parallel_job_batch
>
> and finally, I don't know if it would play well with the rest of
> Joblib, i.e. that it wouldn't introduce problems that I didn't think
> of.

It seems like this really only eliminates the overhead of the 
multiprocessing inter-process message passing; if that is indeed a 
bottleneck, I'd think that using pyzmq as the transport instead of 
multiprocessing should give the same gain without having to do anything 
(manually) about batching messages. (Of course, one may not want the 
pyzmq dependency, I'm just chiming in.)

Dag Sverre