librelist archives

« back to archive

Accessing a shared cache from Parallel

Accessing a shared cache from Parallel

From:
Miroslav Batchkarov
Date:
2013-07-30 @ 10:51
Hi,

are there any plans for joblib to support concurrent access to a shared 
cache from Parallel? The example below shows a race condition:

from joblib import Memory, delayed, Parallel

def f(param):
    print 'recalculating for %d' % param
    return param


memory = Memory(cachedir='.', verbose=0)
cf = memory.cache(f)

params = []
for i in range(100):
    params.extend(range(10))

for i in range(10):
    res = Parallel(n_jobs=-1)(delayed(cf)(x) for x in params)
    print res

I have seen some talk about using a database (SQLite/ MongoDB) for 
storage. I think this would eliminate the race condition above. Has any 
work on database integration been done?

Best,
Miroslav

---
Miroslav Batchkarov
PhD Student,
Text Analysis Group,
Department of Informatics,
University of Sussex


Re: [joblib] Accessing a shared cache from Parallel

From:
Olivier Grisel
Date:
2013-07-30 @ 11:11
I think this can be considered a bug. Do you get a traceback or an
invalid output?

When I run this snippet it seems that the output is consistent. It's
just being built concurrently 10 times on my mac (with 2 cores and 4
hyperthreads).

Please report it on the issue tracker:
https://github.com/joblib/joblib/issues . Please also specify which
operating system your are using and the filesystem name if you use a
non-default FS partition. Also make the race condition more explicit
by adding an assertion on the expected vs computed values if you get
no exception with traceback but an invalid output.

I am not sure that sqlite would easily deal correctly with write
concurrency with multiple python processes. There is no database
backend integration in joblib. joblib is unlikely to depend on a
specific database client but if you plan to work on an integration of
your own it would be interesting to report if the current public API
is enough for instance by subclassing some joblib classes like Memory
of if you would need to expose some internals to make it easier / more
efficient.

Re: [joblib] Accessing a shared cache from Parallel

From:
Gael Varoquaux
Date:
2013-07-30 @ 11:17
> are there any plans for joblib to support concurrent access to a shared 
cache from Parallel?

It should be already possible.

> The example below shows a race condition:

What makes you think that it displays a race condition? I don't see one.
If you are seeing one, could you give us more details please.

> I have seen some talk about using a database (SQLite/ MongoDB) for 
storage. I think this would eliminate the race condition above.

We do not want to depend on such storage. SQLite wouldn't work. The first
version of joblib was implemented using SQLite, but SQLite has too many
locks that grind it to a stop in parallel use. MongoDB is a big
dependency and would require a setting up phase as well as open ports,
which is not the philosophy of joblib.