I'm running a very basic Flask setup with Apache as my webserver. I
have a python list that lives at the global level (aka at the same
nesting level as my function declarations). Looks like this:
mylist = []
app.route('/append_to_mylist')
def append_to_mylist():
mylist.append(1)
return str(mylist)
So as http://www.mysite.com/append_to_mylist is invoked, it should
return a growing list of ones (unless the server is restarted, in
which case I am back to the beginning).
I'm trying to understand if I'm inadvertently screwing myself in the
case where my Flask setup is multi-threaded and mylist could somehow
be accessed by more than one thread, potentially corrupting it. Is
there some other construct I should be using to handle that situation?
I know I could stick mylist in memcached, but that seems to be
overkill for my needs.
I'm not even sure how to tell if my configuration is multi-threaded.
The only part that mentioned threads during configuration was in my
httpd.conf file:
WSGIDaemonProcess jomit user=ubuntu group=ubuntu threads=5
This seems to indicate that Apache might start as many as 5 WSGI
threads, so I guess my code is actually multi-threaded?
Thanks and apologies for the n00b questions,
John
Le 03/11/2011 09:16, John Fries a écrit : > I'm trying to understand if I'm inadvertently screwing myself in the > case where my Flask setup is multi-threaded and mylist could somehow > be accessed by more than one thread, potentially corrupting it. Is > there some other construct I should be using to handle that situation? > I know I could stick mylist in memcached, but that seems to be > overkill for my needs. > > I'm not even sure how to tell if my configuration is multi-threaded. > The only part that mentioned threads during configuration was in my > httpd.conf file: > WSGIDaemonProcess jomit user=ubuntu group=ubuntu threads=5 Hi, In short: don’t do that. Use some kind of shared data store to keep data across requests and clients. This configuration does use multiple threads, but the Python GIL (global interpreter lock) makes sure that eg. a list will never get corrupted. However more complex code mail fail in subtle ways. For example, some_global += 1 is actually three operations: read, increment and write. Each of these is atomic but the thread may be interrupted in-between. If another thread changes the value between a read and the matching write, you get incorrect results. So in general you should use locks to protect relevant code areas: http://docs.python.org/library/threading.html#lock-objects However using global state is considered a bad idea anyway. This seems not to be the case with the configuration you pasted, but if your server has multiples processes, they will each have their own version of the list. Each process does not know about the others. Regards, -- Simon Sapin
On Nov 3, 2011, at 9:06 , Simon Sapin wrote: > Le 03/11/2011 09:16, John Fries a écrit : >> I'm trying to understand if I'm inadvertently screwing myself in the >> case where my Flask setup is multi-threaded and mylist could somehow >> be accessed by more than one thread, potentially corrupting it. Is >> there some other construct I should be using to handle that situation? >> I know I could stick mylist in memcached, but that seems to be >> overkill for my needs. > > Hi, > > In short: don’t do that. Use some kind of shared data store to keep data > across requests and clients. I would recommend Redis (http://redis.io/, use the redis-py library at https://github.com/andymccurdy/redis-py) for this. It's easy to set up, and has really fast operations on lists, sets, counters, etc. etc. Persistent too. Thanks, Matthew Frazier http://leafstorm.us/
+ for Redis. It has great list support and is surprisingly easy to begin using it. On Thu, Nov 3, 2011 at 9:17 AM, Matthew Frazier <leafstormrush@gmail.com>wrote: > On Nov 3, 2011, at 9:06 , Simon Sapin wrote: > > > Le 03/11/2011 09:16, John Fries a écrit : > >> I'm trying to understand if I'm inadvertently screwing myself in the > >> case where my Flask setup is multi-threaded and mylist could somehow > >> be accessed by more than one thread, potentially corrupting it. Is > >> there some other construct I should be using to handle that situation? > >> I know I could stick mylist in memcached, but that seems to be > >> overkill for my needs. > > > > Hi, > > > > In short: don’t do that. Use some kind of shared data store to keep data > > across requests and clients. > > I would recommend Redis (http://redis.io/, use the redis-py library at > https://github.com/andymccurdy/redis-py) for this. It's easy to set up, > and has really fast operations on lists, sets, counters, etc. etc. > Persistent too. > > Thanks, > Matthew Frazier > http://leafstorm.us/ > >
Redis is a great solution. You can also use mongodb, as it has atomic operations. http://www.mongodb.org/display/DOCS/Atomic+Operations -Jonathan On Thu, Nov 3, 2011 at 7:46 AM, Joe Esposito <espo58@gmail.com> wrote: > + for Redis. It has great list support and is surprisingly easy to begin > using it. > > > On Thu, Nov 3, 2011 at 9:17 AM, Matthew Frazier <leafstormrush@gmail.com>wrote: > >> On Nov 3, 2011, at 9:06 , Simon Sapin wrote: >> >> > Le 03/11/2011 09:16, John Fries a écrit : >> >> I'm trying to understand if I'm inadvertently screwing myself in the >> >> case where my Flask setup is multi-threaded and mylist could somehow >> >> be accessed by more than one thread, potentially corrupting it. Is >> >> there some other construct I should be using to handle that situation? >> >> I know I could stick mylist in memcached, but that seems to be >> >> overkill for my needs. >> > >> > Hi, >> > >> > In short: don’t do that. Use some kind of shared data store to keep data >> > across requests and clients. >> >> I would recommend Redis (http://redis.io/, use the redis-py library at >> https://github.com/andymccurdy/redis-py) for this. It's easy to set up, >> and has really fast operations on lists, sets, counters, etc. etc. >> Persistent too. >> >> Thanks, >> Matthew Frazier >> http://leafstorm.us/ >> >> >
I'm less concerned about inconsistency between processes/threads than I am in the global list just getting flat-out corrupted (although Simon says that the GIL will protect me from that). I understand that redis or mongodb as an off-process atomic cache is a natural solution for this problem. However, my concern is performance. It seems that even an ideal atomic store is going to take at least 100ms round-trip, so it seems inefficient to cache data there without first checking some smaller in-process cache. Does anyone see a flaw in my reasoning in the case where eventual consistency between processes is acceptable? It seems surprising to me that this is not a more common pattern. On Thu, Nov 3, 2011 at 1:10 PM, Cheng-Han Lee <lee.chenghan@gmail.com> wrote: > Redis is a great solution. > > You can also use mongodb, as it has atomic operations. > http://www.mongodb.org/display/DOCS/Atomic+Operations > > -Jonathan > > On Thu, Nov 3, 2011 at 7:46 AM, Joe Esposito <espo58@gmail.com> wrote: >> >> + for Redis. It has great list support and is surprisingly easy to begin >> using it. >> >> On Thu, Nov 3, 2011 at 9:17 AM, Matthew Frazier <leafstormrush@gmail.com> >> wrote: >>> >>> On Nov 3, 2011, at 9:06 , Simon Sapin wrote: >>> >>> > Le 03/11/2011 09:16, John Fries a écrit : >>> >> I'm trying to understand if I'm inadvertently screwing myself in the >>> >> case where my Flask setup is multi-threaded and mylist could somehow >>> >> be accessed by more than one thread, potentially corrupting it. Is >>> >> there some other construct I should be using to handle that situation? >>> >> I know I could stick mylist in memcached, but that seems to be >>> >> overkill for my needs. >>> > >>> > Hi, >>> > >>> > In short: don’t do that. Use some kind of shared data store to keep >>> > data >>> > across requests and clients. >>> >>> I would recommend Redis (http://redis.io/, use the redis-py library at >>> https://github.com/andymccurdy/redis-py) for this. It's easy to set up, and >>> has really fast operations on lists, sets, counters, etc. etc. Persistent >>> too. >>> >>> Thanks, >>> Matthew Frazier >>> http://leafstorm.us/ >>> >> > >
Il giorno 04/nov/2011, alle ore 20:21, John Fries ha scritto: > It seems that even an ideal atomic store is going to take at least 100ms round-trip, Where does that number come from? > so it seems inefficient to cache data there without first checking some smaller in-process cache. I'd guess (no scientific reasoning here, you've been warned) that interacting with a well written daemon on the same system will be faster than waiting on a poorly written atomic store in the same process. Anyway, just in case you haven't noticed, Apache and mod_wsgi themselves are using multiple processes and they pass around requests - so it mustn't be that slow. Your http request will probably come in to an apache process, which pass the request to another apache process (the prefork mpm child), which pass the request to another process (the mod_wsgi DaemonProcess you mentioned in your first post). > I'm not even sure how to tell if my configuration is multi-threaded. The only part that mentioned threads during configuration was in my httpd.conf file: > WSGIDaemonProcess jomit user=ubuntu group=ubuntu threads=5 Your code is indeed multithreaded as you noted. And you have no choice about it, because if you switch to the multiprocess model (setting the 'processes' parameter to the WSGIDaemonProcess directive) your code would screw up as different processes would have different sets of global data (ie, no shared lists or anything between them) so different http request would be answered by processes of your app with different internal state. Your only choices are: 1- use an external store as already suggested 2- strictly stick to the single process mod_wsgi mode and implement a good multithreaded store (maybe look out for existing implementation, there must be something somewhere). -- Luca Lesinigo
hmm, your feedback is sinking in now http://www.quora.com/What-are-the-numbers-that-every-computer-engineer-should-know-according-to-Jeff-Dean I'm used to dealing with key-value stores that are across datacenters, so mentally I budget about 100 to 150ms for these. But looking at it more closely, I realize that most of this is round-trip time, and within a datacenter we could probably get something close to 1ms latency (half a ms for round-trip time between machines, plus however much time redis needs to do its thing). Is 1ms latency in line with what people are seeing on their redis or memcached installations? If so, I don't think I will miss having an in-process cache, if it ends up causing me more concurrency headaches. On Fri, Nov 4, 2011 at 3:23 PM, Luca Lesinigo <luca@lesinigo.it> wrote: > Il giorno 04/nov/2011, alle ore 20:21, John Fries ha scritto: >> It seems that even an ideal atomic store is going to take at least 100ms round-trip, > Where does that number come from? > >> so it seems inefficient to cache data there without first checking some smaller in-process cache. > I'd guess (no scientific reasoning here, you've been warned) that interacting with a well written daemon on the same system will be faster than waiting on a poorly written atomic store in the same process. > > Anyway, just in case you haven't noticed, Apache and mod_wsgi themselves are using multiple processes and they pass around requests - so it mustn't be that slow. Your http request will probably come in to an apache process, which pass the request to another apache process (the prefork mpm child), which pass the request to another process (the mod_wsgi DaemonProcess you mentioned in your first post). > >> I'm not even sure how to tell if my configuration is multi-threaded. The only part that mentioned threads during configuration was in my httpd.conf file: >> WSGIDaemonProcess jomit user=ubuntu group=ubuntu threads=5 > Your code is indeed multithreaded as you noted. And you have no choice about it, because if you switch to the multiprocess model (setting the 'processes' parameter to the WSGIDaemonProcess directive) your code would screw up as different processes would have different sets of global data (ie, no shared lists or anything between them) so different http request would be answered by processes of your app with different internal state. > > Your only choices are: > 1- use an external store as already suggested > 2- strictly stick to the single process mod_wsgi mode and implement a good multithreaded store (maybe look out for existing implementation, there must be something somewhere). > > -- > Luca Lesinigo > >
If you are set on using a native python list, you need to make sure the critical sections of your code are wrapped around some synchronization primitives (e.g. semaphores or mutex), so those sections are ensured to execute atomically. But you'll run into a whole new set of issues when writing synchronized code. You need to account for issues such as deadlocks, thread starvation, and such. Not to mention, debugging multi-threaded code can be a painful experience. Hope this helps. On Fri, Nov 4, 2011 at 12:21 PM, John Fries <john.a.fries@gmail.com> wrote: > I'm less concerned about inconsistency between processes/threads than > I am in the global list just getting flat-out corrupted (although > Simon says that the GIL will protect me from that). > > I understand that redis or mongodb as an off-process atomic cache is a > natural solution for this problem. However, my concern is performance. > It seems that even an ideal atomic store is going to take at least > 100ms round-trip, so it seems inefficient to cache data there without > first checking some smaller in-process cache. Does anyone see a flaw > in my reasoning in the case where eventual consistency between > processes is acceptable? It seems surprising to me that this is not a > more common pattern. > > On Thu, Nov 3, 2011 at 1:10 PM, Cheng-Han Lee <lee.chenghan@gmail.com> > wrote: > > Redis is a great solution. > > > > You can also use mongodb, as it has atomic operations. > > http://www.mongodb.org/display/DOCS/Atomic+Operations > > > > -Jonathan > > > > On Thu, Nov 3, 2011 at 7:46 AM, Joe Esposito <espo58@gmail.com> wrote: > >> > >> + for Redis. It has great list support and is surprisingly easy to begin > >> using it. > >> > >> On Thu, Nov 3, 2011 at 9:17 AM, Matthew Frazier < > leafstormrush@gmail.com> > >> wrote: > >>> > >>> On Nov 3, 2011, at 9:06 , Simon Sapin wrote: > >>> > >>> > Le 03/11/2011 09:16, John Fries a écrit : > >>> >> I'm trying to understand if I'm inadvertently screwing myself in the > >>> >> case where my Flask setup is multi-threaded and mylist could somehow > >>> >> be accessed by more than one thread, potentially corrupting it. Is > >>> >> there some other construct I should be using to handle that > situation? > >>> >> I know I could stick mylist in memcached, but that seems to be > >>> >> overkill for my needs. > >>> > > >>> > Hi, > >>> > > >>> > In short: don’t do that. Use some kind of shared data store to keep > >>> > data > >>> > across requests and clients. > >>> > >>> I would recommend Redis (http://redis.io/, use the redis-py library at > >>> https://github.com/andymccurdy/redis-py) for this. It's easy to set > up, and > >>> has really fast operations on lists, sets, counters, etc. etc. > Persistent > >>> too. > >>> > >>> Thanks, > >>> Matthew Frazier > >>> http://leafstorm.us/ > >>> > >> > > > > >
Le 04/11/2011 20:21, John Fries a écrit :
> However, my concern is performance.
Measure, measure, measure.