librelist archives

« back to archive

How much layer metadata should we replicate in the Django models?

How much layer metadata should we replicate in the Django models?

From:
Sebastian Benthall
Date:
2010-06-14 @ 15:59
I wanted to bring up an issue raised in this ticket: how much of each
layer's metadata should we replicate in the Django models?

http://projects.opengeo.org/CAPRA/ticket/498

<http://projects.opengeo.org/CAPRA/ticket/498>My understanding is that there
there are a lot of places in the web application where Django needs the
metadata in hand, but that it may not be scalable to keep replicating those
fields as needed.

Is there some technical way way can solve this problem more cleanly?

-- 
Sebastian Benthall
OpenGeo - http://opengeo.org

Re: [geonode] How much layer metadata should we replicate in the Django models?

From:
Ariel Nunez
Date:
2010-06-14 @ 16:06
Short story:
http://github.com/sebleier/django-redis-cache

Long story:
IMHO, the best idea is to just cache the metadata in RAM, unlike
memcached, Redis also writes a backup periodically to the disk and is
able to maintain the data between restarts. What we would do then is
either write the key, value pairs or just store a geojson dict for a
given layer.

Here is some code I wrote a while ago that uses redis in a very simple
yet effective way to cache an expensive operation:

http://github.com/ingenieroariel/dondevoto/blob/master/server.py#L16

Re: [geonode] How much layer metadata should we replicate in the Django models?

From:
David Winslow
Date:
2010-06-14 @ 18:45
I don't think we are yet at the point where we need to start worrying 
about RAM caches, there is still a lot of room for using the database in 
a smarter (or rather, less braindead) way.  In the specific example of 
displaying GeoNetwork search results, we are hitting the database 
separately for each search result to grab the metadata for it.  Luke has 
implemented a workaround that cleverly sidesteps this issue by holding 
off on loading data from the database until the user expands a JS 
widget, but really we should be able to work out a way to batch up those 
requests if we think about it for a minute.

Conceptually, I'm not sure we even need to hit Django's DB for this at 
all, all the metadata needed to display search results could probably 
live in GeoNetwork (and the more metadata that lives in GeoNetwork, the 
better, since GeoNetwork has its own search interface and can be 
federated with other GeoNetworks).  However, mirroring this stuff in 
Django is going to be fairly important if we want to do the kinds of 
things we've been talking up for GeoNode - having user profiles 
influence layer metadata, etc.

I suppose one formulation of the problem is in a use case:

    Jorge has uploaded several dozen layers to GeoNode.  Since he has
    filled out the profile for his GeoNode account, he has been able to
    avoid a lot of repetitive work filling out the descriptions for each
    of these layers.  Now, however, he's been promoted and needs to
    change his title from Data Wrangler to Poobah of Informatology ...
    on 200 layers.  GeoNode to the rescue!  He simply edits his profile
    and GeoNode updates all the metadata documents that reference him as
    provider or metadata maintainer with current contact information.


So, we need some sort of data architecture that can
* figure out which layers need updating after a user profile changes
* update just the fields corresponding to that user profile (actually, 
GN is basically storing the metadata documents as blobs so we will have 
to overwrite everything... but we need to make sure that we don't 
clobber the fields that aren't being modified)

One possible implementation would be to have a more relational model in 
GeoNode and use the typical "WHERE owner.uid = updated_profile.uid" kind 
of query to figure out what documents to update, and then just generate 
entire new metadata documents to clobber the pre-existing ones.  To 
preserve the fields that aren't coming from GeoNetwork, we'd probably 
want to store everything in the layer's Django representation.

--
David Winslow
OpenGeo - http://opengeo.org/

On 06/14/2010 12:06 PM, Ariel Nunez wrote:
> Short story:
> http://github.com/sebleier/django-redis-cache
>
> Long story:
> IMHO, the best idea is to just cache the metadata in RAM, unlike
> memcached, Redis also writes a backup periodically to the disk and is
> able to maintain the data between restarts. What we would do then is
> either write the key, value pairs or just store a geojson dict for a
> given layer.
>
> Here is some code I wrote a while ago that uses redis in a very simple
> yet effective way to cache an expensive operation:
>
> http://github.com/ingenieroariel/dondevoto/blob/master/server.py#L16
>    

Re: [geonode] How much layer metadata should we replicate in the Django models?

From:
Ariel Nunez
Date:
2010-06-14 @ 20:51
>
> So, we need some sort of data architecture that can
> * figure out which layers need updating after a user profile changes
I suggest we hook up the post_save[1] signal handler for the Profile
model and take ``self.user.layer_set.all()`` as the list of layers to
be updated. If we can do bulk updates to GeoNetwork, it would be great
to add that as a LayerManager method.

Which makes me think: Is this update operation expected to be
expensive(in terms of time) ? If it is, then we better take it out of
the request/response cycle, for example creating a
``update_geonetwork`` management command that runs every minute and
sees if there are pending updates (by checking a PendingUpdates table
or similar).

[1] 
http://docs.djangoproject.com/en/dev/ref/signals/#django.db.models.signals.post_save


> * update just the fields corresponding to that user profile (actually, GN is
> basically storing the metadata documents as blobs so we will have to
> overwrite everything... but we need to make sure that we don't clobber the
> fields that aren't being modified)
From your comments I get that this is not feasible, am I correct?
Which one is supposed to be the authoritative data source for
metadata, our GeoNode or GeoNetwork? Can we safely assume that every
GeoNode instance starts off with a fresh GeoNetwork?

> One possible implementation would be to have a more relational model in
> GeoNode and use the typical "WHERE owner.uid = updated_profile.uid" kind of
> query to figure out what documents to update, and then just generate entire
> new metadata documents to clobber the pre-existing ones.  To preserve the
> fields that aren't coming from GeoNetwork, we'd probably want to store
> everything in the layer's Django representation.
BTW, If we are going to replicate a lot of the GeoNetwork metadata in
the Django db, I wonder why we still need GeoNetwork, only for
searching?

Ariel

Re: [geonode] How much layer metadata should we replicate in the Django models?

From:
David Winslow
Date:
2010-06-14 @ 21:17
On 06/14/2010 04:51 PM, Ariel Nunez wrote:
>> So, we need some sort of data architecture that can
>> * figure out which layers need updating after a user profile changes
>>      
> I suggest we hook up the post_save[1] signal handler for the Profile
> model and take ``self.user.layer_set.all()`` as the list of layers to
> be updated. If we can do bulk updates to GeoNetwork, it would be great
> to add that as a LayerManager method.
>
> Which makes me think: Is this update operation expected to be
> expensive(in terms of time) ? If it is, then we better take it out of
> the request/response cycle, for example creating a
> ``update_geonetwork`` management command that runs every minute and
> sees if there are pending updates (by checking a PendingUpdates table
> or similar).
>
> [1] 
http://docs.djangoproject.com/en/dev/ref/signals/#django.db.models.signals.post_save
>
>
>    
Yes something like this would make sense.  Updating GeoNetwork would 
likely take some time, since afaik we can only write metadata documents 
one at at time (separate HTTP request per document.)  At some point we 
might want to modify GeoNetwork with some facilities for better 
supporting our usage, depending on how receptive the GeoNetwork project 
is to such changes.
>> * update just the fields corresponding to that user profile (actually, GN is
>> basically storing the metadata documents as blobs so we will have to
>> overwrite everything... but we need to make sure that we don't clobber the
>> fields that aren't being modified)
>>      
>  From your comments I get that this is not feasible, am I correct?
> Which one is supposed to be the authoritative data source for
> metadata, our GeoNode or GeoNetwork? Can we safely assume that every
> GeoNode instance starts off with a fresh GeoNetwork?
>
>    

I think we can for now, later mass import to "upgrade" a standalone 
GeoNetwork to a GeoNode site will be very desirable.  We will need to at 
least be able to handle 'foreign' layers reasonably well, so if you have 
a federated GeoNetwork setup, GeoNode doesn't try to modify metadata for 
layers for which it is not the authoritative provider.
>> One possible implementation would be to have a more relational model in
>> GeoNode and use the typical "WHERE owner.uid = updated_profile.uid" kind of
>> query to figure out what documents to update, and then just generate entire
>> new metadata documents to clobber the pre-existing ones.  To preserve the
>> fields that aren't coming from GeoNetwork, we'd probably want to store
>> everything in the layer's Django representation.
>>      
> BTW, If we are going to replicate a lot of the GeoNetwork metadata in
> the Django db, I wonder why we still need GeoNetwork, only for
> searching?
>    

Yes, searching.  GeoNetwork is basically playing the role of a 
BBOX-aware full-text search engine for us right now.  It could also be a 
mechanism for publishing GeoNode data to other GeoNetwork and GeoNode 
sites via CSW federation (one GeoNetwork instance can crawl another and 
mirror the metadata records).