librelist archives

« back to archive

Some questions

Some questions

From:
Tomas Skäre
Date:
2011-04-18 @ 05:30
Hi,

I'm interested in using libgit2 for storage for a project, and have some
questions that I hope you can answer.

The project would have many small blob objects. A blob would typically
be in the size range of 50-250 bytes, and the estimated number of objects
can be about 1.5-2 million objects, organized as a tree (not files, but
similar structure).

Do you foresee any performance problems with that many objects?

Do you foresee any performance problems if many objects are placed
in the same tree directory? At what point can one expect performance
issues, 100 objects/tree, 1000, 10000, 100000?

When doing changes, I'd like to avoid reconstructing the whole tree
each commit. Is it possible to keep a git_tree struct in memory and
modify the entries that are modified? I couldn't find any API for
adding/modifying/deleting single tree entries in a git_tree. Are there
plans for such API, or can the tree struct be modified manually
before commit?

Or is the index2tree (when implemented) optimized to quickly handle
single few changes in a large tree?


Thank you for any answers,

Tomas

Re: [libgit2] Some questions

From:
Vicent Marti
Date:
2011-04-18 @ 06:55
Hey Tomas,

On Mon, Apr 18, 2011 at 8:30 AM, Tomas Skäre <tomas.skare@gmail.com> wrote:
> I'm interested in using libgit2 for storage for a project, and have some
> questions that I hope you can answer.

That's neat!

> The project would have many small blob objects. A blob would typically
> be in the size range of 50-250 bytes, and the estimated number of objects
> can be about 1.5-2 million objects, organized as a tree (not files, but
> similar structure).
>
> Do you foresee any performance problems with that many objects?

I've never tried to store so many (small) objects at the same time,
but the library is quite tuned for speed, so it shouldn't be
irrationally slow. If anything, you may find a bottleneck in disk
lookups for loose objects: my suggestion would be to use our custom
SQLite backend for the ODB. With so many small objects stored at the
same time, you would be reaping all the speed benefits that SQLite
offers.

> Do you foresee any performance problems if many objects are placed
> in the same tree directory? At what point can one expect performance
> issues, 100 objects/tree, 1000, 10000, 100000?

Not necessarily, but of course the most objects you place in a single
tree, the slower the lookups by name will be. It would be smart to
subdivide the objects into subtrees (= subfolders) instead of having
them all in the root. If you can do this, there should be no
significant performance hits.

> When doing changes, I'd like to avoid reconstructing the whole tree
> each commit. Is it possible to keep a git_tree struct in memory and
> modify the entries that are modified? I couldn't find any API for
> adding/modifying/deleting single tree entries in a git_tree. Are there
> plans for such API, or can the tree struct be modified manually
> before commit?

You seem to be using the `master` branch of libgit2. Check the
`development` branch, which now includes a `git_tree_builder`
specifically tailored for this. You can keep the tree builder in
memory, and keep calling `git_tree_builder_write()` after each set of
modifications to write a new tree to disk.

> Or is the index2tree (when implemented) optimized to quickly handle
> single few changes in a large tree?

Index2tree is already implemented in the development branch, but you'd
be much better off using the tree builder for your intended use case.

Thanks for your interest on libgit2,
Vicent

Re: [libgit2] Some questions

From:
Tomas Skäre
Date:
2011-05-02 @ 14:55
On Mon, Apr 18, 2011 at 08:55, Vicent Marti <vicent@github.com> wrote:

> I've never tried to store so many (small) objects at the same time,
> but the library is quite tuned for speed, so it shouldn't be
> irrationally slow. If anything, you may find a bottleneck in disk
> lookups for loose objects: my suggestion would be to use our custom
> SQLite backend for the ODB. With so many small objects stored at the
> same time, you would be reaping all the speed benefits that SQLite
> offers.

I see. Yes, I guess file storage with that many files will not be a
good choice. We already have other means of storing our objects
to disk, so we may end up adapting that as a custom backend.

> You seem to be using the `master` branch of libgit2. Check the
> `development` branch, which now includes a `git_tree_builder`
> specifically tailored for this. You can keep the tree builder in
> memory, and keep calling `git_tree_builder_write()` after each set of
> modifications to write a new tree to disk.

Oh, I see. I had actually checked out "development", but I read
the documentation that was available on the web page, which
was probably from master. I generated new doc from the source
instead, and can see the new functions. Looks promising, so I
will read up more on that.

>> Or is the index2tree (when implemented) optimized to quickly handle
>> single few changes in a large tree?
>
> Index2tree is already implemented in the development branch, but you'd
> be much better off using the tree builder for your intended use case.

I see, I'll try that.


Thanks a lot for the info, I guess I have some experimenting ahead of me.


Regards,

Tomas