[elephant-devel] Backend musings

Fri Feb 29 19:31:47 UTC 2008

On Fri, Feb 29, 2008 at 2:48 PM, Ian Eslick <eslick at media.mit.edu> wrote:
>  An elephant data store has to:
>  - Support persistent slot writes
>  - Expose a persistent key-value store, indexed key-value store and
>    iterator/cursor interfaces over them
>  - Support a transaction model that guarantees the ACID properties
>  - Work correctly in a threading model
>  - Ideally would also work in a multi-process, shared-memory scenario

To me the last point is is very important. Even more important than
multithreading, and you will probably get multithreading support at
the same time if you implement it correctly. A database that can't be
used from several computers is not much of a database. It can be
tricky, but should be doable.

If you think conceptually dividing client and server code, and make a
generic interface between these parts. Then you can have different
ways of implementing the client-server communication protocol. For a
single process application, the client-server protocol can of course
be simple function calls between the server code and the client code
in the same lisp process. Then for a multi process setup you can
probably want different ways of communicating. One "simple" and
straightforward socket interface that works on all platforms, then
perhaps you (at least I will) want to add another performance
optimized communication protocol that might be less generic. From the
start, what need to be done is to mentally see the different client
and server parts, and make sure no state (variables and memory areas)
are shared between them. Perhaps have an eye on the design so if
possible trying to keep the number of roundtrips betwen client and
server down, Then, it probably becomes easy to add a basic simple
client-server socket solution.

As an extra bonus, having a clear client/server mindset from the start
might make the design clearer and better.

>
>  I think the big design decisions for a lisp data store include:
>  - Backing-store model (prevalence-style logging, binary paging,
>  Rucksack style maps, etc)
>  - Multi-threaded transaction support (page/object locking vs. per-txn
>  replication)
>  - Multi-process support?

Strong vote for...

>  It would also be nice, eventually, to have a lisp server model where
>  we can launch a lisp instance on the same or a remote machine to serve
>  access to the underlying store.

See my comments above.

>  My latest thinking is that we can spend disk space to save time/
>  complexity and assume that we have lots of main memory available.
>  This can allow us to avoid messing with fixed-sized pages and locking.

Yes, don't optimize these things too early.

>  So an initial approach might be the following:
>  - BTree-style nodes and slot value writes are duplicated in a per-
>  transaction memory log.
>    (btree ops are compressed into edit operations, but the altered
>  page is also duplicated for later reads)
>  - Any reads to these subsystems first look up data in this transaction
>  cache
>  - We serialize commit ops and cancel active transactions that have
>  read or written data
>    written by the commiting transaction.
>  - Thus, on commit, we just write to the log file, update the btree
>  pages, and write to the slot store

In this model, isn't there a risk of problems:
Transaction A starts, the world is in a coherent state and you want to
get data from different b-trees that are related.
Transaction B starts, does his things quick and writes new things to
the database.
But now, if transaction A wants to read some data that B has updated,
it will suddently get the brand new fresh data from transaction B,
which was not intended since A wants to read all data as the world
looked from the beginning.

Contrasted by a version numbered transaction model that don't
overwrite data (like Rucksack does), transaction A can always get
coherent data since B doesn't overwrite the db, it just write a new
version.

Am I right, is this a problem?

>  - Thus, on commit, we just write to the log file, update the btree
>  pages, and write to the slot store

If we do a client/server solution, these kind of things can be
performance optimized perhaps by splitting it into different
subprocesses.
The server can maybe just update the log file, save data in its cache,
then return OK/Pending to the client, so the client can continue. Then
the server process can continue doing the rest of the work until it is
finished. This could be done with a multithreaded design as well. But,
we can leave it for later and start out with something simple. But it
could be something to think about in the design.

>
>  We can make serialization of btree nodes easier by just serializing
>  the lisp structure and having bins of differently sized disk regions
>  to write them to rather than maintaining fixed sized pages.  Later, we
>  could choose to enhance the system by converting to a low-level read/
>  write into the C byte stream that we fetch from files.

Agree. Better hunt performance later. It is not obvious what is the
most limiting factor for performance from the start.

About other B+tree implementations, there is one in rucksack as well,
memory based but a clean design that can be adapted for disk usage. It
is also not astrophysics to make one from scratch, or translate from
one of the many implementations for other languages.

About project management, how do we discuss the design and what needs
to be done in the best way? Mails like these? The only problem with
that is that it might be difficult to contribute to the discussion
when you don't have the time to respond to these mails, so a design
document or wiki could also be good because you can contribute more
when you have time rather than when the discussion is ongoing on the
list. Or is the mailing list enough, I am not sure...

/Henrik