[elephant-devel] Querying Elephant

Ian Eslick eslick at csail.mit.edu
Fri Jul 28 21:12:20 UTC 2006


Here is a start on a clearer statement of the standard thinking about
designing
objects and writing query functions.  I invite hacking on this but questions
will probably go unanswered for the next week or so.

There are two operations that all queries in Elephant stem from:
1) A BTree lookup of a value given a key
2) A linear traversal of a BTree in an order defined by the native key
ordering

There are two types of BTrees
1) Standard BTrees require that all keys are unique
2) Secondary BTrees allow duplicate key values, ordering of values
    with duplicate keys is undefined
    (but in practice I believe it's the inverse order of insertion)

Given two classes:

(defpclass blog-entry
    ((user <class-ref> :index t)
     (category <string>)
     (title <string>)
     (date <utime-fixnum> :index t)
     (content <string>)))

(defpclass user
    ((name <string> :index t)))

Question to think about: Why didn't I put in an index for category?

Now say we want to generate the following blog entry summary pages:

1) All a user's entries
2) All a user's entries under a given category
3) Users blog entries for the past month
4) Above filtered by category

Simple index query for a unique element:
=============================

(defun get-user (name)
    (get-instance-by-value 'user 'name name))

Simple index query for a set of elements:
=============================

(defun user-blog-entries (name)
    (get-instances-by-value 'blog-entry 'user (get-user name)))

Query a set of objects from an index, but filter by some property:
==============================================

The answer to the above question, why there is no index for category
is that I know in advance that there are lots of instances for each category
so paying the cost of grouping them into an index isn't worthwhile.  I
rarely ask for all users from a category (see #5) so can pay linear cost
in that case.  If I do that alot then go ahead and add the index!

There are two ways to do this one.  Both require serializing the values,
in this case the blog entry records into memory and both require accessing
the slot value.  Neither require deserializing the larger content until it's
needed for rendering.  The only real difference is in the consing.  The
first one conses more than the other but for small query sets this isn't a
problem and is certainly easier to read.

(defun user-blog-entries-by-category (name category)
    (select-if (lambda (entry)
                      (eq blog-entry-category category))
                   (user-blog-entries name)))

(defun user-blog-entries-by-category (name category &aux results)
    "These loops need some nice macros to clean them up.  This may
      not run and was written from memory without testing but is
      indicative"
    (flet ((get-next (cur)
               (multiple-value-bind (exists? skey val pkey) (cursor-next
cur)
                  (when exists?
                     (if (eq (blog-entry-category val) category)
                         (push val results)
                         exists?))))
    (with-transaction ()
       (with-inverted-cursor (cur blog-entry (get-user name))
          (loop while (get-next cur))
          (nreverse results)))))

;; (defun select-if (fn data &aux results)
;;    (labels ((rec (in out)
;;                   (cond ((null in)
;;                               (nreverse out))
;;                             ((funcall fn (car in))
;;                               (rec (cdr in) (cons (car in) out)))
;;                             (t (rec (cdr in) out)))))
;;      (rec data nil)))

Query a range of objects, then filter by user
===============================

Which is bigger?  Get the likely small set and then filter by the
criterion for the larger set. 
(The general set intersection compilation optimization when set sizes
are known apriori)

(defun get-user-entries-for-dates (name startdate enddate)
   (let ((user (get-user name)))
   (select-if (lambda (entry)
                    (eq (blog-entry-user entry) user))
                 (get-instances-by-range 'blog-entry 'date startdate
enddate))))


Query and constrain multiple values
==========================

To query all entries for a user and in a given category between two dates do
something like the above, but select-if over, say, username and category.

The general principle is select the likely smallest set and then filter
those elements by the other criteria.  This works good in most cases.


If you really want to speed up looking up the conjunction of two or more
features,
you can build an index which orders these totally.  That is to say, make
a function
provided to (make-derived-index ) that given an object computes a number (or
string) that creates a new namespace ordered in the appropriate way.

Let's say I do alot of category based user and date queries to generate
blog pages
from a server database.  Big amount of records for date, category and
maybe for
a given user as well (I would just look up by user and then filter, but
over several
years this probably becomes a large set).

You would like an index that implements: USER | CATEGORY | DATE

For example, I could make a bignum with a 32-bit user ID, 16-bit
category ID and
a 32-bit date and map/shift the specific values appropriately.  I can
also put these same
fields into a fixed field delimited string.  Slow lookup, but easy to debug.

"IAN               |Hacking       |2423156663"

I can now do range queries over conjunctions of NAME+CATEGORY over a given
date range in time approximately linear in the result set.

I'm sure there's a cleaner solution to this problem as this is common in
the RDB/SQL
implementation world (join tables?) but this works well for most common
applications
people are likely to be using elephant for.

Enjoy, hack up or critique at your leisure.

Ian












More information about the elephant-devel mailing list