[elephant-devel] Sorting

Ian Eslick eslick at csail.mit.edu
Mon Oct 22 20:38:49 UTC 2007


Well here are the open decisions:

- Characters as an independent type or a numeric type?
- Case sensitive vs. insensitive string/symbol/pathname sorting?
   (insensitive was the policy implemented by the original developers)
- Settle on a standard type ordering and find some way to do a range  
query over a type, or develop an alternative data structure model

Actually the btree vs. dictionary argument isn't quite right.  The  
unsortability of certain ranges of the current btree abstraction is a  
design tradeoff stemming from a conflict between dynamic typing and  
the the lisp definition of sortable.  Not all serializable lisp  
objects are sortable, so we have two classes of objects.

[BTrees are an implemenation, not an abstraction, so perhaps should  
be an optional, low-level interface only experts use?  The arbitrary  
sort order used in the BTree 'implementation' is still deterministic  
and thus enables log(N) insert/delete and linear time traversal.]

There are several ways we can deal with the two-class problem:

1) BTrees can only store a single, statically declared type (perhaps  
we allow all 'number' subclasses as a type)

2) We allow BTrees to store anything, as outlined in the original  
writeup below.

3) We allow BTrees to store anything, but we bend over backwards to  
document how each unsortable type is actually sorted and further wack  
away at all the data stores so they support the defined sort order  
for unsortable types.  This strikes me as a huge headache (Robert,  
remember the headache in duplicate sorting differences between BDB  
and SQL?  Multiply that by N data stores...:)

4) BTrees can only store sortable objects and we define the ordering  
of different type ranges.  We add a dictionary type to store both  
sortable and unsortable objects.  There are some implications, though.

This means that indexed slot values must be sortable and you get a  
type violation error if you try to write an unsorted value.  This is  
a problem if you add an index to an existing class with slots that  
have values that are not sortable...

Now lisp has no notion of ordering between types and it is hard to  
index into the first or last member of any given type (given the  
current abstractions in all data store implementations).  Do we  
disallow numbers and strings to reside in the same tree?   If not, is  
42 < "deep thought" or is "deep thought" < 42?  :)  The logical  
outcome of a cleaner BTree abstraction is that BTrees and slot  
indices become statically typed (just like CLSQL classes) and the  
dictionary type is more like the current BTree, but doesn't have  
cursors, succ or pred.   I think lisp users can deal with not relying  
on the enumeration ordering for a dictionary since they deal with  
hash-table enumeration just fine today.

In my mind only options 2 and 4 are viable.  The difference is  
whether you want the flexibility of writing anything to btrees and  
indexed slots verses cleaner data structure abstractions and range  
queries.

We could also keep the current btree model and clean up the  
abstractions by building the structures in #2 on top for users  
desiring less power and more predictability.  This would mean the  
current BTree with a complex definition of order, a 'tree' data  
structure with pred/succ that only takes a single type and a  
'dictionary' data structure without it.  This last option is pretty  
easy to build on top of btrees, actually.

Finally, class indexing could be extended to enforce types so you  
could choose a BTree index or a cleaner 'tree' index (what is the  
canonical name for a totally ordered set with pred/succ on it?)

(defpclass my-class ()
   ((slot1 :accessor slot1 :initarg :slot1 :index t :index-type number)
    (slot2 :accessor slot2 :initarg :slot2 :index t)))

(make-instance 'my-class :slot1 1 :slot2 'foo)
(make-instance 'my-class :slot1 2.0 :slot2 "test")
(make-instance 'my-class :slot1 (the bignum 2000000000000000) :slot2  
'foo)

(get-instances-by-value 'my-class 'slot2 "test")
=> (#<MY-CLASS s1: 2.0 s2: "test">)
(get-instances-by-range 'my-class 'slot1 0 10)
=> (#<MY-CLASS s1: 1 s2: 'foo> #<MY-CLASS s1: 2.0 s2: "test">)
(get-instances-by-value 'my-class 'slot2 'foo)
=> (#<MY-CLASS s1: 1 s2: 'foo> #<MY-CLASS s1: 200000000000000 s2: 'foo>)

(make-instance 'my-class :slot1 'foo :slot2 'foo) => (error "Invalid  
slot-type for typed index slot")

If we keep the current BTree abstraction in any form, we should  
definitely provide a way to map over a type range within the btree as  
opposed to only the entire tree.  At least for now I can stub in a  
linear version of this vs. the ideal log(N) + M.

Whew...

Ian



On Oct 22, 2007, at 3:55 PM, Robert L. Read wrote:

> I think this is the best policy that we can generate.
>
> I personally don't like it, as I've mentioned before.  I don't think
> we should have an arbitrary order. If you don't have a specified  
> order,
> then you don't really have a btree in meaningful sense.  You can't  
> do a
> range query over something that is arbitrary, and if you don't have a
> range query, why call it a btree? You should just call it a  
> dictionary,
> (which happens to be implemented with btrees) equip it with an
> enumeration, and have something just as efficient and simpler than a
> btree, which furthermore decreases the chances of someone accidentally
> depending on the order, since you never provide a way to utilize it
> (although of course humans, being human, will accidentally depend  
> on the
> ordering of objects returned by the enumeration, so the problem never
> goes away from a software engineering point of view.)
>
> However, I don't have time to reorganize things to present a
> "dictionary" in addition to a "btree" abstraction; and even if someone
> volunteered to do this work, it would be worthy of a new release,  
> rather
> than a discussion of the current release.  And since Ian has done much
> more work than I have in the last year, I will defer to his judgment.
>
> Is there a reason that string sorting is case insensitive?  This seems
> like asking for trouble --- that's something that people are likely to
> depend on and be surprised when they switch repositories and find  
> it is
> different.
>
>
> On Mon, 2007-10-22 at 13:18 -0400, Ian Eslick wrote:
>> There have been a number of discussions about sorting issues and
>> guarantees among the various data store implementations.  This is
>> difficult due to different backend serialized representations.  This
>> is my understanding of the current guarantees.  Let's commit to this
>> as part of locking down 0.9.1.
>>
>> Any ordered data structure, such as btree or indexed-btrees, has to
>> make some guarantees about the function that defines the ordering.
>> Currently, the BDB data store and Elephant core implements the
>> following policy (I believe this is also true for CL-SQL stores as
>> well):
>>
>> General policies:
>>
>> In a BTree with multiple types, each type is sorted as an independent
>> subsequence.  Types are guaranteed to be separated, but the order of
>> the type sub-sequences is not guaranteed.
>>
>> In an Index with multiple values, there is no guarantee about the
>> ordering of duplicate values, the the index is treated as a sorted
>> sequence of unordered equivalence classes.  Tests and users should
>> not depend on this.
>>
>> Specific types:
>>
>> Numeric types (fixnum, bignum, float, double, rational and complex):
>> Sorted properly over the continuous real numbers, smallest to biggest
>>
>> Characters:
>> Currently considered and sorted in with all the above numeric types
>> (this is probably wrong)
>>
>> Strings:
>> Sorted alphabetically, case-INSENSITIVE
>>
>> Symbols/Pathnames:
>> Sorted as strings above, but symbols and paths are sorted
>> independently as type classes.
>>
>> Persistent objects are sorted according to OID, but the OID ordering
>> may change at any time, so long as object equivalency is maintained
>> (for example, if we implement GC).  Users should not depend on the
>> OID for anything other than uniqueness unless they are implemented a
>> new low-level persistent-collection data structure like (N:1 and N:N)
>> associations that is GC aware.  Otherwise the order can only be
>> guaranteed within a given transaction.
>>
>> All other types have no user-visible ordering guarantees.  They have
>> an arbitrary ordering within the data store, but there is no lisp
>> ordering that makes sense for hash tables, standard objects, etc.
>> (You can create derived indices of an indexed-btree that compute a
>> number, string, etc that can serve as an ordering for arbitrary
>> objects).
>>
>> Open issues:
>> - Characters as independent class, or as numeric codes?
>> - If/how to map over a type subset in an index or btree
>>
>>
>> Anything I missed?
>>
>> Ian
>>
>>
>> _______________________________________________
>> elephant-devel site list
>> elephant-devel at common-lisp.net
>> http://common-lisp.net/mailman/listinfo/elephant-devel
>
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel




More information about the elephant-devel mailing list