[elephant-devel] Sorting

Tue Oct 23 02:45:13 UTC 2007

On Mon, 2007-10-22 at 16:38 -0400, Ian Eslick wrote:
> Well here are the open decisions:
> 
> - Characters as an independent type or a numeric type?
Can we usefully think of Unicode characters as a numeric type?
The real problem is that each culture uses a different collation
order, so it is hard to imagine an useful order for Latin-1 or 
Latin-3 characters that isn't specific to one culture.  So it 
seems to me that a numeric type is more culture-neutral, even if
it isn't necessarily more useful.

> - Case sensitive vs. insensitive string/symbol/pathname sorting?
>    (insensitive was the policy implemented by the original developers)

Do we have to treat strings and symbol/pathnames the same way? 

> - Settle on a standard type ordering and find some way to do a range  
> query over a type, or develop an alternative data structure model
> 
> Actually the btree vs. dictionary argument isn't quite right.  The
>   
Touche a moi.
> unsortability of certain ranges of the current btree abstraction is a  
> design tradeoff stemming from a conflict between dynamic typing and  
> the the lisp definition of sortable.  Not all serializable lisp  
> objects are sortable, so we have two classes of objects.
> 
> [BTrees are an implemenation, not an abstraction, so perhaps should  
> be an optional, low-level interface only experts use?  The arbitrary  
> sort order used in the BTree 'implementation' is still deterministic  
> and thus enables log(N) insert/delete and linear time traversal.]

I think you've nailed the most important point here.  I doubt we can 
change it for the current release, but our ideal should not be to offer
a "btree", which is, as you mention, an implementation rather than an
Abstract Data Type.  As you mention, we can let people code against it,
but it should not be the default.  Most users you should use "Set" or
"Ordered Set", if the type of the set is easily ordered.  People may
prefer to call these "Set" and "List", although "List" collides terribly
with LISP notions.  The fact that we may in fact use BTree with
implementation dependent orders is an internal convenience.

> 
> There are several ways we can deal with the two-class problem:
> 
> 1) BTrees can only store a single, statically declared type (perhaps  
> we allow all 'number' subclasses as a type)
> 
> 2) We allow BTrees to store anything, as outlined in the original  
> writeup below.
> 
> 3) We allow BTrees to store anything, but we bend over backwards to  
> document how each unsortable type is actually sorted and further wack  
> away at all the data stores so they support the defined sort order  
> for unsortable types.  This strikes me as a huge headache (Robert,  
> remember the headache in duplicate sorting differences between BDB  
> and SQL?  Multiply that by N data stores...:)
> 
> 4) BTrees can only store sortable objects and we define the ordering  
> of different type ranges.  We add a dictionary type to store both  
> sortable and unsortable objects.  There are some implications, though.
> 
> This means that indexed slot values must be sortable and you get a  
> type violation error if you try to write an unsorted value.  This is  
> a problem if you add an index to an existing class with slots that  
> have values that are not sortable...

I don't quite follow this last paragraph.  
An indexed slot value doesn't support range
queries unless it has a type for which an order is defined.  However,
you should still be able to do O(1) look ups on the value against a 
hash structure, or O(log n) against a BTree implemnation.  For example,
if I have a slot representing social security numbers, I should be able
to look an object up by ssn, in O(log n), but not do a range query over
SSN.  If I expect to do range queries, then I define SSN to be a type
that is ordered (which is, in this example, easy, and will in practice
be free to our users.)  So in designing a class it is still reasonable
to think: Do I want to index this slot or not? And then: Do I need
to support enumeration in a particular order, or range queries?

> 
> Now lisp has no notion of ordering between types and it is hard to  
> index into the first or last member of any given type (given the  
> current abstractions in all data store implementations).  Do we  
> disallow numbers and strings to reside in the same tree?   If not, is  
> 42 < "deep thought" or is "deep thought" < 42?  :)  The logical  
> outcome of a cleaner BTree abstraction is that BTrees and slot  
> indices become statically typed (just like CLSQL classes) and the  
> dictionary type is more like the current BTree, but doesn't have  
> cursors, succ or pred.   I think lisp users can deal with not relying  
> on the enumeration ordering for a dictionary since they deal with  
> hash-table enumeration just fine today.

I agree the last part.

> 
> In my mind only options 2 and 4 are viable.  The difference is  
> whether you want the flexibility of writing anything to btrees and  
> indexed slots verses cleaner data structure abstractions and range  
> queries.

I don't think we can tell people they can't index a certain slot in the
sense of having rapid lookup of a specific value.  We might be able to
tell them they can have "pred/succ" on a slot unless they give us some
more information on how to implement those.

> 
> We could also keep the current btree model and clean up the  
> abstractions by building the structures in #2 on top for users  
> desiring less power and more predictability.  This would mean the  
> current BTree with a complex definition of order, a 'tree' data  
> structure with pred/succ that only takes a single type and a  
> 'dictionary' data structure without it.  This last option is pretty  
> easy to build on top of btrees, actually.

I think this is the ultimate solution.  We are moving toward abstraction
without taking any functionality or performance away from anyone.
I think if when defining a slot the use can sort of punt and say:
(defclass not-sure-class () 
	((slot1 :accessor :slot :index-type 'implemenation-arbitrary))),
where 'implemenation-arbitrary means the "current" order that you have
defined (or at least clarified), then we are allowing the user to do
anything that they can do today, but offering more abstract ways of
doing the same thing.

> 
> Finally, class indexing could be extended to enforce types so you  
> could choose a BTree index or a cleaner 'tree' index (what is the  
> canonical name for a totally ordered set with pred/succ on it?)

"Ordered Set/OSet" is the best canonical name I can come up with.
Both the Wikipedia and _The Design and Analysis of Computer Algorithms_
by Aho, Hopcroft and Ullman are silent on the issue.  Java (shudder!)
uses Set and SortedSet, which is actually not too bad.

> 
> (defpclass my-class ()
>    ((slot1 :accessor slot1 :initarg :slot1 :index t :index-type number)
>     (slot2 :accessor slot2 :initarg :slot2 :index t)))
> 
> (make-instance 'my-class :slot1 1 :slot2 'foo)
> (make-instance 'my-class :slot1 2.0 :slot2 "test")
> (make-instance 'my-class :slot1 (the bignum 2000000000000000) :slot2  
> 'foo)
> 
> (get-instances-by-value 'my-class 'slot2 "test")
> => (#<MY-CLASS s1: 2.0 s2: "test">)
> (get-instances-by-range 'my-class 'slot1 0 10)
> => (#<MY-CLASS s1: 1 s2: 'foo> #<MY-CLASS s1: 2.0 s2: "test">)
> (get-instances-by-value 'my-class 'slot2 'foo)
> => (#<MY-CLASS s1: 1 s2: 'foo> #<MY-CLASS s1: 200000000000000 s2: 'foo>)
> 
> (make-instance 'my-class :slot1 'foo :slot2 'foo) => (error "Invalid  
> slot-type for typed index slot")
> 
> If we keep the current BTree abstraction in any form, we should  
> definitely provide a way to map over a type range within the btree as  
> opposed to only the entire tree.  At least for now I can stub in a  
> linear version of this vs. the ideal log(N) + M.

I think I know what you mean, but I think this impossible in the most
general sense of the word "type", given that lisp supports types like:
(integer 3 37) as a range or '(or integer string).  But I think I know
what you mean; in practice we can usefully do it over must types that 
are simple symbols.  We can support a lot of simple types that coincide
with our "complex type" almost effortlessly---those that you have
already mentioned, for example.

> 
> Whew...

I'm sorry this complicated, and I know my thoughts aren't always
perfectly clear.  But if we want to make progress in the long term we
have to have these kinds of discussions.

Allow me to make a few other points:

1)  You have convinced me that my original idea of allowing the user to 
define their own orders, while mathematically reasonable, is not
practical in our current situation.  (If we had a pure-lisp solution
where we could write our own tree implementation, then it would be
different....)

2)  Originally Elephant was an interface to BDB, and it served that
purpose well.  I personally would rather see it be the basic persistent
object store solution for LISP, at a minimum for rapid prototyping.  I
therefore think growing away for "btree" and toward more mathematical
definitions is a very positive idea.  (Of course, I am also the kind of
guy is who offended that "relational" database rarely have duplicate
elimination turned on, and therefore implement "multisets" rather than
"sets" in many cases.)

3)  Even if we hammer out a complete consensus on what to do, it doesn't
mean we have to do it the next release---we got a very good postmodern
implementation that deserves its own release.

> 
> Ian
>