[cffi-devel] a thought on string encodings

James Bielman jamesjb at jamesjb.com
Tue Jan 3 11:44:04 UTC 2006


On Mon, 2006-01-02 at 16:03 +0100, Hoehle, Joerg-Cyril wrote:

> How are you going to represent the encodings?
> - use the implementation's objects?
> - introduce your own names?
> - restrict yourself to known encodings, e.g. UTF-8 (users on MS-Windows would appreciate UTF-16)?

Currently the plan is to represent encodings with keywords, which we can
map in CFFI-SYS to whatever objects are necessary for the
implementation.  Ultimately, you would be able to do stuff like:

(defcfun "use_a_latin1_string" :void
  (s (:string :encoding :iso-8859-1)))

(defctype utf8-string (:string :encoding :utf-8))

(defcfun "getenv" utf8-string
  (name utf8-string))

I'd like to see at least :ascii, :iso-8859-1, :utf-8, and :utf-16 to
start with.  Then we can start adding support for extra encodings.  It
looks like CLISP supports the most encodings currently, so it will
probably be the main test platform once this moves beyond the basics.

Also, I think plain :string will be modified to use whatever the default
encoding is set to in an implementation-specific manner (possibly based
on the user's locale, or whatever the Windows equivalent is, etc).  I
haven't thought about this too hard yet.

I don't think it will support much beyond :ascii or :iso-8859-1 in
non-Unicode Lisps---I don't want to encumber CFFI with a bunch of
character code tables.

> Yeah, I should really add this string stuff to clisp instead.

My plan for the CFFI-SYS interface so far looks like (modulo some %
prefixes):

Function: LIST-ENCODINGS

Return a list of CFFI encodings (keyword symbols) supported by this
implementation.

Function: FOREIGN-STRING-LENGTH pointer encoding &key (offset 0)

Return the length in octets of the null terminated foreign string at
POINTER plus OFFSET octets, assumed to be encoded in ENCODING, a CFFI
encoding.

This should be smart enough to look for 8-bit vs 16-bit null
terminators, as appropriate for the encoding.

Function: LISP-STRING-OCTET-LENGTH string encoding &key start end

Return the length of STRING from START to END, converted to ENCODING, in
octets.  This can be used to preallocate a buffer to pass to
LISP-STRING-TO-FOREIGN.

Function: LISP-STRING-TO-FOREIGN string encoding &key start end buffer => pointer

Convert characters from START to END (character indices) in STRING to a
foreign string, encoding in ENCODING, a CFFI encoding.

If BUFFER, a pointer, is supplied, the foreign string will be written to
that location.  BUFFER must be large enough to accommodate the foreign
string---this can be queried with LISP-STRING-OCTET-LENGTH.

If BUFFER is not supplied, a freshly allocated string will be returned.
Free this string with CFFI:FOREIGN-FREE. 

Function: FOREIGN-STRING-TO-LISP pointer encoding &key start end => string

Convert octets from START to END (octet indices) from POINTER, assumed
to be encoded in ENCODING, to a Lisp string.  If not supplied, END
should default to:

  (foreign-string-length pointer encoding :offset start)


I think CLISP has enough to implement these fairly efficiently, but any
additional primitives you want to add or comments on this interface will
certainly be helpful!

James





More information about the cffi-devel mailing list