From jeffrey at jkcunningham.com Mon Sep 24 17:01:47 2012 From: jeffrey at jkcunningham.com (Jeff Cunningham) Date: Mon, 24 Sep 2012 10:01:47 -0700 Subject: [drakma-devel] charset errors question Message-ID: <506091FB.6040400@jkcunningham.com> I've been running into some trouble using drakma to retrieve pages from certain commercial websites. It is very likely the HTML they are generating is broken one way or another. But the problem still remains as to how one can retrieve their pages using drakma. For example, if you try this simple case: (http-request "http://www.walmart.com") It will display the following: WARNING: Problems determining charset (falling back to binary): Corrupted Content-Type header: Read character #\;, but expected #\=. And the returned body is binary-encoded ascii. This can be converted to real ascii, of course, but it is inconvenient to say the least. Often the problem is that their metatag for the charset is simply wrong. Sometimes I can figure out what it is and supply this information, like this: (http-request "http://www.walmart.com" :external-format-in :UTF-8) and it will solve he problem. But this particular example does not lend itself to this, at least using the following charsets: :UTF-8 :UTF-7 :iso-8859-1 :iso-8859-2 :iso-8859-3 :iso-8859-4 :iso-8859-5 :iso-8859-6 :iso-8859-7 :iso-8859-8 :iso-8859-9 :BIG5 :US-ASCII :UTF-16 :UTF-32 I have no idea what their server is actually sending - it appears to be invalid for any of these charsets. Is there any way to get around this problem? Best regards, Jeff Cunningham -------------- next part -------------- An HTML attachment was scrubbed... URL: From hans.huebner at gmail.com Mon Sep 24 17:47:59 2012 From: hans.huebner at gmail.com (=?ISO-8859-1?Q?Hans_H=FCbner?=) Date: Mon, 24 Sep 2012 19:47:59 +0200 Subject: [drakma-devel] charset errors question In-Reply-To: <506091FB.6040400@jkcunningham.com> References: <506091FB.6040400@jkcunningham.com> Message-ID: Jeff, you can use the :FORCE-BINARY keyword argument to have DRAKMA return the octets constituting the response, and then call FLEXI-STREAMS:OCTETS-TO-STRING with an explicit external format to force decoding using a particular external format, like so: (flexi-streams:octets-to-string (drakma:http-request "http://www.walmart.com" :force-binary t) :external-format :ascii) HTH, Hans On Mon, Sep 24, 2012 at 7:01 PM, Jeff Cunningham wrote: > I've been running into some trouble using drakma to retrieve pages from > certain commercial websites. It is very likely the HTML they are generating > is broken one way or another. But the problem still remains as to how one > can retrieve their pages using drakma. > > For example, if you try this simple case: > > (http-request "http://www.walmart.com") > > It will display the following: > > WARNING: Problems determining charset (falling back to binary): > Corrupted Content-Type header: > Read character #\;, but expected #\=. > > And the returned body is binary-encoded ascii. This can be converted to real > ascii, of course, but it is inconvenient to say the least. > > Often the problem is that their metatag for the charset is simply wrong. > Sometimes I can figure out what it is and supply this information, like > this: > > (http-request "http://www.walmart.com" :external-format-in :UTF-8) > > and it will solve he problem. But this particular example does not lend > itself to this, at least using the following charsets: > > :UTF-8 > :UTF-7 > :iso-8859-1 > :iso-8859-2 > :iso-8859-3 > :iso-8859-4 > :iso-8859-5 > :iso-8859-6 > :iso-8859-7 > :iso-8859-8 > :iso-8859-9 > :BIG5 > :US-ASCII > :UTF-16 > :UTF-32 > > I have no idea what their server is actually sending - it appears to be > invalid for any of these charsets. > > Is there any way to get around this problem? > > Best regards, > Jeff Cunningham > > _______________________________________________ > drakma-devel mailing list > drakma-devel at common-lisp.net > http://lists.common-lisp.net/cgi-bin/mailman/listinfo/drakma-devel >