[hunchentoot-devel] googlebot revisitation rate excessive?

Hans Hübner hans at huebner.org
Fri Jul 4 15:40:51 UTC 2008


On 7/4/08, Jeff Cunningham <jeffrey at cunningham.net> wrote:
>  Before I block them altogether, there is one thing I don't understand that
> I'm hoping someone can explain to me. What does it mean exactly when I get a
> "No session for session identifier" INFO message in my error_log? There is
> one of these for each of the Googlebot hits.

It means that googlebot presented a session identifier string as a
hunchentoot-session parameter that is not valid.  You are propably
using sessions very frequently and the Google crawler managed to hit
one of the URLs of your server that starts a session.  As the crawler
did not accept the Cookie that Hunchentoot sent, Hunchentoot fell back
to attaching the session identifier to all URLs in the outgoing HTML
as a parameter.  The crawler saved the URLs it saw including the
session identifier and now tries to crawl using these identifiers,
which are propably old and no longer valid.

First off, I would recommend that you switch of URL-REWRITE
(http://weitz.de/hunchentoot/#*rewrite-for-session-urls*).  I am not
using it myself precisely because it confuses simple crawlers.  If a
user does not accept the cookies my site sends, they will not be able
to use it with sessions.  For me, this has never been a problem.  This
will propably not help you with your current problem, but it will make
things easier in the future.

In general, crawlers do not support cookies or session ids in GET
parameters.  Thus, if you want to support crawlers, you need to make
them work without sessions.  Note that if you just do nothing except
switching off URL-REWRITE; every request from a crawler will create a
new session.  This may or may not be a problem.

I guess that Google now has a lot of your URLs it wants to crawl
because the different session identifiers made it think that all of
them are pointing to different resource.  I am kind of wondering
whether that is standard googlebot behaviour.

Lastly, I would vote for switching off URL-REWRITE by default.

-Hans



More information about the Tbnl-devel mailing list