Monday, March 24, 2014

Digital Historia Numorum currently offline; need advice

I took my digital copy of Barclay Head's Historia Numorum offline.

I started getting email from my provider that my web site was using large amounts of transfer. I looked at my server's logs, and found hundreds of requests for a valid URL, followed by / and some other part of the web site, like this:

/coins/hn/peloponnesus.html/nc/bmc/nc/library/nc/library/bmc/peloponnesus/
/coins/hn/peloponnesus.html/nc/nc/nc/nc/nc/nc/bmc/peloponnesus/
/coins/hn/peloponnesus.html/bmc/nc/nc/nc/nc/nc/nc/bmc/peloponnesus/
/coins/hn/peloponnesus.html/nc/library/nc/bmc/library/bmc/peloponnesus/
/coins/hn/peloponnesus.html/bmc/nc/nc/nc/nc/nc/library/bmc/peloponnesus/
/coins/hn/peloponnesus.html/library/library/nc/nc/bmc/bmc/peloponnesus/
… 1000s more …
/coins/hn/peloponnesus.html/library/nc/nc/nc/nc/bmc/bmc/peloponnesus/
/coins/hn/peloponnesus.html/library/nc/nc/nc/bmc/nc/bmc/peloponnesus/

I suspect that what is happening is some Bot is downloading a page, seeing my relative URLs to elsewhere on the site (e.g. ../bmc/index.html"), but then throwing away the .. and miscalculating the link as /bmc/index.html. But on my site, anything after a valid filename.html returns the same filename. The Bot thinks it is spidering a giant site but as actually getting the same page over and over.

Anyone know how to get Apache to 404 requests for / ? I have been running this site for ten years. If I can't get it working I will move the whole site to Amazon S3, but I'd rather not take that step this month.

1 comment:

Ed Snible said...

The web site is back up.

xanthos on Forum showed me how to use Apaches .htaccess file to rewrite URLs to return 403 if I didn't like their shape:

RewriteEngine On
RewriteRule ^.*\.html/.*$ - [forbidden,last]