- From: Brian Wilson <bloo@blooberry.com>
- Date: Thu, 28 Feb 2008 04:09:38 +0100
- To: Nikita The Spider The Spider <nikitathespider@gmail.com>
- CC: www-validator@w3.org
[this got lost in the shuffle, many sorries for the delay] Nikita The Spider The Spider wrote: > On Feb 6, 2008 12:17 PM, Brian Wilson <bloo@blooberry.com> wrote: >> On Wed, 6 Feb 2008, olivier Thereaux wrote: >> >>> * stats on the documents themselves. Doctype, mime type, charset. >>> Ideally, whether charset is in HTTP, XML decl, meta. There are >>> existing studies about these, but another study made on a different >>> sample would bring more perspective. > > Out of curiousity, where do you see these statistics being published? > Time permitting, I'd be happy to contribute results from my validator. > I've already been collecting statistics on robots.txt files (an > obscure hobby to be sure). > > If anyone else is interested in the robots.txt files, the most recent > data is here: > http://NikitaTheSpider.com/articles/RobotsTxt2007.html It will live somewhere on opera.com (I work in QA at Opera) I found this data very interesting. But it might not intersect that well with what I was looking at...actually, I didn't respect robots.txt in my crawling. [maybe for that reason, the two studies complement each other =)] Not consulting robots.txt was an omission on my part at first, but when I considered the issue, I decided to keep using the process I already had in place. - The entire set of URLs was randomized, so the chance of violating a robots.txt crawl delay was pretty low. - The crawl used the DMoz URL set, with ___domain limiting (a cap of 30 URLs per ___domain). This would avoid hammering any server. I'd love to discuss more about any potential cross-talk between these studies though. -Brian
Received on Thursday, 28 February 2008 03:10:10 UTC