Re: Bug 85/4494 (keeping track of validation statistics for various purposes) from Brian Wilson on 2008-02-28 (www-validator@w3.org from February 2008)

From: Brian Wilson <bloo@blooberry.com>
Date: Thu, 28 Feb 2008 04:09:38 +0100
To: Nikita The Spider The Spider <nikitathespider@gmail.com>
CC: www-validator@w3.org
Message-ID: <47C625F2.1060204@blooberry.com>

[this got lost in the shuffle, many sorries for the delay]

Nikita The Spider The Spider wrote:
> On Feb 6, 2008 12:17 PM, Brian Wilson <bloo@blooberry.com> wrote:
>> On Wed, 6 Feb 2008, olivier Thereaux wrote:
>>
>>> * stats on the documents themselves. Doctype, mime type, charset.
>>> Ideally, whether charset is in HTTP, XML decl, meta. There are
>>> existing studies about these, but another study made on a different
>>> sample would bring more perspective.
> 
> Out of curiousity, where do you see these statistics being published?
> Time permitting, I'd be happy to contribute results from my validator.
> I've already been collecting statistics on robots.txt files (an
> obscure hobby to be sure).
> 
> If anyone else is interested in the robots.txt files, the most recent
> data is here:
> http://NikitaTheSpider.com/articles/RobotsTxt2007.html

It will live somewhere on opera.com (I work in QA at Opera)

I found this data very interesting. But it might not intersect that well 
with what I was looking at...actually, I didn't respect robots.txt in my 
crawling. [maybe for that reason, the two studies complement each other 
=)] Not consulting robots.txt was an omission on my part at first, but 
when I considered the issue, I decided to keep using the process I 
already had in place.

- The entire set of URLs was randomized, so the chance of violating a 
robots.txt crawl delay was pretty low.

- The crawl used the DMoz URL set, with ___domain limiting (a cap of 30 
URLs per ___domain). This would avoid hammering any server.

I'd love to discuss more about any potential cross-talk between these 
studies though.

-Brian

Received on Thursday, 28 February 2008 03:10:10 UTC