rone: (Default)
[personal profile] rone

In regards to a slight problem at Google Blogsearch that [livejournal.com profile] tongodeon mentioned:

Thank you for your note. We understand your concern in this matter, and
we'd like to reassure you that Blog Search (beta) was designed to respect
robots.txt files and both the NOINDEX and NOFOLLOW meta tags.

We're aware of the problem you're reporting, and our engineers are
currently investigating. We hope to have this fixed soon. We appreciate
your patience, and thanks for taking the time to write.

Regards,
The Google Team

Date: 2005-09-16 01:03 am (UTC)
From: [identity profile] zonereyrie.livejournal.com
My understanding of the problem is that Google Blog Search is indexing feeds - RSS, Atom, etc. The feeds don't have a way to indicate noindex/nofollow.

Security through obscurity is never a great plan anyway - while the normal Google Bot may be well behaved, not all bots are, so even if you use the bot blocking option on LJ some bots will index anyway.

Date: 2005-09-16 06:25 am (UTC)
ext_8707: Taken in front of Carnegie Hall (bofh)
From: [identity profile] ronebofh.livejournal.com
Security through obscurity is never a great plan anyway - while the normal Google Bot may be well behaved, not all bots are, so even if you use the bot blocking option on LJ some bots will index anyway.

Maybe so, but this convention has hung around for a long time and has been respected; perhaps an extension or encapsulation of robots.txt can be thrown into the RSS feed, or something along those lines. I'd hesitate to call it "security through obscurity" except only in the broadest of senses.

Date: 2005-09-16 03:57 pm (UTC)
From: [identity profile] nothings.livejournal.com
Isn't a feed clearly being read by a bot? So it's not an issue of there needing to be a robots.txt _in_ the RSS feed; instead, LJ just shouldn't be generated RSS feeds if people have marked no-robots.

Date: 2005-09-16 04:17 pm (UTC)
ext_8707: Taken in front of Carnegie Hall (quiet)
From: [identity profile] ronebofh.livejournal.com
Hmmm, tricky. I think that's definitely sensible. Ideally, those of us who want to keep robots out but wouldn't mind being syndicated should be able to allow people to download our RSS feed.

Date: 2005-09-16 04:40 pm (UTC)
From: [identity profile] nothings.livejournal.com
Ok, but now you're saying "there's bots and there's bots". This used to be well-defined: a bot was something which downloaded stuff without human intervention. I guess a directly-read RSS feed is still a human, but as soon as you aggregate it's abot?

It sounds like people actually want something different from robots.txt, then (which requires you to whitelist and blacklist by 'bot user-agent' names); you really want 'do not allow searching', 'do not allow full-text searching', or some other thing. ('do not allow aggregation'?)

Date: 2005-09-16 04:45 pm (UTC)
ext_8707: Taken in front of Carnegie Hall (grumpy)
From: [identity profile] ronebofh.livejournal.com
Yeah, we just might have outgrown robots.txt or something. I guess nobody until Google decided to suck down sites via RSS and index them to allow searches.

I guess a directly-read RSS feed is still a human, but as soon as you aggregate it's a bot?

Nah, that's a dumb distinction, you're right. Either the multiple versions of RSS change to adopt a convention similar to robots.txt that other places can then follow or ignore, or the RSS feeder needs to lock that down so people need to request a feed. The latter is by far the more feasible one.

Profile

rone: (Default)
entombed in the shrine of zeroes and ones

December 2022

S M T W T F S
    123
45678910
11121314151617
18192021222324
252627282930 31

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 3rd, 2025 11:59 pm
Powered by Dreamwidth Studios