rone | they'd better have it fixed soon

You're viewing

rone's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

In regards to a slight problem at Google Blogsearch that tongodeon mentioned:

Thank you for your note. We understand your concern in this matter, and
we'd like to reassure you that Blog Search (beta) was designed to respect
robots.txt files and both the NOINDEX and NOFOLLOW meta tags.

We're aware of the problem you're reporting, and our engineers are
currently investigating. We hope to have this fixed soon. We appreciate
your patience, and thanks for taking the time to write.

Regards,
The Google Team

Current Music: Patty Larkin - Dear Diary

Flat | Top-Level Comments Only

From:

zonereyrie.livejournal.com

My understanding of the problem is that Google Blog Search is indexing feeds - RSS, Atom, etc. The feeds don't have a way to indicate noindex/nofollow.

Security through obscurity is never a great plan anyway - while the normal Google Bot may be well behaved, not all bots are, so even if you use the bot blocking option on LJ some bots will index anyway.

From:

ronebofh.livejournal.com

Security through obscurity is never a great plan anyway - while the normal Google Bot may be well behaved, not all bots are, so even if you use the bot blocking option on LJ some bots will index anyway.

Maybe so, but this convention has hung around for a long time and has been respected; perhaps an extension or encapsulation of robots.txt can be thrown into the RSS feed, or something along those lines. I'd hesitate to call it "security through obscurity" except only in the broadest of senses.

From:

nothings.livejournal.com

Isn't a feed clearly being read by a bot? So it's not an issue of there needing to be a robots.txt _in_ the RSS feed; instead, LJ just shouldn't be generated RSS feeds if people have marked no-robots.

From:

ronebofh.livejournal.com

Hmmm, tricky. I think that's definitely sensible. Ideally, those of us who want to keep robots out but wouldn't mind being syndicated should be able to allow people to download our RSS feed.

From:

nothings.livejournal.com

Ok, but now you're saying "there's bots and there's bots". This used to be well-defined: a bot was something which downloaded stuff without human intervention. I guess a directly-read RSS feed is still a human, but as soon as you aggregate it's abot?

It sounds like people actually want something different from robots.txt, then (which requires you to whitelist and blacklist by 'bot user-agent' names); you really want 'do not allow searching', 'do not allow full-text searching', or some other thing. ('do not allow aggregation'?)

From:

ronebofh.livejournal.com

Yeah, we just might have outgrown robots.txt or something. I guess nobody until Google decided to suck down sites via RSS and index them to allow searches.

I guess a directly-read RSS feed is still a human, but as soon as you aggregate it's a bot?

Nah, that's a dumb distinction, you're right. Either the multiple versions of RSS change to adopt a convention similar to robots.txt that other places can then follow or ignore, or the RSS feeder needs to lock that down so people need to request a feed. The latter is by far the more feasible one.