Addition of robots.txt is breaking the Internet Archive

This topic is locked from further discussion.

Avatar image for tOrchie
#1 Posted by tOrchie (261 posts) -

Is there any way the use of robots.txt could be undone? It is breaking all archived pages of this site, some of which are for pages which no longer exist except in archive form. See here. This is causing havoc at Wikipedia, where hundreds of archive references are now broken. Please see this conversation for more details, and please someone fix this. Years of video game history is at stake.

Avatar image for robotopbuddy
#2 Edited by RobotOpBuddy (65497 posts) -

GS has always had a robots.txt to my knowledge, though it was changed significantly last month (and looking at it there were changes at some point before that since I last looked as well), and now appears to apply a whitelist for bots/crawlers and then blocks every other one (last entry on the robots.txt at this point in time), which would likely cause such problems - one possible fix would be whitelisting the specific crawlers for those sites, but obviously that's down to the GS staff and archive sites to sort out if such a thing is going to happen.

Avatar image for machinesmith
#3 Edited by machinesmith (25 posts) -

Hi! I'm glad someone's beaten me to this. I ran into this issue on a wikipedia page for Tobal 2, the GS page in question is part of an article called "Games You'll Never Play That You Should" an article that can be found *NOWHERE* except for the archive. I took a gander at the Robots.txt in question and I DO know how to fix this, just like GS have whitelisted quite a few sites crawlers, doing the same for the Internet Archive is as simple as adding the following to the file:

User-agent: ia_archiver
Allow: /

While it could be a conscious decision I feel this was more of a simple oversight on GS's part (i.e. they would’ve fixed this if they'd known this would happen.) and an equal amount of issues stemming from IA's exclusion policy, at least I hope that's the case. Either way the only thing I'm not sure of is how to get this info/request out to the webmaster / robots.txt wrangler. Any help would be appreciated!

P.S. Of note, the EXACT page I pasted for "Games you'll never Play that you should" *is* archived on archive.today - it basically shows Tobal 2 and a small blurb, however the REST of article / countdown post is missing.

P.P.S The issue, in case it isn't clear, is that before GameSpot's recent Robots.txt update a use could access a crapton of articles and posts that Wikipedia uses as references, these articles / posts are STILL in the archive but because of the change in that damn txt file access is denied.

Avatar image for digitaldame
#4 Posted by digitaldame (5401 posts) -

I pinged a developer about this, hopefully we'll get some info about this.

Avatar image for robotopbuddy
#5 Posted by RobotOpBuddy (65497 posts) -

I rechecked the robots.txt, seems they've added this particular one, as well as a few others. It may take a little while for the archive site itself to update as necessary however.

Avatar image for machinesmith
#6 Edited by machinesmith (25 posts) -

@DigitalDame and @robotopbuddy- I love you both so much! It works, Thank you for helping out!