Back to writing

[meta]

Finding and fixing dead links on this blog

    Table of contents
  1. Archiving linked pages
  2. Fixing dead links

I said from the beginning that gwern.net was a source of motivation for me to start writing, and I specifically linked to his article about archiving the URLs to which his blog links. It's time for me to start doing the same! I don't have quite as much to say but I thought it'd be worth documenting. Gwern just recently said he's working on a new system.

Archiving linked pages (§)

To preserve third-party pages, I use a browser extension called SingleFile. It saves the entire page into a single .html file by using data: URIs to embed images and other resources. Then I just upload that file along with my own article and link to it. At the moment I am not very concerned about cloning entire websites — I just need one page at a time.

At first I was using web.archive.org to provide archive links, but now I prefer self-hosting SingleFile pages for a few reasons:

While I expect IA to be around for a while, my goal is to reduce dependency on third parties. In some ways, depending on IA as a single third party instead of numerous third parties is better; in some ways it's worse. A single point of trust is also a single point of failure.

IA does not give the reader an easy way to download the archived page for themselves. CTRL+S does not produce a good replica in most cases. With my SingleFile pages, you can just CTRL+S and get the exact same file that I have. My goal is to share information with you, so that's kind of important.

IA might receive a DMCA takedown request and lose the content. If I want to preserve the linked page I'll need to make my own copy of it anyway, so why do both IA plus SingleFile when I can just do the one?

IA is fairly slow to load. I mean no offense to them as they are a fantastic resource and I'm okay with them prioritizing capacity over speed, but it is true.

Sometimes I need to link to a newspaper or other blogger. These articles are often professionally written and need to be pasteurized for consumption by a sane audience. Hosting the HTML myself allows me to do this. Here's an article before (4.11 MB) and after (0.14 MB) I cleaned it up [1]. Oh, and that's with SingleFile's "remove scripts" option enabled. It was 9.75 MB with scripts and I don't want to waste either of our bandwidths by including that here. I shouldn't be too harsh on them, putting two dozen paragraphs of text into a document is really hard and doing it efficiently requires a great engineering team like mine.

For what it's worth, myself and others have noticed that a lot of news sites today are deathly afraid of including external links in their text — they'd rather provide a useless link to themselves than a useful link to a third party. I don't want to be like that, so I'll either include the original link alongside the archived one, or I'll edit the archived page to make its above-the-fold title a link back to the original URL.

[1] Base64 encoding of embedded resources makes them take up about a third more space, a tradeoff for the convenience of having it packed in a single html file. The original source for that article costs me about 3.7 MB over the wire.

I wrote my own linkchecker.py because for some reason I like writing my own solutions instead of using other people's. It gives me a report organized by HTTP status and domain. If the link has a problem, it tells me what article it's on.

It immediately found multiple problematic links, all of which were my fault. My engineering team dropped the ball on this because they were too focused on innovating new ways to put paragraphs into documents, but the linkchecker should reduce these incidents in the future.

Actually, it's a good thing if all the dead links are my fault, because I can easily fix it. As long as I continue making SingleFile archives, the majority of dead links I encounter should be resources that I renamed or images that I forgot to upload. I should be able to run it on a cronjob and use operatornotify to get emails about it, after I fine-tune the error / warning levels. I tend to link to a lot of youtube videos which, of course, I won't be rehosting here due to their large file size, but if it's a video I really care I'll have downloaded a copy to my personal computer and can find some way of getting it to you.

So, that's where we're at for now. If you see any other problems that I missed, you can send me an email.


View this document's history

Contact me: writing@voussoir.net

If you would like to subscribe for more, add this to your RSS reader: https://voussoir.net/writing/writing.atom