The Code Archive

Presented at The Eleventh HOPE (2016), July 22, 2016, 6 p.m. (60 minutes)

Archiving web pages is hard. Crawling, images, assets... Javascript! But archiving code is not. It comes as content-addressed objects neatly packaged in repositories and tagged with refs. It compresses well. Changes can be detected in real time with the GitHub Firehose API. Nevertheless, we need to do it today while the host is healthy, and not wait for it to start bundling adware or slowly fade away. Otherwise, in ten years we'll find ourselves running unreproducible binaries on Javascript emulators, or unable to build the software that could recover all our pictures because that one dependency is missing. This is a talk about building The Code Archive, a Wayback Machine for git. Every time a repository changes on GitHub, Code Archive systems fetch it and archive all the files, commits, tags, and branches as they were at that time. Then you can clone a repository as it was at any point in time, even if the original has been rebased, has disappeared, or GitHub is down. There's a lot of fun to be had when (ab)using the git protocol to clone and pull millions of repositories to the same database. Speakers will show what git looks like on the wire and how fetches are optimized. Also, all the Go code powering the Archive is available... on GitHub.

Presenters:

  • Filippo Valsorda
    Filippo Valsorda is a systems and cryptography engineer at CloudFlare, where he kicked DNSSEC until it became something deployable. Nevertheless, he's probably best known for making popular online vulnerability tests, including the original Heartbleed test. He's really supposed to implement cryptosystems, not break them, but you know how it is.
  • Salman Aljammaz
    Salman Aljammaz is a programmer and occasional unicyclist. He's one of the developers of Camlistore, the content-addressed storage system on which The Code Archive runs.

Links:

Similar Presentations: