Gwtar: a static efficient single-file HTML format
Gwtar is a new polyglot HTML archival format which provides a single, self-contained, HTML file which still can be efficiently lazy-loaded by a web browser. This is done by a header's JavaScript making HTTP range requests. It is used on Gwern.net to serve large HTML archives.
- Background
- HTML Trilemma
- Trisecting
- Download Stopping Mechanisms
- Concatenated Archive Design
- Creation
- Implementation
- Header
- Details
- Fallback
- Compression
- Limitations
- Local Viewing
- Range Request Support
- Cloudflare Is Broken
- Optional Trailing Data
- FEC
- Signing
- Metadata
- IP
- Further Work
Archiving HTML files faces a trilemma: it is easy to create an archival format which is any two of static (self-contained ie. all assets included, no special software or server support), a single file (when stored on disk), and efficient (lazy-loads assets only as necessary to display to a user), but no known format allows all 3 simultaneously.
We introduce a new format, Gwtar (logo; pronounced "guitar",
.gwtar.htmlextension), which achieves all 3 properties simultaneously. A Gwtar is a classic fully-inlined HTML file, which is then processed into a self-extracting concatenated file of an HTML + JavaScript header followed by a tarball of the original HTML and assets. The HTML header's JS stops web browsers from loading the rest of the file, loads just the original HTML, and then hooks requests and turns them into range requests into the tarball part of the file.Thus, a regular web browser loads what seems to be a normal HTML file, and all assets download only when they need to. In this way, a static HTML page can inline anything--such as gigabyte-size media files--but those will not be downloaded until necessary, even while the server sees just a single large HTML file it serves as normal. And because it is self-contained in this way, it is forwards-compatible: no future user or host of a Gwtar file needs to treat it specially, as all functionality required is old standardized web browser/server functionality.
Gwtar allows us to easily and reliably archive even the largest HTML pages, while still being user-friendly to read.
Example pages: "The Secret of Psalm 46" (vs original SingleFile archive--warning: 286MB download).
Background
Linkrot is one of the biggest challenges for long-term websites. Gwern.net makes heavy use of web page archiving to solve this; and due to quality problems and long-term reliability concerns, simply linking to the Internet Archive is not enough, so I try to create & host my own web page archives of everything I link.
There are 3 major properties we would like of an HTML archive format, beyond the basics of actually capturing a page in the first place: it should not depend in any way on the original web page, because then it is not an archive and will inevitably break; it should be easy to manage and store, so you can scalably create them and store them for the long run; and it should be efficient, which for HTML largely means that readers should be able to download only the parts they need in order to view the current page.
HTML Trilemma
No current format achieves all 3. The built-in web browser save-as-HTML format achieves single and efficient, but not static; save-as-HTML-with-directory achieves static partially and efficient, but not single; MHTML, MAFF, SingleFile, & SingleFileZ (a ZIP-compressed variant) achieve static, single, but not efficiency; WARCs/WACZs achieve static and efficient, but not single (because while the WARC is a single file, it relies on a complex software installation like WebRecorder/Replay Webpage to display).
An ordinary 'save as page HTML' browser command doesn't work because "Web Page, HTML Only" leaves out most of a web page; even "Web Page, Complete" is inadequate because a lot of assets are dynamic and only appear when you interact with the page--especially images. If you want a static HTML archive, one which has no dependency on the original web page or domain, you have to use a tool specifically designed for this. I usually use SingleFile. SingleFile produces a static snapshot of the live web page, while making sure that lazy-loaded images are first loaded, so they are included in the snapshot.
SingleFile often produces a useful static snapshot. It also achieves another nice property: the snapshot is a single file, just a simple single .html file, which makes life so much easier in terms of organizing and hosting. Want to mirror a web page? SingleFile it, and upload the resulting single file to a convenient directory somewhere, boom--done forever. Being a single file is important on Gwern.net, where I must host so many files, and I run so many lints and checks and automated tools and track metadata etc. and where other people may rehost my archives.
However, a user of SingleFile quickly runs into a nasty drawback: snapshots can be surprisingly large. In fact, some snapshots on Gwern.net are over half a gigabyte! For example, the homepage for the research project "PaintsUndo: A Base Model of Drawing Behaviors in Digital Paintings" is 485MB after size optimization, while the raw HTML is 0.6MB. It is common for an ordinary somewhat-fancy Web 2.0 blog post like a Medium.com post to be >20MB once fully archived. This is because such web pages wind up importing a lot of fonts, JS, widgets and icons etc., all of which assets must be saved to ensure it is fully static; and then there is additional wasted space overhead due to converting assets from their original binary encoding into Base64 text which can be interleaved with the original HTML.
This is especially bad because, unlike the original web page, anyone viewing a snapshot must download the entire thing. That 500MB web page is possibly OK because a reader only downloads the images that they are looking at; but the archived version must download everything. A web browser has to download the entire page, after all, to display it properly; and there is no lazy-loading or ability to optionally load 'other' files--there are no other files 'elsewhere', that was the whole point of using SingleFile!
Hence, a SingleFile archive is static, and a single file, but it is not efficient: viewing it requires downloading unnecessary assets.
So, for some archives, we 'split' or 'deconstruct' the static snapshot back into a normal HTML file and a directory of asset files, using deconstruct_singlefile.php (which incidentally makes it easy to re-compress all the images, which produces large savings as many websites are surprisingly bad at basic stuff like PNG/JPG/GIF compression); then we are back to a static, efficient, but not single file, archive.
This is fine for our auto-generated local archives because they are stored in their own directory tree which is off-limits to most Gwern.net infrastructure (and off-limits to search engines & agents or off-site hotlinking), and it doesn't matter too much if they litter tens of thousands of directories and files. It is not fine for HTML archives I would like to host as first-class citizens, and expose to Google, and hope people will rehost someday when Gwern.net inevitably dies.
So, we could either host a regular SingleFile archive, which is static, single, and inefficient; or a deconstructed archive, which is static, multiple, and efficient, but not all 3 properties.
This issue came to a head in January 2026 when I was archiving the Internet Archive snapshots of Brian Moriarty's famous lectures "Who Buried Paul?" and "The Secret of Psalm 46", since I noticed while writing an essay drawing on them that his whole website had sadly gone down. I admire them and wanted to host them properly so people could easily find my fast reliable mirrors (unlike the slow, hard-to-find, unreliable IA versions), but realized I was running into our long-standing dilemma: they would be efficient in the local archive system after being split, but unfindable; or if findable, inefficiently large and reader-unfriendly. Specifically, the video of "Who Buried Paul?" was not a problem because it had been linked as a separate file, so I simply converted it to MP4 and edited the link; but "The Secret of Psalm 46" turned out to inline the OGG/MP3 recordings of the lecture and abruptly increased from <1MB to 286MB.
I discussed it with Said Achmiz, and he began developing a fix.
Trisecting
To achieve all 3, we need some way to download only part of a file, and selectively download the rest. This lets us have a single static archive of potentially arbitrarily large size, which can safely store every asset which might be required.
HTTP already easily supports selective downloading via the ancient HTTP Range query feature, which allows one to query for a precise range of bytes inside a URL. This is mostly used to do things like resume downloads, but you can also do interesting things like run databases in reverse: a web browser client can run a database application locally which reads a database file stored on a server, because Range queries let the client download only the exact parts of the database file it needs at any given moment, as opposed to the entire thing (which might be terabytes in size).
This is how formats like WARC can render efficiently: host a WARC as a normal file, and then simply range-query the parts displayed at any moment.
The challenge is the first part: how do we download only the original HTML and subsequently only the displayed assets? If we have a single HTML file and then a separate giant archive file, we could easily just rewrite the HTML using JS to point to the equivalent ranges in the archive file (or do something server-side), but that would achieve only static and efficiency, not single file. If we combine them, like SingleFile, we are back to static and single file, but not efficiency.
The simplest solution here would be to decide to complicate the server itself and do the equivalent of deconstruct_singlefile.php on the fly. HTML requests, perhaps detecting some magic string in the URL like .singlefile.html, is handed to a CGI proxy process, which splits the original single HTML file into a normal HTML file with lazy-loaded references. The client browser sees a normal multiple efficient HTML, while everything on server sees a static single inefficient HTML. (A possible example is WWZ.)
While this solves the immediate Gwern.net problem, it does so at the permanent cost of server complexity, and does not do much to help anyone else. (It is unrealistic to expect more than a handful of people to modify their servers this invasively.) I also considered taking the WARC red pill and going full WebRecorder, but quailed.
Download Stopping Mechanisms
How can we trick an HTML file into acting like a tarball or ZIP file, with partial random access?
Our initial approach was to ship an HTML + JS header with an appended archive, where the JS would do HTTP Range queries into the appended binary archive; the challenge, however, was to stop the file from downloading past the header. To do this, we considered some approaches 'outside' the page, like encoding the archive index into the filename/URL itself (ie. foo.gwtar-$N.html) and requiring the server to parse $N out and slice the archive down to just the header, which then handled the range requests; this minimized how much special handling the server did, while being backwards/forwards-compatible with non-compliant servers (who would ignore the index and simply return the entire file, and be no worse than before). This worked in our prototypes, but required at least some server-side support and also required that the header be fixed-length (because any changes would in length would invalidate the index).
Eventually, Achmiz realized that you can stop downloading from within an HTML page, using the JS command window.stop()! MDN (>96% support, spec):
The
window.stop()stops further resource loading in the current browsing context, equivalent to the stop button in the browser.Because of how scripts are executed, this method cannot interrupt its parent document's loading, but it will stop its images, new windows, and other still-loading objects.
This is precisely what we need, and the design falls into place.
Concatenated Archive Design
A Gwtar is an HTML file with a HTML + JS + JSON header followed by a tarball and possibly further assets. (A Gwtar could be seen as almost a polyglot file is a file valid as more than one format--in this case, a .html file that is also a .tar archive, and possibly .par2. But strictly speaking, it is not.)
Creation
We provide a reference PHP script, deconstruct_singlefile.php, which creates Gwtars from SingleFile HTML snapshots.
It additionally tries to recompress JPG/PNG/GIFs before storing in the Gwtar, and then appends PAR2 FEC.
Example command to replace the original 2010-02-brianmoriarty-thesecretofpsalm46.html by 2010-02-brianmoriarty-thesecretofpsalm46.gwtar.html with PAR2 FEC:
php ./static/build/deconstruct_singlefile.php --create-gwtar --add-fec-data \
2010-02-brianmoriarty-thesecretofpsalm46.htmlImplementation
Header
The first line of the header is the magic HTML string <html> , and the final line is the magic HTML string ' >> "$FILE"
rm "$FILE".asc
# Extract and verify:
SIG=$(mktemp XXXXXX.asc)
CONTENT=$(mktemp)
sed --quiet '/$/p' "$FILE" |
grep -Ev 'GWTAR-GPG-SIG|-->' > "$SIG"
sed '/
[Bibliography of links/references used in page]