Big LLMs weights are a piece of history - <antirez>

antirez 344 days ago. 116472 views. By multiple accounts, the web is losing pieces: every year a fraction of old web pages disappear, lost forever. We should regard the Internet Archive as one of the most valuable pieces of modern history; instead, many companies and entities make the chances of the Archive to survive, and accumulate what otherwise will be lost, harder and harder. I understand that the Archive headquarters are located in what used to be a church: well, there is no better way to think of it than as a sacred place. Imagine the long hours spent by old programmers hacking with the Z80 assembly on their Spectrums. All the discussions about the first generation of the Internet. The subcultures that appeared during the 90s. All things that are getting lost, piece by piece. And what about the personal blogs? Pieces of life of single individuals that dumped part of their consciousness on the Internet. Scientific papers and processes that are lost forever as publishers fail, their websites shut down. Early digital art, video games, climate data once published on the Internet and now lost, and many sources of news, as well. This is a known issue and I believe that the obvious approach of trying to preserve everything is going to fail, for practical reasons: a lot of efforts for zero economic gains: the current version of the world is not exactly the best place to make efforts that cost a lot of money and don't pay money. This is why I believe that the LLMs' ability to compress information, even if imprecise, hallucinated, lacking, is better than nothing. DeepSeek V3 is already an available, public lossy compressed view of the Internet, as other very large state of-art models are. This will not bring back all the things we are losing, and we should try hard supporting The Internet Archive and other similar institutions and efforts. But, at the same time, we should focus on a much simpler effort: to make sure that the weights of LLMs publicly released do not get lost, and also to make sure that the Archive is part of the pre-training set as well.