Best File Formats for Archiving
This guide compares common file formats for the purpose of digital archiving and preservation. It also discusses how to choose a resolution for images, and how to choose a sampling rate and a bit rate for MP3 audio files.

Disclaimer: All of the below is provided as my personal opinion only, without guarantee for completeness or correctness.

 

See article as slide show

Why archiving your data?

We do not often realize it, but much of our life is nowadays digital:

We often take it for granted that this data will live on. For example, the paper diaries from our childhood still live on. Old books and documents survive for centuries. We can still see the original US Declaration of Independence (200 years old). Or the Guttenberg Bible (500 years old). We can even see the Cuthbert Gospel (1300 years old). Scriptures carved in stone survive for millennia (the Code of Hammurabi is 3700 years old).

Digital data does not survive in this way. There is first the problem of the storage medium. Physical storing devices change roughly every 10 years: It used to be floppy disks, then it was CDs, then DVDs, then flash drives (USB sticks), and now the cloud. Whenever a new technology comes up, support for the older technologies fades out. There are today no more floppy drives. Furthermore, the devices themselves have a life span of about 10 years. After that time, they forget their data. Hard drives, for example, typically last around 5 years before they start becoming faulty. We may think that the cloud is the solution. However, cloud companies, likewise, may cease to exist. The only way to keep your data alive despite these changes is to constantly copy it from the older technology to the newer one.

At the same time, media technology changes so rapidly that high longevity media is likely to be threatened by obsolescence before its useful life is over. Think of file formats: Have you ever tried to open a “WRI” file on your computer? This was a popular document format. Today, few programs can read such a file. One day, they may become inaccessible for the average user. The same fate may strike today’s MP3s, DOCX, or JPG one day. This leads to what has been called the Digital dark age: the impossibility to read historical electronic documents and multimedia, because they have been recorded in an obsolete and obscure file format. To prevent this from happening, you have to choose your archiving formats wisely. This guide will help you.

Types of File Formats

Open vs Proprietary File Formats

Proprietary File Formats

A “proprietary” file format is a format that is developed by one particular company. Examples are Microsoft Office documents (DOCX, XLSX, and the like), or Adobe Flash movies (SWF). These formats come in various flavors:
  1. Un-documented formats have no public documentation. Thus, nobody can easily write software that reads these files. The files can be read only by the software of the company that created it. Examples are the archiving files RAR, the Corel drawing files CDR, or Microsoft’s WMA audio format.
  2. Documented file formats have a documentation. However, this documentation might not be available for free. For example, the standards of the International Organization for Standardization (ISO) are not free.
  3. For other documented file formats, the documentation is available for free. Thus, people can write software to process these files. However, these file formats may be encumbered by software licenses, patents, or intellectual property rights. Thus, whoever writes software to read these formats may have to pay the company. This is the case for MP3 and HEVC.
  4. Some formats were supposedly free of known patent issues, but then some other company started claiming intellectual property rights in retrospect. This has been the case for JPG.
  5. Some formats are known to be free of patent issues — either because claims to intellectual property have been rejected, or because the company has renounced to its claims. Still, the company may implement the format slightly differently from what has been documented, or may decide to change the format in the future. Therefore, the documents usually show differently in software from other developers.
  6. Some formats are standardized. This means that they have been submitted to a standards organization, which has documented the format. This makes the format a bit more resilient to unilateral change.

Still, proprietary file formats are de facto under the control of the single company. Software by other vendors often does a worse job at displaying it. So if you don’t have the main company’s software, if it is not available for your operating system, or if the company stops producing it, you lose access to your data.

Open File Formats

In view of the shortcomings of proprietary file formats, people have developed “open” file formats. These come again in a variety of flavors. The main criteria of an open format are:
  1. The format is fully documented and publicly available.
  2. The format is free from copyright restrictions, intellectual property claims, or restrictive licenses.
  3. The further development of the format is decided by a vendor-independent standards organization or a community (e.g., in the form of an open source development community).

Theoretically, anybody can write software to read open file formats. This reduces the dependency on a single company. Furthermore, an open standard is developed by a standards organization or a community, with the goal to make the format as general and thought-through as possible. Examples for open file formats are the open office formats (the document format ODT, the spreadsheet format ODS, etc.), the Web formats (the image format SVG, the document format HTML, etc.), and a number of other formats (such as the document format PDF or the image format PNG).

Quasi-Open File Formats

Open file formats are free from royalty claims. They are developed by a vendor-independent standards organization or a community. There are file formats that do not exactly fulfil these requirements, but that are equivalent to open file formats for all practical purposes.

Some of these file formats are not developed by a vendor-independent standards organization or a community. However, their development has practically stabilized. They do not change any more. Thus, the formats are equivalent to open formats for practical purposes. This includes formats such as the archiving format ZIP.

Some formats are under intellectual property claims, but these are disputed or unenforced. Thus, most common users are unaware of these claims, or choose to ignore them. Such formats include the audio format MP3, and the movie format MP4.

I call these formats “quasi-open file formats”. From a user’s point of view, they are roughly equivalent to the open file formats.

Recommendation

Generally, open and quasi-open file formats are more susceptible for archiving purposes. I group these together as “non-proprietary formats”.

Some open formats have a hard time catching on, because the proprietary competitor formats are being developed and pushed by big companies. At the same time, the boundary between open and proprietary formats is nowadays fuzzy: the big companies often standardize their formats publicly, they promise to abstain from patent claims, and they support also the open formats. Vice versa, due to the ubiquitousness of the proprietary formats, open software can usually read the proprietary formats, and even the software of one company can often read the proprietary formats of the other companies. Some proprietary formats are so standardized and so well-established (most notably the Microsoft Office formats) that they will most likely remain supported by software for the years to come.

We will discuss the best file formats for each type of data below.

Maturity

Established File Formats

When you choose a file format, you want to make sure that it stands the test of time. It is hard to predict whether a file format will still be around in the future, but the following can be indicators:
  1. The file format has already been around for a long time.
  2. The file format is supported by several vendors (and not just by a single company).
  3. The file format is platform-independent, i.e., it enjoys support on Windows machines, Macs, and Unix-based systems.

I call file formats that respect these criteria “established”. They are the main focus of this guide. Established file formats are, e.g., the image format JPG, the audio format MP3, or the document format HTML. Some proprietary formats are also quite established, in particular the Microsoft Office formats (DOCX, XLSX, etc.).

Open Browser Formats

There is a special class of file formats that I call “Open browser formats”. These are open file formats, i.e., they are developed by a community or an association of several parties. Their main strong point is that they can be displayed by the major Web browsers. No other software is needed to read the files. Since every major operating system nowadays has one or several Web browsers, these file formats are platform-independent and vendor-independent.

Some of these open browser formats have only recently been pushed to the limelight by important players such as Google, Mozilla, Wikipedia, or Apple. These include, e.g., the audio format Opus or the movie format WebM. Thanks to the support by the big players, software for editing these formats has already been written or is at least under way. All of this may indicate that the formats have good chances of becoming established in the future. That said, the implementation by the major browsers is no guarantee for the future. For example, the browsers have stopped support for SVG fonts. OGG+Theora was first recommended by the W3C, and then retracted.

Recommendation

Generally, one should use only established file formats for archiving.

Idealists may also consider the Open Browser Formats. We will discuss the best file formats for each type of data below.

Lossy vs Lossless File Formats

Lossless file formats

A lossless file format stores the data exactly as it was originally produced or obtained. The majority of file formats is lossless. An office document, e.g., will store the text exactly as you typed it. But think of audio file formats: Some frequency combinations are inaudible to the human ear. So should we really store them in the file? If we remove them, the file becomes around 10 times smaller. That is what lossy file formats do. Lossless formats, in contrast, keep every detail — even if it is imperceptible. The choice between lossy and lossless formats applies generally to image data, video data, and to audio data.

Common lossless formats include PNG for images, and FLAC for audio data. Lossless formats are interesting for archiving for two main reasons: First, they allow future use of the data for applications that were not orginally envisaged. In the example of the audio file, a DJ may want to artificially slow down the record, and mix it with other audio files. Data losses that were once imperceptible will then suddenly be striking. The second argument for lossless formats is that we can never be sure how long file formats will persist in the future. One day, we may be obliged to convert our files into a newer format. If the newer format is also lossy, then the little losses will add up — ultimately degrading the quality of the file.

Thus, lossless formats are generally to be preferred for archiving purposes.

Resolution of lossless file formats

Lossless file formats keep every detail of the data, even if it is imperceptible to the human. This applies in particular to audio files, video files, and images. However, even lossless file formats cannot mirror reality completely. This is because reality is analog, and file formats are digital. To see this, think of the sinus waveform produced by a sound (shown on the right). A vinyl record of that sound will contain an engraving of exactly that waveform. A computer cannot do that: it has to digitize the sound wave, i.e., to break it down into small steps. Even the best digital recorders have to do that.

The same is true for digital cameras: They can only mirror reality with the number of pixels they have. Anything that is smaller than one of their pixels cannot be caputured.

Thus, even lossless file formats can mirror reality only up to a certain degree of precision. I call this degree the “resolution”. The higher this resolution, the better reality is captured — and the larger the file will be. Generally, one aims at a resolution that is so high that the human cannot distinguish the recording from reality. Lossless formats are then lossless in this sense.

In some cases, the source has already been digitized. Think of a CD that you want to rip to your computer. In these cases, a lossless copy of the CD to your hard drive is completely lossless with respect to that source.

Lossy file formats

As we have seen, lossless file formats mirror the input as closely as possible, up to a certain resolution. A lossy file format loses even more data: it throws away details of the data that can hardly be perceived anyway by a human. For example, the audio format MP3 is a lossy file format: it removes frequency combinations that cannot be perceived by a human anyway. This results in smaller file sizes. The image format JPG does the same thing for images: It throws away details in a picture that humans are unable or unlikely to perceive. These file formats typically let the user choose the compression ratio, i.e., the amount of detail that is thrown away. Higher compression rations produce smaller files and throw away more details.

As we have argued before, lossy file formats are generally less adequate for archiving. The only point to be made in favor of lossy file formats is their smaller size.

That said, it does not make sense to convert lossy file formats into lossless ones. Data that has been removed will never come back anyway. Thus, if the primary form of the file you have at hand is a lossy file format, you can just keep it the way it is.

Vector file formats

Lossless file formats can mirror reality up to a certain resolution. They cannot, e.g., mirror a sinus waveform in infinite detail. In the same way, a digital camera cannot take a perfect picture of a circle. There will always be pixels when we zoom in. But what if we knew that picture should contain a circle? Couldn’t we just simply tell the computer “It’s a circle”?

It turns out that this is possible to some degree. It is not possible when taking pictures of nature. But it is possible when we do drawings on the computer — e.g., for a slide presentation or in a drawing program. We draw a circle, and tell the machine “it’s a circle”. The file then stores “it’s a circle”, and the next time we open the file, the machine draws a circle. Since the machine knows it’s a circle, we can zoom in infinitely without pixels ever appearing. This is what the vector image format SVG does.

The same thing can be achieved to some degree with music. If we know that a piano plays a certain sequence of notes, then there is no need to digitize the wave form. We can just tell the machine “A piano plays this sequence of notes”. This is what the audio format MIDI stores.

I will call these formats “Vector formats”. Compared to lossy and lossless formats, vector formats are ideal for archiving. First, they keep every detail. Second, they usually produce considerably smaller files than the lossy or lossless formats. However, vector formats can be used only when the image or sound is described explicitly.

Recommendation

For audio and image material, we have the choice between lossless formats, lossy formats, and vector formats. Generally, vector formats are the best formats for archiving, because they mirror the data exactly. However, they can only work if the underlying data is vectorized. If that is not the case, lossless formats are the way to go. Their resolution should be chosen so high that a human cannot perceive the difference to the original. Finally, lossy formats should generally be avoided for archiving. However, if you have files in lossy formats lying around, you can just keep them. It does not make sense to convert them to lossless formats, because the lost details will never come back anyway. In particular, it does not make sense to convert lossy file formats to other lossy file formats, because this will only amplify the losses.

We will discuss the best file formats for each type of data below.

Locked-in file formats

Locked-in file formats

For most file formats, there is a software that can edit and modify the files. Yet, for some file formats, this is not the case — either because the format does not allow it, or because there is no such software, or because editing the file would result in loss of information. Take for example PDF documents. It is very hard to edit the text of a PDF document. You can add comments, and you can fill forms, but you cannot easily change the text of a PDF document. Thus, PDF is not modifiable.

Often, you cannot even copy-paste properly from a PDF document. This means that you cannot extract the text from the document. There is software that can extract the text, but this often results in garbled layout, and messed up ligatures. I call such file formats “locked-in”.

Other locked-in file formats are the lossy file formats, such as MP3 or JPG. There exists software to edit such files. However, each modification aggravates the loss of data. If you repeatedly edit a JPG file, the picture will ultimately suffer. The same is true if you try to export the data to another format. Thus, lossy file formats are locked-in, too.

Recommendation

For archiving, locked-in file formats should be avoided. This is not just because they disallow the modification of data. It is also because they do not permit transferring the data into another file format. Such a transfer may become necessary if the file format becomes obsolete one day. In such a case, locked-in file formats can result in a loss of data.

The lossy file formats are all locked-in in our sense. None of the other file formats discussed here is locked-in, unless this is explicitly mentioned.

Self-contained file formats

Self-contained file formats

A file format is self-contained, if all data is stored in a single file. This is obviously the case for most file formats, including JPG images, word documents, or MP3 music files. However, there are some file formats that require other files to be present. For example, an HTML file may contain images. These images are usually stored in external files. Thus, if you want to send the HTML file, you also have to send the images along with the file. The same is true for SVG files, Beamer presentations, TEX projects, and MIDI files. PDF files, too, require font files if these are non-standard.

Bundles

A bundle is a folder that contains a several files, but that acts as a single document. The best-known example is an HTML folder, which contains the main HTML file (usually index.html), together with all embedded images and resources. Another example is a TEX project, which contains a main file and several resources.

The advantage of a bundle is that it is in some way self-contained. At the same time, it keeps the external resources separate and visible, so that they can be used independently.

Recommendation

For archiving, files that are not self-contained can cause problems. This is because the link between the main file and the external resources is often not obvious. Thus, you might accidentally delete or move the external resources, move the main file without moving the resources, or copy the main file without the external resources. All of these destroy the original file.

Therefore, preference is to be given to self-contained file formats — or at least to bundles. All file formats discussed here are self-contained, unless otherwise mentioned.

Recommended File Formats

The following sections discuss different file formats for documents, slide presentations, spreadsheets, audio, video, images, and compressed files. Each list of formats is followed by a recommendation in the end.

Documents

Plain Text File Formats

A plain text file (file extension ".txt") is the most simple way to store text. There is no special software needed to read them, apart from a text editor (such as Notepad, vi, or TextEdit, which can be found on any operating system). The format is thus open, established, and completely safe for archiving. The only disadvantage is that plain text files do not support any formatting (bold text, headlines, etc.).

There are a number of other file types that are similar to plain text files, and similarly archivable. These include CSV files, TSV files, and code files (such as code of Java, Basic, Pascal, etc). There is a caveat is for code files, though: The code itself will survive in a plain text file. However, the corresponding compiler may not. Thus, you may find yourself with code that you can no longer run.

All plain text files share the problem of character encoding. The characer encoding is the method that is used to store umlaut characters, accents, and non-latin characters. There used to be a variety of different encodings, and text stored in one encoding shows up garbled when read in a different encoding. Thankfully, the world has now settled on UTF-8. It is the dominant encoding on the Web nowadays, it is the recommended setting for emails, it has been around since 1993, it’s backwards compatible with ASCII, and it’s space-efficient for Western characters. Thus, if you write a plain text document, you should make sure that you are using UTF-8.

For this purpose, you have to make sure that your text editor is set to UTF-8. Unfortunately, Windows does not use UTF-8 by default in Notepad or Wordpad. Hence, it is very cumbersome to use UTF-8 text files on Windows. Furthermore, Notepad does not display non-Windows line-breaks correctly. Therefore, I recommend using Notepad++. It solves all of this issues, and is the most popular text editor on Windows.

To display UTF-8 text files in Firefox, you have to press F10, then click View -> Encoding -> Unicode.

Microsoft Office Documents

Microsoft is the leading producer of office software. The Microsoft Office suite contains the software called Word for documents, the software Excel for spreadsheets, and the software Powerpoint for slide presentations. The file formats are DOCX, XLSX, and PPTX, respectively. Microsoft products and file formats are well established and are the de facto standard in the office world.

Microsoft formats are definitively established. However, they are also proprietary.

Open Office Formats

The Open Office project started with the goal to provide an open alternative to Microsoft products. Its file formats are called “Open Office formats”: ODT for documents, ODS for spreadsheets, and ODP for slide presentations. Confusingly, Microsoft has decided to call its own proprietary formats “Office Open XML”. Furthermore, the original OpenOffice software has been discontinued. It lives on in the LibreOffice project.

All that said, Open Office software and file formats are established and open. They are thus the way to go if you want to create archivable documents. The Libre Office software to work with such documents can be downloaded for free.

HTML for Documents

I will now discuss some less frequent choices of document file formats. We start with using HTML for documents.

HTML is a file format for text with layout. It is an open format, developed by the Word Wide Web Consortium. It has been around for more than 20 years, and it can be displayed on nearly any device with a display. The format can thus be considered sufficiently established. Conveniently, the main office software suites support exporting documents to HTML. They also support editing HTML documents. Thus, in principle, HTML would be the ideal file format for documents.

There are two caveats: First, editing support for HTML documents is not always perfect. An HTML document edited in Libre Office will not always show in the same way in Microsoft Office and vice versa. Geeky people such as myself edit the HTML file by hand. This, however, requires some knowledge of HTML.

The second caveat is that there is no universally accepted way to integrate embedded material (such as images) into such documents. Thus, HTML is not self-contained. You basically have two choices:

  1. You make a bundle, i.e., a single folder that holds all embedded material as well as the main file (called index.html). This is the safest and preferred way to go for archiving. The disadvantage is that one has a folder instead of a file.
  2. You use Data URIs, i.e., you embed the external resources straight into the HTML file in base64 encoding. The advantage is that you get a neat self-contained HTML document. The disadvantage is that the document becomes hard to edit by hand. Also, the office software typically does not allow this option when exporting. Thus, the file requires post-processing to embed the images in this way.
If you wish, you can use my tool to convert between these two options — but it’s a layman’s tool.

For completeness, we also list the other options to deal with external resources in HTML documents:

  1. A link to the external image, leaving it where it is (LibreOffice). The disadvantage is that the link is not obvious. Deleting or moving the external image, or copying the HTML document without copying the external image will destroy the document.
  2. A separate folder, which contains copies of all embedded material. This is what happens when you click “Save complete Web page” in your browser. It is a well-supported mechanism across all browsers. The disadvantage is that the HTML file and the embedded material is physically separated. The HTML file is not self-contained, and it is not even a bundle. Hence, there is the risk of deleting, renaming, copying, sharing, or moving one resource without the other.
  3. MHTML, which is basically a self-contained file that contains a sequence of files, much like MIME Emails. The disadvantage is that native support remains limited to Internet Explorer and Opera. Firefox and Safari require an extension.
  4. The Mozilla Archive Format (MAF) of Firefox — basically a self-contained ZIP file with the markup and images, with metadata saved as RDF. The disadvantage is that no other browser besides Firefox supports this format, and this support is being discontinued.
  5. Printing the entire file to PDF, which is self-contained, but altogether different file format. The disadvantage is that PDF is essentially locked-in.
  6. Compressing the file to EPUB. The disadvantage is that images are re-scaled by default, that the formatting changes slightly, that EPUB is less established than HTML, and that the format is locked-in without the appropriate software.

For these reasons, HTML has led a niche existence as a document format. Still, geeky idealists can use it to write text documents. Personally, I use HTML to write most of my documents, doing the markup by hand. I use bundles to group the embedded material. I also use HTML as an archiving format for documents. In that case, I opt for Data URIs, because they produce a neat single file.

EPUB for Documents

EPUB is an e-book format that is used for e-readers. The format is developed by the World Wide Web consortium. It is thus an open format. The format has been around since 2007. Several programs can display ebooks: The Android and the iPhone each have reader application, and so has the Mac. Chrome and Firefox each have a plugin that can display EPUB files. The format is thus to some degree established — although nowhere as established as HTML or PDF.

EPUB is basically a self-contained ZIP file that contains HTML and image files. It thus solves the problem of bundling images with HTML files. At the same time, common EPUB converters will reduce the image resolution of the included images, and also change the formatting. Thus, the format is lossy if produced with these programs. Furthermore, EPUB documents cannot be edited with pre-installed software on common operating systems. You would have to install the free software Calibre. Another option is to unzip the file, and to do the editing by hand. This is a tedious endeavor, as EPUB is frustratingly complicated (with constraints on file names, XML tags, and XML namespaces) and redundant (the ID of the document has to be referenced; the content of the ZIP file is redundant with manifest; the spine is redundant with the navigation document). If this is not an option, the EPUB format becomes locked-in.

PDF for Documents

PDF is a file format for documents that was originally developed by Adobe. The format has come a long way since its inception. Nowadays, it is no longer under the control of Adobe, but under the control of a standards organization (in which Adobe is a mere member). Thus, it is an open file format. Furthermore, software to display PDF documents is ubiquitous: PDF documents can be displayed on all major operating systems (often natively), and in all major browsers. It is the de facto standard for sharing printable material. Thus, it is a established file format. PDF/A differs from PDF in the requirement of XML-based metadata and the elimination of elements likely to complicate decoding and accelerate obsolescence, such as audio and video, JavaScript, unembedded fonts, and device-dependent colorspaces. With this, PDF/A files are self-contained.

The main drawback of PDF is that it’s hard to edit. There are not many software tools that allow users to modify PDF documents. Sure, you can add comments, you can fill out forms, and you can sign them, but you cannot easily change the text of a PDF document. In fact, you often cannot even properly copy/paste from a PDF document. If you try to extract the text from a PDF document (by hand or by software), the result is usually messy — with garbled layout and missing characters. The reason for all this is that PDF was designed as a page layout language. It essentially tells the printer where to place certain text elements — with no respect for the actual flow of text. Thus, PDF is locked-in in our sense.

This means that, while PDF is certainly a great choice for archiving unmodifiable documents, it is a poor choice for archiving anything that you wish to modify in the future, or re-use in another form. It is the end of the line for data.

LaTeX + PDF for Documents

In the academic world, it is customary to produce documents with LaTeX. You basically write a plain text document, and sprinkle in little magic commands to define the layout. For example, you’d write “This is \textbf{Lisa}” to have “Lisa” appear in bold. You can also include external resources such as images. For this purpose, the TEX file has to live in a bundle. There is software that can help you manage such a project. That software can then compile the TEX file into a PDF file. This software is established and open. PDF files, in turn, can be displayed in any major browser and on any major operating system. Furthermore, the underlying LaTeX code can always be edited. Thus, LaTeX is not locked-in. Hence, it has all the advantages of PDF without PDF’s main drawback.

Where is the catch? The catch is that it is a pain to write LaTeX documents. Of course, seasoned LaTeX users will rave about how great the layout of LaTeX is, and that may even be true. At the same time, you are sure to spend as much time on Stackoverflow (searching for the correct way to say things in LaTeX) as you spend actually writing the text. Take something as simple as inserting a blank line in your text. There are several pages of discussion about the best way to do that. The easiest way to do that (with 5 backslashes: “\\\ \\”) is not even mentioned. Therefore, I personally do not recommend LaTeX for everyday use.

Recommendation

For creating new documents with an eye on archival, I recommend: The following documents are also safe for archiving: For geeks, I also mention the following:

If you have documents in any other file format, it may make sense to convert them to one of the above. One way to do that is to open the file (by double-clicking it), and to choose “Save as” or “Export”. Then you can pick your target file format. If you want to automate the process, install Libre Office. Open a terminal and type soffice --convert-to targetFormat inputFile --outdir folder --headless On Mac, it’s /Applications/ LibreOffice.app/ Contents/ MacOS/ soffice --convert-to targetFormat inputFile --outdir folder --headless The folder is the folder where the converted file shall be stored. The target format can be, e.g., “html” or “odt”. If you choose HTML, be warned that Libre Office produces very verbose HTML documents with external resources. You may want to clean up the HTML files and/or inline the resources into Data URIs.

Be advised that these conversions usually loose layout details. You should therefore keep the original files.

Slide Presentations

Open Office and Microsoft Office

For slide presentations, you have again basically the choice between the Microsoft file format PPTX and the Open Office file format ODP. The same discussion as for documents applies: Both file formats are established. However, only the Open Office format is open. The accompanying software, Libre Office, can be downloaded for free.

HTML/SVG for Presentations

We now come to geeky alternatives for slide presentation formats. One of them is SVG. SVG is originally a file format for images. However, since slides are basically images, SVG can be used in theory also to store slide presentations. Each slide can be an SVG image, and these can live together in a self-contained HTML file. SVG and HTML are are both open formats, developed by the Word Wide Web Consortium. They have been around for about 10 years, and can be displayed by all major browsers. The combination can thus be considered sufficiently established. Furthermore, the format is human-readable and could in the worst case even be manipulated by hand. Hence, it looks like an ideal alternative to Open Office and Microsoft Office file formats, which always require a software installation for viewing and editing.

The catch is that there is no established software that can produce HTML slide presentations. As far as I know, there is only my own tool, PowerLine. The tool works well, and presentations done with PowerLine can be displayed in any browser out-of-the-box (see here for an example). However, the software remains a layman’s tool. Therefore, HTML has not really caught on for slide show presentations. In other words, the format is open, lossless, and established, but locked-in.

Beamer

As we have seen before, LaTeX+PDF is a very popular combination for writing documents in the academic world. Unfortunately, it is very painful to use. Interestingly, it is possible to experience this pain also when creating slide presentations. A widely used package for this purpose (but not the only one) is Beamer. Beamer presentations have a beautiful layout, but they are often based mainly on bulleted lists. Anything else is more difficult to do in Beamer.

Beamer presentations have the same advantages and disadvantages as LaTeX+PDF. They are established and open, but painful to use.

Recommendation

For creating new slide presentations with an eye on archival, I recommend: The following formats are also safe for archival: For geeky people, I also mention the following:

If you have presentations in any other file format, it may make sense to convert them to one of the above. Proceed as for documents.

Spreadsheets

Open Office and Microsoft Office

For spreadsheets, you have again basically the choice between the Microsoft file format XLSX and the Open Office file format ODS. The same discussion as for documents applies: Both file formats are established. However, only the Open Office format is open.

HTML for Spreadsheets

HTML is a file format for text with layout. It is an open format. It is also established, most notably because it can be displayed on nearly any device with a display. HTML can in principle also be used to store spreadsheets. All major office applications support exporting spreadsheets to HTML. Some also allow editing them. Thus, HTML seems like the perfect format for spreadsheets.

The problem is that a simple HTML export of a spreadsheet keeps the cell values, but loses the formulas that were used to compute them (at least in Libre Office). Thus, the spreadsheet becomes effectively locked-in for anything that goes beyond simple values in a table. For archiving purposes, this may be sufficient, but for everyday use, it is not.

An alternative is to use my tool Spreadshit. It is a spreadsheet program that runs as Javascript inside an HTML document. These HTML documents are self-contained and safe for archiving. However, the tool is amateur software, and thus not ready for heavy use.

Recommendation

For creating new spreadsheets, I recommend: The following format is also safe for archiving: For geeky people, I mention:

If you have spreadsheets in any other file format, it may make sense to convert them to one of the above. Proceed as for documents.

Audio Formats

FLAC

FLAC is a lossless audio format. It is also an open format, which distinguishes it from the proprietary formats ATRAC (Sony), ALAC (Apple; sometimes with file extension M4A), SACD (Sony and Philips), and Windows Media Audio Lossless (Microsoft). FLAC can be played natively on Windows machines and in all major browsers. Apple products (iOS, Mac, Safari) do not natively support FLAC — quite possibly because Apple has its own lossless audio format, ALAC. However, players for FLAC can be easily found also for Apple systems. FLAC is thus an established file format. This distinguishes it from the less well supported formats “Monkey’s Audio” (APE), WavPack, TTA, MPEG-4 SLS, and SHN. Finally, FLAC compresses the data, thus making the files smaller without losing information. This distinguishes FLAC from non-compressing file formats such as WAV, AIFF, AU or raw PCM. FLAC is thus the primary choice for archiving audio data.

FLAC can encode different sampling rates (“resolutions”): higher sampling rates are more truthful to the original, but produce larger file sizes. Based on what humans can hear, the standard sampling rate for everyday use is commonly 44,100 Hz. This is the sampling rate that is used on Audio CDs. Some vendors advertise higher sampling rates, most notably with the DVD Audio format or the competing Super Audio CD format. However, blind tests have shown that humans cannot hear the difference between a sampling rate of 44,100 Hz and anything higher (except at very loud volume). Thus, there is no need to go beyond a sampling rate of 44,100 Hz. Vice versa, given that disk space is cheap nowadays, there is also no reason to go below that rate either.

Professionals may choose a higher sampling rate, if they plan to edit the audio material later on, e.g., by slowing it down, or transposing it. However, it does not make sense to rip an audio CD at a bit rate higher than 44,100 Hz. The result can never be of better quality than the original.

MP3

FLAC is a great choice for archiving audio material, because it is lossless. However, it also requires a lot of space. For this reason, people have developed lossy audio formats. These cut away little details in the audio material that cannot be heard by humans. By far the most popular lossy audio file format is MP3. That makes MP3 one of the most established file formats for audio.

There are ongoing disputes about whether MP3 is free of patents or not. Technicolor maintains that software that treats MP3 files has to pay a fee. However, these disputes matter little to everyday users. Therefore, MP3 is a quasi-open format. This distinguishes it from proprietary formats such as WMA.

MP3 can encode the data at different sampling rates. As I have argued before, a sampling rate of 44,100 Hz is a good choice, and this is indeed the default sampling rate. On top of that, MP3 supports different bit rates. A higher bit rate means a lower compression ratio, and more truthful encoding — at the expense of larger files. Common values for the bit rate are 128, 192, 256, and 320 kbit/s. Blind tests have shown that even trained ears cannot distinguish bit rates of 256 from the original. Therefore, there is no need to go beyond a bit rate of 256. At the same time, there is also no need to go below that bit rate, because disk space is cheap nowadays. Thus, 256 kbit/s is a good choice.

You can see the sampling rate and the bit rate of an MP3 file on a Mac or a Linux system by opening a terminal and typing file your-file.mp3

MP4+AAC

The AAC format was designed as a successor to MP3. It is a lossy format. Technically, AAC is a codec that has to live in a container file. The most common container for this purpose is MP4. The file extension is then “mp4”. Since this extension is also used for MP4 videos, the extension is sometimes changed to “m4a”. I will refer to the combination of MP4 with AAC as MP4+AAC. This format has been around since the early 2000’s, and it enjoys widespread support on all major platforms and all major software implementations. It can thus be considered established.

MP4+AAC is developed by a standards consortium. Unfortunately, it is encumbered by patent restrictions. However, the format is free for consumers, and it can thus be considered quasi-open.

With all of this, there is no particular reason to choose MP4+AAC over MP3.

OGG+Opus

The Vorbis projet set out to create a new lossy audio format that would definitively be open and free from patents. The current version of the format is called Opus. Opus is just a “codec”, i.e., a way to encode audio data. It is not an actual file format. The encoding has to live in what is called a container file format. There are different container file formats that can contain Opus, most notably OGG, Matruska, and WebM. Vice versa, these containers can contain other encodings than just Opus. However, OGG is the most frequent choice for Opus, and hence the format is known as “OGG+Opus”. Common file endings are OGG and OGA.

Opus is an open and lossy audio format. It can be played in the browsers Firefox, Chrome, and Opera, and in a number of other programs. Most notably, Wikipedia encourages the use of Opus. Google uses Opus in its video format WebM+VP9. The format thus falls under the open browser formats. It is established to some degree, but it remains much less ubiquitous than MP3.

MIDI

MIDI is a file format for audio data. It is not an actual recording, but a sequence of instructions. You can imagine it as a note sheet, together with the information which instrument plays which lines. When you open the file, the computer will play the lines like an orchestra would. Thus, MIDI is a vector format in our sense. This makes MIDI files lossless and very small. The format is developed by the MIDI Manufacturers Association, which makes it an open standard for all practical purposes. MIDI files date back to the 1980’s and they are very popular in the digital instrument community. There is software support in one way or the other on all major operating systems. Thus, the format can be considered reasonably established in our sense.

All of this said, MIDI cannot be used to record audio data. It can only be used with explicitly “vectorized” types of sounds. In particular, it cannot replace a recording of a human playing the piano, let alone of an orchestra playing a piece of music. This is because MIDI cannot express the variations in force, distance, perfection, and volume that characterize a piece of music played by humans. Another drawback is that MIDI is not self-contained, because it does not contain the instrument recordings. Thus, a MIDI with piano notes will sound different on different devices.

Recommendation

If you create new audio data, I recommend: The following formats are also safe for archiving:

If you have audio files in any other file format, you’d better convert them into one of the above. One way to do that is with the free software FFmpeg. Once installed, open a terminal and type ffmpeg -i filename.old filename.new Here, old is the old file extension (e.g. wma) and new is the new one (e.g., mp3). If you want MP3 with 256kbit/s (as I suggest), use the option “-b:a 256k”. In any case, you should keep the original files.

Video

Container formats

There are two different choices to make when encoding video:
  1. The “codec”, i.e., the way in which the video or audio is encoded. An example of a popular video codec is AVC.
  2. The “container”, i.e., the actual file format that contains the codec.

A container can contain several codecs at the same time — for example one for the video data, one for the accompanying audio, and one for the subtitles. Popular container formats are:

Resolution

Just like images, videos have a resolution. In principle, the guidelines for image resolution apply for video as well. In practice, however, file size is the limiting factor. Common resolutions are 320x240 (for mobile devices), 1920x1080 (1080p Full HD), 4096x2160 (4K Digital cinema, iPhone), 7680x4320 (HD, 8K, maximum on Youtube), and anything in between.

In addition to a spatial resolution, video also has a temporal resolution: the number of pictures (or “frames”) per second. Common values are between 24 (as used in cinema) and 30.

MPG

MPG (or MPEG) is the oldest common video file format. It is lossy, and defines both a container and a codec. The first variant of MPG is known as MPEG-1, and the newer one as MPEG-2. Due to its age, all known patents have expired, and the format is nowadays open. Today, MPG is the most widely compatible lossy audio/video format in the world. However, it cannot be played on a Mac without additional software. Thus, it is reasonably well, but not fully established.

MPG has since been superseded by MP4+AVC, which offers higher video quality. Hence, preference should generally be given MP4+AVC. At the same time, it usually does not make sense to convert existing MPG files to any other format, if they play on the computer, because this would entail a small loss of quality due to the conversion.

Video DVDs

Video DVDs can contain several videos, subtitles, and menus to choose between these. This information is stored in several files in several folder on the DVD. The main video files are usually in the folder VIDEO_TS. This folder contains several files with meta information, as well as the main video files. The main video files are called VTS_01_x.VOB (where x=1,2,3...). The file type VOB is a container format. Such containers usually contain MPG videos.

Due to their folder structure, video DVDs (or their copies on a hard drive) are usually clumsy to use on a modern computer. Since VOB files are just containers for MPG movies, they can be transcoded without loss of quality to the simpler MPG files. You can use the free software FFmpeg for this purpose. Once installed, open a terminal and type ffmpeg -i VIDEO_TS/VTS_01_x.VOB -vcodec copy -acodec mp2 filename.mpg This conversion is lossless for the video. For the audio, it converts the native AC3 codec to the MP2 codec. Otherwise the audio will not be there in all players.

MOV

MOV is a container format that typically uses the MP4+AVC codecs to store video and audio. The format was originally developed by Apple for its Quicktime software. It is thus proprietary.

MOV has been developed further to become the MP4 container format. This new format is the standard for movies nowadays, and thus preferable to MOV. Since both formats use the same codes, they can be transformed into one another without re-encoding the video (i.e., without degrading the quality). You can use the free software FFmpeg for this purpose. Once installed, open a terminal and type ffmpeg -i filename.mov -vcodec copy filename.mp4 This will translate the MOV file to a MP4 file without loss of quality. (In principle you could also translate the audio losslessly with -acodec copy, but not all codecs can be translated in this way.). The same method works for 3GP files: They often just wrap AVC movies, and can thus be transformed without loss of quality to MP4+AVC with the above command.

MP4+AVC

One of the most popular video codecs nowadays is “MPEG-4 Part 10 (H.264)”. It is also known, equally bulky, as “H.264/MPEG-4 AVC”. This is a lossy encoding for video data. With a compression rate set to 0, it is also lossless, but this is less common because it consumes an extraordinary amount of space. AVC is a proprietary encoding, encumbered by patent litigations. However, the format is free to use for end-users. In any case, this discussion has had little impact on common users, and AVC is nowadays the de facto standard for movies. Thus, it can be considered quasi-open.

The codec typically lives in an MP4 container, and I will call that combination MP4+AVC. The accompanying audio is usually AAC, a lossy quasi-open audio format that was developed as a successor to MP3. This combination is one of the most established file formats for video data. It carries the file extension “mp4”, or, equivalently, “m4v”.

To convert a movie to MP4+AVC, you can use the free software FFmpeg. Type ffmpeg -crf 18 -i filename.old filename.mp4 Here, old is the old file extension (e.g. avi). The option “-crf 18” enforces a nearly lossless compression rate. Since this transformation is lossy, you should keep the original files.

HEVC

HEVC is a lossy video codec that is developed with the goal to replace AVC. Most notably, it has a higher compression rate than AVC. HEVC is not a free format: it uses a number of patents, and thus the use of HEVC requires the payment of royalties to their owners — although probably not by the end users. This cost has curbed the acceptance of the standard, most notably on the Web. No major browser supports the format. Nevertheless, Microsoft Windows and Apple’s operating systems support HEVC out of the box.

The standard is thus proprietary, lossy, and to some degree established — but much less established than the ubiquitous MP4+AVC. In particular, there is an open competitor to HEVC in the making, called WebM. Unlike HEVC, WebM is has the support of all major players in the field. Thus, WebM looks like the more future-proof choice.

Newer iPhones will automatically use the HEVC format for capturing videos. This can be changed to MOV. This format, in turn, is nearly equivalent to MP4.

Theora

Theora is a lossy video codec, which usually lives in an OGG container (with the extension OGV). It was specifically designed to be open and unencumbered by patents. It is developed by the Xiph.org foundation. The format has been around since 2004, and it is supported by all major browsers except Safari. It is thus kind of established, and falls under the open browser formats.

Theora was one of the inspirations for the WebM format. WebM+VP9 can thus be seen as a modern alternative to OGG+Theora. Wikipedia, e.g., clearly recommends WebM over Theora, due to its better space efficiency.

WebM

WebM is a container format for videos that is being pushed by Google with the goal to provide a more efficient alternative to MP4+AVC, and a more open alternative to HEVC. The format is open and unencumbered by patents. The audio of WebM movies is usually encoded in Opus. The video is mostly encoded in the VP9 codec, or its successor, AV1. Both video codecs are lossy, but can also store lossless video if the compression rate is set to zero. We will refer to the WebM container format with Opus audio and either VP9 video or AV1 video collectively as “WebM”.

WebM is relatively young, and it is thus not established. In particular, the move towards AV1 shows the volatility of the format. At the same time, WebM is championed by the Alliance for Open Media. This is a cooperation of all major players in the field (Amazon, ARM, Cisco, Facebook, Google, IBM, Intel Corporation, Microsoft, Mozilla, Netflix, and Nvidia). Even Apple, previously an adherent of HEVC, has now joined. As a consequence, WebM already works in all major browsers, except Safari (which might be about to change). Hence, WebM falls under the open browser formats. In particular, WebM+VP9 is used by Youtube, and encouraged by Wikipedia. WebM may thus be on the way to become a new standard. WebM can be interesting for archiving if disk space is a concern.

To convert a movie to WebM+VP9, you can use the free software FFmpeg. Type ffmpeg -i filename.old -c:v libvpx-vp9 -crf 24 -b:v 0 filename.webm Here, the “-c:v libvpx-vp9” tells ffmpeg to convert to VP9. The combination “-b:v 0 -crf 24” enforces a constant quality across the video, using more bits for the frames where this is necessary. The option “-crf” indicates the quality. For videos of height 1440 pixels, the recommended value is 24. However, I have also made good experiences with the default value of 32, in which case you can omit “-crf 24”. This saves a considerable amount of space at a quality that I cannot distinguish from the original.

The new codec AV1 does not yet enjoy as much support as VP9. AV1 is much slower to encode than VP9: encoding 1 minute of video may take 2 hours. In return, it is 10%-30% more space efficient than VP9. Encoding speed and disk space are difficult to trade-off, but in my experience, AV1 is not worth the pain for personal purposes. If you want to convert a movie to WebM+AV1 nevertheless, you can use again FFmpeg. Type ffmpeg -i filename.old -c:v libaom-av1 -b:v 0 -crf 27 filename.webm Here, the “-c:v libaom-av1” tells ffmpeg to convert to AV1. The combination “-b:v 0 -crf 27” enforces a constant quality across the video, using more bits for the frames where this is necessary. The option “-crf” indicates the quality, and the value 27 is near lossless (omit “-crf 27” to go with the default value of 32).

GIF

The Graphics Interchange Format (GIF) is technically a lossless raster image format. For this application case, it has largely been surpassed by the newer PNG format, which generally produces smaller files. PNG, in turn, is being surpassed by WebP, which produces even smaller files. The conversions can be lossless.

However, there is a niche for GIF: It allows playing several images in a small video. This ability is called animated GIF. There is no compression, there are only 256 colors, there is no sound, and there is no way to control that small video. Thus, an animated GIF can be used only for small sequences of raster images. For these, however, GIF is well suited, because it is one of the oldest file formats. Thus, it is very established, and can be played in nearly any interface. In print, an animated GIF defaults to a still image. All relevant patents have ceased, and the format has not changed in 25 years, and so it is today quasi-open. The alternatives to animated GIF are WebP (which also allows for animation), WebM, and Mozilla's animated PNG format “APNG”. Of these, WebM generally produces the smallest file size, although a conversion would be lossy.

Flash

Flash is a software suite by Adobe for production of animations, browser games, rich Internet applications, desktop applications, mobile applications and mobile games. It comes with several file formats, most notably Flash is proprietary software. It can be played in all major browsers via a plug-in from Adobe. It used to be ubiquitous on the Web, and was thus established in our sense. However, recently the tide has turned against Flash: People criticise the dependence on a single vendor (Adobe), a number of security flaws of Flash, as well as the possibility of tracking users by help of so-called Flash cookies.

For all of these reasons, the Web community (and Google in particular), have been pushing for alternative file formats. Hence, Flash is nowadays on its way out. Adobe itself announced the end of the format for 2020. Therefore, Flash files should be converted to more modern formats, as described below.

Recommendation

If you create new video content, I recommend: The following formats are also safe for archiving:

If you have video files in any other file format, you’d better convert them into one of the above. One way to do that is with the free software VLC player. In File->Convert, you can convert any video file to any other video file. Click “customize” to choose the correct Video codec, Audio codec, and container format. Try to keep the original audio and video codec wherever possible in order to avoid a re-encoding (which could lead to a loss of quality).

If you want to automate the process, you can use the free software FFmpeg. To convert MOV or 3GP to MP4+AVC, proceed as described above. To convert video DVDs, also proceed as described above. Since these transformations are simple rewrappings, you do not need to keep the original files.

To archive other file types, convert them to MP4+AVC or WebM, as described there. Since these transformations are lossy, you should keep the originals.

Images

SVG

SVG is a file format for images. It is a vector format and thus lossless. Like HTML, it is an open format, developed by the Word Wide Web Consortium. It has been around for about 20 years, and it can be displayed by nearly all browsers. The format is thus established. It has superseded older vector formats such as CGM.

SVG is generally the way to go if you have images in vector form.

PNG

PNG is a lossless image file format. It is supported by all major browsers, and can be displayed and edited on all major operating systems. It is the most widely used lossless image compression format on the Internet. It is thus a very established format. Furthermore, it is an open file format, developed by the PNG Working Group. This distinguishes it from the proprietary (and more space consuming) BMP and GIF formats, as well as from the vendor-dependent RAW format.

PNG is thus the format of choice for non-vectorized lossless images.

TIFF

TIFF is a container format for images. It can contain lossy and lossless image encodings. Most often, however, it is used as a lossless image file format. It is widely used by graphic artists, in the publishing industry, and by photographers. It is supported by a wide range of software, and is thus very established. The format was developed by Adobe, and it is thus a proprietary file format. Adobe holds the copyright on the TIFF specification. However, there are no known intellectual property litigations. Also, the format stems from the 1980s. Thus the format can be considered quasi-open.

Compared to PNG, TIFF is less open. It is also less established, because there is more widespread software support for PNG than for TIFF. This is because TIFF is a more complicated format that is more difficult to implement. Finally, TIFF images are often stored uncompressed, which makes them roughly twice as large as PNG images at the same quality setting. Thus, I generally recommend PNG over TIFF.

However, there are cases where TIFF is preferable over PNG. First, TIFF supports the CMYK color model. This is the color model that is used in printing. Certain colors cannot be displayed on the screen (in the RGB model), but only in print (in the CMYK model). If you have documents in the print color model, you should favor TIFF. At the same time, both scanners and digital cameras nowadays work in RGB — so that PNG is just as good as TIFF. The second advantage of TIFF is that it allows multi-page documents, while PNG does not.

Image Resolution

PNG and TIFF are lossless image formats. Still, since they are not vector formats, they can mirror reality only up to a certain resolution. The resolution that you would want depends on how you want to use your image: A photo that you hold in your hand can have a smaller resolution than a poster on your wall.

There is a lot to discuss here, trading off resolution with file size in different use cases. However, to cut all of this short, here is a simple rule of thumb: If your image has 6000 pixels from top to bottom, you’re completely safe.

Let’s see why I’m saying this. The underlying assumption is that you are always at least as far from the picture as the picture is high. Consider a sheet of paper. It’s 30cm high, and you generally do not hold it closer to your nose than 30cm. Consider a smart phone: It’s 10cm high, and you generally do not hold it closer to your nose than 10cm. Consider a poster. It’s 1m high, and you generally stand 1m away from it when you look at it. Consider an advertising board. It can be 3m high if it’s on the wall of a high-rise building — but then you generally stand at least 3m away from it. Now if the picture has a height of x, and if your distance to the picture is at least x, then the picture spans a vertical angle of your field of view of arctan((x/2)/x)*2=53 degrees. Now each of your eye cells covers an angle of 31.5 arc seconds. This means that the eye can distinguish 6057 pixels top-to-bottom in your image. This holds independently of the scaling: As long as you’re standing at least as far away from the picture as the print-out is high, you cannot distinguish more than 6000 pixels.

Formulas for the required resolutions
In practice, you are usually even farther away from the picture than the height of the picture. Then you only need a proportion of the 6000 pixels. The figure on the right gives the number of pixels, the required Dots per Inch (DPI), and the mega pixel resolutions that you need for a given distance. Examples are:
  1. If you have a sheet of paper that’s 30cm high and you hold it at 30cm from your nose, you need a scanning resolution of at most 6000/30cm×2.54cm = 500 DPI.
  2. If you want a poster that’s 100cm high, but expect people to hold their nose at 50cm from it, you need a print resolution of 300 DPI, and 12000 vertical pixels.
  3. If you hold a picture of 10cm height 30cm away from your nose, you need (1/3×6)2×1.5 = 6 Mega Pixels.
  4. If you want a Retina display that is 20cm high, with your nose 40cm away, you need 3000 pixels vertically (the MacBook Pro has 2500).
Higher resolutions are reasonable only if you plan to zoom into parts of the picture, or if you want to transform the picture in some way.

JPG

PNG and TIFF are lossless image formats. They are more truthful to reality, but consume more disk space. The most popular lossy image format is JPG (also: JPEG). It is the most common format for photographic images on the Web. The format has been around since the 1990’s, it is supported on all major operating systems, it can be read by all major browsers, and it is the default format for digital cameras. It is thus one of the most established formats at all.

There are a number of patent issues surrounding the JPEG format, but these are irrelevant for all practical use cases by common users. The format is standardized by the Joint Photographic Experts Group. It is thus an open standard.

Being a lossy image format, JPG allows the choice of a compression ratio. The higher the compression, the smaller the file, and the less truthful the image. At very high compression ratios, artifacts start popping up in the image. The lowest compression ratio is thus the safest choice for archiving purposes. Some programs allow the user to choose the “image quality”, which is simply the inverse of the compression ratio (highest image quality = lowest compression ratio).

WebP



Original picture (top), 500x zoom (middle) and WebP version (bottom).
WebP is an open file format that is promoted by Google as a space-efficient image format. It can store lossy images (and thus competes with JPG). It can store lossless images, and thus competes with PNG. Finally, it can store animations, and thus competes with GIF. WebP has been around since 2010, and is supported by all major browsers. It is thus to some degree established, but in no way as established as JPG. This is also evident from the fact that its successor, AVIF, is already in the making. WebP is an open browser format.

For archiving, WebP can be interesting because it consumes less space than other formats. In my experience, a lossless transformation from PNG to WebP, e.g, consumes often just 10% of the disk space. To convert a PNG file losslessly to a WebP file, use cwebp, and type cwebp input.png -lossless -o output.webp For JPEG, the space gain is equally impressive: The WEBP file size is 30% of the JPEG file size, while the pictures are nearly completely identical (see example). The problem is that cwebp (1) does not follow the EXIF rotation of the picture and (2) does not transfer the geotagging correctly. If you want to convert JPEG to WebP, you thus need EXIF2 (for Windows, download the version that is confusingly called “msvc64”) and ImageMagick. Both are free and open source programs. Then execute the following sequence of commands: magick input.jpg -auto-orient temp.jpg
cwebp temp.jpg -o temp.webp
exiv2 --force ex temp.jpg
exiv2 in temp.webp
move /Y temp.webp output.webp
del temp.*
The first line rotates the input picture correctly. The second line converts the rotated picture to WebP (the cwebp program has some parameters, but in my experience, the default parameters are optimal). The third line exports the meta-data of the rotated JPEG picture (including the geotagging), and the fourth line attaches it to the WebP picture. The last two lines move the file to the correct destination and erase the temporary files (this detour is necessary since the Windows exiv2 command does not read the command line arguments of the “--insert” option correctly).

If the recipient of the picture does not support WebP, you can convert your WebP picture to PNG. PNG is lossless and established. The conversion to PNG is lossless (e.g., with Microsoft Paint), and thus you incur the conversion loss only when you convert to WebP, not when you convertfrom WebP.

AVIF

AVIF is an open file format that is promoted by the Alliance of Open Media (which includes heavyweights such as Google) as a successor to WebP. Just like WebP, AVIF is open, and it can store lossy images, lossless images, and animations.

AVIF is an open browser format. Although most browsers support the format, it is not as established as WebP, let alone JPG, GIF, or PNG. There are few programs that can open or edit AVIF files. Finally, the encoding is very slow: Saving a JPG image as AVIF can take several minutes. AVIF's strongest competitor is JPEG XL.

JPEG XL

JPEG XL is competitor of AVIF and WebP that is developed by the Joint Picture Expert Group (which invented JPG). Just like AVIF and WebP, JPEG XL is open, and it can store lossy images, lossless images, and animations.

There is an ongoing dispute about which format compresses images better at the same quality, AVIF or JPEG XL. However, there is at least one clear advantage of JPEG XL: It allows lossless compression of JPG images (a process called transcoding). This process reduces the file size by about 20%, but allows the bit-by-bit reconstruction of the original JPG file without any loss of quality. JPEG XL also has a number of other advantages that are less likely to impact everyday users.

JPEG XL was on a good path to become an open browser format. However, since JPEG XL competes with AVIF, and AVIF is promoted by Google, Google has decided to remove browser support for JPEG XL from Chromium (a software that forms the basis of many modern browsers, most notably Chrome). In other browsers, support for JPEG XL is hidden behind a flag (in Firefox, type "about:config" in the address bar, accept the risk, search for "jxl", and set it to "true"). There is also very limited software support for JPEG XL so far (e.g., no tool that can perform JPEG transcoding on Windows). All of this entails that JPEG XL is less established than AVIF.

RAW

Most digital cameras produce JPG images nowadays. JPG is a lossy file format. Some cameras allow getting hold of the original, lossless version of the picture. The file format is called RAW — even though it is not a single file format. Rather, each camera vendor has their own proprietary file format for RAW images.

Thus, while RAW images are lossless, they are also not good for everyday use. They have to be converted to an established format — usually JPG, TIFF, or PNG.

Recommendation

If you create new content, I recommend: The following are also safe for archiving:

If you have images in any other format, it makes sense to convert them to one of the above. One way to do that is to open the file (by double-clicking it), and then to choose “save as” or “export”. Then choose a target file format. You should keep the original files.

Compression formats

Compression

A compression file format makes a file consume less space on the disk (without losing information). An archiving file format combines several files into a single file. Many file formats combine both actions, and so the process itself is known as “archiving”, “compression”, or “packing”.

There are numerous file formats for archiving, for compression, and for both. They differ in many features, and in particular in their compression ratio. Since the difference in compression ratio depends on the content you compress, we concentrate on the other differences here.

ZIP

One of the most common archiving and compression formats is ZIP. ZIP was first proposed in the 1990’s, and it is supported natively by all major operating systems. It is thus one of the most established file formats at all. It is so prevalent that “to zip” has come to mean “to compress with ZIP”.

Technically, ZIP is a proprietary file format, because it is developed by PKWARE. However, there are no known license issues, and the format is so ubiquitous that it can be considered quasi-open.

BZIP2

BZIP2 is an open file compression format. It was designed to be more space-efficient than ZIP — much like a plethora of other compression formats. BZIP2 can store only one file per archive. It also requires the installation of extra software on non-Linux machines. Even on Linux machines, there is no way to know the decompressed size without actually decompressing the data. Thus, the format falls behind ZIP in maturity.

TAR+GZIP

TAR is a pure archiving format. It stores several files in a single file without compressing them. The archive file is then often compressed using GZIP. This yields files with the suffix “.tar.gz”. Both TAR and GZIP are open file formats and extremely popular on Linux systems. The GZIP format is also used in HTTP compression on the Internet.

That said, the format is a bit cumbersome to use. On Linux systems, the magic formula to uncompress a TAR+GZIP file is tar -xzf file On non-Linux systems, additional software is required. Thus, TAR+GZIP is not established in our sense.

RAR

RAR is a proprietary file format for archiving and compression. The format is widely used and supported, and can thus be considered established.

At the same time, the format is not open. RAR files can be created only with commercial software WinRAR, RAR, and other software that has written permission from the creator of RAR. Thus, RAR falls behind ZIP for archiving purposes.

7Z

7Z is an open file format for compression and archiving. It is not as ubiquitous and well-supported as ZIP, and thus falls behind ZIP in terms of maturity. On the other hand, it offers better compression ratios that ZIP, as well as encryption with the AES-256 standard. AES is the most widely used symmetric encryption method nowadays, and AES-256 is the state-of-the-art variant of it.

Recommendation

ZIP is established and quasi-open. All other formats are either less established or less open. Thus, there is generally no reason to deviate from ZIP. The only interesting alternative in my view is 7Z. It is less established, but truly open, it supports AES-256 encryption, and it offers better compression ratios than ZIP.

If you have archive files in any other format that you care about, it may make sense to convert them to ZIP (or 7Z): Unpack them, and then re-pack them to ZIP. You do not need to keep the originals.