In recent years, as the need for Long Term Archiving and Long Term Preservation of “Digital Born” information has increased, so has the call for a universal archiving format been increasing. The idea behind this is to have a file format that allows you to permanently store and read information without any loss of information. If possible, even the “graphical” information, in case e.g. text color or the font used would convey any information. Enter PDF/A, the solution to it all. Or not? The two most important words in my first sentence are: “Digital Born”. What, it’s not the solution to all our archiving needs? You might ask. I’ll try to explain.
For starters, there are a lot of misconceptions or myths with regard to the format:
- We can convert everything to PDF/A
- We should convert every file that we need to archive to PDF/A
- PDF/A will help us to prove that the file has not been modified during the retention period
Nowhere have I found any information or article to contradict these misconceptions.
Should you convert every file to PDF/A?
The question is not whether you should convert the original file to a PDF/A, but whether you should convert it to an archiving format. If the retention period that applies to the file, does not create a risk with regard to legibility of the file (“Will I still be able to open or read the file in 5 years’ time?”), then the answer is no.
PDF/A will help us to prove that the file has not been modified during the retention period.
Unfortunately, the answer is no. It is a general misconception that PDF/A files create a non-repudiate, unalterable, immutable version of the document. As any other PDF-file, the file can be edited. We once even received a request from a customer to create a custom module that would add a specific header to the PDF/A, with information that wasn’t included in the original files. For one reason or another, a PDF/A file is often mistaken with a PDF file that has been digitally sealed (sealed, not signed). If you want to prove that your file has not been tampered with during the retention period, you will need one of the following:
- Digital Sealing (can be achieved via this solution)
- Via Audit Trailing on your ECM system
- Via auditing on your Trustworthy Repository
Can we convert everything to PDF/A?
No, you can’t. There simply are file formats that cannot be converted to PDF/A. And most surprisingly, a lot of PDF files cannot be properly converted to PDF/A. And this is for a lot of organisations a major issue, as they very often have stored tons of files as PDF (and not as PDF/A).
PDA/A was created as a format, and it quickly became an ISO standard, that allows for the information needed, to have a correct visual representation of the original file, to be stored in the file itself. This means that, independent of the viewing application used or the client environment used, the file would have to look identical now and in, let’s say, 50 years. This was mainly aimed at digital born documents, often digital born text documents. Converting a file to PDF/A-1a or PDF/A-1 b (which are the most proliferated and used versions of the PDF/A format) imposes some requirements and limitations on the source format.
In short, these are:
- The file should contain the fonts embedded. (The fonts used are stored in the file).
- Device-independent color schemes should be used.
- Extensible Metadata Platform (XMP) metadata
PDF/A-1 Limitations are (mainly in PDF/A-1):
The files should not contain any
- LZW Compression (replaced by ZIP)
- Embedded files (is allowed in v2 and v3)
- External content references
- PDF Transparency
When creating a correct PDF/A file, it is more than just adding a “flag”, it should be checked whether the file complies to the requirements and limitations. One has to be aware of the fact that, when creating a PDF/A file, the necessary original (digital) file is often necessary in order to be able to create a target file that applies to all requirements. The file should also be created using a tool (or machine) that contains all necessary information (such as the fonts). This page contains a very good description about most ways of how a correct PDF/A can be created.
The most commonly used version of PDF/A is PDF/A-1, which is based on PDF 1.4. If you want to convert any PDF file created after PDF 1.4 to PDF/A-1a or PDF/A-1b, you will have to remove all features that are not already part of the PDF 1.4 format. As PDF/A-2 is based on PDF 1.7, that information loss is reduced, but it is still not a good archiving practice to store your documents as a PDF/A-2 and not keep the original format.
Why shouldn’t we convert all files to PDF/A?
So, if PDF/A offers so much advantages, why shouldn’t we convert all files to PDF/A?
There are several reasons why not to do so. This might sound a bit bizarre, as we offer our own solution to convert files to PDF/A, correct and validate them. But our DocShifter solution also allows for your files to be converted to other formats.
- Image Files: There is not much added value in converting image files (e.g. Tiff files) to PDF/A. These files generally already are a correct graphical and content representation of themselves. On the contrary, if other file types can not correctly be converted to PDF/A, as a last resort, they are often convert to an image and stored in a PDF/A container.
- Scanned Documents: Scanned Documents are in the end image files (unless they are converted to a document via OCR-techniques). So, for images the previous bullet point applies.
“Functional” or Compliance Limitations
As said earlier in this post, people consider PDF/A as a best practice format for archiving purposes. And that might serve one of the following goals:
- Demonstrating the file has not been tampered with. This is not something that can be achieved via PDF/A, but via solutions that either create an audit trail for the file or add a digital seal to the document.
- Audit trailing: This can be done via an ECM solution with the right functionalities or via a solution as Kazeon.
- Digital Sealing: A qualified digital seal can prove that the file has not been tampered with, within the right conditions.
- Ensuring legibility of the digital preserved file: The internation standard in this field is the OAIS standard. OAIS proposes a set of best practices that should be implemented, but does not impose any specific format. The most important thing to know, is that you always, ALWAYS, need to preserve the original format. The migrated (or converted format) is there to make sure you are still able to read the information. Over time the “secondary” format might change. Also, at any time, you need to be able to verify or validate the secondary format against the primary format.
If anything went wrong with the conversion? What will you do then?
Always keep the original file format!!! Always be original!