In recent years, as the need for Long Term Archiving and Long Term Preservation of “Digital Born” information has increased, so has the call for a universal archiving format been increasing. The idea behind this is to have a file format that allows you to permanently store and read information without any loss of information.
If possible, even the “graphical” information, in case e.g. text color or the font used would convey any information. Enter PDF/A, the solution to it all. Or not?
The two most important words in my first sentence are: “Digital Born”. What, it’s not the solution to all our archiving needs? You might ask. I’ll try to explain.
For starters, there are a lot of misconceptions or myths with regard to the format:
- We can convert everything to PDF/A
- We should convert every file that we need to archive to PDF/A
- PDF/A will help us to prove that the file has not been modified during the retention period
Nowhere have I found any information or article to contradict these misconceptions.
Should you convert every file to PDF/A?
If the retention period that applies to the file, does not create a risk with regard to the legibility of the file (“Will I still be able to open or read the file in 5 years’ time?”), then the answer is no.
PDF/A will help us to prove that the file has not been modified during the retention period:
As with any other PDF file, the file can be edited. We once even received a request from a customer to create a custom module that would add a specific header to the PDF/A, with information that wasn’t included in the original files.
For one reason or another, a PDF/A file is often mistaken with a PDF file that has been digitally sealed (sealed, not signed). If you want to prove that your file has not been tampered with during the retention period, you will need one of the following:
- Digital Sealing (can be achieved via this solution)
- Via Audit Trailing on your ECM system
- Via auditing on your Trustworthy Repository
And this is for a lot of organization’s major issue, as they very often have stored tons of files as PDF (and not as PDF/A).
PDA/A was created as a format, and it quickly became an ISO standard, that allows for the information needed, to have a correct visual representation of the original file, to be stored in the file itself.
This means that independent of the viewing application used or the client environment used, the file would have to look identical now and in, let’s say, 50 years.
This was mainly aimed at digital-born documents, often digital-born text documents.
Converting a file to PDF/A-1a or PDF/A-1 b (which are the most proliferated and used versions of the PDF/A format) imposes some requirements and limitations on the source format.
In short, these are:
- The file should contain the fonts embedded. (The fonts used are stored in the file).
- Device-independent color schemes should be used.
- Extensible Metadata Platform (XMP) metadata
PDF/A-1 Limitations are (mainly in PDF/A-1):
- LZW Compression (replaced by ZIP)
- Embedded files (is allowed in v2 and v3)
- External content references
- PDF Transparency
When creating a correct file, it is more than just adding a “flag”, it should be checked whether the file complies with the requirements and limitations.
One has to be aware of the fact that, when creating a file, the necessary original (digital) file is often necessary in order to be able to create a target file that applies to all requirements.
The file should also be created using a tool (or machine) that contains all necessary information (such as the fonts). This page contains a very good description of most ways how a correct PDF/A can be created.
The most commonly used version of PDF/A is PDF/A-1, which is based on PDF 1.4. If you want to convert any PDF file created after PDF 1.4 to PDF/A-1a or PDF/A-1b, you will have to remove all features that are not already part of the PDF 1.4 format.
As PDF/A-2 is based on PDF 1.7, that information loss is reduced, but it is still not a good archiving practice to store your documents as a PDF/A-2 and not keep the original format.
Why shouldn’t we convert all files to PDF/A?
There are several reasons why not to do so. This might sound a bit bizarre, as we offer our own solution to convert files to PDF/A, and correct and validate them. But our DocShifter solution also allows for your files to be converted to other formats.
- Image Files: There is not much-added value in converting image files (e.g. Tiff files) to PDF/A. These files generally already are a correct graphical and content representation of themselves. On the contrary, if other file types can not correctly be converted to PDF/A, as a last resort, they are often converted to an image and stored.
- Scanned Documents: Scanned Documents are in the end image files (unless they are converted to a document via OCR techniques). So, for images, the previous bullet point applies.
“Functional” or Compliance Limitations
- Demonstrating the file has not been tampered with. This is not something that can be achieved via, but via solutions that either create an audit trail for the file or add a digital seal to the document.
- Audit trailing: This can be done via an ECM solution with the right functionalities or via a solution like Kazeon.
- Digital Sealing: A qualified digital seal can prove that the file has not been tampered with, within the right conditions.
- Ensuring legibility of the digital preserved file: The international standard in this field is the OAIS standard. OAIS proposes a set of best practices that should be implemented but does not impose any specific format.
- The most important thing to know is that you always, ALWAYS, need to preserve the original format. The migrated (or converted format) is there to make sure you are still able to read the information. Over time the “secondary” format might change. Also, at any time, you need to be able to verify or validate the secondary format against the primary format.
If anything went wrong with the conversion? What will you do then? Always keep the original file format!!! Always be original!