In our Intelligent Document Processing platform, we make use of Machine Learning (ML) technology. An ML-based approach keeps the starting effort low and still continuously improves its efficiency. This is where continuous training comes into play.
A more in-depth description of IDP has been discussed in a previous article. As a short recap, what are the main steps of an IDP solution:
- Document collection: scanning, collect e-mails, read WhatsApp messages etc. This step also includes background steps like automated cleaning, denoising, cropping, rotating of the document and application of OCR
- Classification: automatically categorize the documents into predefined categories
- Extraction: automatically extract relevant information such as names, addresses, numbers, etc.
- Validation of the information is either automated through the use of integrations with external systems or by having a human in the loop.
- After processing the information, extracting and interpreting the information it can be routed to the correct destination, either a person or a line of business application.
- The final step is the business process taking over managing the rest of the information flow.
Steps 2 and 3 make extensive use of ML technologies. The reason for this is the vast variety of input documents and messages which arrive and need to be handled.
While you may request your suppliers to send their invoices to an invoices@ mailbox and have customers send Purchase orders to a PO@ mailbox it is impossible to have everything pre-filtered.
And even if there is some filtering going on already, in your internal processing you most likely still need a lot of different categories and handling.
As an insurer, you may use a generic claims@ mailbox. However, claims for car-, healthcare-, building-, and life insurance are often handled by different departments and agents in your organization.
The categorization of information will be of tremendous help there. Additionally, the categorization of documents is importance when deciding what information to extract from the data as well.
If the document type is an ID card, you want to extract other information from the document than when you are processing an invoice or an insurance claim.
Classification of information applies to multiple aspects of incoming data.
- What is the information about? An insurance claim for a car accident vs an insurance claim for a life insurance)
- What type of documents and information are we working with? An ID card, passport, European Accident Statement, invoice, …
The extraction of information applies to:
- Very specific information to extract: An invoice number, the name of a person or organization, a unique identifier, a barcode, an amount on a bank statement etc.
- Contextual information: identifying specific parts of a document, terminology, or a combination thereof that can be used to further categorize or refine the categorization of information.
- Finding links between different data values in one or multiple documents. E.g., is the birthdate on the ID-card the same as on the birth certificate that was provided etc in Continuous Training?
Although conceptually we always distinguish between classification and information extraction, both are linked together and by combining both we can further improve the results we can achieve in further automation.
The advantages of using ML
In our platform, we have a number of ML-based microservices that we use in our solutions.
Other solutions often use templates to recognize documents or extract information that is always on the same spot in a document.
Unfortunately, this doesn’t scale very well. For every new type of document, you will need to create a new template otherwise none of the information will be extracted at all.
Using a ML based solution, you can “train” the system to identify and recognize the documents, just like a human would, and interpret the information.
This means that a change in the layout of a document, or a new type of document can still be recognized and processed. Maybe with a lower reliability, but it’s never just “true” or “false” as with a template-based solution.
Additionally, a ML-based solution can process unstructured data like e-mails, direct messages etc. and classify and extract relevant information without the need for any template or complex regular expressions.
Using ML does seem to be the way to go as it solves common challenges with more traditional solutions like using templates and regular expressions to extract information or classify documents based on simple keywords.
Challenges of using ML
Even though a ML based approach has huge advantages, it’s not perfect unfortunately. As always, the challenge with any ML based technology is the availability of input data to train the model.
Before you can train a ML model it is necessary to annotate data which can be used as an example for the ML model to learn from. The more input data is available, the better the expected outcome can be.
Creating annotated data however is rarely a fun task which can only be done by someone with the relevant domain expertise. As in every type of training a human receives, you need to be trained by someone knowledgeable. Learning is done by example, and this is no different for a typical ML model.
An exception is Deep Learning where the machine can learn from vast amounts of data. Yet this is not feasible for most information extraction purposes due to the lack of available data and the huge amounts of processing power required.
Additionally, it is especially suitable for the type of problems where it is possible to run a lot of simulations, for example like specific games (See the examples of Deep Learning solutions becoming a Chess or Go champion).
An important challenge with using a ML-based solution is managing the expectations. It is important to know that:
- It requires an initial effort in input data creation for the training of an initial model
- The model will, as any human would, make mistakes.
Continuous Training:
Fortunately, there is a way to reap all the benefits of using a ML-based approach while keeping the starting effort low and still continuously improve its efficiency. This is where continuous training comes into play.
Human in the loop:
First, we should clearly state that although we aim to increase the level of automation and especially reduce the amount of tedious work for the humans in the loop.
We do not want to spend our time copying different data fields or classifying documents. Our main goal is to provide Computer Aided AI solutions, where the ML models try to pre-fill in as much data as possible.
As such, the person processing the documents can quickly glance at what has been pre-filled, make corrections if needed and add missing data.
This is already a first optimization compared to a full manual flow. The logical extension which can often be implemented immediately is to identify those documents where it was not possible to extract all desired information and only present these to a real person for further processing.
There is however an important distinction in how this is communicated:
- We implemented an AI based solution that tries to help you as an agent performing your work by taking away some of the tedious tasks of the job so you can focus on the actual job at hand. E.g., helping our customers.
- We implemented an AI based solution and it’s your job to fix the mistakes it made.
In the latter case it puts the AI solution above the human in the loop instead of the other way around, which is not the case.
Continuous Training you say?
A ML project should not end after an initial Continuous Training. As is often the case, a ML project starts with gathering data, training a model, and then deploying this into production when a certain success rate is reached. In reality, this results in a degrading performance over time due to changes on the input.
As discussed, the human in the loop adds additional information and corrects mistakes the ML model made. As such, you are actually creating additional labeled and corrected data which in turn can be used to further improve any trained model.
When taking these additional annotations into account it becomes possible to improve the quality of the trained ML-models continuously. All extra annotations can, and will further improve the quality of the model. This therefore creates a positive spiral where the amount of information that can be processed automatically increases.
Using this approach, we are by definition creating a self-service solution where new types of documents and data can be added to the flow and be included in new iterations of the ML models.
Conclusions:
Implementing ML in an IDP flow has some huge advantages to optimize the document processing. And the biggest advantage is that it is just a small change to enable continuous training of the platform. Meaning that every correction or change made by a user contributes to a better performing solution.