To understand the capabilities of ChatGPT-4V (aka GPT-4V), a bit of context is needed around LLMs and multimodal LLMs.
Large language models (LLMs) like the original version of ChatGPT, based on the GPT-3.5 architecture, are designed to process and generate human-like text. Most current LLMs are text-only, meaning they excel only at text-based applications and have limited ability to understand other types of data.
A more sophisticated type of LLM, called Multimodal LLM, combines additional data types, such as audio, images, and video, along with the text. Integrating multimodality into LLMs solves for limitations experienced with current text-only models and paves the way for new, innovative applications.
ChatGPT-4 (GPT-4) is the most popular example of a multimodal LLM. In early 2023, OpenAI announced GPT-4 with vision (GPT-4V). It can handle prompts incorporating text and images, enabling users to specify any vision or language task. GPT-4V has proven its value across multiple domains, including diagrams, documents that contain text, photographs, and screenshots. It can also generate responses like code and natural language.
Prior to GPT-4, ChatGPT-3.5 already performed well in analyzing documents by leveraging its advanced NLP capabilities that enable it to understand the context of text data within a document. Whether ingesting a PDF file or a Word document, GPT-3.5 has always been able to read and understand the content as well as extract key insights and summarize information. That makes it well-suited to analyzing large volumes of text and data rapidly and with high accuracy. It also helps by identifying themes and providing insights.
In the world of IDP, LLMs can greatly improve data extraction accuracy because of their advanced language processing capabilities. Imagine having the ability to eliminate OCR errors and gaining perfect entity extraction due to a solution understanding all presented information in context. This can have significant implications for industries that rely heavily on document processing, such as finance, insurance, and healthcare.
Another way that LLMs can improve IDP solutions is through better document classification. LLMs are capable of understanding context and nuance in language, which makes them ideal for performing core IDP tasks like identifying and categorizing different types of documents.
GPT-4 and GPT-4V not ready for prime-time IDP
In a 2023 study, called How Is ChatGPT’s Behavior Changing over Time?, users tested the capability of GPT-4 between March and June of 2023. Researchers observed that GPT-4 dropped significantly in its response accuracy rate, from 97.6% accuracy in March to just 2.4% accuracy in June. One theory for the sharp decline in accuracy is that OpenAI might be using smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run. This cheaper and faster option might be leading to a drop in the quality of GPT-4 responses.
The researchers noted that the lower accuracy may also be due to a change in the behavior of GPT-4 based on a recent update. They pointed out that a behavior change doesn’t equate to a reduction in capability, stating, “a model that has a capability may or may not display that capability in response to a particular prompt.” This focus on the prompt means that users will have to become better at the prompts they use to obtain the best responses (or desired responses) from GPT-4 and GPT-4V.
Another possible cause for the decline in response accuracy might be a move away from Microsoft Azure AI supercomputers. According to Digital Trends, “When GPT-4 was first announced, OpenAI detailed its use of Microsoft Azure AI supercomputers to train the language model for six months, claiming that the result was a 40% higher likelihood of generating the ‘desired information from user prompts.’”
Regardless of the cause, the underlying takeaway is that an IDP solution can benefit from LLMs but should not rely on them as the core to the IDP solution. LLMs still need a lot of help to do their work effectively. This is especially true in areas like banking compliance and healthcare where reliance on incorrect responses can lead to horrible outcomes.
Another drawback to GPT-4V is that the model does store the data that users put into it, subsequently tapping that data to improve the model’s accuracy. This is unacceptable in industries like banking and healthcare where much of the data must be kept private. While there are a few exceptions to this use of users’ data inputs, most GPT-4V services present too much risk around data privacy for many organizations.
Mitigating IDP risk when using GPT-4V
It’s clear that any IDP solution that taps into GPT-4V must incorporate its own human-in-the-loop (HITL) capabilities to validate the model(s) and the responses to a prompt. HITL features are inherent to the best IDP solutions that already leverage AI and machine learning. Having humans review results will help to guarantee that an organization is not experiencing a lot of responses that are incorrect as a result of hallucinated data – a common problem with generative AI solutions like GPT-4V.
Of course, you don’t want too much human intervention as that defeats the whole purpose of streamlining IDP processes via AI and automation. Instead, your IDP solution should reflect a significant amount of domain expertise related to the subject matter of your documents (e.g., banking instruments, contracts, invoices). For example, WorkFusion’s Work.AI platform has built into it a critical mass of knowledge in the area of financial crime compliance for the banking industry. It also reflects the notion that our own FinCrime compliance experts have provided enough teachings to our IDP model so that we have properly engineered the right prompts to ensure GPT-4V yields the most useful responses when reviewing documents presented by banking compliance teams.
By consuming multimodal LLMs like GPT-4V within a platform that orchestrates all the tools, subject matter expertise, and HITL features needed to solve your industry-specific problems, IDP can indeed be improved via LLMs.
Find out why WorkFusion was recognized as a leader in IDP for five consecutive years and is a leader in unstructured document processing in Everest Group’s PEAK Matrix Assessment. Download the full report here.