The advent of large language models (LLMs) and AI represents a transformative technological era. These models, similar to the one at the heart of the famous CHATGPT, are powered by deep learning and neural networks. This enables them to understand and generate human-like texts. The ability of LLMs to process vast amounts of data has revolutionized natural language processing, content generation, customer service, and even coding. Many companies across various sectors already benefit from the applications of AI models.
AI AND THE FINANCIAL INDUSTRY
Large language models for regulated industries like finance appear as a beacon of hope. Chatbots, for example, hold the promise of being able to extract important numbers swiftly and analyze financial narratives. Is this the case, however?
In a twist of expectations, a new finding highlights that GPT and similar large language models are not able to provide accurate answers to questions derived from Securities and Exchange Commission filings.
SEC filings contain crucial data that is critical to the financial industry. How game-changing it would be for an AI model to be able to analyze or answer questions from them correctly! The AI models evaluated in recent research by a startup called Patronus AI failed to do any of that. Patronus AI works to understand how models generate answers and aims at responsible AI development. The founders of this startup told CNBC that even OpenAI’s GPT-4-Turbo, the best-performing artificial intelligence model configuration they evaluated, could only correctly answer 79% of questions on Patronus AI’s latest test. This was despite being armed to read almost an entire filing alongside the question.
Anand Kannappan and Rebecca Qian, Patronus AI’s founders, explained that the dataset for the test included a set of over 10,000 questions and answers derived from SEC filings from major publicly traded companies. Four language models: Anthropic’s Claude 2, Meta’s Llama 2, OpenAI’s GPT-4 and GPT-4-Turbo, were tested using a subset of the questions. The outcome was disappointing, with some of the models giving wrong answers up to 70% even when they were pointed to where the right answers were. The founders also narrated that the models’ refusal to provide an answer was high and surprising because context was provided in many cases. According to Patronus AI’s founders, the test was supposed to provide a “minimum performance standard” for language AI in the financial sector.
Commenting on the poor test outcomes, Patronus AI co-founder Anand Kannappan said, “That type of performance rate is just absolutely unacceptable”. According to him, for AI models to be deployed in an “automated and production-ready way”, the functionality and accuracy has to be higher. Patronus AI’s other co-founder further highlights that “there just is no margin for error that’s acceptable, because, especially in regulated industries, even if the model gets the answer wrong 1 out of 20 times, that’s still not high enough accuracy”.
As much as Anand and Qian believe in the prospect of artificial intelligence, they advise that companies have a human in place to analyze financial narratives and manage workflow.