Key Takeaways:
- New benchmarks are being developed to evaluate the ability of large language models (LLMs) to perform legal work in the real world.
- Current LLMs have critical gaps in their reliability for professional adoption, with the best-performing model scoring only 37% on difficult legal problems.
- LLMs frequently make inaccurate legal judgments and often reach correct conclusions through incomplete or opaque reasoning processes.
- Professional benchmarks may still not capture the complexity of real-world legal work, which often involves subjective and challenging questions.
- LLMs may not be trained to think like lawyers, lacking a mental model of the world and the ability to simulate scenarios and predict outcomes.
Introduction to LLMs in Legal Work
The use of large language models (LLMs) in legal work has been gaining attention in recent years, with many believing that these models have the potential to revolutionize the field. However, new benchmarks are aiming to better measure the models’ ability to do legal work in the real world. The Professional Reasoning Benchmark, published by ScaleAI in November, evaluated leading LLMs on legal and financial tasks designed by professionals in the field. The study found that the models have critical gaps in their reliability for professional adoption, with the best-performing model scoring only 37% on the most difficult legal problems. This means that the model met just over a third of possible points on the evaluation criteria, highlighting the significant limitations of current LLMs in performing legal work.
Limitations of Current LLMs
The study’s findings are consistent with other benchmarks measuring the models’ performance on economically valuable work. The AI Productivity Index, published by the data firm Mercor in September and updated in December, found that the models have "substantial limitations" in performing legal work. The best-performing model scored 77.9% on legal tasks, meaning it satisfied roughly four out of five evaluation criteria. While a model with such a score might generate substantial economic value in some industries, it may not be useful at all in fields where errors are costly. This highlights the need for more accurate and reliable LLMs that can perform legal work with a high degree of accuracy.
Challenges of Legal Reasoning
Unlike math or coding, in which LLMs have made significant progress, legal reasoning may be challenging for the models to learn. The law deals with messy real-world problems, riddled with ambiguity and subjectivity, that often have no right answer. Making matters worse, a lot of legal work isn’t recorded in ways that can be used to train the models. When it is, documents can span hundreds of pages, scattered across statutes, regulations, and court cases that exist in a complex hierarchy. This complexity makes it difficult for LLMs to learn and apply legal reasoning, and it may require significant advances in natural language processing and machine learning to overcome these challenges.
Shortcomings of Current LLM Training
A more fundamental limitation of current LLMs may be that they are simply not trained to think like lawyers. "The reasoning models still don’t fully reason about problems like we humans do," says Julian Nyarko, a law professor at Stanford Law School. The models may lack a mental model of the world—the ability to simulate a scenario and predict what will happen—and that capability could be at the heart of complex legal reasoning. It’s possible that the current paradigm of LLMs trained on next-word prediction gets us only so far, and that new approaches are needed to develop LLMs that can truly think like lawyers.
Future Directions
The development of more accurate and reliable LLMs for legal work will require significant advances in natural language processing and machine learning. It may also require a fundamental shift in how LLMs are trained, with a focus on developing models that can simulate scenarios and predict outcomes. Additionally, the development of more comprehensive benchmarks that capture the complexity of real-world legal work will be essential in evaluating the performance of LLMs. By addressing these challenges and limitations, it may be possible to develop LLMs that can truly support lawyers and other legal professionals in their work, and that can help to improve the efficiency and accuracy of legal services. However, for now, it’s clear that LLMs are not yet ready to replace human lawyers, and that significant work remains to be done to develop models that can truly think like lawyers.