The Electric Power Research Institute on Dec. 9 released first-of-its-kind, domain-specific benchmarking results for the electric power sector. 

"This initial application included multiple-choice and open-ended questions rooted in real-world utility topics, providing a more realistic view of how large language models (LLM) perform. Results indicate expert oversight remains imperative, especially with open-ended questions, which could result in less than 50% accuracy in some cases," EPRI said.

Many existing benchmarks assess broad academic knowledge, such as math, science, and coding, and may not capture the operational and contextual complexity of real-world utility environments, EPRI said.

Benchmarking with electric power-specific questions, such as generation and transmission and distribution asset-related inquiries, helps assess how well LLMs understand and respond to technical, regulatory, and operational questions that utilities face, it noted.

“As utilities integrate AI into power system planning and operations, this benchmarking establishes a critical foundation for evaluating domain-specific tools and models. Accuracy is paramount, as errors can lead to significant operational and reliability consequences,” said EPRI Vice President of AI Transformation and Chief AI Officer Remi Raphael. “Independent benchmarking by EPRI ensures the utility industry can trust and act on unbiased, credible insight.”

Key takeaways from EPRI’s initial benchmarking report included:

  • Open-ended questions exposed a reliability gap. When the same questions were asked in open-ended form instead of multiple-choice questions (MCQs), average accuracy dropped on average by 27 percentage points. On expert-level questions, top models only scored between 46–71%.
  • MCQs provide a strong but incomplete baseline. On EPRI’s MCQs, leading frontier models scored 83–86%, broadly consistent with their performance on external math and science benchmarks, but these scores benefit from the structure of MCQs.
  • Open-weight models are closing the gap. These are LLMs whose trained parameters — known as weights — are publicly available. While typically one generation behind proprietary frontier systems, they are rapidly improving. Their ability to be self-hosted can give utilities valuable deployment flexibility.
  • Web search modestly improves accuracy. Allowing models to search the web boosted scores slightly (2–4%), while also introducing the risk of retrieving irrelevant or misleading information.

EPRI utilized a dataset comprising more than 2,100 questions and answers, generated by 94 power sector experts, drawing from publicly available sources, including the institute’s reports covering 35 power sector topics. 

The benchmarking used three phases to test capabilities, with reproducibility on multiple LLMs, including GPT-5, Grok 4, and Gemini 2.5 Pro. 

Phase 1 measured model knowledge through multiple-choice questions, phase two repeated tests with web search, and phase three assessed open-ended responses using both knowledge and search. Each phase included three runs per model, with confidence intervals reported to capture variability.

The effort stems from EPRI’s Open Power AI Consortium, launched earlier this year to drive the development and deployment of AI approaches tailored for the power sector, including future domain-augmented tools.

Future phases of EPRI’s benchmarking effort will build on this foundation by evaluating domain-augmented tools and models and expanding beyond generic tests into real utility applications.

The full report is available here: Benchmarking Large Language Models for the Electric Power Sector and an interactive site is available here: WattWorks: The Power Sector’s AI Benchmarking Hub.
 

NEW Topics