Choosing the Right LLM for the Job: How We Benchmarked Precision, Recall, and F1

Choosing the right LLM isn’t just important - it’s mission-critical to avoid hallucinations that lead to garbage in/garbage out.

When building an AI system that delivers accurate, structured product expertise, choosing the right large language model (LLM) isn’t just important - it’s mission-critical.

At Implicit, our platform relies on Named Entity Recognition (NER) to extract and structure technical details from unstructured content. This includes things like components, error types, actions, and affected systems - each tied to a strict taxonomy. It’s not open-ended chat; there are right and wrong answers. That’s why we didn’t just rely on hype or leaderboards. We ran a full internal benchmark across several open-source models to test which could actually deliver the precision we need in a high-stakes environment.

Here’s how we did it, and why LLaMA 8B came out on top.

What We Actually Evaluated: NER, Not Just Chat

While most LLM benchmarks focus on open-ended QA or summarization, our use case is more specific: structured extraction of entities that must align with our internal taxonomy. We evaluated each model’s ability to accurately identify and label the correct entities in a document - a classic NER task.

That means we were able to compare performance based on objectively right answers, not subjective human judgments.

The Metrics: What Do Precision, Recall, and F1 Mean?

To evaluate performance, we used three standard NER metrics:

Precision‍

Of all the entities the model predicted, how many were actually correct? High precision = fewer false positives (hallucinated or incorrectly tagged entities).

Recall

Of all the entities it should have predicted, how many did it actually find? High recall = fewer false negatives (missed relevant entities).

F1 Score

The harmonic mean of precision and recall. It rewards models that are both accurate and thorough. A balanced view of performance, especially when one metric is stronger than the other.

Our Benchmark Results

Model	Precision	Recall	F1 Score
LLaMA 8B (Current)	0.41	0.41	0.41
Gemma 4B	0.44	0.39	0.42
Mistral 7B	0.33	0.22	0.26
LLaMA 3B	0.35	0.36	0.35
phi-4	0.55	0.19	0.29

Why We Stuck with LLaMA 8B

On paper, Gemma 4B slightly outperformed LLaMA 8B in F1. But in practice, it fell short in a few key areas that matter in production, which are listed below.

Repetitive outputs‍

Often echoed the same entity or phrase multiple times (e.g., <output>start activity</output> repeated unnecessarily).

Schema drift‍

Occasionally generated tags not defined in our taxonomy, requiring additional post-processing.

Formatting inconsistency‍

Struggled to follow output structure rules on the first attempt. LLaMA was much more reliable here.

Models like phi-4 posted high precision, but had extremely low recall, meaning they often missed relevant entities altogether. Mistral 7B suffered from both low precision and recall, while LLaMA 3B performed decently but lacked the robustness of the 8B variant.

And while LLaMA 4 was highly anticipated, we found its real-world performance disappointing. We’re not alone here. Community feedback has echoed similar concerns, and some have questioned whether published benchmarks were overly optimistic.

The Bottom Line

We selected LLaMA 8B because it delivered the most reliable, balanced performance for our domain-specific NER tasks. It plays nicely with our strict taxonomy, follows structural formatting instructions consistently, and avoids hallucinations that could lead to garbage in/garbage out.

For powering Implicit’s AI expertise engine, that consistency matters more than a few percentage points in one metric. Because when your customers are dealing with complex systems or edge-case issues, you don’t just want “smart.”

You want right. Every time.

Why Taxonomy is the Unsung Hero of Reliable AI

Taxonomy

John Kanarowski

If you're having trouble with your AI, a bigger, stronger, faster model may not be the answer.

Great AI Support Starts with a Great Knowledge Layer

Knowledge Management

John Kanarowski

Building a strong Knowledge Layer is the most important investment a support organization can make when exploring AI.

How to Build a Modern Support Team with a Knowledge Base and AI

AI and Business

John Kanarowski

You need AI to run a modern support team, but keeping the content of your KB clean is priority number one.

You Don’t Need to Rip and Replace: How to Add AI Without Breaking Your Support Stack

Product Support

Tim LaBarge

A guide for support leaders and technical stakeholders on how to experience the benefits of AI without reworking your entire tech stack.

Faster Than Hiring: How to Scale Support Without Adding Headcount

Product Support

Tim LaBarge

An operational playbook for support leaders under budget pressure.

Choosing the Right LLM for the Job: How We Benchmarked Precision, Recall, and F1

LLMs

John Kanarowski

Choosing the right LLM isn’t just important - it’s mission-critical to avoid hallucinations that lead to garbage in/garbage out.

Choosing the Right LLM for the Job: How We Benchmarked Precision, Recall, and F1

Choosing the right LLM isn’t just important - it’s mission-critical to avoid hallucinations that lead to garbage in/garbage out.

What We Actually Evaluated: NER, Not Just Chat

The Metrics: What Do Precision, Recall, and F1 Mean?

Precision‍

Recall

F1 Score

Our Benchmark Results

Why We Stuck with LLaMA 8B

Repetitive outputs‍

Schema drift‍

Formatting inconsistency‍

The Bottom Line

Other articles in our blog

Why Taxonomy is the Unsung Hero of Reliable AI

Great AI Support Starts with a Great Knowledge Layer

How to Build a Modern Support Team with a Knowledge Base and AI

You Don’t Need to Rip and Replace: How to Add AI Without Breaking Your Support Stack

Faster Than Hiring: How to Scale Support Without Adding Headcount

Choosing the Right LLM for the Job: How We Benchmarked Precision, Recall, and F1