Choosing the Right LLM for the Job: How We Benchmarked Precision, Recall, and F1

Choosing the right LLM isn’t just important - it’s mission-critical to avoid hallucinations that lead to garbage in/garbage out.

When building an AI system that delivers accurate, structured product expertise, choosing the right large language model (LLM) isn’t just important - it’s mission-critical.

At Implicit, our platform relies on Named Entity Recognition (NER) to extract and structure technical details from unstructured content. This includes things like components, error types, actions, and affected systems - each tied to a strict taxonomy. It’s not open-ended chat; there are right and wrong answers. That’s why we didn’t just rely on hype or leaderboards. We ran a full internal benchmark across several open-source models to test which could actually deliver the precision we need in a high-stakes environment.

Here’s how we did it, and why LLaMA 8B came out on top.

What We Actually Evaluated: NER, Not Just Chat

While most LLM benchmarks focus on open-ended QA or summarization, our use case is more specific: structured extraction of entities that must align with our internal taxonomy. We evaluated each model’s ability to accurately identify and label the correct entities in a document - a classic NER task.

That means we were able to compare performance based on objectively right answers, not subjective human judgments.

The Metrics: What Do Precision, Recall, and F1 Mean?

To evaluate performance, we used three standard NER metrics:

Precision

Of all the entities the model predicted, how many were actually correct? High precision = fewer false positives (hallucinated or incorrectly tagged entities).

Recall

Of all the entities it should have predicted, how many did it actually find? High recall = fewer false negatives (missed relevant entities).

F1 Score

The harmonic mean of precision and recall. It rewards models that are both accurate and thorough. A balanced view of performance, especially when one metric is stronger than the other.

Our Benchmark Results

Model Precision Recall F1 Score
LLaMA 8B (Current) 0.41 0.41 0.41
Gemma 4B 0.44 0.39 0.42
Mistral 7B 0.33 0.22 0.26
LLaMA 3B 0.35 0.36 0.35
phi-4 0.55 0.19 0.29

Why We Stuck with LLaMA 8B

On paper, Gemma 4B slightly outperformed LLaMA 8B in F1. But in practice, it fell short in a few key areas that matter in production, which are listed below.

Repetitive outputs

Often echoed the same entity or phrase multiple times (e.g., <output>start activity</output> repeated unnecessarily).

Schema drift

Occasionally generated tags not defined in our taxonomy, requiring additional post-processing.

Formatting inconsistency

Struggled to follow output structure rules on the first attempt. LLaMA was much more reliable here.

Models like phi-4 posted high precision, but had extremely low recall, meaning they often missed relevant entities altogether. Mistral 7B suffered from both low precision and recall, while LLaMA 3B performed decently but lacked the robustness of the 8B variant.

And while LLaMA 4 was highly anticipated, we found its real-world performance disappointing. We’re not alone here. Community feedback has echoed similar concerns, and some have questioned whether published benchmarks were overly optimistic.

The Bottom Line

We selected LLaMA 8B because it delivered the most reliable, balanced performance for our domain-specific NER tasks. It plays nicely with our strict taxonomy, follows structural formatting instructions consistently, and avoids hallucinations that could lead to garbage in/garbage out.

For powering Implicit’s AI expertise engine, that consistency matters more than a few percentage points in one metric. Because when your customers are dealing with complex systems or edge-case issues, you don’t just want “smart.”

You want right. Every time.