Uncertainty Quantification for LLMs: Teaching with UQLM

By Yangming Li

Light Dark

1. Why Teach Uncertainty Quantification?

Large Language Models (LLMs) like GPT and Gemini are powerful, but they can "hallucinate"—i.e., produce plausible-sounding yet incorrect or nonsensical outputs. In educational settings, it's crucial for students to:

  • Understand why hallucinations occur
  • Measure how confident a model is in its outputs
  • Mitigate hallucinations to build more reliable applications

UQLM (Uncertainty Quantification for Language Models) is an open-source Python package that provides off-the-shelf tools to score LLM outputs based on various uncertainty metrics, making it ideal for hands-on teaching and experimentation. Official Docs

2. Getting Started: Installation

Install UQLM via pip:

pip install uqlm

Tip for Students: Always install in a fresh virtual environment to avoid dependency conflicts.

3. Core Concepts & Scorer Types

UQLM categorizes uncertainty scorers into four families. As you walk students through each, encourage them to compare trade-offs in latency, cost, and compatibility:

Scorer Family Latency Cost Compatibility Teaching Focus
Black-Box Medium–High High Any LLM Consistency through multiple calls
White-Box Minimal None Token-probability–enabled Directly leverage model internals
LLM-as-Judge Low–Medium Low–High Any LLM as judge Prompt-engineering for judges
Ensemble Flexible Flexible Flexible Combining multiple scorers

4. Hands-On Demo: Black-Box Uncertainty

Objective: Show how variations in multiple generations reveal uncertainty.

from langchain_google_vertexai import ChatVertexAI
from uqlm import BlackBoxUQ

# Initialize an LLM (any LangChain-compatible chat model works)
llm = ChatVertexAI(model='gemini-pro')

# Set up a Black-Box scorer
bbuq = BlackBoxUQ(llm=llm, scorers=["semantic_negentropy"], use_best=True)

# Generate and score 5 responses to a single prompt
results = await bbuq.generate_and_score(
    prompts=["Explain why the sky is green."],
    num_responses=5
)

print(results.to_df())

Discussion point: Why might "semantic negentropy" flag "green sky" as highly uncertain?

Class exercise: Compare with "exact_match" scorer and discuss differences. GitHub

5. Diving Deeper: White-Box & LLM-as-Judge

White-Box Scorer

Uses token probabilities directly. Ultra-fast, cost-free, but requires probability access.

from uqlm import WhiteBoxUQ
wbuq = WhiteBoxUQ(llm=llm, scorers=["min_probability"])
results = await wbuq.generate_and_score(prompts=prompts)

LLM-as-Judge

Leverages a second (or panel of) LLM(s) to "judge" each response. Excellent for teaching prompt-engineering and human-in-the-loop concepts.

from uqlm import LLMPanel
judges = [ChatVertexAI(model=m) for m in ["gemini-1.0-pro","gemini-1.5-pro-001"]]
panel = LLMPanel(llm=llm, judges=judges)
results = await panel.generate_and_score(prompts=prompts)

6. Building an Ensemble

Show students how combining multiple scorers often yields the most robust uncertainty estimates:

from uqlm import UQEnsemble

scorers = ["exact_match", "noncontradiction", "min_probability", llm]
uqe = UQEnsemble(llm=llm, scorers=scorers)

# Tune on a small set of known question-answer pairs
tune_results = await uqe.tune(
    prompts=tuning_prompts,
    ground_truth_answers=ground_truth_answers
)

# Generate scored outputs
final_results = await uqe.generate_and_score(prompts=prompts)

Class Challenge: Have students design their own ensemble, justify scorer selection, and evaluate on a real QA dataset.

7. Next Steps & Resources

  • Explore the official docs for API details and advanced demos.
  • Assign students to run the example notebooks under examples/ and present findings.
  • Encourage a small project: integrate UQLM into a simple chatbot and compare user satisfaction with and without uncertainty filtering.

Conclusion

By guiding learners through installation, core concepts, and incremental hands-on exercises, UQLM becomes not just a tool, but a platform for teaching principled uncertainty quantification in modern NLP. Happy teaching!