1. Why Teach Uncertainty Quantification?
Large Language Models (LLMs) like GPT and Gemini are powerful, but they can "hallucinate"—i.e., produce plausible-sounding yet incorrect or nonsensical outputs. In educational settings, it's crucial for students to:
- Understand why hallucinations occur
- Measure how confident a model is in its outputs
- Mitigate hallucinations to build more reliable applications
UQLM (Uncertainty Quantification for Language Models) is an open-source Python package that provides off-the-shelf tools to score LLM outputs based on various uncertainty metrics, making it ideal for hands-on teaching and experimentation. Official Docs
2. Getting Started: Installation
Install UQLM via pip:
pip install uqlmTip for Students: Always install in a fresh virtual environment to avoid dependency conflicts.
3. Core Concepts & Scorer Types
UQLM categorizes uncertainty scorers into four families. As you walk students through each, encourage them to compare trade-offs in latency, cost, and compatibility:
| Scorer Family | Latency | Cost | Compatibility | Teaching Focus |
|---|---|---|---|---|
| Black-Box | Medium–High | High | Any LLM | Consistency through multiple calls |
| White-Box | Minimal | None | Token-probability–enabled | Directly leverage model internals |
| LLM-as-Judge | Low–Medium | Low–High | Any LLM as judge | Prompt-engineering for judges |
| Ensemble | Flexible | Flexible | Flexible | Combining multiple scorers |
4. Hands-On Demo: Black-Box Uncertainty
Objective: Show how variations in multiple generations reveal uncertainty.
from langchain_google_vertexai import ChatVertexAI
from uqlm import BlackBoxUQ
# Initialize an LLM (any LangChain-compatible chat model works)
llm = ChatVertexAI(model='gemini-pro')
# Set up a Black-Box scorer
bbuq = BlackBoxUQ(llm=llm, scorers=["semantic_negentropy"], use_best=True)
# Generate and score 5 responses to a single prompt
results = await bbuq.generate_and_score(
prompts=["Explain why the sky is green."],
num_responses=5
)
print(results.to_df())Discussion point: Why might "semantic negentropy" flag "green sky" as highly uncertain?
Class exercise: Compare with "exact_match" scorer and discuss differences. GitHub
5. Diving Deeper: White-Box & LLM-as-Judge
White-Box Scorer
Uses token probabilities directly. Ultra-fast, cost-free, but requires probability access.
from uqlm import WhiteBoxUQ
wbuq = WhiteBoxUQ(llm=llm, scorers=["min_probability"])
results = await wbuq.generate_and_score(prompts=prompts)LLM-as-Judge
Leverages a second (or panel of) LLM(s) to "judge" each response. Excellent for teaching prompt-engineering and human-in-the-loop concepts.
from uqlm import LLMPanel
judges = [ChatVertexAI(model=m) for m in ["gemini-1.0-pro","gemini-1.5-pro-001"]]
panel = LLMPanel(llm=llm, judges=judges)
results = await panel.generate_and_score(prompts=prompts)6. Building an Ensemble
Show students how combining multiple scorers often yields the most robust uncertainty estimates:
from uqlm import UQEnsemble
scorers = ["exact_match", "noncontradiction", "min_probability", llm]
uqe = UQEnsemble(llm=llm, scorers=scorers)
# Tune on a small set of known question-answer pairs
tune_results = await uqe.tune(
prompts=tuning_prompts,
ground_truth_answers=ground_truth_answers
)
# Generate scored outputs
final_results = await uqe.generate_and_score(prompts=prompts)Class Challenge: Have students design their own ensemble, justify scorer selection, and evaluate on a real QA dataset.
Conclusion
By guiding learners through installation, core concepts, and incremental hands-on exercises, UQLM becomes not just a tool, but a platform for teaching principled uncertainty quantification in modern NLP. Happy teaching!