What topics does this blog cover?

This site covers applied AI systems, AI system architecture, LLM system design, statistical ML, data engineering, data products, and experiment infrastructure with practical writing across AI/ML, product, and engineering.

Yangming Li is an AI Engineer and Product Builder based in Vancouver, focused on LLM systems, statistical ML, data engineering, data products, and experiment infrastructure for healthcare, finance, and enterprise teams.

How can I contact Yangming Li?

You can contact Yangming Li via email at dleeym95@gmail.com, through LinkedIn, or by using the contact form on the website.

Yangming Li | Applied AI Systems, LLM Evaluation, Data Products

Runtime Snapshot

class YangmingLi: @staticmethod def builds() -> list: return ["AI system architecture", "LLM systems", "Statistical ML", "Data engineering", "Experiment infrastructure"] @staticmethod def serves() -> list: return ["Healthcare teams", "Finance teams", "Enterprise teams"] @staticmethod def outcomes() -> list: return ["Production AI systems", "Faster decision loops", "Reusable internal tooling"] # Instantiate Yangming Li builder = YangmingLi() print(f"Builds: {builder.builds()}") print(f"Serves: {builder.serves()}") print(f"Outcomes: {builder.outcomes()}")

>focus_areas = 5

>industries_covered = 4

>delivery_mode = "prototype to production"

Evidence Strip

A faster read on trust, fit, and delivery

Built for teams that care about adoption, auditability, and production impact, not just model demos. The homepage now keeps the proof points visible and moves side quests into quieter corners.

Industries

Healthcare operations Finance and risk Public sector delivery Research environments

Capability Areas

LLM systems Statistical ML Data engineering Decision-support products

Delivery Modes

Dashboards and scorecards APIs and microservices Copilots and RAG systems Monitoring-ready pipelines

Public Artifacts

Peer-reviewed publication Technical guides Selected portfolio work CFA and FRM credentials

Working Style

Builder mindset

Tool Builder Mindset

"I'm a tool builder. That's how I think of myself. I want to build really good tools that I know in my gut and my heart will be valuable. And then, whatever happens, is... you can't really predict exactly what will happen, but you can feel the direction that we're going. And that's about as close as you can get. Then you just stand back and get out of the way, and these things take on a life of their own."

Lab & Notes

More ways to explore

Beyond the main work and projects, I also keep study notes, essays, small experiments, investing notes, and certificates here. They give extra context on how I learn, think, and build.

About Projects Blog Index Copilot Test Sets A/B Sample Size Causal Inference Model Calibration Resume Contact Notes Essays Investing Lab Certificates

If you are here for collaboration or hiring, start with About, Projects, Blog, Resume, or Contact. If you are curious, the other links are open too.

AI Agent Evaluation

Building an AI agent?

Download a practical launch checklist for evaluating Copilot Studio, RAG, document AI, and enterprise AI agents before production.

Download the Checklist

Selected Writing

Browse writing by topic. AI/ML, product, engineering, and investing now live under one blog view. For production AI systems, visit the dedicated AI Engineering column.

[AI Engineering] 从 Text2SQL 到权益核销：一次业务数据智能查询 Agent 的工程落地复盘

[AI Engineering] 三层 RAG、PGVector、多轮上下文、SQL 安全与评估，以及 RocketMQ、幂等和最终一致性的完整工程闭环。
[AI Engineering] Building a Practical Enterprise Data Agent with Microsoft Fabric Data Agent and Azure AI Foundry

[AI Engineering] A practical architecture guide for governed enterprise data agents using Fabric Data Agent, Fabric IQ ontology, MCP endpoints, and Microsoft Foundry Agent Service.
[Applied ML] A 90% Accurate Model That Still Loses Money: Why Churn Prediction Fails Without Uplift Thinking

[Applied ML] Why high-accuracy churn prediction can improve renewal rate while losing revenue, and how uplift thinking targets incremental renewals instead of risk scores.
[Applied ML] Beyond A/B Testing: A Practical Guide to Uplift Modeling in Industry

[Applied ML] How uplift modeling moves from average A/B test impact to user-level incremental effect, with CATE, meta-learners, Qini, AUUC, decile lift, and online ROI.
[LLM Evaluation] RAG Evaluation Guide: Metrics, Frameworks, and Python Examples

[LLM Evaluation] A practical guide to retrieval metrics, golden test sets, generation faithfulness, citations, no-answer behavior, and production monitoring.
[AI Engineering] Copilot Agent Golden Test Set: Cases, Rubrics, and Regression Gates

[AI Engineering] A practical guide to reusable Copilot agent test cases, rubrics, schema checks, regression gates, and release decisions.
[Experimentation] A/B Testing Sample Size in Python: Power, MDE, and Guardrails

[Experimentation] Calculate sample size, minimum detectable effect, statistical power, and guardrail planning before product experiments launch.
[Product Analytics] Causal Inference for Product Analytics

[Product Analytics] Connect experiments, observational data, ATE, CATE, uplift modeling, guardrails, and product decision quality.
[Trustworthy ML] Model Calibration in Python: Reliability Diagrams, ECE, and Decision Thresholds

[Trustworthy ML] Evaluate calibrated probabilities with reliability diagrams, expected calibration error, confidence bins, thresholds, and monitoring checks.
[LLM Evaluation] The Most Important Part of AI Agents Is Not Prompting. It Is Evaluation.

[LLM Evaluation] Why production AI agents need custom eval sets, trajectory checks, calibrated judges, regression tests, and business-ready metrics.
[AI Engineering] Testing and Evaluating Copilot Agents: From Demo to Reliable AI System

[AI Engineering] A schema-first guide to testing Copilot Studio agents with evaluation sets, custom graders, validation gates, human review, and post-publish monitoring.
PyTorch Review 01 (Popularized 2018-2020)

[ML] A practical PyTorch review of tensors, shapes, broadcasting, reshaping, and torch.distributions.
[MLOps] Leveraging Docker in Machine Learning and Data Science (Popularized 2014-2016)

[MLOps] A comprehensive guide to using Docker for ML/DS projects - from development to deployment
[MLOps] A Must-Have Skill for Efficient Model Management and Deployment (Popularized 2019-2021)

[MLOps] Understanding MLOps principles and implementation with MLflow and Weights & Biases
[ML] Trustworthy Machine Learning (Popularized 2018-2021)

[ML] What is trustworthy machine learning.
[ML] Fine-Tune BERT for Sentiment Analysis (Popularized 2018-2020)

[ML] sft Large Language Model
[ML] Fine-Tune BERT for Sentiment Analysis 02 (Popularized 2018-2020)

[ML] sft Tune Large Language Model2
[MLOps] Ray for Distributed ML (Popularized 2019-2021)

[MLOps] What is Ray.
[ML] Why Decoder-Only Architectures Became Standard in LLMs (Popularized 2020-2023)

[ML] Understanding the dominance of decoder-only architectures in modern LLMs
[ML] Building an Enterprise-Level AI Agent for Document Transformation (Popularized 2023-2025)

[ML] A comprehensive guide to building document processing AI agents using LlamaReport and LlamaCloud
[AI] Model Context Protocol (MCP): Connected AI Tools (Popularized 2024-2025)

[ML] A comprehensive guide to understanding and implementing the Model Context Protocol for AI integration
[ML] Uncertainty Quantification for LLMs with UQLM (Popularized 2024-2025)

[ML] A hands-on teaching guide for using UQLM to quantify and understand uncertainty in large language models.
[AI] Agentic AI Systems with n8n (Popularized 2023-2025)

[ML] A comprehensive guide to agentic AI system architecture, tool use, and production integration.
[Classical ML] Random Forest: Ensemble Learning Explained (Popularized 2001-2005)

[Classical ML] Dive into the mechanics, advantages, and applications of Random Forest - an ensemble learning algorithm that combines multiple decision trees for robust classification and regression tasks.
[ML] Machine Unlearning: A Complete Technical Guide (Popularized 2019-2023)

[ML] A step-by-step guide to implementing machine unlearning systems with algorithms, APIs, and monitoring processes.

Yangming's Product Blog

做产品和工程优化时，先别急着自动化

把“先质疑需求，再删除步骤，然后才优化、提速、自动化”整理成一篇适合产品、工程、流程团队直接落地的中文方法论。
Building a Product That Scales into a Company: Lessons from the 4U Framework

A comprehensive guide to scaling products into successful companies using the 4U Framework and other strategic approaches.
The Essence of a Successful Product: Insights for Product Managers

A deep dive into what truly makes a product successful from a product manager's perspective.
Jira for Agile Project Management (Popularized 2004-2008)

A comprehensive guide to using Jira for agile project management and team collaboration.

Yangming's Engineering Blog

[Statistics] Statistical Tests for Survey Analysis (Popularized 1930-1950)

[BI] A comprehensive guide to statistical tests for analyzing survey data
[Data Engineering] Databricks Lakehouse Guide (Popularized 2015-2018)

[Data Engineering] A deep dive into Databricks features and implementation with practical examples
[Engineering] Kubernetes: A Comprehensive Guide (Popularized 2016-2018)

[Engineering] A deep dive into Kubernetes architecture, components, and practical implementation
[Engineering] Polars: Fast DataFrames in Python (Popularized 2021-2023)

A comprehensive guide to using Polars for high-performance data processing in Python
[Machine Learning] Deep Neural Networks Explained (Popularized 2012-2016)

A comprehensive guide to understanding deep neural networks (DNNs), including forward and backward propagation, optimization algorithms, and PyTorch implementation
[Engineering] Deep Learning Engineering with JAX (Popularized 2020-2022)

A comprehensive guide to using JAX for high-performance machine learning and numerical computing
[Engineering] Feature Flags as Experiment Infrastructure (Popularized 2010-2015)

[Engineering] A deep dive into productionizing A/B testing—from 5 experiments/year to 50,000/year—covering CUPED variance reduction, sequential testing, Bayesian vs. frequentist methods, plus a complete SOP template and resource checklist.
[Engineering] Building and Publishing a Python Package (Popularized 2018-2021)

[Engineering] A comprehensive guide to creating, testing, documenting, and publishing Python packages following modern best practices
[Statistics] Generalized Linear Models (GLM) (Popularized 1972-1985)

[Statistics] A deep dive into GLMs, their applications, and implementation in both statistical and machine learning contexts

Notes

Working notes, study artifacts, and lower-priority references that support the main body of work.

Carnegie Mellon University Advanced NLP Course Notes (Popularized 2018-2023)

These are my study notes from CMU's Advanced Natural Language Processing course. The notes cover fundamental concepts and advanced topics in NLP.
MIT Data Structure and Algorithms Course Notes (Popularized 1970-1990)

These are my study notes from MIT's Data Structure and Algorithms course. The notes cover fundamental algorithms, data structures, and their practical implementations.
MIT Principles of Computer Systems (6.826) Course Notes (Popularized 1960-1980)

These are my study notes from MIT's Principles of Computer Systems course. The notes cover distributed systems, concurrency, fault tolerance, and system design principles.
MIT Computation Structures (6.004) Course Notes (Popularized 1970-1985)

These are my study notes from MIT's Computation Structures course. The notes cover digital systems design, Boolean logic, computer architecture, and assembly language programming.

Selected Work

Representative work themes across healthcare, finance, and enterprise teams, centered on production AI systems that move beyond demos.

Featured Interactive Experience

Focus Room

A premium SwiftUI focus app prototype for deep work: a soft hold-to-enter threshold, layered ambient sound mixing, a subtle timer, and a fullscreen study room that slowly deepens as the session unfolds.

Hold-to-enter ritual Ambient mixer Ghost UI Session evolution Local persistence

Open Focus Room View Focus Room SwiftUI source

Good work. Take a breath.

Focus Timer

31:42

Focusing

Ambient Layers

Piano vinyl warmth

Rain window hush

Brown deep bed

Cafe soft distance

White clean edge

Investing

Notes on capital allocation, market structure, and the quieter parts of long-term decision-making.

风险投资其实是“防风险投资”

张师傅的退休实验室关于 VC 本质、Vintage、投人、一级与二级市场差异，以及投资智慧的一篇长文。

Essays & References

A quieter corner for essays, references, and ideas that inform how I build.

Why "Taste" Matters in Science — and in Technology

Exploring Nobel laureate Yang Zhenning's concept of 'taste' in research and how it applies to technology and product development. What separates the merely competent from the truly visionary in science and tech.
Interesting Resource: Calculating Empires

I recently discovered an fascinating interactive resource called "Calculating Empires: A Genealogy of Technology and Power Since 1500". This comprehensive visualization maps out the intricate relationships between technology, power, and human history over the past 500 years.
Knowledge Flow

An interactive platform for visualizing and exploring connected knowledge across various domains. Knowledge Flow helps discover relationships between concepts and ideas in a structured format.

Lab

A smaller corner for interactive prototypes and playful experiments. The slot machine stays here as a lightweight demo, not part of the main positioning story.

Risk Forecast Studio

Turn Monte Carlo into a finance or delivery risk story

Treat each run like a committee review: one path could be a quarterly portfolio outcome, another a launch program under delivery pressure. The upper chart shows how scenarios drift apart over time, and the histogram reveals where the ending cases really cluster.

10-90 risk band Sample cases Required hurdle

A balanced review setup where the hurdle still feels reachable, but the tail risk is visible enough to force a real decision.

Situation Quarter-close risk review

A lead is checking whether the plan still clears its hurdle before committing more capital, timeline, or scope.

Main Risk Variance can overpower the base case

The center line may look calm, but a few bad shocks widen the tail quickly and change the story for stakeholders.

Decision Lens Adjust before review day

Lower the hurdle, extend the horizon, or reduce exposure and scope if the success odds drift too low.

Scenario Paths

How the forecast can unfold

Preparing simulation...

Distribution

Where the ending scenarios cluster

Target not set

Beat Hurdle --

Expected Finish --

Median Case --

10-90 Band --

Baseline level 100

Opening portfolio value, buffer, or program progress before the uncertainty starts compounding.

Expected gain / velocity +0.45%

The average step change if the plan is working, before market shocks or delivery friction show up.

Shock / delivery noise 1.80%

Use this for market swings, scope creep, coordination drag, and other sources of variance.

Review horizon 72

More steps can help a plan recover, but they also give risk more time to compound.

Scenarios 900

More scenarios make the probability story steadier and the tails easier to trust.

Required hurdle 135

The level you need to hit for the review to be called a win, whether that means return, runway, or launch readiness.

The point is not to chase one perfect forecast. It is to see whether the plan still holds together once you let uncertainty show up honestly.

Decision Simulators

Three more playable tools for experiment design, AI economics, and representation learning

These are not generic calculators. One helps answer "can I trust this experiment read?", another stress-tests production AI economics, and the third turns high-dimensional structure into a neighborhood map you can actually interrogate.

Experiment Design Studio

A/B Test Power Simulator

Model the decision pressure behind a launch review: how much sample, how much noise, and how much real lift you need before a "winner" deserves trust.

Evidence Map

Null vs. uplift distribution

Calibrating read...

Runtime Pressure

How long you need to sit on the test

Timing pending

1 day 1 week 2 weeks 1 month+

Short tests feel faster, but they usually buy speed by borrowing confidence from the future.

Applied AI Economics

LLM Cost-Latency Simulator

Stress-test an AI system the way a platform lead would: traffic, context size, retries, and cache behavior all fight over the same latency and budget envelope.

Cost Stack

Where the monthly spend really goes

Illustrative economics

Latency Budget

How queueing and retries bend the p95

Health pending

Monthly Spend --

P95 Latency --

Failure Rate --

Cache Savings --

Model class

Average QPS 1.2

Steady state throughput across the month, not just the homepage demo spike.

Input context tokens 7,000

Prompt, system instructions, retrieved docs, and conversation history combined.

Output tokens 900

The average answer length or tool-augmented completion size.

Cache hit rate 35%

Shared prompts, reusable context, and semantic caching can shave cost without changing the UX.

Retry share 6%

The fraction of requests that need a retry because of timeouts, transient failures, or guardrail retries.

These numbers are illustrative economics, not live vendor quotes, but they are useful for understanding how quickly context size and retries turn into budget pressure.

Representation Learning Studio

UMAP / HDBSCAN Manifold Simulator

Compress a synthetic high-dimensional population into a neighborhood map and watch density structure survive, split, or dissolve as overlap, local scale, and minimum cluster size shift.

Embedding Surface

How local neighborhoods fold into 2D

Projection pending

Density Frontier

Where dense structure becomes noise

Cluster scan pending

Groups Found --

Noise Share --

Separation --

NN Trust --

Latent dimensions 18

The hidden feature space before the embedding compresses the geometry into a readable plane.

Seeded groups 4

How many underlying structures the synthetic population starts with before density blur and overlap kick in.

Overlap / noise 0.55x

Raise this to make manifolds smear together, inject border cases, and force the clustering logic to work harder.

Local neighborhood 12

Smaller values emphasize fine local detail; larger values smooth the embedding and favor broader structure.

Min cluster size 10

Acts like the patience of a density clusterer: tiny islands can disappear if they cannot support enough local evidence.

Manifold curl 0.85

Bends the hidden geometry so the map stops looking like clean Gaussian blobs and starts feeling more like real representation space.

Use this like a UMAP and density-clustering intuition board: local neighborhoods can survive the projection even while cluster boundaries become contestable.

Below it, the slot machine keeps the lighter arcade energy for visitors who want something more playful than probabilistic forecasting.

AI Slot Machine Demo

CREDITS

1000

WINS

BET

PAYLINES

MULTIPLIER

FREE SPINS

AI Pioneers & Payouts:

Geoffrey Hinton
"Godfather of AI"
x5 = 50

Yann LeCun
Facebook AI
x5 = 75

Andrej Karpathy
ML Educator
x5 = 100

Yoshua Bengio
Deep Learning Pioneer
x5 = 150

AI Research
Innovation
x5 = 200

Jensen Huang
NVIDIA CEO
WILD

Steve Jobs
BONUS

Contact

Email dleeym95@gmail.com
LinkedIn Yangming Li
GitHub github.com/yml-blog
Location: Vancouver, BC, Canada
Book a Meeting

Pick a time directly on my calendar:

Prefer email? dleeym95@gmail.com
Leave a Message

Recent Messages

Explore Yangming Li's work