Unit 6/Lesson 1 of 3

How AI Forecasting Products Are Built & Evaluated

Demand forecasting is a fundamentally different AI problem than chatbots or classifiers. Here's what the PM needs to understand about how it works, how it fails, and how to evaluate whether the model is actually getting better.

SkillsAI/ML FundamentalsModel EvaluationForecasting
+25 XP
✏️ On desktop — select text to ask Nobi, save notes, or add thoughts

How AI Forecasting Products Are Built & Evaluated

Most PM advice about working with AI teams assumes you're dealing with classifiers, recommendation systems, or generative models. Demand forecasting is different — structurally, operationally, and in how you evaluate it. A PM who understands these differences earns trust from the data science team and makes better prioritization calls.

Why Demand Forecasting Is a Different AI Problem

Time-series structure: Demand data is sequential. What happened last week predicts this week in ways that random data doesn't. Models must respect time ordering — you can't train on future data and test on past data (look-ahead bias). This is why naive off-the-shelf ML classifiers don't work well for forecasting.

Sparse data and the cold-start problem: A new SKU has zero historical sales data. What does the model forecast? Most tools fall back to category averages or product attribute-based priors. Tightly's approach (using Shopify tags, variant attributes, and similar SKU performance) is exactly right — but getting cold-start wrong is the most common cause of early-life forecast errors.

Distribution shift: The demand distribution is not stationary. Promotions, seasonality, viral social media moments, and supply chain disruptions all shift the underlying pattern. A model trained on 18 months of data during COVID has learned a very different demand distribution than pre-2020. Models need mechanisms to detect and adapt to distribution shift, not just assume the future looks like the past.

The sparse/lumpy demand problem: Many DTC SKUs don't sell every day. An item that sells 0-0-0-12-0-0-3-0-0 units across nine days is structurally different from one that sells 2-3-1-2-4-1-3-2-2. Most simple forecasting models perform poorly on intermittent demand. The sophisticated approach (croston's method, or neural forecasting with intermittency handling) matters for accessories, limited-edition products, and B2B-adjacent SKUs.

Model Evaluation Metrics: What to Use and When

The choice of evaluation metric is a product decision as much as a technical one — it determines what the data science team optimizes for.

MAPE (Mean Absolute Percentage Error)
|Actual - Forecast| ÷ Actual × 100, averaged across SKUs.
MAPE is intuitive ("we're off by 18% on average") and easy to explain to customers. But it has a critical flaw: it explodes on low-volume SKUs. If a SKU sells 1 unit and you forecast 2, MAPE is 100%. If it sells 100 and you forecast 80, MAPE is 20% — but the second error is a much bigger business problem. MAPE penalizes low-volume SKUs disproportionately and makes aggregate accuracy look worse than it is.

WMAPE (Weighted Mean Absolute Percentage Error)
Weights the error by the SKU's actual volume: Σ|Actual - Forecast| ÷ Σ Actual. This naturally reduces the influence of low-volume SKUs and gives you a revenue-weighted view of accuracy. WMAPE is the better production metric — it tells you how accurate you are on the SKUs that actually drive the business. When Tightly says "we target <15% MAPE on high-velocity SKUs," they're really talking about a WMAPE for the top revenue-contributing SKUs.

MAE (Mean Absolute Error)
Σ|Actual - Forecast| ÷ n. Measures error in units, not percentage. Useful for comparing across time periods without the percentage distortion, but not intuitive for customers who think in percentages.

Bias
Σ(Forecast - Actual) ÷ n. Bias tells you the direction of your errors — are you consistently over-forecasting (positive bias → overstock risk) or under-forecasting (negative biasstockout risk)? A model with low MAPE but high positive bias is systematically bullish and will cause overstock. Bias should be tracked alongside accuracy metrics.

How to Run Evals for a Forecasting Product

Holdout set methodology (leave-last-N-weeks out)
The standard eval approach: train the model on all data up to week W, then test it on weeks W+1 through W+N. Never expose the test period to the training data. The PM implication: when the data science team says "our model improved," ask "what was the holdout window?" A 4-week holdout catches short-term improvements; a 12-week holdout catches seasonal generalization. For a DTC brand, 12-week holdouts that include a promotional period are the most realistic test.

Cohort-based eval: new SKUs vs. mature SKUs
Don't evaluate forecasting performance in aggregate — it hides the cold-start problem. Split the evaluation into two cohorts: SKUs with >90 days of history (mature) and SKUs with <90 days of history (new). The model likely performs much better on mature SKUs. The gap between cohorts tells you how much cold-start is costing you. Closing that gap — through better feature engineering on new SKUs — is often the highest-leverage model improvement available.

Eval against naive baselines
Before celebrating a WMAPE of 18%, check it against naive baselines: a simple moving average (last 4-week average) and a seasonal naive (same-week-last-year). If your model is barely beating a moving average on the majority of SKUs, the sophistication isn't adding value. If it consistently beats seasonal naive on seasonal brands, that's a meaningful result worth communicating to customers.

Tightly's Composite Scoring Matrix

Tightly's replenishment model generates a composite score for each SKU by blending four inputs. As PM, you need to understand what each dimension contributes and when to weight them differently — because these weighting decisions should be driven by customer type, product category, and business context.

Velocity Score: Based on the SKU's recent sell-through rate, normalized for product age. High weight for stable, mature SKUs where the trend is stationary.

Trend Score: Rate of change in the velocity — is this SKU accelerating or decelerating? High weight for new product launches or products in a growth phase. Low weight for mature, stable SKUs where trend noise can mislead.

Seasonality Score: Captures repeating seasonal patterns (weekly, monthly, annual). The key technical question is: does the seasonality score generalize to a new season it hasn't seen yet, or does it only replay last year? For brands with year-over-year growth, last year's seasonality curve should be adjusted for the growth trend before being applied.

Supplier Reliability Score: A newer dimension that attempts to model the probability that a PO will arrive on time, given historical supplier performance. A supplier with 60% on-time delivery should generate earlier POs with larger safety stock than one with 95% on-time delivery.

The PM decision point: When a customer switches from footwear (high seasonality, volatile trend) to supplements (stable velocity, predictable reorder), the weighting should shift — more velocity and supplier weight, less trend and seasonality. Customers who can configure these weights (or who have them auto-tuned) will have higher PO acceptance rates.

A/B Testing in Supply Chain Software

You cannot A/B test a replenishment algorithm the way you test a button color. The unit of randomization is a purchase decision, and purchase decisions happen monthly, not hourly. Interference effects are severe: if you show model A to some SKUs and model B to others within the same brand, the inventory decisions interact (a stockout on a model-A SKU affects demand for a model-B SKU that's a substitute).

Holdout brand methodology: The cleanest approach is to randomly assign entire brands to model A or model B and measure outcomes (WMAPE, stockout rate, PO acceptance rate) over a 12-16 week window. This eliminates interference but requires a large enough brand base to achieve statistical power — typically 50+ brands per arm.

Holdout SKU methodology (within-brand): Assign a random subset of SKUs within a brand to each model. Faster results but vulnerable to interference. Better suited for testing UI changes (e.g., does showing confidence intervals change PO acceptance rate?) than for testing algorithm changes.

The PM implication: When a data scientist proposes an algorithm change, ask: "How will we validate this in production?" If the answer is "we'll see if MAPE improves" without a controlled holdout, the measurement is unreliable. Push for a holdout brand experiment with pre-registered success metrics — even if the window is 12 weeks, it's worth the patience.

Ask Nobiexplain

Select any text in the lesson to ask a question. Press Esc to close.

1 / 3

Tightly's data science team reports: 'We improved MAPE from 24% to 19% on our test set.' Before celebrating, what is the most important follow-up question for the PM to ask?