Black Friday without the stomachache — problems to anticipate and how to avoid them

Black Friday

Black Friday isn’t “more of the same.” It demands real-time price/promo/stock, idempotent orders, queues in front of sluggish systems, a hardened perimeter, and observability at every step.

Black Friday is approaching

Black Friday exposes weaknesses in integrations. What “works okay” day-to-day breaks when traffic and change rates peak. Below are the most common problems—and how to design them away with a platform in Azure instead of point-to-point connections.

1. Promo prices stuck in batch jobs

When prices must flip at 00:00, a “nightly batch” is too slow. Feeds update late, caches serve stale content, and Google/marketplaces display wrong prices.

Example

  • 00:00 – the price is lowered in the ERP

  • 00:02 – customers search for the product

  • 08:00 – Merchant Center/channels have caught up—eight hours too late

Your competitor already captured the night’s traffic.

Practical solution

  • Publish price/promo as real-time events via a queue (e.g., Service Bus), not as a nightly export.

  • Set freshness SLOs for price/promo (e.g., ≤ 60s end-to-end) and alert on breaches.

  • Use feature flags for exact 00:00 activation, plus a canary (1–5% of traffic) a few minutes before the full release.

2. When inventory lies—oversell and returns

Large inflow + slow stock reconciliation = a customer can place an order for an out-of-stock item.

Example
A product with 100 in stock sells 120 units in the first 30 minutes. Inventory update latency is > 5 minutes. Twenty customers get backorders/denials.

Practical solution

  • Emit a stock event on every change—avoid “poll every 5 minutes.”

  • Use a reservation or quota model in checkout (hard or soft reservations) for hot items.

  • Define an SLO for stock sync (e.g., ≤ 30s) and apply backpressure if the ERP can’t keep up.

3. Duplicate orders and webhook storms

At peak, users double-click, the payment provider retries, and marketplaces send the same webhook multiple times.

Example
Checkout times out, the customer clicks again, the PSP retries—three calls create three ERP orders.

Practical solution

  • Introduce an idempotency key per order (stored in the platform). Same key = one order.

  • Put an API perimeter (API Management) in front that enforces a signature, time window, and idempotency header.

  • Handle retries via dead-lettering and exponential backoff with jitter in the queue layer.

4. ERP bottlenecks—when the system can’t keep up

Direct calls against the ERP hit the ceiling (rate limits/CPU). Without queues, calls are lost or checkout blocks.

Example
ERP tolerates 100 req/min; the peak needs 800 req/min. Direct integration times out → orders/exports drop.

Practical solution

  • Decouple with a queue (Service Bus). Accept at the platform’s pace; process at the ERP’s pace.

  • Scale workers (Functions) to drain the backlog after peak—meet the SLA on average.

  • Enable brownout mode: park non-critical flows; prioritize the order flow.

5. Edge performance—gateway, cache, and protection

Without a clear edge, both legitimate spikes and attacks become dangerous.

Example
A sudden traffic spike causes random 429/500s; errors pass straight through to the backend.

Practical solution

  • Use API Management + Front Door/WAF to control rate limits, geo/IP policies, and mTLS where needed.

  • Cache safe read paths (e.g., static attributes) and keep the write path strict and idempotent.

  • Build a capacity model: load test at least 3× weekday traffic before Black Friday.

6. Observability—from “it doesn’t work” to root cause

Without correlation you won’t know which step failed.

Example
“Order export down” turns out to be a new partner payload missing an attribute.

Practical solution

  • Carry a correlation ID through the entire chain (APIM → Functions → Service Bus → ERP).

  • Build dashboards for three SLOs: Order latency, Price/Promo freshness, Stock freshness.

  • Alert on dead letters, signature errors, unusual retry frequency, and spikes in 429s.

7. Changes near peak—without breaking things

Last-minute changes can tip the entire system.

Example
A new promo rule is rolled out broadly at 21:00—PR missed a negative edge case.

Practical solution

  • Use feature flags + canary (1–5–20–100%) with automatic rollback on alerts.

  • Enforce a “change freeze” on non-critical parts seven days before Black Friday.

  • Treat configuration as code with mandatory contract tests on partner payloads.

8. Security under peak

Pressure increases—not only from customers.

Example
Leaked SAS keys or open function endpoints enable scraping/DoS—right when you’re most vulnerable.

Practical solution

  • Use Managed Identity everywhere, Key Vault with rotation, and private endpoints.

  • APIM policies: signature validation, IP/geo filters, DDoS protection, and bot rules.

  • Restrict SAS: time-bound, IP-locked, least privilege.

Three mini-cases (real outcomes)

  • 00:00 promo: Switched from batch to event-driven price feed + flag-controlled activation. Freshness dropped from ~4h to < 60s. First-hour revenue +14% YoY.

  • ERP bottleneck: Introduced queue + workers, prioritized orders over secondary flows. 0% order loss; backlog cleared 17 minutes after peak.

  • Duplicate orders: Idempotency key + APIM signatures. Duplicates down 99.8%; support tickets per 1,000 orders halved.

Black Friday playbook (short and concrete)

  • Set SLOs: Order ≤ X seconds, Price/Promo freshness ≤ Y seconds, Stock freshness ≤ Z seconds.

  • Test: Load 3× weekday; inject chaos on ERP timeouts and PSP retries.

  • Protect the edge: APIM rate limits, signatures, WAF/bot protection.

  • Make it robust: Idempotency, queues, backoff + jitter, brownout mode.

  • Control change: Feature flags, canary, change freeze on non-critical parts.

  • Measure & alert: Correlation IDs, SLO dashboards, alerts on dead letters/429s/signature errors.

Conclusion

Black Friday isn’t “more of the same.” It requires real-time price/promo/stock, idempotent orders, queues in front of slow systems, a hardened perimeter, and measurement at every step.

Build this as a platform in Azure—instead of point-to-point—and you stay in control when it matters most: more conversions, fewer incidents, and a calmer night between 00:00 and 08:00.

Want to start simple? Take the price/promo flow, set a freshness SLO, build the event path, measure before/after—and roll forward from there.