What cross-border signal loss looks like at the row level
Picture a Singapore buyer who opens ChatGPT, asks for the best payment terminal for her event business, clicks an answer that surfaces a US merchant, and converts on a Canadian-fulfilled domain. Three privacy regimes touch that journey. The PDPA consent the buyer gave the Singapore-served ad is not the consent the US merchant collected. The CCPA-mode cookie on the conversion event drops half the parameters the Canadian fulfilment server expected. Every analytics platform in the stack handles the gap differently.
Last-click attribution will credit the converting domain. Data-driven attribution in GA4 will quietly down-weight every touch it cannot model. The platform-reported number on the Meta ad set will overstate by 1.3 to 1.8x in our deployments. None of those reports describe what actually happened.
The honest answer is that the buyer's signal was never one row. It was a probability distribution across several rows that the warehouse stitched imperfectly. The measurement model needs to admit that explicitly.
Why GA4 data-driven attribution falls over at multi-market scale
Three failure modes. First, the privacy regime conflict. GA4's data-driven attribution is a black-box algorithm trained on globally aggregated user behaviour. When 30 percent of the property sits behind a PDPA consent banner that drops third-party cookies, the model trains on the 70 percent and applies its weights to the 30. The error rate is bounded but unmeasured.
Second, sampling at scale. GA4 samples once any single report exceeds the property quota. For a five-market campaign, the cross-market reports hit the quota first and silently sample. The report renders. The decision-maker does not see the sample warning unless they look for it.
Third, the explainability gap. Data-driven attribution does not return a posterior or a confidence interval. It returns a number. Asked to defend the number in front of a board, the analytics lead has nothing to point at.
Bayesian MMM in plain English for CMOs
A Marketing Mix Model is a regression that explains aggregate sales as a function of aggregate spend on each channel, with a few transforms layered on. Adstock captures the carry-over of yesterday's spend into today's sales. Saturation captures the diminishing return of each additional dollar in a channel. Time decay handles seasonality.
The Bayesian version of that regression replaces point estimates with probability distributions. Instead of returning "Meta delivered 1.4x ROAS," it returns "Meta's ROAS is in the range 1.1 to 1.6 with 90 percent probability, calibrated on this prior and this evidence."
The reason the Bayesian framing matters is not statistical purity. It is decision-making honesty. A board paper that says "we are 80 percent confident Meta returned between 1.1 and 1.6" is harder to argue with than one that says "Meta delivered 1.4." The first reads as the operator who has done the work. That posterior is also the input that lets AI performance marketing allocate spend on calibrated incrementality rather than platform-reported ROAS, so the optimisation chases the channels that actually caused sales.
Two open-source libraries do most of the heavy lifting. Google's Meridian is the official successor to the older lightweight_mmm (which was archived on 19 January 2026). Meridian is Apache 2.0, ships reach-and-frequency modelling, integrates search query volume, and supports experiment calibration out of the box. Meta's Robyn is MIT-licensed and bundles a hyperparameter-search loop. Meridian builds on TensorFlow Probability, Robyn on R and nlopt. Both are Bayesian under the hood; neither is vendor-locked, both work for non-Google or non-Meta channels in the same model.
The calibration ritual: geo-holdout incrementality
An MMM in isolation tells you what the model believes. A geo-holdout tells you what reality believes. The two together calibrate.
Pick two matched-pair geographies in the same market. Run the campaign in geography A. Suppress in geography B. Measure the delta in the outcome metric and back-solve for incremental lift. The lift number becomes a prior on the matching channel in the MMM. Re-estimate. The posterior tightens. The board paper becomes credible.
Three operator details people miss. The matched-pair geography must be matched on the metric that matters, not on population. A Klang Valley to Penang pair matches on population but diverges on retail intensity by 40 percent. Pair on retail intensity if the campaign is retail-driven. Second, the holdout window has to be long enough to absorb adstock half-life plus a tail. Six weeks is the minimum for paid-social. Third, the suppression has to be media-channel-specific.
| Channel | Adstock half-life | Holdout window min | Matched-pair metric |
|---|---|---|---|
| Paid search | 5 to 9 days | 4 weeks | branded query volume |
| Paid social | 10 to 18 days | 6 weeks | site direct traffic |
| Programmatic + CTV | 21 to 35 days | 8 weeks | household reach proxy |
| Out of home | 14 to 28 days | 8 weeks | footfall proxy via geofence |
Three priors that decide everything
Adstock prior. The half-life on Meta in Singapore and Malaysia runs shorter than the global default the library ships with. Two-week half-life lands closer than the four-week default. Set it explicitly. Document the choice in the prior file so the next analyst does not over-write you.
Saturation prior. The Hill function defaults assume a US-scale media market. APAC media markets saturate faster because audience sizes are smaller and frequency caps bite earlier. Cut the saturation point estimate by 30 to 40 percent on Singapore Meta. The model will return a tighter saturation posterior and a more honest diminishing-return curve.
Time-decay prior. The seasonal model has to know your business cycle. Bank-led financial services in APAC have a Q4 weight unrelated to global ecommerce Q4. Encode the business cycle as a Fourier prior with the right harmonics. Do not let the library guess.
What your data warehouse needs to look like
MMM is a daily-aggregated model. The warehouse needs a daily fact table for spend by channel by market and a daily fact table for outcome by market. The outcome can be sales, applications, qualified leads. Both tables key on date and market.
Ingestion: daily for spend (vendor APIs), daily for outcome (CRM or commerce platform), weekly for external priors. Governance: every prior decision lives in version control. Every model run produces a hash of the input tables, the prior file, and the model config. If anybody asks how a number was generated, the answer is git log. With that warehouse in place, performance marketing can move budget between markets and channels on a defensible posterior instead of a black-box attribution number, which is the whole point of building the model.
| Table | Grain | Refresh | Source |
|---|---|---|---|
| fact_spend_daily | date × market × channel | daily | Meta + Google + LinkedIn + TikTok APIs |
| fact_outcome_daily | date × market | daily | CRM, commerce, app analytics |
| dim_market | market | static | ISO-3166 + hreflang mapping |
| dim_channel | channel | static | internal taxonomy + vendor mapping |
| fact_external_signals_weekly | iso_week × market | weekly | Trends, weather, sentiment, OOH |
| model_run_log | run_id × hash | every run | internal CI |