A/B Testing Landing Pages: How to Do It Right

Most landing page tests fail for a boring reason. The traffic runs for four days, one version is ahead by a point and a half, someone calls it a winner, and the page gets shipped. Two months later conversion is flat and nobody can explain why. The test was never long enough or big enough to mean anything.

A/B testing works. It is also one of the easiest marketing tools to fool yourself with. The math behind it is unforgiving, the temptation to peek is constant, and the prettier headline often loses to the uglier one. This guide walks through how to run tests that hold up: what to measure, how much traffic you actually need, the traps that produce fake winners, and how to turn results into a steady lift in qualified leads instead of a graveyard of inconclusive experiments.

What A/B testing a landing page actually means

You split incoming traffic between two versions of a page. Version A is your current page, the control. Version B changes one thing, or one bundle of related things, and you measure which version drives more of the outcome you care about. Visitors get bucketed at random and stay in their bucket, so the only systematic difference between the groups is the change you made.

That random split is the whole point. It cancels out the noise (device, source, time of day, mood) and leaves you with a clean comparison. If version B converts better, you can be reasonably confident the change caused it, not the weather or a coincidental spike in branded traffic.

For B2B, there is a wrinkle worth saying out loud. Your sales cycle is long, your traffic is thinner than an e-commerce store's, and a "conversion" on the page (a form fill) is several steps removed from revenue. That changes how you test, not whether you should. We will come back to it.

Pick one metric, and pick the right one

Before you touch a headline, decide what counts as winning. This sounds obvious and it is where most teams quietly go wrong.

The default metric is conversion rate on the page: percentage of visitors who submit the form. Easy to measure, fast to read. The problem is that form fills and good leads are not the same thing. A test that wins on raw form fills can lose on sales-qualified leads if version B pulled in more tire-kickers. A shorter form almost always lifts submissions. It can also flood your sales team with junk.

So you need a primary metric, and ideally one guardrail metric behind it.

Primary: the thing you optimize for. Form submission rate, demo bookings, or trial starts.
Guardrail: the downstream quality check. Lead-to-MQL rate, lead-to-opportunity rate, or sales-accepted leads.

If your CRM tracks leads back to the page variant, you can test on the downstream metric directly. That is the gold standard, and it is also slow, because qualified opportunities are rare events. A page that gets 4,000 visitors a month might produce 30 opportunities. You will wait a long time to see a difference there. Closing the loop between ad click, form fill, and CRM stage is the foundation here, and it is worth getting conversion tracking for B2B right before you run a single test.

A workable compromise: optimize on form fills, watch lead quality as a guardrail, and refuse to ship any winner that tanks the quality number. If version B lifts submissions 18% but its lead-to-opportunity rate drops by a third, that is not a win. That is a leak.

What to test (in order of impact)

Not all changes move the needle equally. Spend your traffic on the elements that decide whether someone stays or bounces, before you fuss with button colors.

Roughly in descending order of typical impact:

The offer. What the visitor gets for filling the form. "Book a demo" versus "Get a free 15-minute funnel audit" is a real test, and often the biggest lever on the page.
The headline and subhead. The first thing read, the thing that decides whether the rest gets read at all.
Form length and fields. Five fields versus two. Every field you remove lifts completion and lowers qualification. That trade-off is exactly why the guardrail metric exists.
Social proof. Logos, a specific testimonial with a name and a number, a case-study stat near the form.
Page structure and hero layout. Where the form sits, what is above the fold, how the value stack is ordered.
Visuals and microcopy. Button text, image versus no image, hero video.

Button color is at the bottom for a reason. It rarely produces a difference you can detect with B2B traffic volumes. If your page has a weak offer and a vague headline, no shade of green will save it. The biggest wins usually come from the message and the offer, which is also why getting the landing page structure right gives you better things to test in the first place.

One discipline: change one variable per test, or one tightly related cluster. If you swap the headline, the hero image, and the form length all at once and version B wins, you have learned that some combination of three things worked. You cannot say which, and you cannot reuse the lesson on the next page.

The math that decides if your test is real

This is the section people skip, and skipping it is why their tests lie to them.

Three numbers govern every A/B test. Treat them as illustrative; plug in your own.

Baseline conversion rate. Where you are now. Say your page converts at 4%.

Minimum detectable effect (MDE). The smallest improvement worth detecting. A 25% relative lift would take 4% to 5%. The smaller the lift you want to catch, the more traffic you need, and the relationship is brutal: halving the MDE roughly quadruples the sample size.

Statistical significance and power. Significance (usually 95%) is your tolerance for declaring a winner that isn't one. Power (usually 80%) is your odds of catching a real winner that exists. Most teams obsess over the first and ignore the second, then run underpowered tests that miss real effects.

Here is the part that hurts. To detect a 25% relative lift on a 4% baseline at 95% significance and 80% power, you need roughly 5,000 to 6,000 visitors per variant, so 10,000 to 12,000 total. To detect a 10% lift, you need somewhere north of 30,000 per variant. (These are ballpark figures from a standard sample-size calculator; run your own before you start, because the exact number depends on your baseline.)

Rough traffic needed per variant to detect a lift (4% baseline, 95% / 80%). Illustrative.
Relative lift you want to catch	New rate	Approx. visitors per variant
50%	6.0%	~1,500
25%	5.0%	~5,500
15%	4.6%	~15,000
10%	4.4%	~33,000

Now look at your monthly landing page traffic. If it is 3,000 visitors, you cannot reliably detect anything smaller than a big swing, and you should design tests around big swings: bold offer changes, radically different headlines, structural redesigns. Trying to A/B test a button color on 3,000 visitors a month is theater.

Calculate the required sample before you launch. Decide the duration. Then leave it alone.

The traps that produce fake winners

Peeking and early stopping

The most common mistake, and the most expensive. You check the dashboard daily, see version B pull ahead, and stop the test to declare victory. The trouble is that conversion rates wander early on. With small samples they swing wildly, cross the significance line by chance, and drift back. If you stop the moment you see significance, you will "win" constantly with changes that do nothing.

The fix is mechanical. Decide your sample size and duration in advance. Do not stop until you hit both, no matter how tempting the early lead looks. If you want the freedom to peek, use a tool with sequential testing or a Bayesian approach designed for it, and follow its stopping rule, not your gut.

Running too short to catch your weekly cycle

B2B traffic behaves differently on Tuesday than on Saturday. Decision-makers research during work hours. A test that runs Monday to Thursday misses the weekend pattern entirely and bakes a weekday bias into the result. Run for full weeks, minimum one, usually two to four. A test should cover at least one complete business cycle so day-of-week effects average out.

The sample ratio mismatch

If you intended a 50/50 split and your tool reports 53/47, something is broken: a redirect bug, a tracking gap, a bot filter hitting one variant harder. A skewed split means the randomization failed, and a failed randomization invalidates the whole test. Check the split early. If it is off by more than a point or two on decent volume, stop and debug before you trust any number.

Testing during a traffic anomaly

A PR spike, a big paid push, a seasonal lull, a competitor outage. Any of these can flood your test with traffic that behaves nothing like your normal mix. The result will not generalize. If you know an anomaly is coming, wait. If one hits mid-test, note it and be ready to discount the affected days.

A repeatable testing process

Tools matter less than discipline. Here is a loop that holds up.

Start from data, not opinion. Before you guess at a test, look at where the page leaks. Scroll maps, session recordings, the GA4 funnel, form-field drop-off. If 70% of visitors never reach the form, your test belongs above the fold, not on the submit button. A structured read of the page's weak points is the front half of any real conversion rate optimization effort, and it tells you what is worth testing.

Write the hypothesis as a sentence. "Because session recordings show visitors hesitating at a five-field form, reducing it to three fields will lift submission rate by at least 15% without hurting lead-to-opportunity rate." A hypothesis with a because, a change, and a predicted effect keeps you honest. A vague "let's try a new headline" does not.

Calculate sample size and set the end date. Do this before launch. Write the number down. This is the commitment that stops you from peeking.

Run it clean. One variable. Full weeks. No mid-flight changes. Watch the sample ratio.

Read the result against both metrics. Primary and guardrail. A winner has to pass both.

Ship, document, and feed the next test. Whatever wins becomes the new control. Write down what you learned, including the losers, because a clear loss ("longer copy hurt us") is a real finding. Then test the next-highest-impact element.

The compounding is the point. A page that improves 12%, then 9%, then 15% across three clean tests is roughly 40% better than where it started, and you can defend every step.

When you do not have enough traffic

Plenty of B2B pages get a few thousand visits a month, not tens of thousands. Strict A/B testing on those pages will rarely reach significance. That does not mean you stop optimizing. It means you change the method.

Test bigger swings, not tweaks. A complete redesign against the old page can show a difference at lower volume because the effect size is larger. Lean harder on qualitative signals: recordings, heatmaps, on-page polls, user testing with five to eight people in your buyer's role. Pool traffic by testing the same change across several similar pages at once. And accept directional reads on a longer horizon rather than demanding 95% significance on every move. A consistent lift over eight weeks, even at 85% confidence, is worth acting on when the alternative is never testing at all. Just label it as the educated bet it is, and keep an eye on the downstream numbers.

FAQ

How long should an A/B test run? Until you hit your pre-calculated sample size, and at least one full week, usually two to four. Full weeks matter because B2B traffic swings by day of week. Do not stop early just because one version is ahead.

How much traffic do I need? It depends on your baseline conversion rate and the size of the lift you want to catch. As a rough guide, detecting a 25% relative lift on a 4% page takes around 5,000 to 6,000 visitors per variant. Smaller lifts need dramatically more. Run a sample-size calculator with your own numbers before you launch.

What should I test first? The offer and the headline, almost always. They decide whether anyone engages at all, and they move the needle far more than button colors or minor copy. Save the small stuff for when the big elements are settled and you have traffic to spare.

Can I test more than two versions at once? Yes, that is an A/B/n test, and it is fine if you have the traffic. Every extra variant splits your audience further, so a three-way test needs roughly 50% more total traffic than a two-way one to reach the same confidence. With thin B2B volume, stick to two.

Why did my winning test not improve real results? Usually one of three reasons: the test was underpowered and the "win" was noise, you optimized form fills while lead quality quietly dropped, or you peeked and stopped early. Always check a downstream guardrail metric, and make sure the page change connects to your CRM so you can see the leads it actually produced. Weak funnel conversion rates downstream will swallow a page-level win every time.

Is A/B testing worth it for low-traffic B2B pages? For tiny tweaks, no. For big swings, yes. With a few thousand visits a month, test redesigns and bold offer changes rather than micro-copy, and supplement with heatmaps, recordings, and user testing. Treat the results as directional and watch quality.

The short version

Done right, A/B testing turns landing page guesswork into a compounding asset. Done wrong, it produces a stack of inconclusive tests and false confidence. The difference is discipline, not budget.

Before your next test, check these:

One clear primary metric, plus a quality guardrail that ties back to your CRM.
Sample size and end date calculated and written down before launch.
One variable, full weeks of traffic, no peeking and no early stopping.
A written hypothesis with a predicted effect.
A winner that passes both the primary and the guardrail before it ships.

If your tests keep coming back inconclusive, or you are not sure your page even has the traffic to test the way you are testing, that is usually a sign the measurement or the targeting needs work first. We are happy to take a look: ask Lead The Way for a short audit of your landing page and tracking setup, and we will tell you the two or three tests actually worth running before you spend traffic on the rest.