Robots.txt: Configure It Without Blocking Indexing

Robots.txt: How to Configure It Without Blocking Indexing

A single line in a text file can wipe a site out of Google. Disallow: / left over from a staging environment, pushed live on launch day, and three weeks later traffic flatlines while everyone blames the new design. It happens more often than agencies admit, usually to companies that just rebuilt their site.

Robots.txt is one of the smallest files on your domain and one of the easiest to get wrong. The confusing part is that it does not do what most people think it does. It controls crawling. It does not reliably control indexing, and treating it as an off switch for search visibility is where the damage starts.

This guide covers what the file actually does, the rules worth knowing, the mistakes that quietly cost rankings, and how to check your setup before it costs you anything.

What robots.txt controls (and what it does not)

Robots.txt lives at the root of your domain, at https://yoursite.com/robots.txt. Crawlers like Googlebot request it before they crawl anything else. It tells well-behaved bots which paths they may fetch.

Here is the trap. Blocking a URL in robots.txt stops Google from crawling it. Google can still index that URL if other pages link to it. You end up with a result in search that shows the URL and maybe an anchor text, with a description that reads "No information is available for this page." The page is indexed, ranking poorly, and you cannot see why because you told the crawler to stay out.

So if your goal is to keep a page out of the index entirely, robots.txt is the wrong tool. You want a noindex directive, which lives in the page's HTML or HTTP header, and that requires the crawler to actually read the page. Block the page in robots.txt and Googlebot never sees the noindex. The two instructions work against each other.

Crawl control and index control are separate jobs. Keep them separate in your head and most robots.txt mistakes disappear.

The syntax you actually need

The file is plain text. Each block starts with a User-agent line naming the bot, followed by Disallow and Allow rules. A minimal, sane file for most B2B sites looks like this:

User-agent: *
Disallow: /wp-admin/
Disallow: /cart/
Disallow: /*?sessionid=
Allow: /wp-admin/admin-ajax.php

Sitemap: https://yoursite.com/sitemap.xml

A few things worth knowing about the rules:

  • User-agent: * applies to every crawler. You can target a specific bot with its name, for example User-agent: Googlebot. Google reads the most specific matching group, not all of them combined, so a Googlebot block overrides the * block for Googlebot.
  • Disallow: with an empty value means "allow everything." A blank Disallow is the opposite of Disallow: /. People mix these up under pressure.
  • Paths are matched as prefixes. Disallow: /blog blocks /blog, /blog/, and /blogging-tips too. Add the trailing slash if you mean the folder only.
  • * is a wildcard for any sequence of characters. $ anchors the end of a URL. Disallow: /*.pdf$ blocks PDFs.
  • The Sitemap directive points crawlers at your XML sitemap. It is independent of the User-agent blocks and can sit anywhere in the file. Use the full absolute URL.

Order matters less than specificity. When an Allow and a Disallow both match a URL, Google follows the more specific (longer) rule. That is how you carve out admin-ajax.php from a blocked /wp-admin/ folder.

The mistakes that quietly cost rankings

Most robots.txt damage is not dramatic. It is a slow leak. Here are the ones we see most.

Blocking CSS and JavaScript

Years ago people blocked /wp-includes/, /assets/, or /static/ to "save crawl budget." Google renders pages now. If it cannot fetch your stylesheet and scripts, it sees a broken layout and may misjudge mobile-friendliness and content. Let Google crawl your CSS and JS. There is almost no good reason to block them on a normal site.

Using Disallow to hide private content

If a page is sensitive, robots.txt is not security. The file is public, anyone can read it, and listing Disallow: /secret-client-portal/ is a signpost pointing straight at the thing you wanted hidden. Use authentication for private content. Use noindex for pages you want out of search but reachable by people.

The staging file going live

This is the launch-day disaster. Staging sites carry a blanket Disallow: / to keep them out of search while they are being built. When the new site ships, the staging robots.txt sometimes ships with it. Every page blocked. Indexing stalls or reverses. Always diff the production robots.txt against staging before and after a launch, and check it again the day after go-live.

Blocking URLs you already want deindexed

You noindexed a thin tag archive, then added it to robots.txt for good measure. Now Googlebot cannot crawl the page to see the noindex, so the page lingers in the index. Let Google crawl pages you want removed until they actually drop, then block them if you still want to save crawl activity. Sequence matters.

Crawl budget worry on a small site

Crawl budget is a real concern for sites with hundreds of thousands of URLs. For a typical B2B site with a few hundred pages, it is not your bottleneck. Spending an afternoon trimming robots.txt to save crawl budget on a 200-page site is effort better spent on content or internal linking.

Common goals and the right tool (illustrative)
Your goalRight toolWrong tool
Keep a page out of Googlenoindex meta tagrobots.txt Disallow
Stop crawling of faceted/filter URLsrobots.txt Disallownoindex (still gets crawled)
Protect private dataPassword / authrobots.txt Disallow
Consolidate duplicate URLsCanonical / 301robots.txt Disallow

When blocking a path is the right call

Robots.txt earns its keep on URLs you genuinely do not want crawled and that have no business being indexed anyway. Good candidates:

  • Internal search results pages, for example /search?q=, which create endless low-value URLs.
  • Faceted navigation and filter parameters that multiply one product page into thousands of near-duplicates.
  • Cart, checkout, and account pages on a commerce or portal site.
  • Add-to-cart action URLs and other endpoints that should never be a search result.

For these, blocking the crawl saves Googlebot from wandering through a maze of parameter combinations. None of these pages should rank, none link out in ways that would index them, and keeping crawlers focused on your real pages is a reasonable use of the file.

If you are dealing with duplicate content rather than crawl waste, a canonical tag or a 301 redirect usually serves you better than a Disallow, because those consolidate signals instead of just hiding pages.

How to test before you trust it

Never push a robots.txt change and walk away. Two checks take five minutes.

First, open https://yoursite.com/robots.txt in a browser. Read it. Confirm it is the file you meant to publish and not the staging version. Obvious, and exactly the step people skip.

Second, use the URL Inspection tool in Google Search Console. Paste a URL that should be crawlable and confirm Google reports it as allowed. Then test a URL you meant to block and confirm it reports as blocked by robots.txt. Setting up Search Console properly is the foundation for this and most other technical checks, so if you have not done it, that comes first.

Search Console also flags a "Blocked by robots.txt" status under Page indexing reports. Watch that report after any change. A spike there a few days after a deploy is your early warning that something in the file is too aggressive.

For a fuller picture of how robots.txt fits with rendering, indexing, and site health, it sits inside the wider set of technical SEO checks worth running on any site.

A practical default for a B2B site

You do not need a clever robots.txt. You need a boring, correct one. For most B2B marketing sites running on WordPress or a similar CMS, this is a safe starting point:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /?s=
Disallow: /search/

Sitemap: https://yoursite.com/sitemap.xml

That blocks the admin area and on-site search, lets the AJAX endpoint through, and hands Google your sitemap. Everything else stays crawlable. Adjust the paths to your platform, replace the sitemap URL with your real one, and resist the urge to add more rules than you can explain.

Page speed and crawl efficiency interact, so if crawlers are struggling on your site, the fix is often faster pages rather than more Disallow rules. That is a separate piece of work covered under page speed for SEO, and it usually pays off more than tinkering with this file.

FAQ

Does robots.txt prevent a page from being indexed?

No, and this catches people out. It blocks crawling. A blocked page can still appear in search results if other pages link to it, just without a useful description. To keep a page out of the index, use a noindex meta tag or HTTP header and let Google crawl the page to read it.

Where does the robots.txt file go?

At the root of your domain: https://yoursite.com/robots.txt. It must be at the root. A file at https://yoursite.com/folder/robots.txt is ignored. Each subdomain needs its own file.

Should I block AI crawlers in robots.txt?

You can. Bots like GPTBot, CCBot, and Google-Extended respect robots.txt directives, so adding a User-agent block for them stops the compliant ones from training on your content. It is a business decision, not an SEO one, and blocking them has no effect on your Google Search rankings. Weigh visibility in AI answers against control over your content.

What happens if I have no robots.txt file at all?

Nothing bad. With no file, crawlers assume everything is allowed and crawl your site normally. A missing robots.txt returns a 404, which Google treats as "crawl everything." You only need the file when you want to block something or point to a sitemap.

Can I have more than one sitemap in robots.txt?

Yes. Add a separate Sitemap: line for each one, or point to a sitemap index file that lists them. Use full absolute URLs on every line. This is handy when you split sitemaps by content type, such as one for posts and one for product pages.

How long until Google notices a robots.txt change?

Google recrawls robots.txt regularly, often within a day, and caches it for up to 24 hours. So a fix usually takes effect within a day, but the downstream effect on indexing (pages dropping out or coming back) can take days to weeks as Google recrawls the affected URLs.

The short version

Robots.txt is low effort to maintain and high consequence to break. Keep it simple, and keep these in mind:

  • Crawling and indexing are different jobs. Use noindex to control the index, robots.txt to control crawling.
  • Never block CSS and JavaScript on a normal site.
  • Do not use it for private data or as a deindex tool.
  • Diff staging against production around every launch.
  • Test changes in Search Console and watch the blocked-pages report afterward.

If your traffic dropped after a redesign or migration and nobody can explain it, the robots.txt file is one of the first places worth looking, along with redirects and canonical tags. We help B2B teams untangle exactly these issues, so if you want a second set of eyes, ask us for a 20-minute technical review of your site and we will tell you what we find.