robots.txt vs. noindex vs. canonical: Controlling What Google Indexes

Last updated: 2026-06-13

These three tools solve three different problems, and mixing them up is the most common technical-SEO mistake. robots.txt controls crawling, the noindex tag controls indexing, and the canonical tag consolidates duplicates. Use the wrong one — most often, blocking a page in robots.txt to get it out of Google — and you can achieve the opposite of what you intended.

What each one actually does

robots.txt — asks crawlers not to fetch certain paths. It controls crawling, not indexing, and it is advisory, not a security boundary.
noindex (a meta robots tag or X-Robots-Tag header) — tells search engines to keep a page out of the index. This is the correct tool for removing a page from search.
canonical — a hint that names the preferred URL among duplicates or near-duplicates, so ranking signals consolidate onto one version.

The trap: blocking a page does not remove it

Here is the counterintuitive part. If you Disallow a page in robots.txt, Google cannot crawl it — which means Google cannot see a noindex tag on it either. The page can still appear in results as a bare URL with no description, surfaced from links pointing to it. To actually remove a page, do the opposite of blocking it:

ALLOW the page to be crawled (do not Disallow it in robots.txt).
Add a noindex directive via a meta robots tag or the X-Robots-Tag header.
Wait for Google to recrawl and drop it from the index.
Only after it is deindexed can you block it in robots.txt if you also want to save crawl budget.

Try the toolRobots.txt GeneratorGenerate a robots.txt file visually — user-agent groups, Allow/Disallow rules, AI-bot blocking presets, sitemap lines. Copy or download instantly.

When to use which

Want a page OUT of search results? Use noindex (set the robots directive with the meta tag generator), and keep the page crawlable.
Want to stop crawlers wasting crawl budget on junk paths (faceted filters, internal search, admin)? Use robots.txt Disallow — build it with the robots.txt generator.
Have several URLs serving the same or similar content? Use a canonical tag pointing at the version you want ranked, to consolidate signals.
Serving the same content in multiple languages or regions? That is a job for hreflang, not canonical — see the hreflang tag generator.

Canonical is a hint, not a command

Unlike noindex, a canonical tag is a suggestion. Google usually honors it but can choose a different canonical if your signals (internal links, sitemaps, redirects) point elsewhere. Keep those signals consistent with your declared canonical so they reinforce rather than contradict it.

What robots.txt is also good for

Beyond crawl control, robots.txt is where you point crawlers to your sitemap and, increasingly, where you decide whether to allow AI training crawlers such as GPTBot and CCBot. The generator above includes a one-click preset for blocking those. Once a page is indexed, confirm how it presents in results with the SERP snippet preview.

Frequently asked questions

Why is a page I blocked in robots.txt still showing in Google?

Blocking in robots.txt stops crawling, not indexing. Google can still index the URL from external links — and because it cannot crawl the page, it cannot see a noindex tag there. Allow crawling and add noindex to remove it.

What is the difference between robots.txt and noindex?

robots.txt controls whether crawlers fetch a page; noindex controls whether it appears in the index. To keep something out of search results, use noindex on a crawlable page.

Does a canonical tag remove duplicate pages from Google?

No. Canonical consolidates ranking signals onto a preferred URL but is only a hint and does not deindex anything. Use noindex if you need a page gone from results.

Can I use robots.txt to hide private content?

No. robots.txt is public and advisory; disallowed paths are visible in the file and can still be indexed via links. Use authentication or noindex for content that must not be seen.