What Is robots.txt
?
The robots.txt
file is a simple text file that tells bots what they’re allowed (or not allowed) to access on a website.
It lives at the root of the domain — for example:
https://storephotos.ca/robots.txt
It’s not an authentication layer or a firewall. It’s a polite request, written in plain-ish English, that well-behaved bots are expected to follow.
Basic Structure
A robots.txt
file is made up of one or more rulesets, each beginning with a User-agent
line and followed by one or more Disallow
or Allow
lines:
User-agent: *
Disallow: /admin/
Allow: /admin/public/
That example says:
All bots (*)
Are disallowed from accessing anything under /admin/
Except for anything under /admin/public/
The Three Most Important Directives
Here’s what you’ll see most often:
- User-agent This identifies the bot the rule applies to. You can name a specific bot (like Googlebot):
User-agent: Googlebot
Or you can use * for all bots:
User-agent: *
- Disallow
This blocks access to a path:
Disallow: /private/
A blank Disallow means nothing is disallowed:
Disallow:
- Allow
This explicitly permits access to a path — even if a broader Disallow rule would normally block it.
Allow: /private/help-page/
How Precedence Works
A lot of people get confused here and precedence is very important so I'll cover it a lot through these tutorials. Most bots follow the longest match wins rule. And a more specific rule overrides a broader one.
User-agent: *
Disallow: /private/
Allow: /private/help-page/
This means:
/private/secrets/ is blocked /private/help-page/ is allowed
But not all bots implement this correctly. That’s why it’s good to be explicit and conservative when writing robots.txt.
What About Wildcards?
Officially, the standard only supports two wildcard-ish operators:
- matches any sequence of characters $ marks the end of a URL