Robots.txt
The robots exclusion protocol (REP), or robots.txt is a text file webmasters create to instruct robots how to crawl and index pages on their website. If you want to exclude folders or files from search engine crowling we can use robots.txt
Block all web crawlers from all content:
User-agent: * Disallow: /
Block a specific web crawler from a specific folder:
User-agent: Googlebot Disallow: /no-google/
Block a specific web crawler from a specific web page:
User-agent: Googlebot Disallow: /no-google/blocked-page.html
Allow a specific web crawler to visit a specific web page:
Disallow: /no-bots/block-all-bots-except-rogerbot-page.html User-agent: rogerbot Allow: /no-bots/block-all-bots-except-rogerbot-page.html
Sitemap Parameter:
User-agent: * Disallow: Sitemap: http://www.example.com/none-standard-location/sitemap.xml
Robots Meta Tag
This also like robots.txt . Here we can denote in page directly mostly denoted at home page. Here i am listed some of the robots meta tags.
Valid meta robots content values:
- NOINDEX – prevents the page from being included in the index.
- NOFOLLOW – prevents Googlebot from following any links on the page. (Note that this is different from the link-level NOFOLLOW attribute, which prevents Googlebot from following an individual link.)
- NOARCHIVE – prevents a cached copy of this page from being available in the search results.
- NOSNIPPET – prevents a description from appearing below the page in the search results, as well as prevents caching of the page.
- NOODP – blocks the Open Directory Project description of the page from being used in the description that appears below the page in the search results.
- NONE – equivalent to “NOINDEX, NOFOLLOW”.
If the page contains multiple meta tags of the same type, we will aggregate the content values. For instance, we will interpret
<META NAME=”ROBOTS” CONTENT=”NOINDEX”>
<META NAME=”ROBOTS” CONTENT=”NOFOLLOW”>
The same way as:
<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>