Robots.txt and Magento
-
HI,
I am working on getting my robots.txt up and running and I'm having lots of problems with the robots.txt my developers generated. www.plasticplace.com/robots.txt
I ran the robots.txt through a syntax checking tool (http://www.sxw.org.uk/computing/robots/check.html) This is what the tool came back with: http://www.dcs.ed.ac.uk/cgi/sxw/parserobots.pl?site=plasticplace.com There seems to be many errors on the file.
Additionally, I looked at our robots.txt in the WMT and they said the crawl was postponed because the robots.txt is inaccessible. What does that mean?
A few questions:
1. Is there a need for all the lines of code that have the “#” before it? I don’t think it’s necessary but correct me if I'm wrong.
2. Furthermore, why are we blocking so many things on our website? The robots can’t get past anything that requires a password to access anyhow but again correct me if I'm wrong.
3. Is there a reason Why can't it just look like this:
User-agent: *
Disallow: /onepagecheckout/
Disallow: /checkout/cart/
I do understand that Magento has certain folders that you don't want crawled, but is this necessary and why are there so many errors?
-
Yes your short robots.txt idea would create a huge problem.
In your Magento admin if you click in the menu Catalog > URL Rewrite Management
You will see the magento feature that creates all the "pretty urls", in that page you will see a table. If get value from Target path column and copy and paste after your site domain, for example domain.com/value_in_target_path...
You'll see that the page loads fine, you don't want Google to rank those pages with the "messy" URL so that's why you need all those stuff in your robots.txt
-
I am bit confused. Are you saying that technically my Magento site has two different urls that can both be indexed; one with a (messy) url and another with a vanity url? This would create major duplicate content issues! The robots.txt would not solve such a complex issue.
Am I missing something?
-
My developer said they custom configured it to block the files they needed according to Magento.
You think I can simply make it look like this:
User-agent: *
Disallow: /onepagecheckout/
Disallow: /checkout/cart/
and then disable it in WMT?
-
3. Is there a reason Why can't it just look like this:
Yes, It would generate a lot of duplicates issues, for example your robots.txt you have the follow line:
Disallow: /catalog/category/view/ -> That's the "real" category URL, you can access any category on magento by /catalog/category/view/id or by the "pretty" URL. Because you disallow the "real: URL only the pretty URL will be viable for search engines. This same rule apply for many other parts of the robots.txt.
-
I assume this is a robots.txt that has been automatically created by Magento? - or has it been created by a developer?
I ran it through a tool and it showed 1 error and 10 warnings - so i would say you definitely need to do something about it.
The reason for all those disallows is to try and stop search engine indexing them (whether they would even find them to index them if they were not there is debatable).
What you could do is set up robots.txt as you have suggested and then stop the SE's indexing the directories or pages you don't want in appropriate webmaster tools.
I don't like displaying a lot of 'don't index' paths in the robots texts as it is pretty much telling any hacker or nasty spider where your weak points may be.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Magento Category Suffix - redirect issue
Hi All, Just launched a new Magento store & set the category suffix to blank & not .html or / So the desired url is https://www.example.com/category-1 But I am seeing a 301 redirect being implemented: https://www.example.com/category-1/ redirect to: https://www.example.com/category-1 I cant see this is the list of 301 redirects within the redirect panel in Magento but Moz & another redirect checker is picking it up. I am missing a setting or something ? Many Thanks,
Technical SEO | | PaddyM556
Pat0 -
Google Webmaster Tools is saying "Sitemap contains urls which are blocked by robots.txt" after Https move...
Hi Everyone, I really don't see anything wrong with our robots.txt file after our https move that just happened, but Google says all URLs are blocked. The only change I know we need to make is changing the sitemap url to https. Anything you all see wrong with this robots.txt file? robots.txt This file is to prevent the crawling and indexing of certain parts of your site by web crawlers and spiders run by sites like Yahoo! and Google. By telling these "robots" where not to go on your site, you save bandwidth and server resources. This file will be ignored unless it is at the root of your host: Used: http://example.com/robots.txt Ignored: http://example.com/site/robots.txt For more information about the robots.txt standard, see: http://www.robotstxt.org/wc/robots.html For syntax checking, see: http://www.sxw.org.uk/computing/robots/check.html Website Sitemap Sitemap: http://www.bestpricenutrition.com/sitemap.xml Crawlers Setup User-agent: * Allowable Index Allow: /*?p=
Technical SEO | | vetofunk
Allow: /index.php/blog/
Allow: /catalog/seo_sitemap/category/ Directories Disallow: /404/
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /downloader/
Disallow: /includes/
Disallow: /lib/
Disallow: /magento/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /stats/
Disallow: /var/ Paths (clean URLs) Disallow: /index.php/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalogsearch/
Disallow: /checkout/
Disallow: /control/
Disallow: /contacts/
Disallow: /customer/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/
Disallow: /aitmanufacturers/index/view/
Disallow: /blog/tag/
Disallow: /advancedreviews/abuse/reportajax/
Disallow: /advancedreviews/ajaxproduct/
Disallow: /advancedreviews/proscons/checkbyproscons/
Disallow: /catalog/product/gallery/
Disallow: /productquestions/index/ajaxform/ Files Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /STATUS.txt Paths (no clean URLs) Disallow: /.php$
Disallow: /?SID=
disallow: /?cat=
disallow: /?price=
disallow: /?flavor=
disallow: /?dir=
disallow: /?mode=
disallow: /?list=
disallow: /?limit=5
disallow: /?limit=10
disallow: /?limit=15
disallow: /?limit=20
disallow: /*?limit=250 -
Advice urgently needed on best practice for handling multiple product categories on Magento website
I have an ecommerce site built using Magento and urgently need advice on best practice for handling multiple product categories (where products appear in more than one category on the site creating multiple URLs to the same page). In April this year, based on advice from my SEO who felt that duplicate content issues were causing my rankings to be held back, I changed about 25% of the product categories to 'noindex, follow'. This has made organic traffic fall (obviously) as these pages fell out of Google's index. But, contrary to what I was hoping for, it didn't then improve rankings - not one iota, nothing - which was the ONLY reason why I did this. This has had a real negative impact on sales, so I'm starting to think this was actually an a terrible idea. Should I change them back? And to ask a wider question, what is best practice for this particular scenario?
Technical SEO | | Coraltoes770 -
Duplicate Content issue in Magento: The product pages are available true 3 URL's! How can we solve this?
Right now the product page "gedroogde goji bessen" (Dutch for: dried goji berries) is available true 3 URL's! **http://www.sportvoeding.net/gedroogde-goji-bessen ** =>
Technical SEO | | Zanox
By clicking on the product slider on the homepage
http://www.sportvoeding.net/superfood/gedroogde-goji-bessen =>
First go to sportvoeding.net/superfood (main categorie) and than clicking on "gedroogde Goji bessen"
http://www.sportvoeding.net/superfood/goji-bessen/gedroogde-goji-bessen =>
When directly go to the subcategorie "Goji Bessen" true the menu and there clicking on "gedroogde Goji Bessen" We want to have the following product URL:
http://www.sportvoeding.net/superfood/goji-bessen/gedroogde-goji-bessen Does someone know´s a good Exetension for this issue?0 -
Robots.txt
Google Webmaster Tools say our website's have low-quality pages, so we have created a robots.txt file and listed all URL’s that we want to remove from Google index. Is this enough for the solve problem?
Technical SEO | | iskq0 -
Magento CMS Page Meta Titles?
Hey All, I'm experiencing a bit of a problem with creating a custom, separate Meta Title for CMS pages in Magento 1.6.1.0. I know this can be added through _CMS => Page => Page Information => Page Title, _but this method presents issues, as a length of 70 characters is a little too long for breadcrumbs. I would like a way to add a second specific meta title field that overrides this first Page Title field_. _ I did find something that does this in Magento Connect for around $50.00USD, but this seems like something that should be easy and free to do. Any suggestions as to a workaround or alternative free plugin would be most welcome! Thanks!
Technical SEO | | G2W0 -
How to allow one directory in robots.txt
Hello, is there a way to allow a certain child directory in robots.txt but keep all others blocked? For instance, we've got external links pointing to /user/password/, but we're blocking everything under /user/. And there are too many /user/somethings/ to just block every one BUT /user/password/. I hope that makes sense... Thanks!
Technical SEO | | poolguy0 -
Getting home page content at top of what robots see
When I click on the text-only cache of nlpca(dot)com on the home page http://webcache.googleusercontent.com/search?q=cache:UIJER7OJFzYJ:www.nlpca.com/&hl=en&gl=us&strip=1 our H1 and body content are at the very bottom. How do we get the h1 and content at the top of what the robots see? Thanks!
Technical SEO | | BobGW0