Robots.txt
-
Hi all,
Happy New Year!
I want to block certain pages on our site as they are being flagged (according to my Moz Crawl Report) as duplicate content when in fact that isn't strictly true, it is more to do with the problems faced when using a CMS system...
Here are some examples of the pages I want to block and underneath will be what I believe to be the correct robots.txt entry...
Disallow: /forum/index.php?app=core&module=search
http://www.XYZ.com/forum/index.php?app=core&module=reports&rcom=gallery&imageId=980&ctyp=image
Disallow: /forum/index.php?app=core&module=reports
Disallow: /forum/index.php?app=forums&module=post
http://www.XYZ.com/forum/gallery/sizes/182-promenade/small/
http://www.XYZ.com/forum/gallery/sizes/182-promenade/large/
Disallow: /forum/gallery/sizes/
Any help \ advice would be much appreciated.
Many thanks
Andy
-
You may be better off just doing a pattern match if your CMS generates a lot of junk URLs. You could save yourself a lot of time and heartache with the following:
User-agent: *
Disallow: /*?That will block everything with with a ? in the string. So yeah, use with caution - as always.
If you're quite certain you want to block access to the image sizes subdirectory you may use:
User-agent: *
Disallow: /sizes*/
More on all of that fun from Google and SEO Book.
Robots.txt is almost as unforgiving as .htaccess, especially once you start pattern matching. Make sure to test everything thoroughly before you push to a live environment. For serious. You have been warned.
Google WMT and Bing WMT also provide parameter handling tools. Once you tell Bing and/or Google that you want their bots to ignore urls with certain parameter(s) you select. So if you wanted to handle it that way, it looks like ignoring the app= parameter should do the trick for most of your expressed concerns.
Good luck! explosions in the distance XD
-
Thanks DC1611, I will look into the other options but I have hundreds (and I mean hundreds) of examples that I would need to investigate...
Andy
-
You can quite easily check if these filters work - using Google Webmastertools (crawl section > robots.txt tester).
In the test-tool you can enter the criteria & check if they do block Googlebot from indexing these pages. I tried a few of the examples you gave & they seem to work.Apart from updating your robots.txt (which seems quite a radical solution) you could also consider implementing canonical url's for these duplicate url's.
Another alternative is to configure url parameters in Google Webmastertools (also in the crawl section) - where you can indicate which parameters need to be ignored.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots blocked by pages webmasters tools
a mistake made in software. How can I solve the problem quickly? help me. XTRjH
Intermediate & Advanced SEO | | mihoreis0 -
Huge increase in server errors and robots.txt
Hi Moz community! Wondering if someone can help? One of my clients (online fashion retailer) has been receiving huge increase in server errors (500's and 503's) over the last 6 weeks and it has got to the point where people cannot access the site because of server errors. The client has recently changed hosting companies to deal with this, and they have just told us they removed the DNS records once the name servers were changed, and they have now fixed this and are waiting for the name servers to propagate again. These errors also correlate with a huge decrease in pages blocked by robots.txt file, which makes me think someone has perhaps changed this and not told anyone... Anyone have any ideas here? It would be greatly appreciated! 🙂 I've been chasing this up with the dev agency and the hosting company for weeks, to no avail. Massive thanks in advance 🙂
Intermediate & Advanced SEO | | labelPR0 -
Robots.txt issue for international websites
In Google.co.uk, our US based (abcd.com) is showing: A description for this result is not available because of this site's robots.txt – learn more But UK website (uk.abcd.com) is working properly. We would like to disappear .com result totally, if possible. How to fix it? Thanks in advance.
Intermediate & Advanced SEO | | JinnatUlHasan0 -
Issue with Robots.txt file blocking meta description
Hi, Can you please tell me why the following error is showing up in the serps for a website that was just re-launched 7 days ago with new pages (301 redirects are built in)? A description for this result is not available because of this site's robots.txt – learn more. Once we noticed it yesterday, we made some changed to the file and removed the amount of items in the disallow list. Here is the current Robots.txt file: # XML Sitemap & Google News Feeds version 4.2 - http://status301.net/wordpress-plugins/xml-sitemap-feed/ Sitemap: http://www.website.com/sitemap.xml Sitemap: http://www.website.com/sitemap-news.xml User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/ Other notes... the site was developed in WordPress and uses that followign plugins: WooCommerce All-in-One SEO Pack Google Analytics for WordPress XML Sitemap Google News Feeds Currently, in the SERPs, it keeps jumping back and forth between showing the meta description for the www domain and showing the error message (above). Originally, WP Super Cache was installed and has since been deactivated, removed from WP-config.php and deleted permanently. One other thing to note, we noticed yesterday that there was an old xml sitemap still on file, which we have since removed and resubmitted a new one via WMT. Also, the old pages are still showing up in the SERPs. Could it just be that this will take time, to review the new sitemap and re-index the new site? If so, what kind of timeframes are you seeing these days for the new pages to show up in SERPs? Days, weeks? Thanks, Erin ```
Intermediate & Advanced SEO | | HiddenPeak0 -
Robots.txt is blocking Wordpress Pages from Googlebot?
I have a robots.txt file on my server, which I did not develop, it was done by the web designer at the company before me. Then there is a word press plugin that generates a robots.txt file. How Do I unblock all the wordpress pages from googlebot?
Intermediate & Advanced SEO | | ENSO0 -
Search Engine Blocked by robots.txt for Dynamic URLs
Today, I was checking crawl diagnostics for my website. I found warning for search engine blocked by robots.txt I have added following syntax to robots.txt file for all dynamic URLs. Disallow: /*?osCsid Disallow: /*?q= Disallow: /*?dir= Disallow: /*?p= Disallow: /*?limit= Disallow: /*review-form Dynamic URLs are as follow. http://www.vistastores.com/bar-stools?dir=desc&order=position http://www.vistastores.com/bathroom-lighting?p=2 and many more... So, Why should it shows me warning for this? Does it really matter or any other solution for these kind of dynamic URLs.
Intermediate & Advanced SEO | | CommercePundit0 -
Does It Really Matter to Restrict Dynamic URLs by Robots.txt?
Today, I was checking Google webmaster tools and found that, there are 117 dynamic URLs are restrict by Robots.txt. I have added following syntax in my Robots.txt You can get more idea by following excel sheet. #Dynamic URLs Disallow: /?osCsidDisallow: /?q= Disallow: /?dir=Disallow: /?p= Disallow: /*?limit= Disallow: /*review-form I have concern for following kind of pages. Shorting by specification: http://www.vistastores.com/table-lamps?dir=asc&order=name Iterms per page: http://www.vistastores.com/table-lamps?dir=asc&limit=60&order=name Numbering page of products: http://www.vistastores.com/table-lamps?p=2 Will it create resistance in organic performance of my category pages?
Intermediate & Advanced SEO | | CommercePundit0 -
Reciprocal Links and nofollow/noindex/robots.txt
Hypothetical Situations: You get a guest post on another blog and it offers a great link back to your website. You want to tell your readers about it, but linking the post will turn that link into a reciprocal link instead of a one way link, which presumably has more value. Should you nofollow your link to the guest post? My intuition here, and the answer that I expect, is that if it's good for users, the link belongs there, and as such there is no trouble with linking to the post. Is this the right way to think about it? Would grey hats agree? You're working for a small local business and you want to explore some reciprocal link opportunities with other companies in your niche using a "links" page you created on your domain. You decide to get sneaky and either noindex your links page, block the links page with robots.txt, or nofollow the links on the page. What is the best practice? My intuition here, and the answer that I expect, is that this would be a sneaky practice, and could lead to bad blood with the people you're exchanging links with. Would these tactics even be effective in turning a reciprocal link into a one-way link if you could overlook the potential immorality of the practice? Would grey hats agree?
Intermediate & Advanced SEO | | AnthonyMangia0