Question about Syntax in Robots.txt
-
So if I want to block any URL from being indexed that contains a particular parameter what is the best way to put this in the robots.txt file?
Currently I have-
Disallow: /attachment_idWhere "attachment_id" is the parameter. Problem is I still see these URL's indexed and this has been in the robots now for over a month. I am wondering if I should just do
Disallow: attachment_id or Disallow: attachment_id= but figured I would ask you guys first.
Thanks!
-
That's excellent Chris.
Use the Remove Page function as well - it might help speed things up for you.
-Andy
-
I don't know how but I completely forgot I could just pop those URL's in GWT and see if they were blocked or not and sure enough, Google says they are. I guess this is just a matter of waiting.... Thanks much!
-
I have previously looked into both of those documents and the issue remains that they don't exactly address how best to block parameters. I could do this through GWT but just am curious about the correct and preferred syntax for the robots.txt as well. I guess I could just look at sites like Amazon or other big sites to see what the common practices are. Thanks though!
-
Problem is I still see these URL's indexed and this has been in the robots now for over a month. I am wondering if I should just do
It can take Google some time to remove pages from the index.
The best way to test if this has worked is hop into Webmaster Tools and use the Test Robots.txt function. If it has blocked the required pages, then you know it's just a case of waiting - you can also remove pages from within Webmaster Tools as well, although this isn't immediate.
-Andy
-
Hi there
Take a look at Google's resource on robots.txt, as well as Moz's. You can get all the information you need there. You can also let Google know about what URLs to exclude from it's crawls via Search Console.
Hope this helps! Good luck!
-
Im not a robots.txt expert by a long shot, but I found this, which is a little dated, which explained it to me in terms i could understand.
https://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/
there is also a feature in Google Webmaster tools called URL parameters that lets you block URLs with set parameters for all sorts of reason to avoid duplicate content etc. I havn't used it myself but may be work looking into
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Scary bug in search console: All our pages reported as being blocked by robots.txt after https migration
We just migrated to https and created 2 days ago a new property in search console for the https domain. Webmaster Tools account for the https domain now shows for every page in our sitemap the warning: "Sitemap contains urls which are blocked by robots.txt."Also in the dashboard of the search console it shows a red triangle with warning that our root domain would be blocked by robots.txt. 1) When I test the URLs in search console robots.txt test tool all looks fine.2) When I fetch as google and render the page it renders and indexes without problem (would not if it was really blocked in robots.txt)3) We temporarily completely emptied the robots.txt, submitted it in search console and uploaded sitemap again and same warnings even though no robots.txt was online4) We run screaming frog crawl on whole website and it indicates that there is no page blocked by robots.txt5) We carefully revised the whole robots.txt and it does not contain any row that blocks relevant content on our site or our root domain. (same robots.txt was online for last decade in http version without problem)6) In big webmaster tools I could upload the sitemap and so far no error reported.7) we resubmitted sitemaps and same issue8) I see our root domain already with https in google SERPThe site is https://www.languagecourse.netSince the site has significant traffic, if google would really interpret for any reason that our site is blocked by robots we will be in serious trouble.
Intermediate & Advanced SEO | | lcourse
This is really scary, so even if it is just a bug in search console and does not affect crawling of the site, it would be great if someone from google could have a look into the reason for this since for a site owner this really can increase cortisol to unhealthy levels.Anybody ever experienced the same problem?Anybody has an idea where we could report/post this issue?0 -
Duplicate H1 Question & Landing Page help
Hi We have 2 H1's on this page http://www.key.co.uk/en/key/heavy-duty-shelving Our webmaster has put one as display:none - but isn't this just going to look like we're keyword spamming & trying to hide it? OK now I;m looking I am seeing more wrong with this page... The width buttons at the top as h2's...& they link to facet pages? Won't this just waste crawl budget? and every product title/user guide title etc are all H2's.... I just need to put a plan together to give to our dev team on what should be updated Any tips would be great. Becky
Intermediate & Advanced SEO | | BeckyKey0 -
Migration Challenge Question
I work for a company that recently acquired another company and we are in the process of merging the brands. Right now we have two website, lets call them: www.parentcompanyalpha.com www.acquiredcompanyalpha.com We are working with a web development company who is designing our brand new site, which will launch at the end of September, we can call that www.parentacquired.com. Normally it would be simple enough to just 301 redirect all content from www.parentcompanyalpha.com and www.acquiredcompanyalpha.com to the mapped migrated content on www.parentacquired.com. But that would be too simple. The reality is that only 30% of www.acquiredcompanyalpha.com will be migrating over, as part of that acquired business is remaining independent of the merged brands, and might be sold off. So someone over there mirrored the www.acquiredcompanyalpha.com site and created an exact duplicate of www.acquiredcompanybravo.com. So now we have duplicate content for that site out there (I was unaware they were doing this now, we thought they were waiting until our new site was launched). Eventually we will want some of the content from acquiredcompanyalpha.com to redirect to acquiredcompanybravo.com and the remainder to parentacquired.com. What is the best interim solution to maintain as much of the domain values as possible? The new site won't launch until end of September, and it could fall into October. I have two sites that are mirrors of each other, one with a domain value of 67 and the new one a lowly 17. I am concerned about the duplicate site dragging down that 67 score. I can ask them to use rel=canonical tags temporarily if both sites are going to remain until Sept/Oct timeframe, but which way should they go? I am inclined to think the best result would be to have acquiredcompanybravo.com rel=canonical back to acquiredcompanyalpha.com for now, and when the new site launches, remove those and redirect as appropriate. But will that have long term negative impact on acquiredcomapnybravo.com? Sorry, if this is convoluted, it is a little crazy with people in different companies doing different things that are not coordinated.
Intermediate & Advanced SEO | | Kenn_Gold0 -
E-Commerce Panda Question
I'm torn. Many of our 'niche' ecommerce products rank well, however I'm concerned that duplicate content is negatively effecting our overall rankings via Panda Algo. Here is an example that can be found through quite a few products on the site. This sub-category page (http://www.ledsupply.com/buckblock-constant-current-led-drivers) in our 'led drivers' --> 'luxdrive drivers' section has three products that are virtually identical with much of the same content on each page, except for their 'output current' - sort of like a shirt selling in different size attributes: S, M, L and XL. I could realistically condense 44 product pages (similar to example above) down to 13 within this sub-category section alone (http://www.ledsupply.com/luxdrive-constant-current-led-drivers). Again, we sell many of these products and rank ok for them, but given the outline for how Panda works I believe this structure could be compromising our overall Panda 'quality score', consequently keeping our traffic from increasing. Has anyone had similar issues and found that its worth the risk to condense product pages by adding attributes? If so, do I make the new pages and just 301 all the old URLs or is there a better way?
Intermediate & Advanced SEO | | saultienut0 -
Quick Question: Is it Bad for SEO to paste from Word to your CMS?
Hey just a quick question I'm having trouble finding a definitive answer to: Is the markup that is transferred from Word docs bad for SEO? We are managing to paste it and it looks fine, but the developers are worried that the extra code will be bad for SEO. Does anyone have solution besides pasting into Text Editor and formatting in the CMS? Is this necessary or can we just leave the extra code? Thank you!
Intermediate & Advanced SEO | | keL.A.xT.o0 -
Site Structure Question
Hi All, Got a question about site structure, I currently have a website where everything is hosted on the root of the domain. See example below: site.com/men site.com/men-shorts site.com/men-shorts-[product name] I want to change the structure to site.com/men/shorts/[product-name] I have asked a couple of SEOs and some agree with me that the structure needs to be changed and some say that as long as I dictate the structure with internal links and breadcrumbs the URL structure doesn't matter... What do you guys think? Many thanks, Carlos
Intermediate & Advanced SEO | | Carlos-R0 -
Robots.txt is blocking Wordpress Pages from Googlebot?
I have a robots.txt file on my server, which I did not develop, it was done by the web designer at the company before me. Then there is a word press plugin that generates a robots.txt file. How Do I unblock all the wordpress pages from googlebot?
Intermediate & Advanced SEO | | ENSO0