Timely use of robots.txt and meta noindex
-
Hi,
I have been checking every possible resources for content removal, but I am still unsure on how to remove already indexed contents.
When I use robots.txt alone, the urls will remain in the index, however no crawling budget is wasted on them, But still, e.g having 100,000+ completely identical login pages within the omitted results, might not mean anything good.
When I use meta noindex alone, I keep my index clean, but also keep Googlebot busy with indexing these no-value pages.
When I use robots.txt and meta noindex together for existing content, then I suggest Google, that please ignore my content, but at the same time, I restrict him from crawling the noindex tag.
Robots.txt and url removal together still not a good solution, as I have failed to remove directories this way. It seems, that only exact urls could be removed like this.
I need a clear solution, which solves both issues (index and crawling).
What I try to do now, is the following:
I remove these directories (one at a time to test the theory) from the robots.txt file, and at the same time, I add the meta noindex tag to all these pages within the directory. The indexed pages should start decreasing (while useless page crawling increasing), and once the number of these indexed pages are low or none, then I would put the directory back to robots.txt and keep the noindex on all of the pages within this directory.
Can this work the way I imagine, or do you have a better way of doing so?
Thank you in advance for all your help.
-
Hi Deb,
Thank you for your reply.
I have never thought, that Google would crawl the robots.txt this rarely. I actually read it somewhere, which makes complete sense, that before they start crawling, they validate the process against robots.txt. This is one page only, but basically one of the most important ones.
This is now a shocking experience for me, thank you for drawing my attention to it. Anyway, I have submitted the page through 'Fetch as Google' now.
Regarding your url suggestion, I do not want them to be 404-d, at least not all of them, as for examply the login pages I still want to use, and why we have individual urls, is that because we would like our visitors to return back the page they left, before we asked them to log in. So status 200 is ok, because these pages we have for customers, but the very same pages are totally useless for Google to crawl or to index.
I hope this clarifies.
-
It seems like the latest Robots.txt file has not been cached by Google so far .. this is what it has –
So, you need to use Fetch As Google Bot and Submit this Robots.txt file to index to fix this issue at the earliest.
What concerns me that defunct URLs like this - http://www.kozelben.hu/login?r=%2Fceg%2Fdrink-island-bufe-whisky-bar-alkotas-utca-17-1123-budapest-126126%23addComment or http://www.kozelben.hu/supplier/nearby/supplierid/127493/type/geo are returning 200 Ok server side response code whereas they should be returning 404 server side response. The problem would have stopped here for once and all.
However assuming the fact that the CMS of your website does not offer you any such option [in that case, this is a bad CMS], you need to apply Meta noindex tag against them and wait patiently for search engine to catch them.
_Can’t you fix the 404 thing? Let us know. _
-
Really good article, indeed!
I have been thinking about the whole concept during the weekend, and now I have a further concept, definetely worth considering.
Thank you again, Ryan.
-
Lindsay wrote a great article on the topic which I am sure you will enjoy: http://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutions
-
Thank you for the further info, Ryan.
Although I see your point and can accept lots of truth in it, checking all the competitors and even the largest sites all around the web, they still keep using robots.txt (even Google does so).
I however accept noindex to be a superior solution to robots.txt and will use it for all the contents I do not want to be indexed.
I will then see, if I need and how I might need to use robots.txt. I hope, it does not hurt having a noindexed page included in robots.txt (at a later time, when it is already out of the index).
-
I understand your concern Andras. The two questions I would focus on with respect to crawl budget:
1. Is all your content being indexed properly?
2. Is your content being indexed in a timely manner?
If the answer to the above two questions is yes, I would not spend any more time thinking about crawl budget. Either way, using the "noindex" meta tag is going to be the best way to handle the issue you originally presented.
On a related note, does the content on your "useful" pages change frequently? If so, ensure you are optimizing your links (both internal and external) to these pages. When you demonstrate these are important pages to your site, Google will crawl the pages more frequently.
-
Hi Ryan,
Thank you for your reply.
The only worry I have regarding the crawl budget, that I currently have three times more indexed pages than useful pages, due to the issues I have mentioned earlier.
It is true, that I do not have daily content updates on all of my useful pages, however I have thought that Google allocates individual crawling budget to all sites, based on the value he assigns to them.
I just want this budget to be spent wisely, and not causing my useful pages to be crawled less frequently, due to crawling no-value (but noindexed) content instead.
-
Hi Andras,
The first thing to know is a general rule....the best robots.txt file is a blank one. There is almost always a better method of managing a situation without using robots.txt. There are numerous reasons, one of which is search engines do not always see the robots.txt file.
Regarding the noindex meta tag, that is the proper solution. I understand your concern over crawl budget, but I suggest in this instance, your concerns are not warranted. It is a waste of crawl budget to have search engines spend extra time due to slow servers, bad code, thin content, etc. If you have pages which should not be indexed, adding the noindex tag is likely the best solution.
Without being familiar with your site, it is not possible to offer a definitive answer, but generally speaking this response should be accurate. Keep in mind many sites have millions of pages, and Google has the ability to crawl the entire site each month.
-
Can you show us examples of URLs that are causing you trouble? That would be easier for us to provide a solution.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Yoast and wordpress duplicate meta
I'm using the Yoast plugin with wordpress and have noticed in my HTML I have duplicate meta data. For example my header starts with
Technical SEO | | simonatkinsphoto
<title>(title) </title<span><<br /><meta </span><span class="html-attribute-name">property</span><span>="</span><span class="html-attribute-value">og:site_name</span><span>" </span><span class="html-attribute-name">content</span><span>=<br /><span><meta </span><span class="html-attribute-name">property</span><span>="</span><span class="html-attribute-value">og:description</span><span>" </span><span class="html-attribute-name">content</span><span>=<br /><br /></span></span>Then I have the 'This site is optimised by Yoast" tagline followed by the same meta -<br /> <span><meta </span><span class="html-attribute-name">name</span><span>="</span><span class="html-attribute-value">description</span><span>" </span><span class="html-attribute-name">content=<br /><span> <meta </span><span class="html-attribute-name">property</span><span>="</span><span class="html-attribute-value">og:title</span><span>" content=<br /><span> <meta </span><span class="html-attribute-name">property</span><span>="</span><span class="html-attribute-value">og:description</span><span>" </span><span class="html-attribute-name">content=<br /><span> <meta </span><span class="html-attribute-name">property</span><span>="</span><span class="html-attribute-value">og:site_name</span><span>" </span><span class="html-attribute-name">content</span><span>=<br /><br /></span></span></span></span>Is this likely to cause problems with Google and is there a way to stop both wordpress and Yoast adding meta to the header. </p></title>0 -
Meta description issue on Google
Hello, I have a small issue on Google with our Meta Description tag not always being properly displayed. If you search for the term: Globe Car (in two words), everything is being displayed properly: http://screencast.com/t/YQCUkJnk Now do the same search for the term GlobeCar (in one word) and the meta tag set into our homepage seems to be totallly ignored and Google is now displaying something that is generated from out of their hat: http://screencast.com/t/K0KeeRGSgspV Anyone has an idea what would cause this? Thanks!
Technical SEO | | GlobeCar1 -
GWT returning 200 for robots.txt, but it's actually returning a 404?
Hi, Just wondering if anyone has had this problem before. I'm just checking a client's GWT and I'm looking at their robots.txt file. In GWT, it's saying that it's all fine and returns a 200 code, but when I manually visit (or click the link in GWT) the page, it gives me a 404 error. As far as I can tell, the client has made no changes to the robots.txt recently, and we definitely haven't either. Has anyone had this problem before? Thanks!
Technical SEO | | White.net0 -
Phone Number In Meta Description
People are more likely to call us, than email us. However, if they're using a mobile device, there's a click to call button on that site. My question is this: google does not include our phone number in our meta description. I could try to get the description changed, but it doesn't seem like it would make that big of a deal for just the desktop site. Am I missing something about the importance of the phone number on a desktop site? Any experience with this situation? Thanks, Ruben
Technical SEO | | KempRugeLawGroup3 -
Using category pages in Wordpress
In our niche we have one main keyword, which represents the entire category. We are using wordpress. I am trying to understand the best URL structure and wonder if the below is a good approach: http://domain.com/keyword This category page will be written to contain an article on the subject. The posts that are put into that category will subsequently appear on this page, below that article. Each of those posts would be targeting a related keyword. e.g. I would write a post which has, as the main target keyword: "MainKeyword training" and another post, which would be targeting "MainKeyword techniques" ... (and so on). Thanks for your advice. Andrew
Technical SEO | | seowhiskey0 -
Canoical tags how do i use them
Hi i have this coming up on the report for my url www.in2town.co.uk but i am not sure how to use the canonical tag. I am using joomla and would be grateful if anyone could please give me advice on how to use this. Canonical URL Tag Usage Moderate fix <dl> <dt>Number of Canonical tags</dt> <dd>0</dd> <dt>Explanation</dt> <dd>Although the canonical URL tag is generally thought of as a way to solve duplicate content problems, it can be extremely wise to use it on every (unique) page of a site to help prevent any query strings, session IDs, scraped versions, licensing deals or future developments to potentially create a secondary version and pull link juice or other metrics away from the original. We believe the canonical URL tag is a best practice to help prevent future problems, even if nothing is specifically duplicate/problematic today.</dd> <dt>Recommendation</dt> <dd>Add a canonical URL tag referencing this URL to the header of the page.</dd> <dd>many thanks for your help
Technical SEO | | ClaireH-184886
</dd> </dl>0 -
Should I use these Meta Tags or Remove it?
Hi, I have a lot of older pages that I am cleaning up older pages, and I see that I have <title>Actual Title</title> (I understand the importance of this tag.) (I have some text in this meta tag on a lot of pages, sometimes matching my title tag exactly but in some cases I treated it like a mini description. Should I remove the on my pages, or keep it and make sure it is the exact as the main Title Tag. -------- Question about meta tag #2. I have heard rumors that the keywords tag should be removed. example: Thanks in advance! Force7
Technical SEO | | Force70 -
Meta Title Keywords and Company name
Currently our meta title says "Network Security Audit | Pivot Point Security" which is pretty broad considering how many services we offer. In trying to restructure our keywords, marketing and SEO focus, I came up with a new title. The problem I have is figuring out which keywords to use in the title, and with a company name with 3 words - I am running out of room. The new title idea is "Information Security Assessments - Penetration Testing | Pivot Point Security" So my questions are the following. Do I need to put the company name? Should I choose different keywords? I'm sort of at a stand still trying to figure out the best possible title since meta keywords or description won't really help ranking.
Technical SEO | | pivotpointsecurity0