Kill your htaccess file, take the risk to learn a little
-
Last week I was browsing Google's index with "site:www.mydomain.com and wanted to scan over to see what Google had indexed with my site. I came across a URL that was mistakenly indexed. It went something like this
www.mydomain.com/link1/link2/link1/link4/link3
I didn't understand why Google had indexed a page like that of mine when the "link" pages were links that were on my main bar which were site wide links. It seemed to be looping infinitely over and over. So I started trying to see how many of these Google had indexed and I came across about 20 pages. I went through the process of removing the URL's in Webmaster Tools, but then I wanted to know why it was happening. I had discovered that I had mistakenly placed some links on my site in my header in such a manner
If you know HTML you will realize that by not placing the "/" in the front of the link I was telling that page to add that link in addition to the URL that is was currently on. What this did was create an infinite loop of links which is not good
Basically when Google went to www.mydomain.com/link1/ it found the other links which then told Google to add that url to the existing URL and then go to that link.
Something like: www.mydomain.com/links1/link2/...
When you do not add the "/" in front of the directory you are linking too it will do this. The "/" refers to the root so if you place that in front of your directory you are linking too it will always assume that first "/" as the root then the url will follow.
So what did I do?
Even though I was able to find about 20 URL's using the "site:" search method there had to be more out there. Even though I tried to search I was not able to find anymore, but I was not convinced.
The light bulb went on at this point
My .htaccess file contained many 301 redirects in my attempt to try and redirect those pages to a real page, they were not really relevant pages to redirect too. So how could I really find out what Google had indexed out there for me since Webmaster Tools only reports the top 1000 links.
I decided to kill my htaccess file. Knowing that Google is "forgiving" when major changes to your site happen I knew Google would not simply just kill my site for removing my htaccess file immediately.
I waited 3 days then BOOM! Webmaster Tools was reporting to me that it found a ton of 401's on my site. I looked at the Crawl Errors and there they were. All those infinite loop links that I knew had to be more out there, I was able to see.
How many were there?
Google found in the first crawl over 5,000 of them. OMG! Yeah could you imagine the "Low quality" score I was getting on those pages? By seeing all those links I was able to determine about 4 patterns in the links. For example:
Now my issue was I wanted to keep all the URL's that were pointing to www.mydomain.com/link1 but anything after that I needed gone. I went into my Robots.txt file and added this
Disallow: www.mydomain.com/link1/link2/
Disallow: www.mydomain.com/link1/link3/
Disallow: www.mydomain.com/link1/link4/
Disallow: www.mydomain.com/link1/link5/
Now there were many more pages indexed that went deeper into those links but I knew I wanted anything after the 2nd URL gone since it was the start of the loop that I detected. With that I was able to have from what I know at least 5k links if not more.
What did I learn from this?
Kill your htaccess file for a few days and see what comes back in your reports. You might learn something
After doing this I simply replaced my htaccess file and I am on my way to removing a ton of "low quality" links I didn't even know I had.
-
Interesting post. Yeah, the .htaccess file is the most important file out there and it is easy to mess up (as I am sure most everyone has at one time or another).
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Rel Canonical, Follow/No Follow in htaccess?
Very quick question, are rel canonical, follow/no follow tags, etc. written in the htaccess file?
Technical SEO | | moon-boots0 -
How long does it take to reindex the website
Generally speaking, how long does it take for Google to recrawl/reindex an (ecommerce) website? After changing a number of product subcategories from 'noindex' back to 'index', I regenerated the sitemap and have fetched as Google in WMT. This was a couple of weeks ago and no action yet. Second question: Does Google treat these pages as if they're brand new? I 'noindexed' them back in April, and they were ranking ok then. (I had noindexed them on the back of advice from my SEO, due to concerns about these pages being seen as duplicate content). Help!
Technical SEO | | Coraltoes770 -
Google disavow tool ( how long does it take ? )
Hello, I disavowed some of my links about three months but still see them in my link profile, using OSE? How long does it take for Google to make them nofollow. Thanks
Technical SEO | | mezozcorp0 -
Oh no googlebot can not access my robots.txt file
I just receive a n error message from google webmaster Wonder it was something to do with Yoast plugin. Could somebody help me with troubleshooting this? Here's original message Over the last 24 hours, Googlebot encountered 189 errors while attempting to access your robots.txt. To ensure that we didn't crawl any pages listed in that file, we postponed our crawl. Your site's overall robots.txt error rate is 100.0%. Recommended action If the site error rate is 100%: Using a web browser, attempt to access http://www.soobumimphotography.com//robots.txt. If you are able to access it from your browser, then your site may be configured to deny access to googlebot. Check the configuration of your firewall and site to ensure that you are not denying access to googlebot. If your robots.txt is a static page, verify that your web service has proper permissions to access the file. If your robots.txt is dynamically generated, verify that the scripts that generate the robots.txt are properly configured and have permission to run. Check the logs for your website to see if your scripts are failing, and if so attempt to diagnose the cause of the failure. If the site error rate is less than 100%: Using Webmaster Tools, find a day with a high error rate and examine the logs for your web server for that day. Look for errors accessing robots.txt in the logs for that day and fix the causes of those errors. The most likely explanation is that your site is overloaded. Contact your hosting provider and discuss reconfiguring your web server or adding more resources to your website. After you think you've fixed the problem, use Fetch as Google to fetch http://www.soobumimphotography.com//robots.txt to verify that Googlebot can properly access your site.
Technical SEO | | BistosAmerica0 -
How long does it take for traffic to bounce back from and accidental robots.txt disallow of root?
We accidentally uploaded a robots.txt disallow root for all agents last Tuesday and did not catch the error until yesterday.. so 6 days total of exposure. Organic traffic is down 20%. Google has since indexed the correct version of the robots.txt file. However, we're still seeing awful titles/descriptions in the SERPs and traffic is not coming back. GWT shows that not many pages were actually removed from the index but we're still seeing drastic rankings decreases. Anyone been through this? Any sort of timeline for a recovery? Much appreciated!
Technical SEO | | bheard0 -
Any value in external links to image files?
Let's say you have www.example.com. On this website, you have www.example.com/example-image.jpg. When someone links externally to this image - like below... { is < {a href="www.example.com/example-image.jpg"} {img src="www.example.com/example-image.jpg"} {/a} The external site would be using the image hosted on your site, but the image is also linked back to the same image file on your site. Does this have any value even though the link is back to the image file and not the website? Also - how much value do you guys feel image links have in relation to tech links? In terms of passing link juice and adding to a natural link profile. Thanks!
Technical SEO | | qlkasdjfw1 -
Is having a sitemap.xml file still beneficial?
Hi, I'm pretty new to SEO and something I've noticed is that a lot of things become relevant and irrelevant like the weather. I was just wondering if having a sitemap.xml file for Google's use is still a good idea and beneficial? Logically thinking, my websites would get crawled faster by having one. Cheers.
Technical SEO | | davieshussein0