Robots.txt to disallow /index.php/ path
-
Hi SEOmoz,
I have a problem with my Joomla site (yeah - me too!). I get a large amount of /index.php/ urls despite using a program to handle these issues. The URLs cause indexation errors with google (404). Now, I fixed this issue once before, but the problem persist. So I thought, instead of wasting more time, couldnt I just disallow all paths containing /index.php/ ?.
I don't use that extension, but would it cause me any problems from an SEO perspective?
How do I disallow all index.php's? Is it a simple: Disallow: /index.php/
-
Hi Cyrus,
Thanks for your reply!
Unfortunately the problem is yet to be fixed, I hope that my disallow will work shortly.
It seems that most of the index.php links to each other internally (and from old /index.php/ pages that no longer exist), which is super weird. How google found them does not make any sense to me.
I don't beleive that external sources are linking to these pages either - I mean, how would they find these links anyway?.
-
Hi Mikkel,
Like Chris, I spidered your site and couldn't find any links to /index.php files, which probably indicates one of two things:
- You've fixed the problem - Yay!
- Or Google is finding those links from external sources
- Google found those links at one time in the past, and is still trying to crawl them.
In the Crawl Errors report in Google Webmaster Tools, if you click on the link of each 404, there's often a "linked from" source where you can see where Google discovered the broken link. This is really helpful in rooting out the cause.
Regardless, I'm going to go with #1 and optimistically believe that you were able to fix the problem.
-
If I spider your site I'm not seeing any /index.php urls. Does that mean you did get Joomla to cooperate with your rewriting?
Or was your problem that you'd previously had urls indexed with /index.php/ paths and you needed to remove them?
-
Hi Mikkel, I have checked your robots.txt, it looks perfect. If you redirect /index.php to home page that using httaccess file or by using any joomla plugin that would great for you. And its also a permanent solution.
-
Well, I tried the sensible solution and redirecting to the correct URL instead. However the SEF program is quite limited and keep on creating new URLs regardless of my modification. Im looking for a more permanent solution, and the disallow seems at bit simple as I'm not a super programmer.
By the way - thanks for quick replys, kudos to both of you!
-
Sure, the website in question is www.vauni.dk
I don't think that there is any inbound links to the index.php pages. They are not easily found.
-
Couldn't you rewrite those /index.php/ urls to remove the /index.php/?
Like this in .htaccess:
RewriteRule ^(.*)$ /index.php/$1 [L]
Only used Joomla once, but there must be a way to configure joomla to just use "/" instead of "/index.php/"?
Update:
Here's a solution to your /index.php/ issue:
http://www.eprcreations.com/remove-index-php-from-joomla-urls/
Once you've updated that, and have your urls working properly without the /index.php/, you could add this slight modification of the rewrite rule above so that all your old /index.php/ urls would be 301'd to your new ones:
RewriteRule ^(.*)$ /index.php/$1 [R=301,L]
Put it underneath the RewriteBase / line they describe in that post.
-
Hi Mikkel,
Do you inbound link pointing to you index.php pages ? If yes, then it might affect your seo. Disallow: /index.ph/ is perfect but after implementing it don't inter link those index.php pages. Can you share me your website URL so that I can show you with example. How to do it.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Disallow wildcard match in Robots.txt
This is in my robots.txt file, does anyone know what this is supposed to accomplish, it doesn't appear to be blocking URLs with question marks Disallow: /?crawler=1
Technical SEO | | AmandaBridge
Disallow: /?mobile=1 Thank you0 -
Accidental No Index
Hi everyone, We control several client sites at my company. The developers accidentally had a no index robot implemented in the site code when we did the HTTPS upgrade without knowing it (yes it's true). Ten days later we noticed traffic was falling. After a couple days we found the no index tags and removed them and resumbitted the sitemaps. The sites started ranking for their own keywords again within a day or two. The organic traffic is still down considerably and other keywords they are not ranking for in the same spot as they were before or at all. If I look in Google Search console, it says we submitted for example 4,000 URLs and only 160 have been indexed. I feel like maybe Google is taking a long time to re-index to remainder of the sites?? Has anyone has this issue?? We're starting to get very concerned so any input would be appreciate. I read an article on here from 2011 about a company that did the same and they were ranking for their keywords within a week. It's been 8 days since our fix.
Technical SEO | | AliMac260 -
Robots.txt on subdomains
Hi guys! I keep reading conflicting information on this and it's left me a little unsure. Am I right in thinking that a website with a subdomain of shop.sitetitle.com will share the same robots.txt file as the root domain?
Technical SEO | | Whittie0 -
Google showing https:// page in search results but directing to http:// page
We're a bit confused as to why Google shows a secure page https:// URL in the results for some of our pages. This includes our homepage. But when you click through it isn't taking you to the https:// page, just the normal unsecured page. This isn't happening for all of our results, most of our deeper content results are not showing as https://. I thought this might have something to do with Google conducting searches behind secure pages now, but this problem doesn't seem to affect other sites and our competitors. Any ideas as to why this is happening and how we get around it?
Technical SEO | | amiraicaew0 -
Site Indexed but not Cached?
I launched a new website ~2 weeks ago that seems to be indexed but not cached. According to Google Webmaster most of the pages are indexed and I see them appear when I search site:www.xxx.com. However, when I type into the URL - cache:www.xxx.com I get a 404 error page from Google.
Technical SEO | | theLotter
I've checked more established websites and they are cached so I know I am checking correctly here... Why would my site be indexed but not in the cache?0 -
Do I need robots.txt and meta robots?
If I can manage to tell crawlers what I do and don't want them to crawl for my whole site via my robots.txt file, do I still need meta robots instructions?
Technical SEO | | Nola5040 -
Robots.txt usage
Hey Guys, I am about make an important improvement to our site's robots.txt we have large number of properties on our site and we have different views for them. List, gallery and map view. By default list view shows up and user can navigate through gallery view. We donot want gallery pages to get indexed and want to save our crawl budget for more important pages. this is one example of our site: http://www.holiday-rentals.co.uk/France/r31.htm When you click on "gallery view" URL of this site will remain same in your address bar: but when you mouse over the "gallery view" tab it will show you URL with parameter "view=g". there are number of parameters: "view=g, view=l and view=m". http://www.holiday-rentals.co.uk/France/r31.htm?view=l http://www.holiday-rentals.co.uk/France/r31.htm?view=g http://www.holiday-rentals.co.uk/France/r31.htm?view=m Now my question is: I If restrict bots by adding "Disallow: ?view=" in our robots.txt will it effect the list view too? Will be very thankful if yo look into this for us. Many thanks Hassan I will test this on some other site within our network too before putting it to important one's. to measure the impact but will be waiting for your recommendations. Thanks
Technical SEO | | holidayseo0 -
Robots.txt blocking site or not?
Here is the robots.txt from a client site. Am I reading this right --
Technical SEO | | 540SEO
that the robots.txt is saying to ignore the entire site, but the
#'s are saying to ignore the robots.txt command? See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file To ban all spiders from the entire site uncomment the next two lines: User-Agent: * Disallow: /0