What does Disallow: /french-wines/?* actually do - robots.txt
-
Hello Mozzers - Just wondering what this robots.txt instruction means: Disallow: /french-wines/?*
Does it stop Googlebot crawling and indexing URLs in that "French Wines" folder - specifically the URLs that include a question mark?
Would it stop the crawling of deeper folders - e.g. /french-wines/rhone-region/ that include a question mark in their URL?
I think this has been done to block URLs containing query strings.
Thanks, Luke
-
Glad to help, Luke!
-
Thanks Logan for your help with this - much appreciated. Really helpful!
-
Disallow: /?* is the same thing as Disallow:/?, since the asterisk is a wildcard, both of those disallows prevent any URL that begins with /? from being crawled.
And yes, it is incredibly easy to disallow the wrong thing! The robots.txt tester in Search Console (under the Crawl menu) is very helpful for figuring out what a disallow will catch and what it will let by. I highly recommend testing any new disallows there before releasing them into the wild.
-
Thanks again Logan.
What would Disallow: /?* do because that is what the site I am looking at has implemented. Perhaps it works both ways around?
I imagine it's easy to disallow the wrong thing or possibly not disallow the right thing. Ugh.
-
Disallow: /*?
This disallow literally says to crawlers 'if a URL starts with a slash (all URLs) and has a parameter, don't crawl it'. The * is a wildcard that says anything between / and ? is applicable to the disallow.
It's very easy to disallow the wrong this especially in regards to parameters, for this reason I always do these 2 things rather than using robots.txt:
- Set the purpose of each parameter in Search Console - Go to Crawl > URL Parameters to configure for your site
- Self-referring canonicals - most people disallow URLs with parameters in robots.txt to prevent indexing, but this only prevents crawling. A self-referring canonical pointing to the root level of that URL will prevent indexing or URLs with parameters.
Hope that's helpful!
-
Thanks Logan - I was just reading: Disallow: /*? # block any URL that includes a ? (and thus a query string) - do you know why the ? comes before the * in this case?
-
Hi Luke,
You are correct that this was done to block URLs with parameters. However, since there's no wildcard (the asterisk) before the folder name, the URL would have to start with /french-wines/. This disallow is really only preventing crawling on the single URL www.yoursite.com/french-wines/ with any parameters appended.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
I have a metadata issue. My site crawl is coming back with missing descriptions, but all of the pages look like site tags (i.e. /blog/?_sft_tag=call-routing)
I have a metadata issue. My site crawl is coming back with missing descriptions, but all of the pages look like site tags (i.e. /blog/?_sft_tag=call-routing)
Intermediate & Advanced SEO | | amarieyoussef0 -
Looking for opinions on structuring meta title tags/page title/menu title/H1
Hi everyone I am hoping a few of you can share your opinions. I have been having conversations (okay, healthy debates) about how to write/structure meta title tag and how to compliment them with the H1, page title, menu name. To help explain the thought processes I will use a pretend keyword. How about "screwdriver". Case: (I made this up) we are redesigning a website for a construction tools manufacturing company (pretend name: ABC Tools) targeting OEMs who are interested in purchasing large quantities of tools. The product categories (to become main menu items) are Screwdrivers, Nails, Drills, and Hammers. (bear with me .... this is just an example I am making up on the fly) K. Circling back to screwdrivers - let's say we have one landing page (a primary category page and in the main menu) listing products and great details about screwdrivers. Focus keywords are screwdriver manufacturer, screwdriver supplier, construction screwdrivers Below are questions being debated. If you are willing ... how would you address these questions? And, can you explain WHY? QUESTION ONE: How would you structure the meta title tag (feel free to write one of your own) Screwdriver Manufacturer - Construction Screwdriver | ABC Tools ABC Tools - US-based Screwdriver Manufacturer Supplier Near You High-Quality Screwdrivers for Construction with ABC Tools QUESTION TWO: how would you write the H1 on the page? Would it match the meta tag? OR, would you write something different using the primary keyword? QUESTION THREE Remembering this is not a blog post ... it is a primary landing page linked to the main navigation. What would the menu title be? (remember the product categories above are how the main menu items are bucketed) Screwdrivers Screwdriver Manufacturer Typically in WordPress, the H1 and the menu title is auto-populated using the page title (not the title tag)... So, if we use Screwdrivers as the page title but we want the H1 to match the meta title tag, would we manually change the H1? Or, have the page title and title tag match, but manually change the menu item?
Intermediate & Advanced SEO | | Brenda.Haines1 -
No index detected in robots meta tag GSC issue_Help Please
Hi Everyone, We just did a site migration ( URL structure change, site redesign, CMS change). During migration, dev team messed up badly on a few things including SEO. The old site had pages canonicalized and self canonicalized <> New site doesn't have anything (CMS dev error) so we are working retroactively to add canonicalization mechanism The legacy site had URL’s ending with a trailing slash “/” <> new site got redirected to Set of url’s without “/” New site action : All robots are allowed: A new sitemap is submitted to google search console So here is my problem (it been a long 24hr night for me 🙂 ) 1. Now when I look at GSC homepage URL it says that old page is self canonicalized and currently in index (old page with a trailing slash at the end of URL). 2. When I try to perform a live URL test, I get the message "No: 'noindex' detected in 'robots' meta tag" , so indexation cant be done. I have no idea where noindex is coming from. 3. Robots.txt in search console still showing old file ( no noindex there ) I tried to submit new file but old one still coming up. When I click on "See live robots.txt" I get current robots. 4. I see that old page is still canonicalized and attempting to index redirected old page might be confusing google Hope someone can help to get the new page indexed! I really need it 🙂 Please ping me if you need more clarification. Thank you ! Thank you
Intermediate & Advanced SEO | | bgvsiteadmin1 -
URL Parameters Settings in WMT/Search Console
On an large ecommerce site the main navigation links to URLs that include a legacy parameter. The parameter doesn’t actually seem to do anything to change content - it doesn’t narrow or specify content, nor does it currently track sessions. We’ve set the canonical for these URLs to be without the parameter. (We did this when we started seeing that Google was stripping out the parameter in the majority of SERP results themselves.) We’re trying to best strategize on how to set the parameters in WMT (search console). Our options are to set to: 1. No: Doesn’t affect page content’ - and then the Crawl field in WMT is auto-set to ‘Representative URL’. (Note, that it's unclear what ‘Representative URL’ is defined as. Google’s documentation suggests that a representative URL is a canonical URL, and we've specifically set canonicals to be without the parameter so does this contradict? ) OR 2. ‘Yes: Changes, reorders, or narrows page content’ And then it’s a question of how to instruct Googlebot to crawl these pages: 'Let Googlebot decide' OR 'No URLs'. The fundamental issue is whether the parameter settings are an index signal or crawl signal. Google documents them as crawl signals, but if we instruct Google not to crawl our navigation how will it find and pass equity to the canonical URLs? Thoughts? Posted by Susan Schwartz, Kahena Digital staff member
Intermediate & Advanced SEO | | AriNahmani0 -
Does link building through content syndication still actually work?
I stumbled across this old SEOmoz whilteboard http://www.seomoz.org/blog/whiteboard-friday-leveraging-syndicated-content-effectively and was wondering if this is still a valid technique given the Panda & Penguin updates. Is anyone here still doing this (and seeing results)?
Intermediate & Advanced SEO | | nicole.healthline0 -
Robots.txt is blocking Wordpress Pages from Googlebot?
I have a robots.txt file on my server, which I did not develop, it was done by the web designer at the company before me. Then there is a word press plugin that generates a robots.txt file. How Do I unblock all the wordpress pages from googlebot?
Intermediate & Advanced SEO | | ENSO0 -
New server update + wrong robots.txt = lost SERP rankings
Over the weekend, we updated our store to a new server. Before the switch, we had a robots.txt file on the new server that disallowed its contents from being indexed (we didn't want duplicate pages from both old and new servers). When we finally made the switch, we somehow forgot to remove that robots.txt file, so the new pages weren't indexed. We quickly put our good robots.txt in place, and we submitted a request for a re-crawl of the site. The problem is that many of our search rankings have changed. We were ranking #2 for some keywords, and now we're not showing up at all. Is there anything we can do? Google Webmaster Tools says that the next crawl could take up to weeks! Any suggestions will be much appreciated.
Intermediate & Advanced SEO | | 9Studios0 -
Old pages still crawled by SE returning 404s. Better to put 301 or block with robots.txt ?
Hello guys, A client of ours has thousand of pages returning 404 visibile on googl webmaster tools. These are all old pages which don't exist anymore but Google keeps on detecting them. These pages belong to sections of the site which don't exist anymore. They are not linked externally and didn't provide much value even when they existed What do u suggest us to do: (a) do nothing (b) redirect all these URL/folders to the homepage through a 301 (c) block these pages through the robots.txt. Are we inappropriately using part of the crawling budget set by Search Engines by not doing anything ? thx
Intermediate & Advanced SEO | | H-FARM0