Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Partial Match or RegEx in Search Console's URL Parameters Tool?
-
So I currently have approximately 1000 of these URLs indexed, when I only want roughly 100 of them.
Let's say the URL is www.example.com/page.php?par1=ABC123=&par2=DEF456=&par3=GHI789=
All the indexed URLs follow that same kinda format, but I only want to index the URLs that have a par1 of ABC (but that could be ABC123 or ABC456 or whatever). Using URL Parameters tool in Search Console, I can ask Googlebot to only crawl URLs with a specific value. But is there any way to get a partial match, using regex maybe?
Am I wasting my time with Search Console, and should I just disallow any page.php without par1=ABC in robots.txt?
-
No problem
Hope you get it sorted!
-Andy
-
Thank you!
-
Haha, I think the train passed the station on that one. I would have realised eventually... XD
Thanks for your help!
-
Don't forget that . & ? have a specific meaning within regex - if you want to use them for pattern matching you will have to escape them. Also be aware that not all bots are capable of interpreting regex in robots.txt - you might want to be more explicit on the user agent - only using regex for Google bot.
User-agent: Googlebot
#disallowing page.php and any parameters after it
disallow: /page.php
#but leaving anything that starts with par1=ABC
allow: page.php?par1=ABC
Dirk
-
Ah sorry I missed that bit!
-Andy
-
Disallowing them would be my first priority really, before removing from index.
The trouble with this is that if you disallow first, Google won't be able to crawl the page to act on the noindex. If you add a noindex flag, Google won't index them the next time it comes-a-crawling and then you will be good to disallow
I'm not actually sure of the best way for you to get the noindex in to the page header of those pages though.
-Andy
-
Yep, have done. (Briefly mentioned in my previous response.) Doesn't pass
-
I thought so too, but according to Google the trailing wildcard is completely unnecessary, and only needs to be used mid-URL.
-
Hi Andy,
Disallowing them would be my first priority really, before removing from index. Didn't want to remove them before I've blocked Google from crawling them in case they get added back again next time Google comes a-crawling, as has happened before when I've simply removed a URL here and there. Does that make sense or am I getting myself mixed up here?
My other hack of a solution would be to check the URL in the page.php, and if URL includes par1=ABC then insert noindex meta tag. (Not sure if that would work well or not...)
-
My guess would be that this line needs an * at the end.
Allow: /page.php?par1=ABC* -
Sorry Martijn, just to jump in here for a second - Ria, you can test this via the Robots.txt testing tool in search console before going live to make sure it work.
-Andy
-
Hi Martijn, thanks for your response!
I'm currently looking at something like this...
**user-agent: *** #disallowing page.php and any parameters after it
disallow: /page.php #but leaving anything that starts with par1=ABC
allow: /page.php?par1=ABCI would have thought that you could disallow things broadly like that and give an exception, as you can with files in disallowed folders. But it's not passing Google's robots.txt Tester.
One thing that's probably worth mentioning really is that there are only two variables that I want to allow of the par1 parameter. For example's sake, ABC123 and ABC456. So would need to be either a partial match or "this or that" kinda deal, disallowing everything else.
-
Hi Ria,
I have never tried regular expressions in this way, so I can't tell you if this would work or not.
However, If all 1000 of these URL's are already indexed, just disallowing access won't then remove them from Google. You would ideally be able to place a noindex tag on those pages and let Google act on them, then you will be good to disallow. I am pretty sure there is no option to noindex under the URL Parameter Tool.
I hope that makes sense?
-Andy
-
Hi Ria,
What you could do, but it also depends on the rest of your structure is Disallow these urls based on the parameters (what you could do in a worst case scenario is that you would disallow all URLs and then put an exception Allow in there as well to make sure you still have the right URLs being indexed).
Martijn.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
[Very Urgent] More 100 "/search/adult-site-keywords" Crawl errors under Search Console
I just opened my G Search Console and was shocked to see more than 150 Not Found errors under Crawl errors. Mine is a Wordpress site (it's consistently updated too): Here's how they show up: Example 1: URL: www.example.com/search/adult-site-keyword/page2.html/feed/rss2 Linked From: http://an-adult-image-hosting.com/search/adult-site-keyword/page2.html Example 2 (this surprised me the most when I looked at the linked from data): URL: www.example.com/search/adult-site-keyword-2.html/page/3/ Linked From: www.example.com/search/adult-site-keyword-2.html/page/2/ (this is showing as if it's from our own site) http://a-spammy-adult-site.com/search/adult-site-keyword-2.html Example 3: URL: www.example.com/search/adult-site-keyword-3.html Linked From: http://an-adult-image-hosting.com/search/adult-site-keyword-3.html How do I address this issue?
Intermediate & Advanced SEO | | rmehta10 -
What's the best way to noindex pages but still keep backlinks equity?
Hello everyone, Maybe it is a stupid question, but I ask to the experts... What's the best way to noindex pages but still keep backlinks equity from those noindexed pages? For example, let's say I have many pages that look similar to a "main" page which I solely want to appear on Google, so I want to noindex all pages with the exception of that "main" page... but, what if I also want to transfer any possible link equity present on the noindexed pages to the main page? The only solution I have thought is to add a canonical tag pointing to the main page on those noindexed pages... but will that work or cause wreak havoc in some way?
Intermediate & Advanced SEO | | fablau3 -
Should I disallow all URL query strings/parameters in Robots.txt?
Webmaster Tools correctly identifies the query strings/parameters used in my URLs, but still reports duplicate title tags and meta descriptions for the original URL and the versions with parameters. For example, Webmaster Tools would report duplicates for the following URLs, despite it correctly identifying the "cat_id" and "kw" parameters: /Mulligan-Practitioner-CD-ROM
Intermediate & Advanced SEO | | jmorehouse
/Mulligan-Practitioner-CD-ROM?cat_id=87
/Mulligan-Practitioner-CD-ROM?kw=CROM Additionally, theses pages have self-referential canonical tags, so I would think I'd be covered, but I recently read that another Mozzer saw a great improvement after disallowing all query/parameter URLs, despite Webmaster Tools not reporting any errors. As I see it, I have two options: Manually tell Google that these parameters have no effect on page content via the URL Parameters section in Webmaster Tools (in case Google is unable to automatically detect this, and I am being penalized as a result). Add "Disallow: *?" to hide all query/parameter URLs from Google. My concern here is that most backlinks include the parameters, and in some cases these parameter URLs outrank the original. Any thoughts?0 -
Incorrect URL shown in Google search results
Can anyone offer any advice on how Google might get the url which it displays in search results wrong? It currently appears for all pages as: <cite>www.domainname.com › Register › Login</cite> When the real url is nothing like this. It should be: www.domainname.com/product-type/product-name. This could obviously affect clickthroughs. Google has indexed around 3,000 urls on the site and they are all like this. There are links at the top of the page on the website itself which look like this: Register » Login » which presumably could be affecting it? Thanks in advance for any advice or help!
Intermediate & Advanced SEO | | Wagada0 -
What's the deal with significantLinks?
http://schema.org/significantLink Schema.org has a definition for "non-navigation links that are clicked on the most." Presumably this means something like the big green buttons on Moz's homepage. But does anyone know how they affect anything? In http://moz.com/blog/schemaorg-a-new-approach-to-structured-data-for-seo#comment-142936, Jeremy Nelson says " It's quite possible that significant links will pass anchor text as well if a previous link to the page was set in navigation, effictively making obselete the first-link-counts rule, and I am interested in putting that to test." This is a pretty obscure comment but it's one of the only results I could find on the subject. Is this BS? I can't even make out what all of it is saying. So what's the deal with significantLinks and how can we use them to SEO?
Intermediate & Advanced SEO | | NerdsOnCall0 -
What are Soft 404's and are they a problem
Hi, I have some old pages that were coming up in google WMT as a 404. These had links into them so i thought i'd do a 301 back to either the home page or to a relevant category or page. However these are now listed in WMT as soft 404's. I'm not sure what this means and whether google is saying it doesn't like this? Any advice welcomed.
Intermediate & Advanced SEO | | Aikijeff0 -
Do I need to use canonicals if I will be using 301's?
I just took a job about three months and one of the first things I wanted to do was restructure the site. The current structure is solution based but I am moving it toward a product focus. The problem I'm having is the CMS I'm using isn't the greatest (and yes I've brought this up to my CMS provider). It creates multiple URL's for the same page. For example, these two urls are the same page: (note: these aren't the actual urls, I just made them up for demonstration purposes) http://www.website.com/home/meet-us/team-leaders/boss-man/
Intermediate & Advanced SEO | | Omnipress
http://www.website.com/home/meet-us/team-leaders/boss-man/bossman.cmsx (I know this is terrible, and once our contract is up we'll be looking at a different provider) So clearly I need to set up canonical tags for the last two pages that look like this: http://www.omnipress.com/boss-man" /> With the new site restructure, do I need to put a canonical tag on the second page to tell the search engine that it's the same as the first, since I'll be changing the category it's in? For Example: http://www.website.com/home/meet-us/team-leaders/boss-man/ will become http://www.website.com/home/MEET-OUR-TEAM/team-leaders/boss-man My overall question is, do I need to spend the time to run through our entire site and do canonical tags AND 301 redirects to the new page, or can I just simply redirect both of them to the new page? I hope this makes sense. Your help is greatly appreciated!!0 -
How to fix issues regarding URL parameters?
Today, I was reading help article for URL parameters by Google. http://www.google.com/support/webmasters/bin/answer.py?answer=1235687 I come to know that, Google is giving value to URLs which ave parameters that change or determine the content of a page. There are too many pages in my website with similar value for Name, Price and Number of product. But, I have restricted all pages by Robots.txt with following syntax. URLs:
Intermediate & Advanced SEO | | CommercePundit
http://www.vistastores.com/table-lamps?dir=asc&order=name
http://www.vistastores.com/table-lamps?dir=asc&order=price
http://www.vistastores.com/table-lamps?limit=100 Syntax in Robots.txt
Disallow: /?dir=
Disallow: /?p=
Disallow: /*?limit= Now, I am confuse. Which is best solution to get maximum benefits in SEO?0