Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Partial Match or RegEx in Search Console's URL Parameters Tool?
-
So I currently have approximately 1000 of these URLs indexed, when I only want roughly 100 of them.
Let's say the URL is www.example.com/page.php?par1=ABC123=&par2=DEF456=&par3=GHI789=
All the indexed URLs follow that same kinda format, but I only want to index the URLs that have a par1 of ABC (but that could be ABC123 or ABC456 or whatever). Using URL Parameters tool in Search Console, I can ask Googlebot to only crawl URLs with a specific value. But is there any way to get a partial match, using regex maybe?
Am I wasting my time with Search Console, and should I just disallow any page.php without par1=ABC in robots.txt?
-
No problem

Hope you get it sorted!
-Andy
-
Thank you!

-
Haha, I think the train passed the station on that one. I would have realised eventually... XD
Thanks for your help!
-
Don't forget that . & ? have a specific meaning within regex - if you want to use them for pattern matching you will have to escape them. Also be aware that not all bots are capable of interpreting regex in robots.txt - you might want to be more explicit on the user agent - only using regex for Google bot.
User-agent: Googlebot
#disallowing page.php and any parameters after it
disallow: /page.php
#but leaving anything that starts with par1=ABC
allow: page.php?par1=ABC
Dirk
-
Ah sorry I missed that bit!
-Andy
-
Disallowing them would be my first priority really, before removing from index.
The trouble with this is that if you disallow first, Google won't be able to crawl the page to act on the noindex. If you add a noindex flag, Google won't index them the next time it comes-a-crawling and then you will be good to disallow

I'm not actually sure of the best way for you to get the noindex in to the page header of those pages though.
-Andy
-
Yep, have done. (Briefly mentioned in my previous response.) Doesn't pass

-
I thought so too, but according to Google the trailing wildcard is completely unnecessary, and only needs to be used mid-URL.
-
Hi Andy,
Disallowing them would be my first priority really, before removing from index. Didn't want to remove them before I've blocked Google from crawling them in case they get added back again next time Google comes a-crawling, as has happened before when I've simply removed a URL here and there. Does that make sense or am I getting myself mixed up here?
My other hack of a solution would be to check the URL in the page.php, and if URL includes par1=ABC then insert noindex meta tag. (Not sure if that would work well or not...)
-
My guess would be that this line needs an * at the end.
Allow: /page.php?par1=ABC* -
Sorry Martijn, just to jump in here for a second - Ria, you can test this via the Robots.txt testing tool in search console before going live to make sure it work.
-Andy
-
Hi Martijn, thanks for your response!
I'm currently looking at something like this...
**user-agent: *** #disallowing page.php and any parameters after it
disallow: /page.php #but leaving anything that starts with par1=ABC
allow: /page.php?par1=ABCI would have thought that you could disallow things broadly like that and give an exception, as you can with files in disallowed folders. But it's not passing Google's robots.txt Tester.
One thing that's probably worth mentioning really is that there are only two variables that I want to allow of the par1 parameter. For example's sake, ABC123 and ABC456. So would need to be either a partial match or "this or that" kinda deal, disallowing everything else.
-
Hi Ria,
I have never tried regular expressions in this way, so I can't tell you if this would work or not.
However, If all 1000 of these URL's are already indexed, just disallowing access won't then remove them from Google. You would ideally be able to place a noindex tag on those pages and let Google act on them, then you will be good to disallow. I am pretty sure there is no option to noindex under the URL Parameter Tool.
I hope that makes sense?
-Andy
-
Hi Ria,
What you could do, but it also depends on the rest of your structure is Disallow these urls based on the parameters (what you could do in a worst case scenario is that you would disallow all URLs and then put an exception Allow in there as well to make sure you still have the right URLs being indexed).
Martijn.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
URL Injection Hack - What to do with spammy URLs that keep appearing in Google's index?
A website was hacked (URL injection) but the malicious code has been cleaned up and removed from all pages. However, whenever we run a site:domain.com in Google, we keep finding more spammy URLs from the hack. They all lead to a 404 error page since the hack was cleaned up in the code. We have been using the Google WMT Remove URLs tool to have these spammy URLs removed from Google's index but new URLs keep appearing every day. We looked at the cache dates on these URLs and they are vary in dates but none are recent and most are from a month ago when the initial hack occurred. My question is...should we continue to check the index every day and keep submitting these URLs to be removed manually? Or since they all lead to a 404 page will Google eventually remove these spammy URLs from the index automatically? Thanks in advance Moz community for your feedback.
Intermediate & Advanced SEO | | peteboyd0 -
Incorrect URL shown in Google search results
Can anyone offer any advice on how Google might get the url which it displays in search results wrong? It currently appears for all pages as: <cite>www.domainname.com › Register › Login</cite> When the real url is nothing like this. It should be: www.domainname.com/product-type/product-name. This could obviously affect clickthroughs. Google has indexed around 3,000 urls on the site and they are all like this. There are links at the top of the page on the website itself which look like this: Register » Login » which presumably could be affecting it? Thanks in advance for any advice or help!
Intermediate & Advanced SEO | | Wagada0 -
What are partial urls and why this is causing a sitemap error?
Hi mozzers, I have a client that recorded 7 errors when generating Xml sitemap. One of the errors appear to be coming from partial urls and apparently I would need to exclude them from sitemap. What are they exactly and why would they cause an error in the sitemap. Thanks!
Intermediate & Advanced SEO | | Ideas-Money-Art0 -
Why is my client's site not ranking anymore? Like big time!
Ok, I'm reaching out to all of you Moz'rs for some help with this one. My client's site has dropped off the face of google in a real short period of time. It went from page 1 (avg rank 3 to page 6 (avg rank 50) and below in the matter of 2 weeks. Here's some facts: 1. DA is a 22 and homepage PA is a 31. It outranks all other sites in its competitive set. 2. The homepage used to be the page that displays for keyword searches, now its the FAQ page, which has a lower PA of 23. Why has the home page seemingly vaporized? And, why is the FAQ showing as the first result? What should I start checking. I feel paralyzed, not sure where to start. More info: a. There are no alerts present in Webmaster Tools. b. For some reason the homepage (domain.com) was 301'd to domain.com/home.html. Domain.com is indexed by Google, however, domain.com/home.html is not. If this is the issue, what is the best way to handle it? Thanks in advance for your help!
Intermediate & Advanced SEO | | rhoadesjohn1 -
Remove URLs that 301 Redirect from Google's Index
I'm working with a client who has 301 redirected thousands of URLs from their primary subdomain to a new subdomain (these are unimportant pages with regards to link equity). These URLs are still appearing in Google's results under the primary domain, rather than the new subdomain. This is problematic because it's creating an artificial index bloat issue. These URLs make up over 90% of the URLs indexed. My experience has been that URLs that have been 301 redirected are removed from the index over time and replaced by the new destination URL. But it has been several months, close to a year even, and they're still in the index. Any recommendations on how to speed up the process of removing the 301 redirected URLs from Google's index? Will Google, or any search engine for that matter, process a noindex meta tag if the URL's been redirected?
Intermediate & Advanced SEO | | trung.ngo0 -
Other domains hosted on same server showing up in SERP for 1st site's keywords
For the website in question, the first domain alphabetically on the shared hosting space, strange search results are appearing on the SERP for keywords associated with the site. Here is an example: A search for "unique company name" shows the results: www.uniquecompanyname.com as the top result. But on pages 2 and 3, we are getting results for the same content but for domains hosted on the same server. Here are some examples with the domain name replaced: UNIQUE DOMAIN NAME PAGE TITLE
Intermediate & Advanced SEO | | Motava
ftp.DOMAIN2.com/?action=news&id=63
META DESCRIPTION TEXT UNIQUE DOMAIN NAME PAGE TITLE 2
www.DOMAIN3.com/?action=news&id=120
META DESCRIPTION TEXT2 UNIQUE DOMAIN NAME PAGE TITLE 2
www.DOMAIN4.com/?action=news&id=120
META DESCRIPTION TEXT2 UNIQUE DOMAIN NAME PAGE TITLE 3
mail.DOMAIN5.com/?action=category&id=17
META DESCRIPTION TEXT3 ns5.DOMAIN6.com/?action=article&id=27 There are more but those are just some examples. These other domain names being listed are other customer domains on the same VPS shared server. When clicking the result the browser URL still shows the other customer domain name B but the content is usually the 404 page. The page title and meta description on that page is not displayed the same as on the SERP.As far as we can tell, this is the only domain this is occurring for.So far, no crawl errors detected in Webmaster Tools and moz crawl not completed yet.0 -
May know what's the meaning of these parameters in .htaccess?
Begin HackRepair.com Blacklist RewriteEngine on Abuse Agent Blocking RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [NC,OR]
Intermediate & Advanced SEO | | esiow2013
RewriteCond %{HTTP_USER_AGENT} ^Bolt\ 0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CazoodleBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Default\ Browser\ 0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} discobot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ecxi [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [NC,OR]
RewriteCond %{HTTP_USER_AGENT} GT::WWW [NC,OR]
RewriteCond %{HTTP_USER_AGENT} heritrix [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTP::Lite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} IDBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} id-search [NC,OR]
RewriteCond %{HTTP_USER_AGENT} id-search.org [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InternetSeer.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} IRLbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ISC\ Systems\ iRc\ Search\ 2.1 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} libwww [NC,OR]
RewriteCond %{HTTP_USER_AGENT} libwww-perl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Link [NC,OR]
RewriteCond %{HTTP_USER_AGENT} LinksManager.com_bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} linkwalker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} lwp-trivial [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Maxthon$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MFC_Tear_Sample [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^microsoft.url [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Microsoft\ URL\ Control [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Missigua\ Locator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*Indy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.NEWT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MSFrontPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} panscient.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PECL::HTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PeoplePal [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PHPCrawl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PleaseCrawl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Rippers\ 0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SBIder [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SeaMonkey$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Snoopy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Steeler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Toata\ dragostea\ mea\ pentru\ diavola [NC,OR]
RewriteCond %{HTTP_USER_AGENT} URI::Fetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} urllib [NC,OR]
RewriteCond %{HTTP_USER_AGENT} User-Agent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} webalta [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCollage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Wells\ Search\ II [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WEP\ Search [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWW-Mechanize [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} zermelo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(.)Zeus.Webster [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ZyBorg [NC]
RewriteRule ^. - [F,L] Abuse bot blocking rule end End HackRepair.com Blacklist1 -
Posing QU's on Google Variables "aclk", "gclid" "cd", "/aclk" "/search", "/url" etc
I've been doing a bit of stats research prompted by read the recent ranking blog http://www.seomoz.org/blog/gettings-rankings-into-ga-using-custom-variables There are a few things that have come up in my research that I'd like to clear up. The below analysis has been done on my "conversions". 1/. What does "/aclk" mean in the Referrer URL? I have noticed a strong correlation between this and "gclid" in the landing page variable. Does it mean "ad click" ?? Although they seem to "closely" correlate they don't exactly, so when I have /aclk in the referrer Url MOSTLY I have gclid in the landing page URL. BUT not always, and the same applies vice versa. It's pretty vital that I know what is the best way to monitor adwords PPC, so what is the best variable to go on? - Currently I am using "gclid", but I have about 25% extra referral URL's with /aclk in that dont have "gclid" in - so am I underestimating my number of PPC conversions? 2/. The use of the variable "cd" is great, but it is not always present. I have noticed that 99% of my google "Referrer URL's" either start with:
Intermediate & Advanced SEO | | James77
/aclk - No cd value
/search - No cd value
/url - Always contains the cd variable. What do I make of this?? Thanks for the help in advance!0