Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Partial Match or RegEx in Search Console's URL Parameters Tool?
- 
					
					
					
					
 So I currently have approximately 1000 of these URLs indexed, when I only want roughly 100 of them. Let's say the URL is www.example.com/page.php?par1=ABC123=&par2=DEF456=&par3=GHI789= All the indexed URLs follow that same kinda format, but I only want to index the URLs that have a par1 of ABC (but that could be ABC123 or ABC456 or whatever). Using URL Parameters tool in Search Console, I can ask Googlebot to only crawl URLs with a specific value. But is there any way to get a partial match, using regex maybe? Am I wasting my time with Search Console, and should I just disallow any page.php without par1=ABC in robots.txt? 
- 
					
					
					
					
 No problem  Hope you get it sorted! -Andy 
- 
					
					
					
					
 Thank you!  
- 
					
					
					
					
 Haha, I think the train passed the station on that one. I would have realised eventually... XD Thanks for your help! 
- 
					
					
					
					
 Don't forget that . & ? have a specific meaning within regex - if you want to use them for pattern matching you will have to escape them. Also be aware that not all bots are capable of interpreting regex in robots.txt - you might want to be more explicit on the user agent - only using regex for Google bot. User-agent: Googlebot #disallowing page.php and any parameters after it disallow: /page.php #but leaving anything that starts with par1=ABC allow: page.php?par1=ABC Dirk 
- 
					
					
					
					
 Ah sorry I missed that bit! -Andy 
- 
					
					
					
					
 Disallowing them would be my first priority really, before removing from index. The trouble with this is that if you disallow first, Google won't be able to crawl the page to act on the noindex. If you add a noindex flag, Google won't index them the next time it comes-a-crawling and then you will be good to disallow  I'm not actually sure of the best way for you to get the noindex in to the page header of those pages though. -Andy 
- 
					
					
					
					
 Yep, have done. (Briefly mentioned in my previous response.) Doesn't pass  
- 
					
					
					
					
 I thought so too, but according to Google the trailing wildcard is completely unnecessary, and only needs to be used mid-URL. 
- 
					
					
					
					
 Hi Andy, Disallowing them would be my first priority really, before removing from index. Didn't want to remove them before I've blocked Google from crawling them in case they get added back again next time Google comes a-crawling, as has happened before when I've simply removed a URL here and there. Does that make sense or am I getting myself mixed up here? My other hack of a solution would be to check the URL in the page.php, and if URL includes par1=ABC then insert noindex meta tag. (Not sure if that would work well or not...) 
- 
					
					
					
					
 My guess would be that this line needs an * at the end. 
 Allow: /page.php?par1=ABC*
- 
					
					
					
					
 Sorry Martijn, just to jump in here for a second - Ria, you can test this via the Robots.txt testing tool in search console before going live to make sure it work. -Andy 
- 
					
					
					
					
 Hi Martijn, thanks for your response! I'm currently looking at something like this... **user-agent: *** #disallowing page.php and any parameters after it 
 disallow: /page.php #but leaving anything that starts with par1=ABC
 allow: /page.php?par1=ABCI would have thought that you could disallow things broadly like that and give an exception, as you can with files in disallowed folders. But it's not passing Google's robots.txt Tester. One thing that's probably worth mentioning really is that there are only two variables that I want to allow of the par1 parameter. For example's sake, ABC123 and ABC456. So would need to be either a partial match or "this or that" kinda deal, disallowing everything else. 
- 
					
					
					
					
 Hi Ria, I have never tried regular expressions in this way, so I can't tell you if this would work or not. However, If all 1000 of these URL's are already indexed, just disallowing access won't then remove them from Google. You would ideally be able to place a noindex tag on those pages and let Google act on them, then you will be good to disallow. I am pretty sure there is no option to noindex under the URL Parameter Tool. I hope that makes sense? -Andy 
- 
					
					
					
					
 Hi Ria, What you could do, but it also depends on the rest of your structure is Disallow these urls based on the parameters (what you could do in a worst case scenario is that you would disallow all URLs and then put an exception Allow in there as well to make sure you still have the right URLs being indexed). Martijn. 
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
- 
		
		Moz ToolsChat with the community about the Moz tools. 
- 
		
		SEO TacticsDiscuss the SEO process with fellow marketers 
- 
		
		CommunityDiscuss industry events, jobs, and news! 
- 
		
		Digital MarketingChat about tactics outside of SEO 
- 
		
		Research & TrendsDive into research and trends in the search industry. 
- 
		
		SupportConnect on product support and feature requests. 
Related Questions
- 
		
		
		
		
		
		How does educational organization schema interact with Google's knowledge graph?
 Hi there! I was just wondering if the granular options of the Organization schema, like Educational Organization (http://schema.org/EducationalOrganization) and CollegeOrUniversity (http://schema.org/CollegeOrUniversity) schema work the same when it comes to pulling data into the knowledge graph. I've typically always used the Organization schema for customers but was wondering if there are any drawbacks for going deep into the hierarchy of schema. Cheers 😄 Intermediate & Advanced SEO | | Corbec8880
- 
		
		
		
		
		
		How old is 404 data from Google Search Console?
 I was wondering how old the 404 data from Google Search Console actually is? Does anyone know over what kind of timespan their site 404s data is compiled over? How long do the 404s tend to take to disappear from the Google Search Console, once they are fixed? Intermediate & Advanced SEO | | McTaggart0
- 
		
		
		
		
		
		Will disallowing URL's in the robots.txt file stop those URL's being indexed by Google
 I found a lot of duplicate title tags showing in Google Webmaster Tools. When I visited the URL's that these duplicates belonged to, I found that they were just images from a gallery that we didn't particularly want Google to index. There is no benefit to the end user in these image pages being indexed in Google. Our developer has told us that these urls are created by a module and are not "real" pages in the CMS. They would like to add the following to our robots.txt file Disallow: /catalog/product/gallery/ QUESTION: If the these pages are already indexed by Google, will this adjustment to the robots.txt file help to remove the pages from the index? We don't want these pages to be found. Intermediate & Advanced SEO | | andyheath0
- 
		
		
		
		
		
		Should you allow an auto dealer's inventory to be indexed?
 Due to the way most auto dealership website populate inventory pages, should you allow inventory to be indexed at all? The main benefit us more content. The problem is it creates duplicate, or near duplicate content. It also creates a ton of crawl errors since the turnover is so short and fast. I would love some help on this. Thanks! Intermediate & Advanced SEO | | Gauge1230
- 
		
		
		
		
		
		Do Q&A 's work for SEO
 If I create a good community in my particular field on my SEO site and have a quality Q&A section like this etc (ripping of MOZ's idea here sorry, I hope it's ok) will the long term returns be worth the effort of creating and man ageing this. Is the user created content of as much use as I think it will be? Intermediate & Advanced SEO | | mark_baird0
- 
		
		
		
		
		
		May know what's the meaning of these parameters in .htaccess?
 Begin HackRepair.com Blacklist RewriteEngine on Abuse Agent Blocking RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [NC,OR] Intermediate & Advanced SEO | | esiow2013
 RewriteCond %{HTTP_USER_AGENT} ^Bolt\ 0 [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} CazoodleBot [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Custo [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Default\ Browser\ 0 [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^DIIbot [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^DISCo [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} discobot [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^eCatch [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ecxi [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^FlashGet [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^GetRight [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^GrabNet [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Grafula [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} GT::WWW [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} heritrix [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^HMView [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} HTTP::Lite [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} IDBot [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} id-search [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} id-search.org [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^InterGET [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^InternetSeer.com [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} IRLbot [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ISC\ Systems\ iRc\ Search\ 2.1 [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^JetCar [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^larbin [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} libwww [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} libwww-perl [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Link [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} LinksManager.com_bot [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} linkwalker [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} lwp-trivial [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Maxthon$ [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} MFC_Tear_Sample [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^microsoft.url [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} Microsoft\ URL\ Control [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} Missigua\ Locator [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*Indy [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Mozilla.NEWT [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^MSFrontPage [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Navroad [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^NearSite [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^NetAnts [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^NetSpider [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^NetZIP [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Nutch [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Octopus [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} panscient.com [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^pavuk [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} PECL::HTTP [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^PeoplePal [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} PHPCrawl [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} PleaseCrawl [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^psbot [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^RealDownload [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^ReGet [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Rippers\ 0 [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} SBIder [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^SeaMonkey$ [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} Snoopy [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} Steeler [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^SuperBot [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Surfbot [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Toata\ dragostea\ mea\ pentru\ diavola [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} URI::Fetch [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} urllib [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} User-Agent [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} Web\ Sucker [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} webalta [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^WebAuto [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} WebCollage [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^WebCopier [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^WebFetch [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^WebReaper [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^WebSauger [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^WebStripper [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^WebZIP [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} Wells\ Search\ II [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} WEP\ Search [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Widow [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^WWW-Mechanize [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} zermelo [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Zeus [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^(.)Zeus.Webster [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ZyBorg [NC]
 RewriteRule ^. - [F,L] Abuse bot blocking rule end End HackRepair.com Blacklist1
- 
		
		
		
		
		
		Do I need to use canonicals if I will be using 301's?
 I just took a job about three months and one of the first things I wanted to do was restructure the site. The current structure is solution based but I am moving it toward a product focus. The problem I'm having is the CMS I'm using isn't the greatest (and yes I've brought this up to my CMS provider). It creates multiple URL's for the same page. For example, these two urls are the same page: (note: these aren't the actual urls, I just made them up for demonstration purposes) http://www.website.com/home/meet-us/team-leaders/boss-man/ Intermediate & Advanced SEO | | Omnipress
 http://www.website.com/home/meet-us/team-leaders/boss-man/bossman.cmsx (I know this is terrible, and once our contract is up we'll be looking at a different provider) So clearly I need to set up canonical tags for the last two pages that look like this: http://www.omnipress.com/boss-man" /> With the new site restructure, do I need to put a canonical tag on the second page to tell the search engine that it's the same as the first, since I'll be changing the category it's in? For Example: http://www.website.com/home/meet-us/team-leaders/boss-man/ will become http://www.website.com/home/MEET-OUR-TEAM/team-leaders/boss-man My overall question is, do I need to spend the time to run through our entire site and do canonical tags AND 301 redirects to the new page, or can I just simply redirect both of them to the new page? I hope this makes sense. Your help is greatly appreciated!!0
- 
		
		
		
		
		
		How to check a website's architecture?
 Hello everyone, I am an SEO analyst - a good one - but I am weak in technical aspects. I do not know any programming and only a little HTML. I know this is a major weakness for an SEO so my first request to you all is to guide me how to learn HTML and some basic PHP programming. Secondly... about the topic of this particular question - I know that a website should have a flat architecture... but I do not know how to find out if a website's architecture is flat or not, good or bad. Please help me out on this... I would be obliged. Eagerly awaiting your responses, BEst Regards, Talha Intermediate & Advanced SEO | | MTalhaImtiaz0
 
			
		 
			
		 
			
		 
			
		 
					
				 
					
				 
					
				 
					
				 
					
				 
					
				