Google News not indexing .index.html pages
-
Hi all,
we've been asked by a blog to help them better indexing and ranking on Google News (with the site being already included in Google News with poor results)
The blog had a chronicle URL duplication problem with each post existing with 3 different URLs:
#1) www.domain.com/post.html (currently in noindex for editorial choices as showing all the comments)
#2) www.domain.com/post/index.html (currently indexed showing only top comments)
#3) www.domain.com/post/ (very same as #2)
We've chosen URL #2 (/index.html) as canonical URL, and included a rel=canonical tag on URL #3 (/) linking to URL #2.
Also we've submitted yesterday a Google News sitemap including consistently the list of URLs #2 from the last 48h . The sitemap has been properly "digested" by Google and shows that all URLs have been sent and indexed.However if we use the site:domain.com command on Google News we see something completely different: Google News has indexed actually only some news and more specifically only the URLs #3 type (ending with the trailing slash instead of /index.html). Why ? What's wrong ?
a) Does Google News bot have problems indexing URLs ending with .index.html ? While figuring out what's wrong we've found out that http://news.google.it/news/search?aq=f&pz=1&cf=all&ned=us&hl=en&q=inurl%3Aindex.html gives no results...it seems that Google News index overall does not include any URLs ending with /index.html
b) Does Google News bot recognise rel=canonical tag ?
c) Is it just a matter of time and then Google News will pick up the right URLs (/index.html) and/or shall we communicate Google News team any changes ?
d) Any suggestions ? OR Shall we do the other way around. meaning make URL #3 the canonical one ?
While Google News is showing these problems, Google Web search has actually well received the changes, so we don't know what to do.
Thanks for your help,
Matteo
-
To follow up on this.
Look what I've found in the Google News Forum:
http://www.google.com/support/forum/p/news/thread?tid=248ef4e6fe372e91&hl=en
The problem is almost the same. Google News not indexing URLs with the trailing index.html.
The only person who answered was a Top Contributor suggesting to contact directly Google News team.
-
Hmmm, that is strange! Check a cached version of one of your URLs to make sure they new version is in the index. If it is, maybe you should switch to option 3.
I am not sure what if any the implications would be of leaving it the way you have it.
Since it is in 2 different areas of search I am not sure that duplicate content issues apply if you were to just leave it be.
-
hey Roger,
Look the CNN seems to have exactly the same "problem" as we do.
They have the "/" article indexed in google news and the index.html version on the non-google news index. They did exavtly what we did, putting a rel=canonical on the "/" version to the "index.html" one. Despite this the "/" version is still the only one showing up on google news
Here is the screenshot just in case
and here the two versions of the same article:
- http://edition.cnn.com/2011/POLITICS/04/22/obama.campaign/
- http://edition.cnn.com/2011/POLITICS/04/22/obama.campaign/index.html
-
They seem to meet these requirements. The only one that is a problem is requirement #3, but it clearly states that is waived with News sitemaps which Matteo said they submitted.
With that said I do like Matteo's option #1 better than the naming convention they chose to go with.
-
It does sound weird, but I am not sure that search operator works in Google News.
Here is a simple test. Search Google News for "Google"
The second story I see is http://phandroid.com/2011/04/22/will-spotify-be-google-musics-savior/
However a Google News search for "inurl:will-spotify-be-google-musics-savior" returns no results.
Clearly the story is indexed!
-
My hunch, and it's only a hunch, is that it relates to their URL requirements that the URL has to be dedicate to an article. An index.html page is usually not a page that would be dedicated to one individual news story. See http://www.google.com/support/news_pub/bin/answer.py?hl=en&answer=68323 for their URL requirements.
-
Hi roger and thx for the very insightful answer !
what about the fact that not a single URL ending with index.html is indexed in Google News ?
http://news.google.it/news/search?aq=f&pz=1&cf=all&ned=us&hl=en&q=inurl%3Aindex.html
compare that with the normal google index
http://www.google.it/search?q=inurl%3Aindex.html&hl=en&ned=us&tab=nw
doesn't that sound weird to you ?
matteo
-
I had another thought too. Just because the pages say they are indexed in Google WMT, doesn't mean the new content including the new canonical tags have been crawled or added to the index yet.
I recently did a similar project adding canonical tags to an ecommerce site. The new URLs are only showing up correctly in the search results maybe 10% of the time, even for pages I know have been crawled and I submitted a week ago. The important thing is that more URLs are updated each day.
I dont believe they throw out their index the first time they crawl an established page and something has changed. I believe the index gets changed as they continue to crawl they compare versions and index data based on multiple crawl agregates, especially if it is for existing pages that have been in the index for a while. So in other words, if they compare 20 recent crawls and only see 1 version as being different, they may not throw out the old version right away until they crawl it multiple times and see that the the new version exists, say 5 or 10 of the most recent 20 crawls. BTW I don't have any data to back that up just my personal observation/theory.
-
If you used the rel canonical tag properly and only submitted sitemap yesterday, its just a waiting game. You will get crawled and indexed properly soon.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google not indexing my website
Hi guys, We have this website http://www.m-health-expo.nl/ but it is not indexed by google. In webmaster tools google says that it can not fetch the site due to the robots.txt but i do not see any faults in it. http://www.m-health-expo.nl/robots.txt Do you see something strange, it really bothers me.
Technical SEO | | RuudHeijnen0 -
Duplicate page content - index.html
Roger is reporting duplicate page content for my domain name and www.mydomain name/index.html. Example: www.just-insulation.com
Technical SEO | | Collie
www.just-insulation.com/index.html What am I doing wrongly, please?0 -
.co.uk/index.html or just .co.uk - my on-page reports are different for both - why?
It looks like the same thing, yet it has a different on-page report for each version - why is this. Please share your ideas with me on this. The original url is http://bath.waspkilluk.co.uk/index.html. Many Thanks - Simon.
Technical SEO | | simonberenyi0 -
Noindex Pages indexed
I'm having problem that gogole is index my search results pages even though i have added the "noindex" metatag. Is the best thing to block the robot from crawling that file using robots.txt?
Technical SEO | | Tedred0 -
Link to Articles for news sites in Google SERPs
I'm trying to figure out why when I search for "international news" or "world news", for example, some sites in the SERPs have links to news articles, while others don't. For "international news", result of Fox News and New York Times have links to articles, while CNN (the top result), only have sitelinks. I would appreciate any theories on why this happens. Thanks.
Technical SEO | | seoFan210 -
Directory Indexed in Google, that I dont want, How to remove?
Hi One of my own websites, having a slight issue, Google have indexed over 500+ pages and files from a template directory from my eCommerce website. In google webmaster tools, getting over 580 crawl errors mostly these ones below I went into my robots text file and added Disallow: /skins*
Technical SEO | | rfksolutionsltd
Disallow: /skin1* Will this block Google from searching them again? and how do I go about getting the 500 pages that are already indexed taken out? Any help would be great | http://www.rfkprintsolutions.co.uk/skin1/modules/Subscriptions/subscription_priceincart.tpl | 403 error | Jan 15, 2012 |
| http://www.rfkprintsolutions.co.uk/skin1/modules/Subscriptions/subscription_info_inlist.tpl | 403 error | Jan 15, 2012 |
| http://www.rfkprintsolutions.co.uk/skin1/modules/Subscriptions/subscriptions_admin.tpl | 403 error | Jan 15, 2012 |0 -
Why googlebot indexing one page, not the other?
Why googlebot indexing one page, not the other in the same conditions? In html sitemap, for example. We have 6 new pages with unique content. Googlebot immediately indexes only 2 pages, and then after sometime the remaining 4 pages. On what parameters the crawler decides to scan or not scan this page?
Technical SEO | | ATCnik0 -
Will Google index a 301 redirect for a new site?
So here is the problem... We have setup a 301redirect for our clients website. When you search the clients name it comes up with the old .co.uk website. We have made this redirect to the new .com website. However on the SERPs when it shows the .co.uk it shows the old title pages which currently say 'Holding Page'. When you click on that link it takes you to the fully functioning .com website. My question is, will the title tags in the SERPs which show the .co.uk update to the new ones from the .com? I'm thinking it will be just a case of Google catching up on things and it will sort itself out eventually. If anyone could help I would REALLY appreciate it. Thanks Chris
Technical SEO | | Weerdboil0