What tools do you use to find scraped content?
-
This hasn’t been an issue for our company so far, but I like to be proactive. What tools do you use to find sites that may have scraped your content?
Looking forward to your suggestions.
Vic
-
Oh, this belongs to a different thread: http://moz.com/community/q/chinese-site-ranking-for-our-brand-name-possible-hack
-
Is this part of the original conversation, or something else? Which sites are these?
-
I'm not sure we have been scraped as such though, because the site in question has different content.
It looks as though the offending site has hacked another site (which redirects to the offending site) but the hacked site is ranking for our brand name. Our homepage has lost all rankings it had (our category and product pages seem fine) and has essentially disappeared.
Can anyone else shed any light?
-
Siteliner (Copyscape's big brother) is really great and what we use first (plus I have a bookmarklet for it to make it faster & easy to use.)
Also use Linda's method of taking a bit of content in quotes. Easiest way to show an ecommerce client how much work they're going to require - take three product descriptions into Google, watch the magic, and explain that would happen across all 15,000 products.
-
I spot check on a regular basis by taking a unique chunk out of a post, putting it in quotes, and doing a Google search on it. It's not comprehensive, but it is free. [And the main problems we have had with scrapers have been with sites that have taken huge portions of our content, not just an article or two, and a spot check roots those out.]
-
Thanks, Chris & Jonathan. I will look into Copyscape. Good stuff!
-
Yep, Copyscape is what I use. I use a wordpress plugin that uses the copyscape API and just check my main content every month or so with a simple click.
-
Copyscape works well for us. You can scan a couple of pages for free, and then it's $0.05/page after that.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Duplication content management across a subdir based multisite where subsites are projects of the main site and naturally adopt some ideas and goals from it
Hi, I have the following problem and would like which would be the best solution for it: I have a site codex21.gal that is actually part of a subdirectories based multisite (galike.net). It has a domain mapping setup, but it is hosted on a folder of galike.net multisite (galike.net/codex21). My main site (galike.net) works as a frame-brand for a series of projects aimed to promote the cultural & natural heritage of a region in NW Spain through creative projects focused on the entertainment, tourism and educational areas. The projects themselves will be a concretion (put into practice) of the general views of the brand, that acts more like a company brand. CodeX21 is one of those projects, it has its own logo, etc, and is actually like a child brand, yet more focused on a particular theme. I don't want to hide that it makes part of the GALIKE brand (in fact, I am planning to add the Galike logo to it, and a link to the main site on the menu). I will be making other projects, each of them with their own brand, hosted in subsites (subfolders) of galike.net multisites. Not all of them might have their own TLD mapped, some could simply be www.galike.net/projectname. The project codex21.gal subsite might become galike.net/codex21 if it would be better for SEO. Now, the problem is that my subsite codex21.gal re-states some principles, concepts and goals that have been defined (in other words) in the main site. Thus, there are some ideas (such as my particular vision on the possibilities of sustainable exploitation of that heritage, concepts I have developed myself as "narrative tourism" "geographical map as a non lineal story" and so on) that need to be present here and there on the subsite, since it is also philosophy of the project. BUT it seems that Google can penalise overlapping content in subdirectories based multisites, since they can seem a collection of doorways to access the same product (*) I have considered the possibility to substitute those overlapping ideas with links to the main page of the site, thought it seems unnatural from the user point of view to be brought off the page to read a piece of info that actually makes part of the project description (every other child project of Galike might have the same problem). I have considered also taking the subsite codex21 out of the network and host it as a single site in other server, but the problem of duplicated content might persist, and anyway, I should link it to my brand Galike somewhere, because that's kind of the "production house" of it. So which would be the best (white hat) strategy, from a SEO point of view, to arrange this brand-project philosophy overlapping? (*) “All the same IP address — that’s really not a problem for us. It’s really common for sites to be on the same IP address. That’s kind of the way the internet works. A lot of CDNs (content delivery networks) use the same IP address as well for different sites, and that’s also perfectly fine. I think the bigger issue that he might be running into is that all these sites are very similar. So, from our point of view, our algorithms might look at that and say “this is kind of a collection of doorway sites” — in that essentially they’re being funnelled toward the same product. The content on the sites is probably very similar. Then, from our point of view, what might happen is we will say we’ll pick one of these pages and index that and show that in the search results. That might be one variation that we could look at. In practice that wouldn’t be so problematic because one of these sites would be showing up in the search results. On the other hand, our algorithm might also be looking at this and saying this is clearly someone trying to overdo things with a collection of doorway sites and we’ll demote all of them. So what I recommend doing here is really trying to take a step back and focus on fewer sites and making those really strong, and really good and unique. So that they have unique content, unique products that they’re selling. So then you don’t have this collection of a lot of different sites that are essentially doing the same thing.” (John Mueller, Senior Webmaster Trend Analyst at Google. https://www.youtube.com/watch?time_continue=1&v=kQIyk-2-wRg&feature=emb_logo)
White Hat / Black Hat SEO | | PabloCulebras0 -
Scraping Website and Using Our Clients Info
One of our clients on Moz has noticed that another website has been scraping their website and pulling lots of their content without permission. We would like to notify Google about this company but are not sure if that is the right remedy to correct the problem. They appear in search results on Google using the client's name so they seem to be use page titles etc with the client's name in them. Several of the SERP links link to their own website but it pulls in our client's web page. Was hoping anyone could perhaps provide some additional options on how to attack this problem?
White Hat / Black Hat SEO | | InTouchMK0 -
Competitor ranking well with duplicate content—what are my options?
A competitor is ranking #1 and #3 for a search term (see attached) by publishing two separate sites with the same content. They've modified the title of the page, and serve it in a different design, but are using their branded domain and a keyword-rich domain to gain multiple rankings. This has been going on for years, and I've always told myself that Google would eventually catch it with an algorithm update, but that doesn't seem to be happening. Does anyone know of other options? It doesn't seem like this falls under any of the categories that Google lists on their web spam report page—is there any other way to get bring this up with the powers that be, or is it something that I just have to live with and hope that Google figures out some day? Any advice would help. Thanks! how_to_become_a_home_inspector_-_Google_Search_2015-01-15_18-45-06.jpg
White Hat / Black Hat SEO | | inxilpro0 -
Dynamic Content Boxes: how to use them without get Duplicate Content Penalty?
Hi everybody, I am starting a project with a travelling website which has some standard category pages like Last Minute, Offers, Destinations, Vacations, Fly + Hotel. Every category has inside a lot of destinations with relative landing pages which will be like: Last Minute New York, Last Minute Paris, Offers New York, Offers Paris, etc. My question is: I am trying to simplify my job thinking about writing some dynamic content boxes for Last Minute, Offers and the other categories, changing only the destination city (Rome, Paris, New York, etc) repeated X types in X different combinations inside the content box. In this way I would simplify a lot my content writing for the principal generic landing pages of each category but I'm worried about getting penalized for Duplicate Content. Do you think my solution could work? If not, what is your suggestion? Is there a rule for categorize a content as duplicate (for example number of same words in a row, ...)? Thanks in advance for your help! A.
White Hat / Black Hat SEO | | OptimizedGroup0 -
Am I Syndicating Content Correctly?
My question is about how to syndicate content correctly. Our site has professionally written content aimed toward our readers, not search engines. As a result, we have other related websites who are looking to syndicate our content. I have read the Google duplicate content guidelines (https://support.google.com/webmasters/answer/66359?hl=en), canonical recommendations (https://support.google.com/webmasters/answer/139066?hl=en&ref_topic=2371375), and no index recommendation (https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag) offered by Google, but am still a little confused about how to proceed. The pros in our opinion are as follows:#1 We can gain exposure to a new audience as well as help grow our brand #2 We figure its also a good way to help build up credible links and help our rankings in GoogleOur initial reaction is to have them use a "canonical link" to assign the content back to us, but also implement a "no index, follow" tag to help avoid duplicate content issues. Are we doing this correctly, or are we potentially in threat of violating some sort of Google Quality Guideline?Thanks!
White Hat / Black Hat SEO | | Dirving4Success0 -
Schema.org tricking and duplicate content across domains
I've found the following abuse, and Im curious what could I do about it. Basically the scheme is: own some content only once (pictures, description, reviews etc) use different domain names (no problem if you use the same IP or IP-C address) have a different layout (this is basically the key) use schema.org tricking, meaning show (the very same) reviews on different scale, show a little bit less reviews on one site than on an another Quick example: http://bit.ly/18rKd2Q
White Hat / Black Hat SEO | | Sved
#2: budapesthotelstart.com/budapest-hotels/hotel-erkel/szalloda-attekintes.hu.html (217.113.62.21), 328 reviews, 8.6 / 10
#6: szallasvadasz.hu/hotel-erkel/ (217.113.62.201), 323 reviews, 4.29 / 5
#7: xn--szlls-gyula-l7ac.hu/szallodak/erkel-hotel/ (217.113.62.201), no reviews shown It turns out that this tactic even without the 4th step can be quite beneficial to rank with several domains. Here is a little investigation I've done (not really extensive, took around 1 and a half hour, but quite shocking nonetheless):
https://docs.google.com/spreadsheet/ccc?key=0Aqbt1cVFlhXbdENGenFsME5vSldldTl3WWh4cVVHQXc#gid=0 Kaspar Szymanski from Google Webspam team said that they have looked into it, and will do something, but honestly I don't know whether I could believe it or not. What do you suggest? should I leave it, and try to copy this tactic to rank with the very same content multiple times? should I deliberately cheat with markups? should I play nice and hope that these guys sooner or later will be dealt with? (honestly can't see this one working out) should I write a case study for this, so maybe if the tactics get bigger attention, then google will deal with it? Does anybody could push this towards Matt Cutts, or anybody else who is responsible for these things?0 -
What are your views on recent statements regarding "advertorial" content?
Hi, Recently, there's been a lot said and written about how Google is going to come down hard on 'advertorial' content. Many B2B publishers provide exposure to their clients by creating and publishing content about them -----based on information/ content obtained from clients (for example, in the form of press releases) or compiled by the publisher. From a target audience/ user perspective, this is useful information that the publication is bringing to its audience. Also, let's say the publishers don't link directly to client websites. In such a case, how do you think Google is likely to look at publisher websites in the context of the recent statements related to 'advertorial' type content? Look forward to views of the Moz community. Thanks, Manoj
White Hat / Black Hat SEO | | ontarget-media0 -
Duplicate Content
Hi, I have a website with over 500 pages. The website is a home service website that services clients in different areas of the UK. My question is, am I able to take down the pages from my URL, leave them down for say a week, so when Google bots crawl the pages, they do not exist. Can I then re upload them to a different website URL, and then Google wont penalise me for duplicate content? I know I would of lost juice and page rank, but that doesnt really matter, because the site had taken a knock since the Google update. Thanks for your help. Chris,
White Hat / Black Hat SEO | | chrisellett0