URL Canonicalization
A: Sorry that it’s a strange word; that’s what we call it around
Google. Canonicalization is the process of picking the best url when
there are several choices, and it usually refers to home pages. For
example, most people would consider these the same urls:
- www.example.com
- example.com/
- www.example.com/index.html
- example.com/home.asp
But technically all of these urls are different. A web server could return completely different content for all the urls above. When Google “canonicalizes” a url, we try to pick the url that seems like the best representative from that set.
Q: So how do I make sure that Google picks the url that I want?
A: One thing that helps is to pick the url that you want and use that
url consistently across your entire site. For example, don’t make half
of your links go to http://example.com/ and the other half go to
http://www.example.com/ . Instead, pick the url you prefer and always
use that format for your internal links.
Q: Is there anything else I can do?
A: Yes. Suppose you want your default url to be http://www.example.com/
. You can make your webserver so that if someone requests
http://example.com/, it does a 301 (permanent) redirect to
http://www.example.com/ . That helps Google know which url you prefer
to be canonical. Adding a 301 redirect can be an especially good idea
if your site changes often (e.g. dynamic content, a blog, etc.).
Q: If I want to get rid of domain.com but keep www.domain.com, should I use the url removal tool to remove domain.com?
A: No, definitely don’t do this. If you remove one of the www vs.
non-www hostnames, it can end up removing your whole domain for six
months. Definitely don’t do this. If you did use the url removal tool
to remove your entire domain when you actually only wanted to remove
the www or non-www version of your domain, do a reinclusion request and mention that you removed your entire domain by accident using the url removal tool and that you’d like it reincluded.
Q: I noticed that you don’t do a 301 redirect on your site from the non-www to the www version, Matt. Why not? Are you stupid in the head?
A: Actually, it’s on purpose. I noticed that several months ago but
decided not to change it on my end or ask anyone at Google to fix it. I
may add a 301 eventually, but for now it’s a helpful test case.
Q: So when you say www vs. non-www, you’re talking about a type of canonicalization. Are there other ways that urls get canonicalized?
A: Yes, there can be a lot, but most people never notice (or need to
notice) them. Search engines can do things like keeping or removing
trailing slashes, trying to convert urls with upper case to lower case,
or removing session IDs from bulletin board or other software (many
bulletin board software packages will work fine if you omit the session
ID).
Q: Let’s talk about the inurl: operator. Why does everyone think that if inurl:mydomain.com shows results that aren’t from mydomain.com, it must be hijacked?
A: Many months ago, if you saw
someresult.com/search2.php?url=mydomain.com, that would sometimes have
content from mydomain. That could happen when the someresult.com url
was a 302 redirect to mydomain.com and we decided to show a result from
someresult.com. Since then, we’ve changed our heuristics to make
showing the source url for 302 redirects much more rare. We are moving
to a framework for handling redirects in which we will almost always
show the destination url. Yahoo handles 302 redirects by usually
showing the destination url, and we are in the middle of transitioning
to a similar set of heuristics. Note that Yahoo reserves the right to
have exceptions on redirect handling, and Google does too. Based on our
analysis, we will show the source url for a 302 redirect less than half
a percent of the time (basically, when we have strong reason to think
the source url is correct).
Q: Okay, how about supplemental results. Do supplemental results cause a penalty in Google?
A: Nope.
A: I wouldn’t spend much effort on them. If the pages have moved, I would make sure that there’s a 301 redirect to the new location of pages. If the pages are truly gone, I’d make sure that you serve a 404 on those pages. After that, I wouldn’t put any more effort in. When Google eventually recrawls those pages, it will pick up the changes, but because it can take longer for us to crawl supplemental results, you might not see that update for a while
|