A <link rel="canonical" href="{url}">
tag in the <head>
of a page specifies the canonical/"main" URL for the current page. When the same content can be accessed at different URLs, it specifies the preferred URL for the page. It was originally introduced for use with search engines.
For example, https://www.smore.com/577k7, https://www.smore.com/577k7-whatever and https://www.smore.com/577k7-the-edtech-monthly all return the same page. That page has a <link rel=canonical>
field which specifies https://www.smore.com/577k7-the-edtech-monthly as the preferred URL.
See also https://en.wikipedia.org/wiki/Canonical_link_element.
When a page has a <link rel=canonical>
tag on it:
- The Hypothesis client will use this URL as the main URI for the page, as reported by
HTMLIntegration.uri
- This main URI is passed to Hypothesis via the initial
/api/search
call to fetch annotation for the current page - The main URI is used as the value of the
uri
andtarget.source
fields of new annotations created on the page
For example, when the user visits https://www.smore.com/577k7 and activates the client, it is the canonical link that is used when fetching annotations GET https://hypothes.is/api/search?limit=50&sort=created&order=asc&_separate_replies=false&group=__world__&uri=https%3A%2F%2Fwww.smore.com%2F577k7-the-edtech-monthly
.
When the user creates an annotation on this page, the payload uses the canonical link as the page URL everywhere except for one entry in document.link
:
{
"created": "2022-07-07T10:08:42.741Z",
"group": "__world__",
"text": "Quick test",
"updated": "2022-07-07T10:08:42.741Z",
"user": "acct:[email protected]",
"document": {
"title": "The EdTech Monthly",
"link": [
{
"href": "https://www.smore.com/577k7"
},
{
"href": "https://www.smore.com/577k7-the-edtech-monthly",
"rel": "canonical",
"type": ""
}
],
...
},
"uri": "https://www.smore.com/577k7-the-edtech-monthly",
"target": [
{
"source": "https://www.smore.com/577k7-the-edtech-monthly",
"selector": ...
}
]
...
}
Note the target.selector
, uri
and document.link
properties.
In the backend after validation and preprocessing, the h.storage.create_annotation
function gets called with:
{
"userid": "acct:robert@localhost",
"target_uri": "https://www.smore.com/577k7-the-edtech-monthly",
"text": "Quick test",
"tags": [],
"groupid": "__world__",
"references": [],
"shared": true,
"target_selectors": [...],
"document": {
"document_uri_dicts": [
{
"claimant": "https://www.smore.com/577k7-the-edtech-monthly",
"uri": "https://www.smore.com/577k7",
"type": "",
"content_type": ""
},
{
"claimant": "https://www.smore.com/577k7-the-edtech-monthly",
"uri": "https://www.smore.com/577k7-the-edtech-monthly",
"type": "rel-canonical",
"content_type": ""
},
{
"claimant": "https://www.smore.com/577k7-the-edtech-monthly",
"uri": "https://www.smore.com/577k7-the-edtech-monthly",
"type": "self-claim",
"content_type": ""
}
],
"document_meta_dicts": [
{
"type": "title",
"value": [
"The EdTech Monthly"
],
"claimant": "https://www.smore.com/577k7-the-edtech-monthly"
},
...
]
}
}
This information is ultimately persisted in the document_uri
table in the DB:
postgres=# select * from document_uri where uri like '%smore%';
-[ RECORD 1 ]-------+-----------------------------------------------
created | 2022-07-07 10:26:41.933846
updated | 2022-07-07 10:27:45.288757
id | 263
claimant | https://www.smore.com/577k7-the-edtech-monthly
claimant_normalized | httpx://www.smore.com/577k7-the-edtech-monthly
uri | https://www.smore.com/577k7
uri_normalized | httpx://www.smore.com/577k7
type |
content_type |
document_id | 144
-[ RECORD 2 ]-------+-----------------------------------------------
created | 2022-07-07 10:26:41.933846
updated | 2022-07-07 10:27:45.288757
id | 264
claimant | https://www.smore.com/577k7-the-edtech-monthly
claimant_normalized | httpx://www.smore.com/577k7-the-edtech-monthly
uri | https://www.smore.com/577k7-the-edtech-monthly
uri_normalized | httpx://www.smore.com/577k7-the-edtech-monthly
type | rel-canonical
content_type |
document_id | 144
-[ RECORD 3 ]-------+-----------------------------------------------
created | 2022-07-07 10:26:41.933846
updated | 2022-07-07 10:27:45.288757
id | 262
claimant | https://www.smore.com/577k7-the-edtech-monthly
claimant_normalized | httpx://www.smore.com/577k7-the-edtech-monthly
uri | https://www.smore.com/577k7-the-edtech-monthly
uri_normalized | httpx://www.smore.com/577k7-the-edtech-monthly
type | self-claim
content_type |
document_id | 144
Canonical links are the current way that a page can direct Hypothesis to use a different URL for fetching/creating annotations than the one currently displayed in the address bar.
Some internal uses of this:
- To indicate the original page on a Via-proxied page.
- To indicate the original YouTube video on Docdrop
Some advice that we give to publishers that involves canonical links:
- https://web.hypothes.is/help/how-to-establish-or-avoid-document-equivalence-in-the-hypothesis-system/
- https://web.hypothes.is/help/how-hypothesis-interacts-with-document-metadata/
Publishers may also have other reasons for wanting to change the URL associated with annotations.
Since canonical links are an official standard and Google makes use of them, they are widely used on real web pages. We have seen various misuses of them:
- Some pages just have wrong canonical URLs (eg. https://developers.google.com/search/blog/2013/04/5-common-mistakes-with-relcanonical)
- Some major sites (YouTube, Reddit, Pinterest) plus a host of minor ones fail to update their canonical URLs after a client-side navigation.
If a page has a wrong canonical URL or fails to update it after a client-side navigation, this can cause the client to associate annotations with the wrong URL. See eg. hypothesis/product-backlog#1288
No. A publisher can fix problems on sites they maintain after we point them out. Such fixes will only affect new annotations. A user trying to annotate a site they do not maintain doesn't have a way to fix this.
Is there a way to repair incorrect URL associations that have been created as a result of incorrect canonical links?
No.
Information about canonical links is captured by the client and relayed to h via the document.links
field of annotation create/update operations.
The extracted information is stored in the document_uri
table in h. That table has a claimant
field which indicates which page the information came from (ie. where was <link rel=canonical>
seen) and uri
which is the URL of the canonical link. There is a type
field which indicates the type of <link>
, <meta>
or other source that the information came from. The value of type
for canonical links is rel-canonical
.
Here are some statistics derived from the document_uri
table in the production h DB:
Rows in document_uri
table: ~3.4M
Rows with type
=rel-canonical
: ~1.21M
Rows where normalized canonical URL is same as page URL: ~1.19M
Rows where normalized canonical URL is different than page URL: ~20.4K
CSV dump of entries from prod document_uri
table with type=rel-canonical
and claimant_normalized != uri_normalized: https://hypothes-is.slack.com/archives/C2BLQDKHA/p1657176769197879?thread_ts=1657120584.483809&cid=C2BLQDKHA
In the H client, there could be 2 new fields which would display the URI and its canonical URL to be bookmark. Users would be able to edit them, although it's prone to voluntary discrepancies if someone were in a bad mood.
What about using
MutationObserver
to detect location-change ? Upon location change, the H client could force-reload the page so that an SSR version of the page is returned with the correct canonical URL.