robertknight/hypothesis-canonical-links.md

## hypothesis-canonical-links.md

      
    Raw
  

              hypothesis-canonical-links.md
            
          
    Hypothesis and canonical links

What are canonical links?

A <link rel="canonical" href="{url}"> tag in the <head> of a page specifies the canonical/"main" URL for the current page. When the same content can be accessed at different URLs, it specifies the preferred URL for the page. It was originally introduced for use with search engines.
For example, https://www.smore.com/577k7, https://www.smore.com/577k7-whatever and https://www.smore.com/577k7-the-edtech-monthly all return the same page. That page has a <link rel=canonical> field which specifies https://www.smore.com/577k7-the-edtech-monthly as the preferred URL.
See also https://en.wikipedia.org/wiki/Canonical_link_element.
How do canonical links on a page affect Hypothesis?

When a page has a <link rel=canonical> tag on it:

The Hypothesis client will use this URL as the main URI for the page, as reported by HTMLIntegration.uri
This main URI is passed to Hypothesis via the initial /api/search call to fetch annotation for the current page
The main URI is used as the value of the uri and target.source fields of new annotations created on the page

For example, when the user visits https://www.smore.com/577k7 and activates the client, it is the canonical link that is used when fetching annotations GET https://hypothes.is/api/search?limit=50&sort=created&order=asc&_separate_replies=false&group=__world__&uri=https%3A%2F%2Fwww.smore.com%2F577k7-the-edtech-monthly.
When the user creates an annotation on this page, the payload uses the canonical link as the page URL everywhere except for one entry in document.link:
{
    "created": "2022-07-07T10:08:42.741Z",
    "group": "__world__",
    "text": "Quick test",
    "updated": "2022-07-07T10:08:42.741Z",
    "user": "acct:[email protected]",
    "document": {
        "title": "The EdTech Monthly",
        "link": [
            {
                "href": "https://www.smore.com/577k7"
            },
            {
                "href": "https://www.smore.com/577k7-the-edtech-monthly",
                "rel": "canonical",
                "type": ""
            }
        ],
        ...
    },
    "uri": "https://www.smore.com/577k7-the-edtech-monthly",
    "target": [
        {
            "source": "https://www.smore.com/577k7-the-edtech-monthly",
            "selector": ...
        }
    ]
    ...
}

Note the target.selector, uri and document.link properties.
In the backend after validation and preprocessing, the h.storage.create_annotation function gets called with:
{
  "userid": "acct:robert@localhost",
  "target_uri": "https://www.smore.com/577k7-the-edtech-monthly",
  "text": "Quick test",
  "tags": [],
  "groupid": "__world__",
  "references": [],
  "shared": true,
  "target_selectors": [...],
  "document": {
    "document_uri_dicts": [
      {
        "claimant": "https://www.smore.com/577k7-the-edtech-monthly",
        "uri": "https://www.smore.com/577k7",
        "type": "",
        "content_type": ""
      },
      {
        "claimant": "https://www.smore.com/577k7-the-edtech-monthly",
        "uri": "https://www.smore.com/577k7-the-edtech-monthly",
        "type": "rel-canonical",
        "content_type": ""
      },
      {
        "claimant": "https://www.smore.com/577k7-the-edtech-monthly",
        "uri": "https://www.smore.com/577k7-the-edtech-monthly",
        "type": "self-claim",
        "content_type": ""
      }
    ],
    "document_meta_dicts": [
      {
        "type": "title",
        "value": [
          "The EdTech Monthly"
        ],
        "claimant": "https://www.smore.com/577k7-the-edtech-monthly"
      },
      ...
    ]
  }
}

This information is ultimately persisted in the document_uri table in the DB:
postgres=# select * from document_uri where uri like '%smore%';
-[ RECORD 1 ]-------+-----------------------------------------------
created             | 2022-07-07 10:26:41.933846
updated             | 2022-07-07 10:27:45.288757
id                  | 263
claimant            | https://www.smore.com/577k7-the-edtech-monthly
claimant_normalized | httpx://www.smore.com/577k7-the-edtech-monthly
uri                 | https://www.smore.com/577k7
uri_normalized      | httpx://www.smore.com/577k7
type                |
content_type        |
document_id         | 144
-[ RECORD 2 ]-------+-----------------------------------------------
created             | 2022-07-07 10:26:41.933846
updated             | 2022-07-07 10:27:45.288757
id                  | 264
claimant            | https://www.smore.com/577k7-the-edtech-monthly
claimant_normalized | httpx://www.smore.com/577k7-the-edtech-monthly
uri                 | https://www.smore.com/577k7-the-edtech-monthly
uri_normalized      | httpx://www.smore.com/577k7-the-edtech-monthly
type                | rel-canonical
content_type        |
document_id         | 144
-[ RECORD 3 ]-------+-----------------------------------------------
created             | 2022-07-07 10:26:41.933846
updated             | 2022-07-07 10:27:45.288757
id                  | 262
claimant            | https://www.smore.com/577k7-the-edtech-monthly
claimant_normalized | httpx://www.smore.com/577k7-the-edtech-monthly
uri                 | https://www.smore.com/577k7-the-edtech-monthly
uri_normalized      | httpx://www.smore.com/577k7-the-edtech-monthly
type                | self-claim
content_type        |
document_id         | 144

How does Hypothesis use canonical links on its own sites?

Canonical links are the current way that a page can direct Hypothesis to use a different URL for fetching/creating annotations than the one currently displayed in the address bar.
Some internal uses of this:

To indicate the original page on a Via-proxied page.
To indicate the original YouTube video on Docdrop

Some advice that we give to publishers that involves canonical links:

https://web.hypothes.is/help/how-to-establish-or-avoid-document-equivalence-in-the-hypothesis-system/
https://web.hypothes.is/help/how-hypothesis-interacts-with-document-metadata/

Publishers may also have other reasons for wanting to change the URL associated with annotations.
What problems do canonical links have?

Since canonical links are an official standard and Google makes use of them, they are widely used on real web pages. We have seen various misuses of them:

Some pages just have wrong canonical URLs (eg. https://developers.google.com/search/blog/2013/04/5-common-mistakes-with-relcanonical)
Some major sites (YouTube, Reddit, Pinterest) plus a host of minor ones fail to update their canonical URLs after a client-side navigation.

If a page has a wrong canonical URL or fails to update it after a client-side navigation, this can cause the client to associate annotations with the wrong URL. See eg. hypothesis/product-backlog#1288
Do we have workarounds for the above problems?

No. A publisher can fix problems on sites they maintain after we point them out. Such fixes will only affect new annotations. A user trying to annotate a site they do not maintain doesn't have a way to fix this.
Is there a way to repair incorrect URL associations that have been created as a result of incorrect canonical links?

No.
What can we learn about canonical link usage from the production Hypothesis data?

Information about canonical links is captured by the client and relayed to h via the document.links field of annotation create/update operations.
The extracted information is stored in the document_uri table in h. That table has a claimant field which indicates which page the information came from (ie. where was <link rel=canonical> seen) and uri which is the URL of the canonical link. There is a type field which indicates the type of <link>, <meta> or other source that the information came from. The value of type for canonical links is rel-canonical.
Here are some statistics derived from the document_uri table in the production h DB:
Rows in document_uri table: ~3.4M
Rows with type=rel-canonical: ~1.21M
Rows where normalized canonical URL is same as page URL: ~1.19M
Rows where normalized canonical URL is different than page URL: ~20.4K
CSV dump of entries from prod document_uri table with type=rel-canonical and claimant_normalized != uri_normalized: https://hypothes-is.slack.com/archives/C2BLQDKHA/p1657176769197879?thread_ts=1657120584.483809&cid=C2BLQDKHA