Skip to content

Instantly share code, notes, and snippets.

@hollance
Last active September 6, 2024 10:27
Show Gist options
  • Save hollance/42e32852f24243b748ae6bc1f985b13a to your computer and use it in GitHub Desktop.
Save hollance/42e32852f24243b748ae6bc1f985b13a to your computer and use it in GitHub Desktop.
Alignment heads for Whisper word-level timestamps with Hugging Face Transformers

To allow the Hugging Face version of Whisper to predict word-level timestamps, a new property alignment_heads must be added to the GenerationConfig object. This is a list of [layer, head] pairs that select the cross-attention heads that are highly correlated to word-level timing.

If your Whisper checkpoint does not have the alignment_heads property yet, it can be added in two possible ways.

Method 1. Change the model.generation_config property:

# load the model
model = WhisperForConditionalGeneration.from_pretrained("your_checkpoint")

# set the new property
model.generation_config.alignment_heads = [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]

Method 2. Add a new line to the generation_config.json file:

"alignment_heads": [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]],

After you're done, use push_to_hub to make these changes permanent:

model.push_to_hub("your_pretrained_checkpoint", use_auth_token="your_token_if_not_logged_in", create_pr=True)

The correct values for alignment_heads depend on the size of the model. Here are the appropriate values for the different Whisper model sizes. These are taken from the OpenAI checkpoints. If you fine-tuned your own checkpoint, you may need to inspect the cross-attention weights to find the appropriate layers and attention heads.

whisper-tiny: [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]

whisper-tiny.en: [[1, 0], [2, 0], [2, 5], [3, 0], [3, 1], [3, 2], [3, 3], [3, 4]]

whisper-base: [[3, 1], [4, 2], [4, 3], [4, 7], [5, 1], [5, 2], [5, 4], [5, 6]]

whisper-base.en: [[3, 3], [4, 7], [5, 1], [5, 5], [5, 7]]

whisper-small: [[5, 3], [5, 9], [8, 0], [8, 4], [8, 7], [8, 8], [9, 0], [9, 7], [9, 9], [10, 5]]

whisper-small.en: [[6, 6], [7, 0], [7, 3], [7, 8], [8, 2], [8, 5], [8, 7], [9, 0], [9, 4], [9, 8], [9, 10], [10, 0], [10, 1], [10, 2], [10, 3], [10, 6], [10, 11], [11, 2], [11, 4]]

whisper-medium: [[13, 15], [15, 4], [15, 15], [16, 1], [20, 0], [23, 4]]

whisper-medium.en: [[11, 4], [14, 1], [14, 12], [14, 14], [15, 4], [16, 0], [16, 4], [16, 9], [17, 12], [17, 14], [18, 7], [18, 10], [18, 15], [20, 0], [20, 3], [20, 9], [20, 14], [21, 12]]

whisper-large-v1: [[9, 19], [11, 2], [11, 4], [11, 17], [22, 7], [22, 11], [22, 17], [23, 2], [23, 15]]

whisper-large-v2: [[10, 12], [13, 17], [16, 11], [16, 12], [16, 13], [17, 15], [17, 16], [18, 4], [18, 11], [18, 19], [19, 11], [21, 2], [21, 3], [22, 3], [22, 9], [22, 12], [23, 5], [23, 7], [23, 13], [25, 5], [26, 1], [26, 12], [27, 15]]

whisper-large: same as large-v2

@hollance
Copy link
Author

hollance commented Jul 9, 2023

Try them both, see which one works best. ;-) @xenova

@Ar770
Copy link

Ar770 commented Jul 11, 2023

@hollance thank you for the great work!
Do you know what are the right alignment_heads for a Peft fine-tuned model (on large-v2)?
How can I check it?

@hollance
Copy link
Author

@Ar770 I haven't done it myself but you can look at the (average) cross-attention weights for a test set, and then use the attention heads that give the nicest looking cross-attentions. In other words, when plotted the cross-attention weights should form a diagonal.

@Ar770
Copy link

Ar770 commented Jul 12, 2023

@hollance Can you refer to something similar? I have no clue how it is done.

@hollance
Copy link
Author

@Ar770 I haven't done this myself. Perhaps the OpenAI folks have something they can share, as this is using their method.

@LaurinmyReha
Copy link

checkout this repo for the best individual alignment heads of the large v2 and v3 variants.

https://github.com/nyrahealth/CrisperWhisper/blob/develop/run_experiments/experiments/head_results.json

The Repo also contains useful code to calculate these head results for other models as long as you are in the posession of a dataset with high quality timestamps like timit ( https://paperswithcode.com/dataset/timit).
The Idea here is to just evaluate the timings output by DTW for individual alignment heads against ground truth data and see which ones score highest. This variant improves upon this by specifically training alignment heads.

https://github.com/nyrahealth/CrisperWhisper

@hashefa
Copy link

hashefa commented Sep 6, 2024

@LaurinmyReha this is a very interesting work, thank you. Can you please reference in the code where you calculate the heads? I couldn't find it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment