Created
February 25, 2024 04:19
-
-
Save mzbac/00ebe60bb36fa4d8f65509f8e47350d5 to your computer and use it in GitHub Desktop.
llava implementation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
```sh | |
Initial Setup: | |
+-------------------+ +---------------+ | |
| Text Sequence | | Raw Images | | |
| [T1, <IMG>, T2, | | [Image1, | | |
| T3, <IMG>, T4] | | Image2] | | |
+-------------------+ +---------------+ | |
Step 1: Convert Text and <IMG> Tokens to Embeddings | |
+---------------------------------------------------------+ | |
| Text and <IMG> Token Embedding Model | | |
| | | |
| [T1, <IMG>, T2, T3, <IMG>, T4] | | |
| | | | |
| V | | |
| [T1_emb, IMG_emb, T2_emb, T3_emb, IMG_emb, T4_emb] | | |
+---------------------------------------------------------+ | |
Step 2: Convert Images to Feature Patches Using Vision Model | |
+------------------------------------------------------------+ | |
| Vision Model | | |
| | | |
| Image1 ---> [I1_1, I1_2, I1_3] | | |
| Image2 ---> [I2_1, I2_2, I2_3] | | |
+------------------------------------------------------------+ | |
Step 3: Convert Image Patches to Embeddings | |
+------------------------------------------------------------+ | |
| Image Patch Embedding Conversion | | |
| | | |
| [I1_1, I1_2, I1_3] ---> [I1_1_embed, I1_2_embed, I1_3_embed]| | |
| [I2_1, I2_2, I2_3] ---> [I2_1_embed, I2_2_embed, I2_3_embed]| | |
+------------------------------------------------------------+ | |
Step 4: Replace IMG_emb with Image Patch Embeddings in Sequence | |
+------------------------------------------------------------+ | |
| Updated Sequence Embeddings | | |
| | | |
| [T1_emb, I1_1_embed, I1_2_embed, I1_3_embed, T2_emb, | | |
| T3_emb, I2_1_embed, I2_2_embed, I2_3_embed, T4_emb] | | |
+------------------------------------------------------------+ | |
Step 5: Feed the Updated Sequence into the LLM | |
+------------------------------------------------------------+ | |
| Large Language Model | | |
| | | |
| Input: [T1_emb, I1_1_embed, I1_2_embed, I1_3_embed, | | |
| T2_emb, T3_emb, I2_1_embed, I2_2_embed, I2_3_embed,| | |
| T4_emb] | | |
| | | |
| | | | |
| V | | |
| Output: Model Predictions | | |
+------------------------------------------------------------+ | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment