mkshing commited on
Commit
afb8483
1 Parent(s): 956763a

add image encoder's information

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -62,7 +62,7 @@ with torch.no_grad():
62
  print("Label probs:", text_probs) # prints: [[1.0, 0.0, 0.0]]
63
  ```
64
  # Model architecture
65
- The model was trained a ViT-B/16 Transformer architecture as an image encoder and uses a 12-layer RoBERTa as a text encoder. The text encoder was trained upon the Japanese pre-trained RoBERTa model [rinna/japanese-roberta-base](https://huggingface.co/rinna/japanese-roberta-base) with the same sentencepiece tokenizer.
66
 
67
  # Training
68
  The model was trained on [CC12M](https://github.com/google-research-datasets/conceptual-12m) translated the captions to Japanese.
 
62
  print("Label probs:", text_probs) # prints: [[1.0, 0.0, 0.0]]
63
  ```
64
  # Model architecture
65
+ The model was trained a ViT-B/16 Transformer architecture as an image encoder and uses a 12-layer RoBERTa as a text encoder. It was initialized with [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) as the image encoder and the Japanese pre-trained RoBERTa model [rinna/japanese-roberta-base](https://huggingface.co/rinna/japanese-roberta-base) with the same sentencepiece tokenizer as the text encoder.
66
 
67
  # Training
68
  The model was trained on [CC12M](https://github.com/google-research-datasets/conceptual-12m) translated the captions to Japanese.