Indian Food Classification with Vision Transformer (ViT)

Overview

This model is a fine-tuned Vision Transformer (ViT) for the task of classifying images of Indian foods. The model was trained on the Indian Foods Dataset from Hugging Face Datasets.

Dataset

The Indian Foods Dataset contains 4,770 images across 15 different classes of popular Indian dishes. The dataset is split into:

Training: 3,047 images
Validation: 762 images
Testing: 961 images

Model

The base model used is the vision transformer (google/vit-base-patch16-224-in21k). The model was fine-tuned on the Indian Foods Dataset for 10 epochs using the AdamW optimizer with a learning rate of 2e-4.

Evaluation

The model was evaluated on the test set and achieved the following metrics:

Accuracy: 0.9667
Precision: 0.9670
Recall: 0.9667

Usage

You can use this pre-trained model directly from Hugging Face

therealcyberlord
/

vit-indian-food