How do I test an LLM for my unique needs? If you work in finance, law, or medicine, generic benchmarks are not enough. This blog post uses Argilla, Distilllabel and 🌤️Lighteval to generate evaluation dataset and evaluate models.
- We found that VLMs can self-improve reasoning performance through a reflection mechanism, and importantly, this approach can scale through test-time computing.
- Evaluation on comprehensive and diverse Vision-Language reasoning tasks are included !
The cleaning process consists of: - Joining the separate splits together / add split column - Converting string messages into list of structs - Removing empty system prompts