Vitaliy Polshkov

cogwheelhead

cogwheelhead

AI & ML interests

Data-centric AI methods; Experiment design, statistical learning and causal inference approaches for LLM / agentic evaluations; "Proper old-school" RL and meta-learning techniques for LLM training

Recent Activity

upvoted a paper 11 days ago

ProcessBench: Identifying Process Errors in Mathematical Reasoning

commented a paper 14 days ago

ProcessBench: Identifying Process Errors in Mathematical Reasoning

posted an update 15 days ago

Hey! Me and my team recently released two benchmarks on university-level math: U-MATH (for University-MATH) and μ-MATH (for Meta U-MATH). We're working a lot on complex reasoning for LLMs, and we were in particular interested in evaluating university-curricula math skills — in topics such as differential calculus and linear algebra — for their wide applicability and practicality. We noticed that available benchmarks at the time were either at or below high-school level, or mainly leaning towards Olympiad-style problems, or synthetically generated from a set of templates / seeds. We wanted focus on university curricula and we wanted "organic" variety, so we created our own benchmark using problems sourced from actual teaching materials used in top US universities — that is how U-MATH came to be. We also, and that is my primary focus in particular, are very eager on studying and improving evaluations themselves, since the standard llm-as-a-judge approach is known to be noisy and biased, but that often remains unaccounted for. So we then created a U-MATH-derived benchmark to do "meta-evaluations" — i.e. evaluate the evaluators — which allows to quantify their error-rates, study their behaviors and biases, and so on. I'm super excited to be sharing those publicly! https://huggingface.co/datasets/toloka/u-math https://huggingface.co/datasets/toloka/mu-math

View all activity

Organizations

Posts 1

Post

280

Hey!

Me and my team recently released two benchmarks on university-level math: U-MATH (for University-MATH) and μ-MATH (for Meta U-MATH).

We're working a lot on complex reasoning for LLMs, and we were in particular interested in evaluating university-curricula math skills — in topics such as differential calculus and linear algebra — for their wide applicability and practicality.

We noticed that available benchmarks at the time were either at or below high-school level, or mainly leaning towards Olympiad-style problems, or synthetically generated from a set of templates / seeds.

We wanted focus on university curricula and we wanted "organic" variety, so we created our own benchmark using problems sourced from actual teaching materials used in top US universities — that is how U-MATH came to be.

We also, and that is my primary focus in particular, are very eager on studying and improving evaluations themselves, since the standard llm-as-a-judge approach is known to be noisy and biased, but that often remains unaccounted for. So we then created a U-MATH-derived benchmark to do "meta-evaluations" — i.e. evaluate the evaluators — which allows to quantify their error-rates, study their behaviors and biases, and so on.

I'm super excited to be sharing those publicly!

toloka/u-math
toloka/mu-math

Collections 1

Papers 1

arxiv:2412.03205

models

None public yet

datasets

None public yet