NumbersStation/nsql-350M · Accuracy on Spider

I wrote code to generate predictions for the 1000 questions in here;

https://github.com/taoyds/spider/blob/master/evaluation_examples/dev.sql

Then I use the following Python file to evaluate execution accuracy.

https://github.com/taoyds/spider/blob/master/evaluation.py

NSQL-350m gets 22% execution accuracy on easy questions which is quite a bit lower than the reported 51% accuracy on Spider reported in the blog post below.

https://www.numbersstation.ai/post/introducing-nsql-open-source-sql-copilot-foundation-models

What is the reason for this discrepancy?