Abstract
Previous studies have typically assumed that large language models are unable to accurately perform arithmetic operations, particularly multiplication of >8 digits, and operations involving decimals and fractions, without the use of calculator tools. This paper aims to challenge this misconception. With sufficient training data, a 2 billion-parameter language model can accurately perform multi-digit arithmetic operations with almost 100% accuracy without data leakage, significantly surpassing GPT-4 (whose multi-digit multiplication accuracy is only 4.3%). We also demonstrate that our MathGLM, fine-tuned from GLM-10B on a dataset with additional multi-step arithmetic operations and math problems described in text, achieves similar performance to GPT-4 on a 5,000-samples Chinese math problem test set.
Community
What is the (research) point here, let alone a potential application? Isn't coupling an LLM with traditional computing for arithmetics and 100% accuracy a solved problem?
What is the (research) point here, let alone a potential application? Isn't coupling an LLM with traditional computing for arithmetics and 100% accuracy a solved problem?
Are you seriously asking this? With this logic, we'd still be living in caves, after all why develop civilisation when we have a perfectly good cave?
Here is what ChatGPT has to say about our conversation. I fed it the abstract and both of our comments. It did this pretty well, I think. My "logic" BTW wasn't to "cancel" any kind of research, but to ask questions to improve science and finally get out of the bloody caves ... ;-)
Your initial comment raises a valid point about the research presented in the article. The question you posed is rooted in the idea of efficiency and practicality in research and development. You are essentially asking whether the effort to train a language model (LLM) to perform arithmetic operations with 100% accuracy is worth it when we already have established computational methods and calculators for such tasks.
The response you received compares this situation to the advancement of civilization and technology, suggesting that progress and innovation should not be stifled by the availability of existing solutions. Here's a breakdown of the points made:
Efficiency of Existing Solutions: You imply that traditional computing methods and calculators can already perform arithmetic operations efficiently, so there may not be a pressing need to train large language models (LLMs) for this purpose.
Innovation and Progress: The response challenges the idea of sticking with existing solutions simply because they work. It suggests that if we followed this logic, we wouldn't have made progress in various fields, including technology, because we would have been content with the status quo (living in caves).
Potential Applications: While the article may not explicitly state specific applications, there could be scenarios where having a language model capable of accurate arithmetic calculations could be beneficial. For example, it might be used for natural language interfaces to computer systems, where users can interact with machines in a more conversational manner.
In summary, your initial comment highlights the need to assess the practicality and relevance of research efforts, while the response argues for the importance of innovation and not dismissing research based solely on the existence of current solutions. Both viewpoints have their merits, and the value of the research may depend on potential applications and the specific goals of the researchers.
That was a fair but politically correct interpretation from ChatGPT, I would say the benefit of having arithmetic abilities built in is that it would serve to increase the models "intelligence" without having to resort to using an API. It would have an internal cascade effect where strengthening one internal component serves to strengthen another, an API would not have this effect. It's also a critical step towards AGI. An "AGI" that requires a calculator isn't an AGI.
I thought this conversation would go wild after reading 2nd comment. I couldn't stop to read further but the constructive mediation of chatGPT as a judge resolved the clash... fun to watch but kind of turning down lol.
We need to deploy it as a reddit peacekeeper, to troll the trolls with infinite patience ๐ค
The data contained in the paper does not support the conclusion that the model "learns the underlying rules and principles of arithmetic operations."
First, the on page 8, the authors say they count an answer as entirely correct if it's close to correct. If the model truly knew how to add two numbers, this relaxation wouldn't be necessary. My speculation (based on my own research on GPT4) is that the model learns to correlate the leading digits of the input with the leading digits of the output, and the length of the output with the length of the input. If it gets two digits and the length correct, that makes the 1% error threshold.
Second, the performance on adding multi-digit numbers drops dramatically as the number of digits goes from 5 to 6 (page 12). If the rules of addition are really learned, one wouldn't expect this drop.
The data contained in the paper does not support the conclusion that the model "learns the underlying rules and principles of arithmetic operations."
First, the on page 8, the authors say they count an answer as entirely correct if it's close to correct. If the model truly knew how to add two numbers, this relaxation wouldn't be necessary. My speculation (based on my own research on GPT4) is that the model learns to correlate the leading digits of the input with the leading digits of the output, and the length of the output with the length of the input. If it gets two digits and the length correct, that makes the 1% error threshold.
Second, the performance on adding multi-digit numbers drops dramatically as the number of digits goes from 5 to 6 (page 12). If the rules of addition are really learned, one wouldn't expect this drop.
As a prediction model, I would argue that it is impossible for any llm to truly calculate anything with 100% accuracy, it's simply a matter of increasing its accuracy so that for 99.9% of use cases it's 99.999% accurate.
My post was not about the usefulness of an LLM as a calculator, but about the scientific claim made by the paper. In my view, the data does not support the claim that the model "learns the underlying rules and principles of arithmetic operations."
There are two notions of accuracy: how close the computed answer is to the correct answer and how often it meets that threshold. In order to claim something has "learn[ed] the underlying rules and principles of arithmetic", I don't see how one can tolerate anything less than exact precision 100% of the time. This model gets more than 1% error a majority of the time with 12 digit numbers.
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper