{ "Summary": "The paper proposes a novel layer-wise learning rate strategy for Transformer models to accelerate and enhance the grokking phenomenon in algorithmic learning tasks. The approach differentially adjusts learning rates across the embedding, lower, and higher layers of the Transformer. The experimental results show significant improvements in convergence speed and final model performance on tasks such as modular arithmetic and permutations. The key contributions include a novel learning rate strategy, empirical evidence of its effectiveness, insights into Transformer learning dynamics, and practical strategies for improving training efficiency.", "Strengths": [ "Novel application of layer-wise learning rates to accelerate grokking.", "Significant improvements in convergence speed and final performance.", "Comprehensive experimental validation on algorithmic tasks.", "Practical implications for training efficiency." ], "Weaknesses": [ "Limited justification for the chosen learning rates.", "Experiments are conducted on relatively small models and specific tasks, raising concerns about generalizability.", "Lacks thorough theoretical analysis to explain the observed phenomena.", "Insufficient discussion on potential negative societal impacts and scalability.", "Writing could be clearer and better organized." ], "Originality": 3, "Quality": 3, "Clarity": 3, "Significance": 3, "Questions": [ "Can you provide more detailed justification for the specific learning rates chosen for each layer?", "Have you tested the approach on larger models or more complex tasks to validate its generalizability?", "How do you ensure that the observed improvements are not just due to overfitting to the specific tasks used in the experiments?", "Can you provide a more detailed theoretical analysis to explain why the layer-wise learning rate strategy accelerates grokking?", "How were the learning rates for different layers determined? Was it through grid search or another method?", "Can you provide more insights into the theoretical reasons behind the effectiveness of layer-wise learning rates in facilitating grokking?", "Can you provide more detailed implementation and hyperparameter settings to enhance reproducibility?" ], "Limitations": [ "The optimal learning rate configuration may vary depending on the task and model architecture.", "Results are based on small models and may not generalize to larger models or more complex tasks.", "Careful tuning of learning rates is needed to avoid potential instability, especially in the early stages of training.", "Potential negative societal impacts are not discussed." ], "Ethical Concerns": false, "Soundness": 3, "Presentation": 3, "Contribution": 3, "Overall": 6, "Confidence": 4, "Decision": "Reject" }