{ "Summary": "This paper investigates the impact of weight initialization strategies on the grokking phenomenon in Transformer models, focusing on four arithmetic tasks in finite fields. The study systematically compares five initialization methods (PyTorch default, Xavier, He, Orthogonal, and Kaiming Normal) using a controlled experimental setup with a small Transformer architecture. The results reveal significant differences in convergence speed and generalization capabilities, with Xavier initialization often outperforming others.", "Strengths": [ "Rigorous empirical analysis with statistical validation.", "Systematic comparison of five widely-used initialization strategies.", "Clear presentation of experimental results with confidence intervals.", "Novel application of weight initialization strategies to study grokking in Transformer models." ], "Weaknesses": [ "Limited scope as the experiments are conducted on a small Transformer architecture.", "Results may not scale to larger models or more complex tasks.", "Generalizability is limited to arithmetic tasks in finite fields.", "Some details about the experimental setup could be further elaborated for better reproducibility.", "Lacks a theoretical explanation for the observed phenomena.", "Inadequate discussion on potential negative societal impacts and ethical concerns." ], "Originality": 3, "Quality": 3, "Clarity": 3, "Significance": 3, "Questions": [ "Can the authors provide more details on the random seed initialization and how it affects the results?", "How do the authors ensure that the findings are generalizable to larger models and real-world tasks?", "Can the authors provide more insights into the choice of arithmetic tasks and their relevance to broader applications?", "What is the theoretical explanation for the observed differences in grokking behavior?", "How do the results translate to real-world applications beyond arithmetic tasks?", "Can the authors provide more details on the specific statistical methods used for validation?", "How do the initialization strategies interact with other hyperparameters like learning rate schedules?", "Can you provide more detailed explanations or visualizations of the learning dynamics associated with each initialization method?", "Are there any plans to explore adaptive initialization methods that evolve during training?" ], "Limitations": [ "The study is conducted on a small Transformer architecture, which may not generalize to larger models.", "The focus on arithmetic tasks in finite fields may not fully represent the complexity of real-world problems.", "Further investigation is needed to understand the underlying mechanisms of grokking and how weight initialization influences them.", "Potential negative societal impacts are not discussed.", "The paper does not address how the findings could be misused or lead to unintended consequences in real-world applications." ], "Ethical Concerns": false, "Soundness": 3, "Presentation": 3, "Contribution": 3, "Overall": 6, "Confidence": 4, "Decision": "Reject" }