File size: 93,811 Bytes

f71c233

paper_id,Summary,Questions,Limitations,Ethical Concerns,Soundness,Presentation,Contribution,Overall,Confidence,Strengths,Weaknesses,Originality,Quality,Clarity,Significance,Decision
data_augmentation_grokking,"The paper investigates the impact of data augmentation techniques on the grokking phenomenon in neural networks, particularly focusing on mathematical operations. It aims to identify effective augmentation strategies that accelerate grokking and enhance generalization without extensive data collection.","['Can the authors provide detailed experimental results to support their claims?', 'How do the proposed augmentation techniques compare with existing methods in terms of their impact on grokking?', 'Could the authors clarify the exact configurations and parameters used in their experiments?', 'What specific data augmentation techniques were used, and how were they implemented?', 'How does the proposed approach compare to existing methods in terms of performance and efficiency?', 'Can you elaborate on the specific challenges and solutions encountered in applying data augmentation to mathematical operations within neural networks?', 'What are the theoretical reasons behind the efficacy of certain data augmentation techniques in facilitating grokking?', 'How do your findings compare with existing methods in the literature? Please include a more comprehensive comparison in the related work section.']","['The lack of detailed experimental results is a significant limitation.', 'The paper does not sufficiently address potential negative societal impacts or ethical concerns related to the work.', 'There is a lack of detailed discussion on the scalability of the proposed methods to more complex tasks or larger datasets.']",False,2,2,2,3,4,"['Addresses an interesting and relatively unexplored aspect of neural network generalization: the grokking phenomenon.', 'Focuses on the potential of data augmentation to modulate grokking, which could have practical implications for improving training efficiency in data-scarce domains.']","['Lacks rigorous theoretical backing or a clear explanation for why certain augmentation techniques are more effective.', 'Insufficient details on datasets and augmentation methods, making it difficult to reproduce the results.', 'Evaluation metrics and experimental setup are vague, and the results section lacks depth and comprehensive analysis.', ""Significant absence of related work that directly compares this study to existing methods, limiting the understanding of the paper's contributions.""]",2,2,2,2,Reject
sleep_wake_grokking,"The paper introduces a novel sleep-wake training regime aimed at influencing the grokking phenomenon in neural networks. This regime alternates between high-intensity training phases ('wake') and reduced activity phases ('sleep'), hypothesizing that such alternation can stabilize learning dynamics and facilitate the emergence of generalized patterns. The study systematically explores various configurations of sleep-wake cycles using Transformer models on different datasets and compares the results with baseline models trained with constant intensity.","['Can the authors provide more detailed explanations of how the sleep-wake phases were implemented?', 'What was the rationale behind choosing the specific durations and intensities for the sleep-wake phases?', 'How does the computational cost of the sleep-wake regime compare to constant intensity training in practice?', 'Can the authors provide a theoretical justification for why sleep-wake cycles are expected to impact grokking?', 'What are the specific hyperparameters used in the experiments, and how sensitive are the results to these choices?', 'How do the results generalize across different neural network architectures and datasets?', 'How does the sleep-wake regime compare to other adaptive training methods like learning rate annealing or cyclical learning rates?', 'Can the authors test the sleep-wake regime on more complex and diverse datasets to validate generalizability?', 'How do the authors propose to mitigate the increased computational cost and complexity of hyperparameter tuning introduced by the sleep-wake regime?']","['The introduction of sleep-wake phases increases the complexity of hyperparameter tuning and computational cost.', 'The practical applicability of the sleep-wake regime is questionable given the increased complexity.', 'The primary limitation is the lack of a detailed mechanistic explanation for the observed phenomena, which leaves the proposed method somewhat speculative.', 'The computational overhead and increased complexity in hyperparameter tuning introduced by the sleep-wake regime are significant concerns that need to be addressed.', 'The paper does not thoroughly address the limitations related to computational cost and the complexity of hyperparameter tuning. Additionally, the generalizability of the results to more complex datasets remains uncertain.']",False,2,2,2,3,4,"['The concept of alternating high-intensity and reduced activity phases during training is novel and well-motivated.', 'Comprehensive experimental setup covering various datasets and detailed monitoring of training metrics.', 'Potentially significant insights into the dynamics of deep learning and the grokking phenomenon.']","['The methodology section lacks clarity, particularly in detailing the implementation of the sleep-wake cycle and the rationale behind choosing specific hyperparameters.', 'Increased complexity in hyperparameter tuning and computational cost due to the introduction of sleep-wake phases.', 'The results, though promising, are not sufficiently compelling to justify the added complexity of the proposed regime.', 'Lacks a strong theoretical foundation for why sleep-wake cycles should impact grokking.', 'Insufficient statistical analysis comparing the new approach to traditional methods.', 'Limited exploration of hyperparameter sensitivity and robustness across different architectures and datasets.', 'The novelty of the sleep-wake training strategy is limited as it resembles existing methods like learning rate annealing and weight decay regularization.', 'The results may not generalize well to more complex or diverse datasets beyond the ones studied.']",3,2,2,3,Reject
batch_size_grokking,"The paper investigates the impact of dynamic batch sizing on the phenomenon of grokking in neural networks, where models suddenly generalize well after a phase of overfitting. The authors propose that progressively adjusting batch sizes during training can accelerate grokking and enhance validation performance across various datasets.","['Can the authors provide a theoretical explanation or intuition for why dynamic batch sizing influences grokking?', 'Please provide detailed descriptions of the datasets, hyperparameters, and the dynamic batch size adjustment schedule.', 'How does the proposed method compare against other techniques aimed at improving generalization?', 'Can the authors provide more detailed explanations and statistical analyses for the results shown in Figures 1 and 2?', 'How does your approach compare to other regularization techniques or methods for improving generalization?', 'What are the practical implications and scalability of dynamic batch sizing in real-world applications?']","['The paper does not thoroughly assess the potential limitations and broader applicability of the proposed method.', 'Ethical considerations and potential societal impacts of the work are not discussed.', 'The concept of grokking and its significance is not well-established in the literature.', 'The choice of small algorithmic datasets limits the generalizability of the results to more complex, real-world datasets.']",False,2,2,2,3,4,"['Addresses the novel and intriguing phenomenon of grokking.', 'Proposes a practical approach to improve generalization by dynamically adjusting batch sizes.', 'Potentially significant findings that could influence future strategies for neural network training.']","['Lacks a detailed theoretical foundation for the proposed method.', 'Insufficient details on the experimental setup, including hyperparameters and dataset specifics.', 'Does not compare against a comprehensive set of baseline methods.', 'Figures and results are not adequately explained, and lack statistical analysis.', 'The limitations and future work section is too brief and lacks critical assessment.', 'Ethical considerations and potential societal impacts are not addressed.', 'The concept of grokking is not widely accepted or well-defined in the literature, making it difficult to assess the significance of the findings.', 'The paper is not well-organized and contains redundant information, making it difficult to follow.']",2,2,2,2,Reject
training_intensity_grokking,"The paper investigates the impact of varying training intensity schedules on the grokking phenomenon in neural networks, specifically focusing on Transformer models. The authors explore cyclical adjustments to learning rates, batch sizes, and weight update frequencies to optimize model training dynamics and generalization performance. The paper presents extensive experiments to validate the approach and provides insights into how different intensity schedules affect grokking.","['Can the authors provide more detailed descriptions of the implementation of training intensity schedules?', 'What theoretical insights can explain why certain training schedules facilitate grokking?', 'How would the proposed approach perform on other types of neural networks and more complex datasets?', 'How does the proposed method compare with other baseline methods not included in the current experiments?']","[""The study's limited scope to specific datasets and models raises concerns about the generalization of the findings."", 'The theoretical basis for the observed phenomena is not sufficiently explored.', 'Potential increased computational cost of varying training schedules.']",False,2,2,2,4,4,"['Novel exploration of training intensity schedules in the context of grokking.', 'Comprehensive experimental design with multiple datasets and ablation studies.', 'Clear criteria for transitioning between high and low-intensity phases.']","['Lack of clarity in the implementation and adjustment of training intensity schedules.', 'Insufficient theoretical analysis explaining why certain schedules facilitate grokking.', 'Experiments are limited to specific datasets and Transformer models, raising questions about generalizability.', 'Overemphasis on cyclical learning rates without exploring other training strategies.', ""The paper's organization and writing quality need improvement for better readability.""]",3,2,2,3,Reject
skip_connections_grokking,"The paper investigates the impact of skip connections on the grokking phenomenon in Transformer models. Specifically, it evaluates the effects of identity mapping and learned skip connections on the timing and extent of grokking across various datasets.","['Can you provide more theoretical analysis or proofs to explain why skip connections affect grokking?', 'How do the findings of this paper significantly advance the state of the art in Transformer model design or understanding of grokking?', 'Can the authors provide more detailed analysis and substantiation of their claims regarding the impact of skip connections on grokking?', 'What are the potential limitations and negative societal impacts of this work?', 'Can the authors provide a more detailed theoretical background on the grokking phenomenon?', 'How do the authors address potential limitations in their experimental setup?', 'Can the authors include more detailed statistical analysis to support their findings?', 'Can you provide more detailed explanations and implementation details for the experiments?', 'How does your work differ from existing studies on skip connections in Transformer models?', 'Why are the findings significant, and how could they influence future Transformer model designs?', 'What is the rationale behind the choice of datasets and hyperparameters?', 'Can the authors provide more detailed analysis and justification for the experimental results?', 'How do the findings of this paper compare to previous work on skip connections in other neural network architectures?']","['The study is limited by its focus on empirical results without sufficient theoretical backing.', 'The choice of datasets and fixed hyperparameters might limit the generalizability of the findings.', 'The paper does not adequately address the potential limitations and negative societal impacts of their work.']",False,2,2,2,3,4,"['Addresses an interesting and somewhat underexplored phenomenon in machine learning: grokking.', 'Provides empirical evidence on how skip connections can influence grokking in Transformer models.']","['The novelty of the paper is limited as skip connections are not new in neural network literature.', 'Lacks detailed theoretical analysis or rigorous proofs explaining the observed phenomena.', 'The empirical results, while interesting, are not sufficiently groundbreaking or impactful to warrant acceptance in a top-tier conference.', 'The clarity and depth of explanations regarding the experimental setup and results need improvement.', 'The experimental methods and results are not well substantiated.', 'The paper lacks a thorough discussion of limitations and potential negative societal impacts.', 'The quality of the experiments is questionable due to a lack of detailed statistical analysis and robust experimental validation.', 'The significance of the results is limited, as the study does not substantially advance the state-of-the-art or provide actionable insights.', 'The presentation is poor. Figures and tables are not well-integrated or properly captioned, making it hard to follow the data.']",2,2,2,2,Reject
fixed_interval_reset_grokking,"The paper investigates the grokking phenomenon in neural networks and proposes a novel intervention method of fixed interval resets to potentially accelerate the transition from overfitting to generalization. The study provides a systematic analysis of different reset intervals and evaluates their effects on model performance, convergence rates, and the timing of grokking.","['Can the authors provide a more detailed theoretical explanation for the impact of fixed interval resets on grokking?', 'Can the authors expand the experiments to include more diverse datasets and additional baselines for a more comprehensive evaluation?', 'Can the authors justify the choice of reset intervals and other hyperparameters in more detail?', 'Can the authors provide more details on the implementation of the fixed interval reset mechanism?', 'How were the datasets chosen, and why are they suitable for studying grokking?', 'Can the authors provide more statistical analysis to validate their findings?', 'What are the baseline models used for comparison, and how were they implemented?', 'Have the authors considered other types of resets, such as random resets or adaptive intervals, and how do they compare with fixed interval resets?', 'Can the authors provide more details on the Transformer model architecture and hyperparameters used in the experiments?', 'How does the proposed method perform on more complex, real-world datasets compared to the algorithmic tasks used in the paper?']","['The methodology lacks a thorough theoretical foundation and deeper insights into the mechanisms behind the observed phenomena.', 'The experiments are somewhat limited in scope and need to be expanded to include more diverse datasets and baselines.', 'The choice of reset interval is critical and requires careful tuning, which is not adequately addressed in the paper.', 'The method may not be universally applicable to all types of neural network architectures and datasets.', 'The paper does not address the potential negative societal impacts of the proposed method.']",False,2,2,2,3,4,"['The paper addresses an interesting and relatively unexplored phenomenon in neural network training.', 'The idea of using fixed interval resets to potentially manage and accelerate grokking is novel.', 'The experimental results show that specific reset intervals can influence the timing of grokking, offering practical insights.']","['The paper lacks a thorough theoretical explanation for why and how fixed interval resets influence the grokking phenomenon.', 'The experiments are limited in scope and do not include a wide range of datasets or more diverse baselines.', 'The choice of reset intervals and other hyperparameters is not well-justified, and the analysis could be more comprehensive.', 'The results are somewhat preliminary and need further validation.', 'The clarity of the paper is a major concern. The methodology section is not well explained, making it difficult to understand the exact implementation of the proposed reset mechanism.', 'The experimental setup lacks detail, and it is unclear how the datasets were chosen or why they are suitable for studying grokking.', 'The results section does not provide enough statistical analysis to support the claims. The paper lacks confidence intervals, p-values, and other statistical measures to validate the significance of the findings.', 'There is a lack of ablation studies to validate the contribution of each component of the method.']",3,2,2,2,Reject
sparsity_grokking,"The paper investigates the impact of sparsity-inducing regularization techniques (L1 regularization and Dropout) on the grokking phenomenon in Transformer models. The authors aim to identify optimal sparsity levels that facilitate or accelerate grokking, thereby enhancing model generalization. The experiments conducted across diverse datasets reveal that these techniques significantly influence grokking, providing clear indicators of its onset and improving generalization performance.","['Can the authors provide more detailed citations for the foundational papers referenced in the related work section?', 'What specific datasets were used in the experiments, and what were the baseline models for comparison?', 'Can the authors include more visual aids (graphs or charts) to effectively communicate the quantitative results?', 'Can the authors describe the ablation studies in more detail?', 'What is the theoretical explanation for the observed effects of sparsity on grokking?', 'How do you ensure the generalizability of your findings given the limited scope of datasets used?', 'Can you elaborate on the computational cost of applying these regularization techniques in practice?', 'Have the authors considered testing their approach on different neural network architectures other than Transformers?', 'What are the potential limitations of using L1 regularization and Dropout for controlling grokking?', 'Can more qualitative results or visualizations be provided to better understand the mechanisms underlying grokking?']","['The paper does not adequately address the limitations of the study, particularly the lack of detailed experimental setup and theoretical explanation.', 'The computational cost of applying these regularization techniques is not discussed.', 'The paper does not provide a thorough theoretical framework to support its findings.', 'The lack of detailed theoretical grounding weakens the claims made in the paper.', 'The paper lacks a thorough discussion of the limitations and potential negative societal impacts of the proposed approach.', 'More detailed qualitative analysis and potential negative societal impacts, if any, should be discussed.']",False,2,2,2,3,4,"['Addresses an intriguing phenomenon (grokking) in neural networks.', 'Proposes a novel combination of L1 regularization and Dropout to control grokking.', 'Provides empirical evidence supporting the effectiveness of L1 regularization and Dropout in enhancing model generalization.']","['Lack of detailed citations in the related work section, making it difficult to verify the information presented.', 'Insufficient details about the experimental setup, including specific datasets, baseline models, and Transformer configurations.', 'Lacks concrete examples or early results to substantiate the claims made in the introduction.', 'Quantitative results are presented without sufficient visual aids (graphs or charts) to effectively communicate the findings.', 'Ablation studies are mentioned but not described in detail, making it hard to assess their thoroughness and impact.', 'The paper does not provide a deep theoretical explanation for the observed effects of sparsity on grokking.', 'The application of L1 regularization and Dropout to Transformer models is not novel.', 'Poorly organized and lacks clarity in writing and presentation.', 'Does not convincingly demonstrate the practical significance or broader impact of the findings.']",2,2,2,2,Reject
adaptive_lr_grokking,"The paper investigates the impact of adaptive learning rate schedules, specifically Adamax and AdamW, on the grokking phenomenon in Transformer models across various datasets. Through systematic experimentation, it aims to reveal how these schedules influence convergence rates, loss dynamics, and generalization performance. The study focuses on small datasets and evaluates the effectiveness of these adaptive learning rate strategies in inducing grokking.","['Can the authors provide more detailed descriptions of the datasets used in the experiments?', 'What specific hyperparameters were used for the Adamax and AdamW schedulers?', 'Can the authors elaborate on the theoretical foundations of their findings?', 'How do the authors address the potential limitations and negative societal impacts of their work?', 'Why were Adamax and AdamW specifically chosen for this study? What theoretical insights support their selection?', 'Can the authors provide more detailed and complete results, especially addressing the incomplete sections in the current draft?', 'What are the broader implications of these findings for the community? How do these results advance the understanding of grokking in neural networks?']","['The paper does not adequately address its limitations or the potential negative societal impacts of its findings.', 'The clarity and presentation of the paper need significant improvement, with many sections being incomplete or poorly formatted.', 'The experimental setup lacks detailed descriptions, making it difficult to reproduce the results.']",False,2,2,2,3,4,"['The topic of grokking is intriguing and relevant to the field of neural network optimization.', 'The paper explores the use of adaptive learning rate schedules, which is a novel approach to addressing the grokking phenomenon.', 'Comprehensive experiments are conducted across various datasets.']","[""The paper's clarity and organization are poor, with fragmented results sections and incomplete sentences."", 'Critical details about the experiments, such as dataset descriptions and specific hyperparameter settings, are missing or briefly mentioned.', 'The experimental results, particularly the ablation studies, lack depth and do not convincingly demonstrate the advantages of adaptive learning rate schedules.', 'The paper does not provide substantial theoretical backing or detailed analyses to support its claims.', 'The limitations and potential negative societal impacts of the work are not adequately addressed.']",2,2,2,2,Reject
dynamic_update_grokking,"The paper investigates the impact of dynamic update strategies on the grokking phenomenon in neural networks, specifically using Transformer models trained on algorithmic datasets. It proposes dynamic update strategies that adapt based on model performance metrics, aiming to improve generalization performance. The experiments show that these strategies can significantly influence grokking timing and extent.","['Can the authors provide more detailed descriptions of the adaptive strategies and their implementation?', 'What are the theoretical justifications for the observed improvements in grokking timing and generalization performance?', 'Have the authors tested the proposed strategies on other neural network architectures and datasets? If not, how do they anticipate the strategies will generalize?', 'Can the authors provide more detailed pseudocode or a step-by-step explanation of how the dynamic update strategies are implemented?', 'What specific criteria were used to choose the hyperparameters for the Transformer models, and how do these choices impact the results?', 'Are there any limitations or challenges encountered when implementing these dynamic strategies that should be discussed?', 'Can the authors provide a more comprehensive comparison with other baseline methods?', 'What is the computational overhead introduced by dynamic update strategies?']","['The paper acknowledges the need for additional computational resources and potential complications in training processes due to dynamic update strategies.', 'Further experiments are needed to validate the strategies across various architectures and datasets.', 'The theoretical explanation of why dynamic update strategies work is lacking.', 'The generality of the proposed methods is not validated on a wide range of datasets and model architectures.', 'Potential negative societal impacts are not discussed.']",False,2,2,2,3,4,"['The topic is highly relevant and addresses an important issue in neural network training and generalization.', 'The proposed dynamic update strategies are novel and provide a fresh perspective on managing training processes.', 'The experimental results demonstrate significant improvements in grokking timing and generalization performance.']","['The paper lacks clarity in detailing the methodology and implementation of adaptive strategies, making reproducibility difficult.', 'The theoretical grounding for why these dynamic strategies work is not sufficiently explored.', 'The experiments are limited to Transformer models on algorithmic datasets, which may not generalize to other architectures and datasets.', 'The experimental setup and results are not comprehensively analyzed, with insufficient ablation studies and limited comparison with baseline methods.', 'There is no discussion on the potential limitations and negative societal impacts of the proposed method.']",3,2,2,3,Reject
task_complexity_grokking,"The paper investigates the grokking phenomenon in neural networks, particularly how different mathematical operations and task complexities influence this phenomenon in Transformer models. The authors introduce a complexity scale to systematically vary task difficulty and train Transformer models across these tasks, aiming to identify task types and complexities that significantly affect grokking.","['Can the authors provide more details on the implementation of the complexity scale and the selection of hyperparameters?', 'Are there any detailed ablation studies that dissect the contributions of different components of the proposed method?', 'Can the authors provide more in-depth analysis and discussion of the experimental results?', 'What are the limitations of the study, and are there any potential negative societal impacts or ethical considerations that should be addressed?']",['The study does not discuss the limitations or potential negative societal impacts of the work. It would be beneficial for the authors to address these aspects to provide a more balanced view.'],False,2,2,2,4,4,"['The topic is highly relevant and addresses an intriguing phenomenon in neural network training.', 'The introduction of a complexity scale to systematically vary task difficulty is a novel approach.', 'The use of Transformer models on a diverse set of mathematical tasks provides a comprehensive experimental setup.']","['The paper lacks detailed ablation studies to dissect the contributions of different components of the proposed method.', 'The methodology section needs more clarity and detail, particularly regarding the implementation of the complexity scale and the selection of hyperparameters.', 'The experimental results section is somewhat lacking in depth and does not provide enough insight into the observed phenomena.', 'There is no discussion of the limitations of the study or the potential negative societal impacts, which are crucial for a balanced evaluation.', 'The paper does not address ethical considerations, which could be relevant given the unpredictable nature of the grokking phenomenon.']",3,2,2,4,Reject
adaptive_interruption_grokking,"The paper investigates the impact of adaptive periodic training interruptions on the grokking phenomenon in neural networks. It proposes adaptive interruption strategies triggered by performance thresholds, aiming to enhance generalization performance and stabilize training dynamics. The study is validated through extensive experiments on Transformer models with various datasets.","['Can you provide more detailed descriptions of the interruption strategies, including the specific performance thresholds and learning rate adjustments used?', 'How do you ensure the reproducibility of your experiments, given the adaptive nature of the interruption strategies?', 'Have you considered testing your interruption strategies on more diverse datasets or other neural network architectures beyond Transformers?']","['The increased computational cost is mentioned, but potential negative societal impacts such as energy consumption and resource utilization are not discussed.', 'The paper should include more theoretical analysis to support the empirical findings.', 'The methodology section needs more detail to ensure reproducibility.']",False,2,2,2,3,4,"['The topic is novel and addresses an important aspect of neural network training dynamics.', 'The experimental validation is thorough, involving various datasets and configurations.', 'The paper provides a detailed analysis of training dynamics, including convergence rates, loss behavior, and stability over time.']","['Lack of theoretical foundation for the proposed adaptive interruption strategies.', 'Insufficient detail in the methodology and experimental setup, making reproducibility difficult.', 'Limited scope of datasets and model configurations, affecting generalizability.', 'Poor organization and clarity in the presentation of the paper.', 'Overreliance on empirical results without sufficient theoretical underpinning.']",2,2,2,2,Reject
custom_loss_grokking,"The paper investigates the impact of custom loss functions on the grokking phenomenon in neural networks using Transformer models. It proposes a suite of custom loss functions designed to promote sparsity, robustness, and generalization, and evaluates their effects on grokking through extensive experiments on various datasets.","['Can the authors provide more theoretical justification for the proposed custom loss functions? How do they differ from existing loss functions in the literature?', 'Can the authors provide a more detailed analysis of the empirical results, including statistical significance and comparison with baseline methods?', 'How do the proposed loss functions impact the training dynamics of the Transformer model? Are there any side effects or trade-offs observed?', 'Can the authors provide more detailed descriptions of the custom loss functions and their mathematical formulations?', ""How is the 'grokking point' defined and measured in the experiments?"", 'What datasets were used in the experiments, and what are their characteristics?', 'How do the custom loss functions compare to standard loss functions in terms of computational complexity and training stability?', 'What measures were taken to ensure the robustness and generalizability of the experimental results?']","['The paper does not thoroughly discuss the limitations and potential negative impacts of the proposed custom loss functions. It would be beneficial to include an analysis of any trade-offs or side effects observed during the experiments.', 'The potential negative societal impacts of the proposed approach are not discussed.', 'The paper does not adequately address the limitations of its approach, particularly the generalizability of the findings to other datasets and model architectures.']",False,2,2,2,3,4,"['The topic of grokking in neural networks is highly relevant and timely, with significant implications for improving model generalization and training efficiency.', 'The paper proposes a novel approach by combining the concepts of custom loss functions and grokking, which could lead to new insights in neural network training.', 'The use of Transformer models and the exploration of different datasets provide a broad context for understanding the impact of custom loss functions.']","['The clarity and organization of the paper are lacking. The description of the custom loss functions and their theoretical justification are not well articulated.', 'The empirical results are not thoroughly analyzed or discussed, making it difficult to ascertain the true impact of the proposed methods.', 'The novelty of the proposed loss functions is not well established. It is unclear how they differ from existing loss functions or why they are expected to influence grokking.', 'The paper lacks detailed experimental results and thorough analysis. The presentation of results is minimal, and statistical validation is not robustly discussed.', 'The methodology description is vague, particularly regarding the design and implementation of custom loss functions.', 'The paper does not provide sufficient empirical evidence to support its claims, with a lack of comprehensive ablation studies and comparisons to baseline methods.', ""The evaluation metrics are not well justified, and the definition of the 'grokking point' is unclear."", 'The related work section is insufficient and does not adequately cover recent advancements in the field.']",3,2,2,3,Reject
batch_mix_grokking,The paper proposes a mixed batch training regime to address the grokking phenomenon in neural networks by dynamically switching between full-batch and mini-batch training based on performance metrics. The objective is to leverage the benefits of both batch sizes to enhance generalization performance and improve training dynamics. Experiments on synthetic datasets indicate potential improvements in grokking timing and generalization performance compared to a constant batch size baseline.,"['Can the authors provide a more detailed theoretical explanation for why mixed batch sizes would influence grokking?', 'What are the specific performance metrics used to switch between batch sizes, and why were they chosen?', 'Can the authors present more diverse datasets to validate their approach?', 'How does the proposed method perform on larger, real-world datasets?', 'What are the computational costs associated with dynamically adjusting batch sizes?', 'Can you provide more details on the methodology, specifically how the performance metrics are used to switch between batch sizes?', 'What are the exact architectures and hyperparameters used in the experiments?', 'How many runs were averaged to obtain the results presented in the paper?', 'Can you provide a theoretical justification for why the mixed training regime should improve generalization and influence grokking?', 'Can you provide a detailed mathematical formulation or algorithm for the mixed training regime?', 'How do different performance metrics impact the effectiveness of the mixed training regime?']","['The authors have not adequately discussed the limitations and potential negative societal impact of their work. It is important to understand the computational costs and scalability issues associated with dynamically adjusting batch sizes.', 'The paper lacks a thorough theoretical analysis and detailed methodology, which limits the robustness of the experimental results.', 'Potential negative impacts of the mixed training regime on model convergence and stability should be addressed.']",False,2,2,2,3,4,"['The paper addresses a novel application of mixed batch training to influence the grokking phenomenon.', 'The idea of dynamically switching between full-batch and mini-batch training based on performance metrics is novel and could be potentially impactful if validated properly.', 'The experimental results show potential improvements in generalization performance and training dynamics.']","['The theoretical foundation for why mixed batch sizes would influence grokking is not well-explained.', 'The methodology section is underdeveloped and lacks detailed explanations and mathematical formulations.', 'The experimental setup lacks critical details such as specific datasets used, the architecture of the neural networks, and the exact hyperparameters, making it difficult to assess reproducibility and robustness.', 'The results section is sparse and does not provide sufficient quantitative analysis to convincingly demonstrate the benefits of the proposed method.', 'The paper does not provide theoretical insights or justifications for why the mixed training regime should work better.', 'The explanations of results and ablation studies are not comprehensive and leave many questions unanswered.', 'There is a lack of discussion on the potential negative impacts or limitations of the proposed method.']",2,2,2,2,Reject
learning_rate_grokking,"The paper investigates the impact of learning rates and schedules on the grokking phenomenon in neural networks, focusing on Transformer models trained on small algorithmic datasets. The study systematically explores various learning rates and schedules to identify configurations that facilitate or accelerate grokking.","['Can you provide a theoretical explanation for why certain learning rates or schedules work better?', 'How do you ensure that the findings generalize beyond small algorithmic datasets?', 'Can you include more comprehensive ablation studies and test on more diverse datasets?', 'What are the potential limitations and ethical concerns of your approach?', 'Can the authors provide more detailed explanations for the selection of datasets and models?', 'What is the rationale behind the specific learning rates and schedules chosen for the study?', 'Could additional ablation studies be conducted to further validate the findings?', 'How generalizable are the findings to other neural network architectures and datasets?', 'Can the authors provide a deeper theoretical explanation for why certain learning rates and schedules might influence grokking?', 'Can the authors include more detailed and clear visualizations of the experimental results?', 'What specific datasets were used, and how were they chosen?', 'Can the authors conduct more comprehensive ablation studies to isolate the impact of each component of the method?', 'What are the practical implications of these findings for training neural networks in real-world scenarios?']","['The study is limited to specific datasets and models, which may affect the generalizability of the findings.', 'The potential biases introduced by hardware-specific training times and convergence rates are acknowledged but not thoroughly addressed.', 'The paper does not adequately discuss the limitations of the study, particularly in terms of generalizability and potential biases.', 'There is no mention of the ethical implications or broader impacts of the research.']",False,2,2,2,3,4,"['Interesting exploration of the grokking phenomenon, which is not well understood in the literature.', 'Systematic study involving different learning rates and schedules.', 'Use of Transformer models and small algorithmic datasets to isolate the phenomenon.', 'Comprehensive experimental setup with detailed analysis of training dynamics and generalization performance.', 'Well-written and organized, making it easy to follow.']","['Lacks a strong theoretical foundation explaining why certain learning rates or schedules work better.', 'Narrow experimental setup focused only on small algorithmic datasets, which may not generalize well.', 'Results section lacks depth and could benefit from more comprehensive ablation studies and diverse datasets.', 'The novelty of the contributions could be better highlighted.', 'The methodology section lacks detailed explanations for the selection of datasets and models.', 'Theoretical foundation of the study is weak; the paper lacks a deep theoretical analysis of why certain learning rates and schedules might influence grokking.', 'The paper does not sufficiently address potential limitations or biases in the study.', 'There is a lack of discussion on the ethical implications or broader impacts of the research.']",2,2,3,2,Reject
early_stopping_grokking,"The paper investigates the impact of early stopping strategies on the grokking phenomenon in neural networks, specifically in Transformer models. The study aims to enhance neural network optimization by balancing overfitting prevention and the encouragement of grokking. It explores various early stopping strategies, their effects on grokking timing, and generalization performance.","['Can the authors provide more detailed ablation studies to understand the impact of different early stopping parameters?', 'How do the proposed early stopping strategies perform across different neural network architectures beyond Transformers?', 'Can the authors clarify the experimental setup and provide more comprehensive quantitative results?', 'Can you provide a more detailed theoretical explanation for the impact of early stopping on grokking?', 'Why were only Transformer models chosen for this study? Would the findings generalize to other architectures?', 'Are there alternative metrics that could be used for early stopping besides validation accuracy?', 'How do the findings compare to other regularization techniques?']","['The study is limited to Transformer models and specific datasets, which may not generalize to other model architectures and datasets.', 'Early stopping strategies are based solely on validation accuracy, which may not capture all aspects of model performance.', 'The paper does not adequately address the limitations of the proposed early stopping strategies across diverse datasets and model architectures.', 'Potential negative societal impacts are not discussed.']",False,2,3,3,4,4,"['Addresses an interesting and relevant phenomenon in neural network training: grokking.', 'Explores the practical implications of early stopping strategies on model generalization.', 'Provides a systematic exploration of different early stopping parameters.']","['Limited originality as it combines two well-studied areas (grokking and early stopping) without significant innovation.', 'Insufficient experimental validation with limited model architectures and datasets.', 'Lacks comprehensive ablation studies and detailed quantitative results.', 'Theoretical underpinning for the impact of early stopping on grokking is not well-developed.', 'The clarity of the paper is lacking, with some sections and methodological details being insufficiently explained.', 'The significance of the findings is limited, with minimal demonstrated impact on the field.']",2,2,3,3,Reject
targeted_perturbation_grokking,"The paper proposes a novel method for influencing the grokking phenomenon in neural networks using targeted periodic perturbations such as cyclical learning rates and Gaussian noise injection. The method aims to control and observe the grokking phenomenon, potentially improving model generalization. Extensive experiments on Transformer models across various datasets are conducted to evaluate the impact of these perturbations.","['Can the authors provide more detailed explanations and visualizations of the experimental results?', 'What are the theoretical underpinnings for the chosen perturbation strategies?', 'How reproducible are the results across different datasets and model architectures?', 'Can the authors provide more detailed descriptions and pseudocode for the implementation of the cyclical learning rates and Gaussian noise injection?', 'Have the authors considered testing the proposed perturbation strategies on real-world datasets to validate their effectiveness beyond synthetic tasks?', 'How do the computational costs of the proposed perturbation strategies compare to standard training methods?', 'Can the authors provide more details on the implementation of Gaussian noise injection and the specific layers it was applied to?', 'How were the hyperparameters for cyclical learning rate and noise injection tuned, and what were their values?', 'Can the authors provide a more comprehensive comparison with additional baselines and evaluation metrics?']","['The paper mentions the higher computational cost but does not explore other potential limitations or negative impacts in detail.', 'The effectiveness of the proposed method may vary across different datasets and model architectures, indicating the need for further research to generalize the findings.', 'The computational cost of implementing these perturbations is higher than standard training, which may limit their practical application in some scenarios.', 'The paper relies heavily on empirical results and lacks a strong theoretical basis.', 'The results are based on synthetic datasets, which may limit their generalizability.']",False,2,2,2,3,4,"['The paper addresses the interesting and challenging phenomenon of grokking in neural networks.', 'The proposed use of cyclical learning rates and Gaussian noise injection is novel in the context of influencing grokking.', 'The idea has potential implications for improving understanding and control of neural network training dynamics and generalization.']","['The experimental results are vaguely presented, lacking detailed analysis and clear interpretation.', 'The theoretical justification for why these perturbations would influence grokking is insufficient.', 'The paper does not include figures and tables referenced in the text, making it hard to verify claims.', 'The computational cost and limitations of the approach are not thoroughly discussed.', 'The methodology section lacks sufficient detail and clarity, making it difficult to fully understand the implementation and scope of the perturbation strategies.', 'The experiments are primarily conducted on synthetic datasets, which limits the generalizability of the findings to real-world applications.', 'There is no thorough theoretical analysis or explanation of why the perturbations influence grokking, which weakens the scientific contribution of the paper.', 'The practical implications and potential downsides, such as increased computational costs, are not adequately addressed.', 'The paper does not compare the proposed perturbation strategies with other existing methods for controlling learning dynamics in neural networks.']",3,2,2,3,Reject
regularization_grokking,"The paper investigates the impact of different regularization techniques (L1, L2 regularization, dropout, and weight decay) on the grokking phenomenon in neural networks, particularly Transformer models. It aims to understand how these techniques affect the onset and extent of grokking by defining it as the point when models reach 99% validation accuracy.","['Can the authors provide a more detailed theoretical analysis to support their empirical findings?', 'How do the authors plan to address the limitations related to small dataset sizes and limited hyperparameter choices?', 'Can the authors improve the clarity of the presentation, especially in the methodological and results sections?', 'What are the broader implications of these findings, and are there any potential negative societal impacts?', 'Could you provide more detailed visualizations and explanations for the impact of each regularization technique?', 'How do different hyperparameters affect the results, and what is the sensitivity of the findings to these choices?', 'Are there other regularization techniques or combinations that might be more effective in influencing grokking?']","['The study is limited by the use of relatively small datasets and a specific model architecture (Transformers). This may affect the generalizability of the findings.', 'Potential negative societal impacts are not adequately discussed.']",False,2,2,2,3,4,"['The paper addresses an interesting and relatively less explored phenomenon in neural networks known as grokking.', 'It systematically evaluates the effect of various regularization techniques on grokking.', 'The study uses a robust experimental setup with Transformer models and mathematical operation datasets, ensuring a structured approach to the investigation.']","['The results are not particularly novel or insightful, and the paper does not provide a deep understanding of the mechanisms behind grokking.', 'The clarity and presentation of the paper could be significantly improved. The descriptions of the experiments and results are somewhat superficial.', 'The paper lacks detailed analyses and ablation studies beyond the basic evaluation of different regularization techniques.', 'The limitations and potential negative societal impacts are not adequately addressed.']",2,2,2,2,Reject
training_duration_grokking,"The paper investigates the impact of training duration on the grokking phenomenon in neural networks, particularly focusing on Transformer models trained on small algorithmic datasets. The authors modify the training loop to systematically increase training durations and conduct experiments across various training durations, model complexities, and dataset sizes. They aim to identify optimal training durations that facilitate or accelerate grokking, defined as achieving 99% validation accuracy.","['Can the authors provide more detailed experimental results, including more comprehensive metrics and comparisons?', 'What are the theoretical insights underlying the grokking phenomenon observed in the experiments?', 'Can the authors include more ablation studies to validate the relevance of specific components of their methodology?', 'Can the authors provide more clarity on the choice of datasets and the implementation details to ensure reproducibility?', 'Can you provide a comparison with baseline methods and justify the selection of training durations?', 'Can you offer a more detailed analysis of the results and explain the observed phenomena?', 'Can you clarify the experimental setup and the rationale behind the chosen configurations?', 'Do the authors plan to extend their experiments to larger datasets and more complex models in future work?', 'Can the authors provide more detailed visualizations of their experimental results to enhance clarity?', 'How do the findings compare with other established methods for improving generalization in neural networks?']","['The paper does not address the theoretical underpinnings of the grokking phenomenon, which is a significant limitation.', 'The experiments are limited to small-scale datasets and models. Future work should explore larger datasets and more complex models.', 'The major limitation is the lack of detailed methodology and clarity in the presentation. Additionally, the theoretical aspects of grokking are not well explored, which limits the depth of the analysis.']",False,2,2,2,4,4,"['The topic of the paper is interesting and relevant, addressing the grokking phenomenon in neural networks, which is a relatively unexplored area.', 'The authors provide a systematic approach to investigate the impact of training duration on grokking, which could lead to insights into optimal training strategies.', 'The paper addresses a novel and practical problem of understanding the effect of training duration on the grokking phenomenon.']","['The paper lacks depth in experimental analysis. While it covers various training durations, model complexities, and dataset sizes, the results section is not comprehensive enough to draw strong conclusions.', 'Theoretical insights into the grokking phenomenon are missing. The paper does not delve into the underlying reasons for the observed phenomena, which limits its contribution to the field.', 'The paper does not provide sufficient ablation studies to validate the relevance of specific components of the methodology.', 'Some parts of the methodology, particularly the choice of datasets and the implementation details, are not clearly described, making it difficult to reproduce the experiments.', 'The experimental setup lacks comparison with other baseline methods and does not provide a clear justification for the selected training durations or model architectures.', 'Clarity is an issue throughout the paper, with several sections lacking detailed descriptions and clear presentations of experimental setups and outcomes.', 'The significance of the contributions is not well justified, and the results do not convincingly demonstrate the claimed improvements.', 'Potential limitations and ethical concerns are not adequately addressed.']",2,2,2,3,Reject
weight_init_grokking,"The paper investigates the impact of different weight initialization strategies (Xavier, Kaiming, and random normal) on the grokking phenomenon in neural networks. It aims to determine how these strategies affect the timing and extent of the transition from overfitting to generalization. The study uses a Transformer model and evaluates the initialization methods across diverse datasets including modular arithmetic and permutation groups.","['Can you provide more detailed visualizations and discussions on how different initialization methods impact the grokking phenomenon?', 'Have you considered the impact of varying other hyperparameters such as learning rate and batch size alongside initialization methods?', 'Can the authors provide more detailed analysis and discussion of the experimental results?', 'How do different initialization methods interact with other training parameters such as learning rate and batch size?', 'Can the authors expand the scope of datasets to include more complex and varied tasks?', 'Can you provide more detailed experimental results, including specific metrics and comparisons across all tested datasets?', 'Can you elaborate on the theoretical reasoning behind why different initialization methods might affect the grokking phenomenon?', 'What are the hyperparameters and model architectures used for each dataset in your experiments?', 'How can the findings of this study be applied in real-world scenarios or in the development of new training strategies?', 'Can the evaluation metrics be expanded to include stability and performance on out-of-distribution data?', 'How do the findings generalize to other model architectures and types of datasets?', 'Can the authors discuss the limitations and potential negative impacts of their findings?']","['The study does not explore the interplay between initialization methods and other training parameters like learning rate and batch size.', 'The paper does not address potential limitations or negative societal impacts of focusing on grokking and weight initialization strategies.', 'The practical significance of the findings is not fully explored.', 'The study is limited to a specific model architecture and task type.', 'There is a need for a deeper theoretical understanding of the observed phenomena.']",False,2,2,2,3,4,"['Addresses a relatively unexplored area in neural network training dynamics.', 'Uses a systematic approach to compare different initialization methods.', 'Maintains controlled experimental conditions by fixing hyperparameters.']","['Lacks originality and does not introduce new techniques or significant theoretical insights.', 'Limited diversity in model architectures and hyperparameters, which constrains the generalizability of the findings.', 'Results section is insufficiently detailed, missing comprehensive discussions and visualizations.', 'The practical implications of the findings are not well-articulated.', 'The experimental results section lacks depth and detailed analysis, making it difficult to assess the robustness of the conclusions.', 'The methodology section lacks clarity and depth. Key details about the experimental setup, such as the exact hyperparameters and architecture for each dataset, are not sufficiently described.', 'The paper does not provide a thorough theoretical analysis or explanation for why certain initialization methods might affect grokking differently.', 'There is no discussion on the potential limitations or negative impacts of the proposed methods, nor any ethical considerations.']",2,2,2,2,Reject
loss_function_grokking,"The paper investigates the impact of different loss functions on the grokking phenomenon in Transformer models. The study evaluates common loss functions such as Mean Squared Error, Cross-Entropy Loss, Hinge Loss, and several custom designs to understand their influence on neural network generalization.","['Can the authors provide more details about the design of custom loss functions?', 'How do the results compare with other state-of-the-art techniques not covered in the study?', 'How were the datasets selected, and what are their specific characteristics?', 'What measures were taken to ensure the reliability and validity of the experimental results?', 'Why do certain loss functions accelerate the grokking phenomenon? This needs deeper analysis.', 'How do the chosen loss functions influence different stages of the grokking phenomenon?', 'Can the authors clarify the theoretical basis for hypothesizing that different loss functions would impact grokking?', 'How do the authors account for variability in results due to different random seeds?']","['The paper lacks clarity and detail in describing the experimental setup and custom loss functions.', 'The scope of the datasets and loss functions evaluated is limited, reducing the generalizability of the findings.', 'The authors mention limitations but do not adequately address them.', 'There is no discussion on the potential negative societal impacts of their work.']",False,2,2,2,3,4,"['Addresses an important and underexplored phenomenon in neural network training.', 'Performs comprehensive experiments across various datasets and loss functions.']","['Lacks novelty in the approach of evaluating loss functions; no new methodological contributions.', 'Experimental design is not sufficiently diverse to support broad claims.', 'Paper is not well-organized, making it difficult to follow experimental setup and results.', 'Results are not groundbreaking and do not advance the state of the art significantly.', 'Theoretical backing for why certain loss functions might influence grokking is weak.', 'Insufficient details are provided for reproducibility; key parameters and implementation details are missing.', 'Limited discussion on the limitations and potential negative societal impact of the work.']",2,2,2,2,Reject
dynamic_training_grokking,"The paper explores the impact of dynamic training regimes on the grokking phenomenon in neural networks. It introduces a novel training schedule that periodically alternates between various batch sizes, learning rates, and optimization algorithms every 500 steps. The study aims to influence the occurrence and timing of grokking by training Transformer models on algorithmic datasets.","['Can the authors provide a stronger theoretical foundation for why dynamic training regimes would trigger grokking?', 'Can the authors conduct more comprehensive experiments across diverse datasets and model architectures?', 'Can the authors provide more detailed implementation information to facilitate reproducibility?', 'What are the potential limitations and negative societal impacts of this approach?', 'Can you provide a more detailed theoretical explanation for why dynamic training regimes accelerate grokking?', 'How generalizable is this method to other types of models and datasets?', 'Can you provide guidelines for tuning the dynamic regimes?', 'Can the authors improve the clarity and organization of the methodology section, providing more detailed explanations of the dynamic training schedule?', 'Can the authors provide more comprehensive ablation studies to thoroughly explore the impact of individual dynamic parameters?', 'Could you include a more detailed statistical analysis of your results?', 'What are the specific algorithmic datasets used, and why were they chosen?', 'Can the authors provide more detailed hyperparameter settings and code to ensure reproducibility?']","['The paper does not discuss the potential limitations or negative societal impacts of the approach. It would be beneficial for the authors to address these aspects to provide a balanced view of their work.', 'The need for careful tuning of parameters might limit the practical applicability of the proposed method.', 'The limitations include the lack of a theoretical foundation, limited experimental scope, and insufficient detail for reproducibility.', 'The paper does not adequately address the potential limitations and negative societal impacts of the proposed method.', 'The authors should discuss the generalizability of their approach to different datasets and model architectures.']",False,2,2,2,3,4,"['The idea of using dynamic training regimes to influence grokking is novel and could have significant implications for neural network training.', 'Addresses an important and poorly understood phenomenon in neural networks: grokking.', 'The experimental results show that certain dynamic regimes significantly accelerate the onset of grokking and improve generalization performance compared to baseline models.']","['The concept of dynamic training regimes is not entirely new, and the paper does not provide a strong theoretical foundation for why these regimes would specifically trigger grokking.', 'The experimental results are not exhaustive and would benefit from more comprehensive experiments across diverse datasets and model architectures.', 'Implementation details are sparse, making it difficult to reproduce the results.', 'There is no discussion on the potential limitations or negative societal impacts of the approach.', 'The paper lacks clarity in some sections, particularly in the explanation of the dynamic training schedule and the experimental setup.', 'Lacks a thorough theoretical explanation for why the proposed dynamic regimes accelerate grokking.', 'Unclear how generalizable the findings are to other types of models and datasets.', 'Requires careful tuning of parameters, which might limit practical applicability.', 'The clarity and organization of the paper are lacking, making it difficult to follow the methodology and results.', 'The significance of the contributions is not convincingly demonstrated, and the performance gains are not well-quantified.', 'The paper lacks sufficient detail to ensure reproducibility, with key hyperparameters and configurations inadequately documented.', 'Results section lacks detailed statistical analysis to robustly support the claims.', 'The experimental setup appears somewhat arbitrary, with choices like changing parameters every 500 steps lacking strong justification.']",2,2,2,3,Reject
cyclical_lr_grokking,"The paper investigates the impact of cyclical learning rates (CLR) on the grokking phenomenon in neural networks, particularly using Transformer models. The authors propose a CLR scheduler with a triangular policy and evaluate its effectiveness using Transformer models on modular arithmetic and permutation datasets. The study aims to identify optimal CLR configurations that accelerate or stabilize grokking.","['Can you provide a detailed theoretical explanation for why CLR would influence grokking?', 'Could you elaborate on the datasets used, including their characteristics and relevance to the study?', 'What specific metrics were used to assess the effectiveness of CLR in facilitating grokking?', 'How do your results compare with other learning rate scheduling methods?', 'Can the authors provide more detailed descriptions of the datasets used and the specific experimental setup?', 'What are the exact metrics used to define and measure grokking in the experiments?', 'Can the authors elaborate on the theoretical rationale behind why CLR might facilitate grokking?', 'What are the broader implications and potential applications of the findings from this study?', 'How would the proposed CLR approach perform on more diverse and common datasets?', 'Is there any specific reason for choosing the triangular policy for CLR?']","[""The paper does not adequately address the theoretical basis for CLR's impact on grokking."", 'The experimental results are not comprehensively analyzed, and additional comparisons with other methods are needed.', 'The datasets used are limited to modular arithmetic and permutation tasks, which do not reflect the diversity of real-world applications.', 'Potential negative societal impacts are not discussed, although they might be minimal in this context.']",False,2,2,2,3,4,"['Addresses an intriguing and relatively novel topic in neural network generalization.', 'Proposes a practical approach with CLR that could be beneficial for training stability and generalization.', 'Includes empirical evaluation using Transformer models on various datasets.']","['Lacks a detailed theoretical explanation for why CLR would influence grokking.', 'Experimental setup and results are not described in sufficient detail, making it difficult to assess the validity and significance of the findings.', 'Repeats certain phrases and lacks depth in the analysis of results.', 'Fails to provide comprehensive comparisons with other learning rate scheduling methods.', 'Overall clarity and organization of the paper are lacking, making it hard to follow and replicate the experiments.']",3,2,2,3,Reject
periodic_reset_grokking,The paper explores the phenomenon of grokking in neural networks by introducing periodic resets of model weights during training. It aims to investigate how these resets can influence the timing and extent of grokking. The authors conduct extensive experiments with Transformer models trained on diverse datasets and compare performance with baseline models.,"['Can you provide a deeper theoretical explanation for why and how periodic resets influence grokking?', 'How do you plan to generalize your findings to other neural network architectures and datasets?', 'What are the potential downsides or trade-offs of using periodic resets in neural network training?', 'What are the specific criteria for selecting reset intervals and checkpoints? How do these choices impact the results?', 'Have you compared your method with other techniques aimed at enhancing generalization, beyond the baselines provided?']","['The paper does not sufficiently address the limitations of the proposed method or its potential ethical concerns.', 'The effectiveness of periodic resets may vary depending on the dataset and model architecture. Further research is needed to optimize reset strategies for different scenarios.']",False,2,2,2,3,4,"['The paper addresses an interesting phenomenon in neural networks known as grokking.', 'The proposed method of periodic resets is novel and could have significant implications for neural network training.', 'Extensive experimentation with Transformer models across diverse datasets.']","['The paper lacks a thorough theoretical explanation for why and how periodic resets influence grokking.', 'The experimental setup is narrow, focusing primarily on Transformer models and specific datasets, limiting the generalizability of the findings.', 'The results are not compellingly presented, lacking deeper analysis of trade-offs and potential downsides.', 'The methodology section is somewhat vague and lacks details on specific implementation choices, making it difficult to reproduce the results.', 'The paper does not sufficiently address potential limitations and ethical concerns.']",3,2,2,3,Reject
systematic_lr_holidays_grokking,The paper proposes a novel learning rate scheduling strategy termed 'learning rate holidays' to enhance the grokking phenomenon in neural networks. The approach involves periodic increases in the learning rate to accelerate and improve generalization performance. The paper validates this method through experiments on Transformer models and small algorithmic datasets.,"['How are the parameters for learning rate holidays systematically varied and optimized?', 'Can the authors provide more comprehensive results on a wider range of datasets and neural network architectures?', 'Can the authors provide a more detailed explanation of the theoretical basis for why learning rate holidays should induce grokking?', 'What specific hyperparameter settings were used in the experiments, and how were they chosen?', 'Can the authors provide more comprehensive ablation studies to isolate the effects of magnitude, duration, and interval of the learning rate holidays?', 'How do the results compare with other advanced learning rate schedules or optimization techniques?']","['The need for dataset-specific tuning and the higher computational cost due to increased learning rates during holidays.', 'The paper does not discuss potential negative societal impacts.']",False,2,2,2,3,4,"['The concept of dynamic learning rate adjustment to induce grokking is intriguing.', 'The paper addresses a relevant issue in neural network training, particularly for small datasets.']","[""The idea of 'learning rate holidays' is not entirely novel; similar concepts like cyclic learning rates have been explored before."", 'The paper lacks clarity in explaining how the learning rate holidays are systematically varied and optimized.', 'The experimental results are limited to a narrow set of datasets and do not explore various neural network architectures.', ""The statistical analysis is weak and does not convincingly demonstrate the method's superiority over traditional learning rate schedules."", 'The paper does not adequately address potential limitations and negative societal impacts of the proposed method.']",2,2,2,2,Reject
activation_function_grokking,"The paper investigates the impact of different activation functions on the grokking phenomenon in neural networks using a Transformer model. It systematically evaluates five common activation functions—ReLU, GELU, Swish, Tanh, and Sigmoid—across diverse datasets to assess their impact on grokking. The study aims to provide practical guidance for selecting activation functions to optimize neural network performance.","['Can the authors provide more insights into why certain activation functions affect grokking differently?', 'How do the results generalize to other neural network architectures and more complex datasets?', 'Can the authors conduct more comprehensive experiments, including different model architectures and a broader range of datasets?', 'Why were the chosen datasets selected, and how well do they represent the diverse applications of Transformer models?', 'Can the authors provide a deeper analysis of the results, explaining why certain activation functions perform better or worse?', 'What are the practical implications and applications of the findings?', 'How do the results generalize to other model architectures?', 'What is the interplay between activation functions and other hyperparameters?', 'Can the authors provide more details on the experimental setup and hyperparameters?', 'Can the authors provide a more detailed analysis of why certain activation functions (e.g., Sigmoid and Tanh) perform better or worse in facilitating grokking?', 'How would the results vary with different model architectures or more complex datasets?', 'Could the authors include more detailed discussions on the limitations of their study and potential future work?', 'Can you provide more details on the specific hyperparameters, dataset characteristics, and model configurations used in the experiments?', 'Why do you think certain activation functions lead to earlier or later grokking? Can you provide a theoretical explanation or hypothesis?', 'Have you considered using cross-validation or other robust statistical methods to validate your findings?', 'Can you elaborate on the practical implications of your results for neural network design and optimization?']","['The paper does not adequately address the limitations of the study or potential negative societal impacts. It would be beneficial to discuss these aspects in more detail.', 'The study is limited by its experimental design, choice of datasets, and depth of analysis. These limitations should be addressed to strengthen the paper.', ""The study's scope is limited to a specific Transformer architecture and datasets, which may not generalize to other models or tasks."", 'More detailed analyses and explanations are needed to strengthen the findings.']",False,2,2,2,3,4,"['The paper addresses an interesting and relatively unexplored aspect of neural networks: the influence of activation functions on the grokking phenomenon.', 'The study systematically evaluates five common activation functions, providing a structured comparison methodology.', 'The results offer some practical guidance for selecting activation functions to optimize neural network performance.']","['The methodology lacks depth and does not provide a thorough analysis of why certain activation functions affect grokking differently.', 'The experiments are not sufficiently comprehensive; they are limited to a specific Transformer model and a few datasets, which might not generalize well to other architectures or tasks.', 'The paper does not adequately address the limitations of the study or potential negative societal impacts.', 'The analysis is somewhat superficial and does not provide enough insights into the underlying mechanisms of grokking.', 'The experimental design lacks sufficient detail, making it difficult to assess the robustness and reproducibility of the results.', 'The results are presented without adequate depth and clarity, lacking detailed analysis and insights.', 'There is a lack of theoretical underpinning explaining why different activation functions might influence grokking.', 'The paper does not employ robust statistical methods or cross-validation to validate its findings.', 'The discussion on the implications of the results is superficial and does not provide actionable insights for practitioners.']",2,2,2,2,Reject
refined_dynamic_training_rhythm_grokking,"The paper explores the impact of dynamic training regimes on the grokking phenomenon in neural networks, using Transformer models on small algorithmic datasets. It introduces adaptive training strategies to enhance and accelerate grokking, aiming to improve model generalization.","['Can the authors provide more detailed explanations of the dynamic training regimes used?', 'How do the dynamic regimes compare with other adaptive training methods not considered in the paper?', 'What theoretical insights can the authors provide to explain the observed impact on grokking?']","['The study is limited to small algorithmic datasets, which impacts the robustness and generalizability of the findings.', 'Potential negative impacts on model stability and training efficiency should be considered.']",False,2,2,2,3,4,"['The paper addresses an intriguing and relatively underexplored phenomenon in neural networks: grokking.', 'It proposes novel dynamic training strategies that adapt based on performance metrics.', 'The experimental setup is clear, and the results indicate that dynamic training regimes can influence the grokking process.']","['The explanations of the dynamic training regimes and their implementation are vague and lack sufficient detail.', 'The experimental results are limited to small algorithmic datasets, which restricts the generalizability of the findings.', 'There is a lack of theoretical analysis or deeper insights into why and how the dynamic training regimes impact grokking.', 'The paper is poorly organized, with missing figures and incomplete sentences, significantly hampering readability.', 'The paper does not adequately address potential limitations or the broader applicability of the proposed methods.']",2,2,2,2,Reject
initialization_grokking,"The paper investigates the impact of different initialization strategies on the grokking phenomenon in Transformer models. Various initialization methods, including Xavier and Kaiming, are analyzed for their effects on training dynamics, timing, and generalization performance of grokking. The paper aims to provide practical insights for neural network practitioners through rigorous experimentation and detailed analysis.","['Could the authors provide more detailed visualizations and analysis of the grokking phenomenon under different initialization strategies?', 'How do the findings generalize to other neural network architectures beyond Transformers?', 'What are the theoretical implications of the observed results on the understanding of grokking?', 'What specific characteristics of initialization methods do you hypothesize to influence grokking, and why?', 'How were the datasets chosen, and what specific tasks do they represent?', 'Can you provide more details on the model configurations, hyperparameters, and training procedures?', 'What are the exact metrics used to measure grokking, and how are they computed?', 'Can you compare your findings with more baselines or state-of-the-art methods to validate your results?', 'Can the authors provide more details on the implementation and code to enhance reproducibility?', 'Have you considered additional initialization strategies beyond those mentioned in the paper?', 'Is there any theoretical analysis that supports the empirical findings?', 'How do other factors like learning rate schedules or regularization techniques influence grokking? Can the authors discuss this?']","['The study is limited to Transformer models and a specific set of datasets. The generalizability of the results to other architectures and more complex tasks is not addressed.', 'There is a need for a deeper theoretical exploration of the grokking phenomenon and its dependencies on initialization techniques.', 'Authors need to clearly articulate limitations of their experimental setup and discuss potential negative societal impacts.', 'More comprehensive ablation studies and comparisons with other techniques are necessary to validate the findings.', 'The paper briefly touches on limitations but does not address potential negative societal impacts.', ""The study's limited scope and depth might hinder its generalizability and the overall impact of its findings."", 'The paper could benefit from a broader range of initialization strategies and datasets to provide a more comprehensive understanding.', 'The paper does not explore other potentially influential factors on grokking, such as learning rate schedules or regularization techniques.']",False,2,2,2,3,4,"['Addresses the timely and relevant problem of grokking in neural network training, which is not well understood.', 'Systematic analysis of multiple initialization strategies, which is valuable for practitioners.', 'The experimental setup is well-defined with specific metrics for evaluating the impact of initialization on grokking.']","['Lacks a clear demonstration of how the findings advance the current understanding significantly beyond existing literature.', 'Experiments are limited to a few datasets and Transformer models, which may not generalize to other architectures or more complex datasets.', 'Presentation of results could be improved with more detailed analysis and visualization of the findings.', 'Lacks a clear theoretical framework or hypothesis explaining the influence of initialization methods on grokking.', 'Inadequate description of the experimental setup, datasets, model architectures, and hyperparameters.', 'Results section is shallow, with high-level findings and insufficient statistical rigor.', 'No comparison with baseline or state-of-the-art methods beyond random initialization.', 'Poor clarity in writing and organization; difficult to follow experimental procedures and results interpretation.', 'The study lacks novelty as it explores well-known initialization techniques without introducing new methods or significant advancements.', 'The paper does not provide detailed implementation details or code, hindering reproducibility.', 'The contributions are incremental and lack the groundbreaking impact needed for acceptance in a prestigious ML venue.']",2,2,3,2,Reject
model_size_grokking,"The paper investigates the impact of model size on the grokking phenomenon in neural networks, focusing on Transformer models. It systematically varies model dimensions such as embedding size, number of attention heads, and layers to observe changes in grokking behavior. The study uses synthetic datasets and monitors various metrics to identify optimal configurations for grokking.","['Why did you choose to use synthetic datasets instead of real-world datasets?', 'How do you plan to address the limitations related to fixed hyperparameters in future work?', 'How were the synthetic datasets generated, and what specific tasks do they involve?', 'Can the authors provide more details on the hyperparameter settings and the specific configurations of the Transformer models?', 'How do the findings compare to existing literature on model size and generalization?', 'Can the authors provide more details on the implementation to enhance reproducibility?', 'Have the authors considered evaluating the impact of other hyperparameters and different training regimes?', 'Can the authors include comparisons with other model architectures and additional baselines?', 'Can the authors provide more detailed results and analysis to support their conclusions?', 'How do the findings generalize to real-world datasets?', 'Can the authors clarify the experimental setup and controls in more detail?', 'What are the potential limitations of this study, and how can they be addressed?']","['The use of synthetic datasets and fixed hyperparameters limits the generalizability of the findings.', 'The paper lacks a robust discussion on how to address these limitations in future work.', 'Potential negative societal impacts are not discussed.', 'The study is limited by its focus on synthetic datasets and a narrow set of hyperparameters. More diverse datasets and variations in training regimes could provide a more comprehensive understanding of the grokking phenomenon.']",False,2,2,2,3,4,"['Addresses a relevant phenomenon in neural network training.', 'Systematic exploration of model dimensions is useful for understanding grokking.']","['Uses synthetic datasets, limiting the generalizability of findings.', 'The originality of the work is limited, as it builds on existing concepts without significant innovation.', 'Fixed hyperparameters across experiments could lead to biased results.', 'Lacks depth in evaluation metrics and analysis.', 'Clarity could be improved with more detailed explanations and comprehensive visualization.', 'The experimental setup is shallow, with insufficient exploration of hyperparameters and datasets.', 'The paper is poorly organized, lacking key details and clear analysis.', 'Findings are not particularly impactful or novel.', 'There is a lack of detailed analysis on the impact of other hyperparameters and different training regimes.', 'The paper could benefit from additional baselines and comparisons with other model architectures.', 'Insufficient implementation details make reproducibility challenging.', 'The results section is weak and lacks substantial evidence to support the conclusions.', 'The experimental setup and controls are not clearly defined, making it difficult to assess the robustness of the findings.', 'There is a lack of theoretical insights or novel methodological contributions.']",2,2,2,2,Reject
targeted_noise_grokking,"The paper explores the impact of targeted noise injection (dropout and Gaussian noise) on the grokking phenomenon in neural networks. It aims to identify optimal noise conditions that facilitate or accelerate grokking, providing insights into neural network learning dynamics. The study includes experiments on synthetic datasets to evaluate the influence of noise injection on grokking timing and performance.","['Can the authors provide a more detailed theoretical explanation for why specific noise configurations facilitate grokking?', 'How do the introduced noise types affect different layers of the neural networks differently, if at all?', 'What are the computational costs associated with the proposed noise injection strategies?', 'How do the findings generalize to more complex, real-world datasets?', 'Can the authors conduct additional ablation studies to explore different noise configurations and their effects?', 'Can you provide more comprehensive tables and figures to support your results?', 'What are the practical implications of your findings for neural network training in different contexts?', 'What are the potential ethical considerations and societal impacts of this work?']","['The study primarily focuses on synthetic datasets, which may not reflect real-world scenarios.', 'The paper lacks sufficient depth in experimental analysis and visualization.', 'The computational cost of training with noise injection is higher, which may limit its applicability in resource-constrained environments.']",False,2,2,2,3,4,"['Addresses an intriguing and underexplored phenomenon in neural network training.', 'Proposes a novel approach by using noise injection to influence neural network learning dynamics.', 'The experimental setup includes various noise intensities and frequencies, offering a wide spectrum of observations.']","['Lacks a thorough theoretical foundation explaining why the proposed noise strategies work.', 'Experiments are limited to synthetic datasets, raising concerns about the generalizability of the findings to real-world data.', 'Insufficient experimental validation and lack of comprehensive comparison with baseline models or noise-free settings.', 'The paper lacks clarity in describing the methodology and experimental setup, making reproducibility difficult.', 'Figures and visualizations are improperly referenced or missing, affecting the clarity and reproducibility of the results.', 'The paper does not address potential negative societal impacts or ethical considerations.']",2,2,2,3,Reject
intermittent_training_grokking,"The paper investigates the impact of intermittent training and evaluation cycles on the grokking phenomenon in neural networks. It introduces various intermittent training regimes and compares them against continuous training baselines using multiple datasets. The authors claim that certain intermittent training strategies can influence the timing and extent of grokking, providing new insights into neural network generalization.","['Can the authors provide a more detailed theoretical justification for the choice of intermittent training regimes?', 'What other datasets and neural network architectures were considered, and why were they not included?', 'Can the authors provide more rigorous statistical analysis to support their claims?', 'Can the authors provide more details on the experimental setup and justify the choice of hyperparameters?', 'How do the results generalize to other neural network architectures beyond the ones used in the experiments?', 'What are the practical implications of these findings for real-world neural network training?']","['The paper does not discuss the limitations of the proposed approach in enough detail. Specifically, the potential negative impacts of intermittent training on other aspects of neural network performance are not addressed.', 'The computational cost and practical feasibility of implementing such regimes in real-world scenarios are not addressed.']",False,2,2,2,3,4,"['The paper addresses an intriguing and relatively underexplored area in neural network training dynamics.', 'The proposed intermittent training regimes offer a novel perspective on how to potentially influence the generalization behavior of neural networks.', 'The use of multiple datasets to validate the findings adds some robustness to the results.']","['The paper lacks a detailed theoretical foundation for why intermittent training regimes should influence grokking, and how these regimes are designed.', 'The experimental setup is not varied enough; more datasets and different types of neural network architectures should be used to generalize the findings.', 'The statistical analysis of the results is not adequately rigorous. More detailed metrics and statistical tests are needed to confirm the significance of the findings.', 'The paper does not adequately discuss the limitations or potential negative societal impacts of the proposed approach.', 'The clarity of the presentation, particularly in the methodological sections, needs improvement. Some crucial details about the experimental setup are missing or not well-explained.']",2,2,2,2,Reject
structured_interruption_grokking,"The paper explores the impact of structured periodic training interruptions on the grokking phenomenon in neural networks, particularly in Transformer models trained on small algorithmic datasets. The proposed method involves introducing weight resets and gradient-informed adjustments at regular intervals to study their effects on model generalization and training dynamics.","['Can you provide a more detailed theoretical justification for using structured periodic training interruptions to influence grokking?', 'How does the proposed method compare with existing baseline methods in a more rigorous and comprehensive manner?', 'Can you clarify the experimental setup and provide more detailed descriptions of the datasets, hyperparameters, and training procedures used?', 'Have you considered additional ablation studies to isolate the effects of different types of interruptions and their frequencies?', 'What are the potential limitations of the proposed method, and how broadly applicable do you believe it to be across different datasets and neural network architectures?']","['The paper does not provide a thorough discussion of the potential limitations of the proposed method.', 'The broader applicability of the method across different datasets and network architectures is not addressed.']",False,2,2,2,3,4,"['Addresses the important and relatively underexplored phenomenon of grokking in neural networks.', 'Proposes a novel methodological approach involving structured periodic training interruptions.', 'Initial results suggest that the method can influence the timing and extent of grokking.']","['Theoretical justification for the proposed interruptions and their connection to grokking is weak.', 'Experimental validation is limited and lacks rigorous comparative analysis with baseline methods.', 'Clarity of the writing is insufficient, particularly in the explanation of technical details and experimental setup.', 'The paper does not adequately address the potential limitations and broader applicability of the proposed method.', 'Lacks detailed ablation studies and sensitivity analyses for the proposed structured interruptions.']",3,2,2,3,Reject
attention_config_grokking,"The paper investigates the impact of varying attention head configurations on the grokking phenomenon in transformer models, aiming to optimize model performance on small algorithmic datasets. The authors propose a systematic approach to modify transformer models by varying the number and distribution of attention heads across layers. The study includes extensive experiments analyzing the timing and extent of grokking, generalization performance, and training dynamics. The results suggest that specific attention head configurations can accelerate grokking and improve generalization.","['Can the authors provide more detailed experimental results and analysis to support their claims?', 'Can the authors clarify the methodological choices and provide comprehensive visualizations and comparative studies?', 'What are the specific implementation details that would allow others to reproduce the results?', 'How significant are the reported improvements in practical applications?', 'How do the findings generalize to larger datasets or other types of data beyond small algorithmic datasets?', 'What are the potential limitations and negative societal impacts of this work?']","['The paper lacks a comprehensive discussion on the limitations and potential negative societal impacts of the work.', 'The study is limited to small algorithmic datasets, and the generalizability of the findings to other types of data is unclear.']",False,2,2,2,3,4,"['Addresses an interesting problem related to transformer models and the grokking phenomenon.', 'Proposes a systematic approach to varying attention head configurations.', 'Provides extensive experimental analysis across various configurations.']","['The originality of the approach is questionable, as similar studies have been conducted in the past.', 'The methodology lacks rigorous theoretical analysis and justification for the chosen experimental setup.', 'The paper is not clearly written, with missing details in the experimental setup and results sections.', 'The significance of the findings is limited, offering incremental insights rather than groundbreaking contributions.']",2,2,2,2,Reject