paper_id,Summary,Questions,Limitations,Ethical Concerns,Soundness,Presentation,Contribution,Overall,Confidence,Strengths,Weaknesses,Originality,Quality,Clarity,Significance,Decision | |
model_architecture_grokking,"The paper investigates the impact of Transformer architecture configurations on grokking behavior in arithmetic and permutation tasks. It systematically explores five Transformer configurations with varying layers, dimensions, and attention heads across four tasks: modular addition, subtraction, division, and permutation composition. The study finds that architectural choices significantly influence grokking behavior, with some configurations facilitating faster learning or achieving higher final accuracies on specific tasks.","['Can the authors provide a theoretical explanation for the observed patterns in grokking behavior?', 'How do the results compare with other architectures beyond Transformers?', 'What specific characteristics of the tasks influence the effectiveness of different Transformer configurations?', 'Can the authors provide a deeper analysis of why certain architectures perform better on specific tasks?', 'Can the authors propose new hypotheses based on the results of their study?', 'Can the authors clarify the experimental setup and better integrate the figures into the text?', 'Why were these specific tasks (modular addition, subtraction, division, and permutation composition) chosen for the study?', 'What was the rationale behind the chosen Transformer configurations?', 'How do the findings translate to more complex real-world tasks?', 'How do the authors suggest improving the performance on more complex tasks like permutation composition with Transformer models?', 'What are the potential limitations of the current study and how can they be addressed in future work?']","['The study is limited to specific tasks and a fixed training duration, which might not generalize to more complex real-world problems.', 'The analysis focuses on descriptive results without offering theoretical insights into the observed phenomena.', 'The paper does not address the broader implications of the findings or propose new hypotheses based on the results.', 'The analysis of the results is somewhat superficial, not providing enough insights into why certain architectures perform better on specific tasks.', 'The paper does not explore a wide enough range of architectural configurations, such as different normalization techniques or activation functions. The training duration and tasks are also limited, which may affect the generalizability of the findings.']",False,2,2,2,4,4,"['The paper addresses an interesting and relevant phenomenon in deep learning, namely grokking.', 'The empirical study is comprehensive, covering multiple configurations and tasks.', 'The findings provide practical insights into optimizing Transformer architectures for specific tasks.']","['The originality of the work is limited as it primarily extends existing studies to new tasks and configurations.', 'The paper lacks a strong theoretical foundation and deeper analysis of why certain architectures perform better.', 'Some sections could benefit from clearer explanations and reasoning, particularly on the impact of architectural choices.', 'The significance of the findings is moderate as they largely confirm intuitive expectations rather than providing groundbreaking insights.', 'The analysis of the results is superficial and does not provide enough insights into why certain architectures perform better on specific tasks.', 'The explanation of the experimental setup and the results could be more concise and integrated with the figures.', 'The paper does not delve deeply into the implications of the findings or propose new hypotheses based on the results.', 'The choice of tasks and architectural variations could be more diverse.', 'Details on the training setup (e.g., hardware, specific implementation details) are sparse, affecting reproducibility.']",2,2,2,3,Reject | |
batch_size_grokking,"The paper investigates the impact of dynamic batch size strategies on the grokking phenomenon in deep learning. The authors propose five dynamic batch size strategies鈥攍inear, exponential, logarithmic, cyclic, and cosine annealing鈥攁nd evaluate them against a fixed batch size baseline across four tasks: modular addition, subtraction, division, and permutation group operations. The results indicate that the linear increase strategy consistently outperforms others, accelerating generalization and improving final model performance.","['Can the authors provide a more detailed description of the dynamic batch size strategies and their implementations?', 'What is the theoretical rationale behind the choice of these dynamic batch size strategies?', 'How do the findings generalize to other types of tasks or more complex datasets?', 'Could the authors explore the interaction between batch size strategies and other hyperparameters, such as learning rate schedules?', 'What are the specific hyperparameters used in the experiments, and how were they chosen?']","['The paper does not thoroughly address the theoretical underpinnings of dynamic batch size strategies, which limits the understanding of why these strategies work.', ""The study's focus on a single model architecture and optimizer limits the generalizability of the findings."", 'The limited scope of tasks used in the evaluation may not provide a comprehensive view of the effectiveness of these strategies.', 'The interaction between batch size strategies and other hyperparameters is not explored.']",False,2,2,2,4,4,"['Addresses a novel and important issue in deep learning, focusing on the grokking phenomenon.', 'Proposes novel dynamic batch size strategies.', 'Comprehensive empirical evaluation across multiple tasks.', 'The linear increase strategy shows promising results in accelerating generalization and improving final model performance.']","['Lacks theoretical grounding for why these dynamic batch size strategies should work.', 'Limited to a specific set of tasks and a single model architecture, limiting generalizability.', 'Insufficient exploration of the interaction between batch size strategies and other hyperparameters.', 'Some important hyperparameters and experimental settings are not clearly explained.', 'The related work section could be more comprehensive.']",3,2,2,3,Reject | |
data_augmentation_grokking,"The paper investigates the impact of data augmentation on grokking dynamics in mathematical operations, focusing on modular arithmetic. It proposes a novel data augmentation strategy combining operand reversal and negation, applied with varying probabilities to different operations. The experiments show that targeted data augmentation significantly accelerates grokking, reducing steps to 99% validation accuracy by up to 76% for addition, 72% for subtraction, and 66% for division.","['Can the authors provide a more detailed theoretical explanation of why the proposed data augmentation strategies enhance grokking dynamics?', 'Can the authors include more detailed ablation studies to isolate the effects of different components of the proposed method?', 'Can the authors provide clearer visualizations and more detailed explanations of the figures and tables presented in the paper?', 'Can the authors explore additional augmentation techniques beyond operand reversal and negation?', 'How does the proposed method generalize to more complex mathematical operations?', 'Can the authors provide more detailed explanations and visualizations of the grokking dynamics observed in the experiments?', 'Have you considered other augmentation techniques or combinations thereof?', 'Can you provide more experiments with different hyperparameters and model architectures?', 'Could you discuss the potential ethical and societal impacts of your work?', 'Can the authors provide a discussion on the statistical significance of the results and the variance across multiple runs?', 'What are the limitations of the proposed augmentation strategies, and how do the authors plan to address them in future work?', 'Why is accelerating grokking specifically important for modular arithmetic?', 'How were the specific probabilities for augmentation chosen?', 'Can you provide more context or baseline comparison to justify the claimed improvements?']","['The paper does not thoroughly discuss the potential limitations of the proposed method, such as its applicability to other mathematical domains or different types of neural network architectures.', 'The broader implications of the work, including its potential impact on curriculum design for machine learning and educational AI systems, are only briefly mentioned and not explored in depth.', 'The experiments were conducted with a fixed set of hyperparameters and model architecture. Further investigation into the interaction between these factors and the augmentation strategies is needed.', 'The study focuses on modular arithmetic with a prime modulus. Generalizing the findings to other mathematical domains remains an open question.', 'The evaluation methodology lacks rigor, with no discussion on statistical significance or variance across multiple runs.', 'The problem setting and the significance of the results are not well-motivated. Why is accelerating grokking specifically important for modular arithmetic?', ""The methodology section lacks clarity and depth. For instance, it's not clear how the specific probabilities for augmentation were chosen."", 'The evaluation metrics and experimental results are presented without sufficient context or baseline comparison to justify the claimed improvements.', ""The paper's presentation is somewhat disorganized, making it difficult to follow and understand the key contributions and results.""]",False,2,2,2,3,4,"['The paper addresses a relevant problem in the context of deep learning: enhancing grokking dynamics in mathematical reasoning tasks.', 'The proposed data augmentation strategies (operand reversal and negation) are novel and show promising empirical results.', 'The experiments are well-designed, systematically varying augmentation strategies and measuring their effects on grokking dynamics.']","['The paper lacks a thorough theoretical analysis of why the proposed data augmentation strategies work and how they influence grokking dynamics.', 'The presentation of results, particularly the figures and tables, could be clearer and more detailed.', 'The paper does not provide sufficient ablation studies to isolate the effects of different components of the proposed method.', 'Potential limitations and broader implications of the work are not thoroughly discussed.', 'The novelty of the proposed augmentation techniques is limited as operand reversal and negation are relatively simple transformations.', 'The experimental setup lacks depth, particularly in terms of hyperparameter exploration and architectural variations.', 'Ethical and societal impacts are not discussed.']",2,2,2,3,Reject | |
noisy_grokking,"The paper investigates the impact of input perturbations on the grokking phenomenon in neural networks, focusing on two algorithmic tasks: modular division and permutation composition. The study systematically varies noise levels and application probabilities, tracking clean and noisy validation accuracies throughout the training process. The results reveal a task-dependent, non-linear relationship between noise and grokking, with modular division showing robustness to low noise levels and permutation composition exhibiting improved generalization under moderate noise.","['Can the authors provide more details on the autoencoder aggregator used in their experiments?', 'How do different model architectures and training procedures affect the observed results?', 'Can the authors provide additional visualizations or cases to support their qualitative analysis?', 'How do the findings generalize to more complex, real-world datasets?', 'What are the practical implications of these findings for designing robust training procedures?', 'Can the authors provide more detailed explanations of the noise injection process and its impact on the training dynamics?', 'What are the practical implications of the observed non-monotonic relationship between noise and performance in the permutation composition task?', 'Have you considered testing other algorithmic tasks or different model architectures to see if the findings generalize?', 'Can you provide a more detailed theoretical explanation for the observed non-monotonic behavior and critical transition points in noise sensitivity?', 'How do the authors ensure the reproducibility and validity of their experiments?', 'Are there any additional tasks or datasets that could be included to make the findings more generalizable?', 'Can the authors provide a more thorough analysis of the experimental data, including additional visualizations and statistical significance tests?', 'How do the findings relate to existing literature on noise and generalization in neural networks?']","['The study is limited to two specific algorithmic tasks, which may not generalize to other domains.', 'The experiments use a fixed model architecture and training procedure, which may not be optimal for all noise conditions.', 'The statistical robustness of the results is questionable due to the small number of trials.', 'The experimental methodology is not clearly explained, making it difficult to understand the reproducibility and validity of the experiments.']",False,2,2,2,4,4,"['The paper addresses a relevant and timely problem by studying the impact of noise on the grokking phenomenon.', 'Comprehensive experimental setup, including varying noise types, levels, and probabilities.', 'Detailed analysis of the results, identifying task-specific behaviors and critical transition points in noise tolerance.', 'The topic of the study is novel and extends current research on the grokking phenomenon by introducing the element of noise.']","['The paper lacks clarity in certain sections, making it difficult to follow the experimental setup and results.', 'The impact of different model architectures and training procedures on the results is not explored.', 'The study focuses on only two specific algorithmic tasks, limiting the generalizability of the findings.', 'The number of random seeds (3) used for experiments is relatively small, potentially affecting the robustness of the results.', 'The idea of studying the impact of noise on neural networks is not novel, and the paper does not sufficiently differentiate itself from existing work on noise and robustness.', 'The practical implications of the findings are not fully explored.', 'The experimental design has some gaps, such as the limited range of noise types and the fixed model architecture.', 'There is a lack of discussion on the implications of the findings for practical applications.']",3,2,2,3,Reject | |
pruning_grokking_dynamics,"The paper investigates the impact of weight pruning on the phenomenon of grokking and generalization performance in transformer models. It systematically varies sparsity levels and examines the effects on models trained on modular division and permutation learning tasks. The results show that moderate sparsity levels can maintain or accelerate grokking in the modular division task, while high sparsity levels lead to poor generalization in the permutation learning task.","['Can you provide a more detailed theoretical analysis to explain the discrepancy between target and achieved sparsity levels?', 'How generalizable are your findings to other tasks and model architectures?', 'Can you improve the clarity of your figures and tables, and provide more detailed explanations?', 'Why does the permutation learning task show severe overfitting with high sparsity levels, while the modular division task does not?', 'Can the authors explore different pruning schedules or techniques to see if the discrepancy in sparsity levels persists?']","['The paper does not sufficiently address the discrepancy between target and achieved sparsity levels.', 'The generalizability of the findings is limited, and the paper does not provide enough discussion on this aspect.', 'The paper lacks a thorough discussion on the limitations and potential negative societal impacts of the work.']",False,2,2,2,3,4,"['The paper addresses an interesting and relevant topic in deep learning, exploring the relationship between sparsity, grokking, and generalization.', 'The experimental design includes systematic variation of sparsity levels and tracks multiple metrics to provide a comprehensive view of the learning dynamics.', 'The findings suggest that moderate sparsity levels can accelerate grokking and improve generalization in some tasks.']","['The theoretical foundation and detailed analysis supporting the claims are weak. The discrepancy between target and achieved sparsity levels is noted but not sufficiently explained.', 'The methodology and experimental setup are fairly standard and do not introduce substantial innovation.', 'The results are limited to two specific tasks, raising questions about the generalizability of the findings.', 'The clarity of the paper is lacking, with confusing presentation of results and insufficient explanation of figures and tables.', 'The discussion around limitations and potential negative societal impacts of the work is minimal.']",3,2,2,3,Reject | |
invariance_learning_grokking,"The paper investigates the relationship between invariance learning and the grokking phenomenon in neural networks, introducing a novel invariance score metric and using attention visualization techniques. The study focuses on modular arithmetic operations and permutations, demonstrating that invariance learning occurs early in training, preceding the grokking point.","['Why is early invariance learning not sufficient for generalization, especially in the permutation task?', 'Can you provide more detailed explanations and visualizations of the attention patterns observed in the different tasks?', 'How does the proposed invariance score metric compare to existing methods in the literature?', 'Can you provide a more detailed theoretical justification for the invariance score metric?', 'Can you include more experimental results, possibly with different tasks and model architectures, to validate your claims more comprehensively?', 'Can you clarify how the invariance score is computed and how it should be interpreted in the context of your experiments?', 'Can you provide a deeper analysis of why the permutation task fails to generalize despite early invariance learning?', 'What additional factors do you think contribute to the grokking phenomenon?', 'How do you plan to address the limitations mentioned in the paper, such as fixed hyperparameters and limited training time?']","['The study acknowledges the fixed hyperparameters and limited training time as limitations. Future work should explore different model architectures and training regimes to better understand the relationship between invariance learning and grokking.', 'The paper does not adequately address the limitations of the proposed approach and the broader implications of the findings. A discussion on these aspects would be beneficial for understanding the scope and impact of the work.', 'Simplified tasks may not fully represent the complexity of real-world problems.']",False,2,2,2,4,4,"['The paper addresses an intriguing and relevant problem in deep learning, focusing on the grokking phenomenon and invariance learning.', 'Introduction of a novel invariance score metric to quantify symmetry awareness in neural networks.', 'Comprehensive experiments on modular arithmetic operations and permutations, providing empirical evidence for the study.', 'Utilization of attention visualization techniques to gain insights into the learning process.']","['The causal link between invariance learning and grokking is not clearly established. The paper should explore why early invariance learning does not guarantee generalization, especially for complex tasks.', 'The presentation of results is disorganized, with unclear visualizations and insufficient explanations of the findings.', 'The attention patterns discussed are not convincingly linked to the observed phenomena, and the paper could benefit from more detailed analysis and interpretation.', 'The methodology could be more rigorous, with a stronger theoretical foundation and a more thorough comparison with existing methods.', 'The paper lacks a detailed exploration of hyperparameter sensitivity and task-specific adjustments.', 'The practical implications and potential applications of the findings are not thoroughly discussed.', 'The scope of tasks studied is somewhat narrow, limiting the generalizability of the findings.']",3,2,2,3,Reject | |
initialization_impact_grokking,"The paper investigates the impact of weight initialization methods on the grokking phenomenon in Transformer models across various mathematical tasks. It compares orthogonal and Xavier/Glorot initialization techniques, analyzing their effects on learning trajectories, generalization capabilities, and model performance. The study aims to provide insights into the grokking process by examining weight distributions and attention patterns throughout training.","['Can you provide more theoretical explanations for why certain initialization methods perform better for specific tasks?', 'Can you expand the experimental setup to include a broader range of tasks and model architectures to validate your findings?', 'How do the results generalize to more complex, real-world tasks?', 'What are the potential negative societal impacts or ethical considerations of this work?', 'Can the authors provide more details on the experimental setup and methodology to ensure reproducibility?']","[""The paper's findings may not be generalizable to more complex problems due to the relatively small scale of the tasks considered."", 'The interaction between initialization methods and other training dynamics, such as learning rate schedules, requires further investigation.', 'The paper does not adequately address the limitations and potential negative societal impacts of the work. The authors should provide a more detailed discussion on these aspects.']",False,2,2,2,3,4,"['The paper addresses an important and timely topic in deep learning: the grokking phenomenon in Transformer models.', 'It provides a novel framework for analyzing weight distributions and attention patterns, offering insights into the grokking process.', 'The study demonstrates practical implications for designing more efficient and generalizable Transformer models.']","['The paper lacks theoretical grounding for the observed phenomena. It relies heavily on empirical results without providing sufficient theoretical explanations.', 'The experimental setup could be more rigorous. A broader range of tasks and model architectures is needed to validate the findings.', 'The clarity of the paper is a concern, with several sections lacking detailed explanations, particularly in the methodology and analysis sections.', 'The tasks used for evaluation are relatively simple and may not generalize to more complex, real-world problems.', 'There is limited discussion on the potential negative societal impacts or ethical considerations of the work.']",2,2,2,2,Reject | |
optimizer_impact_grokking,"The paper investigates the impact of optimization algorithms (SGD and Adam) on the grokking phenomenon in deep learning, focusing on modular arithmetic and permutation tasks. The study empirically compares the performance of SGD and Adam across different learning rates using a transformer model, analyzing training and validation dynamics, generalization performance, and grokking occurrence.","['Can the authors provide a theoretical explanation for the observed differences between SGD and Adam in grokking?', 'How do the chosen tasks (modular arithmetic and permutation) generalize to other types of tasks?', 'Can the authors provide more details on the experimental setup, including hyperparameters and model architectures?', ""How do the authors plan to address the incomplete sections labeled 'BACKGROUND HERE' and 'METHOD HERE'?"", 'What are the statistical significances of the results presented?', 'Can the authors elaborate on the theoretical foundations and contributions of their work?']","['The study is limited by its narrow range of tasks and does not explore the theoretical basis for the observed phenomena.', 'Incomplete sections and insufficient experimental details raise concerns about reproducibility.', 'Potential negative societal impacts of optimizing for generalization in deep learning are not discussed.']",False,2,2,2,3,4,"['Addresses a novel and relevant aspect of deep learning dynamics, specifically the grokking phenomenon.', 'Conducts a comprehensive empirical analysis across multiple tasks and learning rates.', 'Provides detailed metrics to evaluate grokking, including validation accuracy and steps to reach 99% validation accuracy.', 'Offers actionable insights for practitioners regarding optimizer selection and hyperparameter tuning to facilitate grokking.']","['Lack of theoretical insight to explain why different optimizers affect grokking differently.', 'Limited variety of tasks reduces the generalizability of the findings.', 'Incomplete content and placeholders significantly detract from clarity and completeness.', 'Insufficient detail on experimental setup and hyperparameters to ensure reproducibility.', 'Does not adequately address the limitations of the study.', 'Results are presented without sufficient statistical analysis or discussion on their significance.']",2,2,2,3,Reject | |
proto_lottery_tickets_grokking,"The paper investigates the phenomenon of grokking in neural networks, where models exhibit sudden improvements in generalization after prolonged training. The authors propose a novel method for tracking and analyzing the evolution of active subnetworks during training. Experiments are conducted on modular division and permutation tasks to explore subnetwork dynamics during grokking events.","['Can the authors provide more detailed analysis and visualization of subnetwork dynamics?', 'How do the results compare to baseline methods or existing approaches?', 'Can the authors clarify the methods used for detecting subnetwork emergence points and provide more robust results?', 'How do you address the inconsistency in results, especially for the permutation task?', 'Can you discuss the impact of hyperparameter sensitivity on your findings?', 'How would additional runs or different tasks affect the robustness of your results?']","['The main limitations of the paper include the inconclusive experimental results and the lack of detailed analysis and baseline comparisons.', 'The methods for detecting subnetwork emergence points need further refinement.', 'The influence of fixed hyperparameters on the results is not adequately explored.', 'The potential negative societal impacts of this work are not discussed.']",False,2,2,2,3,4,"['The paper addresses an intriguing and relatively unexplored phenomenon in neural networks known as grokking.', 'The proposed method for tracking and analyzing subnetwork evolution is novel and aims to provide insights into sudden generalization.', 'The experiments on modular division and permutation tasks are well-motivated and demonstrate different levels of task complexity.']","['The experimental results are inconclusive and do not provide strong evidence to support the hypotheses.', 'The methodology lacks clarity, especially in defining and detecting active subnetworks and their stability.', 'The paper lacks baseline comparisons and ablation studies, which are necessary to evaluate the effectiveness of the proposed method.', 'The clarity and organization of the paper could be improved, as some sections are difficult to follow.', 'Only three experimental runs are conducted for each task, which is insufficient for robust statistical analysis.']",3,2,2,3,Reject | |
input_encoding_grokking,"The paper investigates how different input encoding schemes affect the grokking phenomenon in transformer models learning mathematical operations. It systematically compares four encoding methods: default integer indices, one-hot encoding, binary encoding, and a novel hybrid encoding across four mathematical operations: modular addition, subtraction, division, and permutation composition. The study aims to understand how these encodings influence grokking dynamics and overall model performance. The authors find that input encodings significantly influence grokking dynamics, with varying effectiveness across operations.","['Can you provide a more detailed theoretical explanation for the observed grokking phenomenon?', 'How do you envision the practical implications of your findings in broader machine learning tasks?', 'Can you further elaborate on the impact and potential benefits of the hybrid encoding scheme?', 'Can you provide a clearer definition of grokking and how it is measured in your experiments?', 'What is the rationale behind the design of the hybrid encoding scheme? How does it compare to other encoding schemes in terms of novelty?', 'Can you conduct additional experiments to cover more mathematical operations and different sets of hyperparameters to strengthen your conclusions?', 'How do you justify the choice of hyperparameters in your experiments?', 'Can you provide a more detailed comparison with existing methods addressing similar problems?', 'Why is the related work section incomplete?', 'Could the authors provide a more detailed theoretical explanation for why different encoding schemes affect grokking dynamics?', 'What is the performance of the proposed model when varying the hybrid encoding scheme? Could more ablation studies be conducted?', 'How generalizable are the findings to other mathematical operations and model architectures?']","['The study is limited to specific mathematical operations and a fixed transformer architecture. This may limit the generalizability of the findings to other tasks and models.', 'The choice of hyperparameters is not thoroughly justified, potentially biasing the results.', 'The results for the hybrid encoding scheme are not thoroughly analyzed, and its effectiveness remains unclear.', 'The paper does not address potential negative societal impacts of the work.']",False,2,2,2,3,4,"['The paper addresses an important and underexplored aspect of neural network learning: the influence of input encoding on grokking.', 'A systematic comparison of different input encoding schemes is conducted.', 'Comprehensive experiments are conducted across multiple mathematical operations and encoding schemes.']","['The novelty of the hybrid encoding scheme is questionable and lacks theoretical justification.', 'The experimental setup is limited to specific mathematical operations and a fixed transformer architecture, raising concerns about generalizability.', 'The clarity of the paper could be improved, particularly in the presentation of the hybrid encoding scheme and the definition of grokking.', 'Some critical related works are missing, and the citations are incomplete.', 'The evaluation metrics and analysis of results are not thoroughly explained.']",2,2,2,2,Reject | |
mdl_grokking_correlation,"This paper investigates the relationship between Minimal Description Length (MDL) and the phenomenon of grokking in neural networks, offering an information-theoretic perspective on sudden generalization. The authors propose a novel MDL estimation technique using weight pruning and apply it to modular arithmetic and permutation tasks. Their experiments reveal a strong correlation between MDL reduction and improved generalization, with MDL transition points often preceding or coinciding with grokking events.","['Can the authors provide more details on the weight pruning technique and its implementation?', 'How do the results generalize to more complex datasets and tasks beyond modular arithmetic and permutation?', 'What specific hyperparameters were used in the experiments, and how was the exact implementation of MDL tracking carried out?', 'Can the authors provide additional ablation studies to isolate the impact of different components in their methodology?', 'How does the choice of pruning threshold (系) impact the MDL estimation and the observed grokking phenomenon?']","['The focus on relatively simple datasets limits the generalizability of the findings to more complex, real-world tasks.', 'More clarity is needed in the methodology, especially regarding the weight pruning technique and its implementation.', 'The choice of pruning threshold (系) and its impact on MDL estimation should be discussed in more detail.']",False,3,2,3,5,4,"['The paper addresses an intriguing and under-explored phenomenon in neural networks: grokking.', 'The proposed MDL estimation technique using weight pruning is novel and offers a unique perspective on understanding neural network generalization.', 'The empirical results demonstrate a strong correlation between MDL reduction and grokking across several modular arithmetic tasks.']","['The explanation of the MDL estimation technique and its implementation details could be clearer.', 'The generalizability of the findings to more complex neural network tasks and architectures is not well demonstrated.', 'The experiments are limited to relatively simple tasks, and the paper does not explore more challenging datasets or real-world applications.', 'Visualizations and some experimental results are not clearly presented, making it difficult to fully understand the findings and their implications.']",3,3,2,3,Reject | |
adversarial_robustness_evolution_grokking,"The paper investigates the relationship between grokking and adversarial robustness in neural networks trained on small algorithmic datasets. It introduces the concept of a 'robustness grokking point' and proposes an experimental framework to track clean accuracy, perturbed accuracy, and their ratio throughout training. The experiments are conducted on transformer models applied to modular arithmetic operations and permutation tasks.","['Why did you choose uniform noise perturbation, and how do you justify its adequacy in capturing adversarial robustness?', 'Can you provide more details on the methodology for detecting the robustness grokking point?', 'Have you considered more sophisticated adversarial attack methods to evaluate robustness?', 'Why was the perturbation strength set at 蔚 = 0.05? Could a different value yield different insights?', 'Can you provide more details on why specific tasks (modular arithmetic and permutation) were chosen?', 'How do you plan to refine the robustness grokking point detection algorithm?']","['The uniform noise perturbation with 蔚 = 0.05 may be overly strong, leading to low perturbed accuracy across tasks.', 'The detection algorithm for robustness grokking needs refinement to capture meaningful changes in adversarial robustness.', 'The permutation task shows poor performance in both clean and perturbed accuracy, suggesting that the current model architecture or training procedure may not be suitable for this more complex task.']",False,2,3,2,3,4,"['The idea of exploring the relationship between grokking and adversarial robustness is novel and interesting.', 'The experimental framework is well-defined, and the use of transformer models is appropriate for the tasks considered.', 'The results highlight the task-dependent variations in grokking onset and a moderate positive correlation between improvements in clean and perturbed accuracy.']","[""The method for identifying the 'robustness grokking point' seems arbitrary and lacks rigorous validation."", 'The choice of uniform noise perturbation is too simplistic and may not adequately capture adversarial robustness.', 'The experimental results do not provide strong evidence or a clear understanding of the relationship between grokking and robustness.', ""The study's limitations, such as the simplistic perturbation method, significantly impact the overall contribution of the paper."", 'The paper lacks detailed explanations for certain choices, such as why specific tasks were chosen and why the perturbation strength was set at 蔚 = 0.05.']",3,2,3,2,Reject | |
gradient_flow_grokking,"The paper investigates the phenomenon of grokking in neural networks through gradient flow analysis, aiming to uncover the internal dynamics of sudden generalization by examining gradient norms and directions during training. The authors conduct experiments on modular division and permutation group tasks, providing insights into the mechanics of grokking and introducing the 'gradient shift point' as a potential predictor of sudden generalization.","['Can the authors provide more details about the autoencoder aggregator used in the experiments?', ""What is the rationale behind choosing a cosine similarity threshold of 0.5 for defining the 'gradient shift point'?"", 'Have the authors considered other tasks or model architectures to validate the generalizability of their findings?', 'Why were the gradient L2 norm and cosine similarity specifically chosen as metrics for detecting grokking?', 'Can the authors provide more details on how these metrics are calculated and their significance?', 'Can the authors provide a theoretical explanation for why gradient shifts might predict grokking?', 'Can the authors include more visualizations and statistical tests to support the findings, particularly regarding the relationship between gradient shift points and grokking points?']","['The study is limited to two specific tasks, and the findings may not generalize to other types of learning problems.', 'The small scale of the experiments (7,500 training steps) may not capture long-term learning dynamics.', ""The definition of the 'gradient shift point' is somewhat arbitrary and may need further investigation."", 'The paper lacks a deep theoretical analysis to support the empirical findings.']",False,2,2,2,3,4,"['The paper addresses an important and intriguing phenomenon in neural networks: grokking.', 'The proposed gradient flow analysis is a novel approach to understanding sudden generalization.', ""The introduction of the 'gradient shift point' as a metric for predicting grokking is innovative."", 'The experimental setup is thorough, with detailed analysis and visualizations of gradient dynamics.']","[""The clarity of the presentation is lacking, particularly in explaining the autoencoder aggregator and the significance of the 'gradient shift point.'"", 'The paper could benefit from more comprehensive ablation studies to validate the proposed methods further.', 'The study is limited to two specific tasks, which may not generalize to all types of learning problems.', ""The definition of the 'gradient shift point' (cosine similarity threshold of 0.5) appears arbitrary and lacks justification."", 'The methodology lacks depth in explaining why specific metrics like gradient L2 norm and cosine similarity are chosen and how they are calculated.', 'The paper lacks a deeper theoretical grounding or explanation for why gradient shifts predict grokking.']",3,2,2,3,Reject | |
lottery_ticket_grokking,"The paper investigates the intersection of the lottery ticket hypothesis and the grokking phenomenon using a transformer model trained on modular arithmetic and permutation tasks. The authors explore whether sparse, trainable subnetworks (lottery tickets) emerge during sudden generalization (grokking). They implement one-shot magnitude pruning at various stages of training and analyze the properties of winning tickets across different tasks.","['Can the authors provide more details on why the permutation task failed to produce lottery tickets?', 'How might the findings change with different model architectures or more complex tasks?', 'Can the authors provide comparisons with other pruning techniques?', 'How do the authors plan to generalize their findings to more complex tasks beyond simple arithmetic operations?']","['The paper focuses on simple tasks and a specific transformer architecture, which might limit the generalizability of the findings.', ""The study's focus on simple arithmetic and permutation tasks limits its generalizability."", 'The paper could benefit from exploring a wider range of tasks and neural network architectures.', 'Additional ablation studies and detailed analysis of permutation tasks are needed to strengthen the findings.', 'The chosen methodology, while detailed, lacks justification for certain critical decisions.']",False,3,3,3,4,4,"['The combination of the lottery ticket hypothesis and grokking phenomenon is novel and provides unique insights.', 'The methodology is well-structured, with thorough experiments and clear presentation.', 'The paper includes several visualizations that help in understanding the results and insights.', 'The findings contribute to understanding the relationship between network pruning and sudden generalization, potentially leading to more efficient training strategies.']","['The experiments are limited to simple arithmetic operations and a basic permutation task, which may affect the generalizability of the findings.', 'The study uses a specific transformer architecture, and the results might not hold for other architectures or more complex tasks.', 'The paper does not explore iterative pruning methods, which could provide a more comprehensive understanding of the lottery ticket hypothesis in the context of grokking.', 'The failure of permutation tasks to produce lottery tickets or exhibit grokking behavior is not sufficiently explored.', 'The choice of specific pruning rates and retraining steps is not well-justified, leaving some methodological decisions unclear.', 'The paper could benefit from a more detailed examination of the learning dynamics during grokking, such as gradient flows or loss landscapes.']",3,3,3,3,Reject | |
cyclic_lr_grokking,"The paper investigates the phenomenon of grokking in deep learning, where models exhibit sudden generalization after prolonged training. The authors explore the use of cyclic learning rate schedules to induce grokking across various algorithmic tasks, such as modular arithmetic and permutation operations. The approach involves systematic exploration of the interplay between cyclic learning rates, gradient dynamics, and generalization in transformer models.","['Can the authors provide ablation studies to isolate the impact of specific components, such as the warmup and decay phases of the cyclic learning rate schedule?', 'What are the gradient norm trends during the transition from memorization to generalization?', 'Can the authors explore alternative regularization techniques in conjunction with cyclic learning rates?', 'What are the theoretical explanations for why cyclic learning rates induce grokking in some tasks but not others?', 'How would varying the configuration of the cyclic learning rate schedule impact the results?']","['The study does not include ablation studies to isolate the impact of specific components of the proposed method.', 'Absence of gradient norm analysis limits the understanding of optimization dynamics and factors influencing grokking.', 'The cyclic learning rate schedules do not consistently lead to grokking behavior across all tasks, particularly struggling with complex tasks like permutations.', 'The results indicate that cyclic learning rates are not a one-size-fits-all solution for inducing grokking across different tasks.']",False,2,3,2,3,4,"['The paper addresses an intriguing and important phenomenon in deep learning, known as grokking.', 'The authors provide a comprehensive empirical study across multiple algorithmic tasks, extending beyond modular addition to include tasks like modular subtraction, division, and permutations.', 'The implementation of a novel cyclic learning rate schedule with warmup and decay is a notable contribution.', 'Detailed analysis of the task-dependent nature of grokking and the limitations of cyclic learning rates in promoting generalization.']","['The effectiveness of cyclic learning rates in inducing grokking varies significantly across tasks, with limited success in more complex tasks like permutations.', 'The study lacks ablation experiments to isolate the impact of specific components of the proposed method.', 'Absence of gradient norm analysis, which could provide deeper insights into the optimization dynamics and the factors influencing grokking.', 'The cyclic learning rate schedules do not consistently lead to full grokking behavior, with none of the tasks achieving 99% validation accuracy within the training steps.', 'The paper does not sufficiently address limitations and potential societal impacts of the work.']",3,2,3,2,Reject | |
gradient_accumulation_grokking,"The paper investigates the impact of gradient accumulation on the grokking phenomenon and computational efficiency in deep learning models, focusing on tasks involving modular arithmetic and permutation groups using a Transformer-based model. The study explores different combinations of gradient accumulation steps and learning rates to analyze their effects on the speed of grokking, final performance, and computational efficiency.","['Can the authors provide more context in the introduction and related work sections?', 'What are the broader implications or potential applications of studying the grokking phenomenon in deep learning?', 'How do the findings on grokking in simplistic tasks generalize to more complex real-world applications?', 'Can the authors provide a clearer definition of the grokking phenomenon and its implications for deep learning?', 'Why were the experiments run only once for each configuration? Would multiple runs with different seeds provide more robust results?', 'Can the authors provide more theoretical insight into why gradient accumulation impacts grokking and computational efficiency?', 'Can the authors provide more details on the implementation of the autoencoder aggregator? This part of the methodology is not very clear.', 'Have the authors considered running multiple seeds for each configuration to ensure the robustness of the results?', 'Could the authors include additional ablation studies to analyze the impact of different components of the proposed method?', 'Can the authors provide more detailed explanations for the choice of hyperparameters and specific implementation details?', 'How do the authors plan to address the limitation of using single seeds in their experiments?', 'Can the authors discuss the potential generalizability of their findings to more complex, real-world tasks?']","['The study lacks robustness due to the use of a single seed for each configuration, which limits the statistical significance of the findings.', 'The tasks used in the study (modular arithmetic and permutation groups) are relatively simple and may not adequately represent the challenges faced in real-world deep learning applications.', 'The proposed method may not be well-suited for tasks involving more complex algebraic structures, as indicated by the relatively low validation accuracies for the permutation group tasks.']",False,2,2,2,3,4,"['Addresses an interesting and less explored phenomenon in deep learning, namely grokking.', 'Investigates the impact of gradient accumulation, a practical technique used in training neural networks, on learning dynamics and efficiency.', 'Conducts experiments on two types of tasks with varying complexity, providing a controlled environment for analysis.']","['The paper lacks clarity in several sections, particularly in the methodology and experimental setup, making it difficult to reproduce the results.', 'The study does not include ablation experiments or comparisons with related works, limiting the evaluation of the proposed approach.', 'The experimental results are based on a single run for each configuration, which raises concerns about the statistical significance and robustness of the findings.', 'The paper does not adequately address the limitations of the proposed method, particularly in more complex real-world tasks.']",2,2,2,2,Reject | |
robustness_grokking_correlation,"This paper investigates the relationship between grokking and input robustness in neural networks trained on mathematical operations. Grokking refers to the sudden emergence of generalization capabilities in neural networks after prolonged training. The authors introduce a novel framework that correlates the onset of grokking with improvements in input robustness, using transformer models trained on modular arithmetic and permutation tasks. Extensive experiments are conducted to demonstrate that the grokking point often precedes significant improvements in robustness to input perturbations.","['Can the authors provide a more detailed explanation of the methodology, particularly the training and evaluation process?', 'Have the authors considered more complex tasks or real-world datasets to validate the generalizability of their findings?', 'Is there a theoretical explanation for the observed relationship between grokking and robustness, or is it purely empirical?', 'How do the authors define robustness points, and why do they occur so early in training across all operations?', 'Can the authors provide more detailed ablation studies to understand the influence of different components of their framework?', 'How does the proposed approach compare to other robustness techniques like adversarial training?', 'Can the authors provide more detailed explanations of key concepts, such as the definition of the robustness point?', 'How do you plan to extend your study to more complex tasks and perturbation strategies?', 'Can you improve the clarity of your presentation to make the methodology and results easier to follow?', 'How does the observed relationship between grokking and robustness compare to models trained with adversarial robustness techniques?', 'Can you clarify how the grokking and robustness points are defined and whether these definitions might influence the observed results?', 'Have you considered more complex and realistic tasks beyond mathematical operations to validate if the grokking-robustness relationship holds?', 'Can the authors test robustness with different types of perturbations to provide a more comprehensive analysis?']","['The paper primarily focuses on simple mathematical operations, which may limit the generalizability of the findings to more complex tasks or real-world datasets.', 'The early robustness points observed across all operations suggest that the current definition of robustness points may need refinement to better capture the nuances of robustness development.', 'The perturbation strategy used is simplistic and does not explore various types of input robustness.', 'The study is limited to simple mathematical operations and might not generalize to more complex tasks or real-world datasets.', 'The perturbation method is basic and may not represent robustness against more sophisticated adversarial attacks.']",False,2,2,2,3,4,"['The paper addresses an under-researched area in neural network learning dynamics, particularly the relationship between grokking and input robustness.', 'The idea of correlating grokking with robustness is novel and provides valuable insights into the learning process of neural networks.', 'The authors provide extensive empirical evidence through experiments on various mathematical operations, demonstrating the interplay between generalization and robustness.']","['The clarity of the paper is a concern. Some sections, particularly the methodology, are not clearly explained, making it difficult for readers to fully understand the experimental setup and results.', 'The paper lacks rigorous theoretical analysis to support the empirical findings. The connection between grokking and robustness is demonstrated empirically but not explained theoretically.', 'The generalizability of the results is questionable. The experiments are limited to simple mathematical operations, and it is unclear if the findings would hold for more complex tasks or real-world datasets.', 'The definition of robustness points may need refinement, as the early robustness points observed across all operations suggest that the current definition may not fully capture the nuances of robustness development.', 'Insufficient comparisons with existing robustness techniques, such as adversarial training, to better contextualize the contributions.', 'The perturbation method used is too simplistic and does not explore a variety of perturbation strategies.']",3,2,2,3,Reject | |
critical_period_grokking,"The paper investigates the impact of component-specific freezing on neural network generalization, focusing on understanding the grokking phenomenon. The study explores how freezing different components of a Transformer-based model at various training stages affects learning dynamics and generalization performance across mathematical tasks.","['Can you provide more details about the implementation of the freezing mechanism?', 'How does the choice of aggregator type affect the performance?', 'Can you provide more qualitative analysis to support the claims about the impact of freezing?', 'Can you provide more theoretical justification for the observed effects of component-specific freezing?', 'How do the authors address the unpredictable effects observed in some tasks, such as the performance degradation in the modular addition task with late freezing?', 'Can the authors clarify the missing figures in the results section?', 'How do you ensure that the findings from mathematical tasks generalize to other domains?', 'Could you provide a more detailed theoretical analysis of why certain freezing schedules work better for specific tasks?', ""Can you provide a clearer definition and significance of the 'grokking' phenomenon?"", 'What theoretical framework supports the observed impact of component-specific freezing?', 'How do the baseline comparisons and control setups rigorously support your conclusions?']","[""The main limitation is the task-dependent nature of the freezing strategy's effectiveness, which limits its practical utility."", 'The paper lacks robust evidence to support some claims, such as the benefits of late-stage freezing.', 'The clarity and organization of the paper need improvement.', 'The study is limited to mathematical tasks, which may not generalize well to other domains.', 'There is a lack of clarity in the implementation details and theoretical justification for the observed phenomena.', 'Experimental results are not always linked to actionable insights.', 'Baseline comparisons and control setups need better definition and explanation.']",False,2,2,2,4,4,"['The concept of component-specific freezing to study critical learning periods is novel and intriguing.', 'The paper addresses an important phenomenon (grokking) in neural networks.', 'The experimental setup is comprehensive, covering various tasks and freezing schedules.', 'The methodology is well-structured, with systematic freezing schedules and thorough experimental analysis across multiple tasks.']","['The results show significant variability across tasks, suggesting that the effectiveness of freezing strategies is highly task-dependent.', 'Some claims, such as the benefits of late-stage freezing, lack robust support.', 'The paper could benefit from clearer explanations of some methodologies, especially regarding the implementation details of the freezing mechanism.', 'The practical utility of the freezing strategy appears limited due to its task-dependent effectiveness.', ""The paper's presentation could be significantly improved. The methodology and experimental setup sections are dense and difficult to follow."", 'The paper lacks a strong theoretical foundation to explain why component-specific freezing should lead to the observed results.', 'The study is limited to mathematical tasks, which may not generalize well to more complex or different types of tasks.', 'The results show some unpredictable and unexplained effects of freezing, calling into question the robustness and reliability of the findings.', 'The results section references figures that are not included in the provided text, making it difficult to fully understand and evaluate the findings.']",3,2,2,3,Reject | |
adaptive_grokking_critical_periods,The paper investigates the phenomenon of grokking in neural networks by identifying critical learning periods during training. It introduces an AdaptiveFreezeScheduler that dynamically freezes model components based on learning progress. The method is tested on modular arithmetic and permutation tasks using a modified Transformer architecture.,"['Can you provide more details on the component analysis and how it helps in understanding the grokking phenomenon?', 'What are the possible reasons for the poor performance on permutation tasks, and how can the method be improved to handle such tasks better?', 'Can you provide additional ablation studies and sensitivity analyses to better understand the impact of various hyperparameters and design choices?', 'How do you justify the choice of task-specific parameters for the AdaptiveFreezeScheduler?', 'Can you compare your method with other existing methods that address grokking?', 'Can you provide a more rigorous justification for the freezing criteria used in the AdaptiveFreezeScheduler?', 'Have you tested the method on a wider range of tasks to see if the findings generalize beyond modular arithmetic and permutation tasks?', ""Can you provide more details on the specific implementation of the AdaptiveFreezeScheduler and how the 'patience' and 'improvement threshold' parameters were chosen?"", 'Have you considered alternative architectures or modifications that could address the poor performance on permutation tasks?', 'Please discuss the potential societal impacts and ethical considerations of your proposed method.']","['The method struggles with permutation tasks, and the baseline models outperform the proposed method in terms of grokking speed for some tasks. This suggests that the method may not be robust across different types of tasks.', 'The clarity of the component analysis section needs improvement, and additional explanations and justifications are required.', 'The current model architecture struggles with permutation tasks, indicating a need for task-specific architectures.', 'The AdaptiveFreezeScheduler may be too aggressive for complex tasks, suggesting the need for more nuanced freezing strategies.', 'The experiments are limited to small-scale tasks, and the scalability of the approach to larger, more complex problems remains to be investigated.']",False,2,2,2,3,4,"['The paper addresses an intriguing phenomenon in neural networks鈥攇rokking鈥攚hich challenges our understanding of neural network generalization.', 'The introduction of the AdaptiveFreezeScheduler is novel and provides a systematic way to explore the impact of freezing different model components.', 'The experiments are well-structured, and the method is validated on different types of tasks, providing a comprehensive view of its effectiveness.']","['The clarity of the method, particularly the component analysis section, is lacking and needs more detailed explanations and justifications.', 'The results on permutation tasks are significantly weaker, indicating that the method may not generalize well across different types of tasks.', 'The baseline models often perform better in terms of grokking speed, which raises questions about the practical utility of the proposed method.', 'The paper could benefit from additional ablation studies and sensitivity analyses to better understand the impact of various hyperparameters and design choices.', 'The paper lacks a solid comparison with other methods that address grokking, which would help contextualize the contributions better.', 'The justification for task-specific parameters and the choice of architecture could be better explained.', ""The criteria for adaptive freezing are somewhat arbitrary and lack rigorous justification, which might limit the method's generalizability."", 'The scope of tasks is narrow, and the findings might be task-specific, limiting the overall significance of the contributions.', 'The paper lacks a thorough discussion on the potential societal impact and ethical considerations of the proposed method.']",3,2,2,3,Reject | |
mutual_information_grokking,"The paper investigates the grokking phenomenon in neural networks by analyzing mutual information dynamics within transformer models during training. The authors identify an 'information shift point' that precedes and potentially triggers rapid improvements in generalization. They focus on algorithmic tasks such as modular arithmetic operations and permutation composition, providing empirical evidence and statistical analysis to support their findings.","['Can the authors provide more detailed explanations of the key concepts and methods used in the paper?', 'How do the authors plan to generalize their findings to more complex tasks and different model architectures?', ""What are the practical implications of identifying the 'information shift point' for training strategies and model design?"", 'Can the authors provide more details on the mutual information estimation process and its robustness?', ""How do the authors ensure that the observed 'information shift point' is not an artifact of the estimation method?"", 'Can the authors provide more quantitative analysis to support the claims about the correlation between mutual information dynamics and grokking?', 'How do the authors address the potential limitations and confounding factors in their experiments?']","['The scope of the experiments is limited to a few algorithmic tasks, which may not generalize to other domains or more complex tasks.', 'The findings rely on a specific model architecture and set of hyperparameters, potentially limiting their applicability to other configurations.', 'The failure to observe grokking in permutation tasks suggests limitations in the approach that need to be addressed.', 'The paper does not convincingly demonstrate the practical significance of the findings.', 'The paper does not adequately address the limitations of the mutual information estimation method and the potential confounding factors in the experiments.']",False,3,2,2,4,4,"['The paper addresses an intriguing phenomenon in deep learning, known as grokking.', 'The use of mutual information dynamics to study grokking in transformer models is a novel approach.', 'The authors provide extensive empirical evidence and rigorous statistical analysis.', 'The topic of grokking is novel and relevant, addressing a significant gap in our understanding of neural network training dynamics.']","['The originality of the paper is questionable, as mutual information has been used in other contexts to understand neural network training.', 'The experimental scope is limited to a few algorithmic tasks, raising concerns about the generalizability of the findings.', 'The paper lacks clarity in explaining key concepts and methods, which could hinder reproducibility.', 'The practical significance of the findings is not well articulated.', 'The results, while interesting, do not convincingly demonstrate a causal relationship between mutual information dynamics and grokking.', 'The presentation of the paper is cluttered and lacks coherence, making it difficult to follow the narrative.', 'The paper does not adequately address potential limitations or alternative explanations for the observed phenomena.']",3,3,2,3,Reject | |
relational_complexity_grokking,"The paper investigates the phenomenon of 'grokking' in neural networks, focusing on how input relational complexity influences learning dynamics across various mathematical operations such as modular arithmetic and permutations. It introduces a novel framework for measuring input relational complexity and conducts extensive experiments to understand the impact of task complexity on learning trajectories and generalization.","['How do you plan to generalize the findings to other types of tasks beyond modular arithmetic and permutations?', 'Can you provide more details on the experimental setups and hyperparameters to improve reproducibility?', 'Have you considered alternative model architectures that might perform better on permutation tasks?', 'How can the complexity measure be improved to better capture the true difficulty of tasks?', 'Can the authors provide more diverse experiments involving different types of tasks?', ""What are potential reasons for the model's poor performance on permutation tasks?"", 'Can the authors provide more detailed analysis or alternative measures for input complexity?', 'What architectural changes or different training strategies could potentially improve performance on more complex tasks like permutations?', 'Can the authors elaborate on the limitations and potential solutions in more detail?', 'Why do the authors believe the current complexity measure is adequate?', 'How do the findings compare to other types of tasks beyond mathematical operations?', 'What are the implications for real-world applications of neural networks?', 'Could you provide more details and validation for the complexity measure?', 'How do the findings of this study significantly advance our understanding beyond existing work?', 'Could you explain the choice of tasks in more detail and why these were selected?']","['The paper does not adequately address the limitations of the complexity measure and model architecture. It is crucial to explore more nuanced complexity measures and alternative models for better generalization.', 'The current complexity measure may not adequately reflect task difficulty.', 'The model architecture may not be optimal for more complex tasks like permutations.', 'The training time and hyperparameters might need adjustment for more challenging tasks.', 'The choice of mathematical operations and specific complexity measures could be questioned.', 'Implications of weak correlations observed in permutation tasks are not fully explored.', 'The paper should address the adequacy and validation of the complexity measure more thoroughly.', 'It should also ensure that the findings offer substantial new insights rather than merely reinforcing existing understanding.']",False,2,2,2,3,4,"['The introduction of a novel framework for quantifying input relational complexity.', 'Comprehensive experiments on different mathematical operations, providing a detailed analysis of grokking behavior.', 'Insightful findings revealing distinct patterns in learning dynamics for different tasks.']","['The complexity measure might be too simplistic and may not fully capture the true difficulty of learning different operations.', 'The study focuses narrowly on specific mathematical operations and may not generalize to other types of tasks.', 'The transformer-based model might not be optimal for the permutation tasks, leading to poor generalization.', 'Lacks sufficient details on some experimental setups and hyperparameters, which could hinder reproducibility.', 'Results are primarily descriptive and do not delve into underlying reasons for observed phenomena.', 'Limited scope and modest insights, with unclear practical implications.']",3,2,3,2,Reject | |
simplicity_bias_grokking,"The paper investigates the phenomenon of grokking in neural networks, focusing on the relationship between function complexity and sudden generalization. The key contribution is the introduction of a method to track function complexity using the L1 norm of model weights and analyzing its evolution in relation to generalization performance. The experiments are conducted on modular arithmetic and permutation tasks.","['Have you considered other measures of function complexity besides the L1 norm? If so, what were the findings?', 'Can you provide additional ablation studies to explore the impact of different hyperparameters and model architectures on the relationship between function complexity and grokking?', 'How do you think your findings generalize to more complex tasks beyond modular arithmetic and permutation operations?']","['The experiments are limited to specific tasks, which might affect the generalizability of the findings.', 'The reliance on a single measure of function complexity (L1 norm) could be limiting. Future work should explore other complexity measures.', 'The lack of extensive ablation studies weakens the strength of the claims made in the paper.']",False,2,2,2,3,4,"['Addresses an important and under-explored phenomenon in deep learning: grokking.', 'Provides a novel method to analyze function complexity using the L1 norm of model weights.', 'Demonstrates consistent patterns across multiple tasks, supporting the simplicity bias hypothesis.', 'Comprehensive experiments with clear visualizations of training dynamics.']","['The reliance on L1 norm as the sole measure of function complexity might be limiting. Other complexity measures could provide additional insights.', 'The tasks chosen for experiments (modular arithmetic and permutation) are quite specific, which might limit the generalizability of the findings to other tasks.', 'Limited ablation studies to explore the impact of different hyperparameters, model architectures, and complexity measures.', 'The paper lacks a thorough investigation into the theoretical underpinnings of the observed relationship between function complexity and grokking.']",3,2,3,3,Reject | |
positional_encoding_grokking,"The paper investigates the impact of positional encoding schemes on the grokking phenomenon in Transformer models, focusing on modular division and permutation group operations. It finds that learned positional encodings facilitate earlier grokking compared to sinusoidal encodings, particularly in modular division tasks. However, neither encoding scheme achieves successful grokking on the more complex permutation group operations task.","['Have you considered other positional encoding schemes?', 'Can you provide more details on the generalization capabilities on longer sequences?', 'How do different hyperparameter settings impact the results?', 'Can the authors provide more detailed explanations and justifications for their experimental setup and choice of tasks?', 'How do the authors plan to address the limitations related to task diversity and training duration in future studies?', 'Can the authors provide more insights into the mechanisms driving the observed differences in grokking dynamics between the two positional encoding schemes?']","['The study is limited by the scope of tasks and the training duration. Additionally, the exploration of positional encoding schemes is not exhaustive, and the paper does not thoroughly analyze the underlying mechanisms driving the grokking phenomenon.', 'The limited scope of positional encoding schemes and potential hyperparameter sensitivity could impact the generalizability of the findings.']",False,2,3,3,3,4,"['Addresses a critical aspect of deep learning generalization by exploring the grokking phenomenon.', 'The comparison between learned and sinusoidal positional encodings is novel and provides valuable insights.', 'Thorough experiments with detailed analysis of learning curves, gradient norms, and attention patterns.', 'Findings have practical implications for optimizing Transformer architectures in algorithmic tasks.']","['Limited to two specific tasks, which may not be representative of all algorithmic problems.', 'Training duration of 7500 steps may not be sufficient for observing grokking in more complex tasks.', 'Only considers two positional encoding schemes, leaving out other potential methods.', 'Hyperparameters are kept constant across all experiments, which might have biased the results.', 'Lacks detailed exploration of generalization capabilities on longer sequences.']",3,2,3,3,Reject | |
effective_capacity_grokking,"The paper investigates the phenomenon of grokking in neural networks by introducing a novel metric called effective capacity, which quantifies the utilization of model parameters during training. The authors hypothesize that changes in effective capacity are closely linked to sudden improvements in generalization performance. They introduce the concept of a 'capacity shift point' and conduct extensive experiments on algorithmic tasks using Transformer models.","['How sensitive are the findings to the choice of the threshold 系 for determining significant parameter contributions?', 'Can the authors provide more detailed ablation studies to validate the importance of different components of the proposed method?', 'How do the findings generalize to more complex tasks and larger models? Can the authors provide any preliminary results or insights in this regard?', 'What are the potential limitations or challenges in applying the proposed method to real-world tasks and models?', 'Can you provide a theoretical justification for why effective capacity should correlate with grokking?', 'Can you include more qualitative insights or case studies to illustrate the findings better?']","['The study is limited to relatively simple tasks and small models, and the findings may not generalize to more complex scenarios.', 'The empirical choice of the threshold for significant parameter contributions could affect the generalizability of the results.', 'The causal relationship between effective capacity and grokking is not thoroughly explored.']",False,3,3,3,5,4,"['Addresses the novel and intriguing phenomenon of grokking.', 'Introduces the innovative concept of effective capacity as a metric for parameter utilization.', 'Presents the new idea of a capacity shift point.', 'Conducts thorough experiments with a well-defined methodology.']","['Empirical validation is limited to simple algorithmic tasks and small models, which may not generalize.', 'Threshold for significant parameter contributions is chosen empirically without substantial justification.', 'Lacks deeper theoretical underpinning for the correlation between effective capacity and grokking.', 'Visualization of results could benefit from additional qualitative insights.', 'Does not sufficiently explore the causal relationship between effective capacity and grokking.']",3,3,3,4,Reject | |
scheduled_regularization_grokking,"The paper introduces Scheduled Regularization, a novel approach to enhance and control grokking in neural networks by dynamically adjusting regularization strength during training. The method is evaluated on various algebraic and combinatorial tasks, showing some improvements but inconsistent performance across different tasks.","['Can the authors provide a theoretical explanation for why Scheduled Regularization should enhance grokking?', 'How does the method perform on more complex tasks and larger models?', 'Can more figures and detailed explanations be provided to clarify the results and methodology?', 'Can the authors provide a deeper analysis of why Scheduled Regularization works for some tasks but not for others?', 'Are there any additional baselines that could be included to strengthen the evaluation of the proposed method?', 'Could the authors explore additional hyperparameter settings or alternative schedules for regularization to see if these impact the results?']","[""The method's effectiveness appears to be highly task-dependent, and its impact on more complex tasks and larger models is not explored."", 'The lack of a theoretical framework limits the understanding of why the method works or fails.', 'The study does not address the limitations of the current model architecture or training procedure, which may contribute to the inconsistent results.', 'The impact of the selected hyperparameters on the results is not thoroughly discussed.']",False,2,2,2,3,4,"['Addresses the intriguing phenomenon of grokking, contributing to the understanding of deep learning dynamics.', 'Introduces a novel concept of dynamically adjusting regularization strength during training.', 'Provides a comprehensive empirical study across various algebraic and combinatorial tasks.']","['The proposed method does not show consistent improvements across all tasks, with particularly poor results on modular division and permutation tasks.', 'Lacks a rigorous theoretical framework explaining why Scheduled Regularization should work.', 'The tasks chosen are quite specific and may not generalize well to more complex real-world tasks.', 'The impact on more complex architectures and larger-scale tasks is not explored, leaving questions about scalability unanswered.', 'Presentation and clarity need improvement. Some figures are missing, and the explanation of results and methods could be more detailed and precise.']",3,2,3,2,Reject | |