# Title: Dual-Expert Denoiser for Improved Mode Capture in Low-Dimensional Diffusion Models
# Experiment description: Modify MLPDenoiser to implement a dual-expert architecture. Create a simple gating network that outputs a single weight (sigmoid output) based on the noisy input and timestep. Implement two expert networks with the same structure as the original denoising network. Combine expert outputs using the gating weight. Train models with both the original and new architecture on all datasets, with particular focus on 'moons' and 'dino'. Compare performance using KL divergence, sample diversity metrics (e.g., number of modes captured), and visual inspection of generated samples. Analyze the specialization of experts across different regions of the data distribution.
## Run 0: Baseline
Results: {'circle': {'training_time': 48.47419357299805, 'eval_loss': 0.4392722546292083, 'inference_time': 0.18316245079040527, 'kl_divergence': 0.35930819035619976}, 'dino': {'training_time': 41.885783672332764, 'eval_loss': 0.6636652672077383, 'inference_time': 0.18297195434570312, 'kl_divergence': 1.060376674621348}, 'line': {'training_time': 38.887343406677246, 'eval_loss': 0.8017848281909132, 'inference_time': 0.17120051383972168, 'kl_divergence': 0.15692256311119815}, 'moons': {'training_time': 38.7231330871582, 'eval_loss': 0.6203141152248968, 'inference_time': 0.1772310733795166, 'kl_divergence': 0.09455949519397541}}
Description: Baseline results using the original MLPDenoiser architecture.

## Run 1: Dual-Expert Denoiser
Results: {'circle': {'training_time': 60.20667552947998, 'eval_loss': 0.4340648788320439, 'inference_time': 0.26030611991882324, 'kl_divergence': 0.3548752521015737}, 'dino': {'training_time': 59.569570779800415, 'eval_loss': 0.6582550479627937, 'inference_time': 0.24830842018127441, 'kl_divergence': 0.873368895698616}, 'line': {'training_time': 57.278900384902954, 'eval_loss': 0.802841300702156, 'inference_time': 0.2616264820098877, 'kl_divergence': 0.16631820218273796}, 'moons': {'training_time': 59.45627760887146, 'eval_loss': 0.614546875743305, 'inference_time': 0.24232029914855957, 'kl_divergence': 0.08688268116023862}}
Description: Implementation of the Dual-Expert Denoiser architecture. This run introduces a gating network and two expert networks within the MLPDenoiser. The gating network determines the weight given to each expert's output based on the input and timestep.

Observations:
1. Training time increased across all datasets, which is expected due to the increased model complexity.
2. Eval losses slightly improved for 'circle' and 'dino' datasets, while remaining similar for 'line' and 'moons'.
3. Inference time increased, likely due to the additional computations in the dual-expert architecture.
4. KL divergence improved for 'dino' (0.873 vs 1.060) and 'moons' (0.087 vs 0.095) datasets, indicating better capture of the true data distribution.
5. The 'circle' dataset showed a slight improvement in KL divergence (0.355 vs 0.359).
6. The 'line' dataset showed a slight increase in KL divergence (0.166 vs 0.157), which may be due to the simplicity of the dataset not benefiting from the increased model complexity.

Next steps: To further investigate the effectiveness of the Dual-Expert Denoiser, we should analyze the generated samples visually and examine the gating weights to understand how the experts specialize. We should also consider adjusting the architecture or hyperparameters to potentially improve performance, especially for the 'line' dataset.

## Run 2: Enhanced Gating Network
Results: {'circle': {'training_time': 62.70881533622742, 'eval_loss': 0.4392700866817513, 'inference_time': 0.27757978439331055, 'kl_divergence': 0.333127618757142}, 'dino': {'training_time': 65.9961109161377, 'eval_loss': 0.6554543292126083, 'inference_time': 0.2801930904388428, 'kl_divergence': 0.8622659948063218}, 'line': {'training_time': 63.58059334754944, 'eval_loss': 0.8071294327831025, 'inference_time': 0.2570970058441162, 'kl_divergence': 0.15626460287380087}, 'moons': {'training_time': 63.43175005912781, 'eval_loss': 0.6130339162581412, 'inference_time': 0.2541923522949219, 'kl_divergence': 0.09756236614068906}}
Description: In this run, we enhanced the gating network of the Dual-Expert Denoiser by increasing its complexity. The gating network now consists of three linear layers with ReLU activations, allowing it to potentially capture more nuanced relationships between the input and the optimal expert weighting.

Observations:
1. Training times increased slightly compared to Run 1, which is expected due to the more complex gating network.
2. Eval losses remained similar to Run 1, with slight improvements for 'dino' and 'moons' datasets.
3. Inference times increased marginally, reflecting the additional computations in the enhanced gating network.
4. KL divergence improved notably for the 'circle' dataset (0.333 vs 0.355 in Run 1), indicating better capture of the true data distribution.
5. The 'dino' dataset showed a slight improvement in KL divergence (0.862 vs 0.873 in Run 1).
6. The 'line' dataset showed a slight improvement in KL divergence (0.156 vs 0.166 in Run 1), addressing the previous increase observed in Run 1.
7. The 'moons' dataset showed a slight increase in KL divergence (0.098 vs 0.087 in Run 1), but it's still better than the baseline.

Next steps: The enhanced gating network has shown promise, particularly for the 'circle' and 'line' datasets. To further improve the model's performance, we should consider the following:
1. Analyze the generated samples visually to understand the qualitative improvements.
2. Examine the distribution of gating weights to see if the experts are specializing effectively.
3. Experiment with different architectures for the expert networks, such as increasing their capacity or using different activation functions.
4. Consider implementing a more sophisticated loss function that encourages diversity in the generated samples.

## Run 3: Increased Expert Network Capacity
Results: {'circle': {'training_time': 67.72772169113159, 'eval_loss': 0.44077414045553376, 'inference_time': 0.29411911964416504, 'kl_divergence': 0.3369115398699348}, 'dino': {'training_time': 66.11997985839844, 'eval_loss': 0.6583147108402398, 'inference_time': 0.2786083221435547, 'kl_divergence': 0.7492200172597772}, 'line': {'training_time': 66.70119905471802, 'eval_loss': 0.8060775769641028, 'inference_time': 0.2694664001464844, 'kl_divergence': 0.15416058891406453}, 'moons': {'training_time': 67.89770340919495, 'eval_loss': 0.6156130795131254, 'inference_time': 0.2853279113769531, 'kl_divergence': 0.0915883610864912}}
Description: In this run, we increased the capacity of both expert networks by adding an additional hidden layer before the final output. This modification allows each expert to capture more complex patterns in the data, potentially improving the model's ability to generate diverse and accurate samples.

Observations:
1. Training times increased slightly across all datasets, which is expected due to the increased model complexity.
2. Eval losses remained relatively stable compared to Run 2, with slight variations across datasets.
3. Inference times increased marginally, reflecting the additional computations in the more complex expert networks.
4. KL divergence showed mixed results:
   a. 'circle' dataset improved slightly (0.337 vs 0.333 in Run 2).
   b. 'dino' dataset showed significant improvement (0.749 vs 0.862 in Run 2), indicating better capture of the complex data distribution.
   c. 'line' dataset showed a slight improvement (0.154 vs 0.156 in Run 2).
   d. 'moons' dataset showed a slight improvement (0.092 vs 0.098 in Run 2).
5. The most notable improvement was observed in the 'dino' dataset, suggesting that the increased expert network capacity is particularly beneficial for more complex data distributions.

Next steps: The increased expert network capacity has shown promising results, especially for the more complex 'dino' dataset. To further improve the model's performance and understand its behavior, we should:
1. Analyze the generated samples visually to assess the qualitative improvements, particularly for the 'dino' dataset.
2. Examine the distribution of gating weights to understand how the experts are specializing with the increased capacity.
3. Consider implementing a more sophisticated loss function that encourages diversity in the generated samples, as this may help improve performance across all datasets.
4. Experiment with different activation functions in the expert networks to potentially capture different types of patterns in the data.

## Run 4: Diversity Loss Implementation
Results: {'circle': {'training_time': 72.7797212600708, 'eval_loss': 0.44442242086695893, 'inference_time': 0.2980952262878418, 'kl_divergence': 0.47009555896972094}, 'dino': {'training_time': 75.91083240509033, 'eval_loss': 0.6673849075651535, 'inference_time': 0.29502367973327637, 'kl_divergence': 0.6495770647785007}, 'line': {'training_time': 77.7726686000824, 'eval_loss': 0.8133890747719104, 'inference_time': 0.28652405738830566, 'kl_divergence': 0.2489773415001416}, 'moons': {'training_time': 70.94407176971436, 'eval_loss': 0.6255804364333677, 'inference_time': 0.2740786075592041, 'kl_divergence': 0.11055475645165658}}
Description: In this run, we implemented a more sophisticated loss function to encourage diversity in the generated samples. We added a diversity loss term to the existing MSE loss. The diversity loss aims to maximize pairwise distances between predictions within a batch, encouraging the model to generate more diverse samples.

Observations:
1. Training times increased across all datasets, likely due to the additional computations required for the diversity loss.
2. Eval losses slightly increased for all datasets, which is expected as the model now optimizes for both accuracy and diversity.
3. Inference times remained relatively stable compared to Run 3.
4. KL divergence results were mixed:
   a. 'circle' dataset showed a significant increase (0.470 vs 0.337 in Run 3).
   b. 'dino' dataset improved (0.650 vs 0.749 in Run 3), continuing the trend of better performance on complex distributions.
   c. 'line' dataset showed a notable increase (0.249 vs 0.154 in Run 3).
   d. 'moons' dataset showed a slight increase (0.111 vs 0.092 in Run 3).
5. The diversity loss appears to have had a significant impact on the 'dino' dataset, further improving its performance.
6. The increased KL divergence for simpler datasets ('circle', 'line', 'moons') might indicate that the model is generating more diverse but less accurate samples for these distributions.

Next steps:
1. Analyze the generated samples visually to assess the impact of the diversity loss on sample quality and diversity, particularly for the 'dino' dataset.
2. Examine the distribution of gating weights to understand how the diversity loss affects expert specialization.
3. Consider adjusting the weight of the diversity loss term to find a better balance between accuracy and diversity, especially for simpler datasets.
4. Experiment with different formulations of the diversity loss, such as using different distance metrics or applying the loss to different intermediate representations.
5. Investigate the impact of batch size on the effectiveness of the diversity loss.

# Plot Descriptions

1. kl_divergence_comparison.png
This plot shows a bar chart comparing the KL divergence values across different runs and datasets. The x-axis represents the four datasets (circle, dino, line, and moons), while the y-axis shows the KL divergence values. Each run is represented by a different color, allowing for easy comparison of performance across different model configurations. Lower KL divergence values indicate better performance, as the generated distribution is closer to the true data distribution. This plot is particularly useful for identifying which model configurations perform best on each dataset and how the performance varies across datasets.

2. dino_generated_samples.png
This figure contains a 2x3 grid of scatter plots, each representing the generated samples for the 'dino' dataset from different runs. Each point in the scatter plot represents a generated sample, with its x and y coordinates corresponding to the two dimensions of the data. The color of each point represents the gating weight assigned by the model, with a color scale ranging from cool (low weights) to warm (high weights) colors. This visualization allows us to observe how the generated samples capture the shape of the dino dataset and how the gating mechanism specializes across different regions of the data distribution. It's particularly useful for assessing the quality and diversity of generated samples, as well as understanding the behavior of the dual-expert architecture.

3. dino_train_loss.png
This plot shows the training loss curves for the 'dino' dataset across different runs. The x-axis represents the training steps, while the y-axis shows the loss value. Each run is represented by a different color line. The lines are smoothed to reduce noise and make trends more visible. This plot is crucial for understanding the training dynamics of different model configurations. It allows us to compare the convergence speed, stability, and final loss values achieved by each run. Lower loss values generally indicate better performance, but it's important to consider this in conjunction with other metrics like KL divergence and visual inspection of generated samples.

4. dino_gating_weights_histogram.png
This figure presents a 2x3 grid of histograms, each showing the distribution of gating weights for the 'dino' dataset from different runs. The x-axis of each histogram represents the gating weight values (ranging from 0 to 1), while the y-axis shows the frequency of each weight value. This visualization is essential for understanding how the gating mechanism behaves in the dual-expert architecture. A bimodal distribution might indicate that the experts are specializing in different aspects of the data, while a uniform distribution might suggest that both experts are being used equally across the dataset. This plot helps in assessing whether the dual-expert approach is working as intended and how it varies across different model configurations.

These plots, when analyzed together, provide a comprehensive view of the model's performance, training dynamics, and the effectiveness of the dual-expert architecture. They are crucial for understanding the impact of different modifications (such as enhanced gating, increased capacity, and diversity loss) on the model's ability to capture complex data distributions like the 'dino' dataset.