Making LLMs Smaller Without Breaking Them: A GLU-Aware Pruning Approach
TL;DR
Pruning is a key technique for creating Small Language Models, but a successful pruning process requires understanding the structure of the target models.
This article demonstrates how to perform pruning on MLP layers with a Gated Linear Unit (GLU) structure, applicable to many current models such as LLaMA 3.2, Gemma, Mistral, QWen and Others.
By preserving the GLU structure during pruning, you can achieve a significant reduction in model size while maintaining coherent output generation and achieving surprisingly strong accuracy on tasks like BoolQ.
Explore the notebook, experiment with the pruned models, create your own, and share your feedback!
Introduction.
As large language models continue to grow in size to achieve greater capabilities, the demand for more efficient, smaller versions has become a pressing need. However, reducing a model's size without losing its core functionality is a delicate balancing act. Techniques such as quantization and pruning are commonly used to decrease size, while methods like knowledge distillation or transfer learning help retain or recover the capabilities lost during the reduction process.
Among these, pruning stands out as one of the most effective strategies for reducing model size. Unlike quantization, which simplifies numerical representations, pruning involves removing specific parts of the model, such as neurons or entire layers. But this effectiveness comes at a cost: pruning is challenging to apply correctly. Not only do you need to identify which part of the model to prune, but you must also carefully select the elements to remove to minimize the impact on the model's capabilities.
This article focuses on structured width pruning, where selected neurons are removed, and demonstrates how to apply it effectively on MLP layers with a Gated Linear Unit (GLU) structure. By following the steps outlined, you’ll see how pruning can significantly reduce model size while preserving its ability to generate coherent outputs and perform well on key benchmarks.
What is Pruning and how it affects the models?
As I’ve explained earlier, pruning involves removing parts of the model that are believed to contribute the least to its final output. By carefully selecting these less critical components, pruning aims to create a more efficient model with fewer parameters and reduced computational requirements, without sacrificing its core capabilities.
The primary challenge in pruning lies in deciding which parts of the model to remove. Not all sections of a model impact its performance equally; each serves a distinct purpose.
To illustrate this, let’s examine the structure of the model used in this article: LLaMA 3.2-1B.
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=512, bias=False)
(v_proj): Linear(in_features=2048, out_features=512, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
When examining the structure, we can identify three main blocks that can be targets for pruning: the embeddings, the self-attention mechanism, and the MLP layers. To decide which of these should be the focus of the pruning process, it’s essential to understand the potential benefits and the possible impacts on the model.
The first step is to assess how much each of these sections occupies within the model, giving us an idea of the potential reduction in size.
Parameter Distribution Analysis.
Embeddings and output layer (
embed_tokens
,lm_head
): 128256×2048≈262M128256 \times 2048 \approx 262M parameters per layer, with two layers totaling 524M parameters.Self-attention mechanism (
self_attn
): This consists of 16 layers, each containing four projection sub-layers. For each layer, the size is approximately: 2048×(2048+512+512+2048)≈10.5Mparameters. Multiplying by 16 layers gives: 10.5×16≈168M parameters.MLP layers (
mlp
): Similarly, these consist of 16 layers, and since they follow the GLU structure, each layer includes a gate_proj, up_proj, and down_proj. The size for each layer is approximately: 2048×8192+2048×8192+8192×2048≈50M Parameters. Multiplying by 16 layers gives: 50×16≈805M Parameters.
Impact Analysis.
As we can see, the MLP layers represent more than 50% of the model’s size, making them clear candidates for pruning. However, before making this decision, it’s crucial to understand the contribution of each section to the model’s behavior.
The embedding layers are responsible for transforming the inputs into dense vector representations that the model can process effectively. Pruning the embedding layer can lead to a loss of the model's ability to understand certain words, or at least reduce the capacity to create vectors that correctly capture the semantic meaning of the inputs. If you want to create a highly specific model that only uses a very specific portion of its input vocabulary, for example, a model for financial or medical analysis, pruning this layer could be an option.
The attention mechanism allows the model to focus on the most relevant parts of the input sequence when processing each token. It computes a weighted importance score between every pair of tokens in the input sequence, enabling the model to capture Context and Focus on Relevant Information. Pruning this section can reduce the model's ability to perform tasks requiring a broad understanding of the input context, such as text summarization or translation. It also affects the coherence of generated text.
The MLP layers accompany the attention mechanism and enhance the model's ability to understand complex patterns through a series of data expansions and contractions. Pruning this section can limit the model’s response to unseen data or tasks not covered during training. In other words, it reduces the model's generalization capability and its ability to provide coherent responses to unfamiliar inputs.
Once you've decided which section of the model to target, the next step is to determine whether to perform width pruning, removing individual neurons, or depth pruning, removing entire layers. As you can see, pruning a model is quite a complex process that involves making many decisions. You not only have to evaluate the abilities of the resulting model but also its capacity to be trained. These models are designed with the intention of being fine-tuned, usually for specific tasks, so they can be more effective and efficient than the base model for the tasks they are created to perform.
Characteristics of Gated Linear Units
The Gated Linear Unit (GLU) architecture is commonly used in modern neural networks, including LLaMA and similar large language models. GLU introduces an element-wise gating mechanism that allows the model to selectively filter and control the flow of information. This architecture consists of paired layers, typically gate_proj, up_proj, and down_proj (as seen in the model structure above), that work together to expand and contract data.
This mechanism enables the model to process more complex patterns while maintaining efficiency. However, it also means that the layers within a GLU structure are tightly coupled, and pruning these layers requires careful consideration.
Any operation on one layer (e.g., removing neurons) must be mirrored in its corresponding paired layers. For instance, if a neuron is removed from gate_proj, the same neuron must also be removed from up_proj, and the size of the down_proj layer must be adjusted accordingly. Most importantly, when calculating the importance of neurons to decide which ones to keep, you need to evaluate the pair of neurons together.
Disrupting the balance of these layers can result in degraded performance or even complete model failure, even if only a small percentage of neurons are removed.
Pruning a Llama 3.2 Model. (GLU).
The example will be demonstrated using a Llama model, but the code has also been tested successfully with Gemma and QWen models.
You can find the complete code in my GitHub repository. In this article, I will only show the code relevant to the pruning process, omitting some support functions. The notebook also includes code for evaluating the models and uploading them to the Hugging Face Hub.
The first step I took with the original model in memory was to execute a small prompt and save the result. This allowed me to easily, visually, and quickly check whether the model generated through the pruning process was coherent or, on the contrary, had lost its ability to generate comprehensible text. Let me assure you, in the first attempt, where the GLU structure of the model was not respected, the text produced left no doubt that the pruning process had a fundamental flaw. The original prompt is: “Paris is the capital of.” Let’s look at the response from the original model and compare it to the one returned by my first pruning attempt.
Base model:
“Paris is the capital of France and one of the most visited cities in the world. It is a city of art, culture, fashion, and gastronomy. The city has a rich history and is home to many famous landmarks, including the E.”
First attempt, with only 20% pruning:
“Paris is the capital of of France. This is the the the the main the area of. This is the the the the the the the the the the the the the the the the city of the the France of the of the of the of.”
It’s clear that something didn’t work in that first attempt. It might seem trivial, but an empirical check like this can save you quite a few hours.
Implementation details.
Let’s start by looking at the function responsible for calculating the importance of the neurons, which will ultimately decide which neurons remain in the model and which ones are removed.
def compute_neuron_pair_importance(gate_weight, up_weight):
"""
compute neuron pair importance scores (Maximum Absolute Weight)
Args:
- gate_weight: Weight matrix from the gate_proj layer.
- up_weight: Weight matrix from the up_weight layer.
Returns:
- importance_scores: Importance scores for each neuron pair.
"""
gate_max_abs = torch.max(gate_weight, dim=1).values + torch.abs(torch.min(gate_weight, dim=1).values)
up_max_abs = torch.max(up_weight, dim=1).values + torch.abs(torch.min(up_weight, dim=1).values)
importance_scores = gate_max_abs + up_max_abs
return importance_scores
The function receives the weights of a gate_proj layer and an up_proj layer, which, as I’ve explained, work in pairs. Therefore, the importance of the neurons must be calculated jointly.
The calculation is very straightforward: it computes the absolute value of the weights for each neuron. Both positive and negative values are considered because, in theory, neurons with the most extreme values have a greater impact on the model’s output by significantly altering the values passing through them.
Here, I must thank Mariusz Kurman for their contribution in incorporating the minimum values into the calculation. While the method worked correctly without them, their inclusion has improved the results.
The importance is calculated separately for each layer, but the function returns the combined value.
The next function is responsible for creating the new layers and incorporating them into the model as replacements for the original ones.
#Prunes a specific percentatge of neurons from the MLP (feed forward layers).
def prune_neuron_pairs(mlp, prune_percent):
"""
Reduces the dimensions of the **gate_proj**,**up_proj**, **down_proj**
layers removing the least important neurons.
Args:
- mlp: Layers to prune.
- prune_percent: Percentage of neurons to prune.
Returns:
- new_gate_proj, new_up_proj, new_down_proj: New pruned layers.
- k: New intermediate size.
"""
# Extract the weights from the MLP layers
# these weights are used to calculate each neuron's
# importance score in the next step.
gate_weight = mlp.gate_proj.weight.data.float()
up_weight = mlp.up_proj.weight.data.float()
#Compute importance stores. Neurons with higher importance scores
# are considered more important and less likely to be pruned.
importance_scores = compute_neuron_pair_importance(gate_weight, up_weight)
#Store the original number of neurons in the intermediate layer.
original_intermediate_size = gate_weight.size(0)
#Computes the number of neurons to prune.
num_neuron_pairs_to_prune = min(int(prune_percent * original_intermediate_size), original_intermediate_size - 1)
#Calculate the number of neurons to keep. The new intermediate size.
k = original_intermediate_size - num_neuron_pairs_to_prune
#Just check that there is no big error calculating k. We can't prune all the neurons.
if k <= 0:
raise ValueError(f"Invalid number of neuron pairs to keep: {k}. Adjust the prune_percent.")
_, indices_to_keep = torch.topk(importance_scores, k, largest=True, sorted=True)
indices_to_keep = indices_to_keep.sort().values
#create the new layers
new_gate_proj = nn.Linear(mlp.gate_proj.in_features, k, bias=False).to(device)
new_up_proj = nn.Linear(mlp.up_proj.in_features, k, bias=False).to(device)
new_down_proj = nn.Linear(k, mlp.down_proj.out_features, bias=False).to(device)
#copy weights to the new layers.
new_gate_proj.weight.data = mlp.gate_proj.weight.data[indices_to_keep, :]
new_up_proj.weight.data = mlp.up_proj.weight.data[indices_to_keep, :]
new_down_proj.weight.data = mlp.down_proj.weight.data[:, indices_to_keep]
#return new layers and intermediate size.
return new_gate_proj, new_up_proj, new_down_proj, k
This function is a bit more complex. It takes a layer from the MLP block and the pruning percentage to apply. By calling the compute_neuron_pair_importance
function, it determines which neurons to keep.
Let’s break it down step by step:
# Extract the weights from the MLP layers
# these weights are used to calculate each neuron's
# importance score in the next step.
gate_weight = mlp.gate_proj.weight.data.float()
up_weight = mlp.up_proj.weight.data.float()
With these two lines, we retrieve the weights of the current layers.
importance_scores = compute_neuron_pair_importance(gate_weight, up_weight)
Now, a tensor is obtained that contains the importance scores calculated for each neuron. These scores reflect each neuron's contribution to the final output, indicating which ones should be kept.
#Store the original number of neurons in the intermediate layer.
original_intermediate_size = gate_weight.size(0)
#Computes the number of neurons to prune.
num_neuron_pairs_to_prune = min(int(prune_percent * original_intermediate_size), original_intermediate_size - 1)
#Calculate the number of neurons to keep. The new intermediate size.
k = original_intermediate_size - num_neuron_pairs_to_prune
The total number of neurons to keep is calculated using the pruning percentage provided as a parameter and the original size of the layers. Since the layers have the same size, there’s no need to store the size of both. Finally, the new size of the intermediate layers is determined.
#Select the neuros to keep, by obtaining the indices to keep.
_, indices_to_keep = torch.topk(importance_scores, k, largest=True, sorted=True)
indices_to_keep = indices_to_keep.sort().values
These lines are crucial. Here, torch is used to retrieve the neurons with the highest importance scores, while also sorting them from most to least important. Since torch returns the data in descending order, the sort method is used to rearrange them in ascending order, which is what we need.
Using the calculated indices, the new layers are created.
#create the new layers
new_gate_proj = nn.Linear(mlp.gate_proj.in_features, k, bias=False).to(device)
new_up_proj = nn.Linear(mlp.up_proj.in_features, k, bias=False).to(device)
new_down_proj = nn.Linear(k, mlp.down_proj.out_features, bias=False).to(device)
#copy weights to the new layers.
new_gate_proj.weight.data = mlp.gate_proj.weight.data[indices_to_keep, :]
new_up_proj.weight.data = mlp.up_proj.weight.data[indices_to_keep, :]
new_down_proj.weight.data = mlp.down_proj.weight.data[:, indices_to_keep]
First, three new layers are created with dimensions adjusted based on the selected indices. In new_gate_proj
and new_up_proj
, the input dimensions are preserved while the output dimensions are reduced. Conversely, in new_down_proj
, the input dimensions are adjusted while the output dimensions remain unchanged.
These layers are initialized without weights, and in the final lines, the relevant weights are transferred from the original layers to the new ones, ensuring that only the weights corresponding to the selected neurons are retained.
#return new layers and intermediate size.
return new_gate_proj, new_up_proj, new_down_proj, k
Finally, the new layers are returned.
Now, let’s look at the function responsible for iterating over all the layers and constructing the modified model.
#Iterates through the model layers and applies pruning.
def update_model(model, prune_percent):
"""
It modifies each mlp layer present in model, to retain only the most
important neurons. Creating new smaller versions of each layer pruned.
Args:
- model: Model to prune.
- prune_percent: Percentage of neurons to prune.
Returns:
- model: New pruned model.
"""
new_intermediate_size = None
#loop for each model layer.
for idx, layer in enumerate(model.model.layers):
#Since each layer is a LlamaDecoderLayer it contains multiple components
# Attention, MLP and Layer norms. We're targetting MLP component
# by accesing layer.mlp.
mlp = layer.mlp
#Call the prune_neiron_pairs with the layers and receiving the pruned.
new_gate_proj, new_up_proj, new_down_proj, new_size = prune_neuron_pairs(mlp, prune_percent)
#Replace the Origiginal Layers with Pruned Layers.
mlp.gate_proj = new_gate_proj
mlp.up_proj = new_up_proj
mlp.down_proj = new_down_proj
#new_intermediate_size only needs to be set once
if new_intermediate_size is None:
new_intermediate_size = new_size
#Update the model config file.
model.config.intermediate_size = new_intermediate_size
return model
It can be said that this function is straightforward. It takes the model and the pruning percentage as inputs. It iterates through each layer of the model, extracting the mlp section from each layer. Then, it calls the prune_neuron_pairs
function and replaces the model's layers with the ones returned by that function.
#Call the prune_neiron_pairs with the layers and receiving the pruned.
new_gate_proj, new_up_proj, new_down_proj, new_size = prune_neuron_pairs(mlp, prune_percent)
#Replace the Origiginal Layers with Pruned Layers.
mlp.gate_proj = new_gate_proj
mlp.up_proj = new_up_proj
mlp.down_proj = new_down_proj
Finally, it also updates a variable in the model’s configuration file: new_intermediate_size.
#Update the model config file.
model.config.intermediate_size = new_intermediate_size
If this file is not updated, the model cannot be used after being saved, whether on Hugging Face or locally. Many libraries, such as Hugging Face's Transformers, rely on model.config
to interpret the model's architecture. If the configuration does not match the actual structure, operations like fine-tuning or inference performed through these libraries may fail.
Results Analysis.
With this code, I’ve created several models, which are available on the Hugging Face Hub.
These include:
- Three models derived from Llama-3.2-1b, with 20%, 40%, and 60% of the neurons in the MLP layers pruned.
- One model based on Gemma-2-2B, pruned by 40%.
- You can download these models and, in addition to using them, study their architecture and how it has changed compared to the original models they are based on.
Let’s analyze the changes in the architecture after applying 20% pruning to the Llama3.2-1b model.
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=512, bias=False)
(v_proj): Linear(in_features=2048, out_features=512, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=6554, bias=False)
(up_proj): Linear(in_features=2048, out_features=6554, bias=False)
(down_proj): Linear(in_features=6554, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
The structure of the model remains unchanged except for the size of the intermediate layers in the MLP blocks. As you can see, the gate_proj and up_proj layers have been reduced from 8192 features to 6554, and the down_proj layer has undergone the same change, but in its input features.
This change is fully aligned with what the code does: modifying these layers while preserving the neurons that are most critical for the model’s performance. If we remove 20% of 8192, we get 6553.6, confirming that the correct percentage of neurons has been pruned.
Now, let’s see how the pruned model performed with the test prompt:
Paris is the capital of France. It is also one of the most beautiful cities in the world. There is so much to see and do in Paris that it is impossible to cover it all in one day. However, there are some things you
The response isn’t identical to the one from the original model, but it maintains coherence. This suggests that the model retains much of its capabilities, and more importantly, it could potentially recover any losses through a process like knowledge distillation or fine-tuning.
Beyond this empirical check, I’ve also evaluated the model using some of the most common benchmarks. Let’s analyze how different degrees of pruning affect the model’s performance.
As we can see, the effect of pruning has been somewhat asymmetrical. The tasks evaluated by the BoolQ test haven’t experienced significant degradation—only about a 2% drop for a model that lost 40% of the neurons in the MLP layers.
In contrast, the impact on the Lambada test has been remarkable, with a drop in accuracy of over 50%. This indicates that the model retains much of its comprehension ability but struggles with tests requiring more open-ended generation.
BoolQ simply presents the model with a text and a question to be answered with Yes/No. It’s a test focused on measuring the model’s ability to understand relationships within the input text.
Lambada, on the other hand, asks the model to guess the last word of a paragraph, a complex task where the final word tests the model’s capability in complex language modeling.
These results are consistent with the functionality of the MLP layers that were pruned.
Conclusions.
The pruning process for the models has been a success. This approach to handling GLU layers allows us to perform pruning while retaining a significant portion of the model's capabilities, thereby reducing its size and resource consumption considerably.
It’s important to note that the test results were obtained with the pruned model before undergoing any capability recovery process, such as knowledge distillation or fine-tuning, which is typically done for models that have undergone pruning.
Future Work.
There are many pruning techniques worth exploring. Perhaps the most straightforward is depth pruning, which involves removing layers that contribute the least to the model’s performance.
Another essential area of research would be to subject these pruned models to a knowledge distillation process and evaluate whether they retain the ability to learn new tasks. This could potentially bring their performance closer to that of the base model, particularly in the benchmarks where the pruned model showed the most significant losses.
The development of smaller, more efficient models remains an attractive field, particularly for companies seeking to deploy LLM capabilities without extensive infrastructure requirements. This work provides a foundation for further research in making these powerful models more accessible and deployable.