Abstract
Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
Community
Very nice paper that introduces a new paradigm for LLM quantization (ternary weights for linear layers {-1, 0, 1} resulting in removing the need of having multiplications in matmul + int8 activations)
It seems that method cannot be used as a post-training quantization method, but rather train a 1.5-bit model from scratch. I believe the code will be shared here: https://github.com/microsoft/unilm/tree/master/bitnet - would be curious to see if the authors will share the quantized models on the Hub!
I also wonder if the lm_head is also quantized, as not quantizing the lm head helps for preserving good generation quality for quantized language models
We would definitely be happy to open-source the models for future research. Please stay tuned!
The lm_head is not quantized because the language models have to use high-precision
probabilities to perform sampling, and it only takes a very small proportion of the cost especially when the model is large.
This is incredible! Like the other commenter here one of my first thoughts goes immediately to existing LLMs and whether they can be converted to 1.58bit LLMs somehow. @shumingma Did you conduct any experiments in this area? Either via some finetuning method or even distillation?
Unfortunately, the conversion or post-training quantization from existing LLMs doesn't help. This is why we train the models from scratch.
Amazing work!
This method is likely compatible with powerinfer (as long as the activation function is replaced by ReLU or squared ReLU) which would make it ever faster on a mixed setup with, for example, 64GB RAM + 24GB VRAM (which would then support a 400B model with decent speeds)
It would also be interesting to see this combined with some of these papers: (I think all of them are compatible with each other)
Fast feed forward doesnβt replicate. Worked on that for a few weeks.
Hi, very exciting work!
I have a few questions on the zero-shot performance on the language tasks.
Did you also compute the evaluation with "BitNet b1.58 70B" ? I'm very curious about these results. I'm referring to something like Table 3.
We haven't finished the training of the models beyond 3B as it requires much much more resources. However, we're optimistic about the results because we have verified that BitNet follows a similar performance-parameter scaling law as the full-precision LLMs. We'll update the results on larger models once they're ready.
Really interesting work! Are there any major drawbacks or are we all just starting over using this?
Hi. Great work wanted to ask how long did the 1b or 700m parameter variants take to train? I couldn't see in the paper.
I would also be interested if you have a sense of if it is more efficient to train the models using this method vs a more traditional model?
The trend of perplexity becoming better with a larger parameter count compared to the 700m and 1.3b is... perplexing.
Did you guys study how it impacted very small parameter count models (i.e, 100m?)
Is it reasonable to conclude that "under-parameterized" Transformers tend to use the full precision to better represent individual neurons, but that this property seems to fade with scaling, which makes the technique more effective w.r.t large models?
From what I understand, as models become larger, sparsity emerges, e.g. https://openreview.net/forum?id=TJ2nxciYCk-
This is great news! Could you share the training code so we can experiment with pre-training smaller models?
Or "TritNet" if they prefer to keep with their existing naming scheme.
I wonder if you could quantize layers one by one, with calibration. To 1bit. I know thats not the point of this as the models were all from scratch. Would be pretty interesting. Something similar to LASER.
Very interesting approach! One question I still have is whatβs the integer layout to store the third state (0). Since it is not 1-bit, I am guessing thatβs where the 1.58 comes from, but I am unclear on whatβs the representation in the binary form. Do you use one bit for the sign and another one for the value?
This research direction is starting to remind me of Hyperdimensional Computing / Vector Symbolic Architectures, which also typically use 1-bit or ternary representations, but take the approach of building explicit knowledge structures by combining concept vectors using a set of basic operations.
I wonder if both HDC/VSA and LLMs end up doing ultimately the same things at their core. It would be really cool if they turned out to be special cases of a single unified framework that combined the former's interpretability with the latter's trainability/scalability :-)
Missed an opportunity to name the title "ternary weights is all you need.".
This paper is very surprising to me. I would have thought that you could have a model with {-1, 0, 1} match the capability of an FP model by being significantly larger than it. You would be making up for the loss of βdescriptivenessβ of FP by increasing the number of less descriptive weights. However, if I am following correctly, youβve found that you actually donβt need to scale up the number of weights at all. Do you have any ideas as to why that might be? It kind've shatters my understanding of weights were even doing in the first place.
I agree with this. I'd be interested if someone has an idea of intuition to offer here. Is it perhaps at these high dimensions the addition of the weights isn't so valuable (ostensibly another dimension)
The memory savings and throughput results in the paper are inference right? Are you seeing the same or similar gains during training or are training gains different?
I believe during training the model is trained with full-precision master weights, and low-bit weighs are used for forward and back calculation.
This kind of feels too good to be true. Please prove me wrong, I'd be happy if you do so and prove the results are true.
My main concerns:
Why don't you at least train the 7B version of BitNet on 2T tokens so it can be easily comparable on OpenLLM benchmark? It's easy to show that a 7B model performs well in a setting where it's trained only on 100B tokens, as there is a potential maximum information capacity, which is far below an fp16 alternative.
What is the StableLM 3B trained on 2T tokens you are talking about? I could not find such a model. Stability has StableLM 3B trained on 1T tokens and a StableLM 2 1.6B trained on 2T tokens. The benchmarks of either of these models don't correspond to the benchmark you provide, and are better.
My main concerns:
- Why don't you at least train the 7B version of BitNet on 2T tokens so it can be easily comparable on OpenLLM benchmark?
They said in a previous thread that they hadn't finished training the models larger than 3.9b yet, because of the compute involved. I think the numbers for those in the paper like 70b are inferred from the current trends, but sounds like they do plan to train them.
- What is the StableLM 3B trained on 2T tokens you are talking about? I could not find such a model. Stability has StableLM 3B trained on 1T tokens and a StableLM 2 1.6B trained on 2T tokens.
It might be a mistake, assuming that the 3b had the same amount of tokens as the 1.6b. There is also the Zephyr versions of each, though not sure how many more tokens were used for those fine-tunings.
Cool Work and I wanna know why the model after ternary QAT optimization is sorely less than 4x smaller? Shouldn't it be 8x smaller at least as compared to FP16?
If it is sorely 4x less small, it looks more like a 4-bit quantized model, and as we all know 4-bit is almost lossless for current LLM. @shumingma
I think the following explains. At smaller sizes, the full precision embedding takes up more of the model. They estimate at 70b, that it will take 1/7 the vram of a normal 70b model.
"We further scaled up the model size to 7B, 13B, and 70B and evaluated the
cost. Figure 2 illustrates the trends of latency and memory, showing that the speed-up increases as the
model size scales. In particular, BitNet b1.58 70B is 4.1 times faster than the LLaMA LLM baseline.
This is because the time cost for nn.Linear grows with the model size. The memory consumption
follows a similar trend, as the embedding remains full precision and its memory proportion is smaller
for larger models. Both latency and memory were measured with a 2-bit kernel, so there is still room
for optimization to further reduce the cost."
Though keep in mind those are extrapolations since they haven't actually trained above 3.9b yet.
Interesting work, but doesn't the improvement in PPL of quantized models vs their fp16 counterparts signal that they(the fp16 models) were not properly trained to begin with? (Intuitively, it should be impossible for the 1-bit model to find a point in the weight-dimension that has lower loss than the point found by fp16, right?)
Exactly that, these models are under-trained for the number of parameters they have.
Would not 2 bits (and quaternary instead of ternary) be more efficient when implemented on a binary processor?
The performance optimization is in the math. When you are doing matrix multiplication with ternary, it turns into non-multiplication. I.E. -1 x anything = sign flip, 0 x anything = 0, and 1 x anything = anything.
In all cases, the answer is almost instantaneous, even without any specialized hardware. It will be great to see how well this runs on regular CPUs.
Good point - interested in this
If you do choose to continue training the larger models could you use the data used to train Phi-2? I imagine it would scale significantly better than standard data. And potentially 5gb of deduped star coder dataset and 5gb of slim pajama ππ just some hopeful request!
Also is there really currently no way to quantize the models down to 1.58 bits and use a recovery lora kinda like in yβalls βtransformer compressionβ paper.
I'm particularly curious about how the model size is kept consistent in the table. So, how is the model size of the b1.58 model calculated? From my understanding, if the model size remains consistent, does it imply more parameters, especially compared to quantization? Especially, I noticed that in the paper, the 1-bit BitNet compares models with different numbers of bits, while keeping the model size consistent. Personally, I believe this approach is less promising than quantization because it does not reduce the model size.
By βmodel sizeβ they just mean the number of parameters in the model, not the physical size on disk. Generally the memory limitation is from loading all the data of the model into memory, so that is more representative of the size in the sense you mean.
And for that, they didnβt make the embeddings smaller, so it makes a bigger diff the larger the model. You can see that by the time it gets up to 70b params, they estimate 1/7 the ram, so the file size would be around that much smaller (depends on how the trits are actually encoded into bits)
Will the training code be made public? That would actually be awesome and then we "gpu poor" will be able to have true mixture of experts with 10s of models trained on trillions of token and hence agi. Also, have you guys thought about doing this for pictures and videos to train models in similar fashion?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OneBit: Towards Extremely Low-bit Large Language Models (2024)
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference (2024)
- FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design (2024)
- WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More (2024)
- Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hey I'm just a curious newb. But I'm wondering could we have a 1 byte mamba? Also spiking neural networks are binary-like and capable of real time learning (that's why they are sometimes called liquid neural nets right?) and ternary is just binary with negatives... so... might there be a way to record the activation of neurons in response to a prompt and do that 3 times with a different seed each, and use a graph pruning algorithm to help it learn? And likewise use some kind of associative reinforcement algorithm to make new graph connections between concepts that get brought up together in context?
Could we also use this system in a just-bytes/encoderless multimodal model?
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import layers
import numpy as np
class BitNet(tf.keras.Model):
def __init__(self, num_layers, hidden_size, num_heads, vocab_size):
super().__init__()
self.embeddings = tf.keras.layers.Embedding(vocab_size, hidden_size)
self.layers = [
BitLinearBlock(hidden_size, num_heads)
for _ in range(num_layers)
]
self.ln = LayerNormalization(hidden_size)
self.lm_head = tf.keras.layers.Dense(vocab_size, dtype=tf.float32) # Use higher precision for lm_head
def call(self, inputs, training=True):
x = self.embeddings(inputs)
for layer in self.layers:
x = layer(x, training=training)
x = self.ln(x)
return self.lm_head(x)
class BitLinearBlock(tf.keras.layers.Layer):
def __init__(self, hidden_size, num_heads):
super().__init__()
self.atten = BitAttention(hidden_size, num_heads)
# Assuming the implementation of FeedForward is complete
self.mlp = FeedForward(hidden_size)
def call(self, inputs, training):
att = self.atten(inputs, training)
return self.mlp(att)
class BitAttention(tf.keras.layers.Layer):
def __init__(self, hidden_size, num_heads):
super().__init__()
self.num_heads = num_heads
self.hidden_size = hidden_size
def build(self, input_shape):
# Initialize 1-bit weights etc
self.q_weight = self.add_weight(
shape=(input_shape[-1], self.hidden_size),
initializer=tf.keras.initializers.GlorotNormal,
dtype=tf.float32
)
self.kv_weight = self.add_weight(
shape=(input_shape[-1], 2 * self.hidden_size),
initializer=tf.keras.initializers.GlorotNormal,
dtype=tf.float32
)
# Convert weights to ternary representation
self.q_weight = tf.sign(self.q_weight)
self.kv_weight = tf.sign(self.kv_weight)
# Centralize weights
self.q_weight_mean = tf.reduce_mean(self.q_weight)
self.q_weight -= self.q_weight_mean
self.kv_weight_mean = tf.reduce_mean(self.kv_weight)
self.kv_weight -= self.kv_weight_mean
# Scale factor
self.q_scale = 1 / tf.reduce_sum(
tf.cast(tf.abs(self.q_weight), tf.float32))
self.kv_scale = 1 / tf.reduce_sum(
tf.cast(tf.abs(self.kv_weight), tf.float32))
def call(self, inputs, training):
# Absmax quantize activations
inputs = quantize(inputs)
# Multi-head attention
queries = tf.matmul(inputs, self.q_weight * self.q_scale)
keys = tf.matmul(inputs, self.kv_weight[:, :self.hidden_size] * self.kv_scale)
values = tf.matmul(inputs, self.kv_weight[:, self.hidden_size:] * self.kv_scale)
qk_aproduct = tf.matmul(queries, keys, transpose_b=True) / np.sqrt(self.hidden_size)
attn_weights = tf.nn.softmax(qk_aproduct)
attn_out = tf.matmul(attn_weights, values)
# Residual connection
output = inputs + attn_out
# Layer normalization
output = self.layer_norm(output)
return output
def backward(self, grad):
# Sign grad
grad_queries = tf.matmul(grad, attn_weights, transpose_a=True)
# Backprop queries
grad_queries = quantize(grad_queries)
grad_q_weight = tf.matmul(inputs, grad_queries, transpose_b=True) * self.q_scale
# Backprop keys
grad_keys = tf.matmul(attn_weights, grad, transpose_a=True)
grad_kv_weight = tf.matmul(inputs, grad_keys, transpose_b=True)[:, :self.hidden_size] * self.kv_scale
# Backprop values
grad_values = tf.matmul(attn_weights, grad, transpose_b=True)
grad_kv_weight = tf.concat([grad_kv_weight, tf.matmul(inputs, grad_values, transpose_b=True)],
axis=1) * self.kv_scale
return grad
class FeedForward(tf.keras.layers.Layer):
def __init__(self, hidden_size):
super().__init__()
def call(self, inputs):
x = tf.keras.layers.Dense(units=hidden_size, activation=tf.nn.relu)(inputs)
x = tf.keras.layers.Dense(units=hidden_size)(x)
return x
class LayerNormalization(layers.Layer):
def __init__(self, hidden_size, epsilon=1e-6):
super().__init__()
self.gamma = self.add_weight(shape=(hidden_size,), initializer='ones', trainable=True)
self.beta = self.add_weight(shape=(hidden_size,), initializer='zeros', trainable=True)
self.epsilon = epsilon
def call(self, x):
mean = tf.reduce_mean(x, axis=-1, keepdims=True)
variance = tf.reduce_mean(tf.square(x - mean), axis=-1, keepdims=True)
normalized = (x - mean) * tf.math.rsqrt(variance + self.epsilon)
return self.gamma * normalized + self.beta
Placeholder for the quantize function
def quantize(x):
abs_max = tf.math.reduce_max(tf.math.abs(x))
quantized = x / abs_max
return tf.clip_by_value(quantized, -1, 1)
Assuming ce (cross-entropy) and lr (learning rate) are defined elsewhere
ce = tf.keras.losses.CategoricalCrossentropy()
lr = 0.001
Instantiate the model and compile
model = BitNet(
num_layers=12,
hidden_size=768,
num_heads=12,
vocab_size=30000
)
model.compile(optimizer=Adam(lr), loss=ce)
@tf
.function
def train_step(inputs, labels):
with tf.GradientTape() as tape:
outs = model
In your Bitattention function you use fp32 for the weights. When do these weights converted into the ternary representation of (-1, 0, 1)? I might be blind, but I just can't see it.
Updated BitAttention
Maintain both high-precision master weights and quantized low-bit weights.
For the forward pass, use the low-bit weights for efficiency.
For the backward pass, calculate gradients with respect to the low-bit weights.
Then apply the straight-through estimator - directly accumulate those gradients onto the high-precision master weights, bypassing the non-diff quantization
class BitAttention(tf.keras.layers.Layer):
def init(self, hidden_size, num_heads, quantization_bits=1):
super().init()
self.num_heads = num_heads
self.hidden_size = hidden_size
self.quantization_bits = quantization_bits
def build(self, input_shape):
# Initialize high-precision master weights
self.q_weight_master = self.add_weight(
shape=(input_shape[-1], self.hidden_size),
initializer=tf.keras.initializers.GlorotNormal,
dtype=tf.float32,
name='q_weight_master'
)
self.kv_weight_master = self.add_weight(
shape=(input_shape[-1], 2 * self.hidden_size),
initializer=tf.keras.initializers.GlorotNormal,
dtype=tf.float32,
name='kv_weight_master'
)
# Initialize low-bit quantized weights
self.q_weight = self.add_weight(
shape=(input_shape[-1], self.hidden_size),
initializer=tf.keras.initializers.GlorotNormal,
dtype=tf.float32,
trainable=False,
name='q_weight'
)
self.kv_weight = self.add_weight(
shape=(input_shape[-1], 2 * self.hidden_size),
initializer=tf.keras.initializers.GlorotNormal,
dtype=tf.float32,
trainable=False,
name='kv_weight'
)
def call(self, inputs, training):
# Use low-bit weights for forward pass
queries = tf.matmul(inputs, self.q_weight)
keys = tf.matmul(inputs, self.kv_weight[:, :self.hidden_size])
values = tf.matmul(inputs, self.kv_weight[:, self.hidden_size:])
qk_aproduct = tf.matmul(queries, keys, transpose_b=True) / np.sqrt(self.hidden_size)
attn_weights = tf.nn.softmax(qk_aproduct)
attn_out = tf.matmul(attn_weights, values)
# Residual connection
output = inputs + attn_out
# Layer normalization
output = self.layer_norm(output)
return output
def backward(self, grad):
# Sign grad
grad_queries = tf.matmul(grad, self.attn_weights, transpose_a=True)
# Backprop queries
grad_queries = quantize(grad_queries)
grad_q_weight = tf.matmul(inputs, grad_queries, transpose_b=True)
# Backprop keys
grad_keys = tf.matmul(self.attn_weights, grad, transpose_a=True)
grad_kv_weight = tf.matmul(inputs, grad_keys, transpose_b=True)[:, :self.hidden_size]
# Backprop values
grad_values = tf.matmul(self.attn_weights, grad, transpose_b=True)
grad_kv_weight = tf.concat([grad_kv_weight, tf.matmul(inputs, grad_values, transpose_b=True)],
axis=1)
# Use straight-through estimator
self.q_weight_master.assign_add(grad_q_weight)
self.kv_weight_master.assign_add(grad_kv_weight)
# Sync quantized weights from masters periodically
if training and self.quantization_bits < 32:
if tf.equal(tf.math.mod(tf.train.get_global_step(), SYNC_INTERVAL), 0):
self.q_weight.assign(quantize(self.q_weight_master))
self.kv_weight.assign(quantize(self.kv_weight_master))
return grad
Is this really 1.58bits or is this 2bits with some waste?
Unless the future hardware has ternary memory, it's still going to be stored in binary. The simplest encoding would be 2bits (maybe sign -1,1 & mag 0, 1), but that's pretty far from 1.58 bits. You could encode 5 ternary bits with 8 binary bits for storage (1.6bits/weight) but then you need some decoder (like a lookup table), and I'm not sure if that was factored into the efficiency/power graphs.
So if we assume it's actually 2bit storage, it raises the question of why not quantize to all 4 values instead of just 3? At first glance it may seem that using only 3 is required to avoid the multiplication, but if I understood the activations were int8, so the 4th weight value could have been 0.5 and the hw can simply right shift instead of multiply, which is just as "free" in as the other 3 values (-1, 0, 1).
Am I missing something here @shumingma ?
Noticed that a post-training quantization work seemed similar to this. https://huggingface.co/papers/2402.11960
@brandf It doesn't address the packing question, but now that you say, with practically free bit shifts, one could avoid multiplication up to (-2, -1, 0, 1, 2) with evenly spaced weights, and even (-4, -2, -1, 0, 1, 2, 4) doesn't look too bad.
any weight that are 1/2^x can also be done with a shifts. it doesn't even have to by symmetric so for example with 3bit quantization you get 8 values and you could map them to (-1, -0.5, 0, 0.25, 0.5, 1, 2, 4).
when signed integers are represented in the standard two's complement way the right shifts need to preserve the high order bit, but again that's free in hardware.
this shift trick doesn't work unless the activations are integers though, however there are similar bit-level tricks that can be done to avoid a full multiply.
I'm a beginner college student. When I first saw ReLU, I was like, "Was there such a simple way?" but this time I feel similar. This weighting seemed like a W with ReLU.
I hope there's a code that I can experiment with or recreate.
Thank you so much for your interest in our work! I'm delighted to see such insightful discussions taking place around our 1-bit LLMs. We truly appreciate the engagement from the community.
I'm excited to share that we will be releasing a detailed note paper this week, which will provide in-depth coverage of the implementation details and experiments discussed in the initial paper. Additionally, we plan to address the questions and comments raised here within the note paper itself.
The note paper is expected to be published this week, hopefully as early as tomorrow. We can't wait to continue the discussions and receive further feedback from all of you once the paper is out.
Stay tuned for the upcoming release, and please feel free to keep the insightful questions and comments coming!
I hope there will be good results!!
A new paper providing training details, code, and FAQ is available at https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf
(It's not on arXiv for some inexplicable reason.)
We welcome any questions or comments you may have regarding this paper and the information it covers. Feel free to share your thoughts and inquiries!
Will the toy models we see trained in the paper (the 3b variants especially) be released on HuggingFace so that llama.cpp and other software can add support for the modified arch? It would be interesting to see how the community optimizes / takes advantage of this on current hardware too.
Someone wrote a critical blog post (saw on HN), but I'm not experience enough to know if the criticisms have merit or not: https://huggingface.co/blog/joey00072/experiments-with-bitnet-1-5
The paper says that the discrepancy with FP16 gets reduced when the models are larger.
In the blog, the models are only 15M parameters, so I don't think it proves anything.
But that said, we still don't know what happens when a 70B ternary model is trained on a very large dataset with 4-8T tokens. Perhaps the ternary model's loss will saturate a lot earlier than the FP16 model.
We have successfully reproduced the results shown in the paper! All models are trained with 100B tokens on RedPajama. The weight can be quantized to ternary values offline. We release the 700M, 1.3B, 3B models and the evaluation results in the https://huggingface.co/1bitLLM
Thatβs awesome, can you share some info on the training compute requirements?
Hi all, first of all, what an exciting result @shumingma ! Very excited to see your followup work, plus of course model weights and code. I wrote a blog post about the paper(s) here: https://learning-exhaust.hashnode.dev/are-all-large-language-models-really-in-158-bits
I hope this helps people pick apart the details and underestand what may be going on under the hood. @shumingma I would love to hear your feedback on the blog
Thanks, this was a very nice writeup!
Hello there!
I am excited about the work you have done, congratulations!
I just have a small question. For the PyTorch implementation that you have provided here, you mention that it is necessary to remove the RMSNorm
layers that precede the Attention
and MLP
calculations. This is because the new BitLinear
layer is responsible for performing this operation.
Considering that the RMSNorm
contains parameters that are learned during training:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
print(model.model.layers[0].input_layernorm.weight) # RMSNorm example
Parameter containing:
tensor([0.0535, 0.2080, 0.4473, ..., 0.0854, 0.0435, 0.0289],
requires_grad=True)
In the case of the BitLinear
layers: do these RMSNorm
layers contain such parameters, or are they parameter-free RMSNorm
layers? We must change the original forward operation to something along these lines?
Revolutionize LLMs: BitNet b1.58 Brings 1.58-bit Efficiency!
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
llama.cpp supports running the models reproduced by @1bitLLM !
Any plans to release the 3B model trained with 2T tokens? It would be a step up in model quality!
Hi, has the code to train the model from scratch for 1.5-bit been made public yet? If so, I would appreciate it if anyone could share the link.
Hi all,
We have released the inference code for BitNet b1.58 models. The current release is optimized for CPU devices (both x86 and ARM), and will support GPU and NPU in the coming releases.
π https://github.com/microsoft/BitNet
Features:
- π₯Seamlessly support the 1-bit models on Hugging Face
- π Running a 100B BitNet b1.58 model on a single CPU with speeds comparable to human reading
- π€ Deploying on various platforms (Windows, Linux, Mac, Android, etc) and different architectures (x86 and ARM)
Have fun!
Models citing this paper 23
Browse 23 models citing this paperDatasets citing this paper 0
No dataset linking this paper