How much ram it needs?
How much ram it needs? my 24GB was not enought
looks like it is being run on CPU
24GB should be more than enough for a 6B model... I run the Pygmalion 7B model in full BF16 precision on my 16GB 4080. If it's running on the CPU then it's more likely that you haven't installed one of the required libraries or something. I would suggest using Oobabooga installed via their installation scripts, here is a link to the Windows version: https://github.com/oobabooga/text-generation-webui/releases/download/installers/oobabooga_windows.zip
The benefit of using their setup script is that it will install everything you need for your hardware. Also, if you tried using the GPU and the memory was not enough, it would likely just die and not work at all, I don't think it would magically switch to CPU mode without you telling it to, so it sounds more like something's not set up for it to use the GPU...
well I tired to run it in pycharm using:
from transformers import pipeline
text_generation = pipeline("text-generation", model="PygmalionAI/pygmalion-6b")
generated_text = text_generation("Hello, how are you?")
print(generated_text[0]['generated_text'])
on option with GPU, also does not work:
from transformers import pipeline
text_generation = pipeline("text-generation",
model="PygmalionAI/pygmalion-6b",
device=0) # specify the GPU device number
generated_text = text_generation("Hello, how are you?")
print(generated_text[0]['generated_text'])
and the RAM memroy 24GB is full
Oh, you're trying to do this in code? I'll pass this over to someone else to support, my suggestion is to start with oobabooga as it has both example code and installs all the libraries you need. As I said earlier, GPU support requires a whole bunch of extra libraries, it's not going to work if you don't have them installed. If your hardware doesn't support 16bit, then you might have to load it in 8 bit mode. Again, check ooba for code/requirement examples. Also, do remember that oobabooga provides a Kobold compatible API and a new streaming text API, so you can connect to it via API and use it that way as well.
Yes i tried to load it in pycharm, becase i wanted to attach some logic
I experienced the same thing. It takes 26-27gb so just barely too big for the gpu. The precision would help but I'm not sure where to set that.
Found you can set the precision by calling model.half()
Need to also call that at the end of your input tensor.
Something to note your cpu might not support all the fp16 operations so if you use this is will likely only run on the gpu now. So just make sure to call model.cuda()
its running fine on my 8gb 3060ti. Sure response time is somewhat between 15 to 25 seconds. But I can life with that.