this model in Ollama
#5
by
robbiemu
- opened
I used Ollama's new integration to run this model directly (like, without writing a model file, hurrah!) -- the Q4_K_M variant, specifically. I use it mostly in openwebui, in case that is important.
I notice that the time before first token grows (seemingly exponentially) with the size of the context, in a way that is not at all comparable to the official variants of the same quantization from Qwen.
Is anyone else noticing time before first token growing into the minutes with a fully loaded context (two 72kb README.md files, marked to fully load -- but the context window set to 32k)?