Transformers
falcon

Some info about RAM.

#5
by Flanua - opened

It's been said that this model requires 400GB of ram for inference to work swiftly (which might be the true to run it swiftly)
But upon loading the model I saw that this model as so many other large models requires just a bit more ram from it actual size.
This particular model with Q8 requires: 181920.04 MB (+ 320.00 MB per state) of DDR ram.
Model falcon-180b-chat.Q8_0.gguf file size 177GB (186,288,380KB) and (190,759,300,512 bytes - on my HDD disk).
So just to run this model you just need around 181920.04 MB (+ 320.00 MB per state) of DDR ram.

Yeah that "400GB" is info from the source model README, ie for the unquantized model.

All these models I've uploaded are quantised so they will need a lot less, with Q8 being the largest of them, needing roughly half what the original needed - as you showed.

Yeah, i was managed to run a Q5_Medium model, but barely on limits of my system (consumer motherboards can support max 128Gb RAM) only on CPU (GPU even offloading produce always crashes).
My system is not Windows, where max ram is max, in Linux there's swap and etc helping running such even above available RAM (i recommend everyone Pika OS which is Ubuntu clone without Ubuntu headache).
Q4_Medium also tested and it's fine. Q4_M working at 0.40 tokens/sec max, Q5_M at 0.36 tokens/sec max registered. (Intel Xeon 14 cores, 28 threads(only 23 used)
Q5_Medium produced some webbrowser resetting, when page becoming blank, but it's not crashing all and can be restored with refreshing localhost page view with all chat history saved (maybe it's problem with lack of RAM or browser). In one of such instances it produces error in the logs, this is it:
ERROR: Exception in ASGI application

Traceback (most recent call last):
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 247, in run_asgi
result = await self.app(self.scope, self.asgi_receive, self.asgi_send)
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
return await self.app(scope, receive, send)
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/fastapi/applications.py", line 276, in call
await super().call(scope, receive, send)
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/starlette/applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/starlette/middleware/errors.py", line 149, in call
await self.app(scope, receive, send)
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/starlette/middleware/cors.py", line 75, in call
await self.app(scope, receive, send)
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in call
raise exc
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in call
raise e
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in call
await self.app(scope, receive, send)
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/starlette/routing.py", line 718, in call
await route.handle(scope, receive, send)
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/starlette/routing.py", line 341, in handle
await self.app(scope, receive, send)
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/starlette/routing.py", line 82, in app
await func(session)
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/fastapi/routing.py", line 289, in app
await dependant.call(**values)
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/gradio/routes.py", line 536, in join_queue
session_info = await asyncio.wait_for(
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/starlette/websockets.py", line 133, in receive_json
self._raise_on_disconnect(message)
File "/home/yui/Downloads/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/starlette/websockets.py", line 105, in _raise_on_disconnect
raise WebSocketDisconnect(message["code"])
starlette.websockets.WebSocketDisconnect: 1001

but it continues to work by refreshing webpage.

About the model itself, i can call it the most "autistic". Q4_M after first prompt starting to going into hallucinating with itself, predicting next user input, writing it and self answering to it, so it's writing such self-dialogue long novels about any topic. In chat mode it's kinda censored more than in instruct mode (decline of medical diagnosing and etc. but not in instruct). Q5_M not yet tested but less that self-dialogues, although it's also starting that if you click Continue to any ended prompt answer. Leaving Q4_M working overnight with max tokens and instruction was a fail, it's produced mostly textual garbage.

It is the most easiest model which you can put into "stasis" or "cold processing" mode (i've seen such only with Llama 1 65B Q8 GGML once), it's a phenomena, when neuronet using most system resources and alive but not producing high heat, running cold, i see some smartness in this because it's avoiding to reach temp limits for turning on fans, so it's very quiet. All such cases it's never freezing whole operating system with allowing other processes in parallel (like watching 4K video). I'm writing this during such mode in Q5_M, it reserved all available RAM but working at 20% CPU (rarely crossing 50%) with writing answer to prompt. In Llama 1 65B during stasis it's just thinking about for hours no one knows what, this Falcon at least writing something during that. (there's no parameters to force that and there's no user implemented limits, it's putting itself into such mode naturally, also it resetting by next prompt/next tokens session)

Sign up or log in to comment