Last modified: 2024-10-05 09:18:40
< 2024-09-30 2024-10-05 >I have an RTX 3060 12GB from eBay. Let's just plug it in and see if it works?
Well the computer booted back up and into graphics, so it seems to be working at least a little bit.
Are we using the proprietary driver? Do we have OpenGL? And most importantly can I use CUDA?
$ glxinfo | grep -i vendor
server glx vendor string: SGI
client glx vendor string: Mesa Project and SGI
Vendor: Mesa (0x10de)
OpenGL vendor string: Mesa
Looks like not the proprietary driver.
$ sudo ubuntu-drivers install
Seems convincing. And now reboot?
$ glxinfo | grep -i vendor
server glx vendor string: NVIDIA Corporation
client glx vendor string: NVIDIA Corporation
OpenGL vendor string: NVIDIA Corporation
OK cool!
It seems to have buggered up the mouse acceleration, I don't see how that is even linked.
So I think we have the proprietary driver and OpenGL looks to be working. How do I test CUDA? Just run Ollama and see how fast it is?
It shows up in nvtop
with a filename with "cuda" in it, looks promising. And it's a lot faster than it was before! Great success.
Getting 132 tokens/sec on llama3.2:1b.
nvtop
says ollama is using 18% of GPU RAM.
Let's try 3b.
Now 28% of GPU memory and 92 tokens/sec.
llama3.1:8b?
Getting 50 tokens/sec.
I'm measuring token speed with "ollama run {model} --verbose" and giving it the prompt "what is in london?" which generates quite a lot of text.
8b is only using 48% of GPU RAM. What's the next size up? Maybe llama3.2:11b? That is a multimodal model, that's fun, though I don't know how to put images in it using the ollama CLI.
Ah, ollama doesn't have 11b. Only has 1b and 3b for llama3.2.
The next one up after llama3.1:8b is 70b, but if 8b is using 50% of GPU RAM, then 70b is not going to fit.
phi3:14b is an 8GB download, maybe that would work.
Phi3 gets 30 tokens/sec but gives a much shorter answer. Only using 67% of GPU RAM. I actually think llama3.1:8b is better than Phi3:14b though.
"mixtral:8x7b-instruct-v0.1-q2_K" is a 16GB download, I wonder what happens with that one? Only using 65% of GPU RAM somehow. But only 10 tokens/sec.
OK, back to llama3.1:8b. It is running with --ctx-size 8192
by default, which is not enough. Increasing it to 64000 makes it more capable, but slower.
Trying out continue.dev with llama3.1:8b.
config.js
..vscode/extensions.json
, and absolutely none of the actual code. Actually maybe they're the 4 places you most recently looked at rather than arbitrary ones?document.getElementById
etc. instead of reactivity.