2024-10-04

Last modified: 2024-10-05 09:18:40

< 2024-09-30 2024-10-05 >

GPU
Continue.dev

GPU

I have an RTX 3060 12GB from eBay. Let's just plug it in and see if it works?

Well the computer booted back up and into graphics, so it seems to be working at least a little bit.

Are we using the proprietary driver? Do we have OpenGL? And most importantly can I use CUDA?

$ glxinfo | grep -i vendor
server glx vendor string: SGI
client glx vendor string: Mesa Project and SGI
    Vendor: Mesa (0x10de)
OpenGL vendor string: Mesa

Looks like not the proprietary driver.

$ sudo ubuntu-drivers install

Seems convincing. And now reboot?

$ glxinfo | grep -i vendor
server glx vendor string: NVIDIA Corporation
client glx vendor string: NVIDIA Corporation
OpenGL vendor string: NVIDIA Corporation

OK cool!

It seems to have buggered up the mouse acceleration, I don't see how that is even linked.

So I think we have the proprietary driver and OpenGL looks to be working. How do I test CUDA? Just run Ollama and see how fast it is?

It shows up in nvtop with a filename with "cuda" in it, looks promising. And it's a lot faster than it was before! Great success.

Getting 132 tokens/sec on llama3.2:1b.

nvtop says ollama is using 18% of GPU RAM.

Let's try 3b.

Now 28% of GPU memory and 92 tokens/sec.

llama3.1:8b?

Getting 50 tokens/sec.

I'm measuring token speed with "ollama run {model} --verbose" and giving it the prompt "what is in london?" which generates quite a lot of text.

8b is only using 48% of GPU RAM. What's the next size up? Maybe llama3.2:11b? That is a multimodal model, that's fun, though I don't know how to put images in it using the ollama CLI.

Ah, ollama doesn't have 11b. Only has 1b and 3b for llama3.2.

The next one up after llama3.1:8b is 70b, but if 8b is using 50% of GPU RAM, then 70b is not going to fit.

phi3:14b is an 8GB download, maybe that would work.

Phi3 gets 30 tokens/sec but gives a much shorter answer. Only using 67% of GPU RAM. I actually think llama3.1:8b is better than Phi3:14b though.

"mixtral:8x7b-instruct-v0.1-q2_K" is a 16GB download, I wonder what happens with that one? Only using 65% of GPU RAM somehow. But only 10 tokens/sec.

OK, back to llama3.1:8b. It is running with --ctx-size 8192 by default, which is not enough. Increasing it to 64000 makes it more capable, but slower.

Continue.dev

Trying out continue.dev with llama3.1:8b.

It's only using 16k context size.
If you just ask a message in the chat window while you have a file up, Cursor automatically gets that file as context, but Continue.dev has no context unless you manually click through the menu to find the file you're already in, very annoying.
It says "Use Ctrl+I to generate code", but (in common with GitHub Copilot) this doesn't work if you use the vim plugin, whereas Cursor's key shortcuts all work perfectly even if you use vim mode.
You have to configure the model you want to use in both the "models" section and the "tabAutocompleteModel" section of config.js.
Llama3.1:8b on my local GPU is slower than claude-3.5-sonnet and not as good.
If you "Use codebase" in the chat, it seems to just pick 4 random sections to provide as context. In a fresh initialised Vite/TypeScript/Svelte project, it chose 3 sections from the README and the entire content of .vscode/extensions.json, and absolutely none of the actual code. Actually maybe they're the 4 places you most recently looked at rather than arbitrary ones?
It's not really writing proper Svelte code, it's using document.getElementById etc. instead of reactivity.

< 2024-09-30 2024-10-05 >