-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(llama.cpp): Vulkan, Kompute, SYCL #1647
Comments
* feat(sycl): Add sycl support (#1647) * onekit: install without prompts * set cmake args only in grpc-server Signed-off-by: Ettore Di Giacinto <[email protected]> * cleanup * fixup sycl source env * Cleanup docs * ci: runs on self-hosted * fix typo * bump llama.cpp * llama.cpp: update server * adapt to upstream changes * adapt to upstream changes * docs: add sycl --------- Signed-off-by: Ettore Di Giacinto <[email protected]>
The merge requests linked on this issue appears to be merged upstream. Does that mean LocalAI already supports Vulkan or there are any additional tasks to do before that? |
Only kompute is missing as for now |
It looks like kompute is also merged |
So.. what's missing in LocalAI to support vulkan? Or compilation of in-tree llama.cpp to support vulkan would be enough to use it? |
It seems that we need to set I have a pretty old GTX 1050Ti Mobile GPU, which means that I can only run some mini models on it, but it is still much faster than run those small models on CPU. With the ability to switch dynamically, I can run mini models on my GPU while can also try some larger models with my CPU. The reason why I do not use cuda is that vulkan is much smaller than cuda runtime. Although using vulkan is slower than using cublas, I think it is acceptable on my old buddy. My idea is that we can also let vulkan backend optional like cuda backend, so LocalAI will prefer vulkan backend while can also fallback to CPU backend. At this moment, I find that LocalAI cannot turn to use CPU if it meets large models failed to load into GPU when building with |
You can technically don't pass ngl or set it to 0 to not use GPU for offload |
I found a very hacky way to achieve that: Build |
Interesting, thanks for that @arenekosreal ! |
Tracker for: ggml-org/llama.cpp#5138 and also ROCm
The text was updated successfully, but these errors were encountered: