Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(llama.cpp): Vulkan, Kompute, SYCL #1647

Open
3 of 4 tasks
mudler opened this issue Jan 26, 2024 · 8 comments
Open
3 of 4 tasks

feat(llama.cpp): Vulkan, Kompute, SYCL #1647

mudler opened this issue Jan 26, 2024 · 8 comments
Labels
enhancement New feature or request roadmap

Comments

@mudler
Copy link
Owner

mudler commented Jan 26, 2024

Tracker for: ggml-org/llama.cpp#5138 and also ROCm

@mudler mudler added the enhancement New feature or request label Jan 26, 2024
mudler added a commit that referenced this issue Jan 29, 2024
mudler added a commit that referenced this issue Jan 30, 2024
@mudler mudler changed the title llama.cpp Vulkan, Kompute, SYCL feat(llama.cpp): Vulkan, Kompute, SYCL Jan 31, 2024
@mudler mudler added the roadmap label Jan 31, 2024
mudler added a commit that referenced this issue Feb 1, 2024
mudler added a commit that referenced this issue Feb 1, 2024
* feat(sycl): Add sycl support (#1647)

* onekit: install without prompts

* set cmake args only in grpc-server

Signed-off-by: Ettore Di Giacinto <[email protected]>

* cleanup

* fixup sycl source env

* Cleanup docs

* ci: runs on self-hosted

* fix typo

* bump llama.cpp

* llama.cpp: update server

* adapt to upstream changes

* adapt to upstream changes

* docs: add sycl

---------

Signed-off-by: Ettore Di Giacinto <[email protected]>
@mudler mudler pinned this issue Mar 2, 2024
@RiQuY
Copy link

RiQuY commented Apr 8, 2024

The merge requests linked on this issue appears to be merged upstream. Does that mean LocalAI already supports Vulkan or there are any additional tasks to do before that?

@mudler
Copy link
Owner Author

mudler commented Jun 24, 2024

The merge requests linked on this issue appears to be merged upstream. Does that mean LocalAI already supports Vulkan or there are any additional tasks to do before that?

Only kompute is missing as for now

@jim3692
Copy link

jim3692 commented Sep 9, 2024

The merge requests linked on this issue appears to be merged upstream. Does that mean LocalAI already supports Vulkan or there are any additional tasks to do before that?

Only kompute is missing as for now

It looks like kompute is also merged

@KhazAkar
Copy link

So.. what's missing in LocalAI to support vulkan? Or compilation of in-tree llama.cpp to support vulkan would be enough to use it?

@arenekosreal
Copy link

It seems that we need to set BUILD_TYPE=vulkan to let llama.cpp use vulkan. But does this mean we have to rebuild the program if we want to use CPU backend? I do not see any option to let me switch between them at runtime.

I have a pretty old GTX 1050Ti Mobile GPU, which means that I can only run some mini models on it, but it is still much faster than run those small models on CPU. With the ability to switch dynamically, I can run mini models on my GPU while can also try some larger models with my CPU. The reason why I do not use cuda is that vulkan is much smaller than cuda runtime. Although using vulkan is slower than using cublas, I think it is acceptable on my old buddy.

My idea is that we can also let vulkan backend optional like cuda backend, so LocalAI will prefer vulkan backend while can also fallback to CPU backend. At this moment, I find that LocalAI cannot turn to use CPU if it meets large models failed to load into GPU when building with BUILD_TYPE=vulkan.

@KhazAkar
Copy link

It seems that we need to set BUILD_TYPE=vulkan to let llama.cpp use vulkan. But does this mean we have to rebuild the program if we want to use CPU backend? I do not see any option to let me switch between them at runtime.

I have a pretty old GTX 1050Ti Mobile GPU, which means that I can only run some mini models on it, but it is still much faster than run those small models on CPU. With the ability to switch dynamically, I can run mini models on my GPU while can also try some larger models with my CPU. The reason why I do not use cuda is that vulkan is much smaller than cuda runtime. Although using vulkan is slower than using cublas, I think it is acceptable on my old buddy.

My idea is that we can also let vulkan backend optional like cuda backend, so LocalAI will prefer vulkan backend while can also fallback to CPU backend. At this moment, I find that LocalAI cannot turn to use CPU if it meets large models failed to load into GPU when building with BUILD_TYPE=vulkan.

You can technically don't pass ngl or set it to 0 to not use GPU for offload

@arenekosreal
Copy link

It seems that we need to set BUILD_TYPE=vulkan to let llama.cpp use vulkan. But does this mean we have to rebuild the program if we want to use CPU backend? I do not see any option to let me switch between them at runtime.
I have a pretty old GTX 1050Ti Mobile GPU, which means that I can only run some mini models on it, but it is still much faster than run those small models on CPU. With the ability to switch dynamically, I can run mini models on my GPU while can also try some larger models with my CPU. The reason why I do not use cuda is that vulkan is much smaller than cuda runtime. Although using vulkan is slower than using cublas, I think it is acceptable on my old buddy.
My idea is that we can also let vulkan backend optional like cuda backend, so LocalAI will prefer vulkan backend while can also fallback to CPU backend. At this moment, I find that LocalAI cannot turn to use CPU if it meets large models failed to load into GPU when building with BUILD_TYPE=vulkan.

You can technically don't pass ngl or set it to 0 to not use GPU for offload

I found a very hacky way to achieve that: Build llama-cpp-fallback with BUILD_TYPE=openblas, while build other parts with BUILD_TYPE=vulkan. Then LocalAI will try vulkan first, and fallback to openblas. But this means that I have to enable other optimizations on llama-cpp-fallback. I am not sure if this usage is valid because there do not have any documentation about the mixture of types of backends.

@KhazAkar
Copy link

Interesting, thanks for that @arenekosreal !
Technically llama.cpp is able to be compiled with multiple backends backed in, so this distinction in LocalAI is interesting to see

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request roadmap
Projects
None yet
Development

No branches or pull requests

5 participants