-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object mask clean #18356
base: master
Are you sure you want to change the base?
Object mask clean #18356
Conversation
@MikoMikarro I revied you pull request (including previous) and also playing with similar implementation on my own in parallel. I see you injecting SAM model inference in What if we keep the model inference in Regarding the model inference and its duration. SAM model inference happens in 3 steps:
The most time-consuming step is step 1. It can take from tens ms on smaller models to couple of seconds on bigger full-capacity models. But we only need to run it once when initiating mask. Steps 2-3 happen each time prompts (point on image) change. Typically, it is not more than tens ms. On model implementation. Let me remind you that both ONNX and cuDNN do not support MinGW build on Windows. It might be also challenging to use binaries build with MSVC. There is possible option to run FAST-SAM on OpenCV DNN. As it is CNN based implementation it could potentially run on it. The second option is GGML. It has SAM (original from Meta) implemented already in C++, other variants of SAM will require some work. |
I was about to say the same. This is not a good implementation for final integration. Ok as a prototype, but we should:
Just some global hint, this will need lot more review anyway. |
Hello everyone! Thank you for jumping into the conversation! As commented on the previous pull request discussion, the implementation was more of a "proof of concept" than an implementation request. I'm not well-versed enough in the Darktable codebase to even consider how to implement the order of operations you are proposing. What I understand is that @jenshannoschwalm is working on preparing the steps you mentioned. Once he has them, I can continue properly implementing the code. If you know how to do that part and can explain, I can hop on again and start working on a more "proper" integration. Thanks a lot for giving it your time and dedication! |
@TurboGit @MikoMikarro yes i started working on that but we will have to agree on the precise data to be available. The idea is basically (for now as that can be done more or less easily:
|
@jenshannoschwalm what do you mean by "chosen segment" of the mask? Just to align on how SAM model works: There are two modes of SAM inference:
|
I don't know anything about AI and models and am simply not able to discuss that at all. In fact i'm not even interested in learning that. Just trying to understand how @MikoMikarro idea could be integrated into darktable - as i said i would be willing to implement the pipeline/cache stuff as i am pretty familiar with that to give his - or your - project a better chance for investigation. From reading his comments in #18331 i understood the the algo would generate N masks (for every detected segment which would be later selected by the UI) from the full image. |
Yes, the system will generate |
@MikoMikarro that is not the optimal use of the SAM. Points as a prompt are used to compute mask. Model indeed computes multiple masks - typically 3 with different confidence score. We just select the best one. If user adds points to the input, mask is recomputed. In real-life scenarios use will need to define multiple positive and negative points as a prompt to achieve good quality of final mask. And that is where segment-anything models shine. But keep in mind that FAST-SAM is not a typical implementation of SAM. It uses more traditional backbone based on CNN instead of Visual transformers. |
Hi, the current implementation gives from 1 to 21500+ masks that are filtered by confidence score and using the NMS algorithm. Once all the masks are generated, you can "query" them using the points themselves. I don't know exactly how SAM works but that's the way FAST-SAM is done. The points themselves are not used as part of the input. Just from a performance and speed point of view I don't see SAM as the best choice for this problem. I think the models Qualcom has open have a similar way of working. I don't think this discussion is proper now as the defined model is not decided. |
If you are getting N masks (segments) can those overlap? If not it might be better to keep an array of indexes, per pixel the mask(segment) index(id) ? |
They can overlap indeed! Right now, as the point is the input of all the masks that contain that point load simultaneously, only the mask that holds all the "small" masks is loaded. Therefore your idea of the indexing is not bad at all as the ones with the higher confidence could be prioritized. For this prototype we could do what you are proposing, it would work perfectly fine. For future systems, another input that the user could give is a Bounding Box itself. If the BB is given, then the filtering of what masks to load could be done by the Intersect-over-Union (IoU) metric and therefore load different masks that could overlap. But yeah, for this one I wouldn't worry about that. We can do 1 mask with values from 0 to 255 and fill it with the indexes of each mask so the one that coincides with the added point by the artist gets loaded. |
I agree that's specific implementation of the model is not important right now. What important is the interface between AI model and darktable. And it should not be tightly couples with the specific implementation of FAST-SAM, which will be completely unscalable. I strongly advice use to implement AI mask using some variant of SAM model based on Visual Transformers. There are multiple implementations of that starting from original SAM from Meta, to efficient models for running on CPU like MobileSAM or EfficientSAM, to more precise and recent variant called HQ-SAM. They all work the same and can be easily interchanged. I believe for the proof-of-concept phase we can keep FAST-SAM, but design flexible SAM-based solution. Or we can start using SAM model implemented on GGML. I have initial implementation here https://github.com/andriiryzhkov/ai_mask_prototype. The user flow (with high level technical details) when using any true SAM model will consist of two steps:
|
Hi @andriiryzhkov! Thanks for the read, I see now what you talk about with the encoding of the image. I see a nice improvement on the resulting mask when a prompt is added. However, SAM also has this "give me all the masks" mode. In this regards I suppose it is similar. In regards of the final implementation. Storing the encodings doesn't seem to be much different than storing a mask so I suppose some part of the pipeline would be reusable. I would keep Fast-SAM for this prototype and iterate over it. |
For me it is still unclear if we can generate all the masks just once for the whole image and the UI picks the interesting one ? |
@MikoMikarro you are absolutely right, SAM has automated mask generation mode which produces multiple masks. But is this mode model deciding what your objects will look like, and user can only select but not adjust the mask to his/her needs. This makes this mode not very practical for darktable use case. Check the demo of HQ-SAM here. Do not focus on which model to use right now rather the user flow on how to mark multiple positive and negative points to improve quality of resulting mask. As I previously mentioned this is exactly the strength of SAM family of models and we should leverage that. Regarding storing image embeddings, I don't think it make sense. I'd rather store final mask as a raster, so that there's no need to recompute the mask each time image is processed. @jenshannoschwalm answering your question. If we agree to follow promptable image segmentation approach, which I strongly recommend, we would generate only ONE mask and it will be recomputed each time user adds or changes input points. This mask computation is really fast even with bigger models. See this demo video. At the end of it you'll see the segment on hover. The only time-consuming thing you need do before is embed RGB image. |
@TurboGit, @jenshannoschwalm can we create a dedicated branch for AI object mask so we all can contribute? |
@andriiryzhkov, I perfectly understand the benefits of promptable image generation. That part is just implementation that will be done in the blend operator and I will be responsible for that part. For what is now responsible @jenshannoschwalm. We just need a way to store some binary that is cached to the image and references the original image RGB, that doesn't change. For the FAST-SAM implementation is just one Mask (if we store the id of the mask it is from) and for the SAM it will be an encoding (magic mask that can be prompted). Both cases require this "just need to do once" thing that can be cached and can be persistent (stored in a file). Due to the ease of "creating" the masks that FAST-SAM gives I think it will give a similar workflow for the user itself so I would prefer if we don't discuss this in this thread/pull request as the target of it is to add this new "cache" field were we can store some data for the masks generation. Once that's done, we can create a new thread talking about the different implementations or what ML model to use. From what I understand this "once in a time" thing is common in both and the only thing that changes is inside the "object_mask" operator itself as a "post-process". For this reason, I think @jenshannoschwalm can continue with the implementation as it's currently understood, and with the insight from his code, we'll be able to propose the required changes for the different models we end up using. |
I agree that we need to focus on darktable part now, not AI one. But it is essential to have high-level understanding and alignment on how model works so we can implement best approach on the darktable side. I just want ensure we develop flexible solution, not limited by only one possible model implementation. I don't want to advocate for a specific model right now, instead focus on what common features and capabilities they have. Speaking of later, storing "embeddings" does not sound like a good idea for me. It again makes it too much dependant on specific model implementation. And the second thing is that from encoder to decoder you need to pass model and its state, which includes embeddings. So, things get a little bit complex here, and it's better to avoid it. I'd do all steps of mask computation only during the mask creation, and store resulting raster transparency mask in a db/file for mask consumption. |
I feel the mask raster is not a terrible idea but lacks flexibility. Each time the image is cropped, rotated, or deformed the mask needs to be recomputed. Also, not storing the embedding would imply that every time you want to create or edit masks once you re-open the dark table a re-computation (of the long part, creating the embeddings) each time the masks want to be edited again. I understand that my happy idea with the embeddings may not be sufficient as the state of the NN is important but I don't like the idea of having to have the model loaded all the time. Hanno also commented on some concerns in terms of the performance of having too many raster masks. My approach is closer to the automatic mask generation proposed also in the SAM repository. We agree that is not the best way to take advantage of the model but gives enormous flexibility for mask selection and reduces the amount of resources needed. Furthermore, it helps the system by not needing to have the model loaded in RAM or VRAM for each modification one performs during the creation of the mask. In my opinion, this provides greater flexibility (at some cost of quality, that can be refined using the current mask refinement capabilities) and also lower consumption of computation resources. I perfectly understand the advantages that you propose but I feel the system is more complex and an iterative process would be needed for the implementation of more complex models. I don't say that we shouldn't try in the future but for now, this approach is giving me good-enough results, not perfect but enough, in the examples I tried in my projects, and as a first step towards improving dark table features it should be implemented as I proposed. |
I agree with you regarding embeddings. Maybe we can think on how to serialize them and save. Regarding your approach, well, you can definitely try if you insist on it. Thats the beauty of open source. But you are aware that ONNX Runtime is not available on Windows/MinGW environment, so does not meet requirement to cover all supported platforms. Using binary package build with MSVC is not the possible too due too binary incompatibility between MSVC and MinGW. This limits your current approach. For automated mask generation you can use much simpler approach if you wish to go that way. Use YOLO model to segment objects. It produces multiple masks per image and very fast. It also can run on Open CV DNN, which is available on all platforms including MinGW. By the way FAST-SAM is implemented with YOLO inside. Some photo editing software supports both types of AI mask. For example, Luminar NEO has both tools. |
AI friends, i volunteered to help out and implement the pipeline part of this "AI masking stuff" because it seems you are not so fluent with this topic and i would like to support evaluation if this is something we really want as a dt feature. If you want me to continue, i need some more information and we have to agree on the darktable<->AI mask "connection.
Please don't overstress me with models and all that internal - i don't give a shit about that details and would get lost and uninterested pretty soon :-) I would really prefer to get something that fits into dt processing and would support all your ideas. Let's stay focused! |
@andriiryzhkov, I insist on giving this "automatic" mask generation a try as a first approach and iterating over it. I understand the embedding system would be powerful as it would also allow for multiple models to select and more-refined masks but I can't help with that implementation as it is totally out of my spec. I'm not an AI Engineer, I'm an aerospace engineer (in the making) with some computer science background that likes to hack into things so you would need to be the main contributor to the feature and correctly explain to @jenshannoschwalm the requirements for your implementation. I definitely wouldn't jump out helping but I would need some time to learn how to work with those systems. Regarding ONNX, I perfectly understand that this is a temporal solution and we'll probably move to OpenCV DNN or GGML in the future. The model we'll be using is not important to be discussed now and I don't really care which one we use, just the one with less hustles. I played with YOLO before and for me would be a perfect solution too. I feel this conversation is rewarding and interesting but I would prefer if we give this approach a try and then we can talk on how to improve it. For the requirements, I don't know if you need anything extra from me @jenshannoschwalm. Your idea of just storing a 256x256 matrix with an id for each cell representing the index of the mask would work wonders with my approach. |
From the user perspective, I believe it makes sense to have automatic detection of specific photography-focused objects like subject, background, sky. Just relying on automatic mask generation by FAST-SAM is not something I would personally use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are having troubles building just some modules of OpenCV i did this some time ago
it is so long to build all the modules i found it useful
https://gitlab.com/RationalPI/paintpp/-/blob/master/easyCV.cmake
then where you need OpenCV:
BuildOpenCV(${CMAKE_BUILD_TYPE} ${CMAKE_CURRENT_SOURCE_DIR}/3rdParty/opencv-4.2.0 ${CMAKE_CURRENT_SOURCE_DIR}/3rdParty/builds/OpenCV)
target_include_directories(PaintPP PUBLIC ${OpenCV_INCLUDE_DIRS})
target_link_libraries(PaintPP ${OpenCV_LIBS})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Thanks! I'll give it a look because right now it is compiling using my local installation of opencv instead of the one being built by myself. I have problems with the local build as it crashes my graphics driver during inference. I still don't know exactly how we will build OpenCV into Darktable, but this looks like a good start. Thanks again!
Deleted all dependencies from ONNX and executed the simplified onnx model on the opencv-dnn module.
Clean version of the previous pull request: #18331