Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object mask clean #18356

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

MikoMikarro
Copy link

Clean version of the previous pull request: #18331

This was referenced Feb 4, 2025
@andriiryzhkov
Copy link
Contributor

@MikoMikarro I revied you pull request (including previous) and also playing with similar implementation on my own in parallel.

I see you injecting SAM model inference in pixelpipe. It means the mask will be recalculated each time image is being processed. Right? It could be costly, especially considering we might use other more accurate variants of SAM model.

What if we keep the model inference in mask itself. In this case we need to refactor how mask writes to history and reads from history. The main magic happens when saving dt_masks_form_t.points. We can possibly add new variable to dt_masks_form_t and column in database to store it. Or alternatively, think how to mix it to points column while saving.

Regarding the model inference and its duration. SAM model inference happens in 3 steps:

  1. image encoding,
  2. prompt encoding,
  3. mask decoding.

The most time-consuming step is step 1. It can take from tens ms on smaller models to couple of seconds on bigger full-capacity models. But we only need to run it once when initiating mask. Steps 2-3 happen each time prompts (point on image) change. Typically, it is not more than tens ms.

On model implementation. Let me remind you that both ONNX and cuDNN do not support MinGW build on Windows. It might be also challenging to use binaries build with MSVC. There is possible option to run FAST-SAM on OpenCV DNN. As it is CNN based implementation it could potentially run on it. The second option is GGML. It has SAM (original from Meta) implemented already in C++, other variants of SAM will require some work.

@TurboGit
Copy link
Member

TurboGit commented Feb 4, 2025

I see you injecting SAM model inference in pixelpipe. It means the mask will be recalculated each time image is being processed. Right?

I was about to say the same. This is not a good implementation for final integration. Ok as a prototype, but we should:

  • Have the AI mask creating raster masks
  • Have a way to select multiple raster masks
  • Have a way to save the raster mask into the XMP

Just some global hint, this will need lot more review anyway.

@MikoMikarro
Copy link
Author

Hello everyone! Thank you for jumping into the conversation!

As commented on the previous pull request discussion, the implementation was more of a "proof of concept" than an implementation request. I'm not well-versed enough in the Darktable codebase to even consider how to implement the order of operations you are proposing.

What I understand is that @jenshannoschwalm is working on preparing the steps you mentioned. Once he has them, I can continue properly implementing the code.

If you know how to do that part and can explain, I can hop on again and start working on a more "proper" integration.

Thanks a lot for giving it your time and dedication!

@jenshannoschwalm
Copy link
Collaborator

@TurboGit @MikoMikarro yes i started working on that but we will have to agree on the precise data to be available.

The idea is basically (for now as that can be done more or less easily:

  1. Any module requiring an AI mask tells the pipe it wants it, we do something like that for the details mask already
  2. the demosaic module gets some added functionality. If it has the AI masks available (protected by a hash depending on previous modules in the pipe) everything is good. If not: it a) expands the roi_in to make sure all data are available and b) in the module->process we generate the masks (whatever they exactly look like, it seems we want to have n masks)
  3. All module requesting a chosen segment get that specific segment mask as we get a raster or a details mask.
  4. Currently raster masks are very restricted in use and if having many segments would be very memory costly btw.

@andriiryzhkov
Copy link
Contributor

@jenshannoschwalm what do you mean by "chosen segment" of the mask?

Just to align on how SAM model works: There are two modes of SAM inference:

  1. Based on prompts. It generates a single mask as an output given prompt as a list of points (ad/or bounding boxes) labeled as 1 (to include) and 0 (to remove).
  2. Automatic masks generation. In this case there's no input to the model other than image itself. Output - multiple masks of objects. This mode is not flexible and won't be too much useful for photography use case.

@jenshannoschwalm
Copy link
Collaborator

@jenshannoschwalm what do you mean by "chosen segment" of the mask?

I don't know anything about AI and models and am simply not able to discuss that at all. In fact i'm not even interested in learning that. Just trying to understand how @MikoMikarro idea could be integrated into darktable - as i said i would be willing to implement the pipeline/cache stuff as i am pretty familiar with that to give his - or your - project a better chance for investigation.

From reading his comments in #18331 i understood the the algo would generate N masks (for every detected segment which would be later selected by the UI) from the full image.

@MikoMikarro
Copy link
Author

Yes, the system will generate $N$ masks and the user will add points that will determine what masks to load. But the masks are generated before,

@andriiryzhkov
Copy link
Contributor

andriiryzhkov commented Feb 4, 2025

@MikoMikarro that is not the optimal use of the SAM. Points as a prompt are used to compute mask. Model indeed computes multiple masks - typically 3 with different confidence score. We just select the best one. If user adds points to the input, mask is recomputed.
You can see small demo here https://github.com/[YavorGIvanov/sam.cpp](https://github.com/YavorGIvanov/sam.cpp)
Though this implementation supports only one point as input, but that's limitation of inference code, not the model itself.

In real-life scenarios use will need to define multiple positive and negative points as a prompt to achieve good quality of final mask. And that is where segment-anything models shine. But keep in mind that FAST-SAM is not a typical implementation of SAM. It uses more traditional backbone based on CNN instead of Visual transformers.

@MikoMikarro
Copy link
Author

MikoMikarro commented Feb 4, 2025

Hi, the current implementation gives from 1 to 21500+ masks that are filtered by confidence score and using the NMS algorithm.

Once all the masks are generated, you can "query" them using the points themselves. I don't know exactly how SAM works but that's the way FAST-SAM is done. The points themselves are not used as part of the input.

Just from a performance and speed point of view I don't see SAM as the best choice for this problem. I think the models Qualcom has open have a similar way of working. I don't think this discussion is proper now as the defined model is not decided.

@jenshannoschwalm
Copy link
Collaborator

If you are getting N masks (segments) can those overlap? If not it might be better to keep an array of indexes, per pixel the mask(segment) index(id) ?

@MikoMikarro
Copy link
Author

They can overlap indeed!

Right now, as the point is the input of all the masks that contain that point load simultaneously, only the mask that holds all the "small" masks is loaded. Therefore your idea of the indexing is not bad at all as the ones with the higher confidence could be prioritized.

For this prototype we could do what you are proposing, it would work perfectly fine.

For future systems, another input that the user could give is a Bounding Box itself. If the BB is given, then the filtering of what masks to load could be done by the Intersect-over-Union (IoU) metric and therefore load different masks that could overlap.

But yeah, for this one I wouldn't worry about that. We can do 1 mask with values from 0 to 255 and fill it with the indexes of each mask so the one that coincides with the added point by the artist gets loaded.

@andriiryzhkov
Copy link
Contributor

andriiryzhkov commented Feb 4, 2025

I agree that's specific implementation of the model is not important right now. What important is the interface between AI model and darktable. And it should not be tightly couples with the specific implementation of FAST-SAM, which will be completely unscalable.

I strongly advice use to implement AI mask using some variant of SAM model based on Visual Transformers. There are multiple implementations of that starting from original SAM from Meta, to efficient models for running on CPU like MobileSAM or EfficientSAM, to more precise and recent variant called HQ-SAM. They all work the same and can be easily interchanged.
FAST-SAM is deferent. Authors took the concept of segment anything but did their best to implement it using older technology in order to eliminate computationally heavy ViT part. It works fast, but less accurate. But what most important for us right now, FAST-SAM works differently - they don't support full power of prompting (with points and bounding boxes).

I believe for the proof-of-concept phase we can keep FAST-SAM, but design flexible SAM-based solution. Or we can start using SAM model implemented on GGML. I have initial implementation here https://github.com/andriiryzhkov/ai_mask_prototype.

The user flow (with high level technical details) when using any true SAM model will consist of two steps:

  1. Model is loaded and image embedding are calculated. Input - RGB image. This is most time-consuming step and performed when AI object mask is opened. Actual time strongly depends on the size of the model.
  2. User defines points as a prompt on the image. Points can be positive or negative, depending on whether we what to keep or remove annotated object from the mask. After any change of the input points, prompt is encoded and output raster mask generated. This is fast step, typically takes milliseconds for any model size.

@MikoMikarro
Copy link
Author

Hi @andriiryzhkov! Thanks for the read, I see now what you talk about with the encoding of the image. I see a nice improvement on the resulting mask when a prompt is added.

However, SAM also has this "give me all the masks" mode. In this regards I suppose it is similar.

In regards of the final implementation. Storing the encodings doesn't seem to be much different than storing a mask so I suppose some part of the pipeline would be reusable.

I would keep Fast-SAM for this prototype and iterate over it.

@jenshannoschwalm
Copy link
Collaborator

For me it is still unclear if we can generate all the masks just once for the whole image and the UI picks the interesting one ?
This is a very important question as it would define how & where we can integrate the mask generation in darktable pipeline in a performant way.

@andriiryzhkov
Copy link
Contributor

@MikoMikarro you are absolutely right, SAM has automated mask generation mode which produces multiple masks. But is this mode model deciding what your objects will look like, and user can only select but not adjust the mask to his/her needs. This makes this mode not very practical for darktable use case.

Check the demo of HQ-SAM here. Do not focus on which model to use right now rather the user flow on how to mark multiple positive and negative points to improve quality of resulting mask. As I previously mentioned this is exactly the strength of SAM family of models and we should leverage that.

Regarding storing image embeddings, I don't think it make sense. I'd rather store final mask as a raster, so that there's no need to recompute the mask each time image is processed.

@jenshannoschwalm answering your question. If we agree to follow promptable image segmentation approach, which I strongly recommend, we would generate only ONE mask and it will be recomputed each time user adds or changes input points. This mask computation is really fast even with bigger models. See this demo video. At the end of it you'll see the segment on hover. The only time-consuming thing you need do before is embed RGB image.

@andriiryzhkov
Copy link
Contributor

@TurboGit, @jenshannoschwalm can we create a dedicated branch for AI object mask so we all can contribute?

@jenshannoschwalm jenshannoschwalm marked this pull request as ready for review February 5, 2025 10:09
@MikoMikarro
Copy link
Author

MikoMikarro commented Feb 5, 2025

@andriiryzhkov, I perfectly understand the benefits of promptable image generation.

That part is just implementation that will be done in the blend operator and I will be responsible for that part.

For what is now responsible @jenshannoschwalm. We just need a way to store some binary that is cached to the image and references the original image RGB, that doesn't change. For the FAST-SAM implementation is just one Mask (if we store the id of the mask it is from) and for the SAM it will be an encoding (magic mask that can be prompted).

Both cases require this "just need to do once" thing that can be cached and can be persistent (stored in a file). Due to the ease of "creating" the masks that FAST-SAM gives I think it will give a similar workflow for the user itself so I would prefer if we don't discuss this in this thread/pull request as the target of it is to add this new "cache" field were we can store some data for the masks generation.

Once that's done, we can create a new thread talking about the different implementations or what ML model to use.

From what I understand this "once in a time" thing is common in both and the only thing that changes is inside the "object_mask" operator itself as a "post-process".

For this reason, I think @jenshannoschwalm can continue with the implementation as it's currently understood, and with the insight from his code, we'll be able to propose the required changes for the different models we end up using.

@andriiryzhkov
Copy link
Contributor

I agree that we need to focus on darktable part now, not AI one. But it is essential to have high-level understanding and alignment on how model works so we can implement best approach on the darktable side.

I just want ensure we develop flexible solution, not limited by only one possible model implementation. I don't want to advocate for a specific model right now, instead focus on what common features and capabilities they have.

Speaking of later, storing "embeddings" does not sound like a good idea for me. It again makes it too much dependant on specific model implementation. And the second thing is that from encoder to decoder you need to pass model and its state, which includes embeddings. So, things get a little bit complex here, and it's better to avoid it.

I'd do all steps of mask computation only during the mask creation, and store resulting raster transparency mask in a db/file for mask consumption.

@MikoMikarro
Copy link
Author

MikoMikarro commented Feb 5, 2025

I feel the mask raster is not a terrible idea but lacks flexibility. Each time the image is cropped, rotated, or deformed the mask needs to be recomputed. Also, not storing the embedding would imply that every time you want to create or edit masks once you re-open the dark table a re-computation (of the long part, creating the embeddings) each time the masks want to be edited again. I understand that my happy idea with the embeddings may not be sufficient as the state of the NN is important but I don't like the idea of having to have the model loaded all the time. Hanno also commented on some concerns in terms of the performance of having too many raster masks.

My approach is closer to the automatic mask generation proposed also in the SAM repository. We agree that is not the best way to take advantage of the model but gives enormous flexibility for mask selection and reduces the amount of resources needed. Furthermore, it helps the system by not needing to have the model loaded in RAM or VRAM for each modification one performs during the creation of the mask.

In my opinion, this provides greater flexibility (at some cost of quality, that can be refined using the current mask refinement capabilities) and also lower consumption of computation resources.

I perfectly understand the advantages that you propose but I feel the system is more complex and an iterative process would be needed for the implementation of more complex models. I don't say that we shouldn't try in the future but for now, this approach is giving me good-enough results, not perfect but enough, in the examples I tried in my projects, and as a first step towards improving dark table features it should be implemented as I proposed.

@andriiryzhkov
Copy link
Contributor

I agree with you regarding embeddings. Maybe we can think on how to serialize them and save.

Regarding your approach, well, you can definitely try if you insist on it. Thats the beauty of open source.

But you are aware that ONNX Runtime is not available on Windows/MinGW environment, so does not meet requirement to cover all supported platforms. Using binary package build with MSVC is not the possible too due too binary incompatibility between MSVC and MinGW. This limits your current approach.

For automated mask generation you can use much simpler approach if you wish to go that way. Use YOLO model to segment objects. It produces multiple masks per image and very fast. It also can run on Open CV DNN, which is available on all platforms including MinGW. By the way FAST-SAM is implemented with YOLO inside.

Some photo editing software supports both types of AI mask. For example, Luminar NEO has both tools.

@jenshannoschwalm
Copy link
Collaborator

AI friends, i volunteered to help out and implement the pipeline part of this "AI masking stuff" because it seems you are not so fluent with this topic and i would like to support evaluation if this is something we really want as a dt feature. If you want me to continue, i need some more information and we have to agree on the darktable<->AI mask "connection.

  1. Please be reminded, dt is basically a realtime raw processor and we all want it to be like that!
  2. In most processing modules - certainly all those that will be used with any mask - the module sees only part of the image and only what is presented by the module before, we call this the roi (region of interest). Whenever we want a mask to be used in any module - and most people use masks a lot - that should come at the lowest performance cost possible. That's why i think any AI processing while blending the mask should be strictly avoided.
  3. To overcome the ROI "restrictions" - we definitely want that for performance - we should definitely (i would say must) create all AI masks once for the whole image. (We do that already for the details mask btw) All modules wanting to use any AI mask should just a) transform an AImask (could be binary for every mask) to a float mask and b) do the geometric transformation. That would be fast and imho the right way to do.
  4. I agree that an AI mask being a raster mask would be a less-optimal solution as that would not allow us to do all the blending magic we have as feathering, blurring ... So the proposed idea by @MikoMikarro here seems to be clearly the better way to me.

Please don't overstress me with models and all that internal - i don't give a shit about that details and would get lost and uninterested pretty soon :-) I would really prefer to get something that fits into dt processing and would support all your ideas. Let's stay focused!

@MikoMikarro
Copy link
Author

@andriiryzhkov, I insist on giving this "automatic" mask generation a try as a first approach and iterating over it. I understand the embedding system would be powerful as it would also allow for multiple models to select and more-refined masks but I can't help with that implementation as it is totally out of my spec. I'm not an AI Engineer, I'm an aerospace engineer (in the making) with some computer science background that likes to hack into things so you would need to be the main contributor to the feature and correctly explain to @jenshannoschwalm the requirements for your implementation. I definitely wouldn't jump out helping but I would need some time to learn how to work with those systems.

Regarding ONNX, I perfectly understand that this is a temporal solution and we'll probably move to OpenCV DNN or GGML in the future. The model we'll be using is not important to be discussed now and I don't really care which one we use, just the one with less hustles. I played with YOLO before and for me would be a perfect solution too.

I feel this conversation is rewarding and interesting but I would prefer if we give this approach a try and then we can talk on how to improve it.

For the requirements, I don't know if you need anything extra from me @jenshannoschwalm. Your idea of just storing a 256x256 matrix with an id for each cell representing the index of the mask would work wonders with my approach.

@andriiryzhkov
Copy link
Contributor

From the user perspective, I believe it makes sense to have automatic detection of specific photography-focused objects like subject, background, sky. Just relying on automatic mask generation by FAST-SAM is not something I would personally use.
But again, the main focus right now is on darktable and gui part. I wish you good luck with your approach, I will be watching where it goes and continue researching on my approach in parallel. In case you have any questions on SAM implementation on C, just ping me.

@MikoMikarro
Copy link
Author

I got working FAST-SAM on OpenCV-DNN. It seems like it is working fine. Not the fastest but it seems that in OpenCV-5 we may get important performance improvements.

Is the use of c++ a problem?

image

Copy link

@Rational-pi Rational-pi Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are having troubles building just some modules of OpenCV i did this some time ago
it is so long to build all the modules i found it useful
https://gitlab.com/RationalPI/paintpp/-/blob/master/easyCV.cmake

then where you need OpenCV:
BuildOpenCV(${CMAKE_BUILD_TYPE} ${CMAKE_CURRENT_SOURCE_DIR}/3rdParty/opencv-4.2.0 ${CMAKE_CURRENT_SOURCE_DIR}/3rdParty/builds/OpenCV)
target_include_directories(PaintPP PUBLIC ${OpenCV_INCLUDE_DIRS})
target_link_libraries(PaintPP ${OpenCV_LIBS})

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks! I'll give it a look because right now it is compiling using my local installation of opencv instead of the one being built by myself. I have problems with the local build as it crashes my graphics driver during inference. I still don't know exactly how we will build OpenCV into Darktable, but this looks like a good start. Thanks again!

Deleted all dependencies from ONNX and executed the simplified onnx
model on the opencv-dnn module.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants