Speaker
Description
The class-agnostic discovery of objects in images is a challenging problem in several computer vision pipelines. To address this problem, we propose the efficient object proposal generation system AttentionMask based on a Convolutional Neural Network (CNN). AttentionMask produces pixel-precise object proposals to discover and segment arbitrary objects. Based on a CNN backbone, AttentionMask extracts a pyramid of feature representations of an image and extracts fixed-size windows across the different pyramid levels to discover objects of various sizes. For efficient processing, AttentionMask utilizes the concept of visual attention and learns to focus the window extraction on relevant parts of the feature pyramid with a lightweight attention module. The learned attention increases efficiency and allows us to equip AttentionMask with a dedicated module to improve the challenging discovery of small objects despite the limited resources on modern GPUs. Given the windows extracted from the feature pyramid at promising locations, AttentionMask efficiently generates a segmentation mask and an objectness score. This leads to high-quality object proposals as segmentation masks. Overall, AttentionMask presents an end-to-end trainable system to discover objects in images and outperforms previous approaches in discovering small objects and objects of all sizes while even increasing efficiency.
To showcase the versatile applicability and the strong robustness, we utilize AttentionMask in three challenging real-world applications: (1) airline logo detection under adverse weather conditions, (2) medical instrument segmentation in images acquired during minimally invasive surgeries, and (3) apple localization in complex orchard environments. While the first two applications demand strong robustness due to severe image degradations, the apple localization is challenging due to the complex scene composition with a substantial amount of clutter. AttentionMask shows a strong performance in all three applications and reliably discovers objects despite complex, challenging image data.