IMPROVE 2025 Abstracts


Area 1 - Fundamentals

Full Papers
Paper Nr: 11
Title:

A Multi-Modal Approach for Face Anti-Spoofing in Non-Calibrated Systems Using Disparity Maps

Authors:

Ariel Larey, Eyal Rond and Omer Achrack

Abstract: Face recognition technologies are increasingly used in various applications, yet they are vulnerable to face spoofing attacks. These spoofing attacks often involve unique 3D structures, such as printed papers or mo- bile device screens. Although stereo-depth cameras can detect such attacks effectively, their high-cost limits their widespread adoption. Conversely, two-sensor systems without extrinsic calibration offer a cost-effective alternative but are unable to calculate depth using stereo techniques. In this work, we propose a method to overcome this challenge by leveraging facial attributes to derive disparity information and estimate relative depth for anti-spoofing purposes, using non-calibrated systems. We introduce a multi-modal anti-spoofing model, coined Disparity Model, that incorporates created disparity maps as a third modality alongside the two original sensor modalities. We demonstrate the effectiveness of the Disparity Model in countering various spoof attacks using a comprehensive dataset collected from the Intel® RealSense™ ID Solution F455. Our method outperformed existing methods in the literature, achieving an Equal Error Rate (EER) of 1.71% and a False Negative Rate (FNR) of 2.77% at a False Positive Rate (FPR) of 1%. These errors are lower by 2.45% and 7.94% than the errors of the best comparison method, respectively. Additionally, we introduce a model ensemble that addresses 3D spoof attacks as well, achieving an EER of 2.04% and an FNR of 3.83% at an FPR of 1%. Overall, our work provides a state-of-the-art solution for the challenging task of anti-spoofing in non-calibrated systems that lack depth information.

Paper Nr: 12
Title:

Multi-Scale Edge-Enhanced ResNet (MSResNet) for RGB-T Image Segmentation

Authors:

Bikram Adhikari, Siyu Lei, Zoran Durić and Duminda Wijesekera

Abstract: Intelligent transportation systems rely heavily on robust scene understanding under varying environmental conditions for decision-making and driver assistance. In this paper, we introduce a novel MsResNet: Multi-scale Edge-Enhanced ResNet for RGB-thermal image segmentation, combining RGB and thermal images to address challenges introduced by varying lighting conditions in scene understanding. MsResNet incorporates multi-scale guided filtering to enhance edge definitions and contrast, and it uses attention-based cross-fusion to dynamically integrate features from RGB and thermal modalities across their spatial dimensions. In addition, a weighted compound loss function refines the predictions at the region, boundary, and pixel levels. Experimental results in the MF and KAIST datasets suggest that MsResNet performs on par with current state-of-the-art models, with reduced parameters and inference time, achieving IoU scores of 59.8 and 50.24, respectively. These results demonstrate the suitability of the model for real-time ADAS applications and scene understanding.

Paper Nr: 17
Title:

MVIP - A Dataset and Methods for Application Oriented Multi-View and Multi-Modal Industrial Part Recognition

Authors:

Paul Koch, Marian Schlüter and Jörg Krüger

Abstract: We present MVIP, a novel dataset for multi-modal and multiview application oriented industrial part recognition. Here we are the first to combine a calibrated RGBD multi-view dataset with additional object context such as physical properties, natural language, and superclasses. The current portfolio of available datasets offer a wide range of representations to design and benchmark related methods. In contrast to existing classification challenges, industrial recognition applications offer controlled multi-modal environments, but at the same time have different problems than traditional 2D/3D classification challenges. Frequently, industrial applications have to deal with small amount or ramp up of training data, visually alike parts, and varying object sizes, while requiring a robust near 100% top 5 accuracy under cost and time constraints. Current methods tackle such challenges individually, but a direct adoption of these methods within industrial applications is complex and requires further research. Our main goal with MVIP is to study and push transferability of various state-of-the-art methods within related downstream tasks towards an efficient deployment of industrial classifiers. Additionally, we intent to push with MVIP research regarding several modality fusion topics, (automated) synthetic data generation, and complex data sampling methods – combined in a single application oriented benchmark.

Paper Nr: 29
Title:

FoodLens: Fine-Grained and Multi-Label Classification of Indian Food Images

Authors:

Narayan Hegde, Jatin Alla, Yashas Samaga, Ashwin Vaswani, Praneeth Netrapalli, Shivani Kapania and Pradeep Kumar

Abstract: India has rich cultural diversity which is reflected in its variety of food. In recent years, computer vision has played a key role in classifying food images for automated tagging, nutrition profiling and many other tasks. However, the existing state-of-the-art AI-based food classification models trained on global food images have subpar performance on Indian food images. This is due to the lack of representation of Indian food in existing food datasets and unique image classification challenges specific to Indian food, such as cuisines having multiple dishes within a single image and regional fine grained varieties of the dishes. To address these challenges, a dataset with 30K food images consisting of popular dishes from restaurant menus across India was curated and annotated with multi-label and fine-grained labels for each dish in the image. All the dishes were mapped onto a hierarchical tree which models a categorical breakdown of Indian food. Custom loss function was tuned to learn from hierarchical and multi-label information contained in the Indian food images. Augmenting our loss on existing methods gives 13% improvement on average AUPRC and shows better classification performance on Indian food dataset compared to state of art food classification models with comparable results for other food benchmark datasets. More than 100k photos which are submitted each day on Google Maps on Indian restaurants and many more on social media channels were utilized for the project.

Paper Nr: 37
Title:

Cross-Modality Learning in Ophthalmology: Is There a Need for Increasing Variety in Data?

Authors:

Imen Chakroun and Julien Verplanken

Abstract: The primary focus of our work extends beyond merely enhancing state-of-the-art predictive performance in cross-modal classification tasks. We aim to demonstrate, through AI, the critical necessity of maintaining the current industrial investment in multi-modalities that are complex, costly, and cumbersome in day-to-day clinical usage. To this end, we first analyzed the prediction accuracy gap between single and multi-modalities models. We then assessed whether the increased complexity of multi-modal predictors demands larger datasets compared to their single-modal counterparts. Finally, we explored whether leveraging multi-modal inputs can compensate for poor-quality images while still outperforming uni-modal approaches.

Paper Nr: 40
Title:

Adaptive Resilience Framework Using Dynamic Feature Fusion for Robust Fingerprint Biometrics Against Adversarial Perturbations

Authors:

Arslan Manzoor, Alessandro Ortis and Sebastiano Battiato

Abstract: This paper presents the Adaptive Resilience Fingerprint Defense (ARFD), a novel framework to enhance fingerprint biometric systems’ robustness against adversarial attacks like FGSM and PGD. ARFD integrates Dynamic Feature Fusion (DFF) for real-time feature weight recalibration and Multi-Scale Feature Ensemble (MFE) for multi-resolution analysis. This two-pronged strategy effectively mitigates adversarial perturbations, achieving superior accuracy, and reducing false acceptance and rejection rates. Experimental results demonstrate ARFD’s significant advancements in biometric security, providing an adaptive and resilient defense mechanism.

Short Papers
Paper Nr: 21
Title:

Rotation Invariance in Floor Plan Digitization Using Zernike Moments

Authors:

Marius Graumann, Marius Stürmer and Tobias Koch

Abstract: Nowadays, a lot of old floor plans exist in printed form or are stored as scanned raster images. Slight rotations or shifts may occur during scanning. Bringing floor plans of this form into a machine readable form to enable further use, still poses a problem. Therefore, we propose an end-to-end pipeline that pre-processes the image and leverages a novel approach to create a region adjacency graph (RAG) from the pre-processed image and predict its nodes. By incorporating normalization steps into the RAG feature extraction, we significantly improved the rotation invariance of the RAG feature calculation. Moreover, applying our method leads to an improved F1 score and IoU on rotated data. Furthermore, we proposed a wall splitting algorithm for partitioning walls into segments associated with the corresponding rooms.

Paper Nr: 23
Title:

Gender Bias Mitigation in Advertisement Videos

Authors:

Thao My Tran Dinh, Thuy Nguyen and Andrew Colarik

Abstract: Gender bias in Artificial Intelligence (AI) has been a concern as AI systems are increasingly employed in real-life applications. Despite efforts to mitigate bias, challenges remain in addressing gender bias embedded in machine learning systems, particularly in automated feature extraction processes. This paper examines the presence and impacts of gender bias in AI within the domain of automated feature extraction in computer vision, focusing on online video advertisements, which inherently reflect societal stereotypes. We highlight the limitations of existing mitigation techniques, emphasizing the need for transparency, comparability, and explainability in addressing bias. By systematically analyzing feature extraction methods and their normative harms, we propose a framework for evaluating gender bias by transforming video data into quantifiable features using pre-trained models and analyzing these features through various dimensions grounded in psychology and marketing research. We will employ a multistage approach including video annotation, automated feature extraction, unsupervised learning techniques, and supervised training models. This work provides actionable insights for reducing gender bias and enhancing fairness in AI systems.

Paper Nr: 24
Title:

Superclass-Guided Hierarchical Learning for Action Anticipation

Authors:

Shin Suzuki, Kazuhiko Sumi and Naoshi Kaneko

Abstract: Action anticipation is crucial for intelligent systems such as autonomous vehicles and AR (Augmented Reality) devices.While existing studies have focused on predicting future actions, they often overlook the hierarchical relationships between human intentions and their resulting behaviors. In this work, we propose "Superclass," a novel approach that leverages hierarchical action labels to enhance action anticipation performance. Our method introduces additional annotations combining verbs, nouns, and actions to capture the complex relationships between different levels of human activity. We evaluate our approach by integrating Superclass with two different base models, AVT and InAViT. Experiments on the EPIC-KITCHENS-100 dataset demonstrate the effectiveness and broad applicability of our method. When applied to InAViT, the current top-performing model on EPIC-KITCHENS-100 evaluation server, Superclass improved the top-5 class mean accuracy for verbs, nouns, and actions by 0.62%, 3.36%, and 1.95% respectively.

Paper Nr: 25
Title:

Cricket Bowling Action Recognition with Transformer-Based Models

Authors:

Bigyan Subedi, Bishwambhar Dahal, Sirjana Bhatta, Sonish Maharjan and Sushmita Poudel

Abstract: Computer vision-based video action recognition has led to significant advancements in sports analytics, streamlining the previous labour-intensive tasks of sensor-based or manual analysis through automated video processing. This paper focuses on applying transformer-based video action recognition models to classify cricket bowling actions. For this, we created a novel dataset named ActionBowl, designed to support multiple specialized classification schemes. We trained and evaluated state-of-the-art transformer based action recognition models- ActionCLIP, TimeSformer and UniFormerV2 on these datasets. This paper aims to highlight the effectiveness of these models in recognizing actions that range from subtle variations to significantly distinct hand movements. Through rigorous evaluation, we provide conclusive evidence of these models' ability to learn and distinguish this unique set of actions effectively. It presents a comprehensive analysis of the experiments, results, and insights drawn from the study, highlighting the potential for further advancements in cricket analytics through video-based action recognition.

Area 2 - Methods and Techniques

Full Papers
Paper Nr: 18
Title:

Assessment of Uncertainty and Variability in Simulation Tools Under Foggy Conditions

Authors:

Pierre Duthon, Mohamed Boudali, Amine Ben Daoued, Rémi Regner, Charlotte Segonne and Frédéric Bernardin

Abstract: Intelligent mobility systems are increasingly making use of AI for various functions, including navigation, sign recognition, road tracking, and obstacle detection. To achieve certification up to SAE Level 3 and more in the future, manufacturers must prove that their vehicles maintain adequate safety within their operational design domain through rigorous testing in diverse scenarios. Sensor simulation tools including degraded weather conditions (physical, numerical or hybrid) must be employed. In this study as part of the PRISSMA project, a proof of concept is proposed to characterize and evaluate the protocols and four different kind of simulation tools that enable AI algorithm certification under degraded weather conditions.

Area 3 - Imaging

Short Papers
Paper Nr: 7
Title:

SuperCrossViT: Integrating Superpixel Segmentation in Vision Transformers for Advanced Medical Image Analysis

Authors:

Ahmed Alqnatri and Wanda Benesova

Abstract: Vision Transformers (ViTs) have revolutionized medical image analysis, yet they face challenges in simultaneously capturing global context and local anatomical details crucial for accurate diagnosis. We present SuperCrossViT, an architecture that enhances the standard CrossViT framework by integrating superpixel segmentation for improved analysis of histopathological images. Our approach leverages superpixels to group pixels into meaningful tissue regions, preserving structural information while maintaining computational efficiency. We evaluate our method on the task of metastatic cancer detection in lymph node Whole Slide Images (WSIs), performing binary classification of tumor versus normal tissue patches. Experimental results demonstrate that SuperCrossViT consistently outperforms baseline ViT and standard CrossViT architectures, achieving superior accuracy in distinguishing cancerous from normal tissues. Our findings suggest that the integration of superpixel segmentation with transformer-based architectures offers a promising direction for enhancing the precision of computer-aided diagnosis in histopathology.

Area 4 - Machine Learning

Full Papers
Paper Nr: 39
Title:

Smartphone-Based Detection of Cataract and Pterygium Using MobileNet: A Unified Approach for Anterior Segment Photographed Images

Authors:

W Mimi Diyana Zaki, Laily Azyan Ramlan, Nurul Syahira Mohamad Zamani, Marizuana Mat Daud and Haliza Abdul Mutalib

Abstract: This study explores the application of the MobileNetV2 architecture for detecting cataracts and pterygium using anterior segment photographed images (ASPI) captured via smartphone cameras. Cataracts and pterygium are significant global health concerns, and their early detection is crucial for preventing vision impairment. The MobileNetV2’s lightweight and efficient design enables accurate and scalable classification of eye diseases, even with variable image quality from smartphone cameras. This paper provides an over-view of the prevalence of cataracts and pterygium, summarizes prior work, and presents experimental results demonstrating MobileNetV2’s high performance in detecting both diseases. For pterygium classification, Mo-bileNetV2 achieved its best performance with the Adam optimizer and a batch size of 10, delivering 97.37% accuracy, 96.05% sensitivity, and the highest AUC of 99.41%. It also demonstrated exceptional computational efficiency, completing training in just 2.13 minutes with Adam and Batch Size 32, the shortest training time across all configurations. The network exhibited consistent performance with only minor declines as the batch size increased. For cataract patch classification, MobileNetV2 also performed strongly, achieving 95.44% accuracy, 95.78% sensitivity, and an AUC of 99.19% with Adam and Batch Size 10. Additionally, it completed training in the shortest time of 7 minutes, making it highly efficient for resource-constrained environments. The findings support the integration of smartphone imaging and deep learning as a cost-effective solution for ophthalmological diagnostics.

Short Papers
Paper Nr: 16
Title:

CerberusDet: Unified Multi-Dataset Object Detection

Authors:

Irina Tolstykh, Mikhail Chernyshov and Maksim Kuprashevich

Abstract: Conventional object detection models are usually limited by the data on which they were trained and by the category logic they de- fine. With the recent rise of Language-Visual Models, new methods have emerged that are not restricted to these fixed categories. Despite their flexibility, such Open Vocabulary detection models still fall short in ac- curacy compared to traditional models with fixed classes. At the same time, more accurate data-specific models face challenges when there is a need to extend classes or merge different datasets for training. The latter often cannot be combined due to different logics or conflicting class defi- nitions, making it difficult to improve a model without compromising its performance. In this paper, we introduce CerberusDet, a framework with a multi-headed model designed for handling multiple object detection tasks. Proposed model is built on the YOLO architecture and efficiently shares visual features from both backbone and neck components, while maintaining separate task heads. This approach allows CerberusDet to perform very efficiently while still delivering optimal results. We evalu- ated the model on the PASCAL VOC dataset and Objects365 dataset to demonstrate its abilities. CerberusDet achieved state-of-the-art results with 36% less inference time. The more tasks are trained together, the more efficient the proposed model becomes compared to running indi- vidual models sequentially. The training and inference code, as well as the model, are available as open-source

Paper Nr: 20
Title:

Automated and Explainable Multi-Disease Detection from Retinal Fundus Images

Authors:

Shubha Masti, Tarunya Prasad and Gowri Srinivasa

Abstract: This study explores the explainable detection of three diseases—pathological myopia, glaucoma, and diabetic retinopathy—using retinal fundus images. Both deep learning and feature-based methods are examined for each condition. The deep learning approaches employ transfer learning, while UNet-based models are utilised for feature segmentation. Feature maps are created from segmented features and passed through simple CNNs to detect diseases. Data augmentation techniques are applied across methods to enhance performance, and Grad-CAM/Grad-CAM++ are used to interpret and validate the insights gained from the deep learning models.

Paper Nr: 38
Title:

Active Learning and the Various Flavors of Supervision for Object Detection

Authors:

Nils Bischoff and Sven Tomforde

Abstract: In an effort to minimize the manual annotation cost for the training of object detectors based on deep learning, we reflect on the role of active learning in object detection when combined with other sources of supervision. In doing so, we highlight the need to harmonize the approaches so that they can develop their full potential. Ultimately, the active learning oracle should only provide supervision for samples that cannot be covered by other, cheaper, forms of supervision.

Paper Nr: 13
Title:

Addressing Class Imbalance in Renal Amyloidosis Classification: A Comparative Study of Few-Shot Learning and Conventional Machine Learning Techniques

Authors:

Alexsandro Silva Santos, Luciano Rebouças de Oliveira, Washington Luis Conrado dos Santos and Angelo Amancio Duarte

Abstract: Class imbalance presents a significant challenge in Computational Pathology, particularly in the classification of rare diseases such as renal amyloidosis. This paper investigates the effectiveness of Few-Shot Learning (FSL), specifically through prototypical networks, alongside conventional methods to enhance the automatic classification of renal glomeruli from biopsy images. A novel multi-stain dataset is introduced, comprising 11,674 annotated images across nine glomerular lesion classes, including amyloidosis, stained with four dif-ferent dyes. This dataset represents a substantial contribution to the field due to the complexities involved in obtaining and annotating such data. The study involved training baseline models using six pre-trained CNN architectures, both with and without Cost-Sensitive Learning (CSL). The three top-performing architectures were subsequently used to construct an ensemble-based model. FSL models were trained using these archi-tectures with episodic training in a 2-way-30-shot configuration. In the FSL experiments, the cosine similarity distance function outperformed Euclidean distance. Applying CSL to the FSL models resulted in a signifi-cant performance boost. The top three FSL models were used to create two ensemble-based models: FSL-Ensemble (without CSL) and FSL-CSL-Ensemble (with CSL). The results indicate that while conventional methods alone may not provide robust classification in this context, their combination with FSL, particularly when applied to Periodic Acid-Schiff (PAS) stained images, significantly enhances performance. The FSL-CSL-Ensemble achieved the highest F1-Score of 93.8%, surpassing the performance of related studies that addressed datasets with less severe imbalance ratios. This study underscores the potential of FSL in classi-fying renal amyloidosis, especially when combined with CSL, and suggests the possibility of eliminating the need for Congo red staining, the current gold standard for diagnosis. The findings also highlight the neces-sity of developing innovative approaches like FSL to improve outcomes in medical image analysis, where data scarcity is prevalent. Further investigation is needed into the generalizability of this approach to other glomerular lesions and staining techniques.

Paper Nr: 26
Title:

Deep Learning in Satellite and Aerial-Based Image Processing

Authors:

Alessia Sbriglio and Giovanni B. Palmerini

Abstract: The improvement of Earth Observation (EO) satellite resolutions in recent years, coupled with the increasing demand for higher performance, introduced the need for more advanced detection techniques for satellite image analysis. Modern EO platforms generate vast amounts of data, rich in potentially valuable information but often underutilized, making the adoption of efficient image processing solutions crucial. These information can be essential for accelerating processes such as classification and geo-referencing, ensuring the timely availability of mission products. In this context, the introduction of deep learning techniques, particularly the You Only Look Once (YOLO) algorithm, represents a natural evolution. YOLO is known for its speed, as it analyzes the entire image in a single pass, and for its precision, due to the use of deep convolutional neural networks (CNNs). When applied to satellite images, YOLO has shown promising results, especially for automatic geo-referencing and rapid classification. A first attempt at a comparative analysis between models trained with 60 and 100 epochs, applied to optical Sentinel-2 images targeting Italian lakes under various weather conditions, revealed significant improvements in detection precision and consistency. In particular, the accuracy of boundaries improved as training epochs increased. As the number of epochs grew, the results became more stable, regardless of environmental or lighting conditions, reducing errors and improving overall performance. These advancements suggest that with further development of algorithms and integration of artificial intelligence, the use of satellites and drones in geospatial applications will become increasingly precise and efficient. The use of drone images could further expand datasets, allowing the model to respond with an adaptive approach to specific details or defined elements, such as artificial structures or small areas of interest that satellites may not be able to detect with the same precision, especially due to unfavorable weather conditions. This integrated approach combining satellite and aerial data could further enhance the model’s ability to detect smaller objects or handle more complex environments, increasing the versatility and reliability of automatic detection solutions in real-world contexts.

Area 5 - Multimedia Communications

Full Papers
Paper Nr: 22
Title:

Use of Orthogonal Encryption Functions in Commutative Watermarking-Encryption

Authors:

Roland Schmitz and Christos Grecos

Abstract: While commutativity of watermarking and encryption is a desirable feature in many application scenarios, it is hard to find robust watermarking schemes and secure ciphers which are able to commute with each other, because there are no visual features to use for embedding the mark in the encrypted domain. In the present paper we investigate if orthogonal maps, which form a large subclass of norm-preserving maps, are suitable for image encryption within the framework of a commutative watermarking-encryption (CWE) scheme. Specifically, we show that these maps, if used properly, have a much larger key space and leave a smaller statistical residue in the ciphertext than other norm-preserving maps like sign-bit encryption and permutation ciphers currently being used in CWE.

Area 6 - Applications

Short Papers
Paper Nr: 34
Title:

Automated Detection of Student Emotions for Engagement Verification in Virtual Learning Environments

Authors:

Quoc Minh Quan Nguyen and Sonit Singh

Abstract: There has been rise in online learning because of its flexibility and need for lifelong learning. Understanding and improving students' engagement during online learning is pivotal as it can provide educators the feedback to improve delivery of content. However, recognising students' emotions using visual data raises ethical issues of individual privacy. In this paper, we build on the existing research in the field of emotion detection in virtual learning environments by making use of facial keypoint images, also known as face meshes, which helps to overcome the challenge of directly working on the visual data. We make use of publicly available emotional dataset, namely, RAF-DB, and demonstrated improved classification accuracy using sophisticated facial keypoints. We finally predicted student engagement using "engagement to index'' algorithm. This work not only advances the field of educational technology by improving emotion classification accuracy, but also addresses crucial ethical issues, including student permission and data privacy.