Abstract: |
Human pose estimation, the task of localizing skeletal joint positions from visual data, has witnessed significant progress with the advent of machine learning techniques. In this paper, we explore the landscape of deep learning-based methods for human pose estimation and investigate the impact of integrating temporal information into the computational framework. Our comparison covers the evolution from methods based on Convolutional Neural Networks (CNNs) to recurrent architectures and visual transformers. While spatial information alone provides valuable insights, we delve into the benefits of incorporating temporal information, enhancing robustness and adaptability to dynamic human movements. The surveyed methods are adapted to fit the requirements of human pose estimation task, and are evaluated on a real large scale dataset, focusing on a single-person scenario, inferring from 3D point cloud inputs. We present results and insights, showcasing the trade-offs between accuracy, memory requirements, and training time for various approaches. Furthermore, our findings demonstrate that models relying on attention mechanisms can achieve competitive outcomes in the realm of human pose estimation within a limited number of trainable parameters. The survey aims to provide a comprehensive overview of machine learning-based human pose estimation techniques, emphasizing the evolution towards temporally-aware models and identifying challenges and opportunities in this rapidly evolving field. |