Abstract
This paper presents a novel approach to improve feature matching in computer vision applications by combining four powerful techniques: LoFTR, knowledge distillation, self-attention, and spatial transformer networks. In our approach, we use knowledge distillation to learn the feature extraction and matching capabilities of PixLoc, while incorporating self-attention and spatial transformer networks to further improve the model's ability to extract and match 2D features. SAM-Net's performance outperforms state-of-the-art methods, as demonstrated by experiments conducted on both indoor and outdoor datasets. Furthermore, SAM-Net achieves the top ranking among published methods in two public benchmarks for visual localization.