Check out my full publication on Fourier-Attentive Representation Learning

Vision-Language Models (VLMs), such as CLIP, have revolutionized the field of computer vision by learning from vast amounts of image-text pairs on the Internet. A key strength of these models is their strong ability for zero-shot and few-shot transfer to downstream tasks through ”prompting”, where natural language descriptions are used to describe classes. This paradigm shifts away from traditional supervised learning, which relies on fixed, discrete labels, towards a more flexible, open-vocabulary understanding of visual concepts.

To effectively adapt these foundation models for downstream tasks, recent research has focused on developing efficient fine-tuning strategies that avoid updating the entire network. A well known line of work is prompt learning, pioneered in the vision domain by CoOp. CoOp replaces hand-crafted text templates with a set of learnable, continuous vectors, or ”prompts”, which are optimized on a few downstream examples. This core concept has been significantly advanced in subsequent works. This core concept has been significantly advanced by works that introduce instance-conditioning, multi-modal deep prompting, or adapter-style modules with shared representation spaces.

However, despite their success, a fundamental limitation persists across these methods: the learned prompts or representation tokens are ”black-box” vectors. They entangle high-level semantic features, such as object shape and structure, with low-level, domain-specific statistics, such as texture, color, and lighting. This entanglement makes the model prone to overfitting on the superficial characteristics of the few training samples from base classes, consequently impairing its generalization ability to novel, unseen classes.

To address this feature entanglement and guide the model toward more generalizable representations, we turn to a fundamental principle in signal processing: the Fourier transform. It is a long-established property in vision science that the Fourier phase spectrum of an image preserves high-level semantics, such as object shape and structure, which are largely domain-invariant. In contrast, the amplitude spectrum primarily captures lower-level statistics such as color, texture, and lighting, which are often domain-specific and vary between different environments. This natural decomposition offers a principled way of separate domain-agnostic structural cues from domain-specific stylistic cues.

In this paper, we propose Fourier-Attentive Representation Learning (FARL), a novel framework that leverages this Fourier-based insight to guide the learning of disentangled representations for VLM fine-tuning. Our method begins by decomposing each input image into its phase- and amplitude-only components. We then introduce a dual cross-attention mechanism where a set of learnable, modality-agnostic representation tokens are used to separately query the features extracted from the phase and amplitude streams. This process yields two specialized sets of tokens: structure-aware tokens informed by the phase information, and style-aware tokens informed by the amplitude information. These are then fused to create enriched, disentangled representation tokens. A key aspect of our design is an asymmetric injection strategy: the fused, feature-rich tokens are injected into the text encoder to form more descriptive and explicit internal prompts. In contrast, the image encoder is conditioned on the original, more general representation tokens. This forces the model to learn a more sophisticated alignment between a specific descriptive text representation and a general visual representation.


Want to learn more about me? Check out my CV or explore my research publications.

Tags: few-shot-learning machine-learning computer-vision deep-learning
Categories: AI Research Tutorial
← Welcome to My Academic Blog