UNC Chapel Hill’s Textless Vision-Language Transformer: Comparable Performance To Text-Based Approaches But 28x Faster

UNC Chapel Hill’s Textless Vision-Language Transformer: Comparable Performance to Text-Based Approaches but 28x Faster | Synced

2022-10-08 07:19:08 By : Mr. Fred Tan

AI Technology & Industry Review

56 Temperance St, #700 Toronto, ON M5H 3V5

In the new paper TVLT: Textless Vision-Language Transformer, researchers from UNC Chapel Hill present the Textless Vision-Language Transformer (TVLT) for vision-and-language representation learning. TVLT uses only raw visual and audio inputs and performs comparably to its text-based counterparts but requires only 1/3 the parameters and achieves 28x faster inference speeds.

Transformer architectures have achieved impressive performance in vision-language (VL) representation learning when trained on text-annotated images or videos. It remains challenging, however, for transformers to learn VL representations without relying on text, i.e. using only low-level visual and acoustic inputs.

The TVLT’s main architecture is a transformer comprising a 12-layer encoder and an 8-layer decoder. It takes its inputs as a list of embeddings obtained directly from perception-level video and audio and does not include any text-specific modules for automatic speech recognition (ASR) or tokenization.

The input embeddings are a combination of 1) modality embedding, 2) temporal/spatial embeddings for video, 3) temporal/frequency embeddings for audio, and 4) vision/audio patch embeddings.

The TVLT is pretrained with two objectives: vision-audio matching (VAM) and masked autoencoding (MAE). VAM is employed to learn the global cross-modal representations, and a linear layer with sigmoid activation is then applied to the encoder to obtain a matching probability. Finally, the binary cross-entropy loss is computed.

MAE is used to improve unimodal representations by masking random patches of visual frames and the audio spectrogram and reconstructing missing inputs. The novel approach slices the audio and video parts of the encoder output and feeds them to the decoder independently instead of jointly, which saves compute costs and boosts finetuning performance.

In their empirical study, the team compared TVLT with text-based counterparts on audio-to-video retrieval, video-based multimodal sentiment analysis, and visual question-answering benchmarks.

In the experiments, TVLT achieved performance competitive with state-of-the-art audio-based vision-and-language models on visual question answering, image retrieval, video retrieval and multimodal sentiment analysis. Moreover, it required only 1/3 of the parameters, and its inference speed was 28x faster than the text-based methods.

Overall, this paper showcases the powerful performance of TVLT and advances the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals without the need for traditional but computationally expensive text modelling.

The code and checkpoints are available on the project’s GitHub. The paper TVLT: Textless Vision-Language Transformer is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Machine Intelligence | Technology & Industry | Information & Analysis

T5 language model by 0.6 percent, and was the first model to surpass the human baseline.

Your email address will not be published. Required fields are marked *

Notify me of follow-up comments by email.

Notify me of new posts by email.

56 Temperance St, #700 Toronto, ON M5H 3V5

One Broadway, 14th Floor, Cambridge, MA 02142

75 E Santa Clara St, 6th Floor, San Jose, CA 95113

We Have Professional Technology To Grasp The Quality And Create Classics

Enabling Businesses To Trade With Confidence

Modern Line Aesthetics, Simple But Not Simple Shape Design

UNC Chapel Hill’s Textless Vision-Language Transformer: Comparable Performance to Text-Based Approaches but 28x Faster | Synced

Featured Products

News & Blog

Best Hailee Steinfeld Performances Ranked

Acquisition signals US expansion for Slicker Recycling - Business Live

KP govt successfully attracting local, foreign investments - Business & Finance - Business Recorder

Azerbaijan provides military units in Kalbajar with electricity [VIDEO]

Driver hospitalized after crashing into utility pole in Haddam, officials say

Concerns Raised in Chelsea After Power Line Sparks 4-Alarm House Fire – NBC Boston

Power transformer Market Size 2021 Industry Share, Global Trend, In-Depth Players Analysis, Revenue, and Recovery, Supply, Growth, Regional Outlook till 2027 – UK Parents Lounge

Insights on the Power Transformers Global Market to 2027 - Featuring ABB, Bowers Electricals and GE Grid Solutions Among Others - ResearchAndMarkets.com | Business Wire

Upgrade your yard lighting to LED the smart way — CNET - CNET

Texas Renaissance Festival fire: Food vendor building burnt to the ground in fire on Todd Mission festival grounds - ABC13 Houston