Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

[arXiv 2020] Video Representation Learning with Visual Tempo Consistency

decisionforce/VTHCL

Folders and files.

NameName
1 Commit
figures figures

Repository files navigation

Video representation learning with visual tempo consistency.

image

  • Full codebae is coming soon

Pretained Models

For now, we provide the pretrained models of 3D ResNet-18/50 ( link ). This repo supports the action recognition task.

Video Representation Learning with Visual Tempo Consistency

Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition [ 12 , 58 ] . In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by sampling the same instance at slow and fast frame rates respectively, we can obtain slow and fast video frames which share the same semantics but contain different visual tempos. Video representations learned from VTHCL achieve the competitive performances under the self-supervision evaluation protocol for action recognition on UCF-101 (82.1%) and HMDB-51 (49.2%). Moreover, comprehensive experiments suggest that the learned representations are generalized well to other downstream tasks including action detection on AVA and action anticipation on Epic-Kitchen. Finally, we use Instance Correspondence Map (ICM) to visualize the shared semantics captured by contrastive learning. 1 1 1 Code and models are available at this link.

1 Introduction

In recent years, a great success of representation learning has been made, especially for self-supervised learning from images. The visual features obtained in a self-supervised manner have been getting very close to those of supervised training on ImageNet [ 10 ] . Meanwhile, representing videos in a compact and informative way is also crucial for many analysis, since videos are redundant and noisy in their raw forms. However, supervised video representation learning demands a huge number of annotations, which in turn encourages researchers to investigate self-supervised learning schemes to harvest the massive amount of unlabelled videos.

Refer to caption

Videos contain rich motion dynamics along the temporal dimension. Thus, if we can make the best of the underlying consistency as well as causality dependency in the activities occurring in the videos, we can better leverage a large amount of unlabelled data for representation learning. For instance, previous attempts learn video representations by predicting the correct order of shuffled frames [ 37 ] , the arrow of time [ 54 ] , and the frames and motion dynamics in the future [ 49 , 17 ] . Considering the recent success of exploiting visual tempo in action recognition tasks [ 12 , 58 ] , in this work we aim at exploring visual tempo for self-supervised video representation learning.

Visual tempo, which describes how fast an action goes, is an essential variation factor of video semantics. Particularly, an action instance can be performed and observed at different tempos due to multiple elements including mood and age of the performer and configuration of the observer, the resulting video thus varies from case by case. Nonetheless, the same instance with different tempos is supposed to share high similarity in terms of their discriminative semantics, which is exactly the underlying consistency for self-supervised representation learning.

While visual tempo could be utilized by directly predicting the correct tempo of a given action instance as in previous attempts [ 5 , 37 , 54 , 49 ] , we argue that such a predictive approach may enforce the learned representations to capture the information that distinguishes the frequency of visual tempos, which is not necessarily related to the discriminative semantics we are looking for. Therefore, we propose an alternative approach based on contrastive learning [ 16 , 19 , 47 , 8 ] , which maximizes the mutual information between representations across videos from the same action instance but with different visual tempos. Specifically, we formulate self-supervised video representation learning as the consistency measurement between a pair of videos, which contains video frames from the same action instance but being sampled at the slow and fast visual tempo respectively. As is shown in Fig. 1 , the learning is conducted by adopting a slow and a fast video encoder, and taking in turn a video from each pair as the query to distinguish its counterpart from a set of negative samples. In this way, the resulting video representations are expected to capture the shared information and better retain its discriminations. Additionally, considering the key of such learning is the shared information, we develop Instance Correspondence Map (ICM) to visualize the shared information captured by contrast learning.

As shown in the literature [ 58 ] that the feature hierarchy inside a video network ( e.g.  I3D [ 6 ] ) already reflects semantics at various visual tempos, we further propose a hierarchical contrastive learning scheme, where we use network features across multiple depths as queries. Such a scheme not only leverages the variation of visual tempo more effectively but also provides a stronger supervision for deeper networks. Evaluated thoroughly on a wide variety of downstream action understanding tasks including action recognition on UCF-101 [ 43 ] and HMDB-51 [ 31 ] , action detection on AVA [ 15 ] , and action anticipation on Epic-Kitchen [ 9 ] , we find the representations learned via exploiting visual tempo consistency are highly discriminative and generalizable, leading to the competitive performances.

We summarize our contributions as follows: a) We demonstrate visual tempo can serve as a strong supervision signal for unsupervised video representation learning, which is utilized by the proposed hierarchical contrastive learning scheme. b) We show that our proposed framework can achieve competitive performances for action recognition on UCF-101 and HMDB-51, and generalize well to other downstream tasks such as action detection and action anticipation. c) We propose Instance Correspondence Map to qualitatively interpret the learned representations, which highlights the informative objects in videos.

2 Related Work

Self-supervised Video Representation Learning. Various pretext tasks have been explored for self-supervised video representation learning, such as modeling the cycle-consistency between two videos of the same category [ 11 ] , modeling the cycle-consistency of time [ 53 ] , predicting the temporal order of frames [ 13 , 32 , 37 , 54 ] , predicting future motion dynamics and frames [ 49 , 17 , 40 ] as well as predicting the color of frames [ 48 ] . In this work, we explore a different pretext task, which models the consistency between videos from the same action instance but with different visual tempos. There are works that learn video representations using not only videos themselves but also corresponding text [ 44 , 45 , 35 ] and audios [ 30 , 2 , 1 , 41 ] . Besides, cocurrent work [ 18 ] proposed a co-training scheme to learn representations from RGB and Optical Flow. In contrast to those, we learn compact representations from RGB frames only.

Contrastive Learning. Due to their promising performances, contrastive learning and its variants [ 3 , 22 , 23 , 40 , 47 , 56 , 19 , 7 ] are considered as an important direction for self-supervised representation learning. Particularly, the most related work is the contrastive multiview coding [ 47 ] , which learns video representations by maximizing the mutual information between RGB and flow data of the same frames. The difference is that in this work we learn video representations via the consistency between videos of the same action instance but with different visual tempos. Moreover, we further introduce a hierarchical scheme to leverage such consistency at different depths of the encoding network, providing a stronger supervision for training deeper networks.

Representation Interpretation. Interpreting what the deep neural networks have learned gives insight into the generalization ability and the transferability of deep features [ 38 , 59 ] . Particularly, some of them [ 61 , 4 ] developed techniques to study the hidden units. Besides, mapping a given representation back to image space [ 42 , 60 , 62 , 39 , 34 ] also explain what CNNs actually learn to distinguish different categories. However, these techniques cannot be directly applied to representations learned from contrastive learning since there are no semantic categories during training. In this work, without regarding the categorical annotations, we develop Instance Correspondence Map to qualitatively interpret the correspondences at the instance-level as a way to reveal the shared information learned by our method.

3 Learning from Visual Tempo Consistency

The goal of self-supervised video representation learning is to learn a video encoder g 𝑔 g that is able to produce compact and informative video representations, by regarding the structural knowledge and the consistency among a set of unlabelled videos { v 1 , … , v n } subscript 𝑣 1 … subscript 𝑣 𝑛 \{v_{1},...,v_{n}\} as the self-supervision signal. The discriminative feature of g 𝑔 g is often verified through a set of downstream tasks ( e.g. action classification, action detection and action anticipation). While various supervisions have been proposed by previous attempts, in this work we introduce the visual tempo consistency, a novel and effective self-supervision signal. We start by discussing what is the visual tempo consistency and why it is a strong supervision signal, then we introduce its learning process.

Refer to caption

3.1 Visual Tempo Consistency as a Self-supervision Signal

Following [ 58 ] , we refer to visual tempo as how fast an action goes in an action video. As an internal variation factor of these videos, the visual tempos of actions across different classes have a large variance. In previous literature [ 58 , 12 ] , the benefits of considering the variance of visual tempo in supervised recognition tasks have been well explored. A question then arises: Whether such variance can also benefit self-supervised learning? With a proper formulation of the variance of visual tempo, we show that it can serve as an effective and promising self-supervision signal.

The intuition behind such approach for video representation learning is that, at first, learning via the consistency between multiple representations is shown to be more effective than learning by prediction [ 47 , 19 , 8 , 7 ] . Moreover, while previous attempts resort to matching representations of different patches [ 56 ] or different views ( e.g.  RGB and optical flow) [ 47 ] of the same instance, the inputs of these representations intrinsically have different semantics. On the contrary, the semantics of an instance’s fast and slow videos are almost identical, with visual tempo being the only difference. Encouraging the representation consistency between videos of the same instance but with different visual tempos thus provides a stronger supervision signal.

3.2 Adopting Visual Tempo Consistency via Contrastive Learning

(1)
(2)
(3)
(4)
(5)

where h ℎ h is a function measuring the similarity between two representations. h ℎ h can be calculated by

(6)

Here T 𝑇 T is the temperature hyperparameter  [ 56 ] , and ϕ italic-ϕ \phi is a learnable mapping. As suggested by [ 7 , 8 ] , applying a non-linear mapping function can substantially improve the learned representations.

Memory bank. It is non-trivial to scale up if we extract the features of all videos at each iteration. Consequently, we reduce the computation overhead by maintaining two memory banks ℬ f subscript ℬ 𝑓 \mathcal{B}_{f} and ℬ s subscript ℬ 𝑠 \mathcal{B}_{s} of size n × d 𝑛 𝑑 n\times d as in [ 56 , 47 ] where d 𝑑 d is the dimension of representations. ℬ f subscript ℬ 𝑓 \mathcal{B}_{f} and ℬ s subscript ℬ 𝑠 \mathcal{B}_{s} respectively store the approximated representations of fast and slow videos. Representations stored in ℬ f subscript ℬ 𝑓 \mathcal{B}_{f} and ℬ s subscript ℬ 𝑠 \mathcal{B}_{s} are accumulated over iterations as

(7)

3.3 Learning from Visual Tempo via Hierarchical Contrastive Learning

subscript superscript ℒ 𝑘 𝑓 subscript superscript ℒ 𝑘 𝑠 \mathcal{L}_{\mathrm{total}}=\sum_{k\in\mathcal{K}}\lambda^{k}(\mathcal{L}^{k}_{f}+\mathcal{L}^{k}_{s}) , where

(8)
(9)

4 Instance Correspondence Map

Refer to caption

(10)
(11)

5 Experiments

We conduct a series of comprehensive experiments following the standard protocol of evaluating video representations from self-supervised learning. Specifically, we pretrain video encoders with the proposed VTHCL on a large-scale dataset ( e.g.  Kinetics-400 [ 6 ] ) then finetune the encoders on the target dataset corresponding to a certain downstream task ( e.g.  UCF-101 and HMDB-51 for action recognition). In practice, we regard the encoder g s subscript 𝑔 𝑠 g_{s} for slow videos as our main encoder used for evaluation. To ensure reproducibility, all implementation details are included in Sec. 5.1 . Main results of action recognition are presented in Sec. 5.2 with comparison to prior approaches. Sec. 5.3 includes ablation studies on the components of VTHCL. To further demonstrate the effectiveness of VTHCL and show the limitation of current evaluation protocol, we evaluate VTHCL on a diverse downstream tasks including action detection on AVA [ 15 ] and action anticipation on Epic-Kitchen [ 9 ] in Sec. 5.4.1 and Sec. 5.4.2 respectively. Finally, we also interpret the learned representations via ICM in Sec. 5.6 . It is worth noting all experiments are conducted on a single modality ( i.e.  RGB frames) and evaluated on the corresponding validation set unless state otherwise.

5.1 Implementation Details

Backbone. Two paths of SlowFast [ 12 ] without lateral connections are adapted as g f subscript 𝑔 𝑓 g_{f} and g s subscript 𝑔 𝑠 g_{s} , which are modified from 2D ResNet [ 21 ] by inflating 2D kernels [ 6 ] . The main difference between two encoders is the network width and the number of inflated blocks. Importantly, after training, only the slow encoder would be adopted for various tasks.

Method Architecture #Frames UCF-101 ] HMDB-51 ]
Random/ImageNet/Kinetics 3D-ResNet18 8 68.0/83.0/92.6 30.8/48.2/66.7
Random/ImageNet/Kinetics 3D-ResNet50 8 61.1/86.2/94.8 21.7/51.8/69.3
MotionPred ] C3D 16 61.2 33.4
RotNet3D ] 3D-ResNet18 16 62.9 33.7
ST-Puzzle ] 3D-ResNet18 16 65.8 33.7
ClipOrder ] R(2+1)D-18 - 72.4 30.9
DPC ] 3D-ResNet34 - 75.7 35.7
AoT ] T-CAM - 79.4 -
SpeedNet ] I3D 64 66.7 43.7
PacePrediction ] R(2+1)D-18 - 77.1 36.6
VTHCL-R18 (Ours) 3D-ResNet18 8 80.6 48.6
VTHCL-R50 (Ours) 3D-ResNet50 8 82.1 49.2

Training Protocol. Following [ 47 , 19 , 56 , 7 ] , video encoders in VTHCL are randomly initialized as default. Synchronized SGD serves as our optimizer, whose weight decay and momentum are set to 0.0001 and 0.9 respectively. The initial learning rate is set to 0.03 with a total batch size of 256. The half-period cosine schedule [ 33 ] is adapted to adjust the learning rate (200 epochs in total). Following the hyperparameters in [ 56 , 47 ] , temperature T 𝑇 T in Eq.( 6 ) is set to 0.07 and the number of sampled representation N 𝑁 N is 16384.

Dataset. Kinetics-400 [ 6 ] contains around 240k training videos which serve as the large-scale benchmark for self-supervised representation learning. We extract video frames at the raw frame per second (FPS) and sample the consecutive 64 frames as a raw clip which can be re-sampled to produce slow and fast clips at the specific stride τ 𝜏 \tau and τ / α 𝜏 𝛼 \tau/\alpha ( α > 1 𝛼 1 \alpha>1 ) separately. Unless state otherwise, the sample stride τ 𝜏 \tau is 8, i.e.  our model will take 8 frames ( 8 = 64 / 8 8 64 8 8=64/8 ) as input.

5.2 Action Recognition

Setup. In order to conduct a fair comparison, following prior works we finetune the learned video encoders of VTHCL on UCF-101 [ 43 ] and HMDB-51 [ 31 ] datasets for action recognition. Particularly, we obtain the video accuracy via the standard protocol [ 12 , 58 , 52 ] , i.e.  uniformly sampling 10 clips of the whole video and averaging the softmax probabilities of all clips as the final prediction. We train our models for 100 epochs with a total batch size of 64 and an initial learning rate of 0.1, which is reduced by a factor of 10 at 40, 80 epoch respectively. Moreover, when pre-training on Kinetics-400 [ 6 ] , three levels of contrastive hierarchy is constructed, i.e.  we collect features from the output of { r ​ e ​ s 3 , r ​ e ​ s 4 , r ​ e ​ s 5 } 𝑟 𝑒 subscript 𝑠 3 𝑟 𝑒 subscript 𝑠 4 𝑟 𝑒 subscript 𝑠 5 \{res_{3},res_{4},res_{5}\} due to the limitation of GPU resources. Unless state otherwise, α 𝛼 \alpha is defaultly set to 2 for the fast clips (sample stride of fast encoder g f subscript 𝑔 𝑓 g_{f} is 8 / 2 = 4 8 2 4 8/2=4 ). Namely, the slow and fast encoders take 8 and 16 frames as the input separately.

Main Results. Table 1 illustrates the comparison between ours and other previous approaches. Here all the methods utilize only a single modality and similar architectures. Besides, the results using different types of initializations ( i.e.  Random, ImageNet inflated and Kinetics pretrained) are also included to serve as the lower/upper bounds. In particular, our method equipped with the shallower network (3D-ResNet18) can achieve top-1 accuracy of 80.6% and 48.6% respectively, outperforming previous works with a similar setting by large margins. Furthermore, increasing the capacity of the network from 3D-ResNet18 to 3D-ResNet50 can introduce a consistent improvement, achieving 82.1% and 49.2% top-1 accuracies. Compared to the supervised results of similar backbones obtained using a random initialization ( e.g.  61.1% and 68.0% on UCF-101 [ 43 ] for 3D-ResNet18 and 3D-ResNet50), our method can significantly decrease the gap between self-supervised and supervised video representation learning.

Effect of Architectures. Beyond the competitive performances, Table 1 also raises the awareness of the effect of various backbones. Intuitively, when increasing network capacity, the learned representations should be better. For example, works in image representation learning [ 29 , 19 , 8 , 47 ] confirms networks with larger capacities can boost the quality of learned representations. As for video representation learning, it can be seen from Table 1 , when networks are well initialized ( e.g.  supervised pretraining on ImageNet and Kinetics, or using VTHCL on Kinetics), the one with a larger capacity indeed outperforms its counterpart. Particularly, when randomly initialized, 3D-ResNet50 performs worse on UCF-101 and HMDB than 3D-ResNet18 although it has a relatively larger capacity. It indicates the number of parameters of 3D-ResNet50 is too large compared to the scale of UCF-101 and HMDB, so that it suffers from overfitting. Therefore, while prior works usually employed a relatively shallow model ( e.g.  3D-ResNet18) in the evaluation, it is important to test a heavy backbone to see whether the proposed methods perform consistently across backbones.

Models
R18 78.2/45.2 79.5/47.4 80.0/48.2
R50 80.3/47.3 80.9/47.7 80.6/48.0
Models
R18 79.5/47.4 80.3/47.9 80.6/48.6
R50 80.9/47.7 81.5/48.5 82.1/49.2

5.3 Ablation Studies

Here we include the ablation study to investigate the effect of different VTHCL components.

Effect of relative visual tempo difference. Although in Table 1 we show VTHCL can obtain competitive results on UCF-101 [ 43 ] and HMDB [ 31 ] , it remains uncertain whether the relative visual tempo difference between slow and fast videos significantly affects the performance of VTHCL. We thus conduct multiple experiments by adjusting the relative coefficient of sample stride ( i.e. α = { 1 , 2 , 4 } 𝛼 1 2 4 \alpha=\{1,2,4\} ). Specifically, 8, 16 and 32 frames are respectively fed into fast encoder g f subscript 𝑔 𝑓 g_{f} while maintaining the number of frames for slow encoder g s subscript 𝑔 𝑠 g_{s} as 8. When α 𝛼 \alpha is 1, the input is exactly the same for both slow and fast encoders. In this case, VTHCL actually turns into instance discrimination task [ 56 ] which distinguishes video instances mainly via the appearance instead of utilizing visual tempo consistency. Such a setting thus serves as our baseline to tell whether the visual tempo could help learn better video representations. Moreover, to avoid unexpected effects, we do not apply the hierarchical scheme, and only the final features of two encoders are used as in Sec. 3.2 .

Results are included in Table. 2a , which suggests that a larger α 𝛼 \alpha generally leads to a better performance for both 3D-ResNet18 and 3D-ResNet50. It has verified that the visual tempo difference between slow and fast videos indeed enforces video encoders to learn discriminative semantics utilizing the underlying consistency. Visual tempo as a source of the supervision signal can help self-supervised video representation learning.

Effect of hierarchical contrastive learning. We study the effect of the hierarchical contrastive formulation with a varying number of levels. Here D 𝐷 D refers to the number of elements in 𝒦 𝒦 \mathcal{K} . For example, we collect the features from { r ​ e ​ s 4 , r ​ e ​ s 5 } 𝑟 𝑒 subscript 𝑠 4 𝑟 𝑒 subscript 𝑠 5 \{res_{4},res_{5}\} and build up a two-level contrastive formulation when D = 2 𝐷 2 D=2 . Furthermore, when D 𝐷 D is 1, the hierarchical scheme degrades into the general contrastive formulation shown in Sec. 3.2 . The relative coefficient α 𝛼 \alpha is set to 2 for a fair comparison.

Results are included in Table. 2b , showing that an increasing number of levels in the contrastive formulation significantly boosts the performance even when the model is quite heavy and tends to overfit. These results verify the effectiveness of utilizing the rich hierarchy inside a deep network, which correlate well with previous studies [ 58 ] . Besides, from the perspective of optimization, such a hierarchical scheme provides a stronger supervision, effectively avoiding the learning process from encountering issues such as gradient vanishing [ 46 ] , especially when a deep network is the encoder.

5.4 Evaluation on Other Downsteam Tasks

Representations learned via supervised learning on large scale datasets such as ImageNet [ 10 ] and Kinetics-400 [ 6 ] have shown to generalize well to a variety of tasks. While previous methods for unsupervised video representation learning  tend to study the quality of learned representations only on the action recognition task, it is important to include other downstream tasks for a comprehensive evaluation, since encoders may overfit to the action recognition benchmarks ( i.e.  UCF-101 [ 43 ] and HMDB-51 [ 31 ] ). Therefore, we also benchmark VTHCL on other downstream tasks, including action detection on AVA [ 15 ] and action anticipation on Epic-Kitchen [ 9 ] .

Random ImageNet Kinetics Ours
R18 11.1 13.4 16.6 13.9
R50 7.9 16.8 21.4 15.0
Random ImageNet Kinetics Ours
R18 8.9/26.3 13.5/28.0 14.2/28.8 11.2/27.0
R50 8.2/26.3 15.7/27.8 15.8/30.2 11.9/27.6

5.4.1 Action Detection on AVA

Dataset. AVA [ 15 ] provides a benchmark for spatial-temporal localization of actions. Different from the traditional video detection ( e.g. ImageNet VID dataset) whose labels are categories of given bounding boxes, annotations of AVA are provided for one frame per second and describe the action over time. AVA [ 15 ] contains around 235 training and 64 validation videos and 80 ‘atomic’ actions.

Setup. We follow the standard setting as in [ 12 , 55 ] for training and validation i.e. we conduct the same pre-processing for region proposals. The slow encoder g s subscript 𝑔 𝑠 g_{s} is employed as the backbone network with the number of 8 frames as input. Besides, the spatial stride of r ​ e ​ s 5 𝑟 𝑒 subscript 𝑠 5 res_{5} is set to 1 with the dilation of 2 to increase the spatial size of the output feature. The region-of-interest (RoI) features are computed by 3D RoIAlign [ 20 ] and then fed into the per-class, sigmoid-based classifier for prediction. The slight difference of training protocol is that we train our model for 24 epochs and the learning rate is decayed by a factor of 10 at 16, 22 epochs which is the standard 2 × 2\times scheduler of object detection. Note that BatchNorm (BN) layers [ 25 ] are not frozen. SGD is adopted as our optimizer with the initial learning rate of 0.1 and weight decay of 1 ​ e − 7 1 superscript 𝑒 7 1e^{-7} .

Results. Table. 3a provides the mean Average Precision (mAP) of several common initialization. Similar observation appears that with the proper initialization ( e.g. ImageNet, Kinetics and Ours), overfitting is slightly prevented such that 3D-ResNet50 can make the best of its increased capacity to achieve a better performance than 3D-ResNet18. It is worth noting that our method equipped with the same backbone (13.9 mAP) can beat 3D-ResNet18 trained via supervised learning on ImageNet (13.4 mAP). However, in action detection task, there exists a clear gap between video representations learned by self-supervised and supervised frameworks, although self-supervised approaches have obtained higher and higher results on action recognition. It is thus beneficial and necessary to include additional downstream tasks for evaluating self-supervised video representation learning.

5.4.2 Action Anticipation on Epic-Kitchen

Dataset. Epic-Kitchen [ 9 ] provides a large-scale cooking dataset, which is recorded by 32 subjects in 32 kitchens. Besides, it contains 125 verb and 352 noun categories. Following [ 9 ] , we randomly select 232 videos (23439 segments) for training and 40 videos (4979 segments) for validation. Action anticipation requires to forecast the category of a future action before it happens, given a video clip as the observation. Following the original baseline of Epic-Kitchen [ 9 ] , we refer to τ a subscript 𝜏 𝑎 \tau_{a} as the anticipation time, and τ o subscript 𝜏 𝑜 \tau_{o} as the observation time. In our experiments, both τ a subscript 𝜏 𝑎 \tau_{a} and τ o subscript 𝜏 𝑜 \tau_{o} are set to 1 second.

Setup. In order to validate the learned representations themselves, we introduce no reasoning modules as in [ 27 , 36 ] . Similar to [ 9 ] , we apply a shared MLP after the backbone network and then design two separable classification heads for noun and verb predictions. The slow encoder g s subscript 𝑔 𝑠 g_{s} is employed as the backbone network with the number of 8 frames as input. Our models are trained for 80 epochs with an initial learning rate of 0.1 (which is divided by 10 at 40 and 60 epoch respectively).

Results. Top-1 accuracy of noun/verb prediction obtained by various models are presented in Table 3b . Although our method can obtain the consistent improvements over the randomly initialized baseline, the gap between results of models learned with self-supervised and supervised schemes indicate the discriminative quality of learned representations can be further improved.

Refer to caption

5.5 Discussion

Heavy Backbones. Intuitively, heavy backbones are supposed to perform better than the lighter ones due to their increased capacity. However, our results on action recognition, detection and anticipation reveal that heavy backbones are likely to overfit when they are not well initialized. Therefore, when evaluating various methods of video representation learning, we should be more careful about whether they introduce consistent improvements on heavy backbones.

Thorough Evaluation. From our results, we argue that we need a more thorough evaluation for learned video representations across architectures, benchmarks and downstream tasks to study their consistency and generalization ability. The reasons are two-fold. a) Models with large capacities tend to overfit on UCF-101 [ 43 ] and HMDB-51 [ 31 ] due to their limited scale and diversity, so that augmentation and regularization sometimes can be more important than representations themselves. Addditionally, evaluating representations for action recognition should not be the only goal. Our study on diverse tasks shows that there remain gaps between video representations learned by self-supervised and supervised learning schemes, especially on action detection and action anticipation. The learned representation should facilitate as many downstream tasks as possible.

5.6 Qualitative Interpretation

In order to investigate the shared information between slow and fast videos captured by contrast learning, we conduct a qualitative evaluation via Instance Correspondence Map (ICM) introduced in Sec. 4 . Particularly, we train our slow and fast encoders via single-level contrastive learning on Something-Something V1 dataset [ 14 ] for better visualization effect. One fully-convolutional layer is used in the learnable mapping ϕ italic-ϕ \phi . Other settings keep the same.

Figure 4 shows several examples of instance correspondence maps. Although ICMs are obtained without accessing any annotations, they share large similarities semantically. Specifically, it can be observed that our learned encoders try to localize discriminative regions spatially and temporally to distinguish instances. In terms of instances where objects are basically static, ICM could also localize the salient objects (see the last two rows), which makes sense since the shared information between slow and fast videos are closely related to the objects. And for those videos containing large motions, ICMs appear to capture the moving objects ( e.g. toothpaste and hands in the first two rows), as motion semantics contribute more to instance classfication in these cases. Such interpretation suggests that to some extend, semantics could emerge automatically from the proposed methods.

6 Conclusion

In this work, we leverage videos of the same instance but with varying visual tempos to learn video representations in a self-supervised learning way,  where we adopt the contrastive learning framework and extend it to a hierarchical contrastive learning. On a variety of downstream tasks including action recognition, detection and anticipation, we demonstrate the effectiveness of our proposed framework, which obtains competitive results on action recognition, outperforming previous approaches by a clear margin. Moreover, our experiments further suggest that when learning the general visual representations of videos, we should evaluate more thoroughly the learned features under different network architectures, benchmarks, and tasks. Finally, we visualize the learned representations through the instance correspondence map to show that contrast learning on visual tempo captures the informative objects wihtout explicit supervision.

Acknowledgments

We thank Zhirong Wu and Yonglong Tian for their public implementation of previous works.

  • [1] Humam Alwassel, Dhruv Mahajan, Lorenzo Torresani, Bernard Ghanem, and Du Tran. Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667 , 2019.
  • [2] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proc. ICCV , 2017.
  • [3] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems , pages 15509–15519, 2019.
  • [4] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Proc. CVPR , 2017.
  • [5] Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. Speednet: Learning the speediness in videos. Proc. CVPR , 2020.
  • [6] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proc. CVPR , 2017.
  • [7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 , 2020.
  • [8] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 , 2020.
  • [9] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proc. ECCV , 2018.
  • [10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. CVPR , 2009.
  • [11] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Temporal cycle-consistency learning. In Proc. CVPR , 2019.
  • [12] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proc. ICCV , 2019.
  • [13] Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recognition , 2017.
  • [14] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proc. ICCV , 2017.
  • [15] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proc. CVPR , 2018.
  • [16] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In cvpr . IEEE, 2006.
  • [17] Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. In Proceedings of the IEEE International Conference on Computer Vision Workshops , 2019.
  • [18] Tengda Han, Weidi Xie, and Andrew Zisserman. Self-supervised co-training for video representation learning. Advances in Neural Information Processing Systems , 33, 2020.
  • [19] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. Proc. CVPR , 2020.
  • [20] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proc. ICCV , 2017.
  • [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR , 2016.
  • [22] Olivier J Hénaff, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 , 2019.
  • [23] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. International Conference on Learning Representations , 2019.
  • [24] De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, and Juan Carlos Niebles. What makes a video a video: Analyzing temporal information in video understanding models and datasets. In Proc. CVPR , 2018.
  • [25] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning , 2015.
  • [26] Longlong Jing and Yingli Tian. Self-supervised spatiotemporal feature learning by video geometric transformations. arXiv preprint arXiv:1811.11387 , 2018.
  • [27] Qiuhong Ke, Mario Fritz, and Bernt Schiele. Time-conditioned action anticipation in one shot. In cvpr , 2019.
  • [28] Dahun Kim, Donghyeon Cho, and In So Kweon. Self-supervised video representation learning with space-time cubic puzzles. 2019.
  • [29] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. In Proc. CVPR , 2019.
  • [30] Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems , 2018.
  • [31] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In Proc. ICCV , 2011.
  • [32] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In Proc. ICCV , 2017.
  • [33] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. International Conference on Learning Representations , 2017.
  • [34] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In Proc. CVPR , 2015.
  • [35] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. arXiv preprint arXiv:1912.06430 , 2019.
  • [36] Antoine Miech, Ivan Laptev, Josef Sivic, Heng Wang, Lorenzo Torresani, and Du Tran. Leveraging the present to anticipate the future in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops , 2019.
  • [37] Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In Proc. ECCV , 2016.
  • [38] Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importance of single directions for generalization. In International Conference on Learning Representations , 2018.
  • [39] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in Neural Information Processing Systems , 2016.
  • [40] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 , 2018.
  • [41] Mandela Patrick, Yuki M Asano, Ruth Fong, João F Henriques, Geoffrey Zweig, and Andrea Vedaldi. Multi-modal self-supervision from generalized data transformations. arXiv preprint arXiv:2003.04298 , 2020.
  • [42] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In International Conference on Learning Representations Workshop , 2014.
  • [43] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 , 2012.
  • [44] Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743 , 2019.
  • [45] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In Proc. ICCV , 2019.
  • [46] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proc. CVPR , 2015.
  • [47] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849 , 2019.
  • [48] Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. Tracking emerges by colorizing videos. In Proc. ECCV , 2018.
  • [49] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proc. CVPR , 2019.
  • [50] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He, Yunhui Liu, and Wei Liu. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proc. CVPR , 2019.
  • [51] Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu. Self-supervised video representation learning by pace prediction. Proc. ECCV , 2020.
  • [52] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proc. CVPR , 2018.
  • [53] Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In Proc. CVPR , 2019.
  • [54] Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. In Proc. CVPR , 2018.
  • [55] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. Long-term feature banks for detailed video understanding. In Proc. CVPR , 2019.
  • [56] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proc. CVPR , 2018.
  • [57] Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In Proc. CVPR , 2019.
  • [58] Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou. Temporal pyramid network for action recognition. In Proc. CVPR , 2020.
  • [59] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems , 2014.
  • [60] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Proc. ECCV , 2014.
  • [61] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. In International Conference on Learning Representations , 2015.
  • [62] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2921–2929, 2016.

ar5iv homepage

Accelerating Diffusion Models via Early Stop of the Diffusion Process

Zhaoyang Lyu, Xudong Xu, Ceyuan Yang, Dahua Lin, Bo Dai.

Guided Diffusion Model for Adversarial Purification

Jinyi Wang, Zhaoyang Lyu, Dahua Lin, Bo Dai , Hongfei Fu.

Unsupervised Landmark Learning from Unpaired Data

Yinghao Xu, Ceyuan Yang, Ziwei Liu, Bo Dai , Bolei Zhou.

Video Representation Learning with Visual Tempo Consistency

Ceyuan Yang, Yinghao Xu, Bo Dai , Bolei Zhou.

Zeroth-Order Supervised Policy Improvement

Hao Sun, Ziping Xu, Yuhang Song, Meng Fang, Jiechao Xiong, Bo Dai , Zhengyou Zhang, Bolei Zhou.

Novel Policy Seeking with Constrained Optimization

Hao Sun, Zhenghao Peng, Bo Dai , Jian Guo, Dahua Lin, Bolei Zhou.

Evolutionary Stochastic Policy Distillation

Hao Sun, Xinyu Pan, Bo Dai , Dahua Lin, Bolei Zhou.

  • We emphasize the time-varying dynamics of the bi-classification task of GAN's discriminator.
  • A on-the-fly adjustment of the discriminator's capacity without incurring any additional computational cost or training objectives.
  • DynamicD is general and effective across different tasks and dataset scales, and is synergistic to other discriminator-improving techniques.

Improving GANs with A Dynamic Discriminator

Ceyuan Yang, Yujun Shen, Yinghao Xu, Deli Zhao, Bo Dai , Bolei Zhou.

Advances in Neural Information Processing Systems (NeurIPS) 2022

  • An early attempt to exploit CLIP for dense prediction tasks.
  • MaskCLIP yields compelling segmentation results on open concepts in the absence of annotations and fine-tuning.
  • With pseudo labeling and self-training, MaskCLIP+ increases mIoU of unseen classes from 35.6 to 86.1.

Extract Free Dense Labels from CLIP       Oral

Chong Zhou, Chen Change Loy, Bo Dai .

European Conference on Computer Vision (ECCV) 2022

  • One of pioneering methods that model city scenes with NeRF.
  • Handling the issue of observations are obtained at drastically different scales.
  • BungeeNeRF follows a progressive growing paradigm to gradually activate high-frequency PE channels for unfolding more complex details.

BungeeNeRF: Progressive Neural Radiance Field for Extreme Multi-scale Scene Rendering

Yuanbo Xiangli, Linning Xu, Xingang Pan, Nanxuan Zhao, Anyi Rao, Christian Theobalt, Bo Dai , Dahua Lin.

  • A transformer-based method for particle-based simulation that models particle interactions in an edge-free manner.
  • The core idea of TIE is to decentralize the computation involving pair-wise particle interactions into per-particle updates.
  • Compared with existing GNN-based methods, without bells and whistles, TIE achieves superior performance and generalization across all these domains.

Transformer with Implicit Edges for Particle-based Physics Simulation

Yidi Shao, Chen Change Loy, Bo Dai .

  • A framework to improve mesh reconstruction via a 3D GAN that synthesizes textured meshes.
  • Reconstruction is achieved by GAN-inversion with respect to the single view observation, and is naturally regularized by the GAN manifold.
  • It generalizes well to meshes that are less commonly seen, such as the extended articulation of deformable objects.

Monocular 3D Object Reconstruction with GAN Inversion

Junzhe Zhang, Daxuan Ren, Zhongang Cai, Chai Kiat Yeo, Bo Dai , Chen Change Loy.

  • A baseline for video-based 2D/3D human pose estimation that achieves 10 times efficiency improvement.
  • DeciWatch only watches sparsely sampled frames, taking advantage of the semantic continuity of human motions.
  • Experiments indicate DeciWatch is able to capture high-level semantic flows beyond simply interpolating estimate poses.

DeciWatch: A Simple Baseline for 10x Efficient 2D and 3D Pose Estimation

Ailing Zeng, Xuan Ju, Lei Yang, Ruiyuan Gao, Xizhou Zhu, Bo Dai , Qiang Xu.

  • A new dataset focusing on breakdancing which features acrobatic moves and tangled postures.
  • Challenges existing datasets that assume strong music-dance correlation, controlled motion data and relatively simple poses and movements.
  • With intricate poses and swift movements, models are forced to reason more effectively about body structure and movements.

BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis

Davide Moltisanti, Jinyi Wu, Bo Dai , Chen Change Loy.

  • A CNN-based method for skeleton-based action recognition.
  • The proposed method avoids the limitation of GCN-based methods in terms of robustness, interoperability and scalability.
  • With PoseC3D, pose modality can be easily integrated with other modalities, providing a great design space for further improvement.

Revisiting Skeleton-based Action Recognition       Oral

Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, Bo Dai .

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2022

  • A cross-model pseudo-labeling scheme for semi-supervised action recognition.
  • A lightweight auxiliary network is introduced to the primary backbone, so that they predict pseudo-labels for each other.
  • Two networks provide complementary labels due to their different structural biases, leading to large-margin improvements over baselines.

Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition       Oral

Yinghao Xu, Fangyun Wei, Xiao Sun, Ceyuan Yang, Yujun Shen, Bo Dai , Bolei Zhou, Stephen Lin.

  • A hierarchical framework for synthesizing diverse 3D human motions in the scene.
  • Decomposing the diversity of scene-aware human motions into interaction diversity, path diversity, and motion diversity.
  • The first attempt to comprehensively consider human motion diversity.

Towards Diverse and Natural Scene-aware 3D Human Motion Synthesis

Jingbo Wang, Yu Rong, Jingyuan Liu, Sijie Yan, Dahua Lin, Bo Dai .

  • A dual-space GAN for better controllable facial editing by disentangling style and content.
  • A novel Transformer-based framework to enhance the interaction between two spaces.
  • A new dual-space editing and inversion strategy is also introduced for additional editing flexibility.

TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing

Yanbo Xu, Yueqin Yin, Liming Jiang, Qianyi Wu, Chengyao Zheng, Chen Change Loy, Bo Dai , Wayne Wu.

  • A framework for co-speech gesture generation.
  • Considering the hierarchical structure of speech audios and human gestures with multiple granularities association.
  • A contrastive learning strategy is also introduced to enhance audio-text alignment.

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai , Bolei Zhou.

  • A novel model for 3D-aware image synthesis that learns compact surfaces.
  • A dedicated transition from the cumulative rendering in radiance fields to rendering with only the surface points.
  • Gradually shrinking the sampling region to a minimal neighboring region around the surface.

Generative Occupancy Fields for 3D Surface-Aware Image Synthesis

Xudong Xu, Xingang Pan, Dahua Lin, Bo Dai .

Advances in Neural Information Processing Systems (NeurIPS) 2021

  • A novel model for 3D-aware image synthesis that learns more accurate shapes.
  • Learning an accurate 3D shape under different lighting conditions.
  • Modeling illumination explicitly and performing shading with various lighting conditions.

A Shading-Guided Generative Implicit Model for Shape-Accurate 3D-Aware Image Synthesis

Xingang Pan, Xudong Xu, Chen Change Loy, Christian Theobalt, Bo Dai .

  • A novel strategy for training GAN with limited data.
  • Encouraging a healthy competition between the generator and the discriminator to avoid overfitting.
  • Employing generated images as real samples to deceive the discriminator and suppress its confidence.

Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data

Liming Jiang, Bo Dai , Wayne Wu, Chen Change Loy.

  • The first generative model for city block synthesis.
  • Its core is a novel representation that reflects block structures with a ring topology and a two-tier graph.
  • It sets the foundation for large-scale city modeling with high fidelity in both geometry and functional semantics.

BlockPlanner: City Block Generation with Vectorized Graph Representation

Linning Xu*, Yuanbo Xiangli*, Anyi Rao, Nanxuan Zhao, Bo Dai , Ziwei Liu, Dahua Lin.

In Proceedings of the IEEE International Conference on Computer Vision (ICCV) 2021 (*=equal contribution)

  • A frequency-domain loss for image reconstruction and synthesis.
  • It allows a model to adaptively focus on frequency components that are hard to synthesize by down-weighting the easy ones.
  • It is complementary to existing spatial losses, improving popular models such as VAE, pix2pix, and SPADE.

Focal Frequency Loss for Image Reconstruction and Synthesis

In Proceedings of the IEEE International Conference on Computer Vision (ICCV) 2021

  • A motion synthesis model emphasizing on the importance of scene context.
  • It factorizes the distribution of human motions into a distribution of movement trajectories and that of body pose dynamics.
  • The ability of synthesizing diverse human motions that will adjust trajectories and body poses with respect to scene context.

Scene-aware Generative Network for Human Motion Synthesis

Jingbo Wang, Sijie Yan, Bo Dai , Dahua Lin.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2021

  • A GAN-Inversion approach for shape completion.
  • Searching for a latent code of a pre-trained GAN that gives a complete shape best reconstructing the given partial input.
  • Giving robust results for real-world scans and partial inputs of various forms and incompleteness levels.

Unsupervised 3D Shape Completion through GAN Inversion

Junzhe Zhang, Xinyi Chen, Zhongang Cai, Liang Pan, Haiyu Zhao, Shuai Yi, Chai Kiat Yeo, Bo Dai , Chen Change Loy.

  • An effective pipeline that learns to generate binaural audios from mono audios.
  • Pseudo visual-stereo pairs based on spherical harmonic decomposition and head-related impulse response (HRIR).
  • Improved generalization ability without being limited by the scale and variety of real pairs.

Visually Informed Binaural Audio Generation without Binaural Audios

Xudong Xu*, Hang Zhou*, Ziwei Liu, Bo Dai , Xiaogang Wang, Dahua Lin.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2021 (*=equal contribution)

  • 3D object shape reconstruction using a pretrained 2D GAN without 3D annotations.
  • An unsupervised 3D shape learning method that does not rely on the symmetry assumption of shapes.
  • Photo-realistic 3D-aware image manipulations including rotation and relighting.

Do 2D GANS Know 3D Shape? Unsupervised 3D Shape Reconstruction From 2D Image GANs       Oral

Xingang Pan, Bo Dai , Ziwei Liu, Chen Change Loy, Ping Luo.

International Conference on Learning Representations (ICLR) 2021

  • Precise GAN-inversion by discriminator-guided generator finetuning.
  • A versatile way for high-quality image restoration and manipulation.
  • Stable generalization on out-of-distribution images.

Exploiting Deep Generative Prior for Versatile Image Restoration and Manipulation       Oral

Xingang Pan, Xiaohang Zhan, Bo Dai , Dahua Lin, Chen Change Loy, Ping Luo.

European Conference on Computer Vision (ECCV) 2020

Exploiting Deep Generative Prior for Versatile Image Restoration and Manipulation

IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 2021

  • A new dataset for fine-grained gymnastics action recognition.
  • High-quality visual data collected from professional competitions.
  • High-quality annotations organized hierarchically, acrossing several semantic/temporal granularities.
  • Revealing the gaps between coarse- and fine-grained action recognition.

FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding       Three Strong Accepts Oral

Dian Shao, Yue Zhao, Bo Dai , Dahua Lin.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020

  • A generic module for action recognition.
  • A feature-level temporal pyramid that fuses temporal dynamics captured at different tempos.
  • Robustly handling actions with varying visual tempos, which means the speed of an action.

Temporal Pyramid Network for Action Recognition

Ceyuan Yang*, Yinghao Xu*, Jianping Shi, Bo Dai , Bolei Zhou.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020 (*=equal contribution)

  • An self-supervised framework that de-occludes scenes.
  • Recovering orders of occluded objects by deciding their pair-wise relationships.
  • Providing amodal completion and succeeding manipulations based on recovered orders.

Self-Supervised Scene De-occlusion Oral

Xiaohang Zhan, Xingang Pan, Bo Dai , Ziwei Liu, Dahua Lin, Chen Change Loy.

  • Proposing a new task, temporal action parsing, for action understanding.
  • Revealing the internal structures of actions by parsing them into sub-actions without labels.
  • Revealing the connections between different actions via common sub-actions.

Intra- and Inter-Action Understanding via Temporal Action Parsing

  • Extending GAN into a more general form by treating realness as a random variable.
  • Training GAN with the loss of maximizing KL divergence.
  • Training DCGAN at 1024x1024 resolution in one run, without progressive tricks.

Real or not Real, that is the Question       Spotlight

Yuanbo Xiangli*, Yubin Deng*, Bo Dai* , Chen Change Loy, Dahua Lin.

International Conference on Learning Representations (ICLR) 2020 (*=equal contribution)

  • A framework for visual-guided sound separation.
  • Recursively separating out the most salient sound and removing it from the mixture.
  • Preventing less salient sounds sound like noises in the context of salient sounds.

Recursive Visual Sound Separation Using Minus-Plus Net

Xudong Xu, Bo Dai , Dahua Lin.

In Proceedings of the IEEE International Conference on Computer Vision (ICCV) 2019

  • A component for object detection and more.
  • Relying on the consistency between two sets of features, separated by scales, difficulty, etc.
  • Guiding features in the less robust set with features in the reliable one.

Feature Intertwiner for Object Detection

Hongyang Li, Bo Dai , Shaoshuai Shi, Wanli Ouyang, Xiaogang Wang.

  • A nonsequential paradigm that generates captions hierarchically.
  • Recursively composing incomplete phrases into longer phrases, fitting the properties of natural language.
  • Strong ability in cross dataset generalization.

A Neural Compositional Paradigm for Image Captioning

Bo Dai , Sanja Fidler, Dahua Lin.

  • Systematically studying image captioning, in terms of model structures, evaluation metrics and learning paradigms.

Towards Diverse and Natural Descriptions for Image Captioning

  • Rethinking the representations of latent states in captioning models.
  • Treating latent states as 2D maps, instead of 1D vectors.
  • Visually revealing the dynamics of caption generation, and connections between visual and linguistic domain.

Rethinking the Form of Latent States in Image Captioning

Bo Dai* , Deming Ye*, Dahua Lin.

  • A framework that generates a descriptive paragraph for a video.
  • Remembering involved events and generated sentences so that the final paragraph is coherent and concise.

Move Forward and Tell: A Progressive Generator of Video Descriptions

Yilei Xiong, Bo Dai , Dahua Lin.

European Conference on Computer Vision (ECCV) 2018

  • A network structure and a loss that approximate the routing mechanism in Capsule Network.
  • Scaling capsule-like networks up to large datasets.
  • Adapting optimal transport as an objective.

Neural Network Encapsulation

Hongyang Li, Xiaoyang Guo, Bo Dai , Wanli Ouyang, Xiaogang Wang.

  • A learning method for image captioning.
  • Encouraging distinctiveness in generated captions.
  • Introducing a reference model, inspired by Noise Contrastive Estimation.

Contrastive Learning for Image Captioning

Bo Dai , Dahua Lin.

Advances in Neural Information Processing Systems (NIPS) 2017

  • Jointly learning a caption generator and a caption evaluator.
  • Assessing captions in terms of semantical relatedness, rather than n-gram matching.
  • Generating captions that are more natural and vivid.

Towards Diverse and Natural Image Descriptions via a Conditional GAN       Oral

Bo Dai , Sanja Fidler, Raquel Urtasun, and Dahua Lin.

In Proceedings of the IEEE International Conference on Computer Vision (ICCV) 2017

  • A visual relationship detector based on visual appearances, spatial configurations and statistical relations.
  • Core component, DR-Net, is a statistical net that captures statistical relations among variables.
  • Serving as a component for structurally representing images with scene graphs.

Detecting Visual Relationships with Deep Relational Networks       Oral

Bo Dai , Yuqi Zhang, and Dahua Lin.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017

Towards an Integrated Framework for Automatic Semantic Analysis of Soccer Games





 /   / 

Full Publications [ Home ]

 /   /   / 
, , . / /
, , . / /
, , , Ceyuan Yang, . / /
, , . / /

Video Representation Learning with Visual Tempo Consistency

video representation learning with visual tempo consistency

Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by sampling the same instance at slow and fast frame rates respectively, we can obtain slow and fast video frames which share the same semantics but contain different visual tempos. Video representations learned from VTHCL achieve the competitive performances under the self-supervision evaluation protocol for action recognition on UCF-101 (82.1%) and HMDB-51 (49.2%). Moreover, we show that the learned representations are also generalized well to other downstream tasks including action detection on AVA and action anticipation on Epic-Kitchen. Finally, our empirical analysis suggests that a more thorough evaluation protocol is needed to verify the effectiveness of the self-supervised video representations across network structures and downstream tasks.

video representation learning with visual tempo consistency

Ceyuan Yang

video representation learning with visual tempo consistency

Related Research

Cycle-contrast for self-supervised video representation learning, learning representations from audio-visual spatial alignment, stc-mix: space, time, channel mixing for self-supervised video representation, cmae-v: contrastive masked autoencoders for video action recognition, self-conditioned probabilistic learning of video rescaling, an information-theoretic approach to unsupervised keypoint representation learning, compressive visual representations.

Please sign up or login with your details

Generation Overview

AI Generator calls

AI Video Generator calls

AI Chat messages

Genius Mode messages

Genius Mode images

AD-free experience

Private images

  • Includes 500 AI Image generations, 1750 AI Chat Messages, 30 AI Video generations, 60 Genius Mode Messages and 60 Genius Mode Images per month. If you go over any of these limits, you will be charged an extra $5 for that group.
  • For example: if you go over 500 AI images, but stay within the limits for AI Chat and Genius Mode, you'll be charged $5 per additional 500 AI Image generations.
  • Includes 100 AI Image generations and 300 AI Chat Messages. If you go over any of these limits, you will have to pay as you go.
  • For example: if you go over 100 AI images, but stay within the limits for AI Chat, you'll have to reload on credits to generate more images. Choose from $5 - $1000. You'll only pay for what you use.

Out of credits

Refill your membership to continue using DeepAI

Share your generations with friends

  • Corpus ID: 220250229

Video Representation Learning with Visual Tempo Consistency

  • Ceyuan Yang , Yinghao Xu , +1 author Bolei Zhou
  • Published in arXiv.org 28 June 2020
  • Computer Science

Figures and Tables from this paper

figure 1

82 Citations

Spatiotemporal contrastive video representation learning, can temporal information help with contrastive self-supervised learning, nearest-neighbor inter-intra contrastive learning from unlabeled videos.

  • Highly Influenced

Visual Tempo Contrastive Learning for Few-Shot Action Recognition

How severe is benchmark-sensitivity in video self-supervised learning, video contrastive learning with global context, contextualized relation predictive model for self-supervised group activity representation learning, cstrcrl: cross-view contrastive learning through gated gcn with strong augmentations for skeleton recognition, self-supervised video representation learning in a heuristic decoupled perspective, videomamba: spatio-temporal selective state space model, 66 references, temporal pyramid network for action recognition.

  • Highly Influential

Video Representation Learning by Dense Predictive Coding

Cooperative learning of audio and video models from self-supervised synchronization, end-to-end learning of visual representations from uncurated instructional videos, shuffle and learn: unsupervised learning using temporal order verification, videobert: a joint model for video and language representation learning, speednet: learning the speediness in videos, temporal cycle-consistency learning, self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics, learning and using the arrow of time, related papers.

Showing 1 through 3 of 0 Related Papers

video representation learning with visual tempo consistency

Published in arXiv.org 2020

Ceyuan Yang Yinghao Xu Bo Dai Bolei Zhou

Video Representation Learning with Graph Contrastive Augmentation

video representation learning with visual tempo consistency

New Citation Alert added!

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations.

  • Qiu S Cheng X Lu H Zhang H Wan R Xue X Pu J (2024) Subclassified Loss: Rethinking Data Imbalance From Subclass Perspective for Semantic Segmentation IEEE Transactions on Intelligent Vehicles 10.1109/TIV.2023.3325343 9 :1 (1547-1558) Online publication date: Jan-2024 https://doi.org/10.1109/TIV.2023.3325343
  • Shen X Zhou Y Yuan Y Yang X Lan L Zheng Y (2023) Contrastive Transformer Hashing for Compact Video Representation IEEE Transactions on Image Processing 10.1109/TIP.2023.3326994 32 (5992-6003) Online publication date: 1-Jan-2023 https://dl.acm.org/doi/10.1109/TIP.2023.3326994
  • Mo Y Chen Y Peng L Shi X Zhu X Magalhães J del Bimbo A Satoh S Sebe N Alameda-Pineda X Jin Q Oria V Toni L (2022) Simple Self-supervised Multiplex Graph Representation Learning Proceedings of the 30th ACM International Conference on Multimedia 10.1145/3503161.3547949 (3301-3309) Online publication date: 10-Oct-2022 https://dl.acm.org/doi/10.1145/3503161.3547949

Index Terms

Computing methodologies

Artificial intelligence

Computer vision

Computer vision problems

Video segmentation

Computer vision tasks

Activity recognition and understanding

Recommendations

Graph self-supervised learning with augmentation-aware contrastive learning.

Graph self-supervised learning aims to mine useful information from unlabeled graph data, and has been successfully applied to pre-train graph representations. Many existing approaches use contrastive learning to learn powerful embeddings by learning ...

Self-supervised Graph-level Representation Learning with Adversarial Contrastive Learning

The recently developed unsupervised graph representation learning approaches apply contrastive learning into graph-structured data and achieve promising performance. However, these methods mainly focus on graph augmentation for positive samples, while the ...

Adaptive Graph Augmentation for Graph Contrastive Learning

Graph contrastive learning emerged as a promising method for graph representation learning. The traditional graph contrastive methods utilize data augmentations for original graphs and train models during pre-training, and for different downstream ...

Information

Published in.

cover image ACM Conferences

  • General Chairs:

University of Electronic Science&Technology of China, China

Zhejiang University, China

  • Program Chairs:

University of Electronic Science and Technology of China, China

Author Picture

CWI&TU Delft, The Netherlands

FACEBOOK, Inc., USA

University of Texas at Dallas, USA

  • SIGMM: ACM Special Interest Group on Multimedia

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • contrastive learning
  • graph augmentation
  • self-supervised learning
  • video representation learning
  • Research-article

Funding Sources

  • National Natural Science Foundation of China
  • Sichuan Science and Technology Program
  • Fundamental Research Funds for the Central Universities

Acceptance Rates

Upcoming conference, contributors, other metrics, bibliometrics, article metrics.

  • 3 Total Citations View Citations
  • 350 Total Downloads
  • Downloads (Last 12 months) 43
  • Downloads (Last 6 weeks) 1

View Options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

View options.

View or Download as a PDF file.

View online with eReader .

Share this Publication link

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

IMAGES

  1. Figure 1 from Video Representation Learning with Visual Tempo

    video representation learning with visual tempo consistency

  2. Video Representation Learning with Visual Tempo Consistency

    video representation learning with visual tempo consistency

  3. Table 4 from Video Representation Learning with Visual Tempo

    video representation learning with visual tempo consistency

  4. Table 2 from Video Representation Learning with Visual Tempo

    video representation learning with visual tempo consistency

  5. [2006.15489] Video Representation Learning with Visual Tempo Consistency

    video representation learning with visual tempo consistency

  6. GitHub

    video representation learning with visual tempo consistency

COMMENTS

  1. Video Representation Learning with Visual Tempo Consistency

    Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by ...

  2. PDF Video Representation Learning with Visual Tempo Consistency

    3. Learning from Visual Tempo Consistency The goal of self-supervised video representation learning is to learn a video encoder gthat is able to produce com-pact and informative video representations, by regarding the structural knowledge and the consistency among a set of unlabelled videos fv 1;:::;v ngas the self-supervision signal.

  3. Video Representation Learning with Visual Tempo Consistency

    In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by sampling the same instance at slow and fast frame rates ...

  4. Video Representation Learning with Visual Tempo Consistency

    In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations ...

  5. Video Representation Learning with Visual Tempo Consistency

    Table 4: Representation Transfer. Results on action detection and anticipation are reported - "Video Representation Learning with Visual Tempo Consistency" Skip to search form Skip to main ... {Video Representation Learning with Visual Tempo Consistency}, author={Ceyuan Yang and Yinghao Xu and Bo Dai and Bolei Zhou}, journal={ArXiv}, year={2020 ...

  6. [arXiv 2020] Video Representation Learning with Visual Tempo Consistency

    @inproceedings{yang2020vthcl, title={Video Representation Learning with Visual Tempo Consistency}, author={Yang, Ceyuan and Xu, Yinghao and Dai, Bo and Zhou, Bolei}, booktitle={arXiv preprint arXiv:2006.15489}, year={2020}, }

  7. Video Representation Learning with Visual Tempo Consistency

    Note that only the top-1 accuracies are reported - "Video Representation Learning with Visual Tempo Consistency" Skip to search form Skip to main content Skip to account menu ... {Video Representation Learning with Visual Tempo Consistency}, author={Ceyuan Yang and Yinghao Xu and Bo Dai and Bolei Zhou}, journal={ArXiv}, year={2020}, volume={abs ...

  8. Video Representation Learning with Visual Tempo Consistency

    It is demonstrated that visual tempo can also serve as a self-supervision signal for video representation learning and is proposed to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we ...

  9. Video Representation Learning with Visual Tempo Consistency

    3 Learning from Visual Tempo Consistency The goal of self-supervised video representation learning is to learn a video encoder gthat is able to produce compact and informative video representations, by regarding the structural knowledge and the consistency among a set of unlabelled videos fv 1;:::;v ngas the self-supervision signal. The 2

  10. Video Representation Learning with Visual Tempo Consistency

    Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition [12, 58]. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for v…

  11. Bo Dai's Homepage

    Video Representation Learning with Visual Tempo Consistency. Ceyuan Yang, Yinghao Xu, Bo Dai, Bolei Zhou. ... A baseline for video-based 2D/3D human pose estimation that achieves 10 times efficiency improvement. ... Relying on the consistency between two sets of features, separated by scales, difficulty, etc. ...

  12. PDF Spatiotemporal Contrastive Video Representation Learning

    We discuss other components of the framework as fol-lows: (1) an encoder network maps an input video clip to its. representation z, (2) spatiotemporal augmentations to con-struct positive pairs (zi, z0 i) and the properties they induce, and (3) methods to evaluate the learned representations. 3.2. Video Encoder.

  13. Ceyuan Yang's Homepage

    Video Representation Learning with Visual Tempo Consistency, Ceyuan Yang, Yinghao Xu , Bo Dai , Bolei Zhou . ArXiv pre-print / Paper / Code

  14. PDF Exploring Temporal Concurrency for Video-Language Representation Learning

    temporal consistency implied between different segments of long-form video is ignored. Most recently, sequence alignment for video-language representation learning is explored in TempCLR [ 52], which discovers the temporal dynamics between different video segments by a contrastive learning framework based on the shufe d segment sequence.

  15. PDF Modeling Video As Stochastic Processes for Fine-Grained Video

    To this end, we propose a new perspective that considers Video as Stochastic Processes (VSP) to explicitly capture the tem-poral dynamics of videos by exploring process agreement. Figure 1. The evolution of ne-grained video representation learning. (a) Video alignment (e.g., TCC [11], LAV [13]) enforces two videos from the same action aligned ...

  16. Video Representation Learning with Visual Tempo Consistency

    3. ∙. Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via ...

  17. Video Representation Learning with Visual Tempo Consistency

    Table 2: Linear classification on Kinetics-400 [5]. Top-1 accuracy is reported - "Video Representation Learning with Visual Tempo Consistency"

  18. Video Representation Learning with Graph Contrastive Augmentation

    We propose a novel contrastive self-supervised video representation learning framework, termed Graph Contrastive Augmentation (GCA), by constructing a video temporal graph and devising a graph augmentation that is designed to enhance the correlation across frames of videos and developing a new view for exploring temporal structure in videos.

  19. PDF Temporal Cycle-Consistency Learning

    Figure 1: We present a self-supervised representation learning technique called temporal cycle consistency (TCC) learning. It is inspired by the temporal video alignment problem, which refers to the task of finding correspondences across multiple videos despite many factors of variation. The learned representations are useful for fine-grained ...

  20. Video Representation Learning with Visual Tempo Consistency

    Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via ...

  21. Spatiotemporal Contrastive Video Representation Learning

    We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. We study what makes for good data ...

  22. PDF Spatiotemporal Contrastive Video Representation Learning

    sentation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our rep-resentations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. We study what makes for good data ...

  23. Video Representation Learning with Visual Tempo Consistency

    Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by ...