Video Applications
Papers
You Lead, We Exceed: Labor-Free Video Concept Learningby Jointly Exploiting Web Videos and Images
- intro: CVPR 2016
 - intro: Lead–Exceed Neural Network (LENN), LSTM
 - paper: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/CVPR16_webly_final.pdf
 
Video Fill in the Blank with Merging LSTMs
- intro: for Large Scale Movie Description and Understanding Challenge (LSMDC) 2016, “Movie fill-in-the-blank” Challenge, UCF_CRCV
 - intro: Video-Fill-in-the-Blank (ViFitB)
 - arxiv: https://arxiv.org/abs/1610.04062
 
Video Pixel Networks
- intro: Google DeepMind
 - arxiv: https://arxiv.org/abs/1610.00527
 
Robust Video Synchronization using Unsupervised Deep Learning
Video Propagation Networks
- intro: CVPR 2017. Max Planck Institute for Intelligent Systems & Bernstein Center for Computational Neuroscience
 - project page: https://varunjampani.github.io/vpn/
 - arxiv: https://arxiv.org/abs/1612.05478
 - github(Caffe): https://github.com/varunjampani/video_prop_networks
 
Video Frame Synthesis using Deep Voxel Flow
- project page: https://liuziwei7.github.io/projects/VoxelFlow.html
 - arxiv: https://arxiv.org/abs/1702.02463
 
Optimizing Deep CNN-Based Queries over Video Streams at Scale
- intro: Stanford InfoLab
 - keywords: NoScope. difference detectors, specialized models
 - arxiv: https://arxiv.org/abs/1703.02529
 - github: https://github.com/stanford-futuredata/noscope
 - github: https://github.com/stanford-futuredata/tensorflow-noscope
 
NoScope: 1000x Faster Deep Learning Queries over Video
http://dawn.cs.stanford.edu/2017/06/22/noscope/
Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos
- intro: CVPR 2017. Stanford University & University of Southern California
 - arxiv: https://arxiv.org/abs/1703.02521
 
ProcNets: Learning to Segment Procedures in Untrimmed and Unconstrained Videos
https://arxiv.org/abs/1703.09788
Unsupervised Learning Layers for Video Analysis
- intro: Baidu Research
 - intro: “The experiments demonstrated the potential applications of UL layers and online learning algorithm to head orientation estimation and moving object localization”
 - arxiv: https://arxiv.org/abs/1705.08918
 
Look, Listen and Learn
- intro: DeepMind
 - intro: “Audio-Visual Correspondence” learning
 - arxiv: https://arxiv.org/abs/1705.08168
 
Video Imagination from a Single Image with Transformation Generation
- intro: Peking University
 - arxiv: https://arxiv.org/abs/1706.04124
 - github: https://github.com/gitpub327/VideoImagination
 
Learning to Learn from Noisy Web Videos
- intro: CVPR 2017. Stanford University & CMU & Simon Fraser University
 - arxiv: https://arxiv.org/abs/1706.02884
 
Convolutional Long Short-Term Memory Networks for Recognizing First Person Interactions
- intro: Accepted on the second International Workshop on Egocentric Perception, Interaction and Computing(EPIC) at International Conference on Computer Vision(ICCV-17)
 - arxiv: https://arxiv.org/abs/1709.06495
 
Learning Binary Residual Representations for Domain-specific Video Streaming
- intro: AAAI 2018
 - project page: http://research.nvidia.com/publication/2018-02_Learning-Binary-Residual
 - arxiv: https://arxiv.org/abs/1712.05087
 
Video Representation Learning Using Discriminative Pooling
- intro: CVPR 2018
 - arxiv: https://arxiv.org/abs/1803.10628
 
Rethinking the Faster R-CNN Architecture for Temporal Action Localization
- intro: CVPR 2018
 - arxiv: https://arxiv.org/abs/1804.07667
 
Deep Keyframe Detection in Human Action Videos
- intro: two-stream ConvNet
 - arxiv: https://arxiv.org/abs/1804.10021
 
FFNet: Video Fast-Forwarding via Reinforcement Learning
- intro: CVPR 2018
 - arxiv: https://arxiv.org/abs/1805.02792
 
Fast forwarding Egocentric Videos by Listening and Watching
https://arxiv.org/abs/1806.04620
Scanner: Efficient Video Analysis at Scale
- intro: CMU
 - arxiv: https://arxiv.org/abs/1805.07339
 
Massively Parallel Video Networks
- intro: DeepMind & University of Oxford
 - arxiv: https://arxiv.org/abs/1806.03863
 
Object Level Visual Reasoning in Videos
- intro: LIRIS & Facebook AI Research
 - arxiv: https://arxiv.org/abs/1806.06157
 
Video Time: Properties, Encoders and Evaluation
- intro: BMVC 2018
 - arxiv: https://arxiv.org/abs/1807.06980
 
Inserting Videos into Videos
- intro: CVPR 2019
 - arxiv: https://arxiv.org/abs/1903.06571
 
Video Classification
Large-scale Video Classification with Convolutional Neural Networks
- intro: CVPR 2014
 - project page: http://cs.stanford.edu/people/karpathy/deepvideo/
 - paper: www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Karpathy_Large-scale_Video_Classification_2014_CVPR_paper.pdf
 
Exploiting Image-trained CNN Architectures for Unconstrained Video Classification
- intro: Video-level event detection. extracting deep features for each frame, averaging frame-level deep features
 - arxiv: http://arxiv.org/abs/1503.04144
 
Beyond Short Snippets: Deep Networks for Video Classification
- intro: CNN + LSTM
 - arxiv: http://arxiv.org/abs/1503.08909
 - demo: http://pan.baidu.com/s/1eQ9zLZk
 
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification
- intro: ACM Multimedia, 2015
 - arxiv: http://arxiv.org/abs/1504.01561
 
Video Content Recognition with Deep Learning
- author: Zuxuan Wu, Fudan University
 - slides: http://vision.ouc.edu.cn/valse/slides/20160420/Zuxuan%20Wu%20-%20Video%20Content%20Recognition%20with%20Deep%20Learning-Zuxuan%20Wu.pdf
 
Video Content Recognition with Deep Learning
- author: Yu-Gang Jiang, Lab for Big Video Data Analytics (BigVid), Fudan University
 - slides: http://www.yugangjiang.info/slides/DeepVideoTalk-2015.pdf
 
Efficient Large Scale Video Classification
- intro: Google
 - arxiv: http://arxiv.org/abs/1505.06250
 
Fusing Multi-Stream Deep Networks for Video Classification
Learning End-to-end Video Classification with Rank-Pooling
- paper: http://jmlr.org/proceedings/papers/v48/fernando16.html
 - paper: http://jmlr.csail.mit.edu/proceedings/papers/v48/fernando16.pdf
 - summary(by Hugo Larochelle): http://www.shortscience.org/paper?bibtexKey=conf/icml/FernandoG16#hlarochelle
 
Deep Learning for Video Classification and Captioning
Fast Video Classification via Adaptive Cascading of Deep Models
Deep Feature Flow for Video Recognition
- intro: CVPR 2017
 - intro: It provides a simple, fast, accurate, and end-to-end framework for video recognition (e.g., object detection and semantic segmentation in videos)
 - arxiv: https://arxiv.org/abs/1611.07715
 - github(official, MXNet): https://github.com/msracver/Deep-Feature-Flow
 - youtube: https://www.youtube.com/watch?v=J0rMHE6ehGw
 
Large-Scale YouTube-8M Video Understanding with Deep Neural Networks
https://arxiv.org/abs/1706.04488
Deep Learning Methods for Efficient Large Scale Video Labeling
- intro: Solution to the Kaggle’s competition Google Cloud & YouTube-8M Video Understanding Challenge
 - arxiv: https://arxiv.org/abs/1706.04572
 - github: https://github.com/mpekalski/Y8M
 
Learnable pooling with Context Gating for video classification
- intro: CVPR17 Youtube 8M workshop. Kaggle 1st place
 - arxiv: https://arxiv.org/abs/1706.06905
 - github: https://github.com/antoine77340/LOUPE
 
Aggregating Frame-level Features for Large-Scale Video Classification
- intro: Youtube-8M Challenge, 4th place
 - arxiv: https://arxiv.org/abs/1707.00803
 
Tensor-Train Recurrent Neural Networks for Video Classification
https://arxiv.org/abs/1707.01786
Hierarchical Deep Recurrent Architecture for Video Understanding
- intro: Classification Challenge Track paper in CVPR 2017 Workshop on YouTube-8M Large-Scale Video Understanding
 - arxiv: https://arxiv.org/abs/1707.03296
 
Large-scale Video Classification guided by Batch Normalized LSTM Translator
- intro: CVPR2017 Workshop on Youtube-8M Large-scale Video Understanding
 - arxiv: https://arxiv.org/abs/1707.04045
 
UTS submission to Google YouTube-8M Challenge 2017
- intro: CVPR’17 Workshop on YouTube-8M
 - arxiv: https://arxiv.org/abs/1707.04143
 - github: https://github.com/ffmpbgrnn/yt8m
 
A spatiotemporal model with visual attention for video classification
https://arxiv.org/abs/1707.02069
Cultivating DNN Diversity for Large Scale Video Labelling
- intro: CVPR 2017 Youtube-8M Workshop
 - arxiv: https://arxiv.org/abs/1707.04272
 
Attention Transfer from Web Images for Video Recognition
- intro: ACM Multimedia, 2017
 - arxiv: https://arxiv.org/abs/1708.00973
 
Non-local Neural Networks
- intro: CVPR 2018. CMU & Facebook AI Research
 - arxiv: https://arxiv.org/abs/1711.07971
 - github(Caffe2): https://github.com/facebookresearch/video-nonlocal-net
 
Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification
https://arxiv.org/abs/1711.08200
Appearance-and-Relation Networks for Video Classification
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
- intro: ECCV 2018. Google Research & University of California San Diego
 - arxiv: https://arxiv.org/abs/1712.04851
 
Long Activity Video Understanding using Functional Object-Oriented Network
https://arxiv.org/abs/1807.00983
Deep Architectures and Ensembles for Semantic Video Classification
https://arxiv.org/abs/1807.01026
Deep Discriminative Model for Video Classification
- intro: ECCV 2018
 - arxiv: https://arxiv.org/abs/1807.08259
 
Deep Video Color Propagation
- intro: BMVC 2018
 - arxuv: https://arxiv.org/abs/1808.03232
 
Non-local NetVLAD Encoding for Video Classification
- intro: ECCV 2018 workshop on YouTube-8M Large-Scale Video Understanding
 - intro: Tencent AI Lab & Fudan University
 - arxiv: https://arxiv.org/abs/1810.00207
 
Learnable Pooling Methods for Video Classification
- intro: Youtube 8M ECCV18 Workshop
 - arxiv: https://arxiv.org/abs/1810.00530
 - github: https://github.com/pomonam/LearnablePoolingMethods
 
NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification
- intro: ECCV 2018 workshop
 - arxiv: https://arxiv.org/abs/1811.05014
 - github: https://github.com/linrongc/youtube-8m
 
High Order Neural Networks for Video Classification
- intro: Fudan University, Carnegie Mellon University, Qiniu Inc., ByteDance AI Lab
 - arxiv: https://arxiv.org/abs/1811.07519
 
SlowFast Networks for Video Recognition
- intro: Facebook AI Research (FAIR)
 - arxiv: https://arxiv.org/abs/1812.03982
 
Efficient Video Classification Using Fewer Frames
- intro: CVPR 2019
 - arxiv: https://arxiv.org/abs/1902.10640
 
Video Classification with Channel-Separated Convolutional Networks
- intro: Facebook AI
 - arxiv: https://arxiv.org/abs/1904.02811
 
Two-Stream Video Classification with Cross-Modality Attention
https://arxiv.org/abs/1908.00497
Action Detection / Activity Recognition
3d convolutional neural networks for human action recognition
Sequential Deep Learning for Human Action Recognition
Two-stream convolutional networks for action recognition in videos
Finding action tubes
- intro: “built action models from shape and motion cues. They start from the image proposals and select the motion salient subset of them and extract saptio-temporal features to represent the video using the CNNs.”
 - arxiv: http://arxiv.org/abs/1411.6031
 
Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition
Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors
- intro: CVPR 2015. TDD
 - paper: www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Wang_Action_Recognition_With_2015_CVPR_paper.pdf
 - ext: http://www.cv-foundation.org/openaccess/content_cvpr_2015/app/2B_105_ext.pdf
 - poster: https://wanglimin.github.io/papers/WangQT_CVPR15_Poster.pdf
 - github: https://github.com/wanglimin/TDD
 
Action Recognition by Hierarchical Mid-level Action Elements
Contextual Action Recognition with R CNN
Towards Good Practices for Very Deep Two-Stream ConvNets
- arxiv: http://arxiv.org/abs/1507.02159
 - github: https://github.com/yjxiong/caffe
 
Action Recognition using Visual Attention
- intro: LSTM / RNN
 - arxiv: http://arxiv.org/abs/1511.04119
 - project page: http://shikharsharma.com/projects/action-recognition-attention/
 - github(Python/Theano): https://github.com/kracwarlock/action-recognition-visual-attention
 
End-to-end Learning of Action Detection from Frame Glimpses in Videos
- intro: CVPR 2016
 - project page: http://ai.stanford.edu/~syyeung/frameglimpses.html
 - arxiv: http://arxiv.org/abs/1511.06984
 - paper: http://vision.stanford.edu/pdf/yeung2016cvpr.pdf
 
Multi-velocity neural networks for gesture recognition in videos
Active Learning for Online Recognition of Human Activities from Streaming Videos
Convolutional Two-Stream Network Fusion for Video Action Recognition
Deep, Convolutional, and Recurrent Models for Human Activity Recognition using Wearables
Unsupervised Semantic Action Discovery from Video Collections
Anticipating Visual Representations from Unlabeled Video
VideoLSTM Convolves, Attends and Flows for Action Recognition
Hierarchical Attention Network for Action Recognition in Videos (HAN)
Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition
Connectionist Temporal Modeling for Weakly Supervised Action Labeling
CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016
- intro: won the 1st place in the untrimmed video classification task of ActivityNet Challenge 2016. TSN
 - arxiv: http://arxiv.org/abs/1608.00797
 - github: https://github.com/yjxiong/anet2016-cuhk
 
Actionness Estimation Using Hybrid FCNs

- intro: CVPR 2016. H-FCN
 - project page: http://wanglimin.github.io/actionness_hfcn/index.html
 - paper: http://wanglimin.github.io/papers/WangQTV_CVPR16.pdf
 - github: https://github.com/wanglimin/actionness-estimation/
 
Real-time Action Recognition with Enhanced Motion Vector CNNs

- intro: CVPR 2016
 - project page: http://zbwglory.github.io/MV-CNN/index.html
 - paper: http://wanglimin.github.io/papers/ZhangWWQW_CVPR16.pdf
 - github: https://github.com/zbwglory/MV-release
 
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
- intro: ECCV 2016. HMDB51: 69.4%, UCF101: 94.2%
 - arxiv: http://arxiv.org/abs/1608.00859
 - paper: http://wanglimin.github.io/papers/WangXWQLTV_ECCV16.pdf
 - github: https://github.com/yjxiong/temporal-segment-networks
 
Temporal Segment Networks for Action Recognition in Videos
- intro: An extension of submission http://arxiv.org/abs/1608.00859
 - arxiv: https://arxiv.org/abs/1705.02953
 
Hierarchical Attention Network for Action Recognition in Videos
DeepCAMP: Deep Convolutional Action & Attribute Mid-Level Patterns
- intro: CVPR 2016
 - arxiv: http://arxiv.org/abs/1608.03217
 
Depth2Action: Exploring Embedded Depth for Large-Scale Action Recognition
Dynamic Image Networks for Action Recognition
- intro: CVPR 2016
 - arxiv: http://users.cecs.anu.edu.au/~sgould/papers/cvpr16-dynamic_images.pdf
 - github: https://github.com/hbilen/dynamic-image-nets
 
Human Action Recognition without Human
Temporal Convolutional Networks: A Unified Approach to Action Segmentation
- arxiv: http://arxiv.org/abs/1608.08242
 - ECCV 2016 workshop: http://bravenewmotion.github.io/
 
Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks
- intro: Bachelor Thesis Report at ETSETB TelecomBCN
 - project page: https://imatge-upc.github.io/activitynet-2016-cvprw/
 - arxiv: http://arxiv.org/abs/1608.08128
 - github: https://github.com/imatge-upc/activitynet-2016-cvprw
 
Sequential Deep Trajectory Descriptor for Action Recognition with Three-stream CNN
Semi-Coupled Two-Stream Fusion ConvNets for Action Recognition at Extremely Low Resolutions
Spatiotemporal Residual Networks for Video Action Recognition
- intro: NIPS 2016
 - arxiv: https://arxiv.org/abs/1611.02155
 
Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks
Deep Recurrent Neural Network for Mobile Human Activity Recognition with High Throughput
Joint Network based Attention for Action Recognition
Temporal Convolutional Networks for Action Segmentation and Detection
AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos
ActionFlowNet: Learning Motion Representation for Action Recognition
Higher-order Pooling of CNN Features via Kernel Linearization for Action Recognition
- intro: Australian Center for Robotic Vision & Data61/CSIRO
 - arxiv: https://arxiv.org/abs/1701.05432
 
Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos
https://arxiv.org/abs/1703.10664
Temporal Action Detection with Structured Segment Networks
- project page: http://yjxiong.me/others/ssn/
 - arxiv: https://arxiv.org/abs/1704.06228
 - github: https://github.com/yjxiong/action-detection
 
Recurrent Residual Learning for Action Recognition
https://arxiv.org/abs/1706.08807
Hierarchical Multi-scale Attention Networks for Action Recognition
https://arxiv.org/abs/1708.07590
Two-stream Flow-guided Convolutional Attention Networks for Action Recognition
- intro: International Conference of Computer Vision Workshop (ICCVW), 2017
 - arxiv: https://arxiv.org/abs/1708.09268
 
Action Classification and Highlighting in Videos
https://arxiv.org/abs/1708.09522
Real-Time Action Detection in Video Surveillance using Sub-Action Descriptor with Multi-CNN
https://arxiv.org/abs/1710.03383
End-to-end Video-level Representation Learning for Action Recognition
- keywords: Deep networks with Temporal Pyramid Pooling (DTPP)
 - arxiv: https://arxiv.org/abs/1711.04161
 
Fully-Coupled Two-Stream Spatiotemporal Networks for Extremely Low Resolution Action Recognition
- intro: WACV 2018
 - arxiv: https://arxiv.org/abs/1801.03983
 
DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks
https://arxiv.org/abs/1801.07230
A Fusion of Appearance based CNNs and Temporal evolution of Skeleton with LSTM for Daily Living Action Recognition
https://arxiv.org/abs/1802.00421
Real-Time End-to-End Action Detection with Two-Stream Networks
https://arxiv.org/abs/1802.08362
A Closer Look at Spatiotemporal Convolutions for Action Recognition
- intro: CVPR 2018. Facebook Research
 - intro: R(2+1)D and Mixed-Convolutions for Action Recognition.
 - project page: https://dutran.github.io/R2Plus1D/
 - arxiv: https://arxiv.org/abs/1711.11248
 - github: https://github.com/facebookresearch/R2Plus1D
 
VideoCapsuleNet: A Simplified Network for Action Detection
https://arxiv.org/abs/1805.08162
Where and When to Look? Spatio-temporal Attention for Action Recognition in Videos
https://arxiv.org/abs/1810.04511
Relational Long Short-Term Memory for Video Action Recognition
https://arxiv.org/abs/1811.07059
Temporal Recurrent Networks for Online Action Detection
https://arxiv.org/abs/1811.073910
Video Action Transformer Network
- intro: Carnegie Mellon University & DeepMind & University of Oxford
 - intro: Ranked first on the AVA (computer vision only) leaderboard of the ActivityNet Challenge 2018
 - project page: https://rohitgirdhar.github.io/ActionTransformer/
 - arxiv: https://arxiv.org/abs/1812.02707
 
D3D: Distilled 3D Networks for Video Action Recognition
- intro: Google & University of Michigan & Princeton University
 - arxiv: https://arxiv.org/abs/1812.08249
 
TACNet: Transition-Aware Context Network for Spatio-Temporal Action Detection
- intro: CVPR 2019
 - arxiv: https://arxiv.org/abs/1905.13417
 
Deformable Tube Network for Action Detection in Videos
https://arxiv.org/abs/1907.01847
You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization
https://arxiv.org/abs/1911.06644
TubeR: Tube-Transformer for Action Detection
- intro: University of Amsterdam & Amazon Web Service
 - arxiv: https://arxiv.org/abs/2104.00969
 
Revisiting Skeleton-based Action Recognition
https://arxiv.org/abs/2104.13586
Projects
A Torch Library for Action Recognition and Detection Using CNNs and LSTMs
- intro: CS231n student project report
 - paper: http://cs231n.stanford.edu/reports2016/221_Report.pdf
 - github: https://github.com/garythung/torch-lrcn
 
2016 ActivityNet action recognition challenge. CNN + LSTM approach. Multi-threaded loading.
LSTM for Human Activity Recognition
- github: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition/
 - github(MXNet): https://github.com/Ldpe2G/DeepLearningForFun/tree/master/Mxnet-Scala/HumanActivityRecognition
 
Scanner: Efficient Video Analysis at Scale
- intro: Locate and recognize faces in a video, Detect shots in a film, Search videos by image
 - github: https://github.com/scanner-research/scanner
 
Charades Starter Code for Activity Classification and Localization
- intro: Activity Recognition Algorithms for the Charades Dataset
 - github: https://github.com/gsig/charades-algorithms
 
NonLocalNetwork and Sequeeze-Excitation Network
- intro: MXNet implementation of Non-Local and Squeeze-Excitation network
 - github: https://github.com/WillSuen/NonLocalandSEnet
 
Event Recognition
TagBook: A Semantic Video Representation without Supervision for Event Detection
AENet: Learning Deep Audio Features for Video Analysis
- arxiv: https://arxiv.org/abs/1701.00599
 - github: https://github.com/znaoya/aenet
 
Event Detection
DevNet: A Deep Event Network for Multimedia Event Detection and Evidence Recounting
- paper: http://120.52.72.47/winsty.net/c3pr90ntcsf0/papers/devnet.pdf
 - paper: http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Gan_DevNet_A_Deep_2015_CVPR_paper.pdf
 
Detecting events and key actors in multi-person videos

- intro: CVPR 2016
 - arxiv: http://arxiv.org/abs/1511.02917
 - paper: www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Ramanathan_Detecting_Events_and_CVPR_2016_paper.pdf
 - paper: http://vision.stanford.edu/pdf/johnson2016cvpr.pdf
 - blog: http://www.leiphone.com/news/201606/l1TKIRFLO3DUFNNu.html
 
Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection
- intro: INTERSPEECH 2016
 - arxiv: https://arxiv.org/abs/1604.07160
 
Efficient Action Detection in Untrimmed Videos via Multi-Task Learning
Joint Event Detection and Description in Continuous Video Streams
- intro: Joint Event Detection and Description Network (JEDDi-Net)
 - arxiv: https://arxiv.org/abs/1802.10250
 
Abnormality / Anomaly Detection
Fully Convolutional Neural Network for Fast Anomaly Detection in Crowded Scenes
Anomaly Detection in Video Using Predictive Convolutional Long Short-Term Memory Networks
- intro: Rochester Institute of Technology
 - arxiv: https://arxiv.org/abs/1612.00390
 
Abnormal Event Detection in Videos using Spatiotemporal Autoencoder
- arxiv: https://arxiv.org/abs/1701.01546
 - github: https://github.com/yshean/abnormal-spatiotemporal-ae
 
Abnormal Event Detection in Videos using Generative Adversarial Nets
- intro: Best Paper / Student Paper Award Finalist, IEEE International Conference on Image Processing (ICIP), 2017
 - arxiv: https://arxiv.org/abs/1708.09644
 
Joint Detection and Recounting of Abnormal Events by Learning Deep Generic Knowledge
- intro: ICCV 2017
 - arxiv: https://arxiv.org/abs/1709.09121
 
An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos
- intro: Uncanny Vision Solutions
 - arxiv: https://arxiv.org/abs/1801.03149
 
STAN: Spatio-Temporal Adversarial Networks for Abnormal Event Detection
- intro: ICASSP 2018
 - arxiv: https://arxiv.org/abs/1804.08381
 
Video Anomaly Detection and Localization via Gaussian Mixture Fully Convolutional Variational Autoencoder
https://arxiv.org/abs/1805.11223
Attentioned Convolutional LSTM InpaintingNetwork for Anomaly Detection in Videos
https://arxiv.org/abs/1811.10228
Video Prediction
Deep multi-scale video prediction beyond mean square error
- intro: ICLR 2016
 - arxiv: http://arxiv.org/abs/1511.05440
 - github: https://github.com/coupriec/VideoPredictionICLR2016
 - github(TensorFlow): https://github.com/dyelax/Adversarial_Video_Generation
 - demo: http://cs.nyu.edu/~mathieu/iclr2016.html
 
Unsupervised Learning for Physical Interaction through Video Prediction
- intro: NIPS 2016
 - arxiv: https://arxiv.org/abs/1605.07157
 - github: https://github.com/tensorflow/models/tree/master/video_prediction
 
Generating Videos with Scene Dynamics
- intro: NIPS 2016
 - intro: The model learns to generate tiny videos using adversarial networks
 - project page: http://web.mit.edu/vondrick/tinyvideo/
 - paper: http://web.mit.edu/vondrick/tinyvideo/paper.pdf
 - github: https://github.com/cvondrick/videogan
 
PredNet
Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning
- project page: https://coxlab.github.io/prednet/
 - arxiv: http://arxiv.org/abs/1605.08104
 - github: https://github.com/coxlab/prednet
 - github: https://github.com/e-lab/torch-prednet
 
Diversity encouraged learning of unsupervised LSTM ensemble for neural activity video prediction
Video Ladder Networks
- inro: NIPS 2016 workshop on ML for Spatiotemporal Forecasting
 - arxiv: https://arxiv.org/abs/1612.01756
 
Unsupervised Learning of Long-Term Motion Dynamics for Videos
- intro: Stanford University
 - arxiv: https://arxiv.org/abs/1701.01821
 
One-Step Time-Dependent Future Video Frame Prediction with a Convolutional Encoder-Decoder Neural Network
- intro: NCCV 2016
 - arxiv: https://arxiv.org/abs/1702.04125
 
Fully Context-Aware Video Prediction
- intro: ETH Zurich & NNAISENSE
 - keywords: unsupervised learning through video prediction, Parallel Multi-Dimensional LSTM
 - project page: https://sites.google.com/view/contextvp
 - arxiv: https://arxiv.org/abs/1710.08518
 
Novel Video Prediction for Large-scale Scene using Optical Flow
- intro: University of Victoria & Tongji University & Horizon Robotics
 - arxiv: https://arxiv.org/abs/1805.12243
 
Video Tagging
Automatic Image and Video Tagging

Tagging YouTube music videos with deep learning - Alexandre Passant
- keywords: Clarifai’s deep learning API
 - blog: http://apassant.net/2015/07/03/tagging-youtube-music-clarifai-deep-learning/
 
Shot Boundary Detection
Large-scale, Fast and Accurate Shot Boundary Detection through Spatio-temporal Convolutional Neural Networks
https://arxiv.org/abs/1705.03281
Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks
- intro: obtains state-of-the-art results while running at an unprecedented speed of more than 120x real-time.
 - arxiv: https://arxiv.org/abs/1705.08214
 
Video Action Segmentation
TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation
- intro: University of Rochester
 - arxiv: https://arxiv.org/abs/1705.07818
 
Video2GIF
Video2GIF: Automatic Generation of Animated GIFs from Video (Robust Deep RankNet)
- intro: 3D CNN, ranking model, Huber loss, 100K GIFs/video sources dataset
 - arxiv: http://arxiv.org/abs/1605.04850
 - github(dataset): https://github.com/gyglim/video2gif_dataset
 - results: http://video2gif.info/
 - demo site: http://people.ee.ethz.ch/~gyglim/work_public/autogif/
 - review: http://motherboard.vice.com/read/these-fire-gifs-were-made-by-artificial-intelligence-yahoo
 
Creating Animated GIFs Automatically from Video
https://yahooresearch.tumblr.com/post/148009705216/creating-animated-gifs-automatically-from-video
GIF2Video: Color Dequantization and Temporal Interpolation of GIF images
- intro: Stony Brook University & Megvii Research USA & UCLA
 - arxiv: https://arxiv.org/abs/1901.02840
 
Video2Speech
Vid2speech: Speech Reconstruction from Silent Video
- intro: ICASSP 2017
 - project page: http://www.vision.huji.ac.il/vid2speech/
 - arxiv: https://arxiv.org/abs/1701.00495
 - github(official): https://github.com/arielephrat/vid2speech
 
Video Captioning
http://handong1587.github.io/deep_learning/2015/10/09/image-video-captioning.html#video-captioning
Video Summarization
Video summarization produces a short summary of a full-length video and ideally encapsulates its most informative parts, alleviates the problem of video browsing, editing and indexing.
Video Summarization with Long Short-term Memory
DeepVideo: Video Summarization using Temporal Sequence Modelling
- intro: CS231n student project report
 - paper: http://cs231n.stanford.edu/reports2016/216_Report.pdf
 
Semantic Video Trailers
Video Summarization using Deep Semantic Features
- inro: ACCV 2016
 - arxiv: http://arxiv.org/abs/1609.08758
 
CNN-Based Prediction of Frame-Level Shot Importance for Video Summarization
- intro: International Conference on new Trends in Computer Sciences (ICTCS), Amman-Jordan, 2017
 - arxiv: https://arxiv.org/abs/1708.07023
 
Video Summarization with Attention-Based Encoder-Decoder Networks
https://arxiv.org/abs/1708.09545
Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward
- intro: AAAI 2018. Chinese Academy of Sciences & Queen Mary University of London
 - project page: https://kaiyangzhou.github.io/project_vsumm_reinforce/index.html
 - arxiv: https://arxiv.org/abs/1801.00054
 - github: https://github.com//KaiyangZhou/vsumm-reinforce
 
Viewpoint-aware Video Summarization
- intro: CVPR 2018
 - arxiv: https://arxiv.org/abs/1804.02843
 
DTR-GAN: Dilated Temporal Relational Adversarial Network for Video Summarization
https://arxiv.org/abs/1804.11228
Learning Video Summarization Using Unpaired Data
https://arxiv.org/abs/1805.12174
Video Summarization Using Fully Convolutional Sequence Networks
https://arxiv.org/abs/1805.10538
Video Summarisation by Classification with Deep Reinforcement Learning
- intro: BMVC 2018
 - arxiv: https://arxiv.org/abs/1807.03089
 
Query-Conditioned Three-Player Adversarial Network for Video Summarization
- intro: BMVC 2018
 - arxiv: https://arxiv.org/abs/1807.06677
 
Discriminative Feature Learning for Unsupervised Video Summarization
- intro: AAAI 2019
 - arxiv: https://arxiv.org/abs/1811.09791
 
Rethinking the Evaluation of Video Summaries
- intro: CVPR 2019 poster
 - arxiv: https://arxiv.org/abs/1903.11328
 
Video Highlight Detection
Unsupervised Extraction of Video Highlights Via Robust Recurrent Auto-encoders
- intro: ICCV 2015
 - intro: rely on an assumption that highlights of an event category are more frequently captured in short videos than non-highlights
 - arxiv: http://arxiv.org/abs/1510.01442
 
Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization
- keywords: wearable device
 - paper: http://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Yao_Highlight_Detection_With_CVPR_2016_paper.pdf
 - paper: http://research.microsoft.com/apps/pubs/default.aspx?id=264919
 
Using Deep Learning to Find Basketball Highlights

- blog: http://public.hudl.com/bits/archives/2015/06/05/highlights/?utm_source=tuicool&utm_medium=referral
 
Real-Time Video Highlights for Yahoo Esports
A Deep Ranking Model for Spatio-Temporal Highlight Detection from a 360 Video
- intro: AAAI 2018
 - arxiv: https://arxiv.org/abs/1801.10312
 
PHD-GIFs: Personalized Highlight Detection for Automatic GIF Creation
- intro: Nanyang Technological University & Google Research, Zurich
 - keywords: personalized highlight detection (PHD)
 - arxiv: https://arxiv.org/abs/1804.06604
 
Video Understanding
Scale Up Video Understandingwith Deep Learning
- intro: 2016, Tsinghua University
 - slides: iiis.tsinghua.edu.cn/~jianli/courses/ATCS2016spring/talk_chuang.pptx
 
Slicing Convolutional Neural Network for Crowd Video Understanding

- intro: CVPR 2016
 - intro: It aims at learning generic spatio-temporal features from crowd videos, especially for long-term temporal learning
 - project page: http://www.ee.cuhk.edu.hk/~jshao/SCNN.html
 - paper: http://www.ee.cuhk.edu.hk/~jshao/papers_jshao/jshao_cvpr16_scnn.pdf
 - github: https://github.com/amandajshao/Slicing-CNN
 
Rethinking Spatiotemporal Feature Learning For Video Understanding
https://arxiv.org/abs/1712.04851
Hierarchical Video Understanding
https://arxiv.org/abs/1809.03316
Challenges
THUMOS Challenge 2014
- homepage: http://crcv.ucf.edu/THUMOS14/home.html
 - download: http://crcv.ucf.edu/THUMOS14/download.html
 
THUMOS Challenge 2015
- homepage: http://www.thumos.info/
 - download: http://www.thumos.info/download.html
 
ActivityNet Challenge 2016

- homepage: http://activity-net.org/challenges/2016/