Call for Paper (Download)
Summary Paper (Download)

Comprehensive video understanding has recently received increasing attention from the computer vision and multimedia communities with the goal of building machines that can understand the video like humans. currently, most works for untrimmed video recognition mainly focus on isolated and independent problems such as action recognition or scene recognition. While they address different aspects of video understanding, there exist strong mutual relationships and correlations among action and scene. To achieve the very accurate human level understanding of untrimmed videos, the comprehensive understanding of various aspects such as what the actors are doing and where they are doing so is of great importance.

This workshop aims at providing a forum to exchange ideas in comprehensive video understanding with a particular emphasis on action and scene recognition in untrimmed videos. Papers presented in this workshop have to address one of two independent video understanding problems

or their joint problem

This workshop consists of two tracks: Regular Track and Challenge Track.

Regular Track

The first track invites paper that addresses video action and scene recognition problems on related topics. We are soliciting original contributions that address a wide range of theoretical and practical issues including but not limited to:

  • Action recognition in untrimmed video
  • Scene recognition in untrimmed video
  • Weakly-supervised model for action/scene recognition
  • Weakly-supervised model for action/scene localization

Challenge Track

The second track is the challenge section that focuses on the evaluation of multi-task action and scene recognition on the new untrimmed video dataset, called the Multi-task Actioclass="text-small-16 font-weight-400 margin-bottom-45"n and Scene Recognition Dataset. The detail of challenge can be found here.


Workshop date is 22 October, 2018.

Time Talk
09:00 Welcome and Opening Comments
09:05 Invited Keynote Speech ("Deep Video Understanding: Representation Learning, Action Recognition, and Language Generation", Dr. Tao Mei)
10:00 Invited Talk 1 ("Actor and Observer: Joint Modeling of First and Third-Person Videos," Dr.Karteek Alahari)
10:30 Coffee Break
10:45 Invited Talk 2 ("Explore Multi-Step Reasoning in Video Question Answering," Prof. Yahong Han)
11:15 Spotlight Presentation (Regular and Challenge Tracks)
11:40 Announcement of Challenge Winners and Awards Ceremony
11:50 Poster Presentation and Discussion (Regular and Challenge Tracks)
12:30 Closing Remarks

Paper Submission

You are cordially invited to submit papers to a workshop ‘Comprehensive Video Understanding in the Wild – Multi-task Action and Scene Recognition in Untrimmed Video’ in http://www.acmmm.org/2018. This workshop invites full research papers of varying length from 4 to 8 pages, plus additional pages for the reference pages. The reference page(s) are not counted to the page limit of 4 to 8 pages.

All papers must be formatted according to the acm-sigconf template which can be obtained from ACM proceedings style.

Important Dates

June 10, 2018: Abstract submission.
July 28, 2018: Abstract submission. (Extended)
July 8, 2018: Workshop paper submission.
July 28, 2018: Workshop paper submission. (Extended)
Aug 5, 2018: Notification of acceptance.
Aug 19, 2018: Camera-ready papers submission.
Oct 22, 2018: Workshop.

Click the following link to go to the submission site: : https://cmt3.research.microsoft.com/ACMMMWORKSHOPS2018/

Note that we also accept the challenge submission file (.csv) as a supplement file

Contact Email : coview2018@gmail.com


Multi-task action and scene recognition in untrimmed videos

This challenge aims at exploring new approaches and brave ideas for multi-task action and scene recognition in untrimmed videos and evaluating the ability of the algorithms. In this task, it is intended to deal with the joint and comprehensive understanding of untrimmed videos with a particular emphasis on multi-task action and scene recognition. While most recent works for untrimmed video recognition mainly focus on each of them, there exist strong mutual relationships and correlations among action and scene. For example one provides valuable prior knowledge for understanding the other.


For the challenge, we build the Multi-task Action and Scene Recognition Dataset that consists of untrimmed videos sampled from the Youtube-8M dataset with annotated action and scene class labels for each video. It consists of about 90,000 Youtube video URLs (we will provide a feature for each video), and the distribution among training, validation, and testing is 84,853, 3,000 and 3,000 of the total videos, respectively. The number of total action and scene class labels are 285 and 29, respectively. Here, video can contain either action or scene class label, and both action and scene class labels.
The video level dataset can be downloaded directly from here.
The frame level dataset can be downloaded directly from here.
Password is "coview".

Evaluation Metric

As the evaluation protocol of the challenge, we will use the top-5 and top-20 hamming score of action and scene results. N is number of test data set, K is number of predictions, L is number of labels, and H(K) is Top-K hamming score, then Top-K hamming score is defined as

$$ H(K) = \frac{1}{N}\sum\limits_{n = 1}^N {\sum\limits_{label = 1}^L {\sum\limits_{k = 1}^K {\frac{{AND\left( {k{\rm{ - th predictio}}{{\rm{n}}_{label}},G{T_{label}}} \right)}}{L}} } } $$

with AND(A,B)=1 only if A and B has exactly same label index on action or scene. We will set the Top-K hamming score as the challenge criterion and the Top-5 hamming score will be provided to you as additional prediction result information. The number K will soon be set to a reasonable value.

Submission Format

Please follow the following CSV format when submitting your results for the challenge:

Submitted file should contain header, [video_id, scene_01, scene_02, … , scene_19, scene_20, action_01, action_02, … , action_19, action_20] and prediction results have to follow it below. The length of one row is 41. The prediction format is [Video id, top-1st scene label, top-2nd scene label, … , top-19th scene label, top-20th scene label, top-1st action label, top-2nd action label, … , top-19th action label, top-20th action label]. You can download here a evaluation kit.

Note that we accept the challenge submission file (.csv) as a supplement file in the submission CMT site (NOT EMAIL !)

Workshop Organizers

Kwanghoon Sohn

Yonsei University

Ming-Hsuan Yang

University of California at Merced

Invited Speakers

Tao Mei


Program Chairs

Hyeran Byun

Yonsei University

Jongwoo Lim

Hanyang University

Jison Hsu


Stephen Lin

Microsoft Research

Publication Chairs

Euntai Kim

Yonsei University

Seungryong Kim

Yonsei University

Technical Program Committee

Karteek Alahari

INRIA Grenoble

Minsu Cho


Sunyoung Cho


Bumsub Ham

Yonsei University

Jia-bin Huang

Virginia Tech

Ig-Jae Kim


Jiangbo Lu

Shenzhen Cloudream Tech

Tao Mei


Dongbo Min

Ewha Womens University

John See

Multimedia University

Tony Tung


Stefan Winkler


Kuk-jin Yoon


Antoine Miech


Gül Varol



Contact the workshop organizers on:


The workshop is supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT (NRF-2017M3C4A7069370).