This is a webpage to distribute UAV video datasets created/assembled by the Aristotle University of Thessaloniki within the MULTIDRONE project.

In order to access the full datasets, please complete and sign this license agreement. Subsequently, email it to Prof. Ioannis Pitas so as to receive FTP credentials for downloading.




If you are granted permission to access, the following datasets are available (NOTE: For datasets assembled from Youtube videos, only links to the videos and the relevant annotation files, if any, are provided).

-DCROWD_VID
A dataset for visual human crowd detection was assembled from Youtube videos, licensed mainly under Standard Youtube License. It is a collection of 53 videos selected by querying the Youtube search engine with specific keywords describing crowded events (e.g. parade, festival, marathon, protests). Non-crowded videos have also been gathered by searching for unspecified drone videos. No annotation is currently available.

-SHOT_TYPES
A dataset containing 46 professional and semi-professional UAV videos was assembled from Youtube material. Care was taken to include as many UAV framing shot types and UAV/camera motion types as possible, based on the UAV shot type taxonomy defined in the context of the MULTIDRONE project. No annotation is currently available.

-Annotations_boats_Raw
A dataset for boat detection/tracking was assembled, consisting of 13 Youtube videos (resolution: 1280 x 720) at 25 frames per second. Annotations are not exhaustive, i.e. there may be unannotated objects in the given image frames. An annotation file is included along with each video file. The annotations are stored in the text files with the format:

  • frameN
  • #objects
  • x y w d
where x, y indicate the upper left corner of the bounding box and w, h describe its width and height in frame N.


-Annotations_Bicycles_Raw
A dataset for bicycle detection/tracking was assembled, consisting of 7 Youtube videos (resolution: 1920 x 1080) at 25 frames per second. Annotations are not exhaustive, i.e., there may be unannotated objects in the given video frames. An annotation file is included along with each video file. The annotations are stored in the corresponding text files with the following format:

    Channel   frameN   ObjectID   x1   y1   x2   y2   0   ObjectType/View
where x1, y2, x2, y2 refer to the upper left and bottom right corner of the bounding box, Object ID is a numerical object identifier (non-consistent, non-reliable), frameN is the number of video frame, while ObjectType/View (where applicable) labels the object class and categorical pose relative to the camera (“1F” means Front View, “1B” means Back View, “1L” means Left View, “1R” means Right View, 2 means Bicycle Crowd, 5H means High-Density Human Crowd, 5L means Low-Density Human Crowd, 0 denotes irrelevant TV graphics).

-Benchmark_RAI
A dataset for bicycle detection/tracking was prepared by processing/editing and annotating material made available by RAI under the “Giro 2017” MULTIDRONE dataset. It is a dataset consisting of two videos (resolutions: 768 x 432 and 960 x 540) at 25 frames per second. The videos are from Giro d’Italia TV coverage provided by RAI. Annotations are exhaustive, i.e., all objects of a certain class present in a given image are covered by an annotation. An annotation file is included along with each video file. The annotations are stored in the text files with the following format:

  • frameN
  • #objects
  • x y w d
where x, y indicate the upper left corner of the bounding box and w, h describe its width and height in frame N.

-person_detection_UAV
A visual person detection dataset has been prepared, consisting of two UHD videos (2160p - 3840 x 2160) at 25 frames per second. The dataset was shot in AUTH Campus employing AUTH research personnel as actors. The camera was mounted on a DJI Phantom IV UAV and pointed towards the ground. The drone was either hovering or flying at low speed, while the actors were walking in random directions. The total video duration is 4 minutes and 20 seconds. An annotation text file is provided along with each video file. Each line refers to a corresponding video frame in the following format:

    number_of_frame person_id min_x min_y max_x max_y

-AUTHDroneSunday_VID
A dataset for visual human crowd detection was collected, in the form of 6 videos shot inside the AUTH Campus using a DJI Phantom IV UAV. The videos depict a crowd of visitors during an "AUTH at Sundays" event. The video format is UHD 20160p, with a resolution of 4096 x 2160 at a rate of 25 frames per second. There are two scenes, the first containing a sparse crowd that moves near exhibition stands and the second a dense static crowd that watches a presentation done by AUTH students. The second scene has 5 videos that are shot from different view angles. No annotation is currently available.


-uav_detection
A dataset was prepared by AUTH for visual drone detection. It consists of 12 Full HD videos (1080p - 1920 x 1080) filmed using two cameras. The cameras were pointed at the general direction of a flying DJI Phantom IV. The drone is shot against various backgrounds, including the sky, trees, buildings and roads. In 11 out of the 12 videos, the two cameras are at ground level and looking up to the drone, maintaining a bottom view of it. In the last video the camera is at the same or higher elevation than the drone, maintaining mostly side and top views of it. The total video duration is 31 minutes. About 39K video frames were annotated for drone detection, with annotations of the following format:

    frame_number, number_of_drones, x_min, y_min, width, height

-uav_detection_2
A dataset for drone detection was collected using one camera held by a person on the ground, within AUTH campus. In total, 11 Full HD videos were produced, which contain shots of a DJI Phantom IV, shot against various backgrounds and at multiple sizes and views. The total duration of this dataset is 15 minutes, or about 22K frames at 25fps. No annotation is currently available.


-landing_sites
A dataset of videos depicting potential UAV landing sites has also been captured. It consists of 2 videos (at a resolution of 4096 x 2160 pixels and with approximate total duration 5 minutes) captured by a DJI Phantom IV within AUTH campus, containing potential landing sites around a point of interest (POI), or generally in the university campus. The potential landing sites include terrain locations characterized by small terrain slope and no obstacles, so as to maximize the possibility of safe UAV landing. No annotation is currently available.


-AUTHObservatory_VID
A dataset named “AUTHObservatory_VID” was also collected by AUTH for building/Point-of-Interest detection purposes. It consists in two videos shot inside the AUTH Campus using a DJI Phantom IV UAV, containing the building of the observatory with the telescope dome. This is a unique building in the campus that can be considered as a Point-Of-Interest in the context of the other buildings. The video format is UHD 2160p, with a resolution of 4096 x 2160 at a rate of 25 frames per second. The view angles include a top view and a 360 perspective of the building sides from a height of 30m-50m. No annotation is currently available.


-face_deid_UAV
A dataset for face de-identification consists of one 3840x2160 video, which was shot by flying a DJI Phantom IV. The drone was flying at a height of about 3-5 meters and its camera was pointed downwards recording the subjects walking-by and occasionally looking directly at it. The total video duration is 45 seconds with a framerate of 25 fps. Each face in the 1124 extracted frames is annotated with a bounding box, using the pixel coordinates of its top left corner followed by its width and height, also in pixels. So the annotation of the dataset is in the following format:

    frame_number, number_of_faces, bounding box for each face in this frame

-face_deidentification_UAV_mult_views
A dataset for face de-identification purposes was collected by a DJI Phantom IV UAV and consists of one 4096 x 2160 video. The UAV was flying at a height of about 3-5 meters, while the subjects were recorded from multiple viewpoints walking-by and occasionally looking directly at it. The total duration of the video is 2 minutes and 23 seconds with a framerate of 25 fps. No annotation is currently available.