Register Cloud Data

At least one data integration is required to register cloud data to Encord. Encord can integrate with the following cloud service providers:


AWS S3	Azure blob	GCP storage
Open Telekom Cloud	Wasabi	MinIO
Oracle	Cloudflare	CoreWeave

Any files you upload to Encord must be stored in folders. Click here to learn how to create a folder to store your files.

Register Cloud Data to Files

STEP 1: Create a JSON or CSV File for Registration

Before registering your cloud data to Encord you must first create a JSON or CSV file specifying the files you want to register.

BEST PRACTICE: If you want to use Index or Active with your video data, we STRONGLY RECOMMEND using custom metadata (clientMetadata) to specify key frames, custom metadata, and custom embeddings. For more information go here or here for information on using the SDK.

JSON Format

We provide helpful scripts and examples that automatically generate JSON and CSV files for all the files in a folder or bucket within your cloud storage. This makes importing large datasets easier and more efficient.

The JSON file format is a JSON object with top-level keys specifying the type of data and object URLs of the files you want to upload to Encord. You can add one data type at a time, or combine multiple data types in one JSON. The supported top-level keys are: videos, audio, image_groups, images, dicom_series, nifti, scenes, and data_groups. The details for each data format are given in the sections below.

See our tips for increasing the speed of file registration here.

Add the "skip_duplicate_urls": true flag at the top level to make the uploads idempotent. Skipping URLs can help speed up large upload operations. Since previously processed assets do not have to be uploaded again, you can simply retry the failed operation without editing the upload specification file. The flag’s default value isfalse.

Encord enforces the following upload limits for each JSON file used for file registration:

Up to 1 million URLs
A maximum of 500,000 items (e.g. images, image groups, videos, DICOMs)
URLs can be up to 16 KB in size

Optimal upload chunking can vary depending on your data type and the amount of associated metadata. For tailored recommendations, contact Encord support. We recommend starting with smaller uploads and gradually increasing the size based on how quickly jobs are processed. Generally, smaller chunks result in faster data reflection within the platform.

Videos

Each object in the videos array is a JSON object with the key objectUrl specifying the full URL of where to find the video resource. The title field is optional. If omitted, the video file path and name are used as the default title. For example, if the file is located at https://encord-solutions-bucket.s3.eu-west-2.amazonaws.com/path/to/my/bucket/video23.mp4, the title defaults to /path/to/my/bucket/video23.mp4.

videoMetadata must be specified when a Strict client-only access integration is used. In all other cases, videoMetadata is optional, but including it significantly reduces import times.

Key or Flag	Required?	Default value
”objectUrl”	Yes
”title”	No	the file’s path + title
”videoMetadata”	No
”clientMetadata”	No
”createVideo”	No	false

Keys / Flags that are not required can be omitted from the JSON file entirely.

{
  "videos": [
    {
      "objectUrl": "<object url_1>"
    },
    {
      "objectUrl": "<object url_2>",
      "title": "my-custom-video-title.mp4",
      "videoMetadata": {
        "fps": 23.98,
        "duration": 29.09,
        "width": 1280,
        "height": 720,
        "file_size": 5468354,
        "mime_type": "video/mp4"
      },
      "clientMetadata": { "optional": "metadata" }
    }
  ],
  "skip_duplicate_urls": true
}

Video Metadata

The JSON format allows you to specify videoMetadata for video files. videoMetadata is essential information used by the Label Editor and is crucial for aligning annotations to the correct frame.

When the videoMetadata flag is present in the JSON file, we directly use the supplied metadata without performing any additional validation, and do not store the file on our servers. To guarantee accurate labels, it is crucial that the metadata you provide is accurate.

videoMetadata must be specified when a Strict client-only access integration is used. In all other cases, videoMetadata is optional.

{
  "videos": [
    {
      "objectUrl": "video_file.mp4",
      "videoMetadata": {
        "fps": 23.98,
        "duration": 29.09,
        "width": 1280,
        "height": 720,
        "file_size": 5468354,
        "mime_type": "video/mp4"
      }
    }
  ]
}

fps: Frames per second.
duration: Duration of the video (in seconds).
width / height: Dimensions of the video (in pixels).
file_size: The size of the file (in bytes).
mime_type: Specifies the file type extension according to the MIME standard.

When videos are supplied with video metadata, Encord assumes the metadata to be correct and our servers will neither download nor pre-process your data. This may be a particularly useful feature for customers with strict data compliance concerns.One way to find the necessary metadata is shown below. Run the following commands in your terminal.

ffmpeg -i 'video_title.mp4' to retrieve fps, duration, width, and height - as highlighted below.

ls -l 'video_title.mp4' to retrieve the file size - as highlighted below.

Audio files

Audio Files

Each object in the audio file array is a JSON object with the key objectUrl specifying the full URL of where to find the audio resource. The title field is optional. If omitted, the audio file path and name are used as the default title. For example, if the file is located at https://encord-solutions-bucket.s3.eu-west-2.amazonaws.com/path/to/my/bucket/song23.mp3, the title defaults to /path/to/my/bucket/song23.mp3.

Audio metadata is distinct from client metadata. clientMetadata allows you to add metadata that can be used for filtering your data in Index. You can use text to import transcripts for your audio file.

audioMetadata must be specified when a Strict client-only access integration is used. In all other cases, audioMetadata is optional, but including it significantly reduces import times.

Key or Flag	Required?	Default value
”objectUrl”	Yes
”title”	No	The file’s path + title
”clientMetadata”	No
”audioMetadata”	No

Keys / Flags that are not required can be omitted from the JSON file entirely.

{
  "audio": [
    {
      "objectUrl": "<object url_1>"
    },
    {
      "objectUrl": "<object url_2>",
      "title": "my-custom-audio-file-title.mp3",
      "audioMetadata": {
        "duration": 23.98,
        "file_size": 2900000,
        "mime_type": "audio/mp3",
        "sample_rate": 44100,
        "bit_depth": 24,
        "codec": "mp3",
        "num_channels": 2
      },
      "clientMetadata": { "optional_key_1": "optional_metadata_value_1" }
    }
  ],
  "skip_duplicate_urls": true
}

Audio Metadata

The JSON format allows you to specify audioMetadata for audio files. This is optional information.

When the audioMetadata flag is present in the JSON file, we directly use the supplied metadata without performing any additional validation, and do not store the file on our servers. It is crucial that the metadata you provide is accurate.

{
  "audio": [
    {
      "objectUrl": "audio_file.mp3",
      "audioMetadata": {
        "duration": 23.98,
        "file_size": 2900000,
        "mime_type": "audio/mp3",
        "sample_rate": 44100,
        "bit_depth": 24,
        "codec": "mp3",
        "num_channels": 2
      }
    }
  ]
}

duration_seconds: float - Audio duration in seconds.
file_size: int - Size of the audio file in bytes.
mime_type: str - MIME type of the audio file (for example: audio/mpeg or audio/wav).
sample_rate: int - Sample rate (int) in Hz.
bit_depth: int - Size of each sample (int) in bits.
codec: str - Codec (for example: mp3, pcm).
num_channels: int - Number of channels.

PDFs

Each object in the PDF array is a JSON object with the key objectUrl specifying the full URL of where to find the PDF. The title field is optional. If omitted, the PDF path and name are used as the default title. For example, if the file is located at https://encord-solutions-bucket.s3.eu-west-2.amazonaws.com/path/to/my/bucket/my-document.pdf, the title defaults to /path/to/my/bucket/my-document.pdf.

PDF metadata is distinct from client metadata. clientMetadata allows you to add metadata that can be used for filtering your data in Index.

PDF Metadata must be specified when a Strict client-only access integration is used. In all other cases, pdfMetadata is optional, but including it significantly reduces import times.

Key or Flag	Required?	Default value
”objectUrl”	Yes
”title”	No	The file’s path + title
”clientMetadata”	No
”pdfMetadata”	No

Keys / Flags that are not required can be omitted from the JSON file entirely.

{
  "pdfs": [
    {
      "objectUrl": "<object url_1>",
      "pdfMetadata": {
        "fileSize": 300,
        "numPages": 5
      }
    },
    {
      "objectUrl": "<object url_2>",
      "title": "my-document-02.pdf",
      "clientMetadata": { "optional_key_1": "optional_metadata_value_1" }
    }
  ],
  "skip_duplicate_urls": true
}

PDF Metadata

The JSON format allows you to specify pdfMetadata for documents. This is optional information.

When the pdfMetadata flag is present in the JSON file, we directly use the supplied metadata without performing any additional validation, and do not store the file on our servers. It is crucial that the metadata you provide is accurate.

{
  "pdfs": [
    {
      "objectUrl": "https://storage.cloud.google.com/encord-onboarding-clinton/text/singapore-penal-code-1871.pdf",
      "pdfMetadata": {
        "fileSize": 300,
        "numPages": 5
      }
    }
  ]
}

file_size: int - Size of the pdf file in bytes.
num_pages: int - The number of pages in the PDF document.

Text Files

Each object in the text file array is a JSON object with the key objectUrl specifying the full URL of where to find the text file. The title field is optional. If omitted, the text file path and name are used as the default title. For example, if the file is located at https://encord-solutions-bucket.s3.eu-west-2.amazonaws.com/path/to/my/bucket/my-file.html, the title defaults to /path/to/my/bucket/my-file.html.

Text files include .txt, .html, .md, .xml, and more.

Text metadata is distinct from client metadata. clientMetadata allows you to add metadata that can be used for filtering your data in Index.

textMetadata must be specified when a Strict client-only access integration is used. In all other cases, textMetadata is optional, but including it significantly reduces import times.

Key or Flag	Required?	Default value
”objectUrl”	Yes
”title”	No	The file’s path + title
”clientMetadata”	No
”textMetadata”	No

Keys / Flags that are not required can be omitted from the JSON file entirely.

{
  "text": [
    {
      "objectUrl": "<object url_1>",
      "textMetadata": {
        "fileSize": 200,
        "mime_type": "application/xml"
      }
    },
    {
      "objectUrl": "<object url_2>",
      "title": "my-file.html",
      "clientMetadata": { "optional_key_1": "optional_metadata_value_1" }
    }
  ],
  "skip_duplicate_urls": true
}

Text Metadata

The JSON format allows you to specify textMetadata for documents. This is optional information.

When the textMetadata flag is present in the JSON file, we directly use the supplied metadata without performing any additional validation, and do not store the file on our servers. It is crucial that the metadata you provide is accurate.

{
  "text": [
    {
      "objectUrl": "<object url_1>",
      "textMetadata": {
        "fileSize": 200,
        "mime_type": "application/xml"
      }
    }
  ]
}

file_size: int - Size of the text file in bytes.
mime_type: str - MIME type of the text file (for example: application/xml or text/plain).

Single images

Single Images

The JSON structure for single images parallels that of videos. The title field is optional. If omitted, the image file path and name are used as the default title. For example, if the file is located at https://encord-solutions-bucket.s3.eu-west-2.amazonaws.com/path/to/my/bucket/image23.jpg, the title defaults to /path/to/my/bucket/image23.jpg.

Key or Flag	Required?	Default value
”objectUrl”	Yes
”title”	No	The file’s path + title
”imageMetadata”*	No
”clientMetadata”	No
”createVideo”	No	false

imageMetadata must be specified when a Strict client-only access integration is used. In all other cases, imageMetadata is optional, but including it significantly reduces import times.

Keys / Flags that are not required can be omitted from the JSON file entirely.

{
  "images": [
    {
      "objectUrl": "<object url>"
    },
    {
      "objectUrl": "<object url>",
      "title": "my-custom-image-title.jpeg",
      "imageMetadata": {
        "mimeType": "image/jpg",
        "fileSize": 124,
        "width": 640,
        "height": 480
      },
      "clientMetadata": { "optional": "metadata" }
    }
  ],
  "skip_duplicate_urls": true
}

Image Metadata

The JSON format allows you to specify imageMetadata for image files. imageMetadata contains essential information used by the Label Editor and is crucial for aligning annotations to the correct image properties.

When the imageMetadata flag is present in the JSON file, we directly use the supplied metadata without performing any additional validation and do not store the file on our servers. To guarantee accurate labels, it is crucial that the metadata you provide is accurate.

imageMetadata must be specified when a Strict client-only access integration is used. In all other cases, imageMetadata is optional.

{
  "images": [
    {
      "objectUrl": "s3://my_image.jpg",
      "imageMetadata": {
        "mimeType": "image/jpg",
        "fileSize": 124,
        "width": 640,
        "height": 480
      }
    }
  ]
}

Image groups

Image groups are collections of images that are processed as one annotation task.
Images within image groups remain unaltered, meaning that images of different sizes and resolutions can form an image group without the loss of data.
Image groups do not require ‘write’ permissions to your cloud storage.
Custom client metadata is defined per image group, not per image. See our documentation here to learn how to add clientMetadata to images in an image group.

Key or Flag	Required?	Default value	Note
”objectUrl_“	Yes		is the number the file occupies in the sequence, starting from 0
”title”	No
”clientMetadata”	No
”createVideo”	No	`false`

The position of each image within the sequence needs to be specified in the key - e.g. objectUrl_{position_number} as seen in the example below.

Keys / Flags that are not required can be omitted from the JSON file entirely.

Custom metadata (clientMetadata) can be added to individual frames in an image group. However, the frames must first be imported into Index, after which you can create an image group from the frames using the SDK.

{
  "image_groups": [
    {
      "title": "<title 1>",
      "createVideo": false,
      "objectUrl_0": "<object url>",
      "objectUrl_1": "<object url>",
      "objectUrl_2": "<object url>"
    },
    {
      "title": "<title 2>",
      "createVideo": false,
      "objectUrl_0": "<object url>",
      "objectUrl_1": "<object url>",
      "objectUrl_2": "<object url>",
      "clientMetadata": { "optional": "imageGroupMetadata" }
    }
  ],
  "skip_duplicate_urls": true
}

Image sequences

Image Sequences

Image sequences are collections of images that are processed as one annotation task and represented as a video.
Images within image sequences may be altered as images of varying sizes are resolutions are made to match that of the first image in the sequence.
Creating Image sequences from cloud storage requires ‘write’ permissions, as new files have to be created in order to be read as a video.
Each object in the image_groups array with the createVideo flag set to true represents a single image sequence.
Custom client metadata is defined per image sequence, not per image.

The only difference between adding image groups and image sequences via a JSON is that image sequences require the createVideo flag to be set to true. Both use the key image_groups.

Key or Flag	Required?	Default value
”objectUrl”	Yes
”title”	No
”clientMetadata”	No
”createVideo”	no	`false`

The position of each image within the sequence needs to be specified in the key - e.g objectUrl_{position_number}. See the example below.

Keys / Flags that are not required can be omitted from the JSON file entirely.

{
  "image_groups": [
    {
      "title": "<title 1>",
      "createVideo": true,
      "objectUrl_0": "<object url>"
    },
    {
      "title": "<title 2>",
      "createVideo": true,
      "objectUrl_0": "<object url>",
      "objectUrl_1": "<object url>",
      "objectUrl_2": "<object url>",
      "clientMetadata": { "optional": "metadata" }
    }
  ],
  "skip_duplicate_urls": true
}

DICOM

Each dicom_series element can contain one or more DICOM series.
Each series requires a title and at least one object URL, as shown in the following example.

Key or Flag	Required?	Note
`"objectUrl_{position_number}"`	Yes	`{position_number}` is the number the file occupies in the sequence, starting from 0
`"title"`	Yes
`"clientMetadata"`	No

Keys / Flags that are required, such as clientMetadata, can be omitted from the JSON file entirely. clientMetadata is distinct from patient metadata, which is included in the .dcm file and does not have to be specific during the upload to Encord.

The following is an example JSON for uploading three DICOM series belonging to a study. Each title and object URL correspond to individual DICOM series.

The first series contains only a single object URL, as it is composed of a single file.
The second series contains 3 object URLs, as it is composed of three separate files.
The third series contains 2 object URLs, as it is composed of two separate files.

For each DICOM upload, an additional DicomSeries file is created. This file represents the series file-set. Only DicomSeries are displayed in the Encord application.

{
  "dicom_series": [
    {
      "title": "<series-1>",
      "objectUrl_0": "https://my-bucket/.../study1-series1-file.dcm"
    },
    {
      "title": "<series-2>",
      "objectUrl_0": "https://my-bucket/.../study1-series2-file1.dcm",
      "objectUrl_1": "https://my-bucket/.../study1-series2-file2.dcm",
      "objectUrl_2": "https://my-bucket/.../study1-series2-file3.dcm"
    },
    {
      "title": "<series-3>",
      "objectUrl_0": "https://my-bucket/.../study1-series3-file1.dcm",
      "objectUrl_1": "https://my-bucket/.../study1-series3-file2.dcm"
    }
  ],
  "skip_duplicate_urls": true
}

NIfTI

Each series requires a title and at least one object URL.

Key or Flag	Required?	Default value
”objectUrl”	Yes
”title”	No	The file’s title
”clientMetadata”	No

The following is an example JSON file for uploading two NIfTI files to Encord.

{
  "nifti": [
    {
      "title": "<file-1>",
      "objectUrl": "https://my-bucket/.../nifti-file1.nii"
    },
    {
      "title": "<file-2>",
      "objectUrl": "https://my-bucket/.../nifti-file2.nii.gz"
    }
  ],
  "skip_duplicate_urls": true
}

Data groups

Data Groups

Data groups let you combine multiple data items from the cloud upload into a single annotation task with a multi-pane layout.
The contents of the group are created by specifying titleRef (matching the title field of another item) or urlRef (matching the objectUrl of another item). The referenced items must be defined in the same JSON file.

There are three layout types:

Layout type	Description
`"default-grid"`	Arranges items in an auto-generated grid. `layoutContents` is an array of refs.
`"default-list"`	Arranges items in a carousel. `layoutContents` is an array of refs.
`"custom"`	Uses a custom split-pane layout. `layoutContents` is an object with named keys, and a `layout` tree must be provided.

{
  "images": [
    {
      "objectUrl": "https://my-bucket/image1.png",
      "title": "image-1"
    },
    {
      "objectUrl": "https://my-bucket/image2.png",
      "title": "image-2"
    }
  ],
  "data_groups": [
    {
      "name": "my-grid-group",
      "layoutType": "default-grid",
      "layoutContents": [{ "titleRef": "image-1" }, { "titleRef": "image-2" }]
    }
  ]
}

Custom Layout

The layout field is a custom layout defined with a recursive binary split tree. Each node is one of:

data_unit: A single pane displaying one data item. Requires a key matching a key in layoutContents.
data_unit_list: A carousel pane displaying multiple data items. Requires keys (array of keys from layoutContents). Optional fields: carouselPosition ("left", "right", "top", "bottom"; default "left") and carouselSize (10-70; default 20).
Grid node: Splits the area into two children. Requires direction ("row" or "column"), first, second, and splitPercentage (5-95).

Settings

The optional settings object supports:

Key	Type	Default	Description
`fixedLayout`	bool	`false`	Prevents users from changing the layout in the Label Editor.
`hasMultilayerLabels`	bool	`false`	When `true`, labels are placed on all elements in the data group.
`tileSettings`	object		Per-tile settings keyed by `layoutContents` key. Each tile can have a `title` (string) and `isReadOnly` (bool, default `false`).

Scenes

Scenes are used for point cloud (PCD/LiDAR) data and can combine multiple sensor streams (cameras, LiDAR) into a single annotation task.For full details on the scene JSON format, supported streams, frames of reference, and examples, see the Add Point Cloud Data documentation.

Use a Multi-Region Access Point

When using a Multi-Region Access Point for your AWS S3 buckets the JSON file has to be slightly different from the examples provided. Instead of an object’s URL, objects are specified using the ARN of the Multi-Region Access Point followed by the object name. The example below shows how video files from a Multi-Region Access Point would be specified.

{
  "videos": [
    {
      "objectUrl": "Multi-Region-Access-Point-ARN + <object name_1>"
    },
    {
      "objectUrl": "Multi-Region-Access-Point-ARN + <object name_2>",
      "title": "my-custom-video-title.mp4"
    }
  ],
  "skip_duplicate_urls": true
}

MRAP Example

{
  "videos": [
    {
      "objectUrl": "https://arn:aws:s3::123123123:accesspoint/frf28frarf9.mrap.s3-accesspoint.amazonaws.com/Videos/2022/video_1.mp4"
    },
    {
      "objectUrl": "https://arn:aws:s3::123123123:accesspoint/frf28frarf9.mrap.s3-accesspoint.amazonaws.com/Videos/2022/video_2.mp4",
      "title": "many-cute-cats.mp4"
    }
  ],
  "skip_duplicate_urls": true
}

CSV Format

In the CSV file format, the column headers specify which type of data is being uploaded. You can add and single file format at a time, or combine multiple data types in a single CSV file. Details for each data format are given in the sections below.

Encord supports up to 10,000 entries for upload in the CSV file.

Object URLs can’t contain whitespace.
For backwards compatibility reasons, a single column CSV is supported. A file with the single ObjectUrl column is interpreted as a request for video upload. If your objects are of a different type (for example, images), this error displays: “Expected a video, got a file of type XXX”.

Videos

A CSV file containing videos should contain two columns with the following mandatory column headings:
‘ObjectURL’ and ‘Video title’. All headings are case-insensitive.

The ‘ObjectURL’ column containing the objectUrl. This field is mandatory for each file, as it specifies the full URL of the video resource.
The ‘Video title’ column containing the video_title. If left blank, the original file name is used.

In the example below files 1, 2 and 4 will be assigned the names in the title column, while file 3 will keep its original file name.

ObjectUrl	Video title
path/to/storage-location/frame1.mp4	Video 1
path/to/storage-location/frame2.mp4	Video 2
path/to/storage-location/frame3.mp4
path/to/storage-location/frame4.mp4	Video 3

Single images

A CSV file containing single images should contain two columns with the following mandatory headings:
‘ObjectURL’ and ‘Image title’. All headings are case-insensitive.

The ‘ObjectURL’ column containing the objectUrl. This field is mandatory for each file, as it specifies the full URL of the image resource.
The ‘Image title’ column containing the image_title. If left blank, the original file name is used.

In the example below files 1, 2 and 4 will be assigned the names in the title column, while file 3 will keep its original file name.

ObjectUrl	Image title
path/to/storage-location/frame1.jpg	Image 1
path/to/storage-location/frame2.jpg	Image 2
path/to/storage-location/frame3.jpg
path/to/storage-location/frame4.jpg	Image 3

Image groups

A CSV file containing image groups should contain three columns with the following mandatory headings:
‘ObjectURL’, ‘Image group title’, and ‘Create video’. All three headings are case-insensitive.

The ‘ObjectURL’ column containing the objectUrl. This field is mandatory for each file, as it specifies the full URL of the resource.
The ‘Image group title’ column containing the image_group_title. This field is mandatory, as it determines which image group a file will be assigned to.

In the example below the first two URLs are grouped together into ‘Group 1’, while the following two files are grouped together into ‘Group 2’.

ObjectUrl	Image group title	Create video
path/to/storage-location/frame1.jpg	Group 1	false
path/to/storage-location/frame2.jpg	Group 1	false
path/to/storage-location/frame3.jpg	Group 2	false
path/to/storage-location/frame4.jpg	Group 2	false

Image groups do not require ‘write’ permissions.

Image sequences

A CSV file containing image sequences should contain three columns with the following mandatory headings: ‘ObjectURL’, ‘Image group title’, and ‘Create video’. All three headings are case-insensitive.

The ‘ObjectURL’ column containing the objectUrl. This field is mandatory for each file, as it specifies the full URL of the resource.
The ‘Image group title’ column containing the image_group_title. This field is mandatory, as it determines which image sequence a file will be assigned to. The dimensions of the image sequence are determined by the first file in the sequence.
The ‘Create video’ column. This can be left blank, as the default value is ‘true’.

In the example below the first two URLs are grouped together into ‘Sequence 1’, while the second two files are grouped together into ‘Sequence 2’.

ObjectUrl	Image group title	Create video
path/to/storage-location/frame1.jpg	Sequence 1	true
path/to/storage-location/frame2.jpg	Sequence 1	true
path/to/storage-location/frame3.jpg	Sequence 2	true
path/to/storage-location/frame4.jpg	Sequence 2	true

Image groups and image sequences are only distinguished by the presence of the ‘Create video’ column.

Image sequences require ‘write’ permissions against your storage bucket to save the compressed video.

DICOM

A CSV file containing DICOM files should contain two columns with the following mandatory headings: ‘ObjectURL’ and ‘Dicom title’. Both headings are case-insensitive.

The ‘ObjectURL’ column containing the objectUrl. This field is mandatory for each file, as it specifies the full URL of the resource.
The ‘Series title’ column containing the dicom_title. When two files are given the same title they are grouped into the same DICOM series. If left blank, the original file name is used.

In the example below the first two files are grouped into ‘dicom series 1’, the next two files are grouped into ‘dicom series 2’, while the final file will remain separated as ‘dicom series 3’.

ObjectUrl	Series title
path/to/storage-location/frame1.dcm	dicom series 1
path/to/storage-location/frame2.dcm	dicom series 1
path/to/storage-location/frame3.dcm	dicom series 2
path/to/storage-location/frame4.dcm	dicom series 2
path/to/storage-location/frame5.dcm	dicom series 3

Multiple file types

You can upload multiple file types with a single CSV file by using a new header each time there is a change of file type. Three headings will be required if image sequences are included.

Since the ‘Create video’ column defaults to true all files that are not image sequences must contain the value false

The example below shows a CSV file for the following:

Two image sequences composed of 2 files each.
One image group composed of 2 files.
One single image.
One video.

ObjectUrl	Image group title	Create video
path/to/storage-location/frame1.jpg	Sequence 1	true
path/to/storage-location/frame2.jpg	Sequence 1	true
path/to/storage-location/frame3.jpg	Sequence 2	true
path/to/storage-location/frame4.jpg	Sequence 2	true
path/to/storage-location/frame5.jpg	Group 1	false
path/to/storage-location/frame6.jpg	Group 1	false
ObjectUrl	Image title	Create video
path/to/storage-location/frame1.jpg	Image 1	false
ObjectUrl	Image title	Create video
full/storage/path/video.mp4	Video 1	false

STEP 2: Register Your Cloud Data

To ensure smoother uploads and faster completion times, and avoid hitting absolute file limits, we recommend adding smaller batches of data. Limit uploads to 100 videos or up to 1,000 images at a time. You can also create multiple Datasets, all of which can be linked to a single Project. Familiarize yourself with our limits and best practices for data import/registration before adding data to Encord.

Navigate to Files section of Index in the Encord platform.
Click into a Folder.
Click + Upload files. A dialog appears.

Click Import from cloud data.

We recommend turning on the Ignore individual file errors feature. This ensures that individual file errors do not lead to the whole upload process being aborted.

Click Add JSON or CSV files to add a JSON or CSV file specifying cloud data that is to be added.

You can also register your data directly in the Datasets screen. Click here for instructions.

Custom Metadata

Custom metadata can only be added through JSON uploads in the Encord Platform or using the Encord SDK.

Custom metadata, also known as client metadata, is supplementary information you can add to all data imported into Encord. It is provided in the form of a Python dictionary, as shown in examples. Client metadata serves several key functions:

Filtering and sorting in Index and Active.
Creating custom Label Editor layouts based on metadata.

You can optionally add some custom metadata per data item in the clientMetadata field (examples show how this is done) of your JSON file.

There is no limit on the number of custom metadata fields on a data unit. However, we enforce a 10MB limit on the custom metadata for each data item. Internally, we store custom metadata as a PostgreSQL jsonb type. Read the relevant PostgreSQL documentation about the jsonb type and its behaviors. For example, jsonb type does not preserve key order or duplicate keys.

Metadata Schema

Metadata schemas, including custom embeddings, can only be imported through the Encord SDK.

Based on your Data Discoverability Strategy, you need to create a metadata schema. The schema provides a method of organization for your custom metadata. Encord supports:

Scalars: Methods for filtering.
Enums: Methods with options for filtering.
Embeddings: Method for embedding plot visualization, similarity search, and natural language search.

Metadata Schema keys support letters (a-z, A-Z), numbers (0-9), and the following blank spaces ( ), hyphens (-), underscores (_), and periods (.).

Custom metadata

Custom metadata refers to any additional information you attach to files, allowing for better data curation and management based on your specific needs. It can include any details relevant to your workflow, helping you organize, filter, and retrieve data more efficiently. For example, for a video of a construction site, custom metadata could include fields like "site_location": "Algiers", "project_phase": "foundation", or "weather_conditions": "sunny". This enables more precise tracking and management of your data. Before importing any files with custom metadata to Encord, we recommend that you import a metadata schema. Encord uses metadata schemas to validate custom metadata uploaded to Encord and to instruct Index and Active how to display your metadata.

To handle your custom metadata schema across multiple teams within the same organization, we recommend using namespacing for metadata keys in the schema. This ensures that different teams can define and manage their own metadata schema without conflicts. For example, team A could use video.description, while team B could use audio.description. Another example could be TeamName.MetadataKey. This approach maintains clarity and avoids key collisions across departments.

Metadata schema table

Metadata Schema keys support letters (a-z, A-Z), numbers (0-9), and blank spaces ( ), hyphens (-), underscores (_), and periods (.). Metadata schema keys are case sensitive.

Use add_scalar to add a scalar key to your metadata schema.

Scalar Key	Description	Display Benefits
boolean	Binary data type with values “true” or “false”.	Filtering by binary values
datetime	ISO 8601 formatted date and time.	Filtering by time and date
number	Numeric data type supporting float values.	Filtering by numeric values
uuid	UUIDv4 formatted unique identifier for a data unit.	Filtering by customer specified unique identifier
varchar	Textual data type. Formally `string`. `string` can be used as an alias for `varchar`, but we STRONGLY RECOMMEND that you use `varchar`.	Filtering by string.
text	Text data with unlimited length (example: transcripts for audio). Formally `long_string`. `long_string` can be used as an alias for `text`, but we STRONGLY RECOMMEND that you use `text`.	Storing and filtering large amounts of text.

Use add_enum and add_enum_options to add an enum and enum options to your metadata schema.

Key	Description	Display Benefits
enum	Enumerated type with predefined set of values.	Facilitates categorical filtering and data validation

Use add_embedding to add an embedding to your metadata schema.

Key	Description	Display Benefits
embedding	1 to 4096 for Index. 1 to 2000 for Active.	Filtering by embeddings, similarity search, 2D scatter plot visualization (Coming Soon)

Use add_geospatial to add a new geospatial type to the metadata schema.

Key	Description	Display Benefits
geospatial	Geospatial data that contains latitude and longitude information. Latitude values are between +90 (North) to -90 (South). Longitude values are between +180 (West) to -180 (East).	Filtering data by using a geospatial map.

Geospatial custom metadata is supported only in Index.
Geospatial custom metadata can be applied to all data unit types and on individual frames in videos.

Incorrectly specifying a data type in the schema can cause errors when filtering your data in Index or Active. If you encounter errors while filtering, verify your schema is correct. If your schema has errors, correct the errors, re-import the schema, and then re-sync your Active Project.

Import your metadata schema to Encord

# Import dependencies
from encord import EncordUserClient
from encord.metadata_schema import MetadataSchema

SSH_PATH = "<file-path-to-ssh-private-key>"

# Authenticate with Encord using the path to your private key
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    ssh_private_key_path=SSH_PATH
)

# Create the schema
metadata_schema = user_client.metadata_schema()

# Add various metadata fields
metadata_schema.add_scalar("metadata_1", data_type="boolean")
metadata_schema.add_scalar("metadata_2", data_type="datetime")
metadata_schema.add_scalar("metadata_3", data_type="number")
metadata_schema.add_scalar("metadata_4", data_type="uuid")
metadata_schema.add_scalar("metadata_5", data_type="varchar")
metadata_schema.add_scalar("metadata_6", data_type="text")

# Add an enum field
metadata_schema.add_enum("my-enum", values=["enum-value-01", "enum-value-02", "enum-value-03"])

# Add embedding fields
metadata_schema.add_embedding("my-test-active-embedding", size=512)
metadata_schema.add_embedding("my-test-index-embedding", size=<values-from-1-to-4096>)

# Add geospatial
metadata_schema.add_geospatial("geo-point")

# Save the schema
metadata_schema.save()

# Print the schema for verification
print(metadata_schema)

Verify your schema

After importing your schema to Encord we recommend that you verify that the import is successful. Run the following code to verify your metadata schema imported and that the schema is correct.

# Import dependencies
from encord import EncordUserClient
from encord.metadata_schema import MetadataSchema

SSH_PATH = "<file-path-to-ssh-private-key>"

# Authenticate with Encord using the path to your private key
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    ssh_private_key_path=SSH_PATH
)

# Create the schema
metadata_schema = user_client.metadata_schema()

# Print the schema for verification
print(metadata_schema)

Update Custom Metadata (JSON)

When updating custom metadata using a JSON file, you MUST specify "skip_duplicate_urls": true and "upsert_metadata": true. Specifying the "skip_duplicate_urls": true and "upsert_metadata": true flags in the JSON file does the following:

New files registered with Encord and custom metadata for those files is added.
Existing files have their existing custom metadata overwritten with the custom metadata specified in the JSON file.

To update custom metadata with a JSON file:

Create a registration JSON file with the updated custom metadata. Include the "skip_duplicate_urls": true and "upsert_metadata": true flags.

Custom metadata updates require "skip_duplicate_urls": true to function. It does not work if "skip_duplicate_urls": false.
Only custom metadata for pre-existing files is updated. Any new files present in the JSON are uploaded.

Update custom metadata example

{
  "videos": [
    {
      "objectUrl": "<object url_1>"
    },
    {
      "objectUrl": "<object url_2>",
      "title": "my-custom-video-title.mp4",
      "clientMetadata": {"optional": "metadata"}
    }
  ],
  "skip_duplicate_urls": true,
  "upsert_metadata": true
}

Custom Embeddings

Metadata schemas, including custom embeddings, can only be imported through the Encord SDK.

Encord enables the use of custom embeddings for images, image sequences, image groups, and individual video frames.

To learn how to use custom embeddings in Encord, see our documentation here.

Step 1: Create a New Embedding Type

A key is required in your custom metadata schema for your embeddings. You can use any string as the key for your embeddings. We strongly recommend that you use a string that is meaningful. If you do not include a key in your metadata schema, your imported embeddings are treated as strings.

Embedding key names can contain alphanumeric (a-z, A-Z, 0-1) characters, hyphens, and underscores.

Use add_embedding to add an embedding to your metadata schema.

Key	Description	Display Benefits
embedding	1 to 4096 for Index. 1 to 2000 for Active	Filtering by embeddings, similarity search, 2D scatter plot visualization (Coming Soon)

# Import dependencies
from encord import EncordUserClient
from encord.metadata_schema import MetadataSchema

SSH_PATH = "<file-path-to-ssh-private-key>"

# Authenticate with Encord using the path to your private key
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    ssh_private_key_path=SSH_PATH
)

# Create the schema
metadata_schema = user_client.metadata_schema()


# Add embedding fields
metadata_schema.add_embedding('my-test-active-embedding', size=512)
metadata_schema.add_embedding('my-test-index-embedding', size=<values-from-1-to-4096>)

# Save the schema
metadata_schema.save()

# Print the schema for verification
print(metadata_schema)

Step 2: Upload Embeddings

With the key in the custom metadata schema ready, we can now import our embeddings. Custom embedding sizes are flexible and can be set anywhere between 1 and 4096. You can import embeddings after you have added your data or during your data registration.

Your key frames (frames specified with or without embeddings) always appear in Index, regardless of what sampling rate you specify.

Embedding key names can contain alphanumeric (a-z, A-Z, 0-1) characters, hyphens, and underscores.

If config is not specified, the sampling_rate is 1 frame per second, and the keyframe_mode is frame.

Specifying a sampling_rate of 0 only imports the first frame and all keyframes of your video into Index.

Import while registering videos

Import while importing videos

This JSON file imports embeddings while registering your data with Index from a cloud integration.config is optional when importing your custom embeddings:

"config": {
    "sampling_rate": "<samples-per-second>",
    "keyframe_mode": "frame" or "seconds",
},

If config is not specified, the sampling_rate is 1 frame per second, and the keyframe_mode is frame.

Specifying a sampling_rate of 0 only imports the first frame and all keyframes of your video into Index.

{
    "videos": [
        {
            "objectUrl": "<cloud-file-path-to-your-video-1>",
            "title": "<title-for-your-video-1>",
            "clientMetadata": {
                "$encord": {
                    "config": {
                        "sampling_rate": "<samples-per-second>",
                        "keyframe_mode": "frame" or "seconds",
                    },
                    "frames": {
                        "<frame-number-or-seconds>": {
                            "<my-embedding>": [1.0, 2.0, 3.0]
                        },
                        "<frame-number-or-seconds>": {
                            "<my-embedding>": [1.0, 2.0, 3.0]
                        }
                    }
                }
            }
        },
        {
            "objectUrl": "<cloud-file-path-to-your-video-2>",
            "title": "<title-for-your-vide-2>",
            "clientMetadata": {
                "$encord": {
                    "config": {
                        "sampling_rate": "<frames-per-second>",
                        "keyframe_mode": "frame" or "seconds",
                    },
                    "frames": {
                        "<frame-number-or-seconds>": {
                            "<my-embedding>": [1.0, 2.0, 3.0]
                        },
                        "<frame-number-or-seconds>": {
                            "<my-embedding>": [1.0, 2.0, 3.0]
                        }
                    }
                }
            }
        },
        {
            "objectUrl": "<cloud-path-to-your-video-3>",
            "title": "<title-for-your-video-3>",
            "clientMetadata": {
                "$encord": {
                    "config": {
                        "sampling_rate": "<frames-per-second>",
                        "keyframe_mode": "frame" or "seconds",
                    },
                    "frames": {
                        "<frame-number-or-seconds>": {
                            "<my-embedding>": [1.0, 2.0, 3.0]
                        },
                        "<frame-number-or-seconds>": {
                            "<my-embedding>": [1.0, 2.0, 3.0]
                        }
                    }
                }
            }
        }
    ],
    "skip_duplicate_urls": true
}

Update specific videos

# Import dependencies
from encord import EncordUserClient
from encord.http.bundle import Bundle
from encord.orm.storage import StorageFolder, StorageItem, StorageItemType, FoldersSortBy

# Authentication
SSH_PATH = "<file-path-to-ssh-private-key>"

# Authenticate with Encord using the path to your private key
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    ssh_private_key_path=SSH_PATH,
)

updates = {
    "<data-hash-1>": {
        "$encord": {
            "frames": {
                "<frame-number-1>": {
                    "<my-embedding>": [1.0, 2.0, 3.0],  # custom embedding ("embedding") with float values
                },
                "<frame-number-2>": {
                    "<my-embedding>": [1.0, 2.0, 3.0],  # custom embedding ("embedding") with float values
                }
            }
        }
    },
    "<data-hash-2>": {
        "$encord": {
            "config": {
                "sampling_rate": <samples-per-second>,  # VIDEO ONLY (optional default = 1 sample/second)
                "keyframe_mode": "frame" or "seconds",  # VIDEO ONLY (optional default = "frame")
            },
            "frames": {
                "<frame-number-1>": {
                    "<my-embedding>": [1.0, 2.0, 3.0],  # custom embedding ("embedding") with float values
                },
                "<frame-number-2>": {
                    "<my-embedding>": [1.0, 2.0, 3.0],  # custom embedding ("embedding") with float values
                }
            }
        }
    },
}

# Use the Bundle context manager
with Bundle() as bundle:
    # Update the storage items based on the dictionary
    for item_uuid, metadata_update in updates.items():
        item = user_client.get_storage_item(item_uuid=item_uuid)

        # Make a copy of the current metadata and update it with the new metadata
        curr_metadata = item.client_metadata.copy()
        curr_metadata.update(metadata_update)

        # Update the item with the new metadata and bundle
        item.update(client_metadata=curr_metadata, bundle=bundle)

Import while registering data units

Import while importing data units

This JSON file imports embeddings while registering your data with Index from a cloud integration.

{
  "images": [
    {
      "objectUrl": "file/path/to/data-unit/file-name-01.file-extension",
      "title": "data-unit-title.file-extension",
      "clientMetadata": {"metadata-1": "value", "metadata-2": "value"}
    },
  ],
  "skip_duplicate_urls": true
}

Update specific data units

Import specific data units

The custom embeddings format for images, text files, PDFs, and audio files follows the same format as importing custom metadata.

# Import dependencies
from encord import EncordUserClient
from encord.http.bundle import Bundle

# Authentication
SSH_PATH = "<file-path-to-ssh-private-key>"

# Authenticate with Encord using the path to your private key
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    ssh_private_key_path=SSH_PATH,
)

# Define a dictionary with item UUIDs and their respective metadata updates
updates = {
    "<data-ID-1>": {"<my-embedding>": [1.0, 2.0, 3.0]},
    "<data-ID-2>": {"<my-embedding>": [1.0, 2.0, 3.0]}
}

# Use the Bundle context manager
with Bundle() as bundle:
    # Update the storage items based on the dictionary
    for item_uuid, metadata_update in updates.items():
        item = user_client.get_storage_item(item_uuid=item_uuid)

        # Make a copy of the current metadata and update it with the new metadata
        curr_metadata = item.client_metadata.copy()
        curr_metadata.update(metadata_update)

        # Update the item with the new metadata and bundle
        item.update(client_metadata=curr_metadata, bundle=bundle)

How To Increase File Registration Speed

To speed up file registration with Encord, you can include metadata for each file in the upload JSON. This metadata is used directly without additional validation and is not stored on our servers. Ensuring accuracy in the metadata you provide is essential to maintain precise labels.

The metadata referenced here is distinct from clientMetadata and serves a different purpose. Documentation for clientMetadata can be found here.

imageMetadata for images:
- mimeType: MIME type of the image (e.g., image/jpeg).
- fileSize: Size of the file in bytes.
- width: Width of the image in pixels.
- height: Height of the image in pixels.
audioMetadata for audio files:
- duration_seconds (float): Audio duration in seconds.
- file_size (int): Size of the audio file in bytes.
- mime_type (str): MIME type (e.g., audio/mpeg, audio/wav).
- sample_rate (int): Sample rate in Hz.
- bit_depth (int): Size of each sample in bits.
- codec (str): Codec used (e.g., mp3, pcm).
- num_channels (int): Number of audio channels.
videoMetadata for videos:
- fps: Frames per second.
- duration: Duration in seconds.
- width / height: Dimensions in pixels.
- file_size: File size in bytes.
- mime_type: File type (MIME standard).

{
  "images": [
    {
      "objectUrl": "s3://my_image.jpg",
      "imageMetadata": {
        "mimeType": "image/jpg",
        "fileSize": 124,
        "width": 640,
        "height": 480
      }
    }
  ]
}

Check Data Registration Status

You can check the progress of the processing job by clicking the bell icon in the top right corner of the Encord app.

A spinning progress indicator shows that the processing job is still in progress.
If successful, the processing completes with a green tick icon.
If unsuccessful, there is a red cross icon, as seen below.

If the upload is unsuccessful, ensure that:

Your provider permissions are set correctly
The object data format is supported
The upload JSON or CSV file is correctly formatted.

Check which files failed to upload by clicking the Export icon to download a CSV log file. Every row in the CSV corresponds to a file which failed to be uploaded.

You only see failed uploads if the Ignore individual file errors toggle was not enabled during cloud data registration.

Helpful Scripts and Examples

Use the following examples and helpful scripts to quickly learn how to create JSON and CSV files formatted for uploading cloud data to Encord, by constructing the URLs from the specified path in your private storage.

AWS S3

AWS S3 object URLs can follow a few set patterns:

Virtual-hosted style: https://<bucket-name>.s3.<region>.amazonaws.com/<key-name>
Path-style: https://s3.<region>.amazonaws.com/<bucket-name>/<key-name>
S3 protocol: S3://<bucket-name>/<key-name>
Legacy: those without regions or those with S3-<region> in the URL

AWS best practice is to use Virtual-hosted style. Path-style is planned to be deprecated and the legacy URLs are already deprecated.We support Virtual-hosted style, Path-style and S3 protocol object URLs. We recommend you use Virtual-hosted style object URLs wherever possible.Object URLs can be found in the Properties tab of the object in question. Navigate to AWS S3 > bucket > object > Properties to find the Object URL.

Here’s a python script which creates a JSON file for single images by constructing the URLs from the specified path in a given S3 bucket. You’ll need to configure the following variables to match your setup.

region: the AWS region where your S3 bucket is.
aws_profile: the name of the profile in the AWS ~/.aws/credentials file. See AWS Credentials Documentation to properly set up the credentials file.
bucket_name: the name of your S3 bucket you want to pull files from.
s3_directory: the path to the directory in the S3 bucket where your files are stored.

In this Amazon S3 Virtual-hosted style URLs example, my-bucket is the bucket name, us-west-2 is the region, and images/dogs is the S3 directory:

https://my-bucket.s3.us-west-2.amazonaws.com/images/dogs/puppy.png

And the script itself:

import boto3
import json
from pathlib import Path
region = 'FILL_ME_IN'
aws_profile = 'FILL_ME_IN'
bucket_name = 'FILL_ME_IN'
s3_directory = Path('FILL_ME_IN')
root_url = f'https://{bucket_name}.s3.{region}.amazonaws.com'
session = boto3.Session(profile_name=aws_profile)
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)
images = []
for object_summary in bucket.objects.filter(Prefix=s3_directory.as_posix() + '/'):
    object_file = Path(object_summary.key)
    if object_file.parent == s3_directory:
        object_url = f'{root_url}/{object_summary.key}'
        images.append({'objectUrl': object_url})
data_dict = {'images': images}
output_file = Path('upload_images.json')
output_file.write_text(json.dumps(data_dict, indent=2))

GCP Storage

The following Python scripts generate a JSON file for uploading cloud data to Encord, specifically for a designated GCP Storage bucket. The resulting JSON file includes images, videos, audio, pdfs, and text files.There are two scripts below. The first script only shares the assets’ links and optional client metadata. The second, recommend approach also includes asset metadata for quicker file uploads (as well as preventing ingestion costs from your cloud provider).

To run these scripts, you must have gsutil installed.

Before using the scripts, make sure to:

Specify your bucket name in the bucket_name variable.
Decide which GCP authentication method to use. There are 3 options:
- Option 1: Hard code the service account authentication JSON.
- Option 2: Create a .json file on your computer and provide the path to the service account authentication JSON.
- Option 3: If you are already authenticated with gcloud CLI the script gets the credentials from your environment.
Optionally, modify the name of the output file, currently set to json_upload_for_encord.json.

import json
from pathlib import Path
from typing import TypedDict, Optional
from google.cloud import storage
from google.oauth2 import service_account
import typer
from typer import Argument, Option

app = typer.Typer()

# Authenticate with GCP
# OPTION 1: Get GCP credentials from the JSON
service_account_info = {
    "type": "service_account",
    "project_id": "your_google_service_project_id",
    "private_key_id": "your_google_secrets_private_key_id",
    "private_key": "your_google_secrets_private_key",
    "client_email": "your_client_email",
    "client_id": "your_client_id",
    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
    "token_uri": "https://oauth2.googleapis.com/token",
    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
    "client_x509_cert_url": "..."
}


class Entry(TypedDict):
    objectUrl: str
    title: str
    clientMetadata: dict

@app.command()
def main(
    bucket_name: str = Argument(..., help="Name of the GCS bucket"),
    bucket_folder: str = Option("", "--bucket-folder", "-f", help="Folder path within the bucket"),
    output_path: str = Option("json_upload_for_encord.json", "--output", "-o", help="Output JSON filename"),
    output_dir: str = Option(".", "--output-dir", "-d", help="Output directory path"),
    service_account_info: Optional[str] = Option(None, "--service-account-info", help="Service account JSON"),
    service_account_file: Optional[str] = Option(None, "--service-account", "-s", help="Path to service account JSON file"),
    add_custom_metadata: bool = Option(False, "--custom-metadata", help="Add custom metadata (you can modify the function)"),
    skip_duplicates: bool = Option(True, "--skip-duplicates/--no-skip-duplicates", help="Skip duplicate URLs in output")
) -> None:
    """
    Generate JSON upload file for Encord from GCS bucket contents.

    Example usage:
    python create_json_upload.py my-bucket --bucket-folder my/folder
    python create_json_upload.py my-bucket -f data/images -o output.json
    """

    # Convert output_dir string to Path
    outpath = Path(output_dir)

    print(f"Processing bucket: {bucket_name}")
    if bucket_folder:
        print(f"Folder: {bucket_folder}")

    # Get credentials from authentication JSON
    credentials = None
    if service_account_info:
        try:
            credentials = service_account.Credentials.from_service_account_info(json.loads(service_account_info))
            print("Using service account from provided JSON string.")
        except (json.JSONDecodeError, Exception) as e:
            print(f"Failed to load service account from string: {e}")
            raise typer.Exit(code=1)
    elif service_account_file:
        try:
            credentials = service_account.Credentials.from_service_account_file(service_account_file)
            print(f"Using service account from: {service_account_file}")
        except Exception as e:
            print(f"Failed to load service account file: {e}")
            raise typer.Exit(code=1)

    # Instantiate storage client
    try:
        if credentials:
            storage_client = storage.Client(credentials=credentials)
        else:
            storage_client = storage.Client()
            print("Using default credentials")
    except Exception as auth_error:
        print(
            "Authentication failed. Try one of the following:\n"
            "1. Run: gcloud auth application-default login\n"
            "2. Use --service-account flag with path to service account JSON\n"
            "3. Set GOOGLE_APPLICATION_CREDENTIALS environment variable"
        )
        print(f"Error: {auth_error}")
        raise typer.Exit(code=1)

    # Access the bucket
    try:
        bucket = storage_client.get_bucket(bucket_name)
    except Exception as e:
        print(f"Failed to access bucket '{bucket_name}'. Make sure it exists and you have permission.")
        print(f"Error: {e}")
        raise typer.Exit(code=1)

    # Ensure the folder path ends with a slash if it's not empty
    if bucket_folder and not bucket_folder.endswith('/'):
        bucket_folder = f"{bucket_folder}/"

    print(f"Fetching objects from bucket: {bucket_name}, folder: {bucket_folder or '(root)'}")

    # Get all blobs with the specified folder prefix
    blobs = bucket.list_blobs(prefix=bucket_folder)
    outpath.mkdir(parents=True, exist_ok=True)

    videos: list[Entry] = []
    images: list[Entry] = []
    audio: list[Entry] = []
    text: list[Entry] = []
    pdfs: list[Entry] = []

    # File extension mappings
    image_extensions = {'.png', '.jpg', '.jpeg', '.tiff'}
    video_extensions = {'.mp4', '.mov', '.webm'}
    audio_extensions = {'.mp3', '.wav', '.m4a'}
    text_extensions = {'.txt'}
    pdf_extensions = {'.pdf'}

    processed_count = 0
    skipped_count = 0

    for blob in blobs:
        # Skip directory markers
        if blob.name.endswith("/"):
            skipped_count += 1
            continue

        # Skip if the blob is not actually in the specified folder (prevents partial prefix matches)
        if bucket_folder and not blob.name.startswith(bucket_folder):
            skipped_count += 1
            continue

        p = Path(blob.name)

        # Generate client metadata
        client_metadata = {}

        # Add custom metadata based on your needs
        if add_custom_metadata:
            # You can modify this function to add your own metadata
            client_metadata = generate_custom_metadata(blob.name)

        # Build the GCS URL
        gs_url = f"gs://{bucket_name}/{blob.name}"
        file_name = p.name
        suffix_lower = p.suffix.lower()

        entry = Entry(
            objectUrl=gs_url,
            title=file_name,
            clientMetadata=client_metadata
        )

        # Categorize by file type
        if suffix_lower in video_extensions:
            videos.append(entry)
            processed_count += 1
        elif suffix_lower in image_extensions:
            images.append(entry)
            processed_count += 1
        elif suffix_lower in audio_extensions:
            audio.append(entry)
            processed_count += 1
        elif suffix_lower in text_extensions:
            text.append(entry)
            processed_count += 1
        elif suffix_lower in pdf_extensions:
            pdfs.append(entry)
            processed_count += 1
        else:
            print(f"  Unknown file type: {p.suffix} ({file_name})")
            skipped_count += 1

    # Create output JSON
    output_data = {
        "images": images,
        "videos": videos,
        "audio": audio,
        "text": text,
        "pdfs": pdfs,
        "skip_duplicate_urls": skip_duplicates
    }

    # Save results to JSON
    output_file = outpath / output_path
    output_file.write_text(json.dumps(output_data, indent=4))

    print(f"\nJSON file saved to: {output_file.absolute()}")

def generate_custom_metadata(blob_name: str) -> dict:
    """
    Custom function to generate metadata based on blob name.
    Modify this function based on your needs.
    """
    metadata = {}

    # Example: Add file path components as metadata
    p = Path(blob_name)
    if len(p.parts) > 1:
        metadata["folder"] = str(p.parent)

    # Add more custom logic here as needed
    # For example:
    # - Extract date from filename
    # - Add tags based on folder structure
    # - Parse metadata from filename patterns

    return metadata


if __name__ == "__main__":
    app()

Get Started

Set up Integrations

Work with Data

Ontologies, Workflows, and Projects

Annotate & Review

Curate and Explore Data

Export labels

Analytics & Metrics

Evaluate Model Performance

Manage Users and Workspaces

​Register Cloud Data to Files

​STEP 1: Create a JSON or CSV File for Registration

​JSON Format

​Videos

​Video Metadata

​Audio Files

​Audio Metadata

​PDFs

​PDF Metadata

​Text Files

​Text Metadata

​Single Images

​Image Metadata

​Image groups

​Image Sequences

​DICOM

​NIfTI

​Data Groups

​Custom Layout

​Settings

​Scenes

​Use a Multi-Region Access Point

​CSV Format

​Videos

​Image groups

​Image sequences

​Multiple file types

​STEP 2: Register Your Cloud Data

​Custom Metadata

​Metadata Schema

​Custom metadata

​Metadata schema table

​Import your metadata schema to Encord

​Verify your schema

​Update Custom Metadata (JSON)

​Custom Embeddings

​Step 1: Create a New Embedding Type

​Step 2: Upload Embeddings

​Import while importing videos

​Update specific videos

​Import while importing data units

​Import specific data units

​How To Increase File Registration Speed

​Check Data Registration Status

​Helpful Scripts and Examples

Register Cloud Data to Files

STEP 1: Create a JSON or CSV File for Registration

JSON Format

Videos

Video Metadata

Audio Files

Audio Metadata

PDFs

PDF Metadata

Text Files

Text Metadata

Single Images

Image Metadata

Image groups

Image Sequences

DICOM

NIfTI

Data Groups

Custom Layout

Settings

Scenes

Use a Multi-Region Access Point

CSV Format

Videos

Image groups

Image sequences

Multiple file types

STEP 2: Register Your Cloud Data

Custom Metadata

Metadata Schema

Custom metadata

Metadata schema table

Import your metadata schema to Encord

Verify your schema

Update Custom Metadata (JSON)

Custom Embeddings

Step 1: Create a New Embedding Type

Step 2: Upload Embeddings

Import while importing videos

Update specific videos

Import while importing data units

Import specific data units

How To Increase File Registration Speed

Check Data Registration Status

Helpful Scripts and Examples