DICOM files may contain sensitive and personally identifiable information (PII) about patients, including their name, date of birth, medical record number. In these cases it is essential to anonymize each file, to protect patient privacy and comply with legal and ethical regulations related to healthcare data.
This is a premium feature and running it incurs additional costs. Contact us at support@encord.com to learn about pricing.
In this tutorial you will learn how to anonymize / de-identify DICOM files in two steps:
The Python code below is used to add criteria as well as call the de-identification function.
An SSH public / private key pair is required to use the sample script below. To learn how to generate one, see our documentation here.
Make sure to edit the criteria used to evaluate each file to suit your needs - any number of criteria can be used.
Copy
import multiprocessingimport timefrom pathlib import Pathfrom typing import Listimport globimport osfrom encord import EncordUserClientfrom encord.objects.common import ( DeidentifyRedactTextMode, SaveDeidentifiedDicomConditionNotSubstr, SaveDeidentifiedDicomConditionIn)# Replace s3://EXAMPLE-BUCKET/raw/ with the path to the file storage you're using def filelist_helper(dir,prefix='s3://EXAMPLE-BUCKET/raw/'): fl = [prefix+os.path.basename(f) for f in glob.glob(dir+"*.dcm")] return fl# Add criteria to evaluate each file. See the 'Setting evaluation criteria' section below for more info. criteria = [ SaveDeidentifiedDicomConditionNotSubstr("PRIMARY","ImageType"), SaveDeidentifiedDicomConditionIn(["ct","pt","nm","mr","mg","pt"],"Modality") ]def deidentify( integration_title: str, dicom_urls: List[str],) -> List[str]:# Authenticate with Encord using the path to your private keyuser_client = EncordUserClient.create_with_ssh_private_key(ssh_private_key_path="<private_key_path>") integration_hash = None # Find integration_hash for requested integration_title for integration in user_client.get_cloud_integrations(): if integration.title == integration_title: integration_hash = integration.id if not integration_hash: raise Exception(f"Integration with integration_title={integration_title} not found") deidentified_dicom_urls = [] # 'dicom_urls' should be a a single list containing the URLs of all instances of a series to be de-identified. Splitting a series into multiple lists might lead to inaccurate results and is therefore not recommended deidentified_dicom_url = user_client.deidentify_dicom_files( dicom_urls=dicom_urls, integration_hash=integration_hash, redact_dicom_tags = True, redact_pixels_mode = DeidentifyRedactTextMode.REDACT_ALL_TEXT, save_conditions = criteria, upload_dir = "s3://EXAMPLE-BUCKETt/output" ) print(f"Deidentified url: {deidentified_dicom_url}") deidentified_dicom_urls += deidentified_dicom_url return deidentified_dicom_urls# Replace MY-ENCORD-INTEGRATION with the title of your private cloud integration_integration_title = "MY-ENCORD-INTEGRATION"_deidentified_dicom_urls = deidentify( _integration_title, _dicom_urls,)
Evaluation criteria are conditions that determine whether a file will be de-identified or not. Criteria can take many forms, but will always return either ‘true’ or ‘false’.
All strings and inputs into the criteria functions are case-insensitive.
There are two distinct criteria functions:
“SaveDeidentifiedDicomConditionNotSubstr” will return ‘true’ if the first argument (PRIMARY in the example above), is not contained in the second argument (ImageType in the example above). In plain English, the example above checks whether the file’s ImageType doesn’t contain the word ‘Primary’, and returns ‘true’ if this condition is fulfilled.
“SaveDeidentifiedDicomConditionIn” will return ‘true’ if the first argument ([“ct”,“pt”,“nm”,“mr”,“mg”,“pt”] in the example above) is contained in the second argument Modality. In plain English, the example above checks whether any of the strings contained in the list are contained within the file’s Modality. If any one of them is, the function returns ‘true’.