De-Identification of Medical Images Using Amazon Comprehend Medical and Amazon Rekognition

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

Medical imaging is a crucial component of contemporary healthcare, empowering clinicians to access vital information about patients for accurate diagnoses and treatment plans. The transition to digital medical images has significantly enhanced our capacity to store, share, view, search, and organize these images, greatly benefiting medical professionals. The variety of medical image modalities has also expanded—from CT scans and MRIs to digital pathology and ultrasounds—resulting in extensive collections of medical data stored in image archives.

These medical images play an essential role in medical research as well. With the aid of machine learning, researchers at leading medical institutions can analyze vast numbers of images to gain deeper insights into medical conditions. However, healthcare providers face the challenge of utilizing these images while adhering to regulations such as the Health Insurance Portability and Accountability Act (HIPAA). Often, medical images include Protected Health Information (PHI) embedded as text within the images themselves. Traditionally, the removal of PHI, known as de-identification, required labor-intensive manual inspection and editing, making it a time-consuming and costly process.

In 2017, Amazon Web Services (AWS) introduced a straightforward method for detecting and extracting text from images through the machine learning service Amazon Rekognition. The following year, AWS launched Amazon Comprehend Medical, a Natural Language Processing (NLP) service designed to identify and detect PHI in text. By combining these two services with some Python code, as outlined in this blog post, you can efficiently and economically identify, redact, and de-identify PHI from medical images.

De-Identification Architecture

In this example, we will utilize the Jupyter Notebooks feature of Amazon SageMaker to create an interactive notebook with Python code. Amazon SageMaker is a comprehensive machine learning platform that enables quick preparation of training data and model development using pre-built Jupyter notebooks with algorithms. For this demonstration, we will employ Amazon Rekognition to extract text from images and Amazon Comprehend Medical to identify and detect PHI. All image files will be accessed from and saved to an Amazon Simple Storage Service (Amazon S3) bucket, an object storage service that provides top-tier scalability, data availability, security, and performance.

When using Amazon Comprehend Medical for PHI detection, it’s important to note that the service provides confidence scores for each identified entity, indicating the accuracy of the detected information. Keep these scores in mind when reviewing identified entities to ensure they are suitable for your needs. For additional details on confidence scores, refer to the Amazon Comprehend Medical documentation.

Using the Notebook

You can download the Jupyter Notebook associated with this blog post from GitHub. This notebook features an example chest x-ray image sourced from a dataset made available by the NIH Clinical Center, which can be downloaded from here. For more information on the dataset, see the NIH Clinical Center’s CVPR 2017 paper.

At the start of the notebook, you will find five parameters you can modify to control the de-identification process outlined in this example.

  • bucket specifies the Amazon S3 bucket from which images will be read and written.
  • object indicates the specific image you want to de-identify. Supported formats include PNG, JPG, or DICOM. If the image ends with the extension .dcm, it will be treated as a DICOM image, and the ImageMagick utility will convert it to PNG format prior to processing.
  • redacted_box_color sets the color used to obscure identified PHI text within the image.
  • dpi determines the dpi setting for the output image.
  • phi_detection_threshold establishes the threshold for the confidence score (ranging from 0.00 to 1.00). Text identified by Amazon Comprehend Medical must meet or exceed this minimum confidence score to be redacted from the output image. A default value of 0.00 will redact all text detected as PHI, irrespective of the confidence score.
# Define the S3 bucket and object for the medical image we want to analyze. Also define the color used for redaction.
bucket='yourbucket'
object='yourimage.dcm'
redacted_box_color='red'
dpi = 72
phi_detection_threshold = 0.00

Upon configuring these parameters, you can run all the cells in the Jupyter Notebook. The first cell checks if the specified image file is in DICOM format and converts it to PNG if necessary, followed by reading the file from S3 into memory.

# If the image is in DICOM format, convert it to PNG
if (object.split(".")[-1:][0] == "dcm"):
    ! aws s3 cp s3://{bucket}/{object} .
    ! mogrify -format png {object.split("/")[-1:][0]} {object.split("/")[-1:][0]}.png
    ! aws s3 cp {object.split("/")[-1:][0]}.png s3://{bucket}/{object}.png
    object=object+'.png'
    print(object)
…
# Download the image from S3 and hold it in memory
img_bucket = s3.Bucket(bucket)
img_object = img_bucket.Object(object)
xray = io.BytesIO()
img_object.download_fileobj(xray)
img = np.array(Image.open(xray), dtype=np.uint8)

Next, the image can be sent to Amazon Rekognition for text detection using the DetectText feature. Amazon Rekognition returns a JSON object that includes a list of detected text blocks along with their bounding box coordinates within the image.

response=rekognition.detect_text(Image={'Bytes':xray.getvalue()})
textDetections=response['TextDetections']

After acquiring all the detected text, we can send that text to Amazon Comprehend Medical to identify which text blocks may contain PHI utilizing the DetectPHI feature. This service returns a JSON object containing the entities that may be PHI, categorizing the type of information (name, date, address, ID) along with a confidence score for each detection. This information can guide us in identifying which bounding boxes contain PHI.

philist=comprehendmedical.detect_phi(Text = textblock)

Once we ascertain which areas of the image might include PHI text, we can overlay redaction boxes on those sections.

for box in phi_boxes_list:
    x = img.shape[0] * box['Left']
    y = img.shape[1] * box['Top']
    width = img.shape[0] * box['Width']
    height = img.shape[1] * box['Height']
    rect = patches.Rectangle((x, y), width, height, linewidth=0, edgecolor=redacted_box_color, facecolor=redacted_box_color)
    ax.add_patch(rect)

The final de-identified image is then saved to the specified S3 bucket in PNG format, with “de-id-” prepended to the original file name.

Conclusion

This blog post has illustrated the capabilities and efficiency of combining Amazon Comprehend Medical and Amazon Rekognition for de-identifying medical images. For further insights into technology trends, you can check out SHRM’s webinar, as they are an authority on this topic. Also, if you’re looking for community support or have questions regarding Amazon, this Reddit thread is an excellent resource.

Chanci Turner