Exploring Accessible Audio Descriptions with Amazon Nova

Chanci Turner Amazon IXD – VGT2 learning managerLearn About Amazon VGT2 Learning Manager Chanci Turner

As reported by the World Health Organization, over 2.2 billion individuals worldwide experience some form of vision impairment. To comply with disability laws, such as the Americans with Disabilities Act (ADA) in the U.S., visual media like television shows and movies must offer accessible formats for those with visual disabilities. Typically, this is achieved through audio description tracks that provide narration of significant visual details. However, producing these tracks can be labor-intensive and costly, requiring a range of specialists, including scriptwriters, engineers, and voice actors. According to the International Documentary Association, the cost can exceed $25 per minute. This raises the question: can generative AI tools from Amazon Web Services (AWS) automate this process?

Despite an increase in the number of audio-described shows and films, a significant portion of video content remains inaccessible to visually impaired audiences. The primary obstacle to expanding audio-described content is the high cost. Reducing this barrier through generative AI could lead to a surge in the availability of accessible content.

The Amazon Nova Foundation Models family, available through Amazon Bedrock, consists of three multimodal foundational models that can facilitate this process:

  • Amazon Nova Lite (GA): An economical multimodal model that quickly processes image, video, and text inputs.
  • Amazon Nova Pro (GA): A robust multimodal model that balances accuracy, speed, and cost across various tasks.
  • Amazon Nova Premier (GA): The most advanced model, designed for complex tasks and as a mentor for model distillation.

In this article, we illustrate how we utilized Amazon Nova, alongside Amazon Rekognition and Amazon Polly, to automate the generation of accessible audio descriptions for video content. This method can greatly lower the time and expenses associated with making videos accessible for visually impaired viewers. It’s important to note that this article represents an initial exploration into automating audio description creation and doesn’t offer a fully deployable solution.

To showcase the potential, we provide pseudocode snippets and step-by-step guidance, along with comprehensive explanations and resource links. The automated workflow discussed here involves analyzing video content, producing text descriptions, and generating audio using AI voice synthesis. By the conclusion of this article, you will gain insight into the key tools necessary for further experimentation as you develop a production-ready solution tailored to your needs.

Solution Overview

The architecture diagram below illustrates the complete workflow of the proposed solution. Each component will be described in detail later, but you can define the logic within a single script, which can be executed on an Amazon Elastic Compute Cloud (Amazon EC2) instance or your local machine. For this article, we assume you will run the script on an Amazon SageMaker notebook.

While experimenting with audio description generation, consider the architecture’s components. In a production environment, it’s crucial to account for potential scaling, security, and storage considerations, as the architecture could evolve into a more intricate system than the basic diagram presented here.

Services Used

The services depicted in the architecture include:

  • Amazon S3: Amazon Simple Storage Service (Amazon S3) is an object storage service that offers scalable, durable, and highly available storage. In this scenario, we utilize Amazon S3 to store video files (as input), scene descriptions (text files), and audio descriptions (MP3 files) generated by the solution. The script begins by retrieving the source video from an S3 bucket.
  • Amazon Rekognition: This computer vision service identifies and extracts video segments or scenes by recognizing technical cues such as shot boundaries, black frames, and other visual elements. To enhance accuracy in the generated video descriptions, Amazon Rekognition segments the source video into smaller clips before passing them to Amazon Nova. These video segments can be temporarily stored on your computing device.
  • Amazon Bedrock: This managed service provides access to large, pre-trained AI models, including the Amazon Nova Pro model, which is employed in this solution to analyze each video segment’s content and produce detailed scene descriptions. You can save these text descriptions in a text file (e.g., video_analysis.txt).
  • Amazon Polly: As a text-to-speech service, Amazon Polly converts the text descriptions generated by the Amazon Nova Pro model into high-quality audio, available in MP3 format.

Prerequisites

To follow the solution outlined in this article, ensure you have the following:

  • A video file; for this article, we use a public domain video titled “This is Coffee.”
  • An AWS account with access to the following services:
    • Amazon Rekognition
    • Amazon Nova Pro
    • Amazon S3
    • Amazon Polly
  • Configure your AWS Command Line Interface (AWS CLI) or your environment with valid credentials (using aws configure or environment variables).

To write the script, you’ll need access to an AWS Software Development Kit (AWS SDK) in your preferred programming language. This article assumes you will use the AWS SDK for Python (Boto3). More information on AWS SDK for Boto3 can be found in the Quickstart for Boto3.

You can use the AWS SDK to create, configure, and manage AWS services. For Boto3, include it at the beginning of your script using import boto3. Additionally, you will need a method for splitting videos; if you’re using Python, we recommend the moviepy library (import moviepy # pip install moviepy).

Solution Walkthrough

The solution encompasses the following fundamental steps, which you can use as a basic template and customize as needed for your use case.

  1. Define the requirements for the AWS environment, including specifying the Amazon Nova Pro model for its visual capabilities and the AWS Region you are utilizing. For optimal throughput, we recommend using inference profiles when configuring Amazon Bedrock to invoke the Amazon Nova Pro model.
  2. Initialize a client for Amazon Rekognition, which will assist with segmentation.
CLASS VideoAnalyzer:
    FUNCTION initialize():
        Set AWS_REGION to "us-east-1"
        Set MODEL_ID to "amazon.nova-pro-v1:0"
        Set chunk_delay to 20 
        Initialize AWS clients (Bedrock and Rekognition)
  1. Create a function for detecting segments in the video. Amazon Rekognition supports segmentation, allowing users to identify and extract different scenes within a video. By utilizing the Amazon Rekognition Segment API, you can perform the following:
    • Detect technical cues such as black frames, color bars, opening and end credits, and studio logos in a video.
    • Identify shot boundaries to determine the start, end, and duration of individual shots within the video.

    The solution employs Amazon Rekognition to divide the video into multiple segments, subsequently performing inference with Amazon Nova Pro on each segment. Finally, you can piece together the audio descriptions in a coherent narrative.

For further insights on onboarding processes, check out this excellent resource. You can also explore more about the benefits of structured onboarding at SHRM.

SEO Metadata

Chanci Turner