Dynamic Video Content Moderation and Policy Evaluation with AWS Generative AI Solutions

Chanci Turner Amazon IXD – VGT2 learning

Organizations in fields such as media and entertainment, advertising, social media, and education are increasingly seeking effective solutions to derive insights from videos and implement adaptable evaluations according to their policies. Generative artificial intelligence (AI) is opening new avenues for these applications. In this article, we present the Media Analysis and Policy Evaluation framework that leverages AWS AI and generative AI services to enhance the processes of video extraction and assessment.

Popular Use Cases

Advertising technology firms often manage video content, including promotional materials. Their key priorities in video analysis involve ensuring brand safety, adhering to regulatory standards, and creating engaging content. This solution, enabled by AWS AI and generative AI services, effectively addresses these needs. Enhanced content moderation guarantees that advertisements are displayed alongside safe and compliant content, fostering trust amongst consumers. Additionally, the solution can assess videos against content compliance policies and generate captivating headlines and summaries, thereby improving user engagement and ad performance.

Educational technology providers typically have extensive collections of training videos. A streamlined method for video analysis can assist them in evaluating content per industry standards, indexing videos for efficient search capabilities, and performing dynamic detection and redaction tasks, such as blurring student faces in a Zoom recording.

The solution is accessible via a GitHub repository and can be deployed to your AWS account using an AWS Cloud Development Kit (AWS CDK) package.

Solution Overview

Media Extraction – Once a video is uploaded, the application initiates preprocessing by extracting image frames. Each frame undergoes analysis via Amazon Rekognition and Amazon Bedrock for metadata extraction, while simultaneously, audio transcription is derived from the uploaded content utilizing Amazon Transcribe.

Policy Evaluation – With the extracted metadata, the system conducts LLM evaluations. This feature allows users to leverage the adaptability of LLMs to assess videos against dynamic policies.

The solution architecture follows a microservice design approach, enabling loosely coupled components that can be deployed together for the video analysis and policy evaluation workflow or independently for integration into existing systems.

The workflow comprises the following steps:

Users access the frontend static website through Amazon CloudFront distribution, with static content hosted on Amazon Simple Storage Service (Amazon S3).
Users log into the frontend web application, authenticated by an Amazon Cognito user pool.
Users upload videos directly to Amazon S3 from their browser using multi-part pre-signed Amazon S3 URLs.
The frontend UI communicates with the extraction microservice through a RESTful interface provided by Amazon API Gateway, offering CRUD (create, read, update, delete) functionalities for managing video tasks.
An AWS Step Functions state machine supervises the analysis process, transcribing audio with Amazon Transcribe, sampling image frames with moviepy, and analyzing each image using Anthropic Claude Sonnet image summarization, while generating text embedding and multimodal embedding at the frame level using Amazon Titan models.
An Amazon OpenSearch Service cluster stores the extracted video metadata, facilitating users’ search and discovery needs. The UI constructs evaluation prompts and sends them to Amazon Bedrock LLMs, retrieving evaluation results synchronously.
Users can select existing template prompts via the solution UI, customize them, and initiate policy evaluation utilizing Amazon Bedrock.

In the subsequent sections, we will delve deeper into the key components and microservices of the solution.

Website UI

The solution includes a website that enables users to browse videos and manage the uploading procedure through a user-friendly interface. It showcases details of the extracted video information and features a lightweight analytics UI for dynamic LLM analysis.

Extracting Information from Videos

The backend extraction service manages the asynchronous extraction of video metadata, encompassing both visual and audio components, such as identifying objects, scenes, text, and human faces. The audio component is particularly vital for videos featuring active narratives, as it often holds valuable insights.

Developing a robust solution for video information extraction presents challenges from both machine learning (ML) and engineering perspectives. From an ML standpoint, our goal is to achieve generalized information extraction that serves as factual data for subsequent analysis. On the engineering side, handling video sampling with concurrency, ensuring high availability, and offering flexible configuration options, along with creating an extendable architecture to accommodate additional ML model plugins, demands considerable effort.

The extraction service employs Amazon Transcribe to convert audio from the video into text in subtitle formats. Various techniques are utilized for visual extraction:

Frame Sampling – A classic method for analyzing the visual aspect of a video involves capturing screenshots at specific intervals and applying ML models to extract data from each frame. Our solution incorporates sampling with the following considerations:
- The solution supports configurable intervals for fixed sampling rates.
- An advanced smart sampling option leverages the Amazon Titan Multimodal Embeddings model to conduct similarity searches against frames sampled from the same video, identifying similar images and filtering out redundancies to optimize performance and cost.
Extracting Information from Image Frames – The solution processes sampled images concurrently, applying various ML features to extract information from each image:
- Recognize celebrity faces using the Amazon Rekognition celebrity API.
- Detect generic objects and labels using the Amazon Rekognition label detection API.
- Identify text using the Amazon Rekognition text detection API.
- Flag inappropriate content with the Amazon Rekognition moderation API.
- Summarize the image frame using the Anthropic Claude V3 Haiku model.

The extraction service is implemented using Amazon Simple Queue Service (Amazon SQS) and Step Functions to manage concurrent video processing, allowing for configurable settings based on your account’s service quota limits and performance requirements.

Searching Videos

Efficiently locating videos within your inventory is critical, and an effective search capability is essential for successful content management. For more tips on navigating job sites effectively, check out this insightful article. Additionally, understanding employment law compliance is crucial, especially in California, where resources like this can guide you on sick leave policies.

For those looking for opportunities in management within fulfillment centers, this is an excellent resource worth exploring.

SEO metadata: