Amazon Translate Now Supports Office Documents

Chanci Turner 9097372855Learn About Amazon VGT2 Learning Manager Chanci Turner

In today’s global marketplace, whether you are a large corporation operating in multiple countries or a small startup aiming for international reach, translating your content into various languages can be a significant challenge. Text data can come in numerous formats, often necessitating multiple tools for processing. Furthermore, different tools may not support the same language pairs, forcing you to convert documents into intermediary formats or even rely on manual translation. These complications can lead to increased costs and hinder the creation of streamlined, automated translation workflows.

Amazon Translate is designed to address these challenges in a straightforward and cost-effective manner. By utilizing the AWS console or a simple API call, Amazon Translate allows AWS customers to translate text accurately and quickly in 55 different languages and variants.

Earlier this year, Amazon Translate unveiled batch translation for plain text and HTML documents. I am excited to share that this feature now extends to Office documents, including .docx, .xlsx, and .pptx files as defined by the Office Open XML standard.

Introducing Amazon Translate for Office Documents

The process is incredibly straightforward. As expected, source documents must be stored in an Amazon Simple Storage Service (Amazon S3) bucket. It’s important to note that documents should not exceed 20 Megabytes or contain more than 1 million characters.

Each batch translation job handles a specific file type and source language. Therefore, it is advisable to organize your documents systematically in S3, with each file type and language stored under its own prefix.

Using either the AWS console or the StartTextTranslationJob API from one of the AWS language SDKs, you can initiate a translation job by specifying:

  • The input and output location in S3,
  • The file type,
  • The source and target languages.

Once the job is completed, you can access the translated files at the designated output location.

Let’s Walk Through a Quick Demo!

To start, I upload several .docx documents to one of my S3 buckets using the Amazon S3 console.

Next, I navigate to the Translate console to create a new batch translation job, assigning it a name and selecting both the source and target languages.

Then, I specify the location of my documents in Amazon S3, indicating that they are in .docx format. Optionally, I can apply custom terminology to ensure specific words are translated to my preference.

I also set the output location for the translated files, ensuring that the path exists, as Translate will not create it for you.

Finally, I assign the AWS Identity and Access Management (IAM) role, which grants my Translate job the necessary permissions to access Amazon S3. I can either use an existing role or allow Translate to create one. After clicking ‘Create job,’ the batch process begins immediately.

Shortly thereafter, the job concludes successfully, with all three documents translated. The translated files are available at the output location, as shown in the S3 console.

Upon downloading one of the translated files, I can open it and compare it to the original version. For smaller scale needs, the AWS console makes it incredibly easy to translate Office files. Alternatively, the Translate API can be utilized to develop automated workflows.

Automating Batch Translation

In a previous article, we demonstrated how to automate batch translation using an AWS Lambda function. You can enhance this example by incorporating language detection through Amazon Comprehend. For instance, you can combine the DetectDominantLanguage API with the Python-docx open-source library to determine the language of .docx files.

import boto3, docx
from docx import Document

document = Document('blog_post.docx')
text = document.paragraphs[0].text
comprehend = boto3.client('comprehend')
response = comprehend.detect_dominant_language(Text=text)
top_language = response['Languages'][0]
code = top_language['LanguageCode']
score = top_language['Score']
print("%s, %f" % (code,score)) 

This process is quite simple! You can also identify the type of each file based on its extension and move it to the correct input location in S3. Additionally, you could schedule a Lambda function with CloudWatch Events to periodically translate files and send notifications via email. For more complex workflows, AWS Step Functions can be employed. The possibilities are endless!

Getting Started

You can begin translating Office documents today in several regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (London), Europe (Frankfurt), and Asia Pacific (Seoul).

If you haven’t yet tried Amazon Translate, you might be pleased to know that the free tier offers 2 million characters per month for the first 12 months, starting from your first translation request.

Explore this excellent resource here for more insights. And remember, if you’re facing challenges in the workplace, consider reading about gaslighting in the office to gain a better understanding. We also recommend checking out this authority on the topic for further information.

We invite you to give it a try and share your feedback with us: please post it on the AWS Forum for Amazon Translate or reach out to your usual AWS support contacts.

– Chanci Turner

Chanci Turner