Amazon Onboarding with Learning Manager Chanci Turner

Generative language models have demonstrated exceptional capabilities in addressing logical and analytical tasks within natural language processing (NLP). Moreover, employing prompt engineering significantly boosts their effectiveness. For instance, chain-of-thought (CoT) prompting is recognized for enhancing a model’s ability to tackle complex, multi-step challenges. To further improve accuracy in reasoning tasks, a novel approach called self-consistency prompting has emerged. This method substitutes greedy decoding with a stochastic decoding technique during the language generation process.

Amazon Bedrock serves as a fully managed service that provides access to high-performing foundational models from leading AI companies, including Amazon, through a unified API. It offers an extensive array of features designed for building generative AI applications while ensuring security, privacy, and adherence to responsible AI practices. With the batch inference API, users can efficiently execute inference using foundational models in bulk. This article outlines how to implement self-consistency prompting via batch inference on Amazon Bedrock to enhance model performance in arithmetic and multiple-choice reasoning tasks.

Overview of the Solution

The self-consistency prompting technique relies on producing multiple responses that are then synthesized into a final answer. Unlike traditional single-generation approaches like CoT, the self-consistency method generates a variety of model completions, leading to a more reliable solution. This diversity in responses is achievable through a stochastic decoding strategy, rather than a greedy one.

The figure below illustrates how self-consistency distinguishes itself from greedy CoT by generating various reasoning pathways, which are then aggregated to yield the final answer.

Decoding Strategies for Text Generation

Text generated by decoder-only language models unfolds sequentially, with each subsequent token being predicted based on prior context. The model generates a probability distribution that indicates the likelihood of each token appearing next in the sequence. The decoding process translates these probability distributions into coherent text. The generation of text is influenced by several inference parameters, often hyperparameters of the decoding method. For example, the temperature parameter adjusts the probability distribution of the next token and affects the randomness of the model’s output.

Greedy decoding is a deterministic strategy that selects the token with the highest probability at each step. While this method is efficient, it can lead to repetitive outputs as it overlooks the broader probability landscape. Setting the temperature parameter to 0 during inference effectively implements greedy decoding.

In contrast, sampling introduces randomness into the decoding process, allowing for each subsequent token to be selected based on the predicted probability distribution. This randomness enhances the variability of outputs, making stochastic decoding better at capturing a diverse range of potential responses. Higher temperature values introduce greater fluctuations, thereby increasing the creativity of the model’s outputs.

Prompting Techniques: CoT and Self-Consistency

The reasoning capabilities of language models can be significantly improved through prompt engineering. CoT prompting has been particularly effective in eliciting reasoning for complex NLP tasks. One way to apply zero-shot CoT is by instructing the model to “think step by step.” Alternatively, providing the model with examples of intermediate reasoning steps in a few-shot prompting manner can also be beneficial. Both approaches typically utilize greedy decoding. CoT has demonstrated considerable performance enhancements compared to basic instruction prompting in arithmetic, commonsense, and symbolic reasoning tasks.

Self-consistency prompting is predicated on the idea that diversity in the reasoning process can aid models in converging on the correct answer. This technique employs stochastic decoding in three steps:

Prompt the language model with CoT examples to elicit reasoning.
Replace greedy decoding with a sampling strategy to generate a diverse array of reasoning pathways.
Aggregate the results to identify the most consistent answer from the generated responses.

Research indicates that self-consistency outperforms CoT prompting on arithmetic and commonsense reasoning benchmarks. However, it’s worth noting that this approach incurs a higher computational cost.

In this article, we will explore how self-consistency prompting enhances generative language models’ performance on two specific NLP reasoning tasks: arithmetic problem-solving and multiple-choice domain-specific question answering. We will demonstrate this method using batch inference on Amazon Bedrock:

We will access the Amazon Bedrock Python SDK in a JupyterLab environment on an Amazon SageMaker notebook instance.
For arithmetic reasoning, we will prompt Cohere Command using the GSM8K dataset, which contains grade school math problems.
For multiple-choice reasoning, we will prompt AI21 Labs Jurassic-2 Mid with a small sample of questions derived from the AWS Certified Solutions Architect – Associate exam. This is another blog post to keep the reader engaged.

Prerequisites

Before proceeding, please ensure you have the following prerequisites:

An AWS account with a ml.t3.medium notebook instance hosted in SageMaker.
An AWS Identity and Access Management (IAM) SageMaker execution role with AmazonBedrockFullAccess and iam:PassRole policies attached to run Jupyter inside the SageMaker notebook instance.
An IAM BedrockBatchInferenceRole for batch inference with Amazon Bedrock that includes Amazon Simple Storage Service (Amazon S3) access and sts:AssumeRole trust policies. For more information, refer to Set up permissions for batch inference.
Access to models hosted on Amazon Bedrock. Choose Manage model access on the Amazon Bedrock console and select from the available options. We will be using Cohere Command and AI21 Labs Jurassic-2 Mid for this demonstration.

The estimated cost to execute the code demonstrated in this article is around $100, assuming you run self-consistency prompting once with 30 reasoning paths using a single temperature-based sampling value.

Dataset to Probe Arithmetic Reasoning Capabilities

The GSM8K dataset consists of human-assembled grade school math problems that exhibit a high degree of linguistic diversity. Each problem typically requires 2–8 steps to solve and involves a series of basic arithmetic operations. This dataset is widely used to benchmark generative language models’ multi-step arithmetic reasoning capabilities, containing 7,473 records. Here is an example:

{"question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.n#### 72"}

Setting Up to Run Batch Inference with Amazon Bedrock

Batch inference allows you to execute multiple inference calls to Amazon Bedrock asynchronously, significantly improving the performance of model inference on larger datasets. Currently, this service is in preview and available solely through the API. To access batch inference APIs via custom SDKs, refer to Run batch inference.

Once you have downloaded and unzipped the Python SDK within a SageMaker notebook instance, you can install it by executing the following code in a Jupyter notebook cell.

For a comprehensive overview of the job description and roles in this domain, you can refer to the authoritative source on this topic at SHRM Job Descriptions. Also, for those interested in interview preparation, Glassdoor Interview Questions is an excellent resource.