Tailoring Coding Companions for Organizations | Artificial Intelligence

Generative AI models for coding companions primarily rely on publicly available source code and natural language text for training. While the extensive training corpus allows these models to produce code for common functionalities, they lack awareness of private repository code and the specific coding styles enforced during development. As a result, the generated suggestions often need to be rewritten before they can be effectively integrated into an internal codebase.

To bridge this gap and reduce the need for extensive manual modifications, we have developed a customization feature for Amazon CodeWhisperer that integrates knowledge from private repositories with a language model trained on public code. In this article, we will explore two methods for customizing coding companions: retrieval-augmented generation and fine-tuning.

The primary aim of the CodeWhisperer customization feature is to allow organizations to tailor the CodeWhisperer model using their private repositories and libraries, thereby generating organization-specific code recommendations. This not only saves time but also ensures adherence to organizational styles and conventions, reducing the risk of bugs or security vulnerabilities. This approach addresses several challenges in enterprise software development:

Limited documentation or information for internal libraries and APIs, forcing developers to sift through previously written code to understand usage.
Inconsistencies in implementing organization-specific coding practices, styles, and patterns.
Unintentional use of deprecated code and APIs by developers.

By utilizing internal code repositories for additional training, which have already undergone code reviews, the language model can highlight the use of internal APIs and code blocks that solve these issues. Since the reference code has already been vetted and meets high standards, the likelihood of introducing bugs or security vulnerabilities is minimized. Furthermore, by carefully selecting the source files for customization, organizations can decrease the use of outdated code.

Design Considerations

Customizing code suggestions based on private repositories presents several intriguing design challenges. Deploying large language models (LLMs) incurs fixed costs for availability and variable costs due to inference, which depend on the number of tokens generated. Having separate customizations for each organization and hosting them individually can lead to prohibitive expenses. Conversely, supporting multiple customizations on the same system requires a multi-tenant infrastructure to safeguard proprietary code.

Additionally, the customization capability should include features that allow for the selection of specific training subsets from internal repositories based on various metrics (e.g., files with a history of fewer bugs or recently committed code). By focusing on high-quality code, the customization can improve the overall quality of code suggestions. Lastly, given the ever-evolving nature of code repositories, the cost associated with customization should be kept minimal to help organizations realize cost savings through increased developer productivity.

A basic approach to building customization could involve pre-training the model on a single dataset that combines existing public pretraining data with private enterprise code. While this method is effective, it necessitates redundant individual pretraining using the public dataset for each enterprise and incurs additional deployment costs associated with hosting a customized model for each client. By decoupling the training of public and private code and deploying the customization on a multi-tenant system, these redundant costs can be eliminated.

Customization Techniques

At a high level, two primary customization techniques can be employed: retrieval-augmented generation (RAG) and fine-tuning (FT).

Retrieval-Augmented Generation: RAG identifies matching code snippets within a repository that are similar to a given code fragment (e.g., the code preceding the cursor in the IDE) and enhances the prompt used to query the LLM with these matched snippets. This enriched prompt encourages the model to generate more relevant code. Previous studies have explored various techniques in this area, including Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, REALM, kNN-LM, and RETRO.
Fine-Tuning: FT involves taking a pre-trained LLM and further training it on a specific, smaller codebase (relative to the pretraining dataset) to better align it with the unique repository. Fine-tuning adjusts the LLM’s weights based on this additional training, tailoring it for the organization’s distinct needs.

Both RAG and fine-tuning are effective methods for enhancing LLM-based customization. RAG can quickly adapt to private libraries or APIs with lower training complexity and cost; however, the process of searching for and augmenting retrieved code snippets can introduce latency. On the other hand, fine-tuning eliminates the need for context augmentation since the model is already trained on private libraries and APIs. Nevertheless, it incurs higher training costs and complexities in supporting multiple custom models for various enterprises. These issues can be addressed through further optimization.

Retrieval-Augmented Generation Steps

RAG encompasses several steps:

Indexing: Upon receiving a private repository from the admin, an index is created by segmenting the source code files into manageable chunks. This “chunking” process converts code snippets into digestible pieces that are most informative for the model and easy to retrieve based on the context. The size of a chunk and its extraction method are crucial design choices affecting the final results.
Administrator Workflow: This involves contextual searching. The system searches the indexed code snippets using a few lines of code above the cursor to retrieve relevant snippets. Various algorithms can be employed for this retrieval, including:
- Bag of Words (BM25): A retrieval function that ranks code snippets based on query term frequencies and snippet lengths.
- Semantic Retrieval: This method converts queries and indexed snippets into high-dimensional vectors, ranking them based on semantic similarity, commonly using k-nearest neighbors (KNN) or approximate nearest neighbor (ANN) search.

BM25 focuses on lexical matching, which means that variations in terms (e.g., replacing “add” with “delete”) may not significantly affect the overall score.

For additional insights on gender equality in the workplace, check out this blog post here. Understanding workplace dynamics can be crucial in the context of coding practices. Moreover, if you’re interested in exploring the shift from CHRO to consultant roles, you can find valuable information from industry experts here. Lastly, for a deeper understanding of Amazon’s approach to employee training and its implications for the future of work, refer to this excellent resource here.

Tailoring Coding Companions for Organizations | Artificial Intelligence

Design Considerations

Customization Techniques

Retrieval-Augmented Generation Steps

Related Topics: