A Guide to Creating Your Own Dataset for LLM Training

Large language models (LLMs) have showcased impressive abilities across a variety of language tasks. However, the effectiveness of these models is significantly determined by the quality of the data utilized in their training. This article serves as a primer on how to prepare your own dataset for LLM training. Whether your aim is to fine-tune a pre-existing model for a particular task or to extend pre-training for specialized applications, having a meticulously curated dataset is essential for achieving the best outcomes.

Data Preprocessing

Text data can originate from numerous sources and exist in diverse formats such as PDF, HTML, JSON, and Microsoft Office files like Word, Excel, and PowerPoint. It’s uncommon to find text data that is ready to be processed and input into an LLM for training. Therefore, the initial step in preparing LLM data is to extract and assemble data from these various formats. This process involves reading data from multiple sources and extracting text using tools such as optical character recognition (OCR) for scanned PDFs, HTML parsers for web documents, and specialized libraries for proprietary file formats. Non-text elements, including HTML tags and non-UTF-8 characters, are generally removed or normalized.

The next phase involves filtering out low-quality or unwanted documents. Common filtering strategies include:

Filtering based on metadata like document names or URLs.
Content-based filtering to exclude toxic or harmful content and personally identifiable information (PII).
Regex filters to identify specific character patterns in the text.
Filtering documents with excessive repetition of sentences or phrases.
Language-specific filters, such as for English.
Additional quality filters, including document length, average word length, and the ratio of alphabetic to non-alphabetic characters.
Model-based quality filtering employing lightweight text classifiers to identify poor-quality documents. For instance, the FineWeb-Edu classifier evaluates the educational value of web pages.

Extracting text from various file formats can be challenging. Thankfully, many high-level libraries can simplify this task. We will illustrate some methods for extracting text and discuss how to scale this process for large document collections later on.

HTML Preprocessing

When handling HTML documents, it’s crucial to eliminate non-text data like markup tags, inline CSS, and JavaScript. Additionally, structured objects such as lists, tables, and code blocks should be converted into markdown format. The trafilatura library offers a command-line interface (CLI) and Python SDK for this purpose. Below is an example illustrating the use of this library to extract and preprocess HTML data from a blog post about fine-tuning Meta Llama 3.1 models using torchtune on Amazon SageMaker.

from trafilatura import fetch_url, extract, html2txt

url = "https://aws.amazon.com/blogs/machine-learning/fine-tune-meta-llama-3-1-models-using-torchtune-on-amazon-sagemaker/"

downloaded = fetch_url(url)
print("RAW HTMLn", downloaded[:250])

all_text = html2txt(downloaded)
print("nALL TEXTn", all_text[:250])

main_text = extract(downloaded)
print("nMAIN TEXTn", main_text[:250])

The trafilatura library provides many functions for handling HTML. In the example above, fetch_url retrieves the raw HTML, html2txt extracts the text content—including navigation links—and extract gets the main body content of the blog post. The output from this code should resemble:

RAW HTML
<!doctype html> <html lang="en-US" class="no-js aws-lng-en_US" xmlns="http://www.w3.org/1999/xhtml" ...

ALL TEXT
Skip to Main Content Click here to return to Amazon Web Services homepage ...

MAIN TEXT
AWS Machine Learning Blog Fine-tune Meta Llama 3.1 models using torchtune on Amazon SageMaker...

PDF Processing

PDF is a prevalent format for document storage and distribution in many organizations. Extracting clean text from PDFs can be tricky due to complex layouts that may include columns, images, tables, and figures. PDFs can also contain embedded fonts and graphics that standard libraries can’t process. Unlike HTML, there is no inherent structural information available, making PDF parsing considerably more difficult. Whenever possible, opt for alternative formats like HTML, markdown, or DOCX. If only PDFs are available, libraries such as pdfplumber, pypdf, and pdfminer can assist with text and data extraction. Below is an example of using pdfplumber to extract text from the first page of the 2023 Amazon annual report.

import pdfplumber

pdf_file = "Amazon-com-Inc-2023-Annual-Report.pdf"

with pdfplumber.open(pdf_file) as pdf:
    page = pdf.pages[1]
    print(page.extract_text(x_tolerance=1)[:300])

pdfplumber offers bounding box information useful for removing extraneous text such as headers and footers. However, this library is effective only with PDFs that contain text, like digitally authored PDFs. For scanned documents requiring OCR, services like Amazon Textract can be utilized.

Office Document Processing

Documents created with Microsoft Office or compatible software, including DOCX, PPTX, and XLSX files, are also common in organizations. Libraries exist to work with these formats. The following code snippet employs the python-docx library to extract text from a Word document. The code iterates through the document’s paragraphs and combines them into a single string.

from docx import Document

doc_file = "SampleDoc.docx"

doc = Document(doc_file)

full_text = []
for paragraph in doc.paragraphs:
    full_text.append(paragraph.text)

document_text = 'n'.join(full_text)

Deduplication

Once preprocessing is complete, it’s vital to further refine the data by removing duplicates (deduplication) and filtering out low-quality content. Deduplication is essential for creating high-quality pretraining datasets. According to research from CCNet, duplicate training examples are widespread in common natural language processing (NLP) datasets, leading to potential bias.

This guide has explored essential practices in preparing your dataset for LLM training. For additional insights on managing self-doubt in your career journey, check out this blog post. Furthermore, for an excellent resource on fulfillment center management, visit Amazon’s job page.