The K8s YAML dataset

Excited to announce the K8s YAML dataset containing 276,520 valid K8s YAML files.

HuggingFace Dataset: https://huggingface.co/datasets/substratusai/the-stack-yaml-k8s
Source code: https://github.com/substratusai/the-stack-yaml-k8s

Why?

This dataset can be used to fine-tune an LLM directly
New datasets can be created from his dataset such as an K8s instruct dataset (coming soon!)
What's your use case?

How?

Getting a lot of K8s YAML manifests wasn't easy. My initial approach was to use the Kubernetes website and scrape the YAML example files, however the issue was the quantity since I could only scrape about ~250 YAML examples that way.

Luckily, I came across the-stack dataset which is a cleaned dataset of code on GitHub. The dataset is nicely structured by language and I noticed that yaml was one of the languages in the dataset.

Install libraries used in this blog post:

pip3 install datasets kubernetes-validate

Let's load the the-stack dataset but only the YAML files (takes about 200GB of disk space):

from datasets import load_dataset
ds = load_dataset("bigcode/the-stack", data_dir="data/yaml", split="train")

Once loaded there are 13,439,939 YAML files in ds.

You can check the content of one of the files:

print(ds[0]["content"])

You probably notice that this ain't a K8s YAML file, so next we need to filter these 13 million YAML files and only keep the one that have valid K8 YAML.

The approach I took was to use the kubernetes-validate OSS library. It turned out that YAML parsing was too slow so I added a 10x speed improvement by eagerly checking if "Kind or "kind" is not a substring in the YAML file.

Here is the validate function that takes the yaml_content as a string and returns if the content was valid K8s YAML or not:

import kubernetes_validate
import yaml

def validate(yaml_content: str):
    try:
        # Speed optimization to return early without having to load YAML
        if "kind" not in yaml_content and "Kind" not in yaml_content:
            return False
        data = yaml.safe_load(yaml_content)
        kubernetes_validate.validate(data, '1.22', strict=True)
        return True
    except Exception as e:
        return False

validate(ds[0]["content"])

Now all that's needed is to filter out all YAML files that aren't valid:

import os
os.cpu_count()
valid_k8s = ds.filter(lambda batch: [validate(x) for x in batch["content"]],
                      num_proc=os.cpu_count(), batched=True)

There were 276,520 YAML files left in valid_k8s. You can print one again to see:

print(valid_k8s[0]["content"])

You can upload the dataset back to HuggingFace by running:

valid_k8s.push_to_hub("substratusai/the-stack-yaml-k8s")

What's next?

Creating a new dataset called K8s Instruct that also provides a prompt for each YAML file.