Back to blog

The K8s YAML dataset

Sam Stoelinga
Engineer
·3 min read
k8syamldataset
kubectl notebook

Excited to announce the K8s YAML dataset containing 276,520 valid K8s YAML files.

HuggingFace Dataset: https://huggingface.co/datasets/substratusai/the-stack-yaml-k8s
Source code: https://github.com/substratusai/the-stack-yaml-k8s

Why?

  • This dataset can be used to fine-tune an LLM directly
  • New datasets can be created from his dataset such as an K8s instruct dataset (coming soon!)
  • What's your use case?

How?

Getting a lot of K8s YAML manifests wasn't easy. My initial approach was to use the Kubernetes website and scrape the YAML example files, however the issue was the quantity since I could only scrape about ~250 YAML examples that way.

Luckily, I came across the-stack dataset which is a cleaned dataset of code on GitHub. The dataset is nicely structured by language and I noticed that yaml was one of the languages in the dataset.

Install libraries used in this blog post:

pip3 install datasets kubernetes-validate

Let's load the the-stack dataset but only the YAML files (takes about 200GB of disk space):

from datasets import load_dataset
ds = load_dataset("bigcode/the-stack", data_dir="data/yaml", split="train")

Once loaded there are 13,439,939 YAML files in ds.

You can check the content of one of the files:

print(ds[0]["content"])

You probably notice that this ain't a K8s YAML file, so next we need to filter these 13 million YAML files and only keep the one that have valid K8 YAML.

The approach I took was to use the kubernetes-validate OSS library. It turned out that YAML parsing was too slow so I added a 10x speed improvement by eagerly checking if "Kind or "kind" is not a substring in the YAML file.

Here is the validate function that takes the yaml_content as a string and returns if the content was valid K8s YAML or not:

import kubernetes_validate
import yaml

def validate(yaml_content: str):
    try:
        # Speed optimization to return early without having to load YAML
        if "kind" not in yaml_content and "Kind" not in yaml_content:
            return False
        data = yaml.safe_load(yaml_content)
        kubernetes_validate.validate(data, '1.22', strict=True)
        return True
    except Exception as e:
        return False

validate(ds[0]["content"])

Now all that's needed is to filter out all YAML files that aren't valid:

import os
os.cpu_count()
valid_k8s = ds.filter(lambda batch: [validate(x) for x in batch["content"]],
                      num_proc=os.cpu_count(), batched=True)

There were 276,520 YAML files left in valid_k8s. You can print one again to see:

print(valid_k8s[0]["content"])

You can upload the dataset back to HuggingFace by running:

valid_k8s.push_to_hub("substratusai/the-stack-yaml-k8s")

What's next?

Creating a new dataset called K8s Instruct that also provides a prompt for each YAML file.