My fork of karpathy/nanoGPT

Andrej Karpathy

Andrej Karpathy is a well-known ML figure. He was the Director of AI at Tesla, a founding member of OpenAI, and now runs Eureka Labs, a startup focused on AI education. Honestly, I’m not entirely sure what made him famous—maybe it’s a mix of everything? I just know he’s brilliant. He’s had a blog, he’s been on some podcasts, and he’s done so much that it’s hard to pin down one thing.

For me, it was his YouTube channel that stood out, especially his technical, hands-on videos about machine learning. In particular, this one: Let's build GPT: from scratch, in code, spelled out. Top 3 educational videos of all time (maybe we have to reduce to ML videos). He basically builds a GPT model from scratch, and trains it on a Shakespeare dataset. The architecture is faithful to OpenAI's GPT-2 and uses the exact same tricks. The code is available here: https://github.com/karpathy/nanoGPT.

I worked for a couple of days in a fork of this repo, and I talk about it here.

Refactoring the code

The idea of the nanoGPT repo is to be minimal and educational. It uses almost no functions, classes, or methods, and there are only 3 relevant scripts: model.py with the architecture, train.py with the training loop, and sample.py for inference.

Although I find this repo an incredible educational resource, I tried to make some changes and quickly found out that the code structure is a nightmare if you want to work on it.

In particular, the configuration management made me cry. It’s funny because the first line in the configurator.py is a comment by Karpathy saying "Probably a terrible idea." It gets worse when we learn that the way to "load" a config is to run exec(open('config.py').read()). My default linter already complains about this (rule S102 in Ruff).

Karpathy's idea was to have no config class. Instead, all configuration parameters are going to be included in the global namespace of your script. He also mentions that he does not like to prepend config. to every parameter, which also supports this approach. Some global configuration variables are defined per-file, and some of them are defined in a specific config that can be loaded (e.g. config/train_shakespeare.py).

I completely disagree here. Knowing if a parameter is a config parameter or not greatly helps the readability of your code. Prepending config. is a blessing and not a curse.

But the biggest problem is that his config management makes it completely impossible to reuse code between files, because every import of a method from a file overrides the current configuration parameters. For instance, if we want to reuse the get_batch method from train.py we are implicitly overwriting our config parameters for the training default ones.

Anyway, this is why I started by refactored the code:

I created a config class with some minimal functionality that mimics Karpathy's config.

I modularized the code, so a part of a module can be reused in other modules.

Moreover, I added a pyproject.toml file that includes the dependencies and allows me to use hatch to build the project. I also set up a ruff linter with the most basic rules.

Perplexity

I often see perplexity being used as a metric for evaluating the quality of a model. Perplexity it is also known to be particularly sensitive to model quantization. This metric is not available in the code, it is still a TODO of the repo according to the README, and I decided to include it. I implemented it in a new evaluate.py script. I don’t think Karpathy would accept my PR, though, since I refactored too much code.

I took the definition of Perplexity in this HuggingFace’s blog post. We simply load the validation dataset, compute the perplexity, and print it out.

My implementation:


def compute_perplexity(model: GPT, config: Config, ctx: nullcontext | torch.amp.autocast) -> float:
    """Compute perplexity of the model"""
    X, y = get_validation(config)

    # Calculate number of batches
    n_samples = X.shape[0]
    n_batches = (n_samples + config.batch_size - 1) // config.batch_size

    total_log_probs = []

    with torch.no_grad():
        with ctx:
            for i in tqdm(range(n_batches)):
                # Get batch indices
                start_idx = i * config.batch_size
                end_idx = min((i + 1) * config.batch_size, n_samples)

                # Get batch data
                X_batch = X[start_idx:end_idx]
                y_batch = y[start_idx:end_idx]

                # Forward pass
                logits, _ = model(X_batch)
                logits = logits[:, -1, :]
                # convert to probabilities
                probs = torch.softmax(logits, dim=-1)
                # get the predicted logits for the y index
                pred_probs = probs[torch.arange(len(y_batch)), y_batch.int()]
                # apply log
                log_probs = torch.log(pred_probs)
                total_log_probs.append(log_probs)

    # Concatenate all log probabilities and compute mean
    all_log_probs = torch.cat(total_log_probs)
    perplexity = torch.exp(-all_log_probs.mean()).item()

    return perplexity

Experiments

I did some experiments, and I give the details and results here. nanoGPT could potentially reproduce the results of the GPT-2 paper.

But I have no GPU, just a Macbook, so I instead focused on the Shakespeare dataset. The goal with this dataset is to train a Language Model that “talks like Shakespeare”.

We are using tinyshakespeare, which is a 1Mb dataset with about 1.1M characters.

Here are the first lines of the dataset:


First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

We use character-level tokenization. After splitting the dataset, we have:

train.bin has 1,003,854 tokens

val.bin has 111,540 tokens

As mentioned before, the model.py file is a reimplementation of the GPT-2 architecture. This means multi-headed self-attention, residual connections, layernorm normalization, dropout...

I train with the following default hyperparameters:

block_size: 64 (context length)

batch_size: 12 (for training)

n_layer: 4 (number of layers)

n_head: 4 (number of attention heads)

n_embd: 128 (embedding dimension)

max_iters: 2000 (total number of training iterations)

lr_decay_iters: 2000 (learning rate decay iterations)

dropout: 0.0 (dropout rate)

This model has 802,944 parameters, and takes about 3 minutes to train the model on my Macbook. As a sanity check, we train with 500, 1000, 2000, and 4000 iterations and plot the loss and perplexity. Everything looks good.

It's worth to look at the generated text. I guess you could say that it looks like the training data. But it just looks like it. If you actually read it, it's gibberish and there is no sense in it. A small sample:


KING RICHARD II:
And these that he is a blassed man
She be my noble winders.

ISABELLA:
Methom; in the subjurit us,
No be as throat quied, and fear I not be,
I was brother in quiet and thee.

But it's funny!

Scaling up

We want to increase the model size. Let's double the embedding dimension (128 -> 256) and the number of heads (4 -> 8). This gives us a model with 3.16M parameters. We train it in the same settings as before, with 2000 iterations.

We scale it up for the last time by doubling the number of layers (4 -> 8). This gives a model with 6.31M parameters. We train it in the same settings as before, with 2000 iterations.

We can visualize the results in the same way as before.

Different model sizes trained with the same training settings: 2000 iterations, same LR schedule.

We can see that scaling up continues to improve the overall performance of the model. This plot hints that the model could be close to overfitting, since the validation loss could not be reduced as much as the training loss.

The final test we do is to train the 6M model for longer, 4000 iterations. This gives a train / val losses of 1.14 / 1.52 (confirming the overfitting) and a final perplexity of 4.47, which is the best one by far.

Here is a sample of the generated text. Still mostly nonsense, but even better than before!


PARIS:
Ay, ay, because the flatterer, and leave your hearts
Where you that seems the wanton of a word.

POMPEY:
And be the more of what commons to him.

PAULINA:
If you must prove what now you may not here,
Have your mother is grace, it am beggar
A woman may be made him not only me.

PETER:
Yes, sir; I do not so? What may the vengeance?

LADY CAPULET:
Yes, my lord, in your grace to
---------------

Norfolk, let you have the friar of the bosom,
Takes the depender of my son whiles are shall.

JULIET:
I heard thee a pretting and true.

ROMEO:
O shame! O thou dost be confession,
See that for thy will speed to be beats;
Known though they do not sand I am soldier.

ROMEO:
O, nurse! I'll find thee to falls of a lights
As death dinner and the wars stopp'd, and love
To name when he hath head is not that slow it.

Compression and inference speed

One of the final aspects I explored was inference speed and how it is affected by model size, device, and quantization techniques.

Inference speed is measured in tokens per second, and I used perplexity as an evaluation metric to ensure quality doesn't degrade significantly with optimizations. I compute the perplexity on a smaller validation set (50k tokens) because of too long runtimes with the full one.

Here are the key configurations I tested:

CPU (float32): Standard inference on the CPU without any optimizations.

CPU with int8 quantization: Dynamically quantized weights to int8 for faster inference. Here, we used the Pytorch quantization API and its method torch.ao.quantization.quantize_dynamic.

CPU with float16 inference: Converted weights to float16 using model.half() with the promise of faster inference speed and reduced model size.

MPS (Metal Performance Shaders, float32): GPU acceleration on macOS using the MPS backend.

MPS with float16 inference: Reduced precision inference on the MPS backend.

Note that int8 quantization was not available for the MPS backend.

The model size was computed as Piotr Bialecki suggests:


    param_size = 0
    for param in model.parameters():
        param_size += param.nelement() * param.element_size()
    buffer_size = 0
    for buffer in model.buffers():
        buffer_size += buffer.nelement() * buffer.element_size()
    size_all_mb = (param_size + buffer_size) / 1024**2

The results are visualized below:

Observations:

Model size behaves as expected. float16 reduces the model size significantly with respect to float32, and int8 brings the 6M parameter model size to under 1Mb.

MPS performs significantly faster inference than the default CPU. I didn’t know about my Macbook having this. The MPS backend in PyTorch is specifically designed for macOS devices and leverages the Metal programming framework to enable high-performance computation on Apple's GPUs.

Perplexity behaves as expected: reducing to float16 precision comes with a slight perplexity increase: 4.111 to 4.113. Finally, the int8 quantization provides the smallest model but at the highest perplexity cost: 4.116.

The tokens-per-second versus precision trade-off does not behave as expected. Typically, we would expect that a lower precision model can use kernels that exploit reduced memory bandwidth and faster arithmetic operations. This should theoretically lead to improved throughput (tokens processed per second). My hypothesis on why this is not the case is the fact that we do not use NVIDIA GPUs but simply my Macbook CPU and the very recent MPS backend, which does not yet support the native Pytorch quantizations.