Training a transformer to predict the gender of German nouns

Note: the implementation of this project is found here: der_die_das (github.com/jsalvasoler)

Context

I wanted to do this for a while. A challenge for all German learners is to learn the gender of words. German nouns have three articles: masculine (der), feminine (die), and neuter (das). I. will call the later gender neutrum (it’s German) because I had never heard off neuter before.

There are of course gendering rules that generally work. Check, for instance, this resource: der - die - das (Duolingo). However, these rules are complex and come with exceptions. Over time, German learners develop some intuition on the article of words and can make more accurate guesses on new words. The intuition is based on a combination of rule memorization (e.g.: words ending in -ung are feminine), and similarity guesses (I know that Thema is neutrum → Lemma should also be neutrum)

But look at the next three words:

Der Löffel (the spoon) - Masculine

Die Gabel (the fork) - Feminine

Das Messer (the knife) - Neuter

I know the gender of these words, but using my intuitive gendering, I would have guessed masculine for all of them 🙃. And this is what German learners complain about:

💡

German learners complain about the nouns having especially unintuitive and arbitrary genders. They claim the few gendering rules are complex and full of exceptions.

So that is what I wanted to study from a Machine Learning prespective.

How well can a model learn the German genders?

Assuming the same dataset size and same model architecture:

Are German genders more difficult to classify compared to other languages?

Gathering datasets

I started by finding the German dataset:

https://frequencylists.blogspot.com/2015/12/the-2000-most-frequent-german-nouns.html

This one seemed to be nice for my purposes, and I just had to clean up a bit some plural nouns (e.g. the media = die Medien) or nouns that were made of two words (e.g. the shorts = die kurze Hose).

I wanted to compare German to other languages, and I chose these two:

Catalan: it is my native language, and I was really curious to see how the model would do on it. However, it only has two genders. My hypothesis for Catalan was that the gendering is very intuitive and “machine learnable”. I couldn’t find a similar dataset, and I thought it made sense to translate word by word using the Microsoft Translator API. I’d recommend using the Python library deep-translator.

Croatian: it is a perfect comparison to German because it also has three genders. My girlfriend speaks it, and it really helped because I didn’t know that, but Croatian has genders but no articles (I went to store). The problem was that I was translating the German dataset to Croatian, and of course it would translate der Apfel → jabuka, so I would lose the gender 😢. Then my Croatian-native collaborator suggested the trick of translating the word along with their possessive pronoun, e.g., mein Apfel → moja jabuka. And I could extract the gender of the Croatian words like this: moj → masculine, moja → feminine, and moje → neutrum. Nice.

Anyway, a nice byproduct of this project is that it provides "noun" datasets for German, Croatian, and Catalan. These are tables with > 2k nouns and their gender. This can be used to power flashcard-based language learning apps, or to train other models.

German: 2539 unique nouns. Find it in data/german_words_processed.csv.

Croatian: 2196 unique nouns. Find it in data/croatian_words_processed.csv.

Catalan: 2250 unique nouns. Find it in data/catalan_words_processed.csv.

Note: the datasets were not manually curated, so there might be some errors. For training a model, the quality is good enough, but for linguistic purposes, you might want to clean the datasets.

Training and evaluation setup

For the three languages, I split the processed dataset into a training and test dataset of sizes 1500 and 500. It is a stratified split on the variables (1) gender, and (2) length of word. Here I plot the class distributions of the words in German and Croatian. I do it just for the training set because thanks to stratification, we know that these plots would look almost identical in the test sets.

For Catalan, 845 words are masculine (56.33%), and 655 words are feminine (43.67%).

Majority class classifier as a baseline

It’s good to keep in mind that we can always have a model with the same accuracy as the proportion of samples of the most populated class. This gives the baseline models to beat:

German: 0.4013

Croatian: 0.4800

Catalan: 0.5633

I was also interested in plotting the distribution of word lengths.

Model architecture

As I mention in the title, I trained a transformer classifier.


import torch
import torch.nn as nn

class TransformerClassifier(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        max_length: int,
        num_classes: int,
        embed_dim: int = 128,
        num_heads: int = 8,
        num_layers: int = 2,
    ) -> None:
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.positional_encoding = nn.Parameter(
            torch.zeros(1, max_length, embed_dim)
        )
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_dim, nhead=num_heads, batch_first=True
        )
        self.transformer_encoder = nn.TransformerEncoder(
            encoder_layer, num_layers=num_layers
        )
        self.fc = nn.Linear(embed_dim, num_classes)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.embedding(x) + self.positional_encoding[:, : x.size(1), :]
        x = self.transformer_encoder(x)
        x = x.mean(dim=1)  # Global average pooling
        return self.fc(x)

The TransformerClassifier model is composed of the following components:

Embedding Layer: nn.Embedding(vocab_size, embed_dim) converts input token indices into dense vector representations of dimension embed_dim. This layer helps in mapping discrete token indices to continuous embeddings.

Positional Encoding: nn.Parameter(torch.zeros(1, max_length, embed_dim)) adds positional information to the token embeddings to capture the order of tokens in the input sequence. Positional encodings are necessary because the transformer architecture does not inherently incorporate token position information.

Transformer Encoder:

nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads, batch_first=True) defines a single encoder layer that applies multi-head self-attention and a position-wise fully connected feed-forward network to the input embeddings. Here, d_model specifies the dimension of the input embeddings, and nhead denotes the number of attention heads.

nn.TransformerEncoder(encoder_layer, num_layers=num_layers) stacks multiple TransformerEncoderLayer instances to create a deeper, more expressive model. This allows for capturing complex dependencies in the input sequence.

Fully Connected (Linear) Layer: nn.Linear(embed_dim, num_classes) maps the final representation of the sequence to output class scores. This layer acts as the classifier in the model, translating the aggregated sequence representation into the desired number of output classes.

The forward pass through the model proceeds as follows:

Embedding Lookup: Convert input token indices into dense embeddings using self.embedding(x). The input x is a tensor of token indices with shape (batch_size, sequence_length).

Add Positional Encoding: Incorporate positional information into the embeddings by adding the positional encoding: x = self.embedding(x) + self.positional_encoding[:, : x.size(1), :]. This operation ensures that the model is aware of the token positions within the sequence.

Transformer Encoding: Pass the positionally encoded embeddings through the transformer encoder: x = self.transformer_encoder(x). The encoder processes the sequence to produce a contextualized representation of each token.

Global Average Pooling: Aggregate token representations by averaging over the sequence length: x = x.mean(dim=1). This operation reduces the sequence dimension, resulting in a fixed-size vector representation of shape (batch_size, embed_dim).

Classification: Transform the pooled representation into class scores using the fully connected layer: x = self.fc(x). The final output is a tensor of shape (batch_size, num_classes), representing the predicted class scores for each input sequence.

When it comes to the training settings, I used:

Loss Function: nn.CrossEntropyLoss() is the most natural choice

Optimizer: optim.Adam(model.parameters(), lr=config.lr) is also widely used.

Learning Rate Scheduler: optim.lr_scheduler.StepLR(optimizer, step_size=config.step_size, gamma=config.gamma) adjusts the learning rate at regular intervals (step_size) by a factor of gamma. This helps in fine-tuning the learning rate during training to improve convergence.

I also provide my training settings:


DEFAULT_SETTINGS = {
    "batch_size": 32,
    "lr": 0.001,
    "step_size": 10,
    "gamma": 0.1,
    "val_size": 0.2,
    "early_stop": None,
}

I didn’t focus on architecture and hyperparameter tuning. This model seems a natural first approach to the task. I used the following settings:

Finally, ChatGPT has this to say about my model:

This model architecture leverages the power of transformer-based self-attention mechanisms to capture long-range dependencies and interactions between tokens in a sequence, making it well-suited for various natural language processing tasks such as text classification.

It’s a bit dumb because there are not really long-range dependencies (maximum length of a word is around 15).

General results

I trained all models with the architecture and settings described above, and for 20 epochs.

I want to first give an overview of the results, and then we will dive deep into examples and analysis of the predictions for each language.


Accuracy:
- german: 0.5860
- croatian: 0.6240
- catalan: 0.6540
F1 Score:
- german: 0.5559
- croatian: 0.6239
- catalan: 0.6271

All models beat the baseline majority class classifier 👏🏻.

Croatian is better classified than German by a large margin, suggesting that Croatian nouns can be gendered more intuitively, and that the rules to do that are more consistent. Catalan shows the highest accuracy, but keep in mind that for this language we have a binary classification problem. When looking at the F1-score, Croatian and Catalan are similar and superior to German.

Results - German:

Detailed scores and confusion matrix


Model 20240730210843 (german)
- Accuracy: 0.5860
- Precision: 0.5619
- Recall: 0.5860
- F1 Score: 0.5559

From the confusion matrix, we see that the most common errors of the model are:

Predicting “der” when the correct article is “das”

Predicting “der” when the correct article is “die”

Predicting “die” when the correct article is “der”

We finally do three small case studies in which we analyze how well the model could learn some of gendering rules of German.

Case study 1: Rule of -ung → feminine

There are 37/500 such words in the test set, and all of them are feminine, except for two: der Ursprung and die Ursprung. Naturally, our model classifies all 37 words as feminine, and fails on these two exceptions... I guess this is what the German learners complain about!

The model also sometimes fails by predicting feminine for words that end with -ng. Some examples are der Flüchtling, der Durchgang (both classified as masculine).

Case study 2: Rule of -um → neutrum

There are just five words with this ending in the test set, but two of them are exceptions to the rule! Here is what our model predicted:


der raum:       predicted der -> correct
der saum:       predicted der -> correct
das wachstum:   predicted die -> wrong
das individuum: predicted das -> correct
das stipendium: predicted das -> correct

It’s impressive to see that the model did decently here. A German learner relying on the rule would have had only 60% accuracy, but our model did 80% correctly!

Case study 3: Rule of -er → masculine

There are 51/500 words that end in -er, but again, this is a rule full of exceptions: 36 are indeed masculine, but 3 are feminine (e.g. die Feier) and 12 are neutrum (e.g. das Wasser). Will our model be able to not “memorize” this rule and achieve more than 36/51 = 70% accuracy?

Well, no: our model presents 63% accuracy on these words. In this case, the model would agree with the frustrated German article learner: there is no intuition that helps for the -er words, and the rule of guessing masculine is not very effective.

Results - Croatian:

Detailed scores and confusion matrix


Model 20240730213637 (croatian)
- Accuracy: 0.6240
- Precision: 0.6239
- Recall: 0.6240
- F1 Score: 0.6239

For Croatian, the two most common errors were mixing up the masculine and feminine words. But this is simply explained by the higher distribution of words in these two genders. I have zero Croatian knowledge, and therefore I didn’t study the results further. But it’s nice to see that there is a language with three genders that are easier to identify compared to German.

Results - Catalan:

Detailed scores and confusion matrix


Model 20240730214109 (catalan)
- Accuracy: 0.6540
- Precision: 0.6677
- Recall: 0.6540
- F1 Score: 0.6271

For Catalan, the most common error was predicting masculine for feminine words. This can be explained by the bias towards the masculine (most populated class) prediction.

Case study: Rule of the words ending in -a → feminine

This rule is also true for many other languages. In the test dataset, there are 145/500 such words, and about 92% are feminine. This is a lot of exceptions! Can the model do well on this subset of the test dataset? Spoiler, no: we just get a 25% accuracy…

This suggests that the model architecture could be too complex and overlooks simple things like just looking at the ending of the word. But here I want to mention that we could have easily added human-knowledge into the model architecture, and it makes sense not to do it in order to be able to respond to our research questions better. Tuning the architecture to focus more on the ending of words makes sense… but it’s not what “learning genders by looking” at the word means.

Anyway, the model is still quite good at predicting the Catalan genders.

Lessons learned

Hatch is handy: use it as your python project and environment manager! I also used an automatic code fromatter and commit hooks, and I am sure I will always do it from now on.

If you have an idea and a bit of time, just get your laptop and work on it! This was what happened with this project. But don’t forget to add it to your GitHub with a nice Readme and to find a way to include in your portfolio.

Using translation tools like Microsoft Translator API and libraries like deep-translator can facilitate dataset creation for languages where pre-existing datasets are unavailable. You can get something like 4M words to translate for free per month.