A from-scratch implementation of the original Transformer architecture described in Attention Is All You Need (Vaswani et al., 2017), built in PyTorch without relying on high-level abstractions.
Motivation
The goal of this project is to gain a thorough understand the Transformer architecture by studying and replicating each of its building blocks, then assembling them into a working model and training it on real-world data (en-fr translation).
Repository Structure
transformer-from-scratch.ipynb contains the block-wise implementation of transformers, alongside the personal notes I took to help me break down and digest the Transformer Architecture.
transformer contains a cleaned-up, streamlined version of the architecture found in the Jupyter Notebook, which was more verbose for clarity purposes.
transformer/
├── model/
│ ├── attention.py # MultiHeadAttention (self, masked, cross)
│ ├── encoder.py # EncoderLayer, Encoder
│ ├── decoder.py # DecoderLayer, Decoder
│ ├── feedforward.py # Position-wise FeedForwardBlock
│ ├── embedding.py # PositionalEncoding
│ └── transformer.py # Seq2Seq (composes encoder + decoder)
├── data/
│ └── tokenizer.py # Word-level tokenizer with regex splitting
├── train.py # Training loop
├── main.py # Entry point
└── config.py # Hyperparameters and device configuration
Hyperparameters
| Parameter | Value |
|---|---|
| d_model | 512 |
| d_k / d_v | 64 |
| Heads | 8 |
| Layers | 4 |
| d_ff | 2048 |
| Optimizer | Adam |
| Learning rate | 0.0001 |
Training
Trained on the opus_books English-French dataset. The tokenizer splits on words and punctuation separately using regex, building a shared vocabulary from both source and target languages.