Transformer from scratch

A transformer implementation built directly from the original paper and without using high-level abstractions.

2026 complete transformerssequence modelingdeep learning View more

A from-scratch implementation of the original Transformer architecture described in Attention Is All You Need (Vaswani et al., 2017), built in PyTorch without relying on high-level abstractions.

Motivation

The goal of this project is to gain a thorough understand the Transformer architecture by studying and replicating each of its building blocks, then assembling them into a working model and training it on real-world data (en-fr translation).

Repository Structure

transformer-from-scratch.ipynb contains the block-wise implementation of transformers, alongside the personal notes I took to help me break down and digest the Transformer Architecture.

transformer contains a cleaned-up, streamlined version of the architecture found in the Jupyter Notebook, which was more verbose for clarity purposes.

transformer/
├── model/
│   ├── attention.py       # MultiHeadAttention (self, masked, cross)
│   ├── encoder.py         # EncoderLayer, Encoder
│   ├── decoder.py         # DecoderLayer, Decoder
│   ├── feedforward.py     # Position-wise FeedForwardBlock
│   ├── embedding.py       # PositionalEncoding
│   └── transformer.py     # Seq2Seq (composes encoder + decoder)
├── data/
│   └── tokenizer.py       # Word-level tokenizer with regex splitting
├── train.py               # Training loop
├── main.py                # Entry point
└── config.py              # Hyperparameters and device configuration

Hyperparameters

Parameter	Value
d_model	512
d_k / d_v	64
Heads	8
Layers	4
d_ff	2048
Optimizer	Adam
Learning rate	0.0001

Training

Trained on the opus_books English-French dataset. The tokenizer splits on words and punctuation separately using regex, building a shared vocabulary from both source and target languages.