Subtitle: From raw tokens to a functional neural network—how to construct, train, and document every line of code for your custom LLM.
Cross-entropy loss is standard. But for your PDF, emphasize the importance of perplexity (exp(loss)). A perplexity of 50 means the model is as uncertain as choosing uniformly among 50 options.
Logging: Every 100 steps, print loss and sample generation with a temperature setting.
Here is where 80% of hobbyist projects crash. You cannot feed raw text into a neural network. You need a tokenizer. build a large language model %28from scratch%29 pdf
Your PDF will dedicate an entire chapter to tiktoken (the tokenizer used by OpenAI) or sentencepiece (used by Google).
The core code you will write (in Python/PyTorch):
import tiktoken enc = tiktoken.get_encoding("gpt2")
text = "Hello, I am building an LLM." tokens = enc.encode(text) # Output: [15496, 11, 314, 716, 1049, 1040, 13]Subtitle: From raw tokens to a functional neural
Why this matters: A naive "character-level" tokenizer (treating each letter as a token) would require a context window of 10,000 steps for a short paragraph. A sub-word tokenizer reduces that to ~200 steps.
The PDF will force you to build the training dataset loader:
You need to chunk your raw text (Project Gutenberg, FineWeb, or TinyStories) into fixed-context windows. If your context length is 256 tokens, you slide a window across your dataset. This prepares the input tensors (B, T) where B is batch size and T is sequence length. The preprocessed text data is then tokenized into
[ PE_(pos, 2i) = \sin(pos / 10000^2i/d_model) ] [ PE_(pos, 2i+1) = \cos(pos / 10000^2i/d_model) ]
Add to token embeddings.
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=512):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe)
def forward(self, x):
return x + self.pe[:x.size(1)]
The preprocessed text data is then tokenized into individual words or subwords. The tokens are then embedded into dense vector representations using an embedding layer.
JuzaPhoto contains affiliate links from Amazon and Ebay and JuzaPhoto earn a commission in case of purchase through affiliate links.May Beauty Be Everywhere Around Me