Language Models: Challenges and Strategies for LLM Engineers

⚡

Key Takeaways

1Large language models (LLMs) are essential for modern AI systems, but their training is complex.

2Tokenization, attention, and fine-tuning are key concepts for understanding and developing effective LLMs.

3Transformer architectures, with their attention mechanisms, are at the heart of text data processing in LLMs.

💡Why it matters — Engineers need to master these concepts to design high-performing AI systems that meet current needs.

The Fundamentals of Large Language Models

Large Language Models (LLMs) have become the cornerstone of modern artificial intelligence systems, used in applications as varied as chatbots, copilots, search, programming, and automation. However, for engineers venturing into this field, the learning curve can seem steep and fragmented. Concepts such as tokenization, attention, fine-tuning, and evaluation are often explained in isolation, making it difficult to form a coherent mental model of how everything fits together.

I personally encountered these challenges when transitioning from computer vision to LLMs. In a short time, I had to grasp not only the theory behind transformers but also the practical realities: training trade-offs, inference bottlenecks, alignment challenges, and evaluation pitfalls. This article is designed to bridge that gap. Rather than diving deeply into a single component, it provides a structured map of the LLM engineering landscape, covering the key elements you need to understand to design, train, and deploy LLM systems in the real world.

We will cover the fundamentals of text representation, through model architectures and training strategies, to optimizing inference, evaluation, and practical considerations such as prompt engineering and reducing hallucinations.

Converting Letters to Numbers

When we feed data to a model, we cannot simply give it letters or words directly; we need a way to convert text into numbers. Intuitively, we might think of assigning a unique number to each word in the language and feeding those numbers to the model. However, there are hundreds of thousands of words in the English language, and training on such a vast vocabulary would be impractical in terms of memory and efficiency.

What can we do instead? We could try encoding letters, since there are only 26 in the English alphabet. But this would also pose problems: models would struggle to grasp the meaning of words from individual letters, and sequences would become unnecessarily long, making training difficult.

A practical solution is tokenization. Instead of representing language at the word or character level, we break the text into the most frequent and useful sub-word units. These sub-words act as the building blocks of the model's vocabulary: common words appear as whole tokens, while rare words can be represented as combinations of smaller sub-words.

A common algorithm for this is Byte-Pair-Encoding (BPE). BPE starts with individual characters as tokens, then repeatedly merges the most frequent pairs of tokens into new tokens, gradually building a vocabulary of sub-word units until a desired vocabulary size is reached.

At this point, each token is assigned a unique number — its ID in the vocabulary.

After tokenizing the data and assigning token IDs, we need to attach semantic meaning to these IDs. This is done through text embeddings — mappings between discrete token IDs and continuous vector spaces. In this space, words or tokens with similar meanings are placed close to each other, and even algebraic operations can capture semantic relationships (for example: embedding(queen) - embedding(woman) + embedding(man) ≈ embedding(king)).

In general, embedding layers are trained to take token IDs as input and produce dense vectors as output. These vectors are optimized jointly with the model's training objective (for example, predicting the next token). Over time, the model learns embeddings that encode both syntactic and semantic information about words, sub-words, or tokens. Popular embedding models include: word2vec, glove, BERT.

Positional Encoding

In general, LLMs are not inherently aware of the structure of language. Natural language has a sequential nature — the order of words matters — but at the same time, tokens far apart in a sentence can be strongly related. To capture both local order and long-term dependencies, we inject positional information of tokens into each embedding.

There are several common approaches to positional encoding:

Absolute positional encodings — Fixed patterns, such as sinusoidal and cosinusoidal functions at different frequencies, are added to token embeddings. This is simple and effective but may struggle to represent very long sequences, as it does not explicitly model relative distances.
Relative positional encodings — These represent the distance between tokens instead of their absolute positions. A popular method is RoPE (Rotary Positional Embeddings), which encodes position as vector rotations. This approach adapts better to long sequences and captures relationships between distant tokens more naturally.
Learned positional encodings — Instead of relying on fixed mathematical functions, the model learns the position embeddings directly during training. This allows for flexibility but may be less generalizable to sequence lengths not seen during training.

Model Architecture

After the data has been tokenized, embedded, and enriched with positional encodings, it is passed through the model. The state-of-the-art architecture for processing text data is the transformer architecture, which is fundamentally based on the attention mechanism. A transformer generally consists of a stack of transformer blocks:

Multi-Head Attention: Allows the model to focus on different parts of the input sequence simultaneously, capturing diverse context. It computes Queries (Q), Keys (K), and Values (V) to define relationships between words.
Position-wise Feed-Forward Network (FFN): A fully connected network applied to each position independently, adding non-linearity.
Residual Connections: Shortcut connections that help gradients flow during training, preventing information loss.
Layer Normalization: Normalizes the input to stabilize training.

The attention mechanism, introduced in the paper "Attention Is All You Need," projects each token into three vectors: a query (what it is looking for), a key (what it offers), and a value (the information it carries). Attention works by comparing queries to keys (via similarity scores) to decide how much of each value to aggregate. This allows the model to dynamically draw relevant context based on content, rather than position.

Multi-head attention runs several attention mechanisms in parallel, each with its own learned projections. Think of each "head" as focusing on a different relationship (for example, syntax, coreference, semantics). Combining them gives the model a richer and more nuanced understanding than a single pass of attention.

There are several types of attention mechanisms that vary based on their purpose: self-attention, masked self-attention, and cross-attention.

Self-attention operates within a single sequence, allowing tokens to attend to each other (for example, understanding a sentence).
Masked self-attention is similar to self-attention with one key difference: attention only sees past tokens, without observing future ones.
Cross-attention connects two sequences, where one provides queries and the other provides keys/values (for example, a decoder handling an encoded input in translation). The key difference lies in whether the context comes from the same source or an external source.

Standard attention compares each token with every other token, leading to quadratic complexity O(n²). As the sequence length increases, the use of computations and memory grows rapidly, making very long contexts costly and slow. This is one of the main bottlenecks in the scalability of LLMs and an active area of research — for example, being selective about which tokens should attend to other tokens.

Types of Architecture

Language modeling tasks are built using one of the following transformer architectures:

Encoder-only models — Each token can attend to all other tokens in the sequence (bidirectional attention). These models are typically trained with masked language modeling (MLM), where some tokens in the input are masked, and the task is to predict them. This setup is well-suited for classification and comprehension tasks (for example, BERT).
Decoder-only models — Each token can only attend to the tokens that precede it in the sequence (causal or unidirectional attention). These models are trained with causal language modeling, meaning predicting the next token given all previous ones. This setup is ideal for text generation (for example, GPT).
Encoder-decoder models — The input sequence is first processed by the encoder, and the resulting representations are then fed into the decoder via cross-attention layers. The decoder generates an output sequence one token at a time, conditioned on both the encoder's representations and its own previous outputs. This setup is common for sequence-to-sequence tasks like machine translation (for example, T5, BART).

Predicting the Next Token and Output Decoding

Models are trained to predict the next token — this is done by producing a probability distribution over all possible tokens in the vocabulary. The model's output is the logit that is then passed through softmax to predict the probability of the next token in the vocabulary.

In the simplest approach, we could always choose the token with the highest probability (this is called greedy decoding). However, this strategy is often suboptimal, as the locally most probable token does not always lead to the most coherent or natural sentence overall.

To improve generation, we can sample from the probability distribution. This introduces diversity and allows the model to explore different continuations. Additionally, we can branch the generation process by considering multiple candidate tokens and expanding...