Understanding Fine-Tuning Challenges in LLMs (QLoRA, LoRA, and Beyond)

Understanding Fine-Tuning Challenges in LLMs (QLoRA, LoRA, and Beyond)

Why read this?
Because the way we fine-tune LLMs (QLoRA/LoRA) might be quietly breaking the very intelligence we paid for... and the fixes aren’t what you think.

TL;DR

  • Your model forgets... on purpose. Fine-tuning carves a deep, narrow canyon in the weight landscape. Great at the new task, worse at everything else. But which models forget more, small or big?
  • LoRA’s “intruder dimensions.” Low-rank adapters don’t just nudge weights, they create new, loud directions that drown out pretraining. Can a simple change in rank and placement defuse this?
  • Quantization’s double whammy. QLoRA starts on a slightly blurry base, then asks tiny adapters to fix both the new task and the blur. When does 4-bit help you and when does it hurt you?
  • Dense vs MoE vs Hybrid (Mamba + MoE). Modular experts can quarantine new skills if you train them right. Are IBM Granite 4 hybrid models inherently safer… or just differently fragile?
  • Hallucinations. Fine-tuning can slash hallucinations or supercharge them, depending on one subtle choice in your pipeline.
  • What actually works. Targeted adapters (Q/V-only), low LR + replay, multi-task recipes, expert expansion, and wild new model-merging tricks that keep both old and new brains intact.

If your “better” model suddenly sounds confident but wrong, it might be LoRA’s hidden geometry and not your data. Want the knobs, ranks, and recipes that stop the damage? Read on.

1. Catastrophic Forgetting in Large Language Models

What it is: Catastrophic forgetting refers to a model losing previously learned abilities or knowledge when fine-tuned on new tasks. In large language models (LLMs), this manifests as a drop in general competency after specialization. The phenomenon is well-documented: even models from 1B up to 7B parameters show noticeable forgetting during continual fine-tuning[1]. In fact, one empirical study found that larger models in this range forget more likely because they had stronger initial capabilities to lose[1]. Fine-tuning a big model on a narrow dataset can carve a “deep, narrow canyon” in its weight landscape, making it excellent at the new task but worse at everything else.

Why LoRA/QLoRA doesn’t magically prevent it: A common misconception is that parameter-efficient methods (like LoRA or QLoRA) avoid forgetting since the original weights remain unchanged. In reality, the combined model (base + adapters) can still exhibit serious forgetting. LoRA fine-tuning effectively overwrites behavior by adding new weight deltas, and the model’s outputs may no longer access the old knowledge. Recent research confirms that LoRA-tuned models can forget significant parts of the pre-training distribution, sometimes even more than fully fine-tuned models[2]. In short, freezing the base weights doesn’t guarantee preserving the base knowledge, the new learned vectors can override it during inference.

Evidence of forgetting: A 2023 study asked whether instruction-tuned LLMs forget their general abilities. The answer: yes. They observed clear degradation in domain knowledge, reasoning, and reading comprehension after sequential fine-tuning[3][1]. Surprisingly, in the 1B–7B scale range, the larger the model, the more severe the forgetting[1] (perhaps because larger models had more knowledge to begin with).

Decoder vs. Encoder-Decoder: The same study found decoder-only models (like BLOOMZ) retained more knowledge than encoder-decoder models (like mT0) when fine-tuned on similar data[4]. This suggests architecture can influence forgetting: decoder-only transformers (e.g. LLaMA, Qwen, GPT-style) might be a bit more resilient than sequence-to-sequence models for continual learning, though both can forget without mitigations.

Takeaway: Catastrophic forgetting is a general problem whenever you fine-tune an LLM on a narrow task. Neither model size alone nor adapter methods alone solve it. In fact, if fine-tuning “succeeds” in making a model highly specialized, it usually means the model’s broad knowledge was partially overwritten in the process[1][2].

2. LoRA’s Intruder Dimensions – A Deep Dive

One of the key neural-level insights from recent research is that LoRA fine-tuning changes the model’s weight space in a different way than full fine-tuning. Specifically, LoRA tends to introduce what researchers call “intruder dimensions.”

  • What are intruder dimensions? When you fine-tune a model with LoRA, you add low-rank update matrices to certain weight layers. It turns out that these updates often create new singular vectors in the weight matrix that have large singular values and are almost orthogonal to the original weight space[5]. In other words, LoRA isn’t just nudging existing neurons – it’s activating new directional axes in weight space that the pre-trained model never strongly used. These new axes (the intruder dims) carry the task-specific information.
  • Why is this bad for generalization? The presence of intruder dimensions correlates with the model becoming a worse representation of its pre-training distribution[2]. A fully fine-tuned model tends to stay “spectrally similar” to the original, meaning it reuses the same dominant neural directions but slightly adjusted[6][5]. LoRA models, by contrast, develop these brand-new directions that thrust the model’s representations away from the manifold of general knowledge. Think of the pre-trained model as an orchestra: full fine-tuning retunes the instruments slightly, whereas LoRA can introduce a few entirely new instruments blaring a different tune. The new tune fits the fine-tuning data perfectly, but it doesn’t harmonize with the old songs.
  • Empirical findings: An October 2024 study titled “LoRA vs Full Fine-Tuning: An Illusion of Equivalence” demonstrated this clearly[5][2]. LoRA-trained weight matrices consistently showed blank spots when checking alignment with the original singular vectors (indicating new orthogonal components)[5]. Full fine-tuning showed no such orthogonal jump, it adjusted magnitudes along existing dimensions instead. Behaviorally, the LoRA models with pronounced intruder dimensions forgot more of the pre-training data and were less robust on out-of-distribution tests, even if their in-task accuracy was as good as a fully fine-tuned model[2]. This suggests LoRA solutions live in a different part of parameter space that is more prone to forgetting and brittleness.
  • Does increasing LoRA rank help? Yes, partially. Higher-rank LoRA (i.e. using a larger update matrices) blurs the line between LoRA and full fine-tuning. Researchers observed that with higher ranks, LoRA updates start to behave more like full fine-tuning: intruder dimensions are fewer, and generalization improves[7][8]. In fact, a very high-rank LoRA can match full fine-tuning performance and generality, though at the cost of more trainable parameters (approaching full model tuning)[7]. There’s a trade-off: low-rank LoRA is extremely parameter-efficient but prone to these distortions, while higher-rank LoRA retains more of the model’s breadth[2][9]. (Notably, there are techniques like rank stabilization that help at high ranks to avoid unstable training[8], but that’s an implementation detail.)
  • “Intruders” and forgetting: The appearance of intruder dimensions is essentially a red flag that the model’s representation has drifted away from its original form. It’s strongly linked with the model performing worse on anything outside the fine-tune distribution[2]. Continual learning tests have shown LoRA-tuned models struggle to adapt further without interference, compared to fully tuned models[2]. In practical terms, this means a LoRA-tuned LLaMA or Qwen might do great on your specific dataset, but if you ask it something it knew before but wasn’t in the new data, it may falter or give odd outputs, the knowledge is still somewhere in the frozen weights, but the model can’t access it readily because the new adapters are steering it along new dimensions.

Analogy: In orchestra analogy, LoRA’s intruder dimensions are like hiring a few loud specialist musicians. They don’t replace the existing orchestra members, but they play so loudly on new songs that the rest of the orchestra’s knowledge gets drowned out. Full fine-tuning is like retraining the whole orchestra subtly, no single instrument dominates, so they can still collectively play the old repertoire (perhaps not perfectly, but in tune). LoRA’s approach is efficient (only a few new players) but risks overpowering the original music.

3. The Quantization Factor in QLoRA (“The Q”)

QLoRA adds another twist to this story: quantization. The “Q” in QLoRA means we take the base model’s weights and compress them to 4-bit integers (usually using a scheme like NF4 with minimal loss) before applying LoRA. This drastically reduces memory, enabling fine-tuning of 65B+ models on a single GPU[10], but it does introduce approximation error in the model’s foundation.

  • Quantization error: By converting 16-bit or 32-bit floating-point weights down to 4-bit, we inevitably lose some precision. Each weight can only take one of 16 possible values in 4-bit. The QLoRA paper used sophisticated techniques (like quantile-based NF4 quantization and double quantization) to minimize the damage, and they reported that it has minimal impact on final performance[11]. In many benchmarks, a QLoRA fine-tuned model is almost as good as a full 16-bit fine-tuned model, the difference in accuracy is often within a point or two, which is remarkable. In some cases, the added noise can even act like regularization and reduce overfitting slightly[11].
  • Subtle degradation of base knowledge: However, “minimal difference” is not “zero difference.” Quantizing the entire model means that all the pre-trained knowledge is now represented a little less precisely. That could mean, for example, the model’s grasp on a very obscure fact or a subtle linguistic nuance might be weakened. Before fine-tuning even begins, the base model loaded in 4-bit form is a slightly degraded version of itself. It might have a few less tokens of capacity for recall, a slightly blurrier memory of rare events, etc. Most users won’t notice this on common tasks, but at a neural level the model’s probability landscape has slightly shifted.
  • QLoRA training dynamics: During QLoRA fine-tuning, the low-rank adapters (which are still usually trained in higher precision) have to do two jobs at once: (1) learn the new task, and (2) if possible, adjust for any quirks introduced by quantization. The adapters are like small, high-precision “corrections” added to the coarse 4-bit base. If the base model outputs are off by a bit due to quantization, the LoRA can in theory learn to counteract that. But the LoRA has limited capacity (being low-rank), so it may prioritize fitting the new data over perfectly restoring every bit of lost generality. In essence, the fine-tuning is happening on a slightly shaky foundation. This could further harm generalization. For instance, if a weight that distinguished two very similar concepts was quantized to the same value, the model might confuse those concepts more readily after QLoRA, unless the LoRA explicitly re-teaches the distinction.
  • Is the effect significant? In practice, QLoRA has been shown to preserve performance very well on the fine-tuning task[11]. The concern is more about out-of-domain generality. There isn’t a large body of literature isolating quantization-caused forgetting, but anecdotal evidence from the community suggests a small quality drop. One guide notes that 4-bit quantization can cause a “small quality drop” but generally “the performance difference is minimal”[11]. It’s likely that quantization alone is a minor factor compared to the larger issue of overfitting to the new task. Still, it means a QLoRA model might be a hair worse on some fringe knowledge than a LoRA model that kept the base at full precision.

Bottom line: Quantization in QLoRA is a brilliant trade-off, huge memory savings for tiny performance hit. But that tiny hit is a slight uniform erosion of the model’s knowledge fidelity. When combined with a strong fine-tune on a narrow task, it can contribute to that “general ability damage”. It’s as if the model’s knowledge base had a mild blur applied, and then the fine-tuning sharply etched a new image on top of it. The end result is a model highly focused on the new image (task) with the old background a bit faded.

4. Does Model Size or Architecture Help? (Dense vs. MoE vs. Hybrid)

A question about model size and architecture is very interesting. Intuitively, a larger model has more parameters (more “brain capacity”) and one might hope it can absorb a new task without forgetting the old. In practice, it’s complicated:

4.1 Model Size and Forgetting

  • Forgetting might increase with model size: As noted, within the range of 1B to 7B, larger models actually showed more forgetting[1]. Why? The authors hypothesize it’s because those larger models simply had much better initial performance, so there was more to lose[12]. When fine-tuned to the same small dataset, the 7B model’s accuracy on original tasks dropped more sharply than a 1.3B model’s did[1]. The big model essentially over-specialized more aggressively because it could model the fine-tune data extremely well (potentially even memorizing it), which is the flip side of having high capacity.
  • Does this trend continue to 70B or 170B? Larger models like LLaMA-65B or GPT-3.5 (175B) start off with even more robust knowledge and abilities. Empirically, they tend to need less relative fine-tuning to achieve good results on a new task (often just a few epochs of instruction tuning). If fine-tuned carefully, a 70B model might retain a lot of its general ability, but if overcooked on a narrow dataset, even a GPT-4-sized model will forget or distort knowledge (it might start answering everything in the style of the fine-tune data, for example). So, model size is not a silver bullet. Larger models have more redundancy which could help them encode new tasks without erasing old ones, but they also can overfit more parameters if allowed. In absence of special precautions, bigger models can and do lose generality from narrow fine-tuning, just as smaller ones do.
  • Hallucinations vs. size: Larger models (32B+) generally have lower base hallucination rates because they know more facts. However, one might expect that if a large model does forget facts due to fine-tuning, it could hallucinate even more confidently (since it used to know something and now the wires are crossed). On the flip side, a small model starts out with more limited knowledge (prone to hallucination), but fine-tuning it on a factual domain might actually improve its accuracy on that domain. In summary, size alone doesn’t decide the forgetting or hallucination outcome – it’s size relative to the fine-tune data and how fine-tuning is done.

4.2 Dense vs. Mixture-of-Experts (MoE) Models

Mixture-of-Experts models introduce a different architecture: instead of a single monolithic transformer, they have multiple “experts” (sub-networks) and a routing mechanism that activates only a few experts per input. Notable open MoE LLMs include some of the IBM Granite models. So the question is are MoE or hybrid models more robust against fine-tuning damage.

  • MoE basics: In an MoE layer, a router learns to send each token (or each input) to, say, 2 out of N expert feed-forward networks. This means at runtime, only a fraction of the model’s parameters are used for any given input[13]. The model capacity (number of parameters) can be huge, but the computational load per token remains manageable because of the sparsity (only the top experts “fire” for that token). For example, IBM’s Granite-3.0 release included MoE variants like Granite-3B-A800M, which has 3B total parameters but only ~800M active per token during inference[14]. The MoE’s advantage is efficiency, you can have specialized experts and scale the model’s knowledge without scaling latency linearly.
  • Forgetting in MoE: Intuitively, MoEs could help mitigate forgetting if used cleverly. Because different experts can specialize in different domains, one idea is to fine-tune only certain experts on new data while leaving others untouched. In an ideal scenario, you could have one expert learn the new task (becoming a specialist) and the router learn to send relevant queries to it, while other experts continue to handle general queries. This way, old knowledge (stored in other experts) isn’t overwritten at all, it’s just bypassed when the new domain is in play. In practice, naive fine-tuning of an MoE doesn’t automatically do this isolation; all experts that get activated by the new data will get tuned, and the router will also adjust, which can cause interference across experts.
  • Research on MoE adaptation: There is promising research on expert-specialized fine-tuning. A 2025 paper introduced Dynamic Expert Specialization (DES-MoE) for multi-domain learning, where they actively isolate domain-specific gradients to specific experts and freeze others[15]. They use an adaptive router and a phased training schedule to minimize cross-domain interference. The result? In multi-domain experiments, they achieved an 89% reduction in forgetting compared to fully fine-tuning an equivalent dense model[16]. Essentially, by letting each expert “stick to what it’s good at” and not constantly overwriting experts, the MoE was able to learn new domains with far less catastrophic forgetting. This validates the idea that MoEs can be made more robust to forgetting, but it requires careful training algorithms. Off-the-shelf MoE models will still forget if you fine-tune them without such strategies.
  • IBM Granite MoE experience: IBM’s Granite 3.x series included both dense and MoE models. Their focus for MoE was primarily on latency and scaling (serving a bigger model in pieces)[14]. There isn’t public data suggesting that, say, Granite-3B-MoE forgets any less than a dense 8B when fine-tuned normally. However, the potential is there: an enterprise could choose to fine-tune only a subset of experts on proprietary data, effectively adding a new expert for new knowledge (some recent works call this “expert expansion” or MoExtend, adding experts for new tasks without touching old ones[17]).

Summary (Dense vs MoE): A dense model (like LLaMA, Qwen, original GPTs) has all weights intermingled, so any weight update can affect any capability – making it prone to entangled forgetting. An MoE model has architectural modularity (separate experts), which gives a chance to localize changes. If fine-tuning is constrained to certain experts or uses domain routing, MoE can significantly reduce forgetting[15]. But if fine-tuning freely updates everything, an MoE is not immune to forgetting either, interference can still occur through the router or shared parts. It’s not inherently more robust unless you leverage its structure.

4.3 Hybrid Architectures (e.g. IBM Granite with Mamba Layers)

Newer architectures like IBM’s Granite 4.0 introduce a hybrid of transformer and alternative layers. Granite 4 uses a mix of standard self-attention blocks and Mamba-2 state-space blocks[18]. Mamba-2 is a type of structured state-space model (SSM) layer, essentially a fancy recurrent layer that can handle long sequences efficiently (NVIDIA and Princeton researchers have been advancing this, showing SSMs can match transformer quality at smaller scales[19]). In Granite 4, roughly 90% of the layers are Mamba-2 (SSM) and 10% are transformers[20]. They also include MoE in some variants (the “H-Tiny” and “H-Small” models have MoE experts too)[13]. This hybrid approach gives massive efficiency gains (they report 70% less memory usage for long contexts) and very strong performance for its size[21][22].

  • Are hybrid models easier to fine-tune? There isn’t yet public research on whether a hybrid SSM/Transformer forgets less or more than a pure Transformer. In principle, fine-tuning works similarly: you can apply LoRA or other methods to those layers. The Mamba (SSM) layers are linear state-space operators, fine-tuning them will change how the model handles sequential information and intermediate representations. They don’t have an attention mechanism, but they carry long-range context through latent states. It’s conceivable that an SSM layer might store information differently (more distributed over time perhaps) which could make it a bit more robust to small-data overfitting, but that’s speculative. At the end of the day, a hybrid model is still an LLM with a lot of parameters that can be overfitted if one isn’t careful.
  • IBM’s approach with Granite 4: IBM seems aware that architecture alone isn’t enough; they pair these innovations with careful fine-tuning regimens. In Granite 3.0 and 3.1 (dense and MoE models), IBM provided a toolkit called InstructLab to fine-tune models with synthetic data and phased training[23]. The goal was to augment the model’s skills without damaging its core. By Granite 4, IBM explicitly used their LAB (Large-scale Alignment for Bots) method to inject new capabilities without overwriting old ones[24][25]. They generate a taxonomy of skills and systematically create instruction data for knowledge, skills, and compositional skills, then train in stages. Importantly, they use replay buffers during fine-tuning, meaning they reintroduce a small subset of earlier training data at later steps to reinforce what was learned first[26]. They reported that using a low learning rate, an extended warm-up, and replaying earlier data at the end prevented the model from overwriting what it learned before[26]. This is effectively combating catastrophic forgetting via training strategy, not just architecture.
  • MoE + Hybrid in Granite 4: Granite 4’s high-end models (H-Tiny, H-Small) are hybrid + MoE[13][27]. In theory, these could be even more amenable to targeted fine-tuning: you have SSM layers (fast and maybe easier to fine-tune some sequence behavior) and MoE experts (which you could selectively fine-tune or add experts). IBM hasn’t publicly broken down forgetting on these, but they do claim Granite 4 models excel at multi-domain tasks and maintain efficiency[28], which suggests they handle new instructions without losing prior abilities (given their high HELM benchmark scores in many areas[29]). It’s likely that their fine-tuning recipe (synthetic data covering broad skills + cautious training) contributes a lot to this robustness.

Dense vs MoE vs Hybrid Recap: No architecture is inherently immune to the pitfalls of fine-tuning, but architectures with modular components (MoE experts, or even the combination of different layer types) offer hooks that clever training algorithms can use to reduce forgetting. A dense model has no explicit modules, so mitigation has to come from how you fine-tune (or by adding external adapters). An MoE can compartmentalize knowledge if you train it to do so. A hybrid (like Mamba+Transformer+MoE) has the benefits of MoE plus the efficiency of SSMs, but it will still need careful fine-tuning to avoid specializing too hard. Newer architectures (IBM Granite, Alibaba’s Qwen, Meta’s LLaMA series, etc.) all face the same fundamental challenge: without careful fine-tuning, they will likely lose some generality when specialized. The difference is how easily we can apply solutions to them – e.g., adding a new expert to an MoE vs. adding new LoRA modules to a dense model are analogous strategies.

5. Strategies to Mitigate Fine-Tuning Damage

The good news is that the research community (and practitioners) are actively developing solutions for these issues. These range from older, simple tricks to novel 2024–2025 techniques. Let’s survey the landscape:

  • 5.1 Low Learning Rates & Partial Freezing: The simplest safeguard is to fine-tune gently. Use a lower learning rate, shorter training, and possibly freeze some layers. This minimizes how much you overwrite the original weights. For example, many instruction-tuning efforts found that you only need a few epochs with a low LR to align a model; any more and you start degrading quality on other tasks. IBM’s LAB method explicitly uses a low LR and extended warm-up to slowly incorporate new knowledge[26]. Partial freezing is another heuristic – e.g., only fine-tuning the last few layers or only certain weights. However, full freezing can limit how well you adapt. Hence the popularity of Adapters and LoRA in the first place: they let the model learn new things with minimal weight change to the core model.
  • 5.2 Targeted LoRA (Q/V LoRA): A practical tip from open-source fine-tuners: instead of applying LoRA to all weight matrices in a transformer layer, apply it only to the attention’s query and value projections (Q and V) and perhaps key projections, but not to every feed-forward. This “QLoRA (Q/V-only)” strategy has been found to preserve general model behavior better for small datasets. As one practitioner put it, training all projection matrices on a tiny dataset can quickly “dumb down” the model by overwriting its broad knowledge with narrow patterns. But training only Q and V “nudges the model’s attention mechanisms without completely rewiring its core reasoning” thus preserving its general smartness while still teaching the new task[30]. Essentially, the model learns where to attend differently for the new task, but you don’t overhaul the entire network. Many Axolotl/LoRA config presets in the community now use this trick for small-scale fine-tunes.
  • 5.3 Regularization and Penalty Methods: In classic continual learning, techniques like EWC (Elastic Weight Consolidation) or L2 regularization encourage the model to not stray too far from its original weights on important dimensions. Applying a penalty for deviating from the pre-trained weights can reduce forgetting. Some research has tried this for LLMs, but the scale makes it tricky (you need to know which of the billions of weights are “important”). More feasible are methods like knowledge distillation: you generate outputs from the original model (or a broader model) for some prompts and ensure the fine-tuned model doesn’t deviate on those. For example, you could sample a set of general questions or tasks and periodically train the model to match the base model’s answers on those effectively rehearsing the base knowledge. This is rarely done explicitly (because it’s hard to choose a representative set of “everything else” the model knew), but it’s conceptually sound.
  • 5.4 Data Augmentation and Rehearsal: A very active area now is using synthetic data or mixed data to avoid overfitting solely to the new task. We saw IBM’s approach: they generate a taxonomy of skills and create synthetic instruction data not just for the narrow task, but for all prerequisite and related skills[31][32]. They then fine-tune in phases: first on fundamental knowledge and skills, then on the specific task, and finally they reinject some earlier data at the end to remind the model of the basics[26]. This phased training with replay greatly reduces catastrophic overwriting, essentially the model rehearses general skills alongside learning the niche ones. Another example is the Self-Synthesized Rehearsal (SSR) framework from a recent ACL paper: it generates synthetic QA pairs from the base model itself to “rehearse” previous knowledge during fine-tuning[33][34]. In SSR, at each step of a multi-step fine-tune, they use the original model (still saved) to produce answers for prompts from the previous task, refine those answers with the current model, and add a selection of these Q&A pairs to the training mix[33][34]. This acted as pseudo-rehearsal data to retain knowledge without needing the original massive pre-training corpus. These methods echo a classic idea: don’t train on the new task alone. Always mix in either some amount of original data or synthetic approximations of it. Even a small percentage of such data can make a big difference in maintaining previous capabilities.
  • 5.5 Multi-Task and Broad Instruction Tuning: Another observation is that if a model is first trained to be a general instruction follower (covering a wide variety of tasks), it becomes more resilient to forgetting when later fine-tuned on a specific task. For instance, LLaMA vs Alpaca: Alpaca is LLaMA 7B that was instruction-tuned on a diverse set of prompts. A study found that Alpaca retained more general knowledge during further fine-tuning than the original LLaMA did[35]. Essentially, the initial broad instruction fine-tune “uses up” some capacity to set the model into a generally aligned state, and that process may mitigate later forgetting (perhaps because the model already had alignment vectors that can accommodate new tasks without shifting everything else). Similarly, some industry practitioners fine-tune on a mix of tasks at once e.g. combine your domain data with some generic tasks or conversations to ensure the model doesn’t veer too far. This is a bit of an art, since mixing data can dilute performance on the target task if not balanced well.
  • 5.6 Advanced Module Fusion (Superposition): A cutting-edge 2025 approach is what you referenced as “superposition” essentially merging two models (the base and a fine-tuned specialist) into one without loss. A recent paper demonstrated a technique to blend the hidden states of a base model and a fine-tuned model via learnable B-spline coefficients and autoencoders inserted at intervals[36][37]. The concept is that instead of modifying the base weights, you train small blending networks that can toggle the model between modes (or combinations of modes) on the fly. In their experiment, they merged a GPT-2 with its fine-tuned French version. The result was a single model that could handle both English (original task) and French (new task) without forgetting either[38]. The merged model retained performance on English while gaining French, something that standard fine-tuning could not achieve (normally GPT-2 fine-tuned on French would start doing poorly in English). This “superposition” technique is like having two minds in one skull, and a switch that routes to the appropriate mind as needed. It achieved perplexity and accuracy improvements over naive interpolation of weights, and produced polysemantic neurons that responded to concepts in both languages simultaneously[38][39]. While this is still early research (and computationally heavy, since you need to train autoencoders and keep two model copies), it points to a future where we might merge expert models without catastrophic interference. Think of it as a very principled MoE with just two experts (in general, it could be multiple experts). This could be very useful if you want to fine-tune a base model on multiple domains separately and then combine them into one super-model that encapsulates all domains. It’s arguably one of the more exciting long-term solutions to the specialization vs. generality problem.
  • 5.7 Expert Expansion and Modular Learning: Another approach for MoE-based LLMs is to add new experts for new tasks instead of altering the existing experts. Google’s recent work with Switch Transformers hinted at this; you can train a new expert on new data while freezing the rest, and update the router to include it. There’s also a technique called “MoExtend” (from 2023) that adds experts for new modalities or tasks instead of fine-tuning the whole model[40]. By not touching the original experts, you avoid forgetting, at the expense of a larger model. The router learns to dispatch appropriate inputs to the new expert. This is like adding a new specialist to a team rather than retraining the whole team. The challenge is keeping the router efficient and not blowing up model size too much. But it’s a promising direction for modular lifelong learning.
  • 5.8 Rank-/Space-Stabilization: Within LoRA-based methods, researchers have looked at ways to make LoRA updates less disruptive. One idea is rank stabilization (Kalajdzievski et al., 2023) – basically re-orthogonalizing or constraining the LoRA updates so they don’t introduce those intruder dimensions as much. Another idea is to allow the rank to grow when needed (some adaptive LoRA schemes) so the model can represent new info without twisting existing directions. These are more technical, but the gist is to ensure the solution space of LoRA stays closer to the original model’s space. The previously cited work noted that if you push LoRA to full-rank (i.e. no rank bottleneck), it ends up behaving like full fine-tuning and also will forget if overfit[9]. So pure capacity isn’t the only issue, it’s how you constrain that capacity. The ideal seems to be: use as low a rank as possible to fit the task (for efficiency), but not so low that you spawn entirely new directions in weight space. How to do that optimally is an open research question.
  • 5.9 Community best practices: From HuggingFace forums, GitHub discussions, etc., a few practical tips commonly surface: (a) Don’t fine-tune for too long. Watch validation performance on some broader metrics and stop early if you see general perplexity start rising. The “Does fine-tuning encourage hallucinations?” paper actually showed that the worst hallucination issues emerged in later-stage training when the model had fully fitted new facts[41]. Early stopping, as soon as the model has mostly learned the task but not yet memorized every last detail, can help preserve pre-trained knowledge[42]. (b) Use high-quality, not-too-narrow data. If your fine-tuning data is extremely narrow or all in one format, the model will over-optimize for that format. Sometimes adding just a bit of variety (even unsupervised data or some Q&A from other domains) regularizes the fine-tune. (c) Monitor outputs on general prompts – e.g., after each epoch, test the model on some unrelated questions or a common sense task. If you see a drastic change (like the model starts answering everything in a formulaic way), that’s a warning sign that forgetting is happening. You might then reduce learning rate or mix in some original data for a bit.

In summary, while catastrophic forgetting and loss of generalization are real challenges, there is a growing toolkit to tackle them. From straightforward methods like mixing data and limiting what you fine-tune, to sophisticated approaches like model merging and expert routing, the trend is towards fine-tuning strategies that maintain a model’s breadth. It’s an active area of research in 2024–2025, precisely because we want LLMs to be both expert specialists and reliable generalists at the same time.

6. Fine-Tuning and Hallucinations: Does it Help or Hurt?

Hallucination is an LLM’s tendency to produce factually incorrect or unfounded statements, this is another crucial aspect, especially as it relates to model size and fine-tuning. The effect of fine-tuning on hallucinations can go both ways, and it often ties into the model’s retention of knowledge (or lack thereof):

  • When fine-tuning reduces hallucination: If you fine-tune on a domain with reliable, factual data (and especially if you use methods like RLHF or direct preference optimization to penalize incorrect outputs), you can significantly reduce hallucinations in that domain. A smaller model that is fine-tuned to be very domain-focused can outperform a larger general model on domain-specific factuality. A notable example: Intel’s Neural Chat 7B model, which is based on Mistral-7B but was fine-tuned with a technique called Direct Preference Optimization (DPO) on high-quality instruction data. It achieved just a 2.8% hallucination rate, slightly lower than even GPT-4’s ~3% on the same evaluation[43]. In Vectara’s factuality leaderboard (HHEM), this 7B model beat models tens of times its size by virtue of targeted fine-tuning[44][43]. This shows that fine-tuning can improve factual accuracy and reduce hallucinations, if the fine-tune data and objective are geared towards that. Essentially, the model was trained to say “I don’t know” or stay within provided context rather than make things up, and it learned that behavior well.
  • When fine-tuning worsens hallucination: On the flip side, a 2024 study (Gekhman et al.) found that introducing new factual knowledge via supervised fine-tuning actually increased the model’s hallucination propensity w.r.t. its original knowledge[45][41]. How so? The controlled experiments showed that when an LLM is fine-tuned on QA data containing facts it didn’t know before (closed-book style), it struggles to incorporate those facts quickly[46][41]. The model initially relies on its existing knowledge, and the new facts are learned slowly and via brute force. Once the model finally does learn those new facts (usually later in training), it tends to start over-generalizing that behavior, effectively hallucinating more. The paper reported a near-linear increase in hallucination rates as the proportion of “unknown knowledge” in the fine-tuning data went up[45]. Importantly, the worst hallucinations happened after the model had fully fitted the new data (they even observed validation accuracy peaking and then declining as the model overfit the new facts)[47][42]. The takeaway was that fine-tuning teaches the model to regurgitate the fine-tune data, even if that means overriding its internal consistency checks. If that data has new info, the model may not truly integrate it with its knowledge graph, it just learns “when asked Q, output X,” and might start outputting fact X even in inappropriate contexts. In other words, the model becomes more willing to say ungrounded facts, because it was literally trained to output things it didn’t previously “believe” (since they were new facts). This supports the view that LLMs mainly acquire factual knowledge during pre-training, and using supervised fine-tuning to add facts is risky[48]. Fine-tuning is better at aligning or formatting knowledge than at adding knowledge, if you push it to add, you might inadvertently train the model that spewing new (unverified) info is okay.
  • Model size and hallucination: As mentioned, larger models usually start with a lower hallucination rate on broad knowledge, because they simply know more and have more parameters to devote to consistency. However, if a large model is fine-tuned on a narrow dataset without regard to factuality, it can still hallucinate, sometimes even more confidently. Smaller models, while generally more prone to hallucination (due to knowledge gaps), can be made highly factual in a narrow domain through fine-tuning (as Intel’s example showed). Interestingly, one blog analysis found that hallucination isn’t strictly correlated with size: small models can achieve hallucination rates comparable to large ones when fine-tuned correctly[44]. For instance, a 7B model may hallucinate less on medical questions than a 70B model if the 7B was fine-tuned on a medical Q&A dataset and the 70B was not.
  • Hallucination vs. forgetting: There’s a connection between a model forgetting something and hallucinating an answer. If you ask a fine-tuned model about something it used to know but that wasn’t reinforced in fine-tuning, it might not recall it perfectly and instead fabricate an answer (hallucinate). If the base knowledge is erased or buried (due to forgetting), the model has no choice but to guess based on patterns, which often yields a hallucination. Therefore, strategies that prevent forgetting (section 5) also help reduce hallucinations on general knowledge, because the model retains factual info and doesn’t need to guess. Conversely, if fine-tuning narrowed the model’s world, anything outside that world invites hallucination.
  • Fine-tuning with context vs. without: It’s worth noting that fine-tuning in a retrieval-augmented setup (where the model is trained to use provided documents) can reduce hallucination, because the model learns to base answers on context. But that’s a different scenario (that’s more like teaching the model to cite sources or to abstain if unsure). Pure fine-tuning without retrieval, as we’re discussing, tends to either instill knowledge or instill style. If it instills a style that favors verbosity or user-pleasing answers, it might actually increase hallucinations (since the model learns to never say “I don’t know”). Some fine-tuning datasets inadvertently do this e.g., if assistants are always required to answer, the model learns to produce an answer even if it’s speculative. Best practice to reduce hallucination is to include examples where the correct behavior is to refuse or indicate uncertainty when appropriate. If your fine-tuning data lacks that, hallucination can indeed worsen.

In summary on hallucinations: Fine-tuning can either ameliorate or exacerbate hallucinations depending on how it’s done. If your fine-tuning focus is factual correctness (and you carefully avoid overriding the model’s base knowledge), you can craft a smaller model that’s very grounded (e.g. fine-tuning on high-quality Q&A with a preference model to discourage made-up answers). But if fine-tuning is used to force-feed new facts or if it over-optimizes on a narrow distribution, the model might lose trust in its own knowledge and hallucinate more whenever it’s outside its fine-tuned comfort zone[48]. Model size doesn’t immunize against this, although larger models have more inherent knowledge, they also have more capacity to hallucinate in complex ways. The key is maintaining the right balance of knowledge and uncertainty in the model. Ensuring that fine-tuning doesn’t destroy the model’s calibrated knowledge (what it knows vs. doesn’t know) is crucial. That’s why techniques like early stopping when adding new knowledge, or filtering fine-tune examples that conflict with the model’s knowledge, have been recommended[49]. For instance, that hallucination study suggests stopping fine-tuning before the model memorizes all new facts, to minimize the hallucination increase[42].


Conclusion: Fine-tuning LLMs – whether via full model updates or efficient methods like QLoRA is a delicate surgery on a very complex system. At a neural level, we see how low-rank adaptations can twist the model’s representation (intruder dimensions) and how quantization adds a faint haze to its memory. Model size and fancy architectures (dense vs. MoE vs. hybrid) don’t remove the fundamental issue: specializing a model tends to narrow it. Newer hybrid models like IBM’s Granite 4 series introduce tools (MoE experts, state-space layers) that, combined with careful fine-tuning regimes, can alleviate some problems. Indeed, IBM’s recent models leverage synthetic data generation, phased training, and expert routing to achieve strong specialization while preserving general capabilities[26][15]. The state of research in 2025 is encouraging, from advanced blending techniques (superposition) to dynamic expert models, pointing toward a future where we can have “the best of both worlds”: models that learn new tasks without forgetting old ones.

But as of now, the practical advice is: fine-tune with caution. Monitor for forgetting, use the smallest effective change (LoRA rank or number of trainable parameters) needed for your task, and consider mixing in data or using multi-task objectives to keep the model well-rounded. And if you have the resources, keep an eye on the latest papers and library updates, new methods to tackle these issues are coming out almost monthly (many accompanied by GitHub repos or HuggingFace examples), which can help your LLM maintain its general intelligence even as it acquires new expertise.

Sources: The insights above were drawn from a mix of recent research papers, blog analyses, and community discussions. Key references include: Ghosh et al. (2024) on LoRA vs full fine-tuning differences[5][2], Luo et al. (2025) on catastrophic forgetting in LLMs[1][12], Li et al. (2025) on MoE adaptation without forgetting[15], IBM’s Granite 4 model announcements and LAB method description[26][23], Gekhman et al. (2024) on fine-tuning and hallucinations[45][41], as well as practical tuning tips from the open-source community[30] and cutting-edge proposals like Superposition for model merging without interference[38]. Each of these sheds light on a piece of the puzzle in maintaining LLM generalization amid specialization. Together, they map out the current landscape of challenges and solutions.


[1] [3] [4] [12] [35] [2308.08747] An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

https://ar5iv.labs.arxiv.org/html/2308.08747v5

[2] [5] [6] [7] [8] [9] LoRA vs Full Fine-tuning: An Illusion of Equivalence

https://arxiv.org/html/2410.21228v1

[10] [2305.14314] QLoRA: Efficient Finetuning of Quantized LLMs - arXiv

https://arxiv.org/abs/2305.14314

[11] Maximizing Efficiency: Fine‑Tuning Large Language Models with LoRA and QLoRA on Runpod

https://www.runpod.io/articles/guides/maximizing-efficiency-fine-tuning-large-language-models-with-lora-and-qlora-on-runpod

[13] [18] [20] [21] [22] [27] [28] [29] IBM Granite 4: Deep Dive Into the Hybrid Mamba/Transformer LLM Family | by Sai Dheeraj Gummadi | Data Science in Your Pocket | Oct, 2025 | Medium

https://medium.com/data-science-in-your-pocket/ibm-granite-4-deep-dive-into-the-hybrid-mamba-transformer-llm-family-c5d01978f27a

[14] [23] IBM Granite 3.0: open, state-of-the-art enterprise models

https://www.ibm.com/new/announcements/ibm-granite-3-0-open-state-of-the-art-enterprise-models

[15] [16] [2509.16882] Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation

https://arxiv.org/abs/2509.16882

[17] Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation

https://www.researchgate.net/publication/397423023_Dynamic_Expert_Specialization_Towards_Catastrophic_Forgetting-Free_Multi-Domain_MoE_Adaptation

[19] Transformers are SSMs: Generalized Models and Efficient ... - arXiv

https://arxiv.org/abs/2405.21060

[24] [25] [26] [31] [32] Synthetic training data for LLMs - IBM Research

https://research.ibm.com/blog/LLM-generated-data

[30] LoRA/QLoRA: The most significant training parameters that affect the VRAM (Axolotl) : r/LocalLLaMA

https://www.reddit.com/r/LocalLLaMA/comments/1od0gw9/loraqlora_the_most_significant_training/

[33] [34] aclanthology.org

https://aclanthology.org/2024.acl-long.77.pdf

[36] [37] [38] [39] Forget Catastrophic Forgetting: Superposition is Here to Revolutionize How We Fine-Tune LLMs | by Ayoub Ben Chaliah | Medium

https://benchaliah.medium.com/forget-catastrophic-forgetting-superposition-is-here-to-revolutionize-how-we-fine-tune-llms-b8c38f5e2986

[40] MoExtend: Tuning New Experts for Modality and Task Extension

https://arxiv.org/abs/2408.03511

[41] [42] [45] [46] [47] [48] [49] [2405.05904] Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

https://ar5iv.org/pdf/2405.05904v2

[43] [44] Do Smaller Models Hallucinate More?

https://www.intel.com/content/www/us/en/developer/articles/technical/do-smaller-models-hallucinate-more.html

Subscribe to Vitalij Neverkevic Blog

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe