trollek / ninjamouse2

This model is a block expanded danube2, using the Llama Pro method of only training (or fine tuning) the expanded blocks. To do this on limited hardware I had to expand by 2 layers per step, from the original 24 to 32. At least, that was the original plan. With the 32 layer model I used BAdam to do a “once over” with most the datasets I also used to expand the model. While it is a faux full fine tune, it isn’t really that different from the Llama Pro method, e.g. layerwise insertion of data.

I have a feeling that Llama3 and other well trained models feels better because of markdown (formatting), personality (friendliness), and prompt compliance (prefereneceness.. I guess). Thus I have used Llama3 8B, WizardLM2, and Hermes 2 Pro Mistral to generate training data for this model.

To ensure that the full 8k context window could be utilised this time I filtered openhermes, Synthia, LongAlpaca, and MathInstruct for entries with a token count between 2k and 8k, to DoRA, QLoRA, and BAdam the context window into submission. One time, elsewhere, even with lm_head as an additional target, and twice with embed_tokens.

The astute among you may notice the extra special tokens like the fim and thought tokens. NinjaMouse has not been trained to use those.. Yet! Also: This is actually 34 layers. Surprise!

Here’s the thing with the 2 extra layers compared to my first model. When I trained NinjaMouse2 with 32 layers I noticed that the grad_norm value would behave strangely on layer 3 and 27. The last layer, before the expansion used to be 27, while 3 is a mystery. I decided to use mergekit to copy layer 3 and insert it beside the original, and copy layer 27 and insert it at the end or top (the new 33, all 0 indexed), depending on your perspective.

![](https://huggingface.co/trollek/NinjaMouse2-2.5B-v0.1/resolve/main/ninjamouse2.jpeg)

This model is a block expanded danube2, using the Llama Pro method of only training (or fine tuning) the expanded blocks. To do this on limited hardware I had to expand by 2 layers per step, from the original 24 to 32. At least, that was the original plan. With the 32 layer model I used BAdam to do a "once over" with most the datasets I also used to expand the model. While it is a faux full fine tune, it isn't really that different from the Llama Pro method, e.g. layerwise insertion of data.

I have a feeling that Llama3 and other well trained models feels better because of markdown (formatting), personality (friendliness), and prompt compliance (prefereneceness.. I guess). Thus I have used Llama3 8B, [WizardLM2](https://huggingface.co/bartowski/WizardLM-2-7B-GGUF), and [Hermes 2 Pro Mistral](https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B) to generate training data for this model.

To ensure that the full 8k context window could be utilised this time I filtered openhermes, Synthia, LongAlpaca, and MathInstruct for entries with a token count between 2k and 8k, to DoRA, QLoRA, and BAdam the context window into submission. One time, elsewhere, even with `lm_head` as an additional target, and twice with `embed_tokens`.

The astute among you may notice the extra special tokens like the fim and thought tokens. NinjaMouse has not been trained to use those.. Yet! Also: This is actually 34 layers. Surprise!

Here's the thing with the 2 extra layers compared to my first model. When I trained NinjaMouse2 with 32 layers I noticed that the `grad_norm` value would behave strangely on layer 3 and 27. The last layer, before the expansion used to be 27, while 3 is a mystery. I decided to use [mergekit](https://github.com/arcee-ai/mergekit) to copy layer 3 and insert it beside the original, and copy layer 27 and insert it at the end or top (the new 33, all 0 indexed), depending on your perspective.

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)