A tiny language model based on h2o-danube2-1.8b-chat. Block expanded.

3B

37 Pulls Updated 2 months ago

Readme

This model is a block expanded danube2, using the Llama Pro method of only training (or fine tuning) the expanded blocks. To do this on limited hardware I had to expand by 2 layers per step, from the original 24 to 32. At least, that was the original plan. With the 32 layer model I used BAdam to do a “once over” with most the datasets I also used to expand the model. While it is a faux full fine tune, it isn’t really that different from the Llama Pro method, e.g. layerwise insertion of data.

I have a feeling that Llama3 and other well trained models feels better because of markdown (formatting), personality (friendliness), and prompt compliance (prefereneceness.. I guess). Thus I have used Llama3 8B, WizardLM2, and Hermes 2 Pro Mistral to generate training data for this model.

To ensure that the full 8k context window could be utilised this time I filtered openhermes, Synthia, LongAlpaca, and MathInstruct for entries with a token count between 2k and 8k, to DoRA, QLoRA, and BAdam the context window into submission. One time, elsewhere, even with lm_head as an additional target, and twice with embed_tokens.

The astute among you may notice the extra special tokens like the fim and thought tokens. NinjaMouse has not been trained to use those.. Yet! Also: This is actually 34 layers. Surprise!

Here’s the thing with the 2 extra layers compared to my first model. When I trained NinjaMouse2 with 32 layers I noticed that the grad_norm value would behave strangely on layer 3 and 27. The last layer, before the expansion used to be 27, while 3 is a mystery. I decided to use mergekit to copy layer 3 and insert it beside the original, and copy layer 27 and insert it at the end or top (the new 33, all 0 indexed), depending on your perspective.