A tiny language model based on h2o-danube2-1.8b-chat. Block expanded.
43 Pulls Updated 5 months ago
Updated 6 months ago
6 months ago
bc4c3a6e09ac · 1.5GB
Readme
This model is a block expanded danube2, using the Llama Pro method of only training (or fine tuning) the expanded blocks. To do this on limited hardware I had to expand by 2 layers per step, from the original 24 to 32. At least, that was the original plan. With the 32 layer model I used BAdam to do a “once over” with most the datasets I also used to expand the model. While it is a faux full fine tune, it isn’t really that different from the Llama Pro method, e.g. layerwise insertion of data.
I have a feeling that Llama3 and other well trained models feels better because of markdown (formatting), personality (friendliness), and prompt compliance (prefereneceness.. I guess). Thus I have used Llama3 8B, WizardLM2, and Hermes 2 Pro Mistral to generate training data for this model.
To ensure that the full 8k context window could be utilised this time I filtered openhermes, Synthia, LongAlpaca, and MathInstruct for entries with a token count between 2k and 8k, to DoRA, QLoRA, and BAdam the context window into submission. One time, elsewhere, even with lm_head
as an additional target, and twice with embed_tokens
.
The astute among you may notice the extra special tokens like the fim and thought tokens. NinjaMouse has not been trained to use those.. Yet! Also: This is actually 34 layers. Surprise!
Here’s the thing with the 2 extra layers compared to my first model. When I trained NinjaMouse2 with 32 layers I noticed that the grad_norm
value would behave strangely on layer 3 and 27. The last layer, before the expansion used to be 27, while 3 is a mystery. I decided to use mergekit to copy layer 3 and insert it beside the original, and copy layer 27 and insert it at the end or top (the new 33, all 0 indexed), depending on your perspective.