1 Download Updated 2 weeks ago
Ever since I released my first Qwen2 based model several weeks ago I’ve taken what I’ve learned and attempted to create a new model that has been pre-trained more thoroughly and on a more diverse dataset. I settled on using the unfiltered version of the english subset of c4 with entries being shuffled in batches of 1000 in an effort to deviate away from continuous streams of related training data. As for fine-tuning I initially opted to use agentlans/multiturn-chat because of the large amounts of examples they had over databricks/databricks-dolly-15k however I reverted back to dolly-15k due to the verbosity of the conversations in multiturn chat which wasn’t the best suited for a short 1024-token context model.
Like my previous model the AllenAI C4 English dataset was used for pre-training with the key difference being that I used the “en.noblocklist” subset for more diversity. Instead of creating my own tokenizer I opted instead to using the internal tokenizer of GPT-2 because it saved me a lot of extra computation and was proven in real world examples to be effective. The model was trained on 280 thousand steps with 1024 token context, at a per device training batch size of 4, and 4 gradient accumulation steps. Pre-training took about 60 hours with the GPU overclocked to its maximum capacity. Post-training involved 5 epochs of databricks/databricks-dolly-15k formatted in ChatML.
HuggingFace URL: https://huggingface.co/TheOneWhoWill/Bootstrap-LLM