mannix/alchemistcoder-7b

Details

Updated 2 years ago

2 years ago

b555c33496cc · 4.1GB ·

model

archllama

parameters6.74B

quantizationQ4_K_M

4.1GB

params

{ "stop": [ "[INST]", "[/INST]", "<<SYS>>", "<</SYS>>" ] }

91B

template

[INST] <<SYS>>{{ .System }}<</SYS>> {{ .Prompt }} [/INST]

59B

Model Summary: AlchemistCoder is a series of coding models by InternLM. This model is tuned from Llama 2, and should excel at all coding related tasks.

Highlights

Abstract: Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6.7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence.

AlchemistPrompts: Designed as data-specific prompts for harmonizing inherent conflicts in multi-source data and mitigating the instruction/response misalignment at a fined-grained level.

Code Comprehension Tasks: Sourced from the process of data construction, consisting of instruction evolution, data filtering, and code review.

Harmonized Multi-source Data: Instruction tuned on 200M tokens, including 6 types of high-quality data.

Superior Model Performance: Surpassing all the open-source models of the same size (6.7/7B), and rivaling or even beating larger models (15B/33B/70B/ChatGPT) on 6 code benchmarks.

Advanced generic capabilities: Demonstrated by the significant improvements on MMLU, BBH, and GSM8K.

AlchemistCoder is a series of coding models by InternLM. Tuned from Llama 2.

Details

Readme