203 4 months ago

Quantized version of Qwen2.5-32B optimized for tool usage with Cline / Roo Code and complex problem solving.

tools 32b

Models

View all →

Readme

Qwen/Qwen2.5-Coder-32B-Instruct

https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct

Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5:

Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o.
A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies.   Long-context Support up to 128K tokens.

Are you interested in learning more about vibe coding? checkout following articles:

Optimizing for Tool Calling

num_ctx 65536: Absolutely crucial. Tool calling often involves complex workflows that span multiple turns. A large context window enables the model to remember previous instructions, tool responses, and the overall goal. 128k or 200k is even better if available.

temperature 0.15: Significantly reduced from the coding configuration (0.25). Tool calling prioritizes accuracy and reliability above all else. The model needs to confidently select the correct tool and format its call precisely. Low temperature minimizes the risk of incorrect tool selection or poorly formatted calls.

top_p 0.7: A more restrained exploration. While a little creativity can be helpful in some cases, we don’t want the model to drastically deviate from the most likely tool choices.

repeat_penalty 1.2: Aggressively penalizes repetition. This is particularly important in tool calling to prevent the model from getting stuck in loops and repeatedly calling the same tool.

temperature 0.15: Significantly reduced from the coding configuration (0.25). Tool calling prioritizes accuracy and reliability above all else. The model needs to confidently select the correct tool and format its call precisely. Low temperature minimizes the risk of incorrect tool selection or poorly formatted calls.

num_keep 1024: Significantly increased from the standard coding configuration (512). Tool calling benefits enormously from retaining a longer history. This allows the model to better understand the context of the task, remember the tools available, and track the progress of the workflow.

min_p 0.03: A slightly higher min_p than the standard coding configuration (0.02). Tool calling often requires creative problem-solving and exploring different tool combinations. While we still want grounded responses, a bit more exploration is beneficial. Caveat: Monitor for nonsensical tool choices or unexpected behavior. Reduce if needed.

num_predict 16384 - 32768: num_predict (or the more common max_tokens in many LLM APIs) sets the maximum number of tokens the model is allowed to generate in a single response. It’s a safeguard to prevent runaway generation and control costs. However, in tool-calling scenarios, it can be quite tricky to set correctly.

References

GPT-4 (32k context): A reasonable starting point is num_predict = 8192 or 16384. The 32k context window provides significant headroom, but be mindful of cost.

GPT-4 (8k context): Start with num_predict = 2048 or 4096. Claude 2 (100k context): You can often start with num_predict = 16384 or 32768, but closely monitor token usage.

Smaller Models: Start with a smaller num_predict (e.g., 512 or 1024) and increase it as needed.

Example Scenario:

Let’s say you have a tool that generates a JSON response, and the average JSON response size is 500 tokens. Your model needs around 100 tokens to formulate the tool call. You anticipate the tool might occasionally return a larger response of up to 1000 tokens. In this case, a good starting point would be:

num_predict = 100 (reasoning) + 1000 (max tool response) + 200 (buffer) = 1300 tokens