13 Downloads Updated 2 weeks ago
Built with Llama
Llama 3.3 is licensed under the Llama 3.3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.
Licensed by NVIDIA Corporation under the NVIDIA Open Model License
NOTICE
Recommended for agentic AI systems.
Apparently relatively decent long-horizon and complex problem solving ability.
Significant track record.
IQ2_XXS - Apparently adequate. Compatible with ~16GB VRAM.
ollama_pull_virtuoso() {
ollama pull mirage335/"$1"
ollama cp mirage335/"$1" "$1"
ollama rm mirage335/"$1"
}
ollama_pull_virtuoso Llama-3_3-Nemotron-Super-49B-v1_5-virtuoso
echo "FROM Llama-3_3-Nemotron-Super-49B-v1_5-virtuoso" > Modelfile-12k
echo "PARAMETER num_ctx 12288" >> Modelfile-12k
echo "PARAMETER num_keep 12288" >> Modelfile-12k
echo "PARAMETER num_predict 24576" >> Modelfile-12k
echo "PARAMETER num_gpu 999" >> Modelfile-12k
ollama create Llama-3_3-Nemotron-Super-49B-v1_5-12k-virtuoso -f Modelfile-12k
rm -f Modelfile-12k
# 24GB VRAM, OLLAMA_KV_CACHE_TYPE=q4_0
# 2x16GB=32GB VRAM, OLLAMA_KV_CACHE_TYPE=q4_0
# 32GB VRAM, OLLAMA_KV_CACHE_TYPE=q4_0
# Other configuration may be necessary.
# Do NOT expect adding another 16GB VRAM GPu will allow any larger context.
echo "FROM Llama-3_3-Nemotron-Super-49B-v1_5-virtuoso" > Modelfile-40k
echo "PARAMETER num_ctx 40960" >> Modelfile-40k
echo "PARAMETER num_keep 40960" >> Modelfile-40k
echo "PARAMETER num_predict 49152" >> Modelfile-40k
echo "PARAMETER num_gpu 999" >> Modelfile-40k
ollama create Llama-3_3-Nemotron-Super-49B-v1_5-40k-virtuoso -f Modelfile-40k
rm -f Modelfile-40k
# Agentic Server use cases should call this unless 24GB VRAM GPUs definitely are not available.
#
# 2x24GB=48GB VRAM, OLLAMA_KV_CACHE_TYPE=q4_0
# Other configuration may be necessary.
# Calling this model guarantees >80k context. Larger context limit, 80k, 96k, 128k, may be given if not known to exceed VRAM limits in the 2x24GB=48GB VRAM situation.
# As usual, intended to assure input prompt processing happens very quickly, ~300t/s or better, with 2x24GB=48GB VRAM .
# 81920 , 98304 , 131072
echo "FROM Llama-3_3-Nemotron-Super-49B-v1_5-virtuoso" > Modelfile-80k
echo "PARAMETER num_ctx 131072" >> Modelfile-80k
echo "PARAMETER num_keep 131072" >> Modelfile-80k
echo "PARAMETER num_predict 49152" >> Modelfile-80k
echo "PARAMETER num_gpu 999" >> Modelfile-80k
ollama create Llama-3_3-Nemotron-Super-49B-v1_5-80k-virtuoso -f Modelfile-80k
rm -f Modelfile-80k
# WARNING: Experimental. Minimizes thermal load for indeterminate background processing use cases.
# May be deprecated.
echo "FROM Llama-3_3-Nemotron-Super-49B-v1_5-virtuoso" > Modelfile-128k
echo "PARAMETER num_ctx 131072" >> Modelfile-128k
echo "PARAMETER num_keep 131072" >> Modelfile-128k
echo "PARAMETER num_predict 49152" >> Modelfile-128k
echo "PARAMETER num_gpu 0" >> Modelfile-128k
ollama create Llama-3_3-Nemotron-Super-49B-v1_5-128k-virtuoso -f Modelfile-128k
rm -f Modelfile-128k
Recommended environment variables. KV_CACHE quantization “q4_0” in particular RECOMMENDED, unless “q8_0” is needed (eg. by Qwen-2_5-VL-7B-Instruct-virtuoso, etc).
export OLLAMA_NUM_THREADS=18
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE="q4_0"
export OLLAMA_NEW_ENGINE=true
export OLLAMA_NOHISTORY=true
export OLLAMA_NUM_PARALLEL=1
export OLLAMA_MAX_LOADED_MODELS=1
Adjust OLLAMA_NUM_THREADS and/or disable HyperThreading, etc, to prevent crippling performance loss.
Pulling the model this way relies on the ollama repository, and more generally, reliability of internet services, which has been rather significantly fragile.
If possible, you should use the “Llama-3-virtuoso” project, which automatically caches an automatically installable backup copy.