TowerInstruct is an Open Multilingual Large Language Model for Translation-Related Tasks by Unbabel.

2,523 6 months ago

Readme

TowerInstruct-7B is a language model that results from fine-tuning TowerBase on the TowerBlocks supervised fine-tuning dataset. TowerInstruct-7B-v0.1 is the first model in the series. The model is trained to handle several translation-related tasks, such as general machine translation (e.g., sentence- and paragraph/document-level translation, terminology-aware translation, context-aware translation), automatic post edition, named-entity recognition, grammatical error correction, and paraphrase generation. We will release more details in the upcoming technical report. For now, you can check the results obtained with the model here.

  • Developed by: Unbabel, Instituto Superior Técnico, CentraleSupélec University of Paris-Saclay
  • Model type: A 7B parameter model fine-tuned on a mix of publicly available, synthetic datasets on translation-related tasks, as well as conversational datasets and code instructions.
  • Language(s) (NLP): English, Portuguese, Spanish, French, German, Dutch, Italian, Korean, Chinese, Russian
  • License: CC-BY-NC-4.0, Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.
  • Finetuned from model: TowerBase

Update: TowerInstruct-7B-v0.2 has more reliable document-level translation capabilities in comparison with TowerInstruct-7B-v0.1. The new version of TowerBlocks used to train v0.2 is also available in the Tower collection.

Note: TowerInstruct-v0.2 was trained using the ChatML prompt templates without any system prompts.

Intended uses & limitations

The model was initially fine-tuned on a filtered and preprocessed supervised fine-tuning dataset (TowerBlocks), which contains a diverse range of data sources:

  • Translation (sentence and paragraph-level)
  • Automatic Post Edition
  • Machine Translation Evaluation
  • Context-aware Translation
  • Terminology-aware Translation
  • Multi-reference Translation
  • Named-entity Recognition
  • Paraphrase Generation
  • Synthetic Chat data
  • Code instructions