Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework. 中文翻译: 最近的工作表明,在通过指令描述的一组任务上微调大型预训练语言模型,也称为指令调整,可以提高它们对未见任务的零和少样本泛化。然而,对于在指令调优过程中做出的不同决策的性能权衡,人们的理解有限。这些决定包括指令调优基准的规模和多样性、不同的任务抽样策略、有和没有演示的微调、使用专门的数据集进行推理和对话的训练,最后是微调目标本身。在本文中,我们描述了在缩放模型和基准大小时指令调整决策对下游任务性能的影响。为此,我们创建了 OPT-IML 工作台:2000 个 NLP 任务的指令元学习 (IML) 的大型基准从 8 个现有基准合并到任务类别,并准备一个评估框架来衡量三种类型的模型泛化:从完全保留类别的任务,到保留类别来自已见类别的任务,以及来自已见任务的保留实例。通过这个框架的镜头,我们首先展示了关于应用于 OPT-30B 的指令调整决策的见解,并进一步利用这些见解来训练 OPT-IML 30B 和 175B,它们是 OPT 的指令调整版本。OPT-IML 在具有不同任务和输入格式的四个不同评估基准(PromptSource、FLAN、Super-NaturalInstructions 和 UnifiedSKG)上,在两个尺度上展示了所有三种泛化能力。它不仅在所有基准测试中显着优于 OPT,而且与在每个特定基准测试中微调的现有模型相比也具有很强的竞争力。我们发布了两种规模的 OPT-IML,以及 OPT-IML Bench 评估框架。