哪个模型更好?LLM的Benchmark Metrics的含义
哪个模型更好?LLM的Benchmark Metrics的含义
回想上次看Dolly还是在上次,
当时看 Dolly 时有看到下表的指标(部分),那会不知道什么意思,现在回过头来仔细看看。
| model | openbookqa | arc_easy | winogrande | hellaswag | arc_challenge | piqa | boolq | gmean |
|---|---|---|---|---|---|---|---|---|
| EleutherAI/pythia-2.8b | 0.348 | 0.585859 | 0.589582 | 0.591217 | 0.323379 | 0.73395 | 0.638226 | 0.523431 |
| EleutherAI/pythia-6.9b | 0.368 | 0.604798 | 0.608524 | 0.631548 | 0.343857 | 0.761153 | 0.6263 | 0.543567 |
| databricks/dolly-v2-12b | 0.408 | 0.63931 | 0.616417 | 0.707927 | 0.388225 | 0.757889 | 0.568196 | 0.56781 |
| databricks/dolly-v2-7b | 0.392 | 0.633838 | 0.607735 | 0.686517 | 0.406997 | 0.750816 | 0.644037 | 0.573487 |
结论就是,这个就是纯正确率
比如openbookqa,就是给了几千个选择题,因为选择题有标准答案,就直接按正确率来判断就可以了。
{
"id": "7-980",
"question":
"stem": "The sun is responsible for",
"choices": [
{"text": "puppies learning new tricks", "label": "A"},
{"text": "children growing up and getting old", "label": "B"},
{"text": "flowers wilting in a vase", "label": "C"},