哪个模型更好?LLM的Benchmark Metrics的含义

哪个模型更好?LLM的Benchmark Metrics的含义

回想上次看Dolly还是在上次,

当时看 Dolly 时有看到下表的指标(部分),那会不知道什么意思,现在回过头来仔细看看。

model openbookqa arc_easy winogrande hellaswag arc_challenge piqa boolq gmean
EleutherAI/pythia-2.8b 0.348 0.585859 0.589582 0.591217 0.323379 0.73395 0.638226 0.523431
EleutherAI/pythia-6.9b 0.368 0.604798 0.608524 0.631548 0.343857 0.761153 0.6263 0.543567
databricks/dolly-v2-12b 0.408 0.63931 0.616417 0.707927 0.388225 0.757889 0.568196 0.56781
databricks/dolly-v2-7b 0.392 0.633838 0.607735 0.686517 0.406997 0.750816 0.644037 0.573487

结论就是,这个就是纯正确率

比如openbookqa,就是给了几千个选择题,因为选择题有标准答案,就直接按正确率来判断就可以了。

{
    "id": "7-980", 
    "question": 
            "stem": "The sun is responsible for", 
            "choices": [
                {"text": "puppies learning new tricks", "label": "A"}, 
                {"text": "children growing up and getting old", "label": "B"}, 
                {"text": "flowers wilting in a vase", "label": "C"},