Armenian Unified Test Exams

This benchmark contains results of various Language Models on Armenian Unified Test Exams for Armenian language and literature, Armenian history and mathematics. The scoring system is a 20-point scale, where 0-8 is a Fail, 8-18 is a Pass, and 18-20 is a Distinction.

Rank	Model	Average	Armenian language and literature	Armenian history	Mathematics
1	Meta-Llama-3.3-70B-Instruct	11.0833	10.5	7.75	12.75

Rank	Model	Average	Armenian language and literature	Armenian history	Mathematics
1	claude-3-7-sonnet-20250219	11.0833	10.5	7.75	15
2	claude-3-5-sonnet-20241022	10.6667	10	9.25	12.75
3	gemini-2.0-flash	9.8333	5.5	6.75	17.25
4	gpt-4o	8.9167	6.75	6.75	13.25
5	qwen-max-2025-01-25	8.6667	7.25	4.5	14.25
6	gemini-1.5-flash	7.8333	4.75	3.75	15
7	DeepSeek-V3	7.5	5.25	5	12.25
8	Meta-Llama-3.3-70B-Instruct	7.0833	4.5	5.25	11.5
9	claude-3-5-haiku-20241022	6.5	5	3.75	10.75

Select Column to Plot

Plot

MMLU-Pro Translated to Armenian (MMLU-Pro-Hy)

This benchmark contains results of various Language Models on the MMLU-Pro benchmark, translated into Armenian. MMLU-Pro is a massive multi-task test in MCQA format. The scores represent accuracy.

Rank	Model	Average	Biology	Business	Chemistry	Computer Science	Economics	Engineering	Health	History	Law	Math	Philosophy	Physics	Psychology	Other
1	Meta-Llama-3.3-70B-Instruct	0.7247	0.8667	0.8182	0.7895	0.7353	0.8169	0.5625	0.6618	0.5517	0.5281	0.8673	0.6429	0.7982	0.7612	0.6364

Rank	Model	Average	Biology	Business	Chemistry	Computer Science	Economics	Engineering	Health	History	Law	Math	Philosophy	Physics	Psychology	Other
1	gemini-2.0-flash	0.7247	0.85	0.8182	0.7895	0.7353	0.8169	0.6	0.75	0.5517	0.5281	0.8673	0.6429	0.7982	0.7612	0.6364
2	claude-3-5-sonnet-20241022	0.6958	0.8667	0.803	0.7579	0.7059	0.7887	0.5625	0.6618	0.6552	0.4944	0.7788	0.5476	0.7523	0.7164	0.6494
3	gpt-4o	0.6758	0.8667	0.7424	0.6842	0.6176	0.7887	0.5625	0.7794	0.5517	0.5393	0.7788	0.5476	0.6881	0.7164	0.5974
4	DeepSeek-V3	0.6633	0.8167	0.8182	0.6947	0.7353	0.7887	0.5875	0.6471	0.4828	0.3596	0.8584	0.5476	0.6881	0.7164	0.5455
5	gemini-1.5-flash	0.5592	0.75	0.7121	0.6947	0.5	0.7183	0.4	0.5	0.4483	0.2584	0.8319	0.3571	0.6514	0.6567	0.3506
6	claude-3-5-haiku-20241022	0.5198	0.75	0.5758	0.5579	0.4412	0.6901	0.4125	0.5882	0.5172	0.2472	0.6018	0.4048	0.5596	0.5672	0.3636
7	Meta-Llama-3.3-70B-Instruct	0.5139	0.7333	0.5303	0.5895	0.3824	0.6338	0.4875	0.5735	0.4138	0.3146	0.6018	0.4524	0.5321	0.6119	0.3377

Select Column to Plot

Plot

Overview

This benchmark is developed and maintained by Metric. It evaluates the capabilities of Large Language Models on Armenian-specific tasks, including Armenian Unified Test Exams and a translated version of the MMLU-Pro benchmark (MMLU-Pro-Hy). It is designed to measure the models' understanding and generation capabilities in the Armenian language.

Dataset

Armenian Unified Exams: collection of High School graduation test exams used in 2025 in Armenia. The highest achievable score per test is 20. The data is extracted from PDFs and manually prepared for LLM evaluation.
MMLU-Pro-Hy: a massive multi-task test in MCQA format, inspired by the original MMLU benchmark, adapted for the Armenian language. Currently, a stratified sample is sued for evaluation summing up to 1000 questions in total. The Armenian version is generated through machine-translation. Resulting dataset went extensive post-processing to ensure high quality subsample is selected for evaluation..

Submission Guide

To submit a model for evaluation, please follow these steps:

Evaluate your model:
- Follow the evaluation script provided here: https://github.com/Metricam/ArmBench-LLM
- For more details about the evaluation and submission process, read the README in the ArmBench-LLM GitHub repository.

Format your submission file:

After evaluation, you will get a results.json file. Ensure the file follows this format:

    {
        "mmlu_results": [
            {
                "category": "category_name",
                "score": score_value
            },
            ...
        ],
        "unified_exam_results": [
            {
                "category": "category_name",
                "score": score_value
            },
            ...
        ]
    }

Submit your model:
- Add the ArmBench-LLM tag and the results.json file to your model card.
- Click on the "Refresh Data" button in this app, and you will see your model's results.

Contributing

You can contribute to this benchmark in several ways:

Provide API credits for evaluating additional API-based models.
Citing our work in your research and publications.
Contributing to the development of the benchmark itself with data or with evaluation results.

About Metric

Metric is an AI Research Lab based in Yerevan, Armenia. It is specialized in training custom embedding and generation models for use cases such as Document AI or low-represented languages. If you are interested in our research or advisory services, drop an email to info@metric.am.