TangoIA
Rendimiento
Resultados de Benchmarks
Evaluación del modelo Tango-70b en comparación con otros modelos de lenguaje
Comparación con otros modelos en la-leaderboard
Modelo | Promedio | AQuAS | Belebele Spa | ClinDiagnosES | ClinTreatES | COPA_es | Crows Pairs Spanish | EsCoLA | Fake News ES | HumorQA | MGSM_es | NoticIA | OffendES | OpenBookQA_es | PAWS-X_es | RagQuAS | SpaLawEx | TELEIA | WNLI ES | XL-Sum_es | XNLI_es | XQuAD_es | xStoryCloze_es | Precision |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tango-70b | 59.90 | 75.78 | 92.00 | 65.72 | 63.43 | 89.60 | 55.96 | 71.79 | 76.57 | 25.49 | 32.40 | 0.86 | 72.64 | 34.80 | 70.95 | 79.87 | 51.26 | 61.90 | 77.46 | 19.71 | 52.37 | 75.16 | 74.72 | bfloat4 |
google/gemma-2-9b-it | 33.62 | 85.93 | 86.22 | 83.19 | 81.42 | 78.80 | 17.96 | 34.52 | 62.94 | 45.10 | 0 | 34.11 | 64.52 | 9.33 | 27.60 | 88.01 | 30.53 | 35.72 | 52.11 | 0 | 24.28 | 62.29 | 35.01 | bfloat16 |
google/gemma-2-9b | 32.97 | 83.02 | 83.26 | 77.77 | 80.93 | 68.80 | 13.59 | 28.79 | 16.00 | 45.10 | 4.80 | 0.23 | 66.33 | 12.00 | 24.70 | 86.79 | 5.88 | 35.72 | 4.23 | 0 | 29.76 | 75.33 | 47.98 | float32 |
meta-llama/Meta-Llama-3.1-8B-Instruct | 30.23 | 85.31 | 83.56 | 81.75 | 73.40 | 72.00 | 6.03 | 24.24 | 60.14 | 37.25 | 0 | 28.71 | 57.00 | 12.00 | 33.20 | 88.62 | 19.33 | 21.43 | 32.39 | 0 | 25.30 | 69.94 | 35.54 | bfloat16 |
Qwen/Qwen2.5-7B | 27.61 | 85.37 | 84.89 | 79.25 | 81.90 | 62.00 | 8.81 | 20.72 | 42.66 | 45.10 | 5.20 | 3.93 | 67.03 | 10.67 | 29.60 | 90.43 | 19.33 | 14.29 | 40.85 | 0 | 25.30 | 80.05 | 38.19 | bfloat16 |
meta-llama/Meta-Llama-3.1-8B | 27.04 | 83.02 | 74.52 | 80.71 | 81.21 | 62.00 | 0 | 11.53 | 19.58 | 45.10 | 1.60 | 2.60 | 66.23 | 13.07 | 30.10 | 90.69 | 5.88 | 0 | 1.41 | 0 | 28.86 | 74.38 | 41.63 | bfloat16 |
utter-project/EuroLLM-9B | 25.87 | 83.10 | 67.70 | 72.24 | 74.52 | 70.40 | 3.25 | 18.29 | 7.34 | 42.48 | 3.60 | 0.19 | 70.26 | 17.07 | 31.00 | 83.11 | 5.88 | 14.29 | 7.04 | 0 | 27.71 | 76.92 | 44.01 | bfloat16 |
BSC-LT/salamandra-7b-instruct | 25.13 | 84.13 | 57.33 | 80.38 | 82.03 | 62.00 | 10.67 | 7.68 | 8.74 | 0 | 0 | 19.38 | 67.83 | 14.93 | 19.50 | 88.78 | 18.21 | 21.43 | 9.86 | 0 | 24.28 | 58.31 | 30.38 | bfloat16 |
utter-project/EuroLLM-9B-Instruct | 24.46 | 84.81 | 69.78 | 80.90 | 77.76 | 72.40 | 11.20 | 24.57 | 38.11 | 26.80 | 0 | 26.80 | 61.91 | 13.60 | 26.10 | 90.79 | 13.73 | 21.43 | 29.58 | 0 | 24.82 | 58.48 | 33.69 | bfloat16 |
CohereForAI/aya-expanse-8b | 24.30 | 83.45 | 77.78 | 78.88 | 72.24 | 68.00 | 9.21 | 15.53 | 19.58 | 0 | 0 | 0.46 | 62.23 | 8.53 | 33.90 | 89.02 | 13.73 | 50.00 | 38.03 | 0 | 15.79 | 77.98 | 34.08 | float16 |
BSC-LT/salamandra-7b | 24.04 | 81.93 | 22.07 | 74.68 | 78.11 | 62.80 | 5.37 | 21.46 | 19.58 | 45.10 | 2.40 | 0.17 | 57.27 | 10.40 | 18.60 | 87.78 | 5.88 | 0 | 15.49 | 0 | 26.15 | 69.21 | 46.92 |
Notas:
- Promedio General: Media no ponderada de todas las métricas válidas de todas las 23 tareas evaluadas (46 valores totales)
- Los resultados de otros modelos provienen de la-leaderboard
- Tango-70b supera significativamente al segundo mejor modelo (google/gemma-2-9b-it con 33.62) por 26.28 puntos porcentuales
- Tabla scrollable: Desplázate horizontalmente para ver todas las 23 tareas de evaluación
- Tango-70b destaca especialmente en: Belebele Spa (92.00), COPA_es (89.60), RagQuAS (79.87), WNLI ES (77.46), EsCoLA (71.79) y XQuAD_es (75.16)
Reproducir los resultados
📁 Repositorio: sandbox-ai/tango-evals
Crear y activar un virtual-env de Python ≥ 3.9:
python -m venv .venv source .venv/bin/activate
Instalar dependencias y el harness en modo editable:
pip install -r requirements.txt pip install -e .
Loguearse en Hugging Face:
huggingface-cli login
Ejecutar el script de evaluación:
chmod +x run_laleaderboard_es.sh ./run_laleaderboard_es.sh
Ejecutar el script de agregación de resultados:
python aggregate_laleaderboard_es_acc.py
El script aggregate_laleaderboard_es_acc.py
lee todos los archivos results_*.json
en tango-evals/
y calcula:
- Media de métricas de accuracy únicamente
- Media de todas las métricas (primera métrica de cada tarea)