LLM Benchmarking Automation Suite
7.2
A tool to automate LLM/Agentic benchmarking, addressing the problem of manually creating models, harnesses, and suites. It would offer pre-built harnesses, standardized datasets, and automated metric collection, reducing the time and resources needed for evaluation.
120h
mvp estimate
7.2
viability grade
6
views
technology stack
Python
SQLite
Medium
inspired by
Frameworks For Supporting LLM/Agentic Benchmarking