LLM Benchmarking Automation Suite
7.2
A tool to automate LLM/Agentic benchmarking, addressing the problem of manually creating models, harnesses, and suites. It would offer pre-built harnesses, standardized datasets, and automated metric collection, reducing the time and resources needed for evaluation.
120h
mvp estimate
7.2
viability grade
19
views
technology stack
Python
SQLite
Medium
inspired by
Frameworks For Supporting LLM/Agentic Benchmarking