Data Contamination Scanner
8.1
Inspired by discussions around data curation for pre-training alignment, this tool automatically scans large datasets (text or image-based) for undesirable content (violence, deception) and suggests targeted replacements. It leverages language models to identify potentially harmful patterns and proactively improve dataset quality.
300h
mvp estimate
8.1
viability grade
5
views
technology stack
Python
PostgreSQL
Difficult
inspired by
Addressing undesirable data in ML training