Data Contamination Scanner
8.1
Inspired by discussions around data curation for pre-training alignment, this tool automatically scans large datasets (text or image-based) for undesirable content (violence, deception) and suggests targeted replacements. It leverages language models to identify potentially harmful patterns and proactively improve dataset quality.
300h
mvp estimate
8.1
viability grade
27
views
technology stack
Python
PostgreSQL
Difficult
inspired by
Addressing undesirable data in ML training