Article 10 of the AI Act sheds light on the use of data as well as other aspects surrounding data governance by organizations subject to this Act.
All training, validation, and testing datasets will be subject to the relevant data governance and management practices. These practices will concern:
- Appropriate design choices;
- Overall data collection processes, data origin, and for personal data the purpose of data collection;
- The formulation of assumptions, in particular with respect to the information that the data are supposed to measure and represent;
- Relevant data-preparation processes such as annotation, labeling, cleaning, updating, enrichment, and aggregation;
- Appropriate assessment of the availability, quantity, and suitability of the required data sets;
- Evaluation of the possible biases that may impact the health and safety of natural persons in addition to their fundamental rights or lead to discrimination prohibited under Union law;
- Appropriate measures adopted to detect, prevent, and mitigate possible identified biases;
- Appropriate identification of data gaps that may prevent compliance with the AI Act's provisions and development strategies to addressed;
All training, validation, and testing data sets must be relevant, sufficiently representative, error-free, and complete to ensure the output generated by these datasets is fit for the intended use. The datasets must also be sufficiently representative of the geographical, contextual, behavioral, or functional setting within which the high-risk AI system is expected to be used.
In cases where processing is necessary for purposes of bias detection and correction, providers of high-risk AI systems may exceptionally proceed with processing special categories of personal data with adequate protections for the fundamental rights and freedoms of individuals. However, in addition to the relevant requirements for such processing under GDPR , Directive (EU) 2016/680, and Regulation (EU) 2018/1725, all of the following conditions should be met:
- The bias detection and correction cannot be carried out with synthetic or anonymized data;
- The special categories of personal data are subject to technical limitations in terms of re-use of the personal data as well as several privacy-related measures;
- The special categories of personal data are subject to appropriate measures to ensure all data processed is secured, protected, and authenticated via suitable safeguards;
- The special categories of personal data will not be transmitted, transferred, or accessible to other parties;
- The special categories of personal data must be deleted once the bias has been corrected or has reached the end of its retention period, whichever comes first;
- The record of processing activities (RoPA) should contain the reasons why processing special categories of personal data was necessary to detect and correct biases and why alternative data could not achieve this objective.