Responsible AI Platform

Article 10 EU AI Act: data governance guide

··11 min read

Most teams hear “data governance” and think about data catalogues, retention rules, and ownership charts. Under the EU AI Act, Article 10 is more concrete and more demanding than that. It is about whether a provider of a high-risk AI system can actually defend the quality of the data that shaped the system.

That matters because many AI failures do not start at the model layer. They start earlier, inside the assumptions behind the data, inside the gaps nobody documented, and inside the bias everyone hoped would average out later. Article 10 is the part of the AI Act that says that is not good enough.

For providers of high-risk AI systems, this article is not a side requirement. It is one of the operational foundations of the whole compliance stack, alongside Article 9 on risk management, Article 13 on provider information, and Article 14 on human oversight.

What Article 10 actually requires

The legal text of Article 10 contains six paragraphs, and the structure matters.

Paragraph 1 sets the main rule. If a high-risk AI system uses techniques involving the training of AI models with data, the system must be developed on the basis of training, validation, and test data sets that meet the quality criteria in paragraphs 2 to 5.

Paragraph 2 is the real engine room. It requires data governance and management practices appropriate to the intended purpose of the high-risk AI system. The article then lists eight specific areas providers must control: design choices, data collection and origin, data preparation, assumptions, availability and suitability, bias examination, bias mitigation, and relevant gaps or shortcomings.

Paragraph 3 adds the quality threshold. Training, validation, and test data must be relevant, sufficiently representative, and, to the best extent possible, free of errors and complete in view of the intended purpose.

Paragraph 4 adds context. The data must reflect the geographical, contextual, behavioural, and functional setting in which the system is intended to be used.

Paragraph 5 creates a narrow and heavily conditioned route for providers to process special categories of personal data when that is strictly necessary for bias detection and correction.

Paragraph 6 makes one final clarification. If the high-risk AI system does not use training techniques, paragraphs 2 to 5 still apply, but only to the test data.

That last point matters more than many providers assume. Article 10 is not only about foundation models or machine learning pipelines. It also reaches systems where testing data is the critical validation layer.

Paragraph 2 is where compliance becomes operational

Most Article 10 work lives inside paragraph 2. The law is not asking whether you have “good data” in the abstract. It is asking whether you can explain, document, and defend how that data was chosen and handled.

Design choices and data origin

Providers need to document the design logic behind their data strategy. Why these sources? Why these labels? Why these inclusion and exclusion criteria? Why this balance between real-world and synthetic data?

That means you should be able to answer questions such as:

  1. What population or environment is the system meant to work in?
  2. Which data sources were used to reflect that environment?
  3. Which sources were rejected, and why?
  4. Where does personal data come from, and what was the original purpose of collection?

This is where Article 10 starts to overlap with GDPR and governance reality. If your training data came from legacy operational systems, external vendors, public datasets, or scraped content, the origin story matters. Not later, now.

Data preparation and assumptions

Article 10 explicitly calls out annotation, labelling, cleaning, updating, enrichment, and aggregation. That is a useful signal. The AI Act is not focused only on raw data collection. It is focused on the full chain of transformation.

Providers often under-document the human judgments built into that chain. But those judgments shape the model.

If an HR model is trained on CV screening outcomes, someone decided what counts as a positive outcome. If a fraud model is trained on historic investigations, someone decided which past cases were “confirmed.” If a public sector model is trained on intervention data, someone decided what the system is supposed to measure in the first place.

Article 10(2)(d) is especially important here because it requires the formulation of assumptions. That is a direct challenge to a common bad habit in AI projects: turning proxies into facts without saying so. Cost is not the same as need. Past intervention is not the same as actual risk. Historical hiring is not the same as merit.

Bias examination, mitigation, and gaps

Article 10 does not stop at identifying bias. It requires examination of bias, appropriate measures to detect, prevent, and mitigate it, and explicit identification of data gaps or shortcomings.

That is stricter than many providers are ready for.

A provider cannot credibly say, “we know the data is imperfect, but the model performs well overall.” Article 10 pushes you to ask a harder question: performs well for whom, in which context, under which assumptions, and with which residual weaknesses?

This is where Article 9 and Article 10 belong together. If the risk management system identifies discrimination, representativeness, or data drift risks, Article 10 is one of the places where those risks must actually be addressed.

Representativeness is contextual, not generic

Paragraphs 3 and 4 are easy to underestimate if you read them too fast.

The law does not ask for some universal notion of representative data. It asks for data that is representative in view of the intended purpose and in light of the real setting where the system will be used.

That changes the practical analysis.

A creditworthiness model intended for consumers in one Member State cannot be defended with a vague claim that the data is large. The provider needs to consider whether the data reflects the actual population, legal context, behavioural patterns, and product environment relevant to that use case.

A recruitment tool intended for public-sector hiring cannot rely on training data shaped mainly by private-sector hiring histories and then assume the difference does not matter.

A municipal risk scoring system cannot rely on data that ignores local socio-economic context and then claim the model is neutral because the code is neutral.

That is why Article 10(4) is so useful. It cuts through the lazy defense that “the dataset is industry standard.” Industry standard is not the legal standard. Context fit is.

If you are still unsure whether your use case falls into the high-risk perimeter, use the risk assessment tool and cross-check the relevant categories in Annex III.

LearnWize2 minutes, zero commitment

Learn the EU AI Act by doing

No slides. No boring e-learning. Try an interactive module.

Interactive ChallengePowered by LearnWize LearnWize

Try it yourself

3 interactive activities. Earn XP. See why this works better than reading slides.

LearnWizeTake the full test on LearnWize
Flashcards→Matching→Audit

Article 10 is not a free pass to process sensitive data

Paragraph 5 is one of the most misunderstood parts of the article.

Yes, the AI Act allows providers, in exceptional cases, to process special categories of personal data for bias detection and correction. But the provision is narrow on purpose. It only applies where this is strictly necessary, and only where the objective cannot be effectively achieved with other data, including synthetic or anonymised data.

Then the article adds six safeguards. Security and privacy-preserving measures must be in place. Access must be tightly controlled and documented. The data must not be transmitted to other parties. The data must be deleted once the bias correction purpose has been achieved or retention ends. And the records of processing activities must explain why using these special categories was strictly necessary.

So the practical message is simple: paragraph 5 is an exception, not a convenience clause.

If you need sensitive data to test whether your system disadvantages certain groups, document that necessity carefully. If you do not need it, do not reach for it casually. Article 10 is trying to make bias correction possible without creating a back door for sloppy or excessive processing.

Post-market learning changes the Article 10 conversation

Article 10 is written as a data governance provision, but in practice it cannot stay frozen at development stage.

Why? Because once a high-risk system is deployed, real-world operation reveals things lab conditions miss. Populations shift. Behaviour changes. Inputs become noisier. Users rely on outputs in unexpected ways. Feedback loops appear.

That is why the strongest providers do not treat Article 10 as a one-off dataset memo. They connect it to:

  1. Article 9 risk management
  2. Article 12 record-keeping
  3. Article 13 information for deployers
  4. Article 72 post-market monitoring

If post-market evidence shows that certain groups are underrepresented, certain inputs are unstable, or certain deployment contexts create worse outcomes than expected, the provider should reopen the Article 10 logic. Data governance is not finished just because the first model version shipped.

What providers should do now

If you are placing a high-risk AI system on the market, a practical Article 10 program usually includes six workstreams.

  1. Map the full data chain. Document origin, collection logic, preparation steps, and ownership for training, validation, and test data.
  2. Document assumptions explicitly. Write down what each dataset is supposed to measure, where the proxies are, and where those proxies could fail.
  3. Test representativeness against the actual deployment context. Not a generic benchmark, the real intended purpose, user group, geography, and operating environment.
  4. Run structured bias analysis. Look for discrimination risks, subgroup weaknesses, and feedback loops, especially where outputs may influence future inputs.
  5. Track gaps and remediation. Article 10 expects providers to identify data shortcomings and explain how they will be addressed.
  6. Connect Article 10 to lifecycle governance. If your Article 10 file is disconnected from monitoring, incident handling, or version review, it will age badly.

That provider-side discipline also makes deployer conversations easier. A deployer trying to meet Article 26 or perform a FRIA will need clear provider documentation. Thin Article 10 discipline usually produces thin Article 13 documentation, which then becomes a downstream compliance problem for everyone.

Where organizations usually get Article 10 wrong

The first mistake is treating Article 10 like a data quality slogan. The law is asking for documented governance, not vague confidence.

The second mistake is focusing only on training data. Validation data and test data matter too, and for non-training systems testing data may be the central legal hook.

The third mistake is confusing scale with representativeness. A huge dataset can still be badly misaligned with the intended purpose.

The fourth mistake is talking about fairness at model level while ignoring bias introduced by annotation, proxy design, or historical labels.

The fifth mistake is assuming synthetic data solves everything. Sometimes it helps. Sometimes it hides the problem. Article 10 does not let providers stop the analysis there.

Frequently asked questions

The most important questions and answers for this topic.

On LearnWize:EU AI Act ComplianceTry it free

From risk classification to conformity assessment: learn it in 10 interactive modules.

Take the free AI challenge