Step 1: Define an objective (purpose) for the AI system
The principle
AnAIsystembasedontheexploitationofpersonaldatamustbedevelopedwitha“purpose”,i.e awelldefinedobjective
Thismakesitpossibletoframeandlimitthepersonaldatathatcanbeusedfortraining,soasnottostore andprocessunnecessarydata.
Thisobjectivemustbedetermined,orestablishedassoonastheprojectisdefined Itmustalsobeexplicit, i.e.knownandunderstandable.Finally,itmustbelegitimate,i.e.compatiblewiththeorganization’stasks. ItissometimesobjectedthattherequirementtodefineapurposeisincompatiblewiththetrainingofAI models,whichmaydevelopunanticipatedcharacteristics.TheCNILconsidersthatthisisnotthecaseand thattherequirementtodefineapurposemustbeadaptedtothecontextofAI,without,however, disappearing,asthefollowingexamplesshow
In practice
Therearethreetypesofsituations
You clearly know what the operational use of your AI system will be
Inthiscase,thisobjectivewillbethepurposeofthedevelopmentphaseaswellasthedeploymentphase
Example:
Anorganisationsetupadatabaseofphotographsoftrainwagonsinservice–i.e.with personspresent–inordertotrainanalgorithmtomeasurethecrowdingand occupancyoftrainsinstations.Thepurposeinthedevelopmentphaseisdetermined, explicitandlegitimateinrelationtotheidentifiedoperationaluse.
However,thisismorecomplexwhenyouaredevelopingageneralpurposeAIsystemthatcanbeusedin variouscontextsandapplicationsorwhenyoursystemisdevelopedforscientificresearchpurposes.
For general purpose AI systems
Example:
Anorganisationmaysetupadatasetfortraininganimageclassificationmodel (persons,vehicles,food,etc.)andmakeitpubliclyavailable,withoutanyspecific operationalusebeingforeseenwhendevelopingthemodel.
Youcannotdefinethepurposetoobroadlyas,forexample,“thedevelopmentandimprovementofanAI system”.Youwillneedtobemorepreciseandreferto:
the“type”ofsystemdeveloped,suchas,forexample,thedevelopmentofalargelanguagemodel, acomputervisionsystemoragenerativeAIsystemforimages,videos,sounds,computercode,etc. thetechnicallyfeasiblefunctionalitiesandcapabilities.
Goodpractice:
Youcangiveevenmoredetailsabouttheobjectivepursued,forexamplebydetermining:
theforeseeablecapacitiesmostatrisk; functionalitiesexcludedbydesign; theconditionsofuseoftheAIsystem:theknownusecasesofthesolutionortheconditionsofuse (disseminationofthemodelinopensource,marketing,availabilitytroughSaaSorAPI,etc.).
For AI systems developed for scientific research purposes
Example:
ThedevelopmentofanAIsystemforaproofofconceptintendedtodemonstratethe robustnessofmachinelearningrequiringlesstrainingdata,inadocumentedscientific approachintendedforpublication,couldberegardedaspursuingscientificresearch purposes.
Youcandefinealessdetailedobjective,giventhedifficultiesofdefiningitpreciselyfromthebeginningof yourwork.Youcanthenprovideadditionalinformationtoclarifythisgoalasyourprojectprogresses.
Formoreinformation,see how-tosheet2
Step 2: Determine your responsibilities
The
principle
IfyouusepersonaldataforthedevelopmentofAIsystems,youneedtodetermineyourliabilitywithinthe meaningoftheGDPR Youcanbe:
controller(TR):youdeterminethepurposesandmeans,i.e.youdecideonthe“why”and“how”of theuseofpersonaldata.Ifoneormoreotherbodiesdecidewithyouontheseelements,youwillbe jointlyresponsiblefortheprocessingandwillhavetodefineyourrespectiveobligations(e.g.througha contract).
subcontractor(ST):youprocessdataonbehalfofaplayerwhoisthe“controller” Inthiscase,the lattermustensurethatyoucomplywiththeGDPRandthatyouprocessthedataonlyonits instructions:thelawthenprovidesfortheconclusionofasubcontract
In practice
TheEuropeanAIActdefinesseveralroles:
anAIsystemprovider developingorhavingdevelopedasystemandplacingitonthemarketor puttingitintoserviceunderitsownnameortrademark,whetherforafeeorfreeofcharge;
importers,distributorsandusers(alsoknownasdeployers) ofthesesystems
Yourdegreeofresponsibilitydependsonacase-by-caseanalysis.Forexample:
IfyouareaproviderattheinitiativeofthedevelopmentofanAIsystemandyouconstitutethe trainingdatasetfromdatayouhaveselectedforyourownaccount,youcanbequalifiedasa controller.
IfyouarebuildingthetrainingdatasetofanAIsystemwithothercontrollersforapurposethatyou havedefinedtogether,youcanbereferredtoasjointcontrollers.
IfyouareanAIsystemprovider,youcanbeasubcontractorifyouaredevelopingasystem onbehalfofoneofyourcustomers.Thecustomerwillberesponsibleforprocessingifhe determinesthepurposebutalsothemeans,thetechniquestobeused.Ifitonlygivesyouonegoalto achieveandyoudesigntheAIsystem,youarethecontroller.
IfyouareanAIsystemprovideryoucanuseaprovidertocollectandprocessthedataaccordingto yourinstructions Theserviceproviderwillbeyoursubcontractor. Thisisthecase,for example,oftheproviderthathastosetupatrainingdatasetforanAIsystemproviderthattellsit preciselyhowithastobedeveloped
Formoreinformation,see how-tosheet3
Fortherest:
Ifyouareacontroller,allthefollowingstepsareofdirectconcerntoyou,andyouareresponsiblefor ensuringcompliance.
Ifyouareasubcontractor,yourmainobligationsareasfollows:
Ensurethatacontractforthesubcontractingofpersonaldatahasbeenconcludedandthatit complieswiththeregulations;
Strictlyfollowtheinstructionsofthecontrolleranddonotusethepersonaldataforanythingelse;
Strictlyensurethesecurityofthedataprocessed;
AssesscompliancewiththeGDPRatyourlevel(seenextsteps)andalertthecontrollerifyoufeel thereisaproblem
Step 3: Define the "legal basis" that allows you to process personal data
The principle
ThedevelopmentofAIsystemscontainingpersonaldatawillneedtohavealegalbasisthatallowsyouto processthisdata.TheGDPRlistssixpossiblelegalbases:consent,compliancewithalegalobligation,the performanceofacontract,theperformanceofataskcarriedoutinthepublicinterest,thesafeguardingof vitalinterests,thepursuitofalegitimateinterest
Dependingonthelegalbasischosen,yourobligationsandtherightsofindividualsmayvary,whichiswhy itisimportanttodetermineitupstreamandindicateitinthedataprivacypolicy.
In practice
Youneedtoaskyourselfaboutthemostappropriatelegalbasisforyoursituation.
Ifyoucollectdatadirectlyfromindividualsandtheyarefreetoacceptorrefusewithoutharm(suchas givinguptheservice),consentisoftenthemostappropriatelegalbasis.Accordingtothelaw,itmustbe free,specific,enlightenedandunambiguous.
Gatheringconsent,however,isoftenimpossibleinpracticefordatasetcreation.Forexample,whenyou collectdataaccessibleonlineorreuseanopensourcedatabase,withoutdirectcontactwithdatasubjects, otherlegalbaseswillgenerallybemoresuitable:
Privateactors willhavetoanalysewhethertheycomplywiththeconditionsinordertorelyon legitimateinterest.Todoso,theymustjustifythreeconditions: theinterestpursuedislegitimate,thatistosay,legal,preciselyandgenuinelydefined;
itmustbepossibletoestablishthatthepersonaldataarereallynecessaryforthetrainingofthe system,becauseitisnotpossibletouseonlydatawhichdonotrelatetonaturalpersonsor anonymiseddata.;
theuseofsuchpersonaldatamustnotleadtoa“disproportionateinterference”withtheprivacyof individuals.Thisisassessedonacase-by-casebasis,dependingonwhatisrevealedbythedata used,whichmaybemoreorlessprivateorsensitive,andwhatisdonewiththedata.;
Pleasenote:ahow-tosheetspecifictothelegalbasisoflegitimateinterestwillbe publishedshortly.
Publicactorsmustverifywhethertheprocessingisinlinewiththeirpublicinterestmissionas providedforbyalaw(e.g.alaw,decree,etc.)andwhetheritcontributestoitinarelevantand appropriateway.
Example: theFrenchpôled'expertisedelarégulationnumérique(PEReN)is authorisedonthisbasistoreusepubliclyavailabledatatocarryoutexperimentsaimed inparticularatdesigningtechnicaltoolsfortheregulationofoperatorsofonline platforms.
Thelegalbasesofthecontractandthelegalobligationmaybeusedmoreexceptionally,if youdemonstratehowyourprocessingisnecessarytomeettheperformanceofthecontractorprecontractualmeasuresora(sufficientlyprecise)legalobligationtowhichyouaresubject
Formoreinformation,see how-tosheet4
Step 4: Check if I can re-use certain personal data
The principle
In practice
Ifyouplantore-useadatasetthatcontainspersonaldata,makesureitislegal.Thatdependsonthe methodofcollectionandthesourceofthedatainquestion. You,asacontroller(see“Determineyour responsibilities”),mustcarryoutcertainadditionalcheckstoensurethatsuchuseislawful Theruleswilldependonthesituation
The provider reuses data that it has already collected itself
Youmaywanttore-usethedatayouoriginallycollectedforanotherpurpose Inthiscase,ifyouhadnot foreseenandinformedthedatasubjectsaboutthisre-use,youshouldcheckthatthisnewuseis compatiblewiththeoriginalpurpose,unlessyouareauthorisedbythedatasubjects(theyhaveconsented) orbyatext(e.g.alaw,decreeetc.).
Youmustcarryoutwhatisknownasa“compatibilitytest”,whichmusttakeintoaccount:
theexistenceofalinkbetweentheinitialobjectiveandthatofbuildingadatasetfortraininganAI system; thecontextinwhichthepersonaldatawerecollected;
thetypeandnatureofthedata; thepossibleconsequencesforthepersonsconcerned; theexistenceofappropriatesafeguards(e.g.pseudonymisationofdata).
Pleasenote: ifyouwishtore-usedataforstatisticalorscientificresearchpurposes, theprocessingispresumedtobecompatiblewiththeoriginalpurpose.No compatibilitytestisthereforenecessaryinthiscase.
The provider re-uses publicly available data (opensource)
Inthiscase,youneedtomakesurethatyouarenotre-usingadatasetwhoseconstitutionwasmanifestly unlawful(e.g.fromadataleak).Acase-by-caseanalysismustbecarriedout.
TheCNILrecommendsthatre-userscheckanddocument(forexample,inthedataprotectionimpact assessment)thefollowing:
thedescriptionofthedatasetmentionstheirsource; theestablishmentordisseminationofthedataset isnotmanifestlytheresultofacrimeor misdemeanourorhasbeenthesubjectofapublicconvictionorsanctionbyacompetent authoritywhichhasinvolvedaremovalorprohibitionofexploitation; thereisnoglaringdoubtthatthedatasetislawfulbyensuringinparticularthattheconditions fordatacollectionaresufficientlydocumented;
thedatasetdoes notcontainsensitivedata(e.g.healthdataordatarevealingpoliticalopinions)or infringementdataor,ifitdoes,itisrecommendedtocarryoutadditionalcheckstoensurethat suchprocessingislawful.
Thebodythatuploadedthedatasetissupposedtohaveensuredthatthepublicationcompliedwiththe GDPR,andisresponsibleforit However,youdonothavetoverifythatthebodiesthatsetupand disseminatedthedatasethavecompliedwithalltheobligationslaiddownintheGDPR:theCNIL considersthatthefourverificationsmentionedabovearegenerallysufficienttoallowthere-useofthe datasetforthetrainingofanAIsystem,providedthattheotherCNILrecommendationsarecomplied with.Ifyoureceiveinformation,especiallyfrompeoplewhosedataiscontainedinthedataset,that highlightsproblemswiththelawfulnessofthedataused,youwillneedtoinvestigatefurther.
The provider reuses data acquired from a third party (data brokers, etc.)
Forthethirdpartysharingpersonaldata,sometimesforremuneration,therearetwotypesofsituations.
EitherthethirdpartycollectedthedataforthepurposeofbuildingadatasetforAIsystem training.ItmustensurethatthedatatransmissionprocessingcomplieswiththeGDPR(definitionofan explicitandlegitimateobjective,requirementofalegalbasis,informationtoindividualsandmanagement oftheexerciseoftheirrights,etc.).
Eitherthethirdpartydidnotinitiallycollectthedataforthatpurpose. Itmustthenensurethat thetransmissionofthosedatapursuesanobjectivecompatiblewiththatwhichjustifiedtheircollection.It willthereforehavetocarryoutthe“compatibilitytest”describedabove.
There-userofthedatahasseveralobligations:
Itmustensurethatitisnotre-usingamanifestlyunlawful datasetbycarryingoutthesamechecks asthosesetoutinthesectionabove.Theconclusionofanagreementbetweentheoriginaldataholder andthere-userisrecommendedinordertofacilitatetheseverifications.
Inadditiontothosechecks,itmustensureitsowncompliancewiththeGDPRintheprocessingofthat data Formoreinformation,see how-tosheet4
Step 5: Minimize the personal data I use
The
principle
Thepersonaldatacollectedandusedmustbe adequate,relevantandlimitedtowhatisnecessary inthelightoftheobjectivedefined:thisistheprincipleofdataminimisation.Youmustrespectthis principleandapplyitrigorouslywhenthedataprocessedissensitive(dataconcerninghealth,data concerningsexlife,religiousbeliefsorpoliticalopinions,etc.).
In
practice
The method to be used
Youshouldfocusonthetechniquethatachievesthedesiredresult(orofthesameorder)usingaslittle personaldataaspossible. Inparticular,theuseofdeeplearningshouldthereforenotbesystematic.
Thechoiceofthelearningprotocolusedmay,forexample,makeitpossibletolimitaccesstodataonlyto authorisedpersons,ortogiveaccessonlytoencrypteddata
Selection of strictly necessary data
Theprincipleofminimisationdoesnotprohibitthetrainingofanalgorithmwithverylargevolumesof data,butimplies:
tohaveanupstreamreflectioninordertohaverecourseonlytopersonaldatausefulforthe developmentofthesystem;and tosubsequentlyimplementthetechnicalmeanstocollectonlytheseones
The validity of design choices
Inordertovalidatethedesignchoices,itisrecommendedasagoodpracticeto:
conductapilotstudy, i.e.carryoutasmall-scaleexperiment.Fictitious,synthetic,anonymiseddata maybeusedforthispurpose;
questionanethicscommittee (oran“ethicaladvisor”).Thiscommitteemustensurethatethical issuesandtheprotectionoftherightsandfreedomsofindividualsareproperlytakenintoaccount.It canthusformulateopinionsonallorpartoftheorganisation’sprojects,tools,products,etc.likelyto raiseethicalissues.
The organisation of the collection
Youmustensurethatthedatacollectedisrelevantinviewoftheobjectivespursued.Severalstepsare stronglyrecommended:
Datacleaning:Thisstepallowsyoutobuildaqualitydatasetandthusstrengthentheintegrityand relevanceofthedatabyreducinginconsistencies,aswellasthecostoflearning.
Identificationoftherelevantdata:Thisstepaimstooptimizesystemperformancewhileavoiding under-andover-fitting.Inpractice,itallowsyoutomakesurethatcertainclassesorcategoriesthat areunnecessaryforthetaskathandarenotrepresented,thattheproportionsbetweenthedifferent interestclassesarewellbalanced,etc.Thisprocedurealsoaimstoidentifydatathatisnotrelevantfor learning(whichwillthenhavetoberemovedfromthedataset)
Theimplementationofmeasurestoincorporatetheprinciplesofpersonaldataprotection bydesign:Thisstepallowsyoutoapplydatatransformations(suchasgeneralisationand/or randomisationmeasures,dataanonymisation,etc.)tolimittheimpactonpeople.
Monitoringandupdatingofdata:minimisationmeasurescouldbecomeobsoleteovertime.The datacollectedcouldlosetheirexact,relevant,adequateandlimitedcharacteristics,duetoapossible driftofthedata,anupdatethereofortechnicaldevelopments.Youwillthereforehavetoconducta regularanalysistoensurethefollow-upoftheconstituteddataset.
DocumentationofthedatausedforthedevelopmentofanAIsystem:itallowsyoutoguarantee thetraceabilityofthedatasetsusedwhichthelargesizecanmakedifficult Youmustkeepthis documentationup-to-dateasthedatasetevolves.TheCNILprovideshereamodelofdocumentation.
Formoreinformation,seehow-to sheets6and7
Step 6: Set a retention period
The
principle
Personaldatacannotbekeptindefinitely.TheGDPRrequiresyoutodefineaperiodoftimeafterwhich datamustbedeletedor,insomecases,archived.Youmustdeterminethisretentionperiod accordingtothepurposethatledtotheprocessingofthesedata
In
practice
YoumustsetaretentionperiodforthedatausedforthedevelopmentoftheAIsystem:
Forthedevelopmentphase:Dataretentionneedstobepre-plannedandmonitoredovertime. Datasubjectsmustbeinformedofthedataretentionperiod(e.g.intheinformationnotices);
Forthemaintenanceorimprovementoftheproduct: wherethedatanolongerneedtobe accessiblefortheday-to-daytasksofthepersonsinchargeofthedevelopmentoftheAIsystem,it shouldinprinciplebedeleted However,theycanbekeptforthemaintenanceoftheproductorits improvementifguaranteesareimplemented(partitionedsupport,restrictionofaccessonlyto authorizedpersons,etc )
Pleasenote:theretentionoftrainingdatacanallowauditstobecarried outandfacilitatethemeasurementofcertainbiases.Insuchcases,prolonged retentionofdatamaybejustified,unlesstheretentionofgeneralinformationonthe dataissufficient(e.g.documentationcarriedoutasproposedinthe Documentation sectionorinformationonthestatisticaldistributionofthedata).Suchretentionmust belimitedtothenecessarydataandbeaccompaniedbyenhancedsecuritymeasures.
Formoreinformation,see how-tosheet7
Step 7: Carry out a Data Protection Impact Assessment (DPIA)
The principle
TheDataProtectionImpactAssessment(DPIA)isanapproachthatallowsyoutomapandassesstherisks ofprocessingonpersonaldataprotectionandestablishanactionplantoreducethemtoanacceptable level Inparticular,itwillleadyoutodefinethesecuritymeasuresneededtoprotectthedata
In practice
Achieving a DPIA for the development of AI systems sensitivedataarecollected; personaldataarecollectedonalargescale; dataofvulnerablepersons(minors,personswithdisabilities,etc.)arecollected; datasetsarecrossedorcombined; newtechnologicalsolutionsareimplementedorinnovativeuseismade.
ItisstronglyrecommendedtocarryoutaDPIAforthedevelopmentofyourAIsystem,especiallywhen twoofthefollowingcriteriaaremet:
Inaddition,ifsignificantrisksexist(e g :datamisuse,databreach,ordiscrimination),aDPIAmustbe carriedouteveniftwooftheabovecriteriaarenotmet.
TohelpcarryoutaDPIA,theCNILmakesavailablethededicatedopensourcePIAsoftware.
The risk criteria introduced by the European AI Act
TheCNILconsidersthat,forthedevelopmentofhigh-risksystemscoveredbytheAIActandinvolving personaldata,theperformanceofaDPIAisinprinciplenecessary.
Pleasenote:CompletionoftheDPIAmaybebasedonthedocumentationrequiredbytheAIActprovided thatitincludestheelementsprovidedforintheGDPR(Article35GDPR).
The scope of the DPIA
TherearetwotypesofsituationsfortheproviderofanAIsystem,dependingonthepurposeoftheAI system(see“Step1:Defineanobjective(purpose) fortheAIsystem”).
YouclearlyknowwhattheoperationaluseofyourAIsystemwillbe ItisrecommendedtocarryoutageneralDPIAforthewholelifecycle,whichincludesthedevelopment anddeploymentphases.Pleasenotethatifyouarenottheuser/deployeroftheAIsystem,itisthis actorthatwillberesponsibleforcarryingouttheDPIAforthedeploymentphase(althoughitmaybe basedontheDPIAmodelyouhaveproposed).
IfyouaredevelopingageneralpurposeAIsystem
YouwillonlybeabletocarryoutaDPIAduringthedevelopmentphase.ThisDPIAshouldbeprovided totheusersofyourAIsystemtoenablethemtoconducttheirownanalysis.
AI risks to be considered in a DPIA
ProcessingofpersonaldatabasedonAIsystemspresentsspecificrisksthatyoumusttakeintoaccount:
therisksrelatedtotheconfidentialityofdatathatcanbeextractedfromtheAIsystem;
theriskstodatasubjectslinkedtomisuseofthedatacontainedinthetrainingdataset(byyour employeeswhohaveaccesstoitorintheeventofadatabreach);
theriskofautomateddiscriminationcausedbyabiasintheAIsystemintroducedduring development;
theriskofproducingfalsefictitiouscontentonarealperson,inparticularinthecaseofgenerativeAI systems;
theriskofautomateddecision-makingwhenthestaffmemberusingthesystemisunabletoverifyhis performanceinrealconditionsortotakeadecisioncontrarytotheoneprovidedbythesystem withoutdetriment(duetohierarchicalpressure,forexample);
theriskofuserslosingcontrolovertheirdatapublishedandfreelyaccessibleonline; therisksrelatedtoknownattacks,specifictoAIsystems(e.g.datapoisoningattacks); thesystemicandseriousethicalrisksrelatedtothedeploymentofthesystem.
Actions to be taken based on the results of the DPIA
Oncethelevelofriskhasbeendetermined,yourDPIAmustprovideforasetofmeasurestoreduceitand keepitatanacceptablelevel,forexample:
securitymeasures(e g homomorphicencryptionortheuseofasecureexecutionenvironment); minimisationmeasures(e.g.useofsyntheticdata); anonymisationorpseudonymisationmeasures(e.g.differentialprivacy);
dataprotectionmeasuresfromtheoutset(e.g.federatedlearning); measuresfacilitatingtheexerciseofrightsorredressforindividuals(e.g.machineunlearning techniques,explainabilityandtraceabilitymeasuresregardingtheoutputsofthesystem,etc );
auditandvalidationmeasures(e g fictitiousattacks)
Other,moregenericmeasuresmayalsobeapplied:organisationalmeasures(managementandlimitation ofaccesstothedatasetswhichmayallowamodificationoftheAIsystem,etc.),governancemeasures (establishmentofanethicscommittee,etc.),measuresforthetraceabilityofactionsorinternal documentation(informationcharter,etc.).
Formoreinformation,see how-tosheet5
The CNIL is continuing its work to help providers of AI systems.
Itwillsoonpublishnewhow-tosheetsexplaininghowtodesignandtrainmodelsincompliancewiththe GDPR:retrievalofdataontheinternet(webscraping);useofthelegitimateinterestasalegalbasis, exerciseoftherightsofaccess,rectificationanderasure;whetherornottouseopenlicences...
Thesehow-tosheetswillbesubjecttopublicconsultation.
Read more
ThefullCNILrecommendationonthedeploymentofAIsystems AllCNILcontentonartificialintelligence