What the RBI FREE-AI Framework Actually Asks of You — And What Most Lenders Are Getting Wrong

When the RBI released the FREE-AI Committee report on 13 August 2025 — seven Sutras, six strategic pillars, twenty-six recommendations, chaired by Dr. Pushpak Bhattacharyya — most lenders filed it the way they file every framework: as a document to be summarised, mapped to a policy register, and reported as "addressed" in the next board pack. That instinct is the first mistake, and it is the one this piece is about.

A clarification before we go further, because it matters for how you read this. FREE-AI's official architecture organises its work under six strategic pillars — Infrastructure, Policy, Capacity under innovation enablement; Governance, Protection, Assurance under risk mitigation — operationalised through twenty-six recommendations. The six dimensions this article is structured around — Fairness, Reliability, Explainability, Ethics, Accountability, Inclusivity — are the responsible-AI obligations the framework embeds through its Sutras and its Governance and Assurance pillars. They are what the framework actually asks of a credit decisioning function in operational terms. If your reading of FREE-AI stopped at the pillar names, you have read the table of contents, not the requirement.

The thesis is simple and uncomfortable: FREE-AI is not a policy checklist. It is an operational governance requirement that demands demonstrable, auditable, continuously monitored controls. The recommendations are not yet binding regulation, but they telegraph the supervisory posture you will be examined against. "We have a policy" is the answer that fails. "Here is the control, here is the evidence it operated, here is the immutable record" is the answer that passes. Below is what each dimension actually requires, where institutions are getting it wrong, and what technically adequate looks like.

1. Fairness — Stop Reporting Portfolio-Level Approval Rates

What FREE-AI requires. The framework asks that AI systems be designed and tested to promote fairness and equity. In a credit context, this is not a sentiment — it is a measurement obligation. Fairness has to be demonstrable at the level where disparate outcomes actually occur, which is the segment, not the portfolio.

The common gap. Almost every lender reports fairness as a portfolio-level approval rate: "we approved 62% of applications this quarter." This number passes no fairness test whatsoever. A 62% blended approval rate is mathematically consistent with approving 85% of one population segment and 30% of another at identical risk. Aggregate KPIs are designed to hide exactly the variance that a fairness assessment is supposed to surface. Reporting them as evidence of fairness is not weak evidence — it is the absence of evidence dressed as its presence.

What adequate looks like. Cluster-level outcome monitoring with threshold analysis. Segment the applicant base using an unsupervised method — a Self-Organising Map followed by K-Means is a defensible choice because it produces stable, interpretable clusters independent of the label. Within each cluster, compute the selection rate, then the adverse impact ratio against the most-favoured cluster, and flag any ratio below 0.8 (the four-fifths threshold). That is necessary but not sufficient. The harder requirement is threshold analysis: demonstrate that the score-to-decision cutoff is calibrated within each segment, so that an applicant scoring 0.7 in cluster A and an applicant scoring 0.7 in cluster B carry the same true default probability and receive the same decision. Miscalibration across segments produces disparate impact even when the model and the threshold look uniform. You are auditing the decision boundary's behaviour per segment, not the model's average accuracy.

2. Reliability — Governance Drift Is Not Model Drift

What FREE-AI requires. The Sutra on safety and resilience asks that AI systems behave reliably over time. Lenders read "reliability" as "model monitoring" and stop there. That reading misses half the failure surface.

The common gap. Institutions monitor model drift — Population Stability Index on input features, the Kolmogorov-Smirnov statistic on the score distribution — and treat a stable PSI/KS as proof of reliability. It is not. A model can be statistically pristine while the decision policy it serves has quietly diverged from what was approved.

Define the missing concept. Governance drift is the divergence between the intended decision policy and the actual threshold behaviour in production over time. Formally: let the approved policy specify a decision function with threshold τ* for a given segment. The realised operating threshold τ_t in production evolves through manual override patterns, segment-specific cutoffs introduced by operations, champion-challenger swaps, hardcoded exceptions, and undocumented "temporary" relaxations that never get reverted. Governance drift is the accumulated distance between τ* and τ_t, measured per segment over time. It is structurally invisible to PSI and KS, because those statistics describe the data and the score, not the policy applied to the score. Your features can be perfectly stable, your score distribution unchanged, and your effective approval threshold for a particular segment can have moved 40 basis points of score because of override creep. PSI will read flat. The policy has failed.

What adequate looks like. Track the effective realised threshold per segment as a monitored time series, reconcile it against the approved configuration on every cycle, and alert when the divergence exceeds tolerance. Reliability monitoring under FREE-AI means watching both the model and the governance layer — and treating an unexplained gap between intended and executed policy as a reliability incident, not an ops footnote.

3. Explainability — Documentation Is Not Decision-Level Explanation

What FREE-AI requires. The "Understandable by Design" Sutra, read alongside the RBI digital lending grievance-redressal expectations, requires two distinct things that lenders routinely conflate: model-level understanding and decision-level understanding.

The common gap. Institutions produce a model card, a development methodology, and a validation report, and treat that documentation as explainability. It is not. Documentation explains the model. It does not explain this applicant's decline. When a grievance arrives, a model card is useless.

What adequate looks like. Both global and local explainability, deployed for different audit scenarios. Global XAI — mean absolute SHAP feature importance across the population — answers "what drives this model in general." It is the right instrument for model validation, fairness diagnostics, and model risk management review. Local XAI — per-applicant SHAP attribution — answers "why was this applicant declined," decomposing the individual prediction into signed contributions per feature. It is the only instrument that supports adverse-action reason codes, grievance redressal, and recourse. FREE-AI requires both because the audit scenarios are different: a model validation examination interrogates global behaviour; a consumer grievance examination interrogates a single local explanation. A lender holding only global SHAP can defend its model and not a single one of its decisions. A lender generating local SHAP on demand — recomputed against the current model rather than the one that made the decision — produces explanations that are confidently wrong. Both attributions must be produced, and the local one must be captured at decision time.

4. Ethics — "We Have a Governance Policy" Fails the Audit

What FREE-AI requires. The framework's foundation Sutra is that trust must be earned, and trust in an examination is established by demonstrable controls operating, not by the existence of intent.

The common gap. The single most common failure in an ethics audit is the sentence "we have a governance policy." A policy is a statement of intent. An ethics audit does not test intent; it tests whether the control described by the policy actually operated, continuously, on the decisions in question. A policy with no operating evidence is, for examination purposes, indistinguishable from no policy.

What adequate looks like. "Demonstrable controls" has a precise technical meaning. First, version-controlled threshold configurations: every cutoff, every segment-specific rule, every override authority is stored as code, with full history of who changed what value, when, and under whose approval. Second, automated alerts for cluster-level decision divergence: when a segment's outcome distribution moves outside tolerance, the system raises an incident without a human noticing first. Third, immutable decision records: an append-only, tamper-evident store — write-once or hash-chained — that preserves what was decided and on what basis, such that no one can retroactively edit history. These three together let you answer the only question an ethics examiner cares about: not "what is your policy," but "show me the control firing on the decisions I select."

5. Accountability — Every Decision Must Be Attributable

What FREE-AI requires. The Accountability Sutra holds the regulated entity responsible for every AI decision regardless of the system's autonomy. Operationally, that is a decision-attributability requirement: any single decision must be reconstructable from a complete, self-contained record.

The common gap. Fragmented, team-level records. The data science team holds model versions in a registry. Credit operations holds thresholds in a spreadsheet. The application platform holds the logs. Explanations are regenerated on request. Each team can answer for its own artifact, and no one can assemble the whole. When an examiner asks "reconstruct this decision," the institution discovers there is no join key linking the pieces, different retention windows have already destroyed some of them, and the regenerated explanation no longer matches what the applicant was shown. Distributed custody is not accountability; it is the structural guarantee of unaccountability.

What adequate looks like. A single immutable decision record carrying the full attribution tuple for every decision: (model_id, model_version, threshold_value, feature_vector_hash, explanation_hash, policy_id) — plus decision_id, timestamp, and consent reference. The hashes matter as much as the IDs: feature_vector_hash lets you prove the inputs were exactly those used, and explanation_hash lets you prove the explanation produced at decision time is byte-for-byte the one on record. With this tuple, any decision is attributable to the precise model version, the exact cutoff in force for that segment at that moment, the verified inputs, the verified explanation, and the governing policy. Without it, accountability is an aspiration the evidence cannot support.

6. Inclusivity — Governed Multi-Signal, Not Score Supplementation

What FREE-AI requires. The framework treats inclusion as a first-order objective, but a governed one — extending credit reach without importing ungoverned risk.

The common gap. Credit score supplementation: bolting alternative signals onto the decision to lift approvals, with no governance attached. Adding telco, device, or bank-statement signals raises the approval rate, the inclusion KPI looks good, and the new bias vectors, consent provenance, and recourse gaps go unexamined. Adding signals is not the same as including people responsibly.

What adequate looks like. Governed multi-signal decision-making. Signals sourced through the Account Aggregator framework arrive with consent artifacts that are captured and linked to the decision record. Each new signal class is subjected to the same segment-level fairness and divergence monitoring as the core model — CFDI-style monitoring of decision divergence introduced by the augmented signals. And every alternative-data-driven decline carries a recourse path that is tracked: can the applicant contest the signal, correct it, and have the decision revisited? The distinction is governance. Score supplementation adds inputs and hopes. Governed multi-signal adds inputs, consent provenance, divergence monitoring, and recourse tracking — and can prove all four.

The 14-Month Audit Scenario

Here is how this actually plays out. An examiner does not ask for your governance policy. The examiner selects one declined applicant from fourteen months ago and says: reconstruct this decision.

Fourteen months is not arbitrary. It crosses a model retraining cycle (the model live then is not the model live now), a threshold change (the cutoff for that segment was revised in month nine), and very likely a DPDP consent-regime change. The evidence chain you must produce, in order: the decision_id; the model_id and the specific model_version that was live fourteen months ago — not the current champion; the exact feature vector, verified against feature_vector_hash to prove the inputs have not been recomputed from since-drifted data; the threshold_value in force for that applicant's segment at that timestamp, retrieved from version-controlled configuration history; the local explanation as generated at decision time, verified against explanation_hash; the policy_id that governed the decision; the consent artifact authorising the data used; and the grievance or recourse record if one exists.

The technical failures that create regulatory exposure are predictable. The model registry kept only the latest version — the deciding model is gone. Thresholds lived in a mutable spreadsheet that was overwritten — you cannot prove the cutoff. Explanations are regenerated against the current model — they do not match what the applicant saw, and the hash mismatch proves it. No feature snapshot was stored — features get recomputed from current data, and the reconstruction is fiction. Decision logs were rotated before fourteen months elapsed — the record does not exist. Each failure converts directly into exposure: the inability to demonstrate that a specific decision was fair, explainable, and accountable, on demand, for any applicant, for as long as the obligation runs.

An institution that can produce the full chain in minutes has built FREE-AI into its architecture. An institution that needs three teams and three weeks to half-assemble it has a policy, not a control.

What Lenders Think FREE-AI Requires vs What It Actually Requires

Dimension	What Lenders Think It Requires	What FREE-AI Actually Requires
Fairness	Portfolio-level approval-rate reporting	Cluster-level selection-rate and adverse-impact monitoring with segment-specific threshold calibration
Reliability	PSI/KS model-drift monitoring	Model-drift and governance-drift monitoring, including realised versus intended thresholds across segments over time
Explainability	A model card and validation report	Global SHAP for model validation and local SHAP explanations captured at decision time for every applicant
Ethics	A documented governance policy	Version-controlled thresholds, automated divergence alerts, and immutable decision records with demonstrable operational controls
Accountability	Each team retains its own artefacts	One immutable decision record containing model ID, model version, threshold value, feature-vector hash, explanation hash, and policy ID
Inclusivity	Add alternative-data signals to increase approvals	Governed multi-signal lending using Account Aggregator consent artefacts, CFDI-style divergence monitoring, and tracked borrower recourse

The Bottom Line

FREE-AI rewards institutions that can produce evidence and penalises institutions that can produce intent. The gap between the two is not a documentation gap — it is an architecture gap. Every requirement above resolves to the same engineering reality: decisions must be monitored at the segment level, the governance layer must be versioned and watched as closely as the model, explanations must be captured at decision time, and every decision must be reconstructable from a single immutable record long after the model and the thresholds that produced it have changed. Build that, and an examination is a query. Skip it, and an examination is an excavation that comes up empty.