Why You Should Redact Your Identifiables Before Using LLMs And How To Do It

Maria Sergeeva
Feb 8
3 min read

(And why uploading raw medical records is the worst possible approach)

Large Language Models (LLMs) like ChatGPT and other GenAI tools are increasingly used to summarise, analyse, translate, and reason over health data. Many doctors and patients are already past the “should we?” stage. We're excited to see that some of them are into the “how do we use this safely?” phase instead of just blindly dropping all possible files.

For many, this critical issue is still widely misunderstood:

You should never upload raw medical records containing identifiable information into LLMs. And more importantly — you don’t need to.

In fact, uploading unredacted files actively reduces reliability, privacy, and usefulness.

This article explains:

what counts as PHI,
why redacting sensitive information matters,
how to remove identifiable data safely,
and why LLM-ready, structured, redacted files outperform originals every time.

What Is Protected Health Information?

Protected Health Information or PHI includes any data that can identify a person and relates to their health. This is broader than most people realise.

Examples include:

Name, address, email, phone number
NHS number, insurance ID, policy numbers
Date of birth (even without a name)
Hospital numbers, MRNs
Clinician names when linked to a specific patient
Appointment dates tied to identity
Scans, PDFs, or letters containing metadata

Even a single page can be enough to identify someone when combined with context.

This is why “remove identifiable information” is not a cosmetic step — it’s a foundational privacy requirement.

The Hidden Risk of Uploading Raw Medical Records to LLMs

Uploading original documents into ChatGPT or other GenAI tools feels convenient, but it creates multiple problems:

1. Privacy & compliance risk

Once uploaded, you must trust:

where the data is stored,
how long it’s retained,
whether it’s used for training,
and what happens if the provider changes policy, ownership, or jurisdiction.

For health data, that’s an unacceptable level of ambiguity.

2. Poor parsing & data loss

Medical PDFs, scans, lab reports, and discharge letters are not structured for LLMs.

Common issues:

tables flattened into nonsense,
Only half a page parsed
footnotes merged into diagnoses,
headers interpreted as clinical facts,
handwritten or scanned sections ignored entirely.

As a result we get confident-sounding answers based on incomplete or misread data

3. Over-sharing by default

Most people upload everything, even when the task only requires:

lab values,
timelines,
medications,
or symptom patterns.

This is the opposite of data minimisation.

Takeaway message:

You Should Not Upload Your Original Documents at All

If your goal is data-informed answers, uploading raw files is unnecessary — and counterproductive.

What to do instead?

The safest and most effective approach is:

Use a version of the file that is structured specifically for LLMs and stripped of identifiable information.

Why this works better:

the text is fully captured (no missed sections),
the structure is predictable,
entities are clearly separated,
and sensitive identifiers are removed by design.

This guarantees:

reliable parsing
repeatable results
privacy by default

What “Editing Sensitive Information for LLMs” Actually Means

Redaction isn’t just blacking out names.

A proper patient record redaction tool should:

detect personal identifiers automatically,
remove or replace them consistently,
preserve medical meaning and timelines,
and keep the document readable by machines.

Manual redaction fails because it is:

slow,
inconsistent,
error-prone,
and almost always incomplete.

True health data remove sensitive information workflows are automated, deterministic, and auditable.

The Secure Way of Removing Personal Information from Files

A secure workflow follows three principles:

No raw uploads to consumer AI tools
Automated, context-aware redaction
Export in an LLM-optimised format

This ensures:

no identifiable data leaves your control,
the LLM only sees what it needs,
and outputs are based on complete, correctly parsed information.

This is especially critical for:

chronic patients,
carers managing multiple records,
health tourists,
immigrants and cross-border patients,
and anyone working across languages or systems.

How This Works in Health Data Avatar

While the redaction tool is often a whole separate platform costing you around £2 per page, within HDA, this approach is built into the platform itself.

For our early testers group, a redaction and LLM-ready export tool is already live.

Users can download a version of their file that is safe to use with GenAI tools.

The exported file:

is optimised for reliable LLM interpretation.
has sensitive information removed from medical documents,

Why This Matters (Beyond Privacy)

Redaction isn’t just about safety — it’s about better outcomes.

LLMs perform best when:

inputs are clean,
structure is predictable,
and noise is removed.

A redacted, structured file will always outperform a raw scan or PDF.

This flips the usual assumption on its head:

Less data, structured correctly, produces better answers than more data uploaded blindly.

You can join our private beta or even join our early testers group to be the first ones to try HDA's latest tools.