What is Robust?

Basic pseudonymisation can be done very simply and quickly with minimal technical effort.

However, experience will show that in order to be effective across a variety of data sources and recipients, as well as robust to the tests of time, robust pseudonymisation must be designed in a more sophisticated way.

Pitfalls and Challenges

How to Pseudonymise: Lookup table Vs Hash algorithm

In the de-identification process, reliable privacy results are dependent on what method is used to create the pseudonyms. Pseudonymising by encrypting the original sensitive data on a one-to-one basis is a relatively easy process to set up. However, using encryption this way results in a  mathematical relationship between the original patient information and the de-identified data.

Working with a set of encryption-based pseudonyms, an outside hacker could work out the mathematical relationship being used to generate pseudonyms and re-identify the patient data. Worse yet, there would be no way to know that the data had been breached even after the damage was done.

Another issue with relying on encryption is the people-intensive management issue of the keys which is generally discovered after a period of use. This can also serve as a security weakness if users are not conscientious about how they store the keys (i.e. the “key on a sticky note on the monitor” problem). The process of automating the occasional need for re-identification will uncover problems with the original encryption-based design which doesn’t afford the flexibility necessary to accommodate new requirements.

When pseudonyms are generated arbitrarily, with lists of the paired sensitive data and pseudonyms kept in “Lookup” tables residing on a protected server, the information is fundamentally more secure. A hacker can not sit at home and work at his leisure to crack the code: physical access to the Lookup table (i.e. by copying it off the server where it lives) is necessary to link the pseudonyms with the “Sensitives” (i.e. name, NHS number, postcode, etc) and thus expose the patient information. In the meantime, protecting data stored on local servers is a solved problem. By using monitors it will be clear if a breach has occurred.

The challenge with designing a lookup-based pseudonymisation process is with ensuring accuracy and correctness in pseudonym assignment and re-identification.

Accuracy for patient safety

Accuracy is vital for trustworthy de-identification. Researchers, auditors, managers and especially patient-facing clinical personnel must be able to rely on the accuracy of de-identified data since it can’t be visually spot checked: the sensitive data is no longer accessible.

In the event that pseudonymised data is used to determine pathways and take actions involving direct patient care, the accuracy of the pseudonymisation process is vital to avoid introducing risks to patient safety.

Slighly more technical but no less critical, the slightest error (2+ pseudonyms linked to one patient or 1 pseudonym linked to 2+ patients) will result in a “nightmare scenario”. This entails purging all relevant databases, trying to fix the problem without knowing how far back the first error occurred, reloading all the data and re-running every report produced during the period in question.

To ensure perfect accuracy, stringent attention must be paid to correctness when designing a pseudonymisation solution. If possible, some means of verifying output is desirable.

Correctness: managing changes over time

Databases that store information collected over several years are subject to changes in information. These infrequent but significant changes will affect the correctness of pseudonyms assigned at an earlier date. Generally speaking, encryption-based pseudonymisation solutions will not be able to accommodate these inevitable changes.

As an example, assume patient X is admitted into hospital without identification and gets assigned a temporary NHS number. In the database, these records will appear to be for a separate patient with a unique pseudonym. An effective pseudonymisation system must be able to map the temporary NHS number alongside the permanent NHS number (once known) in order to fully connect that patient’s history henceforth.

The ability to manage slowly changing sensitive data to the same pseudonyms is vital to maintaining patient history when linking data sets over time.

Other key features

Automated Re-identification by permission

In addition to pseudonymising sensitive data, an effective de-identification solution must also offer the option to re-identify (uncover the original sensitive data) on a case-by-case, permission basis. Whist re-identification is by comparison an infrequent activity, it can pose both a security risk and a resource-intensive bottleneck if not automated. In particular, encryption-based pseudonymisation solutions will be subject to key management issues once re-identification becomes a regular requirement.


The ability to successfully and consistently link pseudonymised data requires either that the keys or algorithms used to pseudonymise be shared securely or that the pseudonymisation process is centralised. In addition, appropriate cleansing is required to make the data consistent prior to pseudonymising.


Consideration must be given to guard against a compromised systems administrator and to reduce departmental collusion.

Easy Integration

An effective pseudonymisation solution must be able to accommodate new and varied data sources quickly and with ease. It should be re-useable for different applications. Also, the ability to shape pseudonyms is necessary in order to interoperate with the unique needs of local data management environments. In order to integrate directly with existing IT investments, pseudonyms must meet data format restrictions such as punctuation, maximum character limits.


As with any solution that is depended upon for regular use, accessibility, speed and responsiveness are important to ensure that the solution doesn’t become a troublesome bottleneck as demand grows. The ability to control when de-identification occurs with an in-house solution provides more flexibility of data linking and enables more variety of data enrichment. With a shared resource, if customisation is required each time the demands on the solution grow and change…

Learn why BT chose Sapior to de-identify patient data for the NHS Spine SUS. Read more…

All of Sapior’s solutions provide the most robust NHS-grade pseudonymisation and are based on our de facto national standard Pseudonymisation engine, as used by the English NHS Spine Secondary Uses Service (SUS).

“an example of best practice”

– PIAG (now ECC-Ethics and Confidentiality Committee)

Latest From Blog

Oct 25

The Cameron government has re-opened the debate on how much of ou ... Read...

Aug 27

I've heard many times through many media the need for "balance" i ... Read...

Latest News

Safemerge v2 released

May 2013 - Building on Sapior's market lead in e ... Read...

Self service Pseudo service launched

March 2012 - Sapior has launched a self service ... Read...