Use normalize() to duel with non-English string in javascript

Frankie
2 min readJan 10, 2022
Photo by James Harrison on Unsplash

For a international website, we may need to handle different languages, in some cases, the user may need to input in their language, we cannot control what they type, they may maliciously or accidentally input something that we do not expect.

One famous example:

const name1 = '\u0041\u006d\u00e9\u006c\u0069\u0065';
const name2 = '\u0041\u006d\u0065\u0301\u006c\u0069\u0065';
console.log(`${name1}, ${name2}`);
// expected output: "Amélie, Amélie"
console.log(name1 === name2);
// expected output: false
console.log(name1.length === name2.length);
// expected output: false

We cannot see any differences by our eyes, they look the same, but they are different, very weird.

Here is the reason:

Unicode assigns a unique numerical value, called a code point, to each character. For example, the code point for "A" is given as U+0041. However, sometimes more than one code point, or sequence of code points, can represent the same abstract character — the character "ñ" for example can be represented by either of:

The single code point U+00F1.

The code point for "n" (U+006E) followed by the code point for the combining tilde (U+0303).

How to fix it?

By using normalize() , it will returns the Unicode Normalization Form of the string.

const name1NFC = name1.normalize('NFC');
const name2NFC = name2.normalize('NFC');
console.log(`${name1NFC}, ${name2NFC}`);
// expected output: "Amélie, Amélie"
console.log(name1NFC === name2NFC);
// expected output: true
console.log(name1NFC.length === name2NFC.length);
// expected output: true

Why this is important?

You can see from the above example, the length of two identical string can have two different length, it will fail the length check.

Also, imagine a user register a username which has been taken, user can play this trick to circumvent the checking.

Therefore, we should normalize the string before we do any checking.

--

--

Frankie

Hi, I am interested in IT field including front-end, back-end, database, networking.