Wednesday, April 3, 2024

Take just a single visible character?

Fairly simple requirement - get a first letter of profile description and present it together with a link. You get the idea, if I have two profiles, Foo and Bar, I want two links with F and B respectively.
The first version of the code (not even mentioned here) was just something like: if string has at least one character, take uppercase of the first character.
This seemingly simple approach completely ignores Emojis which are handled in C# strings as two consecutive chars. The second version of the code was then:
public static string ToShortDescription( this string source )
{
	var description = source?.Trim();

	if ( !string.IsNullOrWhiteSpace( description ) && description.Length >= 1 )
	{
		if ( char.IsSurrogatePair( description, 0 ) )
		{
			return description.Substring( 0, 2 ).ToUpper();
		}
		else
		{
			return description.Substring( 0, 1 ).ToUpper();
		}
	}

	return "?";
}
This is better, much better. It's not just the first character of the string, it's the substring that has the length of 2. This simple approach correctly handles many two-char Emojis, like the male mage Emoji, 🧙, encoded as 🧙.
Unfortunately, it's just the beginning of the story. It turns out some Emojis are combined from other Emojis. Let's take the mage emoji. Its female version, 🧙‍♀️, is encoded as male mage Emoji followed by additional characters to indicate female version (🧙‍♀️). The special character used to glue together emojis is the Zero-Width-Joiner (ZWJ).
Take a C# string that starts with the female mage emoji. This time it's not the 2 characters that should be taken from it, now it's 5! The two-char Emoji, the ZWJ, and another two-char Emoji!
Let this sink in - in order to have a single visible character on the screen, we need to take 5 first characters of the C# string!
And as you can expect, the above version of code correctly discovers the first surrogate but fails to discover the ZWJ.
There's even a discussion on SO on how to detect this.
My current approach is
public static string ToShortDescription( this string source, bool autoUpper = true )
{
	var description = source?.Trim();

	if ( !string.IsNullOrWhiteSpace( description ) && description.Length >= 1 )
	{
		// należy brać kolejne znaki na następujących zasadach
		// * jeśli zwykły znak - bierze się i koniec
		// * jeśli zjw - bierze się i nie koniec
		// * jeśli surrogatepair bierze się dwa i nie koniec
		char[] sourceChars = source.ToCharArray();
		List<char> destChars = new List<char>();

		var index = 0;
		bool takeAgain;
		bool zjw;

		do
		{
			takeAgain = false;

			// czy jest jeden i jeszcze jeden za nim (dwuznaki)
			if ( index < sourceChars.Length - 1 )
			{
				// surogat
				if ( char.IsSurrogatePair( sourceChars[index], sourceChars[index + 1] ) )
				{
					destChars.AddRange( new[] { sourceChars[index], sourceChars[index + 1] } );

					index    += 2;
					takeAgain = true;
				}
			}

			if ( index < sourceChars.Length - 2 )
			{
				// zjw - skleja dwa emoji
				if ( sourceChars[index] == (char)8205 )
				{
					destChars.AddRange( new[] { sourceChars[index], sourceChars[index + 1], sourceChars[index + 2] } );

					index    += 3;
					takeAgain = true;
				}

			}


		} while ( takeAgain && index < sourceChars.Length );

		// weź jeszcze jeden jeśli jeszcze nie ma nic lub zjw
		if ( !takeAgain && 
			 index <= sourceChars.Length-1 &&
			 destChars.Count == 0 
			)
		{
			destChars.Add( sourceChars[index] );
		}

		string _result = new string( destChars.ToArray() );

		return autoUpper ? _result.ToUpper() : _result;

		/*
		if ( char.IsSurrogatePair( description, 0 ) )
		{
			return description.Substring( 0, 2 ).ToUpper();
		}
		else
		{
			return description.Substring( 0, 1 ).ToUpper();
		}
		*/
	}

	return "?";
}
This passes some important unit tests. Namely, it correctly handles the England Emoji flag emoji, the 🏴󠁧󠁢󠁥󠁮󠁧󠁿 (&#x1F3F4;&#xE0067;&#xE0062;&#xE0065;&#xE006E;&#xE0067;&#xE007F;), which still is a single visible sign but in this extreme case it's the first 14 characters of the C# string! I believe there's still a room for improvement (and possibly other strange cases that I still miss).

No comments: