Tuesday, April 23, 2024

Hobby, reloaded

Monday, April 15, 2024

Rust/WebAssembly vs Javascript performance ... reloaded

Last year I've blogged about a tiny contest where Rust/WASM and Javascript were used to implement a simple Mandelbrot animation. I've polished the code, put it in a github repo, the rust-vs-js. I've also added a third contestant, the gpu.js accelerated version, which of course beats the heck out of the two (Javascript and Rust) since it's heavily parallelized.
Anyway, jump to the repository and enjoy the code.

Wednesday, April 3, 2024

Take just a single visible character?

Fairly simple requirement - get a first letter of profile description and present it together with a link. You get the idea, if I have two profiles, Foo and Bar, I want two links with F and B respectively.
The first version of the code (not even mentioned here) was just something like: if string has at least one character, take uppercase of the first character.
This seemingly simple approach completely ignores Emojis which are handled in C# strings as two consecutive chars. The second version of the code was then:
public static string ToShortDescription( this string source )
{
	var description = source?.Trim();

	if ( !string.IsNullOrWhiteSpace( description ) && description.Length >= 1 )
	{
		if ( char.IsSurrogatePair( description, 0 ) )
		{
			return description.Substring( 0, 2 ).ToUpper();
		}
		else
		{
			return description.Substring( 0, 1 ).ToUpper();
		}
	}

	return "?";
}
This is better, much better. It's not just the first character of the string, it's the substring that has the length of 2. This simple approach correctly handles many two-char Emojis, like the male mage Emoji, 🧙, encoded as 🧙.
Unfortunately, it's just the beginning of the story. It turns out some Emojis are combined from other Emojis. Let's take the mage emoji. Its female version, 🧙‍♀️, is encoded as male mage Emoji followed by additional characters to indicate female version (🧙‍♀️). The special character used to glue together emojis is the Zero-Width-Joiner (ZWJ).
Take a C# string that starts with the female mage emoji. This time it's not the 2 characters that should be taken from it, now it's 5! The two-char Emoji, the ZWJ, and another two-char Emoji!
Let this sink in - in order to have a single visible character on the screen, we need to take 5 first characters of the C# string!
And as you can expect, the above version of code correctly discovers the first surrogate but fails to discover the ZWJ.
There's even a discussion on SO on how to detect this.
My current approach is
public static string ToShortDescription( this string source, bool autoUpper = true )
{
	var description = source?.Trim();

	if ( !string.IsNullOrWhiteSpace( description ) && description.Length >= 1 )
	{
		// należy brać kolejne znaki na następujących zasadach
		// * jeśli zwykły znak - bierze się i koniec
		// * jeśli zjw - bierze się i nie koniec
		// * jeśli surrogatepair bierze się dwa i nie koniec
		char[] sourceChars = source.ToCharArray();
		List<char> destChars = new List<char>();

		var index = 0;
		bool takeAgain;
		bool zjw;

		do
		{
			takeAgain = false;

			// czy jest jeden i jeszcze jeden za nim (dwuznaki)
			if ( index < sourceChars.Length - 1 )
			{
				// surogat
				if ( char.IsSurrogatePair( sourceChars[index], sourceChars[index + 1] ) )
				{
					destChars.AddRange( new[] { sourceChars[index], sourceChars[index + 1] } );

					index    += 2;
					takeAgain = true;
				}
			}

			if ( index < sourceChars.Length - 2 )
			{
				// zjw - skleja dwa emoji
				if ( sourceChars[index] == (char)8205 )
				{
					destChars.AddRange( new[] { sourceChars[index], sourceChars[index + 1], sourceChars[index + 2] } );

					index    += 3;
					takeAgain = true;
				}

			}


		} while ( takeAgain && index < sourceChars.Length );

		// weź jeszcze jeden jeśli jeszcze nie ma nic lub zjw
		if ( !takeAgain && 
			 index <= sourceChars.Length-1 &&
			 destChars.Count == 0 
			)
		{
			destChars.Add( sourceChars[index] );
		}

		string _result = new string( destChars.ToArray() );

		return autoUpper ? _result.ToUpper() : _result;

		/*
		if ( char.IsSurrogatePair( description, 0 ) )
		{
			return description.Substring( 0, 2 ).ToUpper();
		}
		else
		{
			return description.Substring( 0, 1 ).ToUpper();
		}
		*/
	}

	return "?";
}
This passes some important unit tests. Namely, it correctly handles the England Emoji flag emoji, the 🏴󠁧󠁢󠁥󠁮󠁧󠁿 (&#x1F3F4;&#xE0067;&#xE0062;&#xE0065;&#xE006E;&#xE0067;&#xE007F;), which still is a single visible sign but in this extreme case it's the first 14 characters of the C# string! I believe there's still a room for improvement (and possibly other strange cases that I still miss).