Serializing strings, Unicode, and randomized testing using ScalaCheck

While implementing a simple event store for an example application I needed to serialize JSON data to binary arrays and turn those bytes back into the original JSON. Obviously, that’s a piece of cake!

// try1.scala
def serialize(string: String) = string.getBytes;
def deserialize(bytes: Array[Byte]) = new String(bytes);
def test(string: String) {
  val actual = deserialize(serialize(string))
  println(if (string == actual) "OK ".format(string)
          else "FAIL expected  got ".format(string, actual))
}

test("")
test("foo")
test("the quick brown fox jumps over the lazy dog")
test("\u2192")

Save this to a file (try1.scala) or get it from github at erikrozendaal/scalacheck-blog and run it with the Scala “interpreter”:

$ scala try1.scala
OK
OK
OK FAIL <?> expected got <?>

Oops, the last test fails on my laptop: FAIL expected <?> got <?>. The error message is not very helpful, but it turns out the default encoding used by Java on Mac OS X is MacRoman, and String.getBytes silently changes any unknown character to a question mark. Now silent corruption may make things easier, but is normally unwanted. Especially when building an event store!

Let’s try again. This time we explicitly use the UTF-8 encoding:

// try2.scala
def serialize(string: String) = string.getBytes("UTF-8")
def deserialize(bytes: Array[Byte]) = new String(bytes, "UTF-8")

// [... test code unchanged ...]

Running this example succeeds, even with the default encoding set to MacRoman. But these little interactions showed us that serializing a string to an array of bytes may not be as easy as originally expected. Are there any other strings that fail to serialize and deserialize? And how can we find these?

Fortunately, there is a powerful tool available for this: ScalaCheck. ScalaCheck generates random test examples and uses these to test programmer specified propositions. This sounds more complicated than it is in practice. So let’s try an example:

{% highlight scala linenos %} // try3.scala import org.scalacheck._

def serialize(string: String) = string.getBytes("UTF-8") def deserialize(bytes: Array[Byte]) = new String(bytes, "UTF-8")

val prop_serializes_all_strings = Prop.forAll { s: String => s == deserialize(serialize(s)) }

prop_serializes_all_strings.check {% endhighlight %}

Here we reuse the same string serialization code from the previous example, but instead of manually thinking of examples to test we use ScalaCheck’s Prop.forAll method to create a proposition. You can basically read lines 7 to 9 as: “For all strings s, s must be equal to the result of serializing and deserializing s“. On the last line we tell ScalaCheck to check this proposition. ScalaCheck does so by generating random strings and passing these to our proposition. Download the ScalaCheck jar and try it:

$ scala -cp scalacheck_2.8.1-1.8.jar try3.scala
! Falsified after 15 passed tests.
> ARG_0: ?

Not good. It only took 15 tries for ScalaCheck to find a counter example, which ScalaCheck prints on line 3 above. Unfortunately, due to the limits of current font technology unknown characters are not clearly printed. So we know there is at least one counter example, but we do not know yet what it is.

To help us with this, ScalaCheck allows us to label a test case. Labeling our proposition with the integral values of the string contents should provide us with more information:

// try4.scala
import org.scalacheck._, Prop._

def serialize(string: String) = string.getBytes("UTF-8")
def deserialize(bytes: Array[Byte]) = new String(bytes, "UTF-8")

def decode(string: String) = "characters = " + string.map(_.toInt).mkString(",")

val prop_serializes_all_strings = Prop.forAll { s: String =>
  decode(s) |: s == deserialize(serialize(s))
}

prop_serializes_all_strings.check

Here an invocation to the new function decode (line 7) is added as a label to the prop_serializes_all_strings proposition using the ScalaCheck provided |: operator (line 10).

Running this version a couple of times gives us some ideas about what is going wrong:

$ scala -cp scalacheck_2.8.1-1.8.jar deser4.scala
! Falsified after 11 passed tests.
> Labels of failing property:
characters = 56465
> ARG_0: ?

It turns out our proposition fails for strings containing characters in the range \uD800 \uDFFF. Looking at the Unicode reference it turns out these are the leading- and trailing-surrogate ranges defined for UTF-16 encoding. And Java Strings use UTF-16 encoding internally. So it looks like ScalaCheck generates Strings that are not valid UTF-16 Unicode strings, and java.lang.String happily accepts them. [NOTE: ScalaCheck 1.9 no longer generates strings containing leading- and trailing-surrogates. So Strings will never contain characters from the Supplementary Planes, but at least all generated strings are valid UTF-16.]

In our case, we’re only interested in serializing valid Unicode strings. But since we should never silently corrupt data, it would be better to fail with an exception when an illegal string is passed to our serialize implementation. The Java String class does not provide a method for this, so we have to use a CharsetEncoder instead.

Unfortunately, this requires digging into some of the lower-level APIs provided by Java NIO. Here’s the end result:

import java.nio.{ByteBuffer, CharBuffer}
import java.nio.charset.Charset
import org.scalacheck._, Prop._

def encoder = Charset.forName("UTF-8").newEncoder
def decoder = Charset.forName("UTF-8").newDecoder

def serialize(string: String) = {
  val bytes = encoder.encode(CharBuffer.wrap(string))
  bytes.array.slice(bytes.position, bytes.limit)
}
def deserialize(bytes: Array[Byte]) = decoder.decode(ByteBuffer.wrap(bytes)).toString

def decode(string: String) = "characters = " + string.map(_.toInt).mkString(",")

val prop_serializes_all_strings = Prop.forAll { s: String =>
  encoder.canEncode(s) ==> (decode(s) |: s == deserialize(serialize(s)))
}

prop_serializes_all_strings.check

The serialization code has been changed to directly use the Charset encoder and decoder and the proposition is changed to include a guard condition using the ScalaCheck ==> operator: the proposition only holds for strings that can be encoded. Let’s run this:

$ scala -cp scalacheck_2.8.1-1.8.jar try5.scala
+ OK, passed 100 tests.

Finally, success! Well, mostly. Occassionally ScalaCheck will fail:

$ scala -Dfile.encoding=UTF-8 -cp scalacheck_2.8.1-1.8.jar try5.scala
! Gave up after only 98 passed tests. 500 tests were discarded.

This happens because our guard condition discards strings which cannot be encoded. By default, ScalaCheck will try to generated 100 passing examples, but will not discard more than 500 examples when trying to achieve this. One way to fix this is to add some parameters to the check invocation:

prop_serializes_all_strings.check(Test.Params(minSuccessfulTests = 50, maxDiscardedTests = 1000))

Now it becomes very unlikely that ScalaCheck fails to find enough successful tests.

Another approach is to define a custom generator that only generates valid Unicode strings. Generating valid UTF-16 Unicode strings is a little tricky, so the code is a bit long:

{% highlight scala linenos %} // [... imports same as before ...]

val UnicodeLeadingSurrogate = '\uD800' to '\uDBFF' val UnicodeTrailingSurrogate = '\uDC00' to '\uDFFF' val UnicodeBasicMultilingualPlane = ('\u0000' to '\uFFFF').diff(UnicodeLeadingSurrogate).diff(UnicodeTrailingSurrogate)

val unicodeCharacterBasicMultilingualPlane: Gen[String] = Gen.oneOf(UnicodeBasicMultilingualPlane).map(_.toString) val unicodeCharacterSupplementaryPlane: Gen[String] = for { c1 <- Gen.oneOf(UnicodeLeadingSurrogate) c2 <- Gen.oneOf(UnicodeTrailingSurrogate) } yield { c1.toString + c2.toString }

val unicodeCharacter = Gen.frequency( 9 -> unicodeCharacterBasicMultilingualPlane, 1 -> unicodeCharacterSupplementaryPlane)

val unicodeString = Gen.listOf(unicodeCharacter).map(_.mkString)

// [... serialization code same as before ...]

val prop_serializes_all_strings = Prop.forAll(unicodeString) { s: String => decode(s) |: s == deserialize(serialize(s)) }

prop_serializes_all_strings.check {% endhighlight %}

On lines 3-5 we define three character collections that will be used to construct correct Unicode data.

On line 7 we define a generator for generating a single Unicode character from the Basic Multilingual Plane (BMP). It basically picks a random element from the UnicodeBasicMultilingualPlane collection and converts it into a (single character) String.

Lines 8-12 define a generator for Unicode characters from the supplementary planes. It generates a two character string with the first character taken from the leading surrogate range and the second character taken from the trailing surrogate range. The code uses a for-comprehension since generators are allowed to fail (but in our case will never do so).

Lines 14-16 combines these two generators to produce a new generator that generates a BMP character 9 times out of 10, and a supplementary plane character otherwise.

Finally, line 18 uses the previous generator to create a list of random but valid Unicode characters and concatenates this list into a single string. This is the generator we want to use in our proposition. This is shown on lines 22 to 24. Instead of guarding our proposition we now explicitly pass in the generator to use in our call to Prop.forAll. Our proposition now passes without needing to discard any examples, so we can remove the check parameters.

Conclusion

With traditional unit tests it is hard to come up with good data that actually exposes bugs, especially the unexpected kind of bugs that only seem to creep up in production. Using ScalaCheck we can let the computer come up with random data to test our code. ScalaCheck includes standard generators for the usual suspects such as Int, String, etc. It’s also easy to add and use custom generators for existing or new data types.

ScalaCheck is not a replacement for traditional tests. Explicit examples are still useful as documentation and we should also add a test to the above code to ensure an exception is thrown when we try to serialize a String that is not valid Unicode, as our proposition only tests the behavior for valid Strings. But ScalaCheck based testing can help us find those unexpected bugs (really, the most common kind) before they get into production code. Trying to find the silent corruption bug of the first few serialization attempts would be a lot harder in a production environment.

Blog

Serializing strings, Unicode, and randomized testing using ScalaCheck

Conclusion

Onze Software Diensten