Thursday, December 17, 2009

Faster reading UTF-8 encoded file in Android

I created an Android application which reads some text files from a raw resource. The text files are encoded in UTF8. Therefore, I straight away wrote the code to convert bytes from the file into characters.

InputStreamReader in = new InputStreamReader(new BufferedInputStream(resources.openRawResource(R.raw.textfile)))
int c = in.read(); // read a character, and so on.

But, reading a 10KB file takes almost a minute on the Android 1.5 emulator! I wondered what made that so slow, in my Nokia phone, the same program written in Java ME takes less than a second to do the same thing.

By using Traceview, I found out that most of the time is spent on the UTF-8 decoding from bytes to characters. Android's Java implementation uses IBM ICU for character encoding. And it seems to be overkill to just decode UTF-8. Hence, the solution is to create own implementation if UTF-8 decoder. (Some concept taken from Go source, less the error-checking overhead and only look for max 16-bit characters.)

public class Utf8Reader implements Closeable {
    private InputStream in_;
    public static final char replacementChar = 0xFFFD;

    public Utf8Reader(InputStream in) {
        in_ = in;
    }

    public int read() throws IOException {
        int c0 = in_.read();

        if (c0 == -1) {
            // EOF
            return -1;
        }

        if (c0 < 0x80) {
            // input 1 byte, output 7 bit
            return c0;
        }

        int c1 = in_.read();

        if (c1 == -1) {
            // partial EOF
            return -1;
        }

        if (c0 < 0xe0) {
            // input 2 byte, output 5+6 = 11 bit
            return ((c0 & 0x1f) << 6) | (c1 & 0x3f);
        }

        int c2 = in_.read();

        if (c2 == -1) {
            // partial EOF
            return -1;
        }

        // input 3 byte, output 4+6+6 = 16 bit
        return ((c0 & 0x0f) << 12) | ((c1 & 0x3f) << 6) | (c2 & 0x3f);
    }

    @Override
    public void close() throws IOException {
        in_.close();
    }
}

(Please add the required import by yourself.) The result is satisfying: the 10KB file is now loaded in about 1 second in the emulator, and almost instantly on the device.