Counting+the+characters+in+a+Unicode+string

//by Richard Russell, March 2010//

//BBC BASIC for Windows// provides native support for the [|Unicode] Basic Multilingual Plane, allowing you to work with, output and print a wide range of foreign-language and other character sets with very little extra effort. The main [|Help documentation] describes how to enable Unicode support.

The Unicode encoding used by //BBC BASIC for Windows// is [|UTF-8]. This is used in preference to other encodings (for example UTF-16) for the following reasons:
 * UTF-8 is represented as a //byte stream//, which is compatible with BBC BASIC's string variables, functions and operators.
 * Regular 7-bit ASCII text is represented identically in UTF-8 and ANSI, making it extremely easy to work with such text.
 * UTF-8 is compatible with BBC BASIC's [|VDU codes]; you can mix UTF-8 text and VDU sequences in the same string and PRINT them.
 * You can embed UTF-8 text within a program as string constants or DATA statements (although they will not display as expected in the program editor); UTF-16 cannot be used in this way.
 * UTF-8 has only one version, whereas UTF-16 is byte-order dependent (it has little-endian and big-endian versions).
 * UTF-8 is the preferred Unicode encoding for emails and web pages.

UTF-8 has only one significant disadvantage compared with UTF-16: it is a variable-length encoding. That means you cannot determine the number of characters in a string using the **LEN** function (it returns the length in bytes, not in characters). Similarly, the **COUNT** function and features that depend on it (i.e. the **WIDTH** statement and the **TAB(x)** function) won't necessarily work as expected. Note that in any case COUNT, WIDTH and TAB(x) aren't generally useful when a **proportionally spaced** font is in use.

To overcome this disadvantage the function **FNulen** is listed below. This takes as a parameter a Unicode (UTF-8) string, and returns the length of the string in characters: code format="bb4w" DEF FNulen(U$) LOCAL L%     CP_UTF8 = 65001 SYS "MultiByteToWideChar", CP_UTF8, 0, U$, LEN(U$), 0, 0 TO L%     = L% code If passed a string containing only 7-bit ASCII text, the function will return the same value as **LEN(U$)**.

If you need to know the **extent** (that is, the physical width and height) of a Unicode (UTF-8) string, such as you might if you want to centre it on the screen or a printout, you can use the following procedure: code format="bb4w" DEF PROCuextent(hdc%, U$, size{}) LOCAL L%, U%     L% = FNulen(U$) DIM U% LOCAL 2*L% U% = (U% + 1) AND -2 SYS "MultiByteToWideChar", CP_UTF8, 0, U$, LEN(U$), U%, L%     SYS "GetTextExtentPoint32W", hdc%, U%, L%, size{} ENDPROC code If passed a string containing only 7-bit ASCII text, the procedure will return the same size as **GetTextExtentPoint32**.