This library defines String class that stores a non-null-terminated style character code array in UTF-8. Although you may think it is just a loss of time to invent a new string class and you don't want to study about it, we should do it because we've found that there are several security considerations in the design of ISO C++ std::string class. The String class in this library is designed to do zero-clear of the memory space used in string operations to reduce the risk of sniffing the sensitive data in the memory (by dumping the memory blocks).
Anyway, if you want to preserve some sensitive information (such as passwords) relatively long time on the memory, use SecureString instead of String. SecureString encrypts the string to preverve it secure.
String class has several constructors and it can be initialized with string of char or wchar_t. Usually the actual charset and encoding of char string depends on the locale and the platform and when a String instance is initialized with a char string, the string is automatically converted into UTF-8 (of UCS-4) using the system function.
Similarly, the charset and encoding of wchar_t is also environment dependent. In Windows, wchar_t string uses UTF-16 and you can convert wchar_t string into String instance like the following code:
Since wchar_t is not identical to UTF-16 in general, you should not use wchar_t on non-Windows environment. You should use UChar2 for UTF-16 (of UCS-4) strings and UChar4 for UTF-32 (of UCS-4) strings. You can assume that UChar2 is always 16bit and UChar4 is 32bit (regardless of the endianness). The code below illustrates how to use them:
You can embed strings expressed in Unicode without depending on the platform dependent encoding/charset by using UChar2 and UChar4.
There is also a String constructor that receives UTF-8 string. Since UTF-8 string is also expressed in char string, the UTF-8 version constructor should be explicitly called using utf8s proxy object:
We define NULL_STRING enumeration type and NullString value to call a String constructor that initializes the instance with empty string. The following code is a sample use of them:
You can also use NullString as default value for function parameters:
You can also initialize a String from another string that is not terminated by '\0'
or a portion of a terminated string. The following code illustrates this:
To initialize a String from a portion of another String, use String::substring or String::substringByChar function:
The difference between String::substring and String::substringByChar is discussed on Character Count vs. String Length.
The concatenation of several String instances are very easy. All you have to do is to use +
operator:
You can also use traditional printf syntax on String by format or format_utf8 function. format function regards the parameters as normal char string and format_utf8 regards them as UTF-8 string.
The byte size of the string is obtained by String::getLength function and the character count is by String::getNumOfChars function. Anyway, both of length and size of the string does not contain the terminating "\\0"
character.
The difference between the length and the character count is discussed on Character Count vs. String Length.
String has several comparison operators and you can easily compare two String instances:
There are also String::compare (case sensitive) and String::compareI (case insensitive) functions.
You can modify String instances by [] operator:
by [] returns the reference to or value of the specified offset address in the string. The drawback for writing a character to some position by the function is it happens to duplicate the whole string to modify the portion if the string is referenced by several instances. If you want to modify the whole string effeciently, see Allocate Character Array.
There is also a way to know the character code of the specified character position. () gets the UCS-4 character code of the specified UCS-4 character index:
The difference between offset address and UTF-4 character position is discussed on Character Count vs. String Length.
You can get a pointer to the raw UTF-8 string using String::c_str function:
The pointer is valid until you call any function that accesses to the String instance.
There are several functions that eliminates unnecessary white space characters (space, tab, and line feeds) from String instances:
If your String manipulation code works on traditional character code array directly, allocate function can be used. The following code illustrates how to use the function:
In this case, you should manipulate UTF-8 string by your own hand.
There are also several String functions.
You can also use Regular Expression for seaching and matching of the strings. For more information, see Regular Expression.
Since the String class adopts UTF-8 as its intermediate charset, there're needs of converting the charset into the platform native ones. This library provides UtfConverter class and nifty String methods (String::toMbs, String::toWcs, String::toUcs2 and String::toUcs4) which wraps UtfConverter.
String::toMbs converts the String instance into multibyte string pointed by const char*
.
String::toWcs converts the String instance into wide character string pointed by const wchar_t*
.
String::toUcs2 converts the String instance into UCS2 character string pointed by const UChar2*
.
String::toUcs4 converts the String instance into UCS4 character string pointed by const UChar4*
.
On Windows, most of the programs can work with String::toMbs and/or String::toWcs. But if you use tchar.h
to use TCHAR
, you can use TO_TCS to deal with String::toMbs and String::toWcs methods indirectly. It behaves as String::toWcs if UNICODE
macro is defined, otherwise as String::toMbs.
The following code illustrates how to use these macros:
In this library, there is difference between character count and string length.
The character count means the number of UCS-4 characters in string. The string length means the number of UChar1 entries in string. Anyway, both usually does not include the terminating null character ('\0'
).
In ASCII 7bit string, the difference is not a problem because an UChar1 entry can hold an UCS-4 character code that corresponding to an ASCII 7bit character.
String::substring function extracts a substring based on the length, position of UChar1 entry and String::substringByChar function does the same based on the position, count in the number of UCS-4 characters. Likewise, String::getLength function returns the number of UChar1 entries and String::getNumOfChars function returns the number of UCS-4 characters in the string (Both do not include the terminating null).