   #PHP Manual Function Reference tanh mb_convert_case

   PHP Manual
   Prev  Next
   ______________________________________________________________________

LII. Multi-Byte String Functions

Introduction

   There are many languages in which all characters can be expressed by
   single byte. Multi-byte character codes are used to express many
   characters for many languages. mbstring is developed to handle
   Japanese characters. However, many mbstring functions are able to
   handle character encoding other than Japanese.

   A multi-byte character encoding represents single character with
   consecutive bytes. Some character encoding has shift(escape) sequences
   to start/end multi-byte character strings. Therefore, a multi-byte
   character string may be destroyed when it is divided and/or counted
   unless multi-byte character encoding safe method is used. This module
   provides multi-byte character safe string functions and other utility
   functions such as conversion functions.

   Since PHP is basically designed for ISO-8859-1, some multi-byte
   character encoding does not work well with PHP. Therefore, it is
   important to set mbstring.language to appropriate language (i.e.
   "Japanese" for Japanese) and mbstring.internal_encoding to a character
   encoding that works with PHP.

   PHP 4 Character Encoding Requirements

     * Per byte encoding
     * Single byte characters in range of 00h-7fh which is compatible
       with ASCII
     * Multi-byte characters without 00h-7fh

   These are examples of internal character encoding that works with PHP
   and does NOT work with PHP.

Character encodings work with PHP:
ISO-8859-*, EUC-JP, UTF-8

Character encodings do NOT work with PHP:
JIS, SJIS

   Character encoding, that does not work with PHP, may be converted with
   mbstring's HTTP input/output conversion feature/function.

     Note: SJIS should not be used for internal encoding unless the
     reader is familiar with parser/compiler, character encoding and
     character encoding issues.

     Note: If you use databases with PHP, it is recommended that you use
     the same character encoding for both database and internal encoding
     for ease of use and better performance.

     If you are using PostgreSQL, it supports character encoding that is
     different from backend character encoding. See the PostgreSQL
     manual for details.

Installation

   mbstring is an extended module. You must enable the module with the
   configure script. Refer to the Install section for details.

   The following configure options are related to the mbstring module.

     * --enable-mbstring=LANG: Enable mbstring functions. This option is
       required to use mbstring functions.
       As of PHP 4.3.0, mbstring extension provides enhanced support for
       Simplified Chinese, Traditional Chinese, Korean, and Russian in
       addition to Japanese. To enable that feature, you will have to
       supply either one of the following options to the LANG parameter;
       --enable-mbstring=cn for Simplified Chinese support,
       --enable-mbstring=tw for Traditional Chinese support,
       --enable-mbstring=kr for Korean support, --enable-mbstring=ru for
       Russian support, and --enable-mbstring=ja for Japanese support.
       Also --enable-mbstring=all is convenient for you to enable all the
       supported languages listed above.

     Note: Japanese language support is also enabled by
     --enable-mbstring without any options for the sake of backwards
     compatibility.
     * --enable-mbstr-enc-trans : Enable HTTP input character encoding
       conversion using mbstring conversion engine. If this feature is
       enabled, HTTP input character encoding may be converted to
       mbstring.internal_encoding automatically.

     Note: As of PHP 4.3.0, the option --enable-mbstr-enc-trans will be
     eliminated and replaced with mbstring.encoding_translation. HTTP
     input character encoding conversion is enabled when this is set to
     On (the default is Off).
     * --enable-mbregex: Enable regular expression functions with
       multibyte character support.

Runtime Configuration

   The behaviour of these functions is affected by settings in php.ini.

   Table 1. Multi-Byte String configuration options
   Name                          Default Changeable
   mbstring.language             NULL    PHP_INI_ALL
   mbstring.detect_order         NULL    PHP_INI_ALL
   mbstring.http_input           NULL    PHP_INI_ALL
   mbstring.http_output          NULL    PHP_INI_ALL
   mbstring.internal_encoding    NULL    PHP_INI_ALL
   mbstring.script_encoding      NULL    PHP_INI_ALL
   mbstring.substitute_character NULL    PHP_INI_ALL
   mbstring.func_overload        "0"     PHP_INI_SYSTEM
   mbstring.encoding_translation "0"     PHP_INI_ALL
   For further details and definition of the PHP_INI_* constants see
   ini_set().

   Here's a short explanation of the configuration directives.

     * mbstring.language defines default language used in mbstring. Note
       that this option defines mbstring.internal_encoding and
       mbstring.internal_encoding should be placed after
       mbstring.language in php.ini
     * mbstring.encoding_translation enables HTTP input character
       encoding detection and translation into internal character
       encoding.
     * mbstring.internal_encoding defines default internal character
       encoding.
     * mbstring.http_input defines default HTTP input character encoding.
     * mbstring.http_output defines default HTTP output character
       encoding.
     * mbstring.detect_order defines default character code detection
       order. See also mb_detect_order().
     * mbstring.substitute_character defines character to substitute for
       invalid character encoding.
     * mbstring.func_overloadoverload(replace) single byte functions by
       mbstring functions. mail(), ereg(), etc. are overloaded by
       mb_send_mail(), mb_ereg(), etc. Possible values are 0, 1, 2, 4 or
       a combination of them. For example, 7 for overload everything. 0:
       No overload, 1: Overload mail() function, 2: Overload str*()
       functions, 4: Overload ereg*() functions.

   Web Browsers are supposed to use the same character encoding when
   submitting form. However, browsers may not use the same character
   encoding. See mb_http_input() to detect character encoding used by
   browsers.

   If enctype is set to multipart/form-data in HTML forms, mbstring does
   not convert character encoding in POST data. The user must convert
   them in the script, if conversion is needed.

   Although, browsers are smart enough to detect character encoding in
   HTML. charset is better to be set in HTTP header. Change
   default_charset according to character encoding.

   Example 1. php.ini setting example
; Set default language
mbstring.language        = Neutral; Set default language to Neutral(UTF-8) (def
ault)
mbstring.language        = English; Set default language to English
mbstring.language        = Japanese; Set default language to Japanese

;; Set default internal encoding
;; Note: Make sure to use character encoding works with PHP
mbstring.internal_encoding    = UTF-8  ; Set internal encoding to UTF-8

;; HTTP input encoding translation is enabled.
mbstring.encoding_translation = On

;; Set default HTTP input character encoding
;; Note: Script cannot change http_input setting.
mbstring.http_input           = pass    ; No conversion.
mbstring.http_input           = auto    ; Set HTTP input to auto
                                ; "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP
,SJIS"
mbstring.http_input           = SJIS    ; Set HTTP2 input to  SJIS
mbstring.http_input           = UTF-8,SJIS,EUC-JP ; Specify order

;; Set default HTTP output character encoding
mbstring.http_output          = pass    ; No conversion
mbstring.http_output          = UTF-8   ; Set HTTP output encoding to UTF-8

;; Set default character encoding detection order
mbstring.detect_order         = auto    ; Set detect order to auto
mbstring.detect_order         = ASCII,JIS,UTF-8,SJIS,EUC-JP ; Specify order

;; Set default substitute character
mbstring.substitute_character = 12307   ; Specify Unicode value
mbstring.substitute_character = none    ; Do not print character
mbstring.substitute_character = long    ; Long Example: U+3000,JIS+7E7E

   Example 2. php.ini setting for EUC-JP users
;; Disable Output Buffering
output_buffering      = Off

;; Set HTTP header charset
default_charset       = EUC-JP

;; Set default language to Japanese
mbstring.language = Japanese

;; HTTP input encoding translation is enabled.
mbstring.encoding_translation = On

;; Set HTTP input encoding conversion to auto
mbstring.http_input   = auto

;; Convert HTTP output to EUC-JP
mbstring.http_output  = EUC-JP

;; Set internal encoding to EUC-JP
mbstring.internal_encoding = EUC-JP

;; Do not print invalid characters
mbstring.substitute_character = none

   Example 3. php.ini setting for SJIS users
;; Enable Output Buffering
output_buffering     = On

;; Set mb_output_handler to enable output conversion
output_handler       = mb_output_handler

;; Set HTTP header charset
default_charset      = Shift_JIS

;; Set default language to Japanese
mbstring.language = Japanese

;; Set http input encoding conversion to auto
mbstring.http_input  = auto

;; Convert to SJIS
mbstring.http_output = SJIS

;; Set internal encoding to EUC-JP
mbstring.internal_encoding = EUC-JP

;; Do not print invalid characters
mbstring.substitute_character = none

Resource Types

   This extension has no resource types defined.

Predefined Constants

   The constants below are defined by this extension, and will only be
   available when the extension has either been compiled into PHP or
   dynamically loaded at runtime.

   MB_OVERLOAD_MAIL (integer)

   MB_OVERLOAD_STRING (integer)

   MB_OVERLOAD_REGEX (integer)

HTTP Input and Output

   HTTP input/output character encoding conversion may convert binary
   data also. Users are supposed to control character encoding conversion
   if binary data is used for HTTP input/output.

     Note: For PHP 4.3.2 or earlier, if enctype for HTML form is set to
     multipart/form-data, mbstring does not convert character encoding
     in POST data. If it is the case, strings are needed to be converted
     to internal character encoding.

     Note: Since PHP 4.3.3, if enctype for HTML form is set to
     multipart/form-data, and, mbstring.encoding_translation is set to
     On in php.ini POST variables and uploaded filename will be
     converted to internal character encoding. But, characters specified
     in 'name' of HTML form will not be converted.

     * HTTP Input
       There is no way to control HTTP input character conversion from
       PHP script. To disable HTTP input character conversion, it has to
       be done in php.ini.

   Example 4. Disable HTTP input conversion in php.ini
   ;; Disable HTTP Input conversion
   mbstring.http_input = pass
   ;; Disable HTTP Input conversion (PHP 4.3.0 or higher)
   mbstring.encoding_translation = Off
       When using PHP as an Apache module, it is possible to override PHP
       ini setting per Virtual Host in httpd.conf or per directory with
       .htaccess. Refer to the Configuration section and Apache Manual
       for details.
     * HTTP Output
       There are several ways to enable output character encoding
       conversion. One is using php.ini, another is using ob_start() with
       mb_output_handler() as ob_start callback function.

     Note: For PHP3-i18n users, mbstring's output conversion differs
     from PHP3-i18n. Character encoding is converted using output
     buffer.

   Example 5. php.ini setting example
;; Enable output character encoding conversion for all PHP pages

;; Enable Output Buffering
output_buffering    = On

;; Set mb_output_handler to enable output conversion
output_handler      = mb_output_handler

   Example 6. Script example
   <?php
   // Enable output character encoding conversion only for this page
   // Set HTTP output character encoding to SJIS
   mb_http_output('SJIS');
   // Start buffering and specify "mb_output_handler" as
   // callback function
   ob_start('mb_output_handler');
   ?>

Supported Character Encodings

   Currently, the following character encoding is supported by the
   mbstring module. Character encoding may be specified for mbstring
   functions' encoding parameter.

   The following character encoding is supported in this PHP extension:

   UCS-4, UCS-4BE, UCS-4LE, UCS-2, UCS-2BE, UCS-2LE, UTF-32, UTF-32BE,
   UTF-32LE, UCS-2LE, UTF-16, UTF-16BE, UTF-16LE, UTF-8, UTF-7, ASCII,
   EUC-JP, SJIS, eucJP-win, SJIS-win, ISO-2022-JP, JIS, ISO-8859-1,
   ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6,
   ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-13,
   ISO-8859-14, ISO-8859-15, byte2be, byte2le, byte4be, byte4le, BASE64,
   7bit, 8bit and UTF7-IMAP.

   As of PHP 4.3.0, the following character encoding support will be
   added experimentally : EUC-CN, CP936, HZ, EUC-TW, CP950, BIG-5,
   EUC-KR, UHC (CP949), ISO-2022-KR, Windows-1251 (CP1251), Windows-1252
   (CP1252), CP866, KOI8-R.

   php.ini entry, which accepts encoding name, accepts "auto" and "pass"
   also. mbstring functions, which accepts encoding name, and accepts
   "auto".

   If "pass" is set, no character encoding conversion is performed.

   If "auto" is set, it is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS".

   See also mb_detect_order()

     Note: "Supported character encoding" does not mean that it works as
     internal character code.

Overloading PHP string functions with multi byte string functions

   Because almost PHP application written for language using single-byte
   character encoding, there are some difficulties for multibyte string
   handling including Japanese. Almost PHP string functions such as
   substr() do not support multibyte string.

   Multibyte extension (mbstring) has some PHP string functions with
   multibyte support (ex. substr() supports mb_substr()).

   Multibyte extension (mbstring) also supports 'function overloading' to
   add multibyte string functionality without code modification. Using
   function overloading, some PHP string functions will be overloaded
   multibyte string functions. For example, mb_substr() is called instead
   of substr() if function overloading is enabled. Function overload
   makes easy to port application supporting only single-byte encoding
   for multibyte application.

   mbstring.func_overload in php.ini should be set some positive value to
   use function overloading. The value should specify the category of
   overloading functions, should be set 1 to enable mail function
   overloading. 2 to enable string functions, 4 to regular expression
   functions. For example, if is set for 7, mail, strings, regex
   functions should be overloaded. The list of overloaded functions are
   shown in below.

   Table 2. Functions to be overloaded
   value of mbstring.func_overload original function overloaded function
   1                               mail()            mb_send_mail()
   2                               strlen()          mb_strlen()
   2                               strpos()          mb_strpos()
   2                               strrpos()         mb_strrpos()
   2                               substr()          mb_substr()
   2                               strtolower()      mb_strtolower()
   2                               strtoupper()      mb_strtoupper()
   2                               substr_count()    mb_substr_count()
   4                               ereg()            mb_ereg()
   4                               eregi()           mb_eregi()
   4                               ereg_replace()    mb_ereg_replace()
   4                               eregi_replace()   mb_eregi_replace()
   4                               split()           mb_split()

Basics of Japanese multi-byte characters

   Most Japanese characters need more than 1 byte per character. In
   addition, several character encoding schemes are used under a Japanese
   environment. There are EUC-JP, Shift_JIS(SJIS) and ISO-2022-JP(JIS)
   character encoding. As Unicode becomes popular, UTF-8 is used also. To
   develop Web applications for a Japanese environment, it is important
   to use the character set for the task in hand, whether HTTP
   input/output, RDBMS and E-mail.

     * Storage for a character can be up to six bytes
     * A multi-byte character is usually twice of the width compared to
       single-byte characters. Wider characters are called "zen-kaku" -
       meaning full width, narrower characters are called "han-kaku" -
       meaning half width. "zen-kaku" characters are usually fixed width.
     * Some character encoding defines shift(escape) sequence for
       entering/exiting multi-byte character strings.
     * ISO-2022-JP must be used for SMTP/NNTP.
     * "i-mode" web site is supposed to use SJIS.

References

   Multi-byte character encoding and its related issues are very complex.
   It is impossible to cover in sufficient detail here. Please refer to
   the following URLs and other resources for further readings.

     * Unicode/UTF/UCS/etc
       http://www.unicode.org/
     * Japanese/Korean/Chinese character information
       ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf

   Table of Contents
   mb_convert_case -- Perform case folding on a string
   mb_convert_encoding -- Convert character encoding
   mb_convert_kana --  Convert "kana" one from another ("zen-kaku",
          "han-kaku" and more)

   mb_convert_variables -- Convert character code in variable(s)
   mb_decode_mimeheader -- Decode string in MIME header field
   mb_decode_numericentity --  Decode HTML numeric string reference to
          character

   mb_detect_encoding -- Detect character encoding
   mb_detect_order --  Set/Get character encoding detection order
   mb_encode_mimeheader -- Encode string for MIME header
   mb_encode_numericentity --  Encode character to HTML numeric string
          reference

   mb_ereg_match --  Regular expression match for multibyte string
   mb_ereg_replace -- Replace regular expression with multibyte support
   mb_ereg_search_getpos --  Returns start point for next regular
          expression match

   mb_ereg_search_getregs --  Retrieve the result from the last multibyte
          regular expression match

   mb_ereg_search_init --  Setup string and regular expression for
          multibyte regular expression match

   mb_ereg_search_pos --  Return position and length of matched part of
          multibyte regular expression for predefined multibyte string

   mb_ereg_search_regs --  Returns the matched part of multibyte regular
          expression

   mb_ereg_search_setpos --  Set start point of next regular expression
          match

   mb_ereg_search --  Multibyte regular expression match for predefined
          multibyte string

   mb_ereg -- Regular expression match with multibyte support
   mb_eregi_replace --  Replace regular expression with multibyte support
          ignoring case

   mb_eregi --  Regular expression match ignoring case with multibyte
          support

   mb_get_info -- Get internal settings of mbstring
   mb_http_input -- Detect HTTP input character encoding
   mb_http_output -- Set/Get HTTP output character encoding
   mb_internal_encoding --  Set/Get internal character encoding
   mb_language --  Set/Get current language
   mb_output_handler --  Callback function converts character encoding in
          output buffer

   mb_parse_str --  Parse GET/POST/COOKIE data and set global variable
   mb_preferred_mime_name -- Get MIME charset string
   mb_regex_encoding --  Returns current encoding for multibyte regex as
          string

   mb_regex_set_options --  Set/Get the default options for mbregex
          functions

   mb_send_mail --  Send encoded mail.
   mb_split -- Split multibyte string using regular expression
   mb_strcut -- Get part of string
   mb_strimwidth -- Get truncated string with specified width
   mb_strlen -- Get string length
   mb_strpos --  Find position of first occurrence of string in a string
   mb_strrpos --  Find position of last occurrence of a string in a
          string

   mb_strtolower -- Make a string lowercase
   mb_strtoupper -- Make a string uppercase
   mb_strwidth -- Return width of string
   mb_substitute_character -- Set/Get substitution character
   mb_substr_count -- Count the number of substring occurrences
   mb_substr -- Get part of string
   ______________________________________________________________________

   Prev Home            Next
   tanh  Up  mb_convert_case
