The simplest form of character set used on computers uses an 8-bit (one byte) numerical value to represent a letter from the English and Latin alphabets and certain accented characters (normally seen in French writing). This system is called ASCII and almost all modern day operating systems use it as well as many older computer systems.
Unicode, it particular the UTF-8 standard, takes this concept of a numerical value representing a character and extends it to host the alphabets of (virtually) all the known languages in the world. This is around 100,000 characters and as such UTF-8 can use 1-, 2-, 3- and even 4-byte values to represent them:
- the 1-byte character set is used to cover the simple English alphabet;
- the 2-byte character set is used to cover the more common alphabets, including Arabic, Armenian, Cyrillic, Greek, Hebrew, Latin, and Syriac;
- the 3-byte character set is used to cover other language alphabets;
- the 4-byte character set is used to cover rarer language alphabets, as such it is not used often.
In addition to the characters used the standard also defines "handedness", as in which way the text flows. Typically Western languages are written left-to-right (as per the text on this page) while other, typically middle-Eastern languages, write from right-to-left.
While ASCII uses one character-per-byte and so a 100 letter document would be (theoretically) 100 bytes on disk a Unicode document could be 2, 3 or 4 times that size, depending on the encoding used. The Unicode standard is backwards compatible with ASCII when used in 1-byte character set.
Using UTF-8 in FreeBSD
First we need to set the LC_ALL and LANG variables, find out which locales can support UTF-8.
cd /usr/share/locale/; ls *UTF-8 -d
Add the following environment variables to the appropriate file, ~/.profile or ~/.login or ~/.bashrc.
export LANG=sv_SE.UTF-8 export LC_ALL=sv_SE.UTF-8
Now login and logout to have the effects apply.
To make xterm play nice i added
echo "xterm*locale: UTF-8" >> ~/.Xdefaults
irssi + screen
If you're like me and don't want to restart your irssi you use the following line, otherwise screen should use the locales.
Ctrl-a : (colon) then write 'encoding UTF-8 UTF-8'
This config will enable you to send ISO8859-1 by default in irssi.
/set term_charset UTF-8 /set recode_out_default_charset ISO8859-1 /set recode yes /set recode_autodetect_utf8 no /set recode_fallback ISO8859-1 /set recode_transliterate no /recode add #utf8channel UTF-8