pavement

Unicode

From FreeBSDwiki
(Difference between revisions)
Jump to: navigation, search
(Unicode)
(Unicode)
Line 1: Line 1:
 
== Unicode ==
 
== Unicode ==
The simplest form of character set used on computers uses an 8-bit (one byte) numerical value to represent a letter from the English and Latin alphabets and certain accented characters (normally seen in French writing).  This system is called [http://en.wikipedia.org/wiki/ASCII ASCII] and almost all modern day operating systems use it as well as many older computer systems.
+
The simplest form of character set used on computers uses an 8-bit (one byte) numerical value to represent a letter from the English and Latin alphabets and certain accented characters (normally seen in French writing).  This system is called [http://en.wikipedia.org/wiki/ASCII ASCII], the American Standard Code for Information Interchange.  Almost all modern day operating systems use it as well as many older computer systems.
  
Unicode, it particular the UTF-8 standard, takes this concept of a numerical value representing a character and extends it to host the alphabets of (virtually) all the known languages in the world.  This is around 100,000 characters and as such UTF-8 can use 1-, 2-, 3- and even 4-byte values to represent them:
+
Unicode, in particular the UTF-8 standard, takes this concept of a numerical value representing a character and extends it to host the alphabets of (virtually) all the known languages in the world.  This is around 100,000 characters and as such UTF-8 can use 1-, 2-, 3- and even 4-byte values to represent them:
  
 
* the 1-byte character set is used to cover the simple English alphabet;
 
* the 1-byte character set is used to cover the simple English alphabet;
 
* the 2-byte character set is used to cover the more common alphabets, including Arabic, Armenian, Cyrillic, Greek, Hebrew, Latin, and Syriac;
 
* the 2-byte character set is used to cover the more common alphabets, including Arabic, Armenian, Cyrillic, Greek, Hebrew, Latin, and Syriac;
* the 3-byte character set is used to cover other language alphabets;
+
* the 3-byte character set is used to cover additional language alphabets;
* the 4-byte character set is used to cover rarer language alphabets, as such it is not used often.
+
* the 4-byte character set is used to cover additional, but rarer, language alphabets, as such it is not used often.
  
In addition to the characters used the standard also defines "handedness", as in which way the text flows.  Typically Western languages are written left-to-right (as per the text on this page) while other, typically middle-Eastern languages, write from right-to-left.
+
In addition to the character sets used the standard also defines "handedness", as in which way the text flows.  Typically Western languages are written left-to-right (as per the text on this page) while other, typically middle-Eastern languages, write from right-to-left.
  
 
While ASCII uses one character-per-byte and so a 100 letter document would be (theoretically) 100 bytes on disk a Unicode document could be 2, 3 or 4 times that size, depending on the encoding used.  The Unicode standard is backwards compatible with ASCII when used in 1-byte character set.
 
While ASCII uses one character-per-byte and so a 100 letter document would be (theoretically) 100 bytes on disk a Unicode document could be 2, 3 or 4 times that size, depending on the encoding used.  The Unicode standard is backwards compatible with ASCII when used in 1-byte character set.
 +
 +
There is another character set typically found on older mainframes, most notably from IBM, called EBCDIC, the Extended Binary-Coded Decimal Interchange Code.  There is a variation called [http://en.wikipedia.org/wiki/UTF-EBCDIC UTF-EBCDIC] for these systems where Unicode can exist within legacy applications.
  
 
== Using UTF-8 in FreeBSD ==
 
== Using UTF-8 in FreeBSD ==

Revision as of 10:50, 27 September 2007

Contents

Unicode

The simplest form of character set used on computers uses an 8-bit (one byte) numerical value to represent a letter from the English and Latin alphabets and certain accented characters (normally seen in French writing). This system is called ASCII, the American Standard Code for Information Interchange. Almost all modern day operating systems use it as well as many older computer systems.

Unicode, in particular the UTF-8 standard, takes this concept of a numerical value representing a character and extends it to host the alphabets of (virtually) all the known languages in the world. This is around 100,000 characters and as such UTF-8 can use 1-, 2-, 3- and even 4-byte values to represent them:

  • the 1-byte character set is used to cover the simple English alphabet;
  • the 2-byte character set is used to cover the more common alphabets, including Arabic, Armenian, Cyrillic, Greek, Hebrew, Latin, and Syriac;
  • the 3-byte character set is used to cover additional language alphabets;
  • the 4-byte character set is used to cover additional, but rarer, language alphabets, as such it is not used often.

In addition to the character sets used the standard also defines "handedness", as in which way the text flows. Typically Western languages are written left-to-right (as per the text on this page) while other, typically middle-Eastern languages, write from right-to-left.

While ASCII uses one character-per-byte and so a 100 letter document would be (theoretically) 100 bytes on disk a Unicode document could be 2, 3 or 4 times that size, depending on the encoding used. The Unicode standard is backwards compatible with ASCII when used in 1-byte character set.

There is another character set typically found on older mainframes, most notably from IBM, called EBCDIC, the Extended Binary-Coded Decimal Interchange Code. There is a variation called UTF-EBCDIC for these systems where Unicode can exist within legacy applications.

Using UTF-8 in FreeBSD

First we need to set the LC_ALL and LANG variables, find out which locales can support UTF-8.

cd /usr/share/locale/; ls *UTF-8 -d

Add the following environment variables to the appropriate file, ~/.profile or ~/.login or ~/.bashrc.

export LANG=sv_SE.UTF-8
export LC_ALL=sv_SE.UTF-8

Now login and logout to have the effects apply.

Applications

xterm

To make xterm play nice i added

echo "xterm*locale: UTF-8" >> ~/.Xdefaults

irssi + screen

If you're like me and don't want to restart your irssi you use the following line, otherwise screen should use the locales.

Ctrl-a : (colon) then write 'encoding UTF-8 UTF-8'

This config will enable you to send ISO8859-1 by default in irssi.

/set term_charset UTF-8
/set recode_out_default_charset ISO8859-1
/set recode yes
/set recode_autodetect_utf8 no
/set recode_fallback ISO8859-1
/set recode_transliterate no
/recode add #utf8channel UTF-8

External Links

Personal tools