pavement

Unicode

From FreeBSDwiki
(Difference between revisions)
Jump to: navigation, search
(Unicode)
 
(5 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
== Unicode ==
 
== Unicode ==
The simplest form of character set used on computers uses an 8-bit (one byte) numerical value to represent a letter from the English and Latin alphabets and certain accented characters (normally seen in French writing).  This system is called [http://en.wikipedia.org/wiki/ASCII ASCII] and almost all modern day operating systems use it as well as many older computer systems.
+
The simplest form of character set used on computers uses an 8-bit (one byte) numerical value to represent a letter from the English and Latin alphabets and certain accented characters (normally seen in French writing).  This system is called [http://en.wikipedia.org/wiki/ASCII ASCII], the American Standard Code for Information Interchange.  Almost all modern day operating systems use it as well as many older computer systems.
  
Unicode, it particular the UTF-8 standard, takes this concept of a numerical value representing a character and extends it to host the alphabets of (virtually) all the known languages in the world.  This is around 100,000 characters and as such UTF-8 can use 1-, 2-, 3- and even 4-byte values to represent them:
+
Unicode, in particular the UTF-8 standard, takes this concept of a numerical value representing a character and extends it to host the alphabets of (virtually) all the known languages in the world.  This is around 100,000 characters and as such UTF-8 can use 1-, 2-, 3- and even 4-byte values to represent them:
  
 
* the 1-byte character set is used to cover the simple English alphabet;
 
* the 1-byte character set is used to cover the simple English alphabet;
 
* the 2-byte character set is used to cover the more common alphabets, including Arabic, Armenian, Cyrillic, Greek, Hebrew, Latin, and Syriac;
 
* the 2-byte character set is used to cover the more common alphabets, including Arabic, Armenian, Cyrillic, Greek, Hebrew, Latin, and Syriac;
* the 3-byte character set is used to cover other language alphabets;
+
* the 3-byte character set is used to cover additional language alphabets;
* the 4-byte character set is used to cover rarer language alphabets, as such it is not used often.
+
* the 4-byte character set is used to cover additional, but rarer, language alphabets, as such it is not used often.
  
In addition to the characters used the standard also defines "handedness", as in which way the text flows.  Typically Western languages are written left-to-right (as per the text on this page) while other, typically middle-Eastern languages, write from right-to-left.
+
In addition to the character sets used the standard also defines "handedness", as in which way the text flows.  Typically Western languages are written left-to-right (as per the text on this page) while other, typically middle-Eastern languages, write from right-to-left.
  
 
While ASCII uses one character-per-byte and so a 100 letter document would be (theoretically) 100 bytes on disk a Unicode document could be 2, 3 or 4 times that size, depending on the encoding used.  The Unicode standard is backwards compatible with ASCII when used in 1-byte character set.
 
While ASCII uses one character-per-byte and so a 100 letter document would be (theoretically) 100 bytes on disk a Unicode document could be 2, 3 or 4 times that size, depending on the encoding used.  The Unicode standard is backwards compatible with ASCII when used in 1-byte character set.
 +
 +
There is another character set typically found on older mainframes, most notably from IBM, called EBCDIC, the Extended Binary-Coded Decimal Interchange Code.  There is a variation called [http://en.wikipedia.org/wiki/UTF-EBCDIC UTF-EBCDIC] to enable legacy applications running on these systems to utilise Unicode.
  
 
== Using UTF-8 in FreeBSD ==
 
== Using UTF-8 in FreeBSD ==
 
First we need to set the LC_ALL and LANG variables, find out which locales can support UTF-8.
 
First we need to set the LC_ALL and LANG variables, find out which locales can support UTF-8.
  cd /usr/share/locale/; ls *UTF-8 -d
+
  $ cd /usr/share/locale/; ls *UTF-8 -d
  
Add the following environment variables to the appropriate file, ~/.profile or ~/.login or ~/.bashrc.
+
Add the following environment variable to the appropriate file, ~/.profile or ~/.login or ~/.bashrc.
export LANG=sv_SE.UTF-8
+
 
  export LC_ALL=sv_SE.UTF-8
 
  export LC_ALL=sv_SE.UTF-8
  
 
Now login and logout to have the effects apply.
 
Now login and logout to have the effects apply.
 +
After that you should enable UTF-8 support in your terminal, see the application section for this.
 +
 +
=== Converting files ===
 +
Now you're ready to convert some files, this is done with the command iconv, install it if you don't already have it.
 +
# pkg_add -r libiconv
 +
 +
Then use the following to convert a file.
 +
$ iconv -f iso8859-1 -t utf-8 file > file.new
 +
 +
This is a small script that converts a bunch of files and creates a backup of them in another directory.
  
 
== Applications ==
 
== Applications ==
 
=== xterm ===
 
=== xterm ===
To make xterm play nice i added  
+
To make xterm play nice I added  
  echo "xterm*locale: UTF-8" >> ~/.Xdefaults
+
  $ echo "xterm*locale: UTF-8" >> ~/.Xdefaults
  
=== irssi + screen ===
+
It could also be necessary to change the font see Unicode support on FreeBSD.
If you're like me and don't want to restart your irssi you use the following line, otherwise screen should use the locales.
+
Ctrl-a : (colon) then write 'encoding UTF-8 UTF-8'
+
  
This config will enable you to send ISO8859-1 by default in irssi.  
+
=== irssi + screen ===
 +
Unfortunately I haven't found any way to get irssi+screen+FiSH to work with out a restart of irssi.
 +
So restart screen with the new locales, this config will enable you to send ISO8859-1 by default in irssi.
 +
 
  /set term_charset UTF-8
 
  /set term_charset UTF-8
 
  /set recode_out_default_charset ISO8859-1
 
  /set recode_out_default_charset ISO8859-1
Line 40: Line 52:
 
  /set recode_transliterate no
 
  /set recode_transliterate no
 
  /recode add #utf8channel UTF-8
 
  /recode add #utf8channel UTF-8
 +
 +
For use with FiSH (an IRC encryption module [http://fish.sekure.us/]) some more adjustment are needed.
 +
Read instructions an apply patches from [http://iiice.net/~ice/programs/FiSH/ http://iiice.net/~ice/programs/FiSH/]
 
   
 
   
 
== External Links ==
 
== External Links ==
 
* [http://opal.com/freebsd/unicode.html Unicode support on FreeBSD]
 
* [http://opal.com/freebsd/unicode.html Unicode support on FreeBSD]
 +
 +
[[Category: FreeBSD Terminology]] [[Category: Common Tasks]]

Latest revision as of 08:07, 3 January 2009

Contents

[edit] Unicode

The simplest form of character set used on computers uses an 8-bit (one byte) numerical value to represent a letter from the English and Latin alphabets and certain accented characters (normally seen in French writing). This system is called ASCII, the American Standard Code for Information Interchange. Almost all modern day operating systems use it as well as many older computer systems.

Unicode, in particular the UTF-8 standard, takes this concept of a numerical value representing a character and extends it to host the alphabets of (virtually) all the known languages in the world. This is around 100,000 characters and as such UTF-8 can use 1-, 2-, 3- and even 4-byte values to represent them:

  • the 1-byte character set is used to cover the simple English alphabet;
  • the 2-byte character set is used to cover the more common alphabets, including Arabic, Armenian, Cyrillic, Greek, Hebrew, Latin, and Syriac;
  • the 3-byte character set is used to cover additional language alphabets;
  • the 4-byte character set is used to cover additional, but rarer, language alphabets, as such it is not used often.

In addition to the character sets used the standard also defines "handedness", as in which way the text flows. Typically Western languages are written left-to-right (as per the text on this page) while other, typically middle-Eastern languages, write from right-to-left.

While ASCII uses one character-per-byte and so a 100 letter document would be (theoretically) 100 bytes on disk a Unicode document could be 2, 3 or 4 times that size, depending on the encoding used. The Unicode standard is backwards compatible with ASCII when used in 1-byte character set.

There is another character set typically found on older mainframes, most notably from IBM, called EBCDIC, the Extended Binary-Coded Decimal Interchange Code. There is a variation called UTF-EBCDIC to enable legacy applications running on these systems to utilise Unicode.

[edit] Using UTF-8 in FreeBSD

First we need to set the LC_ALL and LANG variables, find out which locales can support UTF-8.

$ cd /usr/share/locale/; ls *UTF-8 -d

Add the following environment variable to the appropriate file, ~/.profile or ~/.login or ~/.bashrc.

export LC_ALL=sv_SE.UTF-8

Now login and logout to have the effects apply. After that you should enable UTF-8 support in your terminal, see the application section for this.

[edit] Converting files

Now you're ready to convert some files, this is done with the command iconv, install it if you don't already have it.

# pkg_add -r libiconv

Then use the following to convert a file.

$ iconv -f iso8859-1 -t utf-8 file > file.new

This is a small script that converts a bunch of files and creates a backup of them in another directory.

[edit] Applications

[edit] xterm

To make xterm play nice I added

$ echo "xterm*locale: UTF-8" >> ~/.Xdefaults

It could also be necessary to change the font see Unicode support on FreeBSD.

[edit] irssi + screen

Unfortunately I haven't found any way to get irssi+screen+FiSH to work with out a restart of irssi. So restart screen with the new locales, this config will enable you to send ISO8859-1 by default in irssi.

/set term_charset UTF-8
/set recode_out_default_charset ISO8859-1
/set recode yes
/set recode_autodetect_utf8 no
/set recode_fallback ISO8859-1
/set recode_transliterate no
/recode add #utf8channel UTF-8

For use with FiSH (an IRC encryption module [1]) some more adjustment are needed. Read instructions an apply patches from http://iiice.net/~ice/programs/FiSH/

[edit] External Links

Personal tools