Monday, December 12, 2011

Detecting and changing the encoding of text files.

SkyHi @ Monday, December 12, 2011
When you receive and need to handle multiple text files that use characters that are not natural to the English language, you may run into the problem that is dealing with different character encodings. This is particularly noticeable in websites, where if the browser try to interpret the text file with an encoding that differs from the actual encoding that the file is using, we can see strange symbols where this characters were supposed to show, but it is not limited to websites, any program that is made to work with languages other than English may present a similar problem if it is not appropriately handled.
In the case of HTML archives, many people, and several programs by default, opt for change this foreign characters with either HTML entities (e.g. á to place an á) or Iso Latin-1 code (e.g. á to place an á), but the truth is that nowadays every modern (and not so modern) browser can successfully handle encodings such as iso-8859-1 or utf-8, all that we have to do is choose an encoding and use that same encoding for all files to avoid conflicts, and specify to the browser that we are using that encoding. Personally I prefer to use utf-8 as I consider it a much more flexible and complete character set, and unless it is otherwise required I have standardized the use of utf-8 in all my projects and in my systems in general.
To detect the encoding that is being used within a file, we can use the command "file". This command try to autodetect the encoding that a file is using. If no special characters are detected inside the text file, "file" will tell us that the encoding is us-ascii, and our editor can use whatever character encoding it is set to use by default. Of course, I set my editors to work with utf-8 by default.
file --mime-encoding file.txt
Once we have the encoding of the file, then we can transform it to a different character encoding if it's necessary, by using:
iconv --from-code=iso-8859-1 --to-code=utf-8 file.txt > file.txt.utf8
mv file.txt.utf8 file.txt

Changing the character encoding of multiple files

When we need to change the character encoding of one file, more often than not we have to change the character encoding of other files as well, to do this operation to several files at once we can use:
for old in *.txt;
iconv --from-code=iso-8859-1 --to-code=utf-8 $old > $old.utf8;
Once this is done, we can rename all the converted files to the name that they were generated from, in effect, replacing the original with the reencoded version:
for old in *.utf8;
cp $old `basename $old .utf8`;
basename give us the name of the file minus the ".utf8" part. If everything is ok, we can remove the temporal files that we created.
rm *.utf8

Howto to detect file encoding and convert default encoding of given files from one encoding to another on GNU/Linux and FreeBSD

I wanted to convert an html document character encoding to UTF-8, to achieve that of
course it was first needed to determine what kind of character encoding was used in
creation time of the file.
First thing I tried was:
hipo@noah:~/Desktop/test$ file File-Whole.htm
File-Whole.htm: HTML document text
as you can see that’s shit cause for some reason mime encoding is not printed by the file
Next what I tried was:
hipo@noah:~/Desktop/test$ file --mime File-Whole.htm1File-Whole.htm1: text/html; charset=unknown-8bit
Here you see that character encoding is reported ascharset=unknown-8bit which
ain’t cool at all and is of no use and prompts an error if I try it in iconv
Here is why I needed concretely to determine what kind of character set my file uses to later
be able to convert it using iconv .
To achieve my goal after consulting with Mr. Google , I found
out about enca — detect and convert encoding of text files
It’s obviously my lucky day because good guys from Debian has packaged enca so, everything came to the point of
apt-getting it.
# apt-get install enca
On FreeBSD enca port is available, so installing it cames simply to installing it from port tree.
Here is how:
pcfreak# cd /usr/ports/converters/enca;pcfreak# make install clean
Now I tried launching enca directly without any program parameters, but I was unlucky:
hipo@noah:~/Desktop/test$ enca file-Whole.htm
enca: Cannot determine (or understand) your language preferences.
Please use `-L language', or `-L none' if your language is not supported
(only a few multibyte encodings can be recognized then).
Run `enca --list languages' to get a list of supported languages.
I gave it another try, following prescribed usage parameters though I first checked my possibility
as a languages I can pass by to enca’s -L parameter.
Preliminary knowing that my text contains text in Bulgarian language, it wasn’t such a big deal
for me to determine the required language:
hipo@noah:~/Desktop/test$ enca -L bulgarian File-Whole.htm
transformation format 8 bits; CP1251
Knowing my character set all left for me was to do do the convert to UTF-8 to make text,
much more accessible.
hipo@noah:~/Desktop/test$ iconv --from-code=unknown-8bit --to=UTF-8 File-Whole.htm >
hipo@noah:~/Desktop/test$ mv File-Whole.htm