Convert Arabic Text From (ISO-8859-1 or WINDOWS-1256) to UTF-8


 

 if your managing legacy code and you come across arabic or non-english encoding and you want to convert it to UTF-8, here is how you can do this:


first of all you need to understand how things work in order to better know what action to take, so lets seperate the process to 2 steps

1- Reading the content of the file, in this section weither your opening the file from terminal or any IDE, by default the file is loaded in the viewer by your defualt encoding or the encoding of the file if it is defined or your ide guess it out, up to know there is no saving done, so what you see in the view is what you have in the file displayed in a 1 encoding, you may re-open the file in a different encoding and you can do that with many ways i will mention 2 ways:
A: in linux you can use iconv to read file content in a specified encoding as follow:
iconv -f ISO-8859-1 file.htm

B: sublime:
to open it from file -> Reopen with Encoding > UTF-8 for example

changing the encoding you choose to load the file will affect how the characters will be presented, dammaged output means that the encode chosen was wrong


2- Saving the content to a file, by default if you save the current file it will save the content with the current chosen encoding and loaded content, regardless on the original content, the encoding of the file will change only if you decided to change it from your ide or UNIX and here is how you can do that in booth:

A: linux:
iconv -f WINDOWS-1256 -t UTF-8 source-file.htm -o output-file.htm

B: sublime:
file -> Save with Encoding -> UTF-8 for example


now that you know how to open a file in any encoding and also can save the file in any encoding, we should be ready to convert the file

as you may know already you cant directly convert ISO to UTF-8 you will get an error specially if your converting file that has non-english letters like arabic for example, the work around is to do it in 3 steps:

1- Open the File in the correct encoding that will display the file content correctly

2- copy the content to clip board or any new windows "but dont save it"

3- create a new file with the desired encoding and paste the content into the new file and save the file

you can repeat these steps to all the files that you need to fix, remember to backup all you files just in case and test 1 file b4 updating all the files to make sure that your solution work

 if your trying to update many files, then you might need to write a bash script do do that, first lets write the command that will convert the file:

iconv -f WINDOWS-1256 -t UTF-8 file.htm | xclip && rm file.htm && xclip -o >> file.htm

after testing it and confirm that it is working, now we will round it up with a loop and condition to make sure that we only convert the files thats needs to be converted

below is a shell script that you can create ( by saving the script into a executable file ) and run it in the root folder of your project

Note: backup your project b4 you run the script and check the result b4 saving

https://github.com/salehawal/script_bash_arabic_to_utf8/blob/main/convert_to_utf8.sh

Comments