Skip to main content

UTF character data, encoding of text

Objective and Background:  You have text data that is UTF encoded and need SAS/R to read and write datasets with that encoding.  If you have ever printed or viewed text information, and seen something like Giuffr?Ÿ’e?ƒe?Ÿƒ?ÿ?›ƒ?ªƒ?›?Ÿ’e›ƒ?ª­?Ÿƒeee, then you are running into this encoding issue.  Computers store text using numbers, with each number assigned to a particular character.  See https://en.wikipedia.org/wiki/ASCII to find that the character & is stored as 38 when using the ASCII encoding.  Unicode is popular internationally because it encodes special characters such as accented letters, and UTF-8 is a widely used version (https://en.wikipedia.org/wiki/UTF-8).  In UTF-8 the & character is stored as 26, and you can imagine how the jumbled example above arises from the confusion of what letters are being stored.

Solution 1:  Use options to request that individual datasets be read and written in a particular encoding.  In SAS, specify encoding options on the various file statements, for example:
data imdb; infile 'movies.csv' dsd lrecl=2056 pad encoding="utf-8"  firstobs=2;
 length Type $ 32;
 length Title Genre $ 500;
 input  type tile genre;
run;
In R, follow the advice here http://people.fas.harvard.edu/~izahn/posts/reading-data-with-non-native-encoding-in-r/ 

Solution 2:  If all your files are always encoded UTF-8, then create a SAS session environment:
1)  Create a shortcut that points to the supplied UTF-8 config file.  In Windows, right-click on the shortcut, choose properties, and under Target you will see something like
"C:\Program Files\SASHome\SASFoundation\9.4\sas.exe" -CONFIG "c:\ProgramFiles\SASHome\SASFoundation\9.4\nls\en\sasv9.cfg"
Change the CONFIG to point to \nls\u8\, and this supplied config file includes specific settings like
-DBCS
-LOCALE en_US
-ENCODING UTF-8
as well as other tweaks that set the default encoding, thus no need for encoding= options within your programs.  Another advantage is .sas program files will then be stored in UTF-8, so comments and explicit data are retained correctly.
R is even easier, just edit your Rprofile file to options(encoding="utf-8")

Possibility 3:  You may just want your .sas program file to be UTF-8, but all datasets are whatever default encoding is used for your country.  This is easily done within the File-Open and File-Save windows, using the encoding menu below the filename box.  When doing this, if you ever see garbled text, you know you made an error with the encoding choice.  BE careful!

Comments

Popular posts from this blog

DANDA - A macro collection for easier SAS statistical analysis

Objective :  You are running ANOVAs or regressions in SAS, and wish there was a way to avoid writing the dozens of commands needed to conduct the analysis and generate recommended diagnostics and summary of results, not to mention the hundreds of possible options that might be needed to access recommended methods.  A possible solution is to download a copy of danda.sas below, and use this macro collection to run the dozens of commands with one statement.  We will also have future posts covering various uses of danda.sas, giving examples as always. danda.sas is under continued development, check this page for updates. Date                       Version               Link 2021/03/15             2.12.030          danda.sas 2021/03/15       ...

Reporting results from transformed analyses

Objective :  Transformed data, for example log(y), is analyzed to correct normality or equal variance requirements.  But we want to report means and standard errors in the original units. SAS example : data one;  do treat=1 to 3;  do rep=1 to 5;    y=10 + treat+ exp(rannor(111));    logy=log(y);    output;  end;end; run; proc mixed plots=all;   class treat;   model y=treat;   lsmeans treat/pdiff; run; proc mixed plots=all;   class treat;   model logy=treat;   lsmeans treat/pdiff; run; The original data, variable y, might have units of pounds.  If a transformation is needed, we simply calculate a new variable by applying a mathematical function known to improve normality or equal variance, and run the same analysis on the new variable.  Commonly used choices are listed in the second table below. However, looking at the results for both analyses we see treat Mean Y S...