Issue with native2unicode and windows-1252 encoding

Hi all,
I'm trying to encode some bytes into a character set using the windows-1252 encoding and I've checked that native2unicode

1 件のコメント

Rik
Rik 2022 年 1 月 14 日
Most of your question seems to be missing.

サインインしてコメントする。

回答 (3 件)

Walter Roberson
Walter Roberson 2022 年 1 月 14 日

0 投票

source = char(0:511)
source =
' !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿǀǁǂǃDŽDždžLJLjljNJNjnjǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰDZDzdzǴǵǶǷǸǹǺǻǼǽǾǿ'
bytes = unicode2native(source, 'windows-1252')
bytes = 1×512
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
backport = char(bytes)
backport =
' !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
whichdiffer = find(source(1:256) ~= backport(1:256) )
whichdiffer = 1×27
129 131 132 133 134 135 136 137 138 139 140 141 143 146 147 148 149 150 151 152 153 154 155 156 157 159 160
source(whichdiffer)
ans = ''
bytes(whichdiffer)
ans = 1×27
26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26
backport(whichdiffer)
ans = ''
What this is telling us is that Unicode 129 to 141 are not represented in Windows 1252
bytes2 = uint8(129:141)
bytes2 = 1×13
129 130 131 132 133 134 135 136 137 138 139 140 141
encodes_as = native2unicode(bytes2, 'windows-1252')
encodes_as = '‚ƒ„…†‡ˆ‰Š‹Œ'
double(encodes_as)
ans = 1×13
129 8218 402 8222 8230 8224 8225 710 8240 352 8249 338 141
Looks about right.

2 件のコメント

Borja Heriz
Borja Heriz 2022 年 1 月 17 日
Thanks for the asnwer.
But what about unicode 26 and 157? These are also encoded with the square symbol in Windows 1252.
Thanks
Walter Roberson
Walter Roberson 2022 年 1 月 17 日
code point 26 is the standard value to substitute for codepoints that cannot be represented
https://en.m.wikipedia.org/wiki/Substitute_character

サインインしてコメントする。

Borja Heriz
Borja Heriz 2022 年 1 月 17 日

0 投票

Hi,
Sorry for not having completed the post...
My question is about why unicode2native returns the same symbol for different numerical values.
native2unicode(26,'windows-1252')
native2unicode(157,'windows-1252')
native2unicode(129,'windows-1252')
All of them return the square symbol in R2020b.
Borja Heriz
Borja Heriz 2022 年 1 月 17 日

0 投票

Hi there,
Definetely, there must be something I'm missing. I don't understand why numercial numbers 153 and 156 are equally encoded with independence of the method I use.
char(153)
ans = ''
char(156)
ans = ''
native2unicode(153,'ISO-8859-1')
ans = ''
native2unicode(156,'ISO-8859-1')
ans = ''
native2unicode(153,'utf-8')
ans = '�'
native2unicode(156,'utf-8')
ans = '�'
native2unicode(153,'US-ASCII')
ans = '�'
native2unicode(156,'US-ASCII')
ans = '�'
native2unicode(153,'latin1')
ans = ''
native2unicode(156,'latin1')
ans = ''
What I'm doing wrong?
Thanks,

1 件のコメント

Rik
Rik 2022 年 1 月 17 日
This is an answer, but it looks like a comment. Please use the comment sections to post comments. The order of answers can change, which will make reading back confusing.
Please post this as a comment and delete the answer.
When you do, I (or Walter) will post something along these lines:
Why do you think 153 and 156 are encoded as the same character? They are displayed as the same character, but that is probably due to a limitation in the display, as this could very well encode a control character without a proper symbol.

サインインしてコメントする。

カテゴリ

ヘルプ センター および File ExchangeData Type Conversion についてさらに検索

製品

リリース

R2020b

質問済み:

2022 年 1 月 14 日

コメント済み:

2022 年 1 月 17 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by