How to read a UTF-8 encoded text file as a single character vector including white spaces and unicode special characters?

Question

Deepu George Kurian 2020 年 5 月 14 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/525358-how-to-read-a-utf-8-encoded-text-file-as-a-single-character-vector-including-white-spaces-and-unicod

編集済み: Rik 2021 年 2 月 19 日

I am trying to read a UTF-8 encoded .txt file, "data.txt" containing sample info like this.

<title>
Fate/kaleid liner Prisma☆Illya (Fate/Kaleid Liner Prisma Illya) - MyAnimeList.net
</title>

If I try;

data = fileread('data.txt');

Sample read data:

<title>
Fate/kaleid liner Prismaâ˜†Illya (Fate/Kaleid Liner Prisma Illya) - MyAnimeList.net
</title>

I lose the UTF8 encoded special characters. Here, '☆' is misread as 'â˜†'.

If I try;

file = fopen('data.txt','r','n','UTF-8');
data = fscanf(file, '%s');
fclose(file);

Sample read data::

<title>Fate/kaleidlinerPrisma☆Illya(Fate/KaleidLinerPrismaIllya)-MyAnimeList.net</title>

I can retain the unicode characters but loses all the white space characters.

If I try;

file = fopen('data.txt','r','n','UTF-8');
data = textscan(file, '%s');
fclose(file);

Sample read data:

 11×1 cell array
    {'<title>'           }
    {'Fate/kaleid'       }
    {'liner'             }
    {'Prisma☆Illya'      }
    {'(Fate/Kaleid'      }
    ......

It's a cell broken up by white spaces, even though it did read all the unicode correctly.

Can you give me possible way to overcome this issue?

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Walter Roberson 2020 年 5 月 14 日

2
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/525358-how-to-read-a-utf-8-encoded-text-file-as-a-single-character-vector-including-white-spaces-and-unicod#answer_432396

MATLAB Online で開く

file = fopen('data.txt','r','n','UTF-8');
data = fread(file, [1 inf], '*char');
fclose(file)

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

Walter Roberson 2020 年 5 月 14 日

You have a problem: 11577.txt and New.txt are ISO-8896-1 Latin1 encoded, but 14829.txt is UTF-8 encoded.

It is sometimes possible to tell the difference between the two encodings, but there is no provided routine for doing that.

If you were using R2020a or later, then fileread() would be enough: R2020a improved encoding detection and automatic use of encodings.

Deepu George Kurian 2020 年 5 月 15 日

Ahh....... You just reminded me of a stupid mistake I made while acquiring those crude data. Thanks man. I corrected it and it works perfectly now.

サインインしてコメントする。

Answer 2

Rik 2020 年 5 月 14 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/525358-how-to-read-a-utf-8-encoded-text-file-as-a-single-character-vector-including-white-spaces-and-unicod#answer_432391

編集済み: Rik 2020 年 5 月 14 日

MATLAB Online で開く

I wrote the readfile function for this goal. It will result in a cell array, but you can concatenate them back to a long char array if you prefer.

data=cell2mat(readfile('data.txt'));

Note: this removes all newlines. You can replace them with spaces like this:

data=readfile('data.txt');
data(2,:)={' '};
data=data(:)';
data=cell2mat(data);

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

Deepu George Kurian 2020 年 5 月 14 日

MATLAB Online で開く

I am sorry. Its working now. Guess I made some mistake, the first time.

But I have encountered another issue with this. The character ° is not read correctly.

I have added a file 'New.txt' which contains this character in the link I shared before.

You can use this code on extracted character vector, to check if output is correct.

title = char(extractBetween(data, '<title>', ' - MyAnime'));

Required Output:

' Gintama°: Aizome Kaori-hen'

Observed Output:

' Gintama�: Aizome Kaori-hen'

Deepu George Kurian 2020 年 5 月 15 日

Sorry man. As Rik pointed out here, that last error was due to my stupidity while acquiring those crude data. Both your answers work perfectly now. I am choosing his, just because I can have one extra function less there. But thanks a lot

サインインしてコメントする。

Answer 3

MathWorks Support Team 2021 年 2 月 19 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/525358-how-to-read-a-utf-8-encoded-text-file-as-a-single-character-vector-including-white-spaces-and-unicod#answer_628044

As of MATLAB R2020a, fileread accomplishes the desired task.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Rik 2021 年 2 月 19 日

編集済み: Rik 2021 年 2 月 19 日

MATLAB Online で開く

This is not quite true, as it doesn't work on all Unicode characters:

fid=fopen('foo.txt','w','n','UTF-8');
fprintf(fid,'%s','😀');
fclose(fid);
fid=fopen('foo.txt','rb');fread(fid).',fclose(fid);%display raw bytes
ans = 1×4
   240   159   152   128
fileread('foo.txt') % show fileread result
ans = ''

サインインしてコメントする。

How to read a UTF-8 encoded text file as a single character vector including white spaces and unicode special characters?

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (2 件)

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

How to read a UTF-8 encoded text file as a single character vector including white spaces and unicode special characters?

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

3 件のコメント 1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (2 件)

6 件のコメント 4 件の古いコメントを表示4 件の古いコメントを非表示

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示