Why the headerlines are not always properly detected by readtable?

Question

pietro 2018 年 5 月 10 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/400119-why-the-headerlines-are-not-always-properly-detected-by-readtable

編集済み: Stephen23 2023 年 9 月 7 日

Hi all,

I have many .csv files to import into Matlab. Those files are automatically exported from Scopus. With most of the downloaded files, I have no problem, but for some, like the one you can download from this link , the headers are totally wrong. Matlab skips the first line.

With other files like this , Matlab returns the following error:

Error using readtable (line 198)
Reading failed at line 3. All lines of a text file must have the same
number of delimiters. Line 3 has 1164 delimiters, while preceding
lines have 365.
    Note: readtable detected the following parameters:
    'Delimiter', ' ', 'HeaderLines', 1, 'ReadVariableNames', false,
    'Format',
    '%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%f%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%f%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%f%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%f%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%f%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%f%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%f%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%f%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%q%f%q%q%q'

Here the code I have used for both files:

M=readtable(TestFile.csv,'Encoding','UTF-8');

How can I solve both problems?

Thanks.

Best regards,

pietro

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Guillaume 2018 年 5 月 10 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/400119-why-the-headerlines-are-not-always-properly-detected-by-readtable#answer_319626

編集済み: Guillaume 2018 年 5 月 10 日

MATLAB Online で開く

There is actually a weird character at the start of the file. It is an UTF-8 BOM marker, EF BB BF. Unicode does not recommend using a UTF-8 BOM marker.

Note that regardless of the marker, matlab R2018a imports the file correctly. It slightly mangles the Authors header because of that BOM marker that it doesn't know how to interpret. The header becomes x__Authors. The rest is as it should be.

edit: As far as I know there is nothing you can do at the readtable level but you could always check the files beforehand and remove the BOM marker:

files = {....};  %list of files
folder = 'C:\somewhere';
for fileidx = 1:numel(files)
   fid = fopen(fullfile(folder, files{fileidx}));
   content = fread(fid);
   fclose(fid);
   if isequal(content(1:3), [239; 187; 191])
      fid = fopen(fullfile(folder, files{fileidx}), 'w');
      fwrite(fid, content(4:end));
   end
end

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

jmgoldba 2023 年 9 月 6 日

Misdetecting the delimiter solved an issue I had. You'd think the .csv would give Matlab a clue what to use, but apparently not.

Stephen23 2023 年 9 月 7 日

編集済み: Stephen23 2023 年 9 月 7 日

MATLAB Online で開く

" You'd think the .csv would give Matlab a clue what to use, but apparently not."

There is no universally accepted standard for CSV files. According to common usage CSV files often use comma, semi-colon, or tab delimiters, mostly depending on the language spoken by the users who create them (e.g. see Windows locale settings). The wikipedia page for CSV files clearly explains this: "Delimiter-separated files are often given a ".csv" extension even when the field separator is not a comma. Many applications or libraries that consume or produce CSV files have options to specify an alternative delimiter."

https://en.wikipedia.org/wiki/Comma-separated_values

Please tell me the exact algorithm that you apparently think MATLAB could use that will perfectly detect the delimiter for every single CSV file on the entire planet without any mistake.

Then try your perfect algorithm on this very simple one-line CSV file:

123.456,789

and tell me what the numeric values are. Surely that perfect algorithm that you apparently think exists won't have any problems with that very very simple line, so I look forward to your reply soon!

サインインしてコメントする。

Why the headerlines are not always properly detected by readtable?

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

Why the headerlines are not always properly detected by readtable?

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

4 件のコメント 2 件の古いコメントを表示2 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示