Check that *.txt file is really a TXT formatted file?

7 ビュー (過去 30 日間)
Marco
Marco 2015 年 5 月 19 日
コメント済み: Walter Roberson 2016 年 8 月 23 日
Hello!
How could I detect if the content of a *.txt file is really txt formatted, before further proceeding that file with my data import parser? I searched in folders for all files with file extension TXT in order to work with the data stored in each of them. In principal no problem so far. But it sometimes happened that a file has wrongly been stored as a *.TXT named file while its content is not in TXT format, but instead in some binary format (i.e. should better have been namened *.XLS).

採用された回答

Guillaume
Guillaume 2015 年 5 月 19 日
It all depends on what you call a text file.
If it's an ASCII file, then the code value of the characters is limited to 0-127, so you could test if any character has a value > 127. The presence of code values in the range 0-31 with the exception of 9 (tab), 10 and 13 (new lines) would also be a strong indication that the content is not meant to be read as text. It's not a guarantee though.
If it's an extended ASCII file, then the whole range 0-255 is used. Other than semantics, there's nothing distinguishing a text file from a binary file. Again characters in the range [0-8, 11-12, 14-31] would be an indication.
If it's an UTF8 file, there are some combinations that are not allowed and you could try to detect them. Again [0-31] is an indication that it's not meant to be text.
Perhaps, instead of trying to discriminate text files against binary, what you should be discriminating is files conforming to the format your code expects and those that don't?
  5 件のコメント
Guillaume
Guillaume 2015 年 5 月 19 日
編集済み: Guillaume 2015 年 5 月 19 日
@Walter,
Can matlab decode UTF-16? It's certainly not listed as an option for the encoding of fopen.
Also,
filestart = char(fread(fid, numel(expectedstart)))';
%or
filestart = char(fread(fid, [1 numel(expectedstart)]));
%or
filestart = fread(fid, [1 numel(expectedstart)], '*char');
would be more akin to fscanf. But fread only works if the characters are ASCII (or more precisely, just one byte per code point).
UTF8 is the same as bytes for those code points < 128. Anything above that use more than one byte per character.
Walter Roberson
Walter Roberson 2016 年 8 月 23 日
Yes, MATLAB can decode UTF-16, both little endian and big endian. It can also decode UTF-32 little endian and big endian. For any of these MATLAB will issue a warning when you fopen() the file about the encoding not being supported, but really what that means is that MATLAB does not support writing files in those formats.

サインインしてコメントする。

その他の回答 (1 件)

Stephen23
Stephen23 2015 年 5 月 19 日
編集済み: Stephen23 2015 年 5 月 19 日
It is important to note that files themselves have no semantic meaning: they are merely lots of bits that can be interpreted in a particular way, given a known encoding. To answer your question you really need to answer this question: What exactly is a text file?
Here are two methods that you could try:
  • Read the file data, and check that all of the "characters" are within the expected character range (e.g. alphanumeric, punctuation, spaces, etc). This would work best when the data is of a limited kind (e.g. numeric data) and uses only a small character set (e.g. ASCII). This is also dependent on character encoding/format, and several other factors so it is very fragile in practice.
  • Read the first few bytes and check if it matches any known file signature. This is also fragile in practice, as it would miss formats not covered by the list of signatures.
  3 件のコメント
Stephen23
Stephen23 2015 年 5 月 19 日
編集済み: Stephen23 2015 年 5 月 21 日
It won't crash, but don't use fgetl: this will read to the next newline character, which if this is a binary file there may be no such combination of bits that looks like a newline. And so this simple "line" ends up being 5 GiBi of random data... or however big that file might be.
A better solution would be to use fscanf, as Guillaume explained, and reading just the number of bits that you need to identify the file. You can find more useful file reading functions here:
And because you already know the first characters, then you can simply check that these are what the file contains.
Marco
Marco 2015 年 5 月 19 日
Thanks a lot, really helpful! As I could only accept one answer, I at least gave you my vote.

サインインしてコメントする。

カテゴリ

Help Center および File ExchangeLarge Files and Big Data についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by