Efficient way to read variable column number data from a mixed-format text file?

Question

Carson Purnell 2022 年 9 月 6 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1796435-efficient-way-to-read-variable-column-number-data-from-a-mixed-format-text-file

コメント済み: Carson Purnell 2022 年 10 月 21 日

I'm trying to read in specific data from .cif files, which have an unfortunate text format. A relevant section is below: the section starts with a list of the property identifiers for each column, and then there are many rows with that number of properties before the section is closed. The file can have multiple sections, I already have ways to find each of them and handle them separately, as the rest of the file contents are irrelevant and differently formatted.

loop_

_atom_site.group_PDB

_atom_site.id

_atom_site.type_symbol

_atom_site.label_atom_id

_atom_site.label_alt_id

_atom_site.label_comp_id

_atom_site.label_asym_id

_atom_site.label_entity_id

_atom_site.label_seq_id

_atom_site.Cartn_x

_atom_site.Cartn_y

_atom_site.Cartn_z

_atom_site.auth_asym_id

_atom_site.auth_seq_id

_atom_site.pdbx_PDB_ins_code

_atom_site.occupancy

_atom_site.B_iso_or_equiv

_atom_site.pdbx_PDB_model_num

ATOM 1 N N . ASP A 1 1 624.249 268.361 303.253 A 2 ? 0.00 0.00 1

ATOM 2 C CA . ASP A 1 1 625.516 268.284 302.473 A 2 ? 0.00 0.00 1

ATOM 3 C C . ASP A 1 1 626.767 268.479 303.343 A 2 ? 0.00 0.00 1

ATOM 4 O O . ASP A 1 1 627.026 269.597 303.785 A 2 ? 0.00 0.00 1

ATOM 5 C CB . ASP A 1 1 625.533 269.354 301.363 A 2 ? 0.00 0.00 1

The problem is that the number of properties can vary, and the width of properties is not fixed so I cannot directly parse the data block. Text reading functions like textscan aren't working because of the leading data being entirely differently formatted, and won't operate on extracted strings of cleaned data as far as i can see.

Is there some sneaky way to make a table with a list of headers transposed like that? I'm trying to avoid a very slow loop to parse each line individually, especially as I only need select columns of data.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

dpb 2022 年 9 月 6 日

"...trying to avoid a very slow loop to parse each line individually"

Actually, that generally is NOT that slow as long as the output varaible(s) have been preallocated so you're not dynamically reallocating every pass through the loop.

"...way to make a table with a list of headers transposed like that?"

What list transposed like what? Lost me here, sorry; don't see what would correspond to that statement in looking at the data given.

You can certainly use textscan on in-memory data; I don't know that it has been extended to string arrays yet, however which may be where you ran into issue depending on how you read the file.

BUT, my recommendation is to provide a pertinent data file as an attachment so folks can access it and then describe explicitly what it is that is wanted/needed from the file and undoubtedly the file(s) can be read.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

chrisw23 2022 年 9 月 8 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1796435-efficient-way-to-read-variable-column-number-data-from-a-mixed-format-text-file#answer_1048075

try string operation performance (saved your example as text file)

rawLines = readlines("example.txt");

headerId = rawLines.startsWith("_");

varNames = rawLines(headerId);

dataId = rawLines.startsWith("ATOM");

pat = whitespacePattern(1,inf);

dataRows = rawLines(dataId).strip.replace(pat," ").split();

resTbl = array2table(dataRows);

resTbl.Properties.VariableNames = varNames;

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Carson Purnell 2022 年 10 月 21 日

This general strategy ended up working. get that header block into a table and then it became possible to get the target information a priori without needing to regex sort the headers themselves or anything like that.

サインインしてコメントする。

Answer 2

Walter Roberson 2022 年 9 月 6 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1796435-efficient-way-to-read-variable-column-number-data-from-a-mixed-format-text-file#answer_1044040

編集済み: Walter Roberson 2022 年 9 月 6 日

MATLAB Online で開く

I suggest using fileread() and text processing. For example,

S = fileread('YourFile.cif');
S = regexprep(S, {'^.*?(?=^ATOM)', '^(?!=ATOM).*$'}, {'', ''}, 'lineanchors');

if I got the pattern right then this should first delete everything in the file before the first line that starts with ATOM, and then should delete everything in the file from the first remaining line that does not start with ATOM.

What remains would be a character vector that you could pass as the first parameter to textscan()

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

Carson Purnell 2022 年 9 月 6 日

Regexprep is much too slow - i need to parse millions of lines of strings. str2double has the same problem, because it won't operate on arrays in a useful way and a looped str2double is (for this) far slower than a single str2num vectorized solution.

There's also one or more data blocks per file, so clearing lines before and after the first block is not useful. I can already extract the relevant lines for each block, I just can't figure out how to parse it without an incredibly slow set of loops because the columns are variable.

Walter Roberson 2022 年 9 月 6 日

MATLAB Online で開く

When you currently extract the relevant lines for each block, what form are you extracting them into? Are you extracting them all first and post-processing?

If you have a file identifier fid positioned to loop_ then

textscan(fid, '_%*[\n]', 'headerlines', 1)

should read through all of the _ lines, leaving you positioned at ATOM. At that point you can ftell() and record the position. Then fgetl() and analyze that one line to count fields, figure out which columns are character and which are number. With that information on hand, you can generate a format to read such lines. fseek() to go back to the beginning of the line and textscan() with that format.

However, this approach would be weak if the columns are fixed width and there are some empty columns -- for example if you had a situation where alt_id was empty if the alt_id was the same as the atom_id .

サインインしてコメントする。

Efficient way to read variable column number data from a mixed-format text file?

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

採用された回答

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (1 件)

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

Community Treasure Hunt

Efficient way to read variable column number data from a mixed-format text file?

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

採用された回答

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (1 件)

3 件のコメント 1 件の古いコメントを表示1 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

Community Treasure Hunt

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示