Read specific data columns from a text file based on header name requested by user

22 ビュー (過去 30 日間)
Hello,
I have the matlab version 2018a.
I'm trying to extract specific columns of a text file based on the header name of the column. I have tried couple of different methods such as readtable, textscanf, etc. but, none of them exactly worked as I expected.
I have attached the text file itself. I'm trying to make sure the code I'm writing is not slow because there are 1000's of these files that I need to look into in a for-loop possibly.
The structure never changes but, the header columns can be in different positions and that's the reason why I want the code to find the header name no matter which position the column is in.
Here is a sample from the text file:
As it can be seen, the same dates are repeated below with different headers (information) and it is repeated 3-4 times in the actual text file. If I know how to pick up "WOPR - PROD1", "WOPR-PROD2", and "FOPT" columns and put them into a matrix in this order [WOPR-PROD1; WOPR-PROD2; FOPT] I can figure out the rest I believe. I prefer not to modify the text file itself if possible.
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"SUMMARY OF RUN Original_1
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"DATE ""YEARS ""FOPR ""FWPR ""FGPR ""FOPT ""FGPT ""FWPT ""FWCT ""FWIR "
" ""YEARS ""STB/DAY ""STB/DAY ""MSCF/DAY ""STB ""MSCF ""STB "" ""STB/DAY "
" "" "" "" "" "" "" "" "" "" "
" "" "" "" "" "" "" "" "" "" "
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
" 1JAN2009" 0 0 0 0 0 0 0 0 0
" 1FEB2009" 0.084873 0 0 0 0 0 0 0 0
" 1MAR2009" 0.161533 2000.000 65.16867 1360.000 56000.00 38080.00 1824.723 0.031556 0
" 1APR2009" 0.246407 2000.000 67.93040 1360.000 118000.0 80240.00 3906.001 0.032849 0
" 1MAY2009" 0.328542 2449.850 53.91752 1665.898 191495.5 130216.9 5523.527 0.021535 0
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"SUMMARY OF RUN Original_1 "
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
"DATE ""FWIT ""FGOR ""FOIP ""FWIP ""FGIP ""FPR ""WOPR ""WOPR ""WOPR "
" ""STB ""MSCF/STB ""STB ""STB ""MSCF ""PSIA ""STB/DAY ""STB/DAY ""STB/DAY "
" "" "" ""*10**3 ""*10**3 ""*10**3 "" "" "" "" "
" "" "" "" "" "" "" ""PROD1 ""PROD2 ""PROD3 "
" "" "" "" "" "" "" "" "" "" "
"--------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------""-----------"
" 1JAN2009" 0 0 31190.54 645456.1 21209.57 6553.930 0 0 0
" 1FEB2009" 0 0 31190.54 645456.1 21209.57 6553.922 0 0 0
" 1MAR2009" 0 0.680000 31134.54 645454.2 21171.49 6473.267 0 0 0
" 1APR2009" 0 0.680000 31072.54 645452.2 21129.33 6394.598 0 0 0
" 1MAY2009" 0 0.680000 30999.18 645450.7 21079.44 6296.722 0 1675.190 0
Any help is appreciated. Thank you.
  7 件のコメント
Stephen23
Stephen23 2019 年 6 月 1 日
編集済み: Stephen23 2019 年 6 月 1 日
I would use textscan, something like this:
  • read the very first line using fgetl
  • identify the number of columns (regexp, count delimiters, or whatever).
  • generate textscan format strings, for header and data.
  • in a while loop import the file data using textscan.
  • post-process the headers to generate unique valid fieldnames / table variable names.
  • convert to structure or table.
  • now the order of the columns is irrelevant: simply access the structure fields / table variables.
You should read this (EDIT: unfortunately TMW seem to have removed the very useful example from this page showing how to use while to read blocks of data, but note how the repeated textscan calls could be within a while loop):
Yildirim Kocoglu
Yildirim Kocoglu 2019 年 6 月 1 日
編集済み: Yildirim Kocoglu 2019 年 6 月 1 日
Thank you for your suggestions Stephen. I always appreciate your input.
I'll look into the links you have provided. I'm still working on the code and I will post the answer once I have it ready. I was able to create a new file without all the [", "--...", "SUMMARY..."] lines. I might be doing some things the long way but, it's a learning curve and I'm not sure if this is the path I should follow yet.
Here is the code I have used with the previous text file that generated this newfile.txt that I have attached. I don't know how to combine all the expressions into one line if they are doing pretty much the same thing like the one regexprep that uses dotexceptnewline or if it's possible to write a one line code that does all of this. That is not important though for now.
I'm thinking if I use a " , " delimiter replacing spaces and add a space horizontally between the pages of new info (where dates start) and read them seperately, that might do the trick (maybe).
content = fileread('Original_1.txt');
filebyline = regexprep(content, '"SUMMARY\s.*$', '', 'lineanchors', 'dotexceptnewline');
filebyline = regexprep(filebyline, '"-.*$', '', 'lineanchors', 'dotexceptnewline');
newfile = regexprep(filebyline, '["]', ' ');
fid = fopen('newfile.txt', 'w');
fprintf(fid, '%s', newfile);
fclose(fid);

サインインしてコメントする。

採用された回答

Yildirim Kocoglu
Yildirim Kocoglu 2019 年 6 月 5 日
Before I continue, I want to Thank @Bob Nbob and @Stephen Cobeldick for their work, suggestions and help.
I really appreciate everything you guys are doing for this community.
Whoever is interested in this post and was waiting for an answer.
It took me a while but, I have finally got it to work correctly. The code is a little long and I did not exactly choose good variable names or probably wrote comments detailed enough.
If I can write a more efficient code and anyone has suggestions I'll consider them.
I don't know if 0.027222 seconds (from beginning to end) is efficient enough for this kind of task: read 68 columns of info from a text file full of "pages" of columns with no delimiters between columns.
The output is a 1X68 column cell array called "storage" with 99X1 or 100X1 cell arrays inside each cell of 68 cells. The output can be changed to numeric value later after removing some of the "." at the very end of some numbers (look at the number "8713996." FOPT in first cell towards the last rows - there is no "0" after the decimal).
The output is also a cell array and can be converted to a different type of array by using cell2mat,etc. functions (which I have not used in this code).
%% Read txt file
% Reset all the variables
clear;
clc;
% Read the content of the text file into memory
content = fileread('Original_1.txt');
% Declare desired string occurences (columns) to create the storage cell
% (initially with an unknown size) and store specified columns
desired_string = ["FOPT", "FGPT", "FWPT", "WOPR", "WWPR", "WOPT", "WWPT", "WGPT", "WGPR", "WBHP", "WGOR" ];
% Create cell array to store the columns desired
count_string = count(content,desired_string );
storage = cell(1,count_string);
% Delete unnecessary strings and special characters for readibility (should
% be left with 130 characters "per line"-content variable in workspace is a
% character vector of size 1*154485). Warning: New line and carriage return
% characters also need to be deleted to get 130 characters "per line".
new_content = regexprep(content, '"SUMMARY\s.*$|"-.*$|SUMMARY\s.*$|^\s+|\n|\r', '', 'lineanchors', 'dotexceptnewline');
% If first 13 characters containts the string 'DATE' insert '!' to beginning
% as a delimiter to separate into pages
search_string = {'"','DATE'};
first_13 = new_content(1:13);
if contains(first_13,search_string(1,1))
new_str = insertBefore(new_content, '"DATE', '!');
final_str = strsplit(new_str, '!');
elseif contains(first_13, search_string(1,2))
new_str = insertBefore(new_content, 'DATE', '!');
final_str = strsplit(new_str, '!');
end
% Search for a specific string and read all the columns
pages = length(final_str);
add = 13;
count = 0;
next_column = 0;
% For loop for reading each page of the content
for ii = 2:pages
% Convert cell into character vector
new_char_vec = char(final_str(1,ii));
% Length of each character vector in each cell
long = length(char(final_str(1,ii)));
% Column number of each page!
row_no = long/130;
% Read first 130 characters, 13 characters at a time 10 times and
% match the desired string
for i = 1:10
row = 1 + count;
column = 13 + count;
testing = new_char_vec(1, (row:column));
count = count + add;
% if contains 'FOPT' %read all the columns related
if contains(testing,desired_string)
% Move to next column of storage cell when string is found
next_column = next_column + 1;
% Create cell array inside each column of storage cell array
% row_no varies
storage{1,next_column} = cell(row_no,1);
% read_rows and read_columns are reset to row and column
% everytime the desired_string is found
read_rows = row;
read_columns = column;
% Store the the desired string column's first row
storage{1,next_column}{1,1} = new_char_vec(1, (read_rows:read_columns));
% Store each row of the desired string column (starting from 2nd row) in a for loop
for jj = 2:(row_no)
read_rows = read_rows + 130;
read_columns = read_columns + 130;
storage{1,next_column}{jj,1} = new_char_vec(1, (read_rows:read_columns));
end
end
end
% Reset count
count = 0;
end
% Clear all the unnecessary variables
clear add column content count count_string desired_string final_str first_13 i ii jj long new_char_vec new_content new_str next_column pages read_rows read_columns row row_no search_string testing

その他の回答 (1 件)

Bob Thompson
Bob Thompson 2019 年 5 月 31 日
I started to work on this, and realized that with my knowledge the code was going to be pretty ugly. If anybody knows a way to convert the text into an array that would help a lot, but I do not know how to do that.
With that in mind, I abandoned regexp, as it got too complex and ugly with the different layers of cells needed to break everything. Instead I just used fgetl and parsing to look through the file for the key words you want.
I am going to assume that the matrix layouts are going to remain the same for all of your text files. For example, in the posted sample you have WOPR in the first row, with PROD1, PROD2, PROD3 in the fourth row. The sample code I wrote is looking for these specific locations. They can be in any of the matrices, and don't all need to be in the same matrix, but WOPR is assumes to be in the first row, and PROD# in the fourth. You will need to adjust things if these references are not consistent.
flist = dir('ORIGINAL_*.txt');
for n = 3:length(flist) % dir usually picks up a '.' and '..' listing as the first two elements. If this is not the case just start i = 1
% Read in file
A = fopen(flist(n).name);
% Initialize everything
line = fgetl(A);
c = 1;
i = 0;
j = 0;
ifpt = [];
iwpr = {};
FPT = {};
WPR = cell(2,1);
% Loop through lines of file
while ~isnumeric(line)
tmp = strsplit(line); % Split line at white space, to find proper column
strt = strfind(tmp,'SUMMARY'); % Look for beginning of matrix
strt = strt(~cellfun('isempty',strt)); % Remove negative results from check
% Record data from previous matrix
if ~isempty(strt)
if i > 0 % First matrix check
FOPT(:,1,i) = cellfun(@(x) str2num(x),FPT(5:9)); % Record FOPT data in array
FPT = {}; % Reset FPT var
end
if j > 0
WOPR(:,1,i) = cellfun(@(x) str2num(x),WPR{1}(3:7)); % Record WOPR-PROD1 in first column
WOPR(:,2,i) = cellfun(@(x) str2num(x),WPR{2}(3:7)); % Record WOPR-PROD2 in second column
WPR = {}; % Reset
end
ifpt = []; % Reset
iwpr = {}; % Reset
end
% Check for desired columns
fpt = strfind(tmp,'FOPT'); % Look for FOPT
fpt = fpt(~cellfun('isempty',fpt)); % Remove negatives
if ~isempty(fpt)
ifpt = find(contains(tmp,'FOPT')); % Get index of positive result
i = i + 1; % Advance FOPT results array index
end
wpr = strfind(tmp,'WOPR'); % Look for WOPR
wpr = wpr(~cellfun('isempty',wpr));
if ~isempty(wpr)
for k = 1:6 % Skip lines to find PROD
line = fgetl(A); c = c + 3;
end
% NOTE: The above will ruin the indexing of FOPT array if both
% occur in the same matrix
tmp = strsplit(line);
iwpr{1} = find(contains(tmp,'PROD1')); % Check which WOPR is PROD1
iwpr{2} = find(contains(tmp,'PROD2')); % Check which WOPR is PROD2
j = j + 1; % Advance WOPR results array index
end
% Capture data
if ~isempty(ifpt)&size(tmp,2)>1 % FOPT exists in matrix, and isn't blank line
FPT = vertcat(FPT, tmp(ifpt));
end
if ~isempty(iwpr)&size(tmp,2)>1 % WOPR exists in matrix, and isn't blank line
WPR{1} = vertcat(WPR{1}, tmp(iwpr{1}+1));
WPR{2} = vertcat(WPR{2}, tmp(iwpr{2}+1));
end
% Advance through file
line = fgetl(A);
c = c + 1;
end
% Record results from last matrix
if ~isempty(FPT) % FOPT existed in last matrix
if i > 0
FOPT(:,1,i) = cellfun(@(x) str2num(x),FPT(5:9));
FPT = {};
end
end
if ~isempty(WPR{1}) % WOPR existed in last matrix
if j > 0
WOPR(:,1,i) = cellfun(@(x) str2num(x),WPR{1}(3:7));
WOPR(:,2,i) = cellfun(@(x) str2num(x),WPR{2}(3:7));
WPR = {};
end
end
end
  2 件のコメント
Yildirim Kocoglu
Yildirim Kocoglu 2019 年 5 月 31 日
編集済み: Yildirim Kocoglu 2019 年 6 月 1 日
Thank you @Bob Nbob. This code must have taken some time to write... I really appreciate everything you have done.
I have not tested your code yet. I'll let get back to you after testing the code however, I also want you and anyone else reading this post be aware of some new things I have discovered which might be very useful in finding an easier solution for this problem.
I realize that I should have updated you earlier about the pattern discoveries I made yesterday night in the text file and I apologize for not letting you know sooner.
I noticed these patterns in text file:
  • Every single column of info is made up of 13 characters which includes [", _, space] characters
Example: Read from "1JAN2009"-"DATE" to the beginning of the next value-"YEARS" in the same row and you will see that it is 13 characters not including the next value -"FOPR" and when you start reading from the next value "FOPR" to the beginning of the next value-"FWPR" you will again see that it is 13 characters not including the next value (FWPR) and this pattern is consistent until the end of the line.
I have tested this theory with fileread() function: There are 10 columns = 13*10=130 characters written in each line and a new line starts at 131 st character. These 130 characters does not necessarily include 10 columns of info, it could be 9 columns of info and the rest of the characters are space characters but, each line is always made up of 130 characters regardless. If you start from the beginning of the line and read to the right-downwards at 132nd character you will get the letter "S" of "SUMMARY" after ' " ' character as 131st character.
  • The string "SUMMARY..." in the text file marks the beginning of new information = Dates,etc.
  • The last line of the page does not have the string "SUMMARY...".
  • One more important realization: The line with string "SUMMARY..." is an exception to the 130 characters per line rule. That seems to be the only line that goes way beyond 130 characters.
  • Last realization (hopefully): lines that start with "_ _ _..." also are an exception to the 130 per line rule.
I have a plan of my own for solving this issue (approximate solution):
  • Replace [",-] characters with space character so that the structure of the text file=13 characters per column stays in tact.
  • If a line is made of all space characters, remove/skip that line so that I can only work with the info I need: read from the string "SUMMARY..." to the next string "SUMMARY..." and work with that info but, do not include the string "SUMMARY...".
  • Store everything after the string "SUMMARY..." that is not a date or a number into one cell as Header no matter how many rows it takes to read since the number of rows can be inconsistent for each header (4 above and 5 below including PROD1,etc.).
  • After seeing a date or a number, divide everything into 10 x 13 character columns until the next string "SUMMARY..." or end of page since the last info will not have the string "SUMMARY..." at the end of it.
  • Make sure each header is correctly related to the column info and call the column data based on the header and get rid of the rest of the information.
  • There is also a possibility that the last columns might not equal 10 columns of info: 10th column could be made up of all space characters so, somehow skip these.
With my current knowledge it will not be that easy for me to put this into code however, I think this will solve the problem. There might also be a better solution to this that I don't know about.
Thank you again @Bob Nbob.
Yildirim Kocoglu
Yildirim Kocoglu 2019 年 5 月 31 日
編集済み: Yildirim Kocoglu 2019 年 6 月 1 日
I have tested the code without any modifications and it did not produce the results I wanted.
The line with "SUMMARY..." string has the character ' " ' at the very end (need to scroll all the way) and it has the same character in every line of "SUMMARY..." string.
The results it produced were:
Capture.PNG
The structure has name, path, date of text file, bytes, isdir=0, and datenum=7.3757e+05.
I might have to look into it more but, regardless thank you for the time and effort you have put into this. At least your code will give me more ideas and I will learn/remember faster. It's been a while since the last time I used matlab.

サインインしてコメントする。

カテゴリ

Help Center および File ExchangeText Files についてさらに検索

製品


リリース

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by