Pull out strings and its values from a text file.

Hi
Please find the attachment *.txt file. I want to analyze the whole text file .
Thanks
-Sriram

 採用された回答

Guillaume
Guillaume 2019 年 6 月 11 日

1 投票

HI Sriram, sorry I was away last week.
Parsing the the first part of each message (date, level, source) is trivial. It's the part after that that is difficult due to the variations of format. I don't fully understand the algorithm you've written and I don't think you can use : indiscriminately as a delimiter. For example on line 2, it's part of https://www....
Here is how I would start the parsing:
filecontent = string(fileread('File.txt')); %read whole file as STRING (for easier text comparison later)
messages = regexp(filecontent, '^(?<date>[^ ]+) (?<level>[^ ]+) (?<source>[^:]+):\s+(?<content>[^\r\n]+)', 'names', 'lineanchors'); %parse all lines according to common format
dates = num2cell(datetime([messages.date], 'InputFormat', 'yyyy-MM-dd''T''HH:mm:ss.SSSSSSZZZZZ', 'TimeZone', 'UTC')); %decode date
[messages.date] = dates{:}; %and put back into structure
%parsing of kernel messages
iskernel = [messages.source] == "kernel";
parsedkernel = regexp([messages(iskernel).content], '\[\s*(?<cputime>[^\]]+)]\s+(?<message>.*)', 'names'); %parse kernel messages. Not sure of the rule
parsedkernel = [parsedkernel{:}]; %convert into structure array
cputime = num2cell(str2double([parsedkernel.cputime])); %convert cputime to numeric
[parsedkernel.cputime] = cputime{:}; %and put back into structure
parsedkernel = num2cell(parsedkernel); %convert to cell array to put back into messages structure
[messages(iskernel).content] = parsedkernel{:};

6 件のコメント

Life is Wonderful
Life is Wonderful 2019 年 6 月 14 日
編集済み: Guillaume 2019 年 6 月 14 日
Hi
Thanks a lot. I made the change instead of string - I used char and executed the script.
I see error ,
"
Error using datetime (line 598)
Unable to parse date/time string
'2019-05-10T21:41:40.053993+00:002019-05-10T21:41:40.054122+00:002019-05-10T21:41:40.054614+00:002019-05-10T21:41:40.054618+00:002019-05-10T21:41:40.054622+00:002019-05-10T21:41:40.054623+00:002019-05-10T21:41:40.054196+00:002019-05-10T21:41:40.054625+00:002019-05-10T21:41:40.054626+00:002019-05-10T21:41:40.054627+00:002019-05-10T21:41:40.054627+00:002019-05-10T21:41:40.054230+00:002019-05-10T21:41:40.054628+00:002019-05-10T21:41:40.054629+00:002019-05-10T21:41:40.054629+00:002019-05-10T21:41:40.054255+00:002019-05-10T21:41:40.054631+00:002019-05-10T21:41:40.054632+00:002019-05-10T21:41:40.054632+00:00....'
using the format 'yyyy-MM-dd'T'HH:mm:ss.SSSSSSZZZZZ'.
Please help me.
edited by Guillaume to shorten wall of text
Walter Roberson
Walter Roberson 2019 年 6 月 14 日
"I made the change instead of string - I used char"
Are you using R2016a or before? If so that is important to know.
Stephen23
Stephen23 2019 年 6 月 14 日
sriram shastry's "Answer" moved here:
I am using before R2016a
Guillaume
Guillaume 2019 年 6 月 14 日
編集済み: Guillaume 2019 年 6 月 14 日
So which version?
Same code to work with char arrays instead of strings:
filecontent = fileread('File.txt'); %read whole file as STRING (for easier text comparison later)
messages = regexp(filecontent, '^(?<date>[^ ]+) (?<level>[^ ]+) (?<source>[^:]+):\s+(?<content>[^\r\n]+)', 'names', 'lineanchors'); %parse all lines according to common format
dates = num2cell(datetime({messages.date}, 'InputFormat', 'yyyy-MM-dd''T''HH:mm:ss.SSSSSSZZZZZ', 'TimeZone', 'UTC')); %decode date
[messages.date] = dates{:}; %and put back into structure
%parsing of kernel messages
iskernel = strcmp({messages.source}, 'kernel');
parsedkernel = regexp({messages(iskernel).content}, '\[\s*(?<cputime>[^\]]+)]\s+(?<message>.*)', 'names'); %parse kernel messages. Not sure of the rule
parsedkernel = [parsedkernel{:}]; %convert into structure array
cputime = num2cell(str2double({parsedkernel.cputime})); %convert cputime to numeric
[parsedkernel.cputime] = cputime{:};
parsedkernel = num2cell(parsedkernel); %convert to cell array to put back into messages structure
[messages(iskernel).content] = parsedkernel{:};
Guillaume
Guillaume 2019 年 6 月 14 日
Sriram's comment mistakenly posted as an answer (please use comments!):
Thanks a lot. I works.
Guillaume
Guillaume 2019 年 6 月 14 日
Then consider changing your accepted answer, particularly after all the hard work that has gone in getting you there.

サインインしてコメントする。

その他の回答 (1 件)

Dimitar Georgiev
Dimitar Georgiev 2019 年 5 月 26 日

0 投票

cell = readcell('filename.xlsx','Range','......');
stringname = '......';
variable = strcmp(stringname,cell);

12 件のコメント

Life is Wonderful
Life is Wonderful 2019 年 5 月 26 日
編集済み: Life is Wonderful 2019 年 5 月 29 日
I have accepted the answer as it's a pointer on how to use the function,but it is not solving the issue.
It is NOT helping me .
I have a text file with lots of text and number associated with it.
Example :
2019-05-10T21:41:40.054631+00:00 INFO kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] type 16
2019-05-10T21:41:40.054632+00:00 INFO kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009ffff] usable
2019-05-10T21:41:40.054632+00:00 INFO kernel: [ 0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
2019-05-10T21:41:40.054633+00:00 INFO kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000089afdfff] usable
I need string AND associated number. Next I want to convert the cells into meaning full data for plotting as well.
Life is Wonderful
Life is Wonderful 2019 年 5 月 28 日
編集済み: Life is Wonderful 2019 年 5 月 29 日
I want to parse the full file using textscan function.
  • Identify the matchig pattern with string and values
cac = textscan(fid,'%s%s%s%*s[^\r\n]','Delimiter','');
[~] = fclose( fid );
n1 = cac{1}; % new
n2 = cac{2}; % new
n3 = cac{3}; % new
n4 = cac{4}; % new
  • Convert the cell using cellfunc
I need help
  • Plot the data
I need help
Guillaume
Guillaume 2019 年 5 月 29 日
I want to parse the full file using textscan
Why the insistence on using textscan? The modern readtable, readcell, etc. can usually figure out the format for you and if not usually do it after being given a few hints. And they output something more useful than textscan.
One thing that you should never do is create numbered variables. Instead of embedding an index in the variable name, use proper indexing.
Walter Roberson
Walter Roberson 2019 年 5 月 29 日
Looking at the sample file, I would say it is too irregular for textscan or readtable to be useful. I would instead fileread() and use regexp() to pull it apart.
Guillaume
Guillaume 2019 年 5 月 29 日
Oh yes, I didn't look at the file, since the accepted answer use readcell to read an excel file. Any file that is similar to an excel file (i.e tabulated) can easily be read without using textscan.
Having now looked at the file, I would agree that readxxx would be completely unsuitable and textscan would struggle. Indeed regexp or a dedicated parser would be the way to go.
Life is Wonderful
Life is Wonderful 2019 年 5 月 30 日
Can you please suggest a sample example to work on.
My requirement is
  • Get the Pattern string and values in a structure variable .
  • Convert the cell array into matlab variable and value
  • Use cellfun
  • Plot the data
Guillaume
Guillaume 2019 年 5 月 30 日
Most of your requirements are requirements on how the code should be implemented instead of on what it should do. This is not how you design code. You first specify what result you want, then you use whichever implementation gives you these results efficiently.
Whether structures, cell arrays, cellfun, etc. are useful is unknown because you haven't specified what you want other than some sort name/value pair (what does pattern string and value refer to?) and a plot of something (what is the data?)
Life is Wonderful
Life is Wonderful 2019 年 5 月 30 日
編集済み: Life is Wonderful 2019 年 5 月 30 日
Understood. My requirements are example snippet from the attached file is
2019-05-10T21:41:40.054631+00:00 INFO kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] type 16
2019-05-10T21:41:40.054649+00:00 DEBUG kernel: [ 0.000009] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
2019-05-10T21:41:40.054785+00:00 NOTICE kernel: [ 0.101170] random: get_random_bytes called from start_kernel+0x8d/0x429 with crng_init=0
s_TimeStamp(idx,1) = 21:41:40.054631+00:00
MsgLib.Kernel.String(idx,1) = INFO kernel
MsgLib.Kernel.String(idx,1) = DEBUG kernel
MsgLib.Kernel.String(idx,1) = NOTICE kernel
MsgLib.Kernel.val(idx,1) = 0.000000
Msg.Kernel.BIOS.SubStr(idx,1) = BIOS-e820
Msg.Kernel.BIOS.SubVal(idx,1) = [mem 0x0000000000000000-0x0000000000000fff]
Msg.Kernel.BIOS.Substr.type(idx,1) = type
Msg.Kernel.BIOS.SubVal.type(idx,1) = 16
figure;
subplot(3,1,1);plot(MsgLib.Kernel.String,MsgLib.Kernel.val );legend(sprintf('%s,%d','MsgLib.Kernel.String',MsgLib.Kernel.val);
subplot(3,1,2);plot(Msg.Kernel.BIOS.SubStr,Msg.Kernel.BIOS.SubVal)
subplot(3,1,3);plot(Msg.Kernel.BIOS.Substr.type,Msg.Kernel.BIOS.SubVal.type );
Like this I want to the full file analysis. My request is if you help me with a sample code - I will generate rest of coding.
Thanks
Guillaume
Guillaume 2019 年 5 月 30 日
Ok,so you want to parse each line of the file and split the lines into various components.
Once again, you're also giving an implementation. I'm not convinced that the structure you outline is a good idea, but that's not important right now.
The first thing you have to do, before we can even think how to implement it, is define exactly the parsing rules for the lines of the file. The start of the rule is going to be:
  • extract all the characters up to the first space and decode that as time
  • then, extract the characters up to the colon. That's the log source (I assume)
After that I'm not sure. It looks like the rule may vary according to the log source. If the log source is ANYTHING kernel, then the next step is
  • Extrace the number between [] as the log value
Then it gets very murky, you get different types of messages after the [xxx] with different formatings. You will have to establish the rules for how these should be decoded.
If the log source is not the kernel, you get a completely different format of message. Again, you need to specify the rules for decoding these.
So, I'm afraid, the task is back onto you. You first need to define rules (there's going to be several due to the complex formatting of the lines) on how to split a line into various components. Only once you've done that can we think about writing the code to do it.
I suggest you continue this bullet point list:
For each line:
  • extract the text up to the first space as the logtime
  • then extract the text up to the colon as the logsource
  • if logsource ends with kernel
  • extract the number between the [] as logvalue
  • ????
  • if logsource is ????
  • ????
  • ????
  • if ???
  • ????
  • ????
Life is Wonderful
Life is Wonderful 2019 年 5 月 30 日
Thanks! Yes,I agree with your algorithm style ,can you please give me a sample code write up?
Guillaume
Guillaume 2019 年 6 月 1 日
As I wrote:
So, I'm afraid, the task is back onto you You first need to define rules (there's going to be several due to the complex formatting of the lines) on how to split a line into various components. Only once you've done that can we think about writing the code to do it
Life is Wonderful
Life is Wonderful 2019 年 6 月 6 日
Any feedback here ?
Thanks

サインインしてコメントする。

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by