Incomplete reading of MS Word file

2 ビュー (過去 30 日間)
Luca Scagnellato
Luca Scagnellato 2023 年 4 月 12 日
コメント済み: Walter Roberson 2023 年 4 月 12 日
At work I have to read some VERY long Word documents (~300 pages) and analyze the text. However, if I use the commands suggested in https://fr.mathworks.com/matlabcentral/answers/348737-how-to-read-ms-word-file-doc-docx :
word = actxserver('Word.Application');
wdoc = word.Documents.Open(filePath);
text = wdoc.Content.text;
wdoc.Close; % close document
word.Quit; % end application
the resulting "text" variable (1x158745 char) only contains ~25% of the document.
How can I read the whole document using this method? I saw that on newer relaseses there are dedicated functions/toolboxes for reading Word documents, but I don't have access to them as my company only provides R2020b and limited toolboxes.

回答 (1 件)

Oguz Kaan Hancioglu
Oguz Kaan Hancioglu 2023 年 4 月 12 日
I haven't tried for such a huge file but can you try the open word document with fopen and read the whole text using read(fid, '*char'). Maybe it will work.
  1 件のコメント
Walter Roberson
Walter Roberson 2023 年 4 月 12 日
That will not work in the form stated. .docx files are zip files that contain a directory of mostly XML files.
You can unzip the .docx file and go through the directory and try to extract things from the XML files; the XML files would be text files.

サインインしてコメントする。

カテゴリ

Help Center および File ExchangeText Files についてさらに検索

製品


リリース

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by