MSER と OCR を使用したテキストの自動検出と自動認識

ライブスクリプトを開く

この例では、maximally stable extremal regions (MSER) 特徴検出器を使用して、イメージ内でテキストが含まれる領域を検出する方法を示します。これは構造化されていないシーンに対してよく実行されるタスクです。構造化されていないシーンとは、不確定のシナリオやランダムなシナリオを含むイメージです。たとえば、キャプチャしたビデオからテキストを自動的に検出して認識し、ドライバーに交通標識について注意を促すことができます。これは、テキストの位置が事前に分かっている場合の既知のシナリオが含まれている、構造化されたシーンとは異なります。

構造化されていないシーンからテキストをセグメント化すると、光学式文字認識 (OCR) などの追加のタスクに非常に役立ちます。この例における自動テキスト検出アルゴリズムでは、数多くのテキスト領域候補を検出し、テキストを含む確率が低いものを徐々に削除していきます。

手順 1: MSER を使用したテキスト領域候補の検出

MSER 特徴検出器はテキスト領域の検出に効果的です [1]。テキストの一貫した色と高いコントラストから安定した強度プロファイルが得られるため、うまく機能します。

関数 detectMSERFeatures を使用してイメージ内のすべての領域を見つけ、結果をプロットします。テキスト領域とともにテキスト以外の領域も多く検出されることに注意してください。

colorImage = imread("handicapSign.jpg");
I = im2gray(colorImage);

% Detect MSER regions.
[mserRegions, mserConnComp] = detectMSERFeatures(I, ... 
    "RegionAreaRange",[200 8000],"ThresholdDelta",4);

figure
imshow(I)
hold on
plot(mserRegions, "showPixelList", true,"showEllipses",false)
title("MSER regions")
hold off

Figure contains an axes object. The axes object with title MSER regions contains 1120 objects of type image, line.

手順 2: 基本的な幾何学的特性に基づくテキスト以外の領域の削除

MSER アルゴリズムでは大半のテキストが見つかりますが、イメージ内にある、テキストではない他の安定した領域も多く検出されます。ルールベースの方法を使用すると、テキスト以外の領域を削除できます。たとえば、テキストの幾何学的特性を利用して、簡単なしきい値に基づきテキスト以外の領域を除外できます。あるいは、機械学習の方法を使用して、テキストとテキスト以外を区別する分類器に学習させることもできます。通常は、これら 2 つの方法を組み合わせるとよりよい結果が得られます [4]。この例では、簡単なルールベースの方法を使用して、幾何学的特性を基にテキスト以外の領域をフィルターで除外します。

テキスト領域とテキスト以外の領域の区別に有効な幾何学的特性には、以下を含むいくつかがあります [2,3]。

縦横比
離心率
オイラー数
割合
固体度

regionprops を使用して上記特性のいくつかを測定し、それらの特性の値に基づいて領域を削除します。

% Use regionprops to measure MSER properties
mserStats = regionprops(mserConnComp, "BoundingBox", "Eccentricity", ...
    "Solidity", "Extent", "Euler", "Image");

% Compute the aspect ratio using bounding box data.
bbox = vertcat(mserStats.BoundingBox);
w = bbox(:,3);
h = bbox(:,4);
aspectRatio = w./h;

% Threshold the data to determine which regions to remove. These thresholds
% may need to be tuned for other images.
filterIdx = aspectRatio' > 3; 
filterIdx = filterIdx | [mserStats.Eccentricity] > .995 ;
filterIdx = filterIdx | [mserStats.Solidity] < .3;
filterIdx = filterIdx | [mserStats.Extent] < 0.2 | [mserStats.Extent] > 0.9;
filterIdx = filterIdx | [mserStats.EulerNumber] < -4;

% Remove regions
mserStats(filterIdx) = [];
mserRegions(filterIdx) = [];

% Show remaining regions
figure
imshow(I)
hold on
plot(mserRegions, "showPixelList", true,"showEllipses",false)
title("After Removing Non-Text Regions Based On Geometric Properties")
hold off

Figure contains an axes object. The axes object with title After Removing Non-Text Regions Based On Geometric Properties contains 1042 objects of type image, line.

手順 3: 線幅のばらつきに基づくテキスト以外の領域の削除

テキストとテキスト以外を区別するもう 1 つの測定基準として、線幅がよく使用されます。"線幅" は、文字を構成する曲線や直線の幅の測定値です。テキスト領域では線幅はあまり変化しない傾向にあり、テキスト以外の領域では線幅のばらつきが大きくなる傾向があります。

線幅を使用してテキスト以外の領域を削除する方法を理解するために、検出されたいずれかの MSER 領域の線幅を推定してみます。これを行うには、距離変換およびバイナリ細線化操作を使用します [3]。

% Get a binary image of the a region, and pad it to avoid boundary effects
% during the stroke width computation.
regionImage = mserStats(6).Image;
regionImage = padarray(regionImage, [1 1]);

% Compute the stroke width image.
distanceImage = bwdist(~regionImage); 
skeletonImage = bwmorph(regionImage, "thin", inf);

strokeWidthImage = distanceImage;
strokeWidthImage(~skeletonImage) = 0;

% Show the region image alongside the stroke width image. 
figure
subplot(1,2,1)
imagesc(regionImage)
title("Region Image")

subplot(1,2,2)
imagesc(strokeWidthImage)
title("Stroke Width Image")

Figure contains 2 axes objects. Axes object 1 with title Region Image contains an object of type image. Axes object 2 with title Stroke Width Image contains an object of type image.

上記のイメージにおいて、線幅のイメージでは領域のほぼ全体にわたり変化がほとんどないことに注意してください。これは、この領域がテキスト領域である可能性が大きいことを示しています。領域を構成するすべての直線と曲線の幅が類似しているのは、人間が読み取れるテキストの一般的な特性だからです。

線幅のばらつきを使って、しきい値によりテキスト以外の領域を削除するには、領域全体におけるばらつきを 1 つのメトリクスとして次のように定量化しなければなりません。

% Compute the stroke width variation metric 
strokeWidthValues = distanceImage(skeletonImage);   
strokeWidthMetric = std(strokeWidthValues)/mean(strokeWidthValues);

そのうえで、しきい値を適用してテキスト以外の領域を削除します。ただし、異なったフォントスタイルをもつイメージでは、このしきい値の調整が必要となる場合があります。

% Threshold the stroke width variation metric
strokeWidthThreshold = 0.4;
strokeWidthFilterIdx = strokeWidthMetric > strokeWidthThreshold;

上記の手続きは、検出された各 MSER 領域に別々に適用しなければなりません。次の for ループでは、すべての領域を処理した後、線幅のばらつきを使ってテキスト以外の領域を削除した結果が表示されます。

% Process the remaining regions
for j = 1:numel(mserStats)
    
    regionImage = mserStats(j).Image;
    regionImage = padarray(regionImage, [1 1], 0);
    
    distanceImage = bwdist(~regionImage);
    skeletonImage = bwmorph(regionImage, "thin", inf);
    
    strokeWidthValues = distanceImage(skeletonImage);
    
    strokeWidthMetric = std(strokeWidthValues)/mean(strokeWidthValues);
    
    strokeWidthFilterIdx(j) = strokeWidthMetric > strokeWidthThreshold;
    
end

% Remove regions based on the stroke width variation
mserRegions(strokeWidthFilterIdx) = [];
mserStats(strokeWidthFilterIdx) = [];

% Show remaining regions
figure
imshow(I)
hold on
plot(mserRegions, "showPixelList", true,"showEllipses",false)
title("After Removing Non-Text Regions Based On Stroke Width Variation")
hold off

Figure contains an axes object. The axes object with title After Removing Non-Text Regions Based On Stroke Width Variation contains 1032 objects of type image, line.

手順 4: テキスト領域のマージによる最終検出結果の取得

この時点では、すべての検出結果が個々のテキスト文字で構成されています。この結果を OCR などの認識タスクで使用するには、個々のテキストの文字をマージして単語やテキスト行にしなければなりません。これにより、個々の文字よりも意味のある情報をもつ、イメージ内の実際の単語を認識できるようになります。たとえば、'EXIT' という文字列を認識する場合と、{'X','E','T','I'} という個々の文字のセットを認識する場合を比較すると、後者では文字の順序が正しくないため単語の意味が失われます。

個々のテキスト領域をマージして単語やテキスト行にする 1 つの方法として、まず隣接するテキスト領域を検出してから、これらの領域の周りに境界ボックスを形成します。隣接する領域を見つけるには、あらかじめ regionprops で計算してあった境界ボックスを拡張します。これにより隣接するテキスト領域の境界ボックスが重ね合わされ、同じ単語やテキスト行を構成する複数のテキスト領域により、一連のオーバーラップする境界ボックスが形成されます。

% Get bounding boxes for all the regions
bboxes = vertcat(mserStats.BoundingBox);

% Convert from the [x y width height] bounding box format to the [xmin ymin
% xmax ymax] format for convenience.
xmin = bboxes(:,1);
ymin = bboxes(:,2);
xmax = xmin + bboxes(:,3) - 1;
ymax = ymin + bboxes(:,4) - 1;

% Expand the bounding boxes by a small amount.
expansionAmount = 0.02;
xmin = (1-expansionAmount) * xmin;
ymin = (1-expansionAmount) * ymin;
xmax = (1+expansionAmount) * xmax;
ymax = (1+expansionAmount) * ymax;

% Clip the bounding boxes to be within the image bounds
xmin = max(xmin, 1);
ymin = max(ymin, 1);
xmax = min(xmax, size(I,2));
ymax = min(ymax, size(I,1));

% Show the expanded bounding boxes
expandedBBoxes = [xmin ymin xmax-xmin+1 ymax-ymin+1];
IExpandedBBoxes = insertShape(colorImage,"rectangle",expandedBBoxes,"LineWidth",3);

figure
imshow(IExpandedBBoxes)
title("Expanded Bounding Boxes Text")

Figure contains an axes object. The axes object with title Expanded Bounding Boxes Text contains an object of type image.

次に、オーバーラップする境界ボックスをマージして、個々の単語またはテキスト行の周りに 1 つの境界ボックスを形成します。これを行うには、すべての境界ボックスのペア間でオーバーラップ率を計算します。これによってテキスト領域のすべてのペア間の距離が定量化され、非ゼロのオーバーラップ率を検索することによって隣接するテキスト領域のグループを見つけることが可能になります。ペアごとのオーバーラップ率を計算した後、graph を使用して、非ゼロのオーバーラップ率で「連結された」すべてのテキスト領域を見つけます。

関数 bboxOverlapRatio を使用してすべての拡張された境界ボックスについてペアごとのオーバーラップ率を計算し、graph を使用してすべての連結された領域を見つけます。

% Compute the overlap ratio
overlapRatio = bboxOverlapRatio(expandedBBoxes, expandedBBoxes);

% Set the overlap ratio between a bounding box and itself to zero to
% simplify the graph representation.
n = size(overlapRatio,1); 
overlapRatio(1:n+1:n^2) = 0;

% Create the graph
g = graph(overlapRatio);

% Find the connected text regions within the graph
componentIndices = conncomp(g);

conncomp の出力は、各境界ボックスが属している連結テキスト領域のインデックスです。これらのインデックスを使用して、各連結要素を構成する個々の境界ボックスの最小値と最大値を計算することにより、隣接する複数の境界ボックスを 1 つの境界ボックスにマージします。

% Merge the boxes based on the minimum and maximum dimensions.
xmin = accumarray(componentIndices', xmin, [], @min);
ymin = accumarray(componentIndices', ymin, [], @min);
xmax = accumarray(componentIndices', xmax, [], @max);
ymax = accumarray(componentIndices', ymax, [], @max);

% Compose the merged bounding boxes using the [x y width height] format.
textBBoxes = [xmin ymin xmax-xmin+1 ymax-ymin+1];

最後に、検出の最終結果を表示する前に、1 つのテキスト領域のみで構成される境界ボックスを削除して、誤検出されたテキストを非表示にします。テキストは、通常はグループ (単語や文) として出現するので、これにより実際にテキストである確率の低い、孤立した領域が削除されます。

% Remove bounding boxes that only contain one text region
numRegionsInGroup = histcounts(componentIndices);
textBBoxes(numRegionsInGroup == 1, :) = [];

% Show the final text detection result.
ITextRegion = insertShape(colorImage, "rectangle", textBBoxes,"LineWidth",3);

figure
imshow(ITextRegion)
title("Detected Text")

Figure contains an axes object. The axes object with title Detected Text contains an object of type image.

手順 5: 検出したテキストの OCR による認識

テキスト領域を検出した後、関数 ocr を使用して各境界ボックス内のテキストを認識します。最初にテキスト領域の検出を行わない場合、関数 ocr の出力に含まれるノイズが大幅に増えることに注意してください。

ocrtxt = ocr(I, textBBoxes);
[ocrtxt.Text]

ans = 
    'HANDICAPPED
     PARKING
     SPECIAL PLATE
     REQUIRED
     UNAUTHORIZED
     VEHICLES
     MAY BE TOWED
     AT OWNERS
     EXPENSE
     
     ie os i uu
     
     '

この例では、MSER 特徴検出器を使用してまずテキスト領域の候補を見つけ、幾何学的測定値を使ってテキスト以外の領域をすべて削除することにより、イメージ内のテキストを検出する方法を説明しました。この例のコードは、よりロバストなテキスト検出アルゴリズムを開発するための適切な開始点となります。また、この例に改善を加えないままでも、たとえば posters.jpg や licensePlates.jpg など、他のさまざまなイメージで妥当な結果が得られます。

参考文献

[1] Chen, Huizhong, et al. "Robust Text Detection in Natural Images with Edge-Enhanced Maximally Stable Extremal Regions."Image Processing (ICIP), 2011 18th IEEE International Conference on.IEEE, 2011.

[2] Gonzalez, Alvaro, et al. "Text location in complex images." Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 2012.

[3] Li, Yao, and Huchuan Lu. "Scene text detection via stroke width." Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 2012.

[4] Neumann, Lukas, and Jiri Matas. "Real-time scene text localization and recognition."Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.IEEE, 2012.

参照

[1] Chen, Huizhong, et al. "Robust Text Detection in Natural Images with Edge-Enhanced Maximally Stable Extremal Regions." Image Processing (ICIP), 2011 18th IEEE International Conference on. IEEE, 2011.