matlab.io.datastore.Partitionable クラス

名前空間: matlab.io.datastore

データストアへの並列処理のサポートの追加

説明

matlab.io.datastore.Partitionable は抽象 mixin クラスであり、カスタムデータストアに Parallel Computing Toolbox™ および MATLAB^® Parallel Server™ を共に使用するための並列処理のサポートを追加します。

この mixin クラスを使用するには、matlab.io.Datastore 基底クラスからの継承に加え、matlab.io.datastore.Partitionable クラスから継承しなければなりません。クラス定義ファイルの最初の行として次の構文を入力します。

classdef MyDatastore < matlab.io.Datastore & ...
                       matlab.io.datastore.Partitionable
    ...
end

カスタムデータストアに並列処理のサポートを追加するには、次を行わなければなりません。

追加クラス matlab.io.datastore.Partitionable から継承する。
追加のメソッド maxpartitions と partition を定義する

並列処理をサポートするカスタムデータストア作成の手順と詳細については、カスタムデータストアの開発を参照してください。

メソッド

`maxpartitions`	使用可能な最大区画数
`numpartitions`	既定の区画数
`partition`	データストアを分割する

属性

Sealed false

クラス属性の詳細については、クラスの属性を参照してください。

例

すべて折りたたむ

並列処理をサポートするデータストアの作成

スクリプトを開く

並列処理をサポートするデータストアを作成し、これを使用してカスタムデータまたは独自のデータを MATLAB® に取り込みます。次に、このデータを並列プールで処理します。

カスタムデータストアを実装するコードを含む、.m クラス定義ファイルを作成します。このファイルは作業フォルダーまたは MATLAB® パス上のフォルダーに保存しなければなりません。.m ファイルの名前は、オブジェクトコンストラクター関数の名前と同じでなければなりません。たとえば、コンストラクター関数の名前を MyDatastorePar にする場合、.m ファイルの名前は MyDatastorePar.m でなければなりません。.m クラス定義ファイルには、次の手順が含まれなければなりません。

手順 1: データストアクラスから継承します。
手順 2: コンストラクターと必須メソッドを定義します。
手順 3: カスタムファイルの読み取り関数を定義します。

これらの手順に加えて、データの処理と解析に必要なその他のプロパティまたはメソッドを定義します。

%% STEP 1: INHERIT FROM DATASTORE CLASSES
classdef MyDatastorePar < matlab.io.Datastore & ...
        matlab.io.datastore.Partitionable
   
    properties(Access = private)
        CurrentFileIndex double
        FileSet matlab.io.datastore.DsFileSet
    end
    
    % Property to support saving, loading, and processing of
    % datastore on different file system machines or clusters.
    % In addition, define the methods get.AlternateFileSystemRoots()
    % and set.AlternateFileSystemRoots() in the methods section. 
    properties(Dependent)
        AlternateFileSystemRoots
    end
    
%% STEP 2: DEFINE THE CONSTRUCTOR AND THE REQUIRED METHODS
    methods
        % Define your datastore constructor
        function myds = MyDatastorePar(location,altRoots)
            myds.FileSet = matlab.io.datastore.DsFileSet(location,...
                'FileExtensions','.bin', ...
                'FileSplitSize',8*1024);
            myds.CurrentFileIndex = 1;
             
            if nargin == 2
                 myds.AlternateFileSystemRoots = altRoots;
            end
            
            reset(myds);
        end
        
        % Define the hasdata method
        function tf = hasdata(myds)
            % Return true if more data is available
            tf = hasfile(myds.FileSet);
        end
        
        % Define the read method
        function [data,info] = read(myds)
            % Read data and information about the extracted data
            % See also: MyFileReader()
            if ~hasdata(myds)
                msgII = ['Use the reset method to reset the datastore ',... 
                         'to the start of the data.']; 
                msgIII = ['Before calling the read method, ',...
                          'check if data is available to read ',...
                          'by using the hasdata method.'];
                error('No more data to read.\n%s\n%s',msgII,msgIII);
            end
            
            fileInfoTbl = nextfile(myds.FileSet);
            data = MyFileReader(fileInfoTbl);
            info.Size = size(data);
            info.FileName = fileInfoTbl.FileName;
            info.Offset = fileInfoTbl.Offset;
            
            % Update CurrentFileIndex for tracking progress
            if fileInfoTbl.Offset + fileInfoTbl.SplitSize >= ...
                    fileInfoTbl.FileSize
                myds.CurrentFileIndex = myds.CurrentFileIndex + 1 ;
            end
        end
        
        % Define the reset method
        function reset(myds)
            % Reset to the start of the data
            reset(myds.FileSet);
            myds.CurrentFileIndex = 1;
        end

        % Define the partition method
        function subds = partition(myds,n,ii)
            subds = copy(myds);
            subds.FileSet = partition(myds.FileSet,n,ii);
            reset(subds);
        end
        
        % Getter for AlternateFileSystemRoots property
        function altRoots = get.AlternateFileSystemRoots(myds)
            altRoots = myds.FileSet.AlternateFileSystemRoots;
        end

        % Setter for AlternateFileSystemRoots property
        function set.AlternateFileSystemRoots(myds,altRoots)
            try
              % The DsFileSet object manages AlternateFileSystemRoots
              % for your datastore
              myds.FileSet.AlternateFileSystemRoots = altRoots;

              % Reset the datastore
              reset(myds);  
            catch ME
              throw(ME);
            end
        end
      
    end
    
    methods (Hidden = true)          
        % Define the progress method
        function frac = progress(myds)
            % Determine percentage of data read from datastore
            if hasdata(myds) 
               frac = (myds.CurrentFileIndex-1)/...
                             myds.FileSet.NumFiles; 
            else 
               frac = 1;  
            end 
        end
    end
    
    methods(Access = protected)
        % If you use the  FileSet property in the datastore,
        % then you must define the copyElement method. The
        % copyElement method allows methods such as readall
        % and preview to remain stateless 
        function dscopy = copyElement(ds)
            dscopy = copyElement@matlab.mixin.Copyable(ds);
            dscopy.FileSet = copy(ds.FileSet);
        end
        
        % Define the maxpartitions method
        function n = maxpartitions(myds)
            n = maxpartitions(myds.FileSet);
        end
    end
end

%% STEP 3: IMPLEMENT YOUR CUSTOM FILE READING FUNCTION
function data = MyFileReader(fileInfoTbl)
% create a reader object using FileName
reader = matlab.io.datastore.DsFileReader(fileInfoTbl.FileName);

% seek to the offset
seek(reader,fileInfoTbl.Offset,'Origin','start-of-file');

% read fileInfoTbl.SplitSize amount of data
data = read(reader,fileInfoTbl.SplitSize);

end

カスタムデータストアの準備ができました。カスタムデータストアを使用して、並列プールでデータを読み取り、処理します。

カスタムデータストアを使用したデータの読み取りと並列プールでの処理

ライブスクリプトを開く

カスタムデータストアを使用して独自のデータをプレビューして並列処理用に MATLAB に読み取ります。

この例では簡単なデータセットを使用して、カスタムデータストアを使ったワークフローを説明します。このデータセットは、15 個のバイナリ (.bin) ファイルの集合で、各ファイルには 1 列 (変数 1) 10000 行の符号なし整数 (レコード) が含まれます。

dir('*.bin')

binary_data01.bin  binary_data02.bin  binary_data03.bin  binary_data04.bin  binary_data05.bin  binary_data06.bin  binary_data07.bin  binary_data08.bin  binary_data09.bin  binary_data10.bin  binary_data11.bin  binary_data12.bin  binary_data13.bin  binary_data14.bin  binary_data15.bin

関数 MyDatastorePar を使用して、datastore オブジェクトを作成します。MyDatastorePar の実装の詳細については、「並列処理をサポートするデータストアの作成」の例を参照してください。

folder = fullfile('*.bin'); 
ds = MyDatastorePar(folder);

データストアのデータをプレビューします。

preview(ds)

ans = 8x1 uint8 column vector

   113
   180
   251
    91
    29
    66
   254
   214

データストアの区画数を特定します。Parallel Computing Toolbox (PCT) を使用している場合、n = numpartitions(ds,myPool) を使用できます。ここで myPool は gcp または parpool です。

n = numpartitions(ds);

データストアを並列プールで n 個の区画と n 個のワーカーに分割します。

parfor ii = 1:n
    subds = partition(ds,n,ii);
      while hasdata(subds)
        data = read(subds);
        % do something
      end
end

異なるプラットフォームでデータストアを処理

異なるプラットフォームクラウドまたはクラスターマシンが含まれる並列計算と分散計算でデータストアを処理するには、'AlternateFileSystemRoots' パラメーターを事前定義しなければなりません。たとえば、ローカルマシンでデータストアを作成し、データの小さな部分を解析します。次に、Parallel Computing Toolbox と MATLAB Parallel Server を使用して、データセット全体の解析にスケールアップします。

MyDatastorePar を使用してデータストアを作成し、値を 'AlternateFileSystemRoots' プロパティに割り当てます。MyDatastorePar の実装の詳細については、Build Datastore with Parallel Processing Support の例を参照してください。

'AlternateFileSystemRoots' プロパティの値を設定するには、異なるプラットフォーム上でのデータのルートパスを特定します。ルートパスはマシンまたはファイルシステムによって異なります。たとえば、次のルートパスを使用してデータにアクセスするとします。

Windows^® マシンからの "Z:\DataSet"。
MATLAB Parallel Server Linux^® クラスターからの "/nfs-bldg001/DataSet"。

次に、AlternateFileSystemRoots プロパティを使用してこれらのルートパスを関連付けます。

altRoots = ["Z:\DataSet","/nfs-bldg001/DataSet"];
ds = MyDatastorePar('Z:\DataSet',altRoots);

ローカルマシン上でデータのごく一部を解析します。たとえば、データの分割されたサブセットを取得して、欠損エントリをすべて削除することにより、そのデータを整理します。次に、変数のプロットを調べます。

tt = tall(partition(ds,100,1)); 
summary(tt); 
% analyze your data                        
tt = rmmissing(tt);               
plot(tt.MyVar1,tt.MyVar2)

MATLAB Parallel Server クラスター (Linux クラスター) を使用することで、解析をデータセット全体にスケールアップします。たとえば、クラスタープロファイルを使用してワーカープールを起動し、次に並列および分散計算機能を使用してデータセット全体の解析を実行します。

parpool('MyMjsProfile') 
tt = tall(ds);          
summary(tt);
% analyze your data
tt = rmmissing(tt);               
plot(tt.MyVar1,tt.MyVar2)

ヒント

カスタムデータストアの実装におけるベストプラクティスは、numpartitions メソッドを実装しないことです。

バージョン履歴

R2017b で導入

参考

mapreduce | datastore | matlab.io.datastore.HadoopLocationBased | matlab.io.Datastore

トピック

カスタムデータストアの開発
メモリに収まらないデータの tall 配列
データストアの並列分割 (Parallel Computing Toolbox)