matlab.io.datastore.HadoopFileBased クラス

名前空間: matlab.io.datastore

(非推奨) データストアへの Hadoop ファイルのサポートの追加

matlab.io.datastore.HadoopFileBased は推奨されません。代わりに matlab.io.datastore.HadoopLocationBased を使用してください。

説明

matlab.io.datastore.HadoopFileBased は抽象 mixin クラスで、Hadoop^® のサポートをカスタムデータストアに追加します。

この mixin クラスを使用するには、matlab.io.Datastore 基底クラスからの継承に加え、matlab.io.datastore.HadoopFileBased クラスから継承しなければなりません。クラス定義ファイルの最初の行として次の構文を入力します。

classdef MyDatastore < matlab.io.Datastore & ...
                             matlab.io.datastore.HadoopFileBased 
    ...
end

並列処理のサポートと共に Hadoop サポートを追加するには、次の行をクラス定義ファイルで使用します。

classdef MyDatastore < matlab.io.Datastore & ...
                             matlab.io.datastore.Partitionable & ...
                             matlab.io.datastore.HadoopFileBased 
    ...
end

カスタムデータストアに Hadoop のサポートを追加するには、次を行わなければなりません。

追加クラス matlab.io.datastore.HadoopFileBased から継承する
追加のメソッド getLocation、initializeDatastore、isfullfile を定義する

Hadoop をサポートするカスタムデータストア作成の手順と詳細については、カスタムデータストアの開発を参照してください。

メソッド

`getLocation`	(非推奨) Hadoop 内のファイルの場所
`initializeDatastore`	(非推奨) Hadoop からの情報でデータストアを初期化
`isfullfile`	(非推奨) データストアがファイル全体を読み取るかどうかをチェック

属性

Sealed false

クラス属性の詳細については、クラスの属性を参照してください。

例

すべて折りたたむ

Hadoop をサポートするデータストアの作成

並列処理と Hadoop のサポートを備えたデータストアを実装して、Hadoop サーバーから MATLAB^® へのデータの取り込みに使用します。その後、関数 tall および gather をこのデータに使用します。

次に、カスタムデータストアを実装するコードを含む、新しい .m クラス定義ファイルを作成します。このファイルは作業フォルダーまたは MATLAB パス上のフォルダーに保存しなければなりません。.m ファイルの名前は、オブジェクトコンストラクター関数の名前と同じでなければなりません。たとえば、コンストラクター関数の名前を MyDatastoreHadoop にする場合、スクリプトファイルの名前は MyDatastoreHadoop.m でなければなりません。.m クラス定義ファイルには次の手順が含まれていなければなりません。

手順 1: データストアクラスから継承します。
手順 2: コンストラクターと必須メソッドを定義します。
手順 3: カスタムファイルの読み取り関数を定義します。

次のコードは、Hadoop サーバーからバイナリファイルを読み取れるカスタムデータストアの実装例での 3 つの手順を示します。

%% STEP 1: INHERIT FROM DATASTORE CLASSES
classdef MyDatastoreHadoop < matlab.io.Datastore & ...
        matlab.io.datastore.Partitionable & ...
        matlab.io.datastore.HadoopFileBased
    
    properties (Access = private)
        CurrentFileIndex double
        FileSet matlab.io.datastore.DsFileSet
    end

         
%% STEP 2: DEFINE THE CONSTRUCTOR AND THE REQUIRED METHODS
    methods
        % Define your datastore constructor
        function myds = MyDatastoreHadoop(location,altRoots)
            myds.FileSet = matlab.io.datastore.DsFileSet(location,...
                'FileExtensions','.bin', ...
                'FileSplitSize',8*1024);
            myds.CurrentFileIndex = 1;
             
            if nargin == 2
                 myds.AlternateFileSystemRoots = altRoots;
            end
            
            reset(myds);
        end
        
        % Define the hasdata method
        function tf = hasdata(myds)
            % Return true if more data is available
            tf = hasfile(myds.FileSet);
        end
        
        % Define the read method
        function [data,info] = read(myds)
            % Read data and information about the extracted data
            % See also: MyFileReader()
            if ~hasdata(myds)
                error(sprintf(['No more data to read.\nUse the reset ',... 
                     'method to reset the datastore to the start of ' ,...
                     'the data. \nBefore calling the read method, ',...
                     'check if data is available to read ',...
                     'by using the hasdata method.'])) 
            end
            
            fileInfoTbl = nextfile(myds.FileSet);
            data = MyFileReader(fileInfoTbl);
            info.Size = size(data);
            info.FileName = fileInfoTbl.FileName;
            info.Offset = fileInfoTbl.Offset;
            
            % Update CurrentFileIndex for tracking progress
            if fileInfoTbl.Offset + fileInfoTbl.SplitSize >= ...
                    fileInfoTbl.FileSize
                myds.CurrentFileIndex = myds.CurrentFileIndex + 1 ;
            end
        end
        
        % Define the reset method
        function reset(myds)
            % Reset to the start of the data
            reset(myds.FileSet);
            myds.CurrentFileIndex = 1;
        end
        
        
        % Define the partition method
        function subds = partition(myds,n,ii)
            subds = copy(myds);
            subds.FileSet = partition(myds.FileSet,n,ii);
            reset(subds);
        end
    end      

     
    methods (Hidden = true)   

        % Define the progress method
        function frac = progress(myds)
            % Determine percentage of data read from datastore
            if hasdata(myds) 
               frac = (myds.CurrentFileIndex-1)/...
                             myds.FileSet.NumFiles; 
            else 
               frac = 1;  
            end 
        end
 
        % Define the initializeDatastore method
        function initializeDatastore(myds,hadoopInfo)
            import matlab.io.datastore.DsFileSet;
            myds.FileSet = DsFileSet(hadoopInfo,...
                'FileSplitSize',myds.FileSet.FileSplitSize,...
                'IncludeSubfolders',true, ...
                'FileExtensions','.bin');
            reset(myds);
        end
        
        % Define the getLocation method
        function loc = getLocation(myds)
            loc = myds.FileSet;
        end
        
        % Define the isfullfile method
        function tf = isfullfile(~)
            tf = isequal(myds.FileSet.FileSplitSize,'file'); 
        end

    end
        
    methods (Access = protected)
        % If you use the  FileSet property in the datastore,
        % then you must define the copyElement method. The
        % copyElement method allows methods such as readall
        % and preview to remain stateless 
        function dscopy = copyElement(ds)
            dscopy = copyElement@matlab.mixin.Copyable(ds);
            dscopy.FileSet = copy(ds.FileSet);
        end
        
        % Define the maxpartitions method
        function n = maxpartitions(myds)
            n = maxpartitions(myds.FileSet);
        end
    end
end

%% STEP 3: IMPLEMENT YOUR CUSTOM FILE READING FUNCTION
function data = MyFileReader(fileInfoTbl)
% create a reader object using FileName
reader = matlab.io.datastore.DsFileReader(fileInfoTbl.FileName);

% seek to the offset
seek(reader,fileInfoTbl.Offset,'Origin','start-of-file');

% read fileInfoTbl.SplitSize amount of data
data = read(reader,fileInfoTbl.SplitSize);
end

この手順によりカスタムデータストアの実装は完了します。

次に、カスタムデータストアのコンストラクターを使用して、datastore オブジェクトを作成します。データが hdfs:///path_to_files にある場合は、次のコードを使用できます。

setenv('HADOOP_HOME','/path/to/hadoop/install');
ds = MyDatastoreHadoop('hdfs:///path_to_files');

並列クラスター構成をもつ Apache^® Spark™ で tall 配列と関数 gather を使用するには、mapreducer を設定して MyDatastoreHadoop.m をクラスターに付加します。

mr = mapreducer(cluster);
mr.Cluster.AttachedFiles = 'MyDatastoreHadoop.m';

データストアから tall 配列を作成します。

t = tall(ds);

tall 配列の先頭を収集します。

 hd = gather(head(t));

バージョン履歴

R2017b で導入

参考

mapreduce | matlab.io.datastore.Partitionable | matlab.io.Datastore | matlab.io.datastore.DsFileSet | tall

トピック

Hadoop サポートの追加
Spark クラスターでの tall 配列の使用 (Parallel Computing Toolbox)
tall 配列およびデータストアを使用するビッグデータのワークフロー (Parallel Computing Toolbox)