fastBERTtokens: Tokenizing for BERT in parallel

This function simply divides your text into batches, and tokenizes in parallel. Provides significant speed-up.

Ralf Elsas

バージョン 1.0.0 (1.43 KB)

ダウンロード: 23 件

(0)

2023/2/24

ダウンロード

MATLAB Online で開く

フォロー

ダウンロード

MATLAB Online で開く

フォロー

Function to use Matlab BERT tokenizer in parallel

This function simply divides your text into batches, and tokenizes in parallel. As the Matlab tokenizer is very slow when run on a single processor for large data, this provides a significant speed-up. On an i7-10875H laptop with 8 logical units, tokenizing 76k sentences takes about 100 seconds.

Also note that providing the Matlab BERT model is important, as different BERT models use different encodings for the special BERT tokens like [SEP] etc.

引用

Ralf Elsas (2026). fastBERTtokens: Tokenizing for BERT in parallel (https://jp.mathworks.com/matlabcentral/fileexchange/125295-fastberttokens-tokenizing-for-bert-in-parallel), MATLAB Central File Exchange. August 1、2026に取得済み.

謝辞

ヒントを得たファイル: Transformer Models

MATLAB リリースの互換性

R2021a 以降のリリースと互換性あり

プラットフォームの互換性

Windows
macOS
Linux

新しいタブで開く

バージョン	公開済み	リリースノート	Action
1.0.0	2023/2/24		ダウンロード

fastBERTtokens: Tokenizing for BERT in parallel

引用

謝辞

タグ

一般的な情報

必須

MATLAB リリースの互換性

プラットフォームの互換性