Cosine Similarity using BERT
6 ビュー (過去 30 日間)
古いコメントを表示
I am using BERT to calculate similarities in Question Answering. I have encoded my Question data using
data.Tokens = encode(mdl.Tokenizer,data.Questions) which returns me a cell array. data:image/s3,"s3://crabby-images/981d4/981d4a514cfe96847458cbfacc64db8601afeb47" alt=""
data:image/s3,"s3://crabby-images/981d4/981d4a514cfe96847458cbfacc64db8601afeb47" alt=""
Next, I proceeded to encode new text to test the similiarity with the already encoded Questions in the database: testTokens = encode(mdl.Tokenizer,text)
However, I am imable to use the cosineSimilarity(data.Tokens,testTokens) and I receive an error that says:
Input must be a matrix, a tokenizedDocument array, a bagOfWords model, a bagOfNgrams model, a string array of words, or a cell array of character vectors.
Do I need padding here or reshape of my cell vectors?
0 件のコメント
採用された回答
Divyam Gupta
2021 年 6 月 30 日
Hi Nicholas, I notice that you're facing an issue while computing the cosine similarity using a text encoder. As per the documentation mentioned at https://www.mathworks.com/help/textanalytics/ref/cosinesimilarity.html#d123e8335 the cosineSimilarity function takes a matrix to compute the similarity between two documents.
Since the encoded vector sizes for each of the questions is different, constructing a matrix might be difficult. You can do a pairwise comparision between the data.Tokens and the testTokens to compute the similarities. This can be achieved by running a nested loop while simultaneously storing the similarity scores.
Hope this helps.
その他の回答 (0 件)
参考
カテゴリ
Help Center および File Exchange で Modeling and Prediction についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!