フィルターのクリア

how to find the similarity between two text documents

6 ビュー (過去 30 日間)
Jothi
Jothi 2012 年 12 月 19 日
コメント済み: info info 2020 年 3 月 20 日
i have two text document.
For example, a.txt file contains ' Hai How R U'.
and b.txt file contains 'Hai How are U'.
How I can calculate the cosine similarity or Euclidean Distance for these two documents (text files).
thanks in advance.
  2 件のコメント
Jan
Jan 2012 年 12 月 19 日
The Euclidean Distance requires vektors of the same size. There are different Edit Distances, but I do not know the cosine distance. Perhaps it is better that you explain the details that that we search in WikiPedia.
info info
info info 2020 年 3 月 20 日
i think the best way to give the similarity text is "shinling"
Shingling, a common technique of representing documents as sets. Given the document, its k-shingle is said to be all the possible consecutive substring of length k found within it. An example with k = 3 is given below :
## $Original
## [1] "The sky is blue and the sun is bright."
##
## $Shingled
## [1] "the sky is" "sky is blue" "is blue and" "blue and the"
## [5] "and the sun" "the sun is" "sun is bright"
then we virify if find in our textes
## doc_1 doc_2 doc_3
## the sky is 1 1 1
## sky is blue 1 0 1
## is blue and 1 0 0
## blue and the 1 0 0
## and the sun 1 0 0
## the sun is 1 0 0
## sun is bright 1 0 1
## the sun in 0 1 0
## sun in the 0 1 0
## in the sky 0 1 0
## sky is bright 0 1 0
## we can see 0 0 1
## can see sun 0 0 1
## see sun is 0 0 1
## is bright the 0 0 1
## bright the sky 0 0 1
then calculate .and take the big valeur

サインインしてコメントする。

回答 (1 件)

Jan
Jan 2012 年 12 月 19 日

カテゴリ

Help Center および File ExchangeLarge Files and Big Data についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by