PHP实现Levenshtein Distance算法

    技术2022-05-19  24

    Levenshtein Distance算法可用于计算2个文本的近似度,PHP提供了levenshtein的字符串函数,但在实际使用过程中发现,该实现有2个缺点:

     

    1.不支持多字节编码

    2.字符串不能超过255个字节

     

    自己实现如下:

     

    <?php function levenshtein2($str1, $str2) { mb_internal_encoding('UTF-8'); $len1 = mb_strlen($str1); $len2 = mb_strlen($str2); // strip the common prefix $i = 0; while($len1 > 0 && $len2 > 0) { if(mb_substr($str1, $i, 1) != mb_substr($str2, $i, 1)) { break; } $i++; $len1--; $len2--; } if ($i > 0) { $str1 = mb_substr($str1, $i); $str2 = mb_substr($str2, $i); } // strip the common surfix while($len1 > 0 && $len2 > 0) { if (mb_substr($str1, $len1 - 1, 1) != mb_substr($str2, $len2 - 1, 1)) { break; } $i++; $len1--; $len2--; } if ($i++) { $str1 = mb_substr($str1, 0, $len1); $str2 = mb_substr($str1, 0, $len2); } if ($len1 == 0 ) return $len2; if ($len2 == 0 ) return $len1; for($i = 0; $i <= $len1; $i++) { $distance[0][$i] = $i; } for($i = 0; $i <= $len2; $i++) { $distance[$i][0] = $i; } for($i = 1; $i <= $len1; $i++) { $chr1 = mb_substr($str1, $i - 1, 1); for($j = 1; $j <= $len2; $j++) { $chr2 = mb_substr($str2, $j - 1, 1); if ($chr1 === $chr2) { $cost = 0; } else { $cost = 1; } $dis1 = $distance[$j][$i - 1] + 1; $dis2 = $distance[$j - 1][$i] + 1; $dis3 = $distance[$j - 1][$i - 1] + $cost; $distance[$j][$i] = min(array($dis1, $dis2, $dis3)); } } return $distance[$len2][$len1]; }

     

    在这个基础上,我们就可以很容易计算2篇文章的近似度了

     

    <?php function getSimilar($str1, $str2) { mb_internal_encoding('UTF-8'); $len1 = mb_strlen($str1); $len2 = mb_strlen($str2); $lev = levenshtein2($str1, $str2); $rate = (float)$lev/max(array($len1, $len2)); printf('The similarity rate is %.2f%%', (1 - $rate) * 100); }

     

    参考:

    http://www.merriampark.com/ld.htm

    http://cn2.php.net/manual/en/function.levenshtein.php#101084


    最新回复(0)