|
 |
levenshtein (PHP 3 >= 3.0.17, PHP 4 >= 4.0.1, PHP 5) levenshtein --
Вычисляет расстояние Левенштейна между двумя строками
Описаниеint levenshtein ( string str1, string str2 ) int levenshtein ( string str1, string str2, int cost_ins, int cost_rep, int cost_del ) int levenshtein ( string str1, string str2, function cost )
Функция возвращает расстояние Левенштейна между двумя строками, или
-1, если хотя бы одна из строк длиннее 255 символов (этого более чем
достаточно для сравнения имен или поиска по словарю, а проводить
генетический анализ на PHP просто несерьезно).
Расстояние Левенштейна - это минимальное количество вставок, замен и
удалений символов, необходимое для преобразования
str1 в str2.
Сложность алгоритма равна O(m*n),
где n и m - длины строк
str1 и str2 (неплохо по
сравнению с similar_text(), имеющей сложность O(max(n,m)**3),
но все же довольно много).
В простейшей форме функция принимает в качестве аргументов две строки
и возвращает минимальное количество вставок, замен и
удалений символов, необходимое для преобразования
str1 в str2.
Второй вариант принимает три дополнительных аргумента, задающих
стоимость операций вставки, замены и удаления. Этот вариант
универсальнее первого, но не так эффективен.
Третий вариант (который еще не реализован) будет наиболее
универсальным, но и самым медленным. Он будет принимать в качестве
третьего аргумента пользовательскую функцию, которая будет вычислять
стоимость каждой возможной операции.
Пользовательская функция будет иметь следующие аргументы:
тип операции: 'I', 'R' or 'D'
текущий символ в строке 1
текущий символ в строке 2
текущая позиция символа в строке 1
текущая позиция символа в строке 2
количество символов, оставшихся в строке 1
количество символов, оставшихся в строке 2
Пользовательская функция должна возвращать положительное целое,
определяющее стоимость конкретной операции.
Использование пользовательской функции позволяет учитывать различия
между символами и даже контекст символов при вычислении стоимости
операций вставки, замены и удаления, но ценой потери скорости по
сравнению с двумя первыми вариантами.
См. также описание функций soundex(),
similar_text() и
metaphone().
levenshtein
dinesh AT dinsoft DOT net
17-Mar-2006 03:18
Here is a string resynch function:
<?php
function str_resynch($a,$b,$l=32,$s=2048) {
$r=array();
for($i=0,$c=strlen($a),$cc=strlen($b),$ii=0,$z=$s-1,$z2=($z<<1)+1; $i<$c; $i++) {
$d=$i-$z;
$d=($d<$ii)?substr($b,$ii,$z2-$ii+$d):substr($b,$d,$z2);
$p=strpos($d,$a{$i});
$n=0;
while ($p!==FALSE) {
$m=1;
$bi=$i;
$bp=$p;
$p+=$ii;
while ((++$i<$c) && (++$p<$cc)) {
if ($a{$i}!=$b{$p}) break;
$m++;
}
if ($m<$l) {
$i=$bi;
$n=$bp+1;
$p=@strpos($d,$a{$i},$n);
}
else {
$i--;
$r[]=array($bi,$bp+$ii,$m); $ii=$p;
break;
}
}
}
if (!count($r)) return ($cc)?array('/',0,$c,0,$cc):array(array('+',0,$c,0,0));
$o=array();
$bi=0;
$bp=0;
for($i=0,$m=count($r);$i<$m;$i++) {
if ($r[$i][0]!=$bi) {
if ($r[$i][1]!=$bp) {
$o[]=array('/',$bi,$r[$i][0]-$bi,$bp,$r[$i][1]-$bp);
$bi=$r[$i][0];
$bp=$r[$i][1];
}
else {
$o[]=array('+',$bi,$r[$i][0]-$bi,$bp,0);
$bi=$r[$i][0];
}
}
elseif ($r[$i][1]!=$bp) {
$o[]=array('-',$bi,0,$bp,$r[$i][1]-$bp);
$bp=$r[$i][1];
}
$o[]=array('=',$r[$i][0],$r[$i][2],$r[$i][1],$r[$i][2]);
$bi+=$r[$i][2];
$bp+=$r[$i][2];
}
if ($c!=$bi) {
if ($cc!=$bp) $o[]=array('/',$bi,$c-$bi,$bp,$cc-$bp);
else $o[]=array('+',$bi,$c-$bi,$bp,0);
}
elseif ($cc!=$bp) $o[]=array('-',$bi,0,$bp,$cc-$bp);
return $o;
}
?>
june05 at tilo-hauke dot de
06-Jun-2005 12:44
//levenshtein for arrays
function array_levenshtein($array1,$array2)
{ $aliases= array_flip(array_values(array_unique(array_merge($array1,$array2))));
if(count($aliases)>255) return -1;
$stringA=''; $stringB='';
foreach($array1 as $entry) $stringA.=chr($aliases[$entry]);
foreach($array2 as $entry) $stringB.=chr($aliases[$entry]);
return levenshtein($stringA,$stringB);
}
// e.g. use array_levenshtein to detect special expressions in unser-inputs
echo array_levenshtein(split(" ", "my name is xxx"), split(" ","my name is levenshtein"));
//output: 1
justin at visunet dot ie
05-Apr-2005 07:46
<?php
function btlfsa($word1,$word2)
{
$score = 0;
$remainder = preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word1)."]/i",'',$word2);
$remainder .= preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word2)."]/i",'',$word1);
$score = strlen($remainder)*2;
$w1_len = strlen($word1);
$w2_len = strlen($word2);
$score += $w1_len > $w2_len ? $w1_len - $w2_len : $w2_len - $w1_len;
$w1 = $w1_len > $w2_len ? $word1 : $word2;
$w2 = $w1_len > $w2_len ? $word2 : $word1;
for($i=0; $i < strlen($w1); $i++)
{
if ( !isset($w2[$i]) || $w1[$i] != $w2[$i] )
{
$score++;
}
}
return $score;
}
$misspelled = 'haert';
$suggestions = array('herat', 'haart', 'heart', 'harte');
$levenshtein_ordered = array();
foreach ( $suggestions as $suggestion )
{
$levenshtein_ordered[$suggestion] = levenshtein($misspelled,$suggestion);
}
asort($levenshtein_ordered, SORT_NUMERIC );
print "<b>Suggestions as ordered by levenshtein...</b><ul><pre>";
print_r($levenshtein_ordered);
print "</pre></ul>";
$btlfsa_ordered = array();
foreach ( $suggestions as $suggestion )
{
$btlfsa_ordered[$suggestion] = btlfsa($misspelled,$suggestion);
}
asort($btlfsa_ordered, SORT_NUMERIC );
print "<b>Suggestions as ordered by btlfsa...</b><ul><pre>";
print_r($btlfsa_ordered);
print "</pre></ul>";
?>
mcreuzer at r-world dot com
07-Mar-2005 09:01
I am using the Levenshtein distance to SORT my search results.
I have a search page for peoples names. I do a SOUNDEX() search on the name in mysql. MySQL SOUNDEX() will perform the "fuzzy" search for me.
I then calculate the Levenshtein distance between the search term and the actual name found by the SOUNDEX() search. This will give me a score on how close my results are to the search string.
I can the sort my results for display listing the closest results first.
<?php
usort($searchresults, "finallevenshteinsortfunction");
function finallevenshteinsortfunction($a, $b)
{
if(($a['levenshtein'] > $b['levenshtein']) || ( $a['levenshtein'] == $b['levenshtein'] && strnatcasecmp( $a['Last_Name'], $b['Last_Name']) >= 1) ){ return $a['levenshtein'];} elseif($a['levenshtein'] == $b['levenshtein']){ return '0';} elseif($a['levenshtein'] < $b['levenshtein']){ return -$a['levenshtein'];}
else{die("<!-- a horrable death -->");}
}
?>
silas at silaspalmer dot com
28-Feb-2004 02:52
Some code for a basic fuzzy search.
<?php
$needle = "Youser Nein stitched tymes hand";
$haystack = "
Did you know that a stitch in time can save nine. Isn't that amazing! I think so
The user-supplied function has to return the loser, from the sand. Describe the cost for this particular operation.
Eweser may decide to use only some of the supplied nans. And pay attention. This isn't hard...
";
$hwords = preg_split("/[\s\W]+/", $haystack);
$nwords = preg_split("/[\s\W]+/", $needle);
echo "You searched for $needle<br>";
echo "I found...<br>";
foreach ($hwords as $hkey => $hayword) {
$hmp = metaphone ($hayword);
foreach ($nwords as $nkey => $needword) {
$nfirst = strtolower(substr($needword, 0, 1));
$nlast = strtolower(substr($needword, -1));
$hfirst = strtolower(substr($hayword, 0, 1));
$hlast = strtolower(substr($hayword, -1));
if (($hfirst == $nfirst) or ($hlast == $nlast)) {
$nmp = metaphone ($needword);
$distance = levenshtein ($hmp, $nmp);
$n_len = strlen($nmp);
$per = round(($distance/$n_len)*1000);
if ($per < 335) {
$haystack = str_replace($hayword, "<b>$hayword</b>", $haystack);
$haystack = str_replace("<b><b>", "<b>", $haystack);
$haystack = str_replace("</b></b>", "</b>", $haystack);
}
}
}
}
echo $haystack;
?>
gzink at zinkconsulting dot com
03-Dec-2003 03:03
Try combining this with metaphone() for a truly amazing fuzzy search function. Play with it a bit, the results can be plain scary (users thinking the computer is almost telepathic) when implemented properly. I wish spell checkers worked as well as the code I've written.
I would release my complete code if reasonable, but it's not, due to copyright issues. I just hope that somebody can learn from this little tip!
genialbrainmachine at dot IHATESPAM dot tiscali dot it
26-Oct-2003 11:55
I wrote this function to have an "intelligent" comparison between data to be written in a DB
and already existent data. Not ony calculating distances but also balancing distances for
each field.
<?php
function search_similar($record, $weights, $compared, $precision=2) {
$field_names = array_keys($record);
foreach ($field_names as $field_key) {
$record_weight += strlen($record[$field_key]) * $weights[$field_key];
$weighted_distance += levenshtein($record[$field_key],$compared[$field_key]) * $weights[$field_key];
}
if ($record_weight) {
return round(($weighted_distance / $record_weight * 100),$precision);
} elseif ((strlen(implode("",$record)) == 0) && (strlen(implode("",$compared)) == 0)) { return round(0,$precision);
} elseif (array_sum($weights) == 0) { return round(0,$precision);
} else {
return false;
}
}
?>
jlsalinas at gmx dot net
25-Oct-2003 07:28
Regarding the post by fgilles on April 26th 2001, I suggest not to use levenshtein() function to test for over-uppercasing unless you've got plenty of time to waste in your host. ;) Anyhow, I think it's a useful feature, as I get really annoyed when reading whole messages in uppercase.
PHP's levenshtein() function can only handle up to 255 characters, which is not realistic for user input (only the first paragraph oh this post has 285 characters). If you choose to use a custom function able to handle more than 255 characters, efficiency is an important issue.
I use this function, specific for this case, but much faster:
function ucase_percent ($str) {
$str2 = strtolower ($str);
$l = strlen ($str);
$ucase = 0;
for ($i = 0; $i < $l; $i++) {
if ($str{$i} != $str2{$i}) {
$ucase++;
}
}
return $ucase / $l * 100.0;
}
I think 10% is enough for written English (maybe other languages like German, which use more capital letters, need more). With some sentencies in uppercase (everybody has the right to shout occasionally), 20% would be enough; so I use a threshold of 30%. When exceeded, I lowercase the whole message.
Hope you find it useful and it helps keeping the web free of ill-mannered people.
"inerte" is my hotmail.com username
11-Jul-2003 10:22
I am using this function to avoid duplicate information on my client's database.
After retrieving a series of rows and assigning the results to an array values, I loop it with foreach comparing its levenshtein() with the user supplied string.
It helps to avoid people re-registering "John Smith", "Jon Smith" or "Jon Smit".
Of course, I can't block the operation if the user really wants to, but a suggestion is displayed along the lines of: "There's a similar client with this name.", followed by the list of the similar strings.
asseg at ukl dot uni-feiburg dot de
09-May-2003 06:05
for the unserious guys how might do dna sequence analysis
with php. Here's the function that can handle long strings.
to get an output of the matrix uncomment the
printf-section. handle with care iteration are approximately
2x2^strlen ;)
------------------------------------------------------------
function levdis($s,$t){
$n=strlen($s);
$m=strlen($t);
$matrix=array(range(0,$n+1),range(0,$m+1));
$ret=0;
if ($n==0){
return $m;
}
elseif ($m==0){
return $n;
}
for ($i=0;$i<=$n;$i++) {
$matrix[$i][0]=$i;
}
for ($j=0;$j<=$m;$j++) {
$matrix[0][$j]=$j;
}
for ($i=1;$i<=$n;$i++) {
for ($j=1;$j<=$m;$j++) {
if ($s[$i-1]==$t[$j-1]) {
$cost=0;
}else{
$cost=1;
}
$matrix[$i][$j]=min($matrix[$i-1][$j]+1,
$matrix[$i][$j-1]+1,
$matrix[$i-1][$j-1]+$cost);
}
}
/* for ($j=0;$j<=$m;$j++) {
for ($i=0;$i<=$n;$i++) {
printf(" %02d",$matrix[$i][$j]);
}
echo "\n";
}*/
return $matrix[$n][$m];
}
----------------------------------------------
bisqwit at iki dot fi
12-Aug-2002 07:43
At the time of this manual note the user defined thing
in levenshtein() is not implemented yet. I wanted something
like that, so I wrote my own function. Note that this
doesn't return levenshtein() difference, but instead
an array of operations to transform a string to another.
Please note that the difference finding part (resync)
may be extremely slow on long strings.
<?php
function matchlen(&$a, &$b)
{
$c=0;
$alen = strlen($a);
$blen = strlen($b);
$d = min($alen, $blen);
while($a[$c] == $b[$c] && $c < $d)
$c++;
return $c;
}
function calcdiffer($a, $b)
{
$alen = strlen($a);
$blen = strlen($b);
$aptr = 0;
$bptr = 0;
$ops = array();
while($aptr < $alen && $bptr < $blen)
{
$matchlen = matchlen(substr($a, $aptr), substr($b, $bptr));
if($matchlen)
{
$ops[] = array('=', substr($a, $aptr, $matchlen));
$aptr += $matchlen;
$bptr += $matchlen;
continue;
}
$bestlen=0;
$bestpos=array(0,0);
for($atmp = $aptr; $atmp < $alen; $atmp++)
{
for($btmp = $bptr; $btmp < $blen; $btmp++)
{
$matchlen = matchlen(substr($a, $atmp), substr($b, $btmp));
if($matchlen>$bestlen)
{
$bestlen=$matchlen;
$bestpos=array($atmp,$btmp);
}
if($matchlen >= $blen-$btmp)break;
}
}
if(!$bestlen)break;
$adifflen = $bestpos[0] - $aptr;
$bdifflen = $bestpos[1] - $bptr;
if($adifflen)
{
$ops[] = array('-', substr($a, $aptr, $adifflen));
$aptr += $adifflen;
}
if($bdifflen)
{
$ops[] = array('+', substr($b, $bptr, $bdifflen));
$bptr += $bdifflen;
}
$ops[] = array('=', substr($a, $aptr, $bestlen));
$aptr += $bestlen;
$bptr += $bestlen;
}
if($aptr < $alen)
{
$ops[] = array('-', substr($a, $aptr));
}
if($bptr < $blen)
{
$ops[] = array('+', substr($b, $bptr));
}
return $ops;
}
Example:
$tab = calcdiffer('Tm on jonkinlainen testi',
'Tm ei ole minknlainen testi.');
$ops = array('='=>'Ok', '-'=>'Remove', '+'=>'Add');
foreach($tab as $k)
echo $ops[$k[0]], " '", $k[1], "'\n";
Example output:
Ok 'Tm '
Remove 'on jonki'
Add 'ei ole mink'
Ok 'nlainen testi'
Add '.'
11-Apr-2002 06:57
For spell checking applications, delay could be tolerable if you assume the typist got the first two or three chars of each word right. Then you'd only need to calc distances for a small segment of the dictionary. This is a compromise but one I think a lot of spell checkers make.
For an example of site search using this function look at the PHP manual search button on this page. It appears to be doing this for the PHP function list.
fgilles at free dot fr
26-Apr-2001 01:32
exemple of use for a forum: users can't post messages too much uppercased
if ((strlen($subject)>10) and ( ( levenshtein ($subject, strtolower ($subject) / strlen ($subject) ) > .3 ) ){
$subject = strtolower($subject);
}
hholzgra at php dot net
11-Aug-2000 01:20
well, think of it as a sort of typo counter
by the way we might introduce a feature similar to something that the jikes compiler for java does using this function:
every time you try to access an undefined funtion you will get a suggestion of a function that's named
similar to what you requested
e.g.:
Undefined function 'rest()'!
Where you looking for 'reset()' instead?
dschultz at protonic dot com
10-Aug-2000 09:01
It's also useful if you want to make some sort of registration page and you want to make sure that people who register don't pick usernames that are very similar to their passwords.
Mathijs Brands <mathijs at ilse dot nl>
06-Aug-2000 05:54
This could come in handy if you were to implement a simple searchengine in PHP. It would allow for simple fuzzy searching.
ad1n at dc dot uba dot ar
04-Aug-2000 05:01
One application of this is when you want to look for a similar match instead of an exact one. You can sort the results of checking the distances of a word to a dictionary and sort them to see which were the more similar ones. Of course it will be a quite resourse consuming task anyway.
| |