case and accent - insensitive regular expression?

Discussion:

Giulio Mastrosanti

2008-07-12 07:36:50 UTC

Hi,
I have a php page that asks user for a key ( or a list of keys ) and
then shows a list of items matching the query.

every item in the list shows its data, and the list of keys it has ( a
list of comma-separated words )

I would like to higlight, in the list of keys shown for every item,
the words matching the query,

this can be easily achieved with a search and replace, for every
search word, i search it in the key list and replace it adding a style
tag to higlight it such as for example to have it in red color:

if ( @stripos($keylist,$keysearch!== false ) {
$keylist = str_ireplace($keysearch,''.
$keysearch.'',$keylist);
}

but i have some problem with accented characters:

i have mysql with character encoding utf8, and all the php pages are
declared as utf8

mysql in configured to perform queries in a case and accent
insensitive way.
this mean that if you search for the word 'cafe', you have returned
rows that contains in the keyword list 'cafe', but also 'café' with
the accent. ( I think it has to do with 'collation' settings, but I'm
not investigating at the moment because it is OK for me the way it
works ).

now my problem is to find a way ( I imagine with some kind of regular
expression ) to achieve in php a search and replace accent-
insensitive, so that i can find the word 'cafe' in a string also if it
is 'café', or 'CAFÉ', or 'CAFE', and vice-versa.

hope the problem is clear and well-explained in english,

thank you for any tip,

Giulio

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

tedd

2008-07-12 14:29:22 UTC

Permalink

Post by Giulio Mastrosanti
Hi,
I have a php page that asks user for a key ( or
a list of keys ) and then shows a list of items
matching the query.
every item in the list shows its data, and the
list of keys it has ( a list of comma-separated
words )
I would like to higlight, in the list of keys shown for every item,
the words matching the query,
this can be easily achieved with a search and
replace, for every search word, i search it in
the key list and replace it adding a style tag
to higlight it such as for example to have it in
$keylist = str_ireplace($keysearch,''.$keysearch.'',$keylist);
}
i have mysql with character encoding utf8, and
all the php pages are declared as utf8
mysql in configured to perform queries in a case and accent insensitive way.
this mean that if you search for the word
'cafe', you have returned rows that contains in
the keyword list 'cafe', but also 'café' with
the accent. ( I think it has to do with
'collation' settings, but I'm not investigating
at the moment because it is OK for me the way it
works ).
now my problem is to find a way ( I imagine with
some kind of regular expression ) to achieve in
php a search and replace accent-insensitive, so
that i can find the word 'cafe' in a string also
if it is 'café', or 'CAFÉ', or 'CAFE', and
vice-versa.
hope the problem is clear and well-explained in english,
thank you for any tip,
Giulio

Giulio:

Three things:

1. Your English is fine.

2. Try using mb_ereg_replace()

http://www.php.net/mb_ereg_replace

Place the accents you want to change in that and
change them to whatever you want.

3. Change:

'.$keysearch.''

to

'.$keysearch.''

and add

.keysearch
{
color: #FF0000;
}

to your css.

Cheers,

tedd

--
-------
http://sperling.com http://ancientstones.com http://earthstones.com
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Andrew Ballard

2008-07-13 12:31:28 UTC

Permalink

Post by tedd

Hi,
I have a php page that asks user for a key ( or a list of keys ) and then
shows a list of items matching the query.
every item in the list shows its data, and the list of keys it has ( a
list of comma-separated words )
I would like to higlight, in the list of keys shown for every item, the
words matching the query,
this can be easily achieved with a search and replace, for every search
word, i search it in the key list and replace it adding a style tag to
#FF0000">'.$keysearch.'',$keylist);
}
i have mysql with character encoding utf8, and all the php pages are
declared as utf8
mysql in configured to perform queries in a case and accent insensitive
way.
this mean that if you search for the word 'cafe', you have returned rows
that contains in the keyword list 'cafe', but also 'café' with the accent. (
I think it has to do with 'collation' settings, but I'm not investigating at
the moment because it is OK for me the way it works ).
now my problem is to find a way ( I imagine with some kind of regular
expression ) to achieve in php a search and replace accent-insensitive, so
that i can find the word 'cafe' in a string also if it is 'café', or 'CAFÉ',
or 'CAFE', and vice-versa.
hope the problem is clear and well-explained in english,
thank you for any tip,
Giulio

1. Your English is fine.
2. Try using mb_ereg_replace()
http://www.php.net/mb_ereg_replace
Place the accents you want to change in that and change them to whatever you
want.
'.$keysearch.''
to
'.$keysearch.''
and add
.keysearch
{
color: #FF0000;
}
to your css.
Cheers,
tedd

I may be mistaken (and if I am, then just ignore this as ignorant
rambling), but I don't think he's wanting to replace the accented
characters in the original string. I think he's just wanting the
pattern to find all variations of the same string and highlight them
without changing them. For example, his last paragraph would look like
this:

[quote]
now my problem is to find a way ( I imagine with some kind of regular
expression ) to achieve in php a search and replace
accent-insensitive, so that i can find the word 'cafe' in a string also if it is 'café', or 'CAFÉ', or 'CAFE', and vice-versa.
[/quote]

The best I can think of right now is something like this:

<?php

function highlight_search_terms($word, $string) {
$search = preg_quote($word);

$search = str_replace('a', '[aàáâãäå]', $search);
$search = str_replace('e', '[eèéêë]', $search);
/* repeat for each possible accented character */

return preg_replace('/\b' . $search . '\b/i', '$0', $string);

}

$string = "now my problem is to find a way ( I imagine with some kind
of regular expression ) to achieve in php a search and replace
accent-insensitive, so that i can find the word 'cafe' in a string
also if it is 'café', or 'CAFÉ', or 'CAFE', and vice-versa.";

echo highlight_search_

tedd

2008-07-13 12:50:27 UTC

Permalink

Post by Andrew Ballard

Post by tedd

Hi,
I have a php page that asks user for a key ( or a list of keys ) and then
shows a list of items matching the query.
every item in the list shows its data, and the list of keys it has ( a
list of comma-separated words )
I would like to higlight, in the list of keys shown for every item, the
words matching the query,
this can be easily achieved with a search and replace, for every search
word, i search it in the key list and replace it adding a style tag to
#FF0000">'.$keysearch.'',$keylist);
}
i have mysql with character encoding utf8, and all the php pages are
declared as utf8
mysql in configured to perform queries in a case and accent insensitive
way.
this mean that if you search for the word 'cafe', you have returned rows
that contains in the keyword list 'cafe', but
also 'café' with the accent. (
I think it has to do with 'collation'
settings, but I'm not investigating at
the moment because it is OK for me the way it works ).
now my problem is to find a way ( I imagine with some kind of regular
expression ) to achieve in php a search and replace accent-insensitive, so
that i can find the word 'cafe' in a string
also if it is 'café', or 'CAFÉ',
or 'CAFE', and vice-versa.
hope the problem is clear and well-explained in english,
thank you for any tip,
Giulio

I may be mistaken (and if I am, then just ignore this as ignorant
rambling), but I don't think he's wanting to replace the accented
characters in the original string. I think he's just wanting the
pattern to find all variations of the same string and highlight them
without changing them. For example, his last paragraph would look like
[quote]
now my problem is to find a way ( I imagine with some kind of regular
expression ) to achieve in php a search and replace
accent-insensitive, so that i can find the word 'cafe' in a string also if it is 'café', or 'CAFÉ', or 'CAFE', and vice-versa.
[/quote]
<?php
function highlight_search_terms($word, $string) {
$search = preg_quote($word);
$search = str_replace('a', '[aàáâãäå]', $search);
$search = str_replace('e', '[eèéêë]', $search);
/* repeat for each possible accented character */
return preg_replace('/\b' . $search . '\b/i', '$0', $string);
}
$string = "now my problem is to find a way ( I imagine with some kind
of regular expression ) to achieve in php a search and replace
accent-insensitive, so that i can find the word 'cafe' in a string
also if it is 'café', or 'CAFÉ', or 'CAFE', and vice-versa.";
echo highlight_search_terms('cafe', $string);
?>
Andrew

Andrew:

You may be right -- it's ambiguous now that I
review it again. He does say search and replace
but I'm not sure if that's what he really wants.
It looks more like search with one string and
highlight all like-strings.

Cheers,

tedd

--
-------
http://sperling.com http://ancientstones.com http://earthstones.com
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Giulio Mastrosanti

2008-07-14 15:06:14 UTC

Permalink

First of all thank you all for your answers, and thank you for your time

and yes Tedd, my question was quite ambiguous in that point.

Andrew is right, i don't want to change in any way the list of keys I
show in the result, I just want to find the way to higlight the
matching words, regardless of their accent variations.

So I think his Andrew's suggestion could be a good solution, and I'll
try it ASAP...

let me se if i correctly understood:

$search = preg_quote($word); -- quotes chars that could be intrepreted
like regex special chars

$search = str_replace('e', '[eèéêë]', $search); -- trasforms i.e.
cafe in caf[eèéêë], so matches all the accented variations

return preg_replace('/\b' ... -- replaces all the occurences adding
the tags, you use \b as word boundary, right?

it seems a fine soultion to the problem!

the only thing i must add is, befor calling highlight_search_terms, to
'normalize' the word string ( the word used for the search) to
transform it removing the accentated versions of the chars:

$word = preg_replace('[èé]{1}','e',$word);
$word = preg_replace('[à]{1}','a',$word);

that because also the search string could contain an accented char,
and this way I avoid to perform str_replace in the
highlight_search_terms function for every combination of accented chars

well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)

thank you!

Giulio

Post by Andrew Ballard
I may be mistaken (and if I am, then just ignore this as ignorant
rambling), but I don't think he's wanting to replace the accented
characters in the original string. I think he's just wanting the
pattern to find all variations of the same string and highlight them
without changing them. For example, his last paragraph would look
like
[quote]
now my problem is to find a way ( I imagine with some kind of regular
expression ) to achieve in php a search and replace
accent-insensitive, so that i can find the word 'cafe' in a string also if it is 'café', or 'CAFÉ', or 'CAFE', and vice-versa.
[/quote]
<?php
function highlight_search_terms($word, $string) {
$search = preg_quote($word);
$search = str_replace('a', '[aàáâãäå]', $search);
$search = str_replace('e', '[eèéêë]', $search);
/* repeat for each possible accented character */
return preg_replace('/\b' . $search . '\b/i', '$0', $string);
}
$string = "now my problem is to find a way ( I imagine with some kind
of regular expression ) to achieve in php a search and replace
accent-insensitive, so that i can find the word 'cafe' in a string
also if it is 'café', or 'CAFÉ', or 'CAFE', and vice-versa.";
echo highlight_search_terms('cafe', $string);
?>
Andrew

You may be right -- it's ambiguous now that I review it again. He
does say search and replace but I'm not sure if that's what he
really wants. It looks more like search with one string and
highlight all like-strings.
Cheers,
tedd
--

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Andrew Ballard

2008-07-14 16:04:06 UTC

Permalink

On Mon, Jul 14, 2008 at 11:06 AM, Giulio Mastrosanti

Post by Giulio Mastrosanti
First of all thank you all for your answers, and thank you for your time
and yes Tedd, my question was quite ambiguous in that point.
Andrew is right, i don't want to change in any way the list of keys I show
in the result, I just want to find the way to higlight the matching words,
regardless of their accent variations.
So I think his Andrew's suggestion could be a good solution, and I'll try it
ASAP...
$search = preg_quote($word); -- quotes chars that could be intrepreted like
regex special chars
$search = str_replace('e', '[eטיךכ]', $search); -- trasforms i.e. cafe in
caf[eטיךכ], so matches all the accented variations
return preg_replace('/\b' ... -- replaces all the occurences adding the
tags, you use \b as word boundary, right?

Yes, yes, and yes. :-)

Post by Giulio Mastrosanti
it seems a fine soultion to the problem!
the only thing i must add is, befor calling highlight_search_terms, to
'normalize' the word string ( the word used for the search) to transform it
$word = preg_replace('[טי]{1}','e',$word);
$word = preg_replace('[א]{1}','a',$word);
that because also the search string could contain an accented char, and this
way I avoid to perform str_replace in the highlight_search_terms function
for every combination of accented chars

I was intrigued by your example, so I played around with it some more
this morning. My own quick web search yielded a lot of results for
highlighting search terms, but none that I found did what you're
after. (I admit I didn't look very deep.) I was up to something like
this before your reply came in. It's still by no means complete. It
even handles simple English plurals (words ending in 's' or 'es'), but
not variations that require changing the word base (like 'daisy' to
'daisies').

<?php
function highlight_search_terms($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);

$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
$search_words[] = $word;
}
}

$search_words = array_unique($search_words);

foreach ($search_words as $word) {
$search = preg_quote($word);

/* repeat for each possible accented character */
$search = preg_replace('/(ae|æ|ǽ)/iu', '(ae|æ|ǽ)', $search);
$search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);
$search = preg_replace('/[aàáâãäåǻāăą](?!e)/iu',
'[aàáâãäåǻāăą]', $search);
$search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search);
$search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);
$search = preg_replace('/(?<![ao])[eèéêëēĕėęě]/iu',
'[eèéêëēĕėęě]', $search);
$search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search);
$search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);
$search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]', $search);
$search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);
$search = preg_replace('/[kķĸ]/iu', '[kķĸ]', $search);
$search = preg_replace('/[lĺļľŀł]/iu', '[lĺļľŀł]', $search);
$search = preg_replace('/[nñńņňŉŋ]/iu', '[nñńņňŉŋ]', $search);
$search = preg_replace('/[oòóôõöōŏőǿơ](?!e)/iu',
'[oòóôõöōŏőǿơ]', $search);
$search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);
$search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search);
$search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);
$search = preg_replace('/[uùúûüũūŭůűųǔǖǘǚǜ]/iu',
'[uùúûüũūŭůűųǔǖǘǚǜ]', $search);
$search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);
$search = preg_replace('/[yýÿŷ]/iu', '[yýÿŷ]', $search);
$search = preg_replace('/[zźżž]/iu', '[zźżž]', $search);

$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '$0', $string);
}

return $string;

}
?>

I still can't help feeling there must be some better way, though.

Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giuli

Giulio Mastrosanti

2008-07-14 17:35:26 UTC

Permalink

Brilliant !!!

so you replace every occurence of every accent variation with all the
accent variations...

OK, that's it!

only some more doubts ( regex are still an headhache for me... )

preg_replace('/[iÃ¬ÃÃ®Ã¯Ä©Ä«ÄÄ¯Ä±]/iu',... -- what's the meaning of
iu after the match string?

preg_replace('/[aÃ Ã¡Ã¢Ã£Ã€Ã¥Ç»ÄÄÄ](?!e)/iu',... whats (?!e) for?
-- every occurence of aÃ Ã¡Ã¢Ã£Ã€Ã¥Ç»ÄÄÄ NOT followed by e?

Many thanks again for your effort,

I'm definitely on the good way

Giulio

Post by Andrew Ballard
I was intrigued by your example, so I played around with it some more
this morning. My own quick web search yielded a lot of results for
highlighting search terms, but none that I found did what you're
after. (I admit I didn't look very deep.) I was up to something like
this before your reply came in. It's still by no means complete. It
even handles simple English plurals (words ending in 's' or 'es'), but
not variations that require changing the word base (like 'daisy' to
'daisies').
<?php
function highlight_search_terms($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);
$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars,
$word)) {
$search_words[] = $word;
}
}
$search_words = array_unique($search_words);
foreach ($search_words as $word) {
$search = preg_quote($word);
/* repeat for each possible accented character */
$search = preg_replace('/(ae|ÃŠ|Çœ)/iu', '(ae|ÃŠ|Çœ)',
$search);
$search = preg_replace('/(oe|Å)/iu', '(oe|Å)', $search);
$search = preg_replace('/[aÃ Ã¡Ã¢Ã£Ã€Ã¥Ç»ÄÄÄ](?!e)/iu',
'[aÃ Ã¡Ã¢Ã£Ã€Ã¥Ç»ÄÄÄ]', $search);
$search = preg_replace('/[cÃ§ÄÄÄÄ]/iu', '[cÃ§ÄÄÄÄ]',
$search);
$search = preg_replace('/[dÄÄ]/iu', '[dÄÄ]', $search);
$search = preg_replace('/(?<![ao])[eÃšÃ©ÃªÃ«ÄÄÄÄÄ]/iu',
'[eÃšÃ©ÃªÃ«ÄÄÄÄÄ]', $search);
$search = preg_replace('/[gÄÄÄ¡Ä£]/iu', '[gÄÄÄ¡Ä£]',
$search);
$search = preg_replace('/[hÄ¥Ä§]/iu', '[hÄ¥Ä§]', $search);
$search = preg_replace('/[iÃ¬ÃÃ®Ã¯Ä©Ä«ÄÄ¯Ä±]/iu',
'[iÃ¬ÃÃ®Ã¯Ä©Ä«ÄÄ¯Ä±]', $search);
$search = preg_replace('/[jÄµ]/iu', '[jÄµ]', $search);
$search = preg_replace('/[kÄ·Äž]/iu', '[kÄ·Äž]', $search);
$search = preg_replace('/[lÄºÄŒÄŸÅÅ]/iu', '[lÄºÄŒÄŸÅÅ]',
$search);
$search = preg_replace('/[nÃ±ÅÅÅÅÅ]/iu',
'[nÃ±ÅÅÅÅÅ]', $search);
$search = preg_replace('/[oÃ²Ã³ÃŽÃµÃ¶ÅÅÅÇ¿Æ¡](?!e)/iu',
'[oÃ²Ã³ÃŽÃµÃ¶ÅÅÅÇ¿Æ¡]', $search);
$search = preg_replace('/[rÅÅÅ]/iu', '[rÅÅÅ]', $search);
$search = preg_replace('/[sÅÅÅÅ¡]/iu', '[sÅÅÅÅ¡]',
$search);
$search = preg_replace('/[tÅ£Å¥Å§]/iu', '[tÅ£Å¥Å§]', $search);
$search = preg_replace('/[uÃ¹ÃºÃ»ÃŒÅ©Å«ÅÅ¯Å±Å³ÇÇÇÇÇ]/iu',
'[uÃ¹ÃºÃ»ÃŒÅ©Å«ÅÅ¯Å±Å³ÇÇÇÇÇ]', $search);
$search = preg_replace('/[wÅµ]/iu', '[wÅµ]', $search);
$search = preg_replace('/[yÃœÃ¿Å·]/iu', '[yÃœÃ¿Å·]', $search);
$search = preg_replace('/[zÅºÅŒÅŸ]/iu', '[zÅºÅŒÅŸ]', $search);
$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '$0', $string);
}
return $string;
}
?>
I still can't help feeling there must be some better way, though.

Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some
other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giulio

Andrew

Andrew Ballard

2008-07-14 18:20:54 UTC

Permalink

On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti

Post by Giulio Mastrosanti
Brilliant !!!
so you replace every occurence of every accent variation with all the accent
variations...
OK, that's it!
only some more doubts ( regex are still an headhache for me... )
preg_replace('/[iìíîïĩīĭįı]/iu',... -- what's the meaning of iu after the
match string?

This page explains them both.
http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php

Post by Giulio Mastrosanti
preg_replace('/[aàáâãäåǻāăą](?!e)/iu',... whats (?!e) for? -- every
occurence of aàáâãäåǻāăą NOT followed by e?

Yes. It matches any character based on the latin 'a' that is not
followed by an 'e'. It keeps the pattern from matching the 'a' when it
immediately precedes an 'e' for the character 'ae' for words like
these:
http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
(However, that may cause problems with words that have other variants
of 'ae' in them. I'll leave that to you to resolve.)
http://us.php.net/manual/en/regexp.reference.php

Post by Giulio Mastrosanti
Many thanks again for your effort,
I'm definitely on the good way
Giulio

Post by Andrew Ballard
I was intrigued by your example, so I played around with it some more
this morning. My own quick web search yielded a lot of results for
highlighting search terms, but none that I found did what you're
after. (I admit I didn't look very deep.) I was up to something like
this before your reply came in. It's still by no means complete. It
even handles simple English plurals (words ending in 's' or 'es'), but
not variations that require changing the word base (like 'daisy' to
'daisies').
<?php
function highlight_search_terms($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);
$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
$search_words[] = $word;
}
}
$search_words = array_unique($search_words);
foreach ($search_words as $word) {
$search = preg_quote($word);
/* repeat for each possible accented character */
$search = preg_replace('/(ae|æ|ǽ)/iu', '(ae|æ|ǽ)', $search);
$search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);
$search = preg_replace('/[aàáâãäåǻāăą](?!e)/iu',
'[aàáâãäåǻāăą]', $search);
$search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search);
$search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);
$search = preg_replace('/(?<![ao])[eèéêëēĕėęě]/iu',
'[eèéêëēĕėęě]', $search);
$search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search);
$search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);
$search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]', $search);
$search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);
$search = preg_replace('/[kķĸ]/iu', '[kķĸ]', $search);
$search = preg_replace('/[lĺļľŀł]/iu', '[lĺļľŀł]', $search);
$search = preg_replace('/[nñńņňŉŋ]/iu', '[nñńņňŉŋ]', $search);
$search = preg_replace('/[oòóôõöōŏőǿơ](?!e)/iu',
'[oòóôõöōŏőǿơ]', $search);
$search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);
$search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search);
$search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);
$search = preg_replace('/[uùúûüũūŭůűųǔǖǘǚǜ]/iu',
'[uùúûüũūŭůűųǔǖǘǚǜ]', $search);
$search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);
$search = preg_replace('/[yýÿŷ]/iu', '[yýÿŷ]', $search);
$search = preg_replace('/[zźżž]/iu', '[zźżž]', $search);
$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '$0', $string);
}
return $string;
}
?>
I still can't help feeling there must be some better way, though.

Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giulio

Yeti

2008-07-15 09:38:17 UTC

Permalink

I dont think using all these regular expressions is a very efficient way to
do so. As i previously pointed out there are many users who had a similar
problem, which can be viewed at:

http://it.php.net/manual/en/function.strtr.php

One of my favourites is what derernst at gmx dot ch used.

derernst at gmx dot ch
wrote on 20-Sep-2005 07:29
This works for me to remove accents for some characters of Latin-1, Latin-2
and Turkish in a UTF-8 environment, where the htmlentities-based solutions
fail:

<?php
function remove_accents($string, $german=false) {

// Single letters

$single_fr = explode(" ", "ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ Ą Ă ï¿œ Ć Č
Ď Đ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ Ę Ě Ğ ï¿œ ï¿œ ï¿œ ï¿œ İ Ł Ľ
Ĺ ï¿œ Ń Ň ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ Ő Ŕ Ř ï¿œ Ś Ş
Ť Ţ ï¿œ ï¿œ ï¿œ ï¿œ Ů Ű ï¿œ ï¿œ Ź Ż ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ą
ă ï¿œ ć č ď đ ï¿œ ï¿œ ï¿œ ï¿œ ę ě ğ ï¿œ ï¿œ ï¿œ ï¿œ
ı ł ľ ĺ ï¿œ ń ň ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ő ŕ
ř ś ï¿œ ş ť ţ ï¿œ ï¿œ ï¿œ ï¿œ ů ű ï¿œ ï¿œ ï¿œ ź
ż");

$single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I I
I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a a
a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s s
t t u u u u u u y y z z z");

$single = array();

for ($i=0; $i<count($single_fr); $i++) {

$single[$single_fr[$i]] = $single_to[$i];

}

// Ligatures

$ligatures = array("ï¿œ"=>"Ae", "ï¿œ"=>"ae", "ï¿œ"=>"Oe", "ï¿œ"=>"oe", "ï¿œ"=>"ss");

// German umlauts

$umlauts = array("ï¿œ"=>"Ae", "ï¿œ"=>"ae", "ï¿œ"=>"Oe", "ï¿œ"=>"oe", "ï¿œ"=>"Ue",
"ï¿œ"=>"ue");

// Replace

$replacements = array_merge($single, $ligatures);

if ($german) $replacements = array_merge($replacements, $umlauts);

$string = strtr($string, $replacements);

return $string;

}

?>

I would change this function a bit ...

<?php
//echo rawurlencode("Ã¡Ã Ã©ÃšÃÃ¬Ã³Ã²ÃºÃ¹ÃÃÃÃÃÃÃÃÃÃ"); // RFC 1738 codes; NOTE: One
might use UTF-8 as this documents encoding
function remove_accents($string) {
$string = rawurlencode($string);
$replacements = array(
'%C3%A1' => 'a',
'%C3%A0' => 'a',
'%C3%A9' => 'e',
'%C3%A8' => 'e',
'%C3%AD' => 'i',
'%C3%AC' => 'i',
'%C3%B3' => 'o',
'%C3%B2' => 'o',
'%C3%BA' => 'u',
'%C3%B9' => 'u',
'%C3%81' => 'A',
'%C3%80' => 'A',
'%C3%89' => 'E',
'%C3%88' => 'E',
'%C3%8D' => 'I',
'%C3%8C' => 'I',
'%C3%93' => 'O',
'%C3%92' => 'O',
'%C3%9A' => 'U',
'%C3%99' => 'U'
);
return strtr($string, $replacements);
}
//echo remove_accents("CÃfÃ©"); // I know it's not spelled right
echo remove_accents("Ã¡Ã Ã©ÃšÃÃ¬Ã³Ã²ÃºÃ¹ÃÃÃÃÃÃÃÃÃÃ"); //OUTPUT (again: i used UTF-8
for document): aaeeiioouuAAEEIIOOUU
?>

Ciao

Yeti

Post by Andrew Ballard
On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti

Post by Giulio Mastrosanti
Brilliant !!!
so you replace every occurence of every accent variation with all the

accent

Post by Giulio Mastrosanti
variations...
OK, that's it!
only some more doubts ( regex are still an headhache for me... )
preg_replace('/[iÃ¬ÃÃ®Ã¯Ä©Ä«ÄÄ¯Ä±]/iu',... -- what's the meaning of iu after

the

Post by Giulio Mastrosanti
match string?

This page explains them both.
http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php

Post by Giulio Mastrosanti
preg_replace('/[aÃ Ã¡Ã¢Ã£Ã€Ã¥Ç»ÄÄÄ](?!e)/iu',... whats (?!e) for? -- every
occurence of aÃ Ã¡Ã¢Ã£Ã€Ã¥Ç»ÄÄÄ NOT followed by e?

Yes. It matches any character based on the latin 'a' that is not
followed by an 'e'. It keeps the pattern from matching the 'a' when it
immediately precedes an 'e' for the character 'ae' for words like
http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
(However, that may cause problems with words that have other variants
of 'ae' in them. I'll leave that to you to resolve.)
http://us.php.net/manual/en/regexp.reference.php

Post by Giulio Mastrosanti
Many thanks again for your effort,
I'm definitely on the good way
Giulio

$search);

Post by Giulio Mastrosanti

Post by Andrew Ballard
$search = preg_replace('/[jÄµ]/iu', '[jÄµ]', $search);
$search = preg_replace('/[kÄ·Äž]/iu', '[kÄ·Äž]', $search);
$search = preg_replace('/[lÄºÄŒÄŸÅÅ]/iu', '[lÄºÄŒÄŸÅÅ]', $search);
$search = preg_replace('/[nÃ±ÅÅÅÅÅ]/iu', '[nÃ±ÅÅÅÅÅ]', $search);
$search = preg_replace('/[oÃ²Ã³ÃŽÃµÃ¶ÅÅÅÇ¿Æ¡](?!e)/iu',
'[oÃ²Ã³ÃŽÃµÃ¶ÅÅÅÇ¿Æ¡]', $search);
$search = preg_replace('/[rÅÅÅ]/iu', '[rÅÅÅ]', $search);
$search = preg_replace('/[sÅÅÅÅ¡]/iu', '[sÅÅÅÅ¡]', $search);
$search = preg_replace('/[tÅ£Å¥Å§]/iu', '[tÅ£Å¥Å§]', $search);
$search = preg_replace('/[uÃ¹ÃºÃ»ÃŒÅ©Å«ÅÅ¯Å±Å³ÇÇÇÇÇ]/iu',
'[uÃ¹ÃºÃ»ÃŒÅ©Å«ÅÅ¯Å±Å³ÇÇÇÇÇ]', $search);
$search = preg_replace('/[wÅµ]/iu', '[wÅµ]', $search);
$search = preg_replace('/[yÃœÃ¿Å·]/iu', '[yÃœÃ¿Å·]', $search);
$search = preg_replace('/[zÅºÅŒÅŸ]/iu', '[zÅºÅŒÅŸ]', $search);
$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '$0', $string);
}
return $string;
}
?>
I still can't help feeling there must be some better way, though.

Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giulio

Andrew

Yeti

2008-07-15 09:44:24 UTC

Permalink

Oh, and i forgot about this one ...

jorge at seisbits dot com
wrote on 11-Jul-2008 09:04
If you try to make a strtr of not usual charafters when you are in a utf8
enviroment, you can do that:

function normaliza ($string){
$string = utf8_decode($string);
$string = strtr($string, utf8_decode(" ÃÃÃÃÃÃ"), "-AEIOU");
$string = strtolower($string);
return $string;
}

Post by Yeti
I dont think using all these regular expressions is a very efficient way to
do so. As i previously pointed out there are many users who had a similar
http://it.php.net/manual/en/function.strtr.php
One of my favourites is what derernst at gmx dot ch used.
derernst at gmx dot ch
wrote on 20-Sep-2005 07:29
This works for me to remove accents for some characters of Latin-1, Latin-2
and Turkish in a UTF-8 environment, where the htmlentities-based solutions
<?php
function remove_accents($string, $german=false) {
// Single letters
$single_fr = explode(" ", "ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ Ą Ă ï¿œ Ć Č
Ď Đ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ Ę Ě Ğ ï¿œ ï¿œ ï¿œ ï¿œ İ Ł Ľ
Ĺ ï¿œ Ń Ň ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ Ő Ŕ Ř ï¿œ Ś Ş
Ť Ţ ï¿œ ï¿œ ï¿œ ï¿œ Ů Ű ï¿œ ï¿œ Ź Ż ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ą
ă ï¿œ ć č ď đ ï¿œ ï¿œ ï¿œ ï¿œ ę ě ğ ï¿œ ï¿œ ï¿œ ï¿œ
ı ł ľ ĺ ï¿œ ń ň ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ő ŕ
ř ś ï¿œ ş ť ţ ï¿œ ï¿œ ï¿œ ï¿œ ů ű ï¿œ ï¿œ ï¿œ ź
ż");
$single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I
I I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a
a a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s
s t t u u u u u u y y z z z");
$single = array();
for ($i=0; $i<count($single_fr); $i++) {
$single[$single_fr[$i]] = $single_to[$i];
}
// Ligatures
$ligatures = array("ï¿œ"=>"Ae", "ï¿œ"=>"ae", "ï¿œ"=>"Oe", "ï¿œ"=>"oe",
"ï¿œ"=>"ss");
// German umlauts
$umlauts = array("ï¿œ"=>"Ae", "ï¿œ"=>"ae", "ï¿œ"=>"Oe", "ï¿œ"=>"oe", "ï¿œ"=>"Ue",
"ï¿œ"=>"ue");
// Replace
$replacements = array_merge($single, $ligatures);
if ($german) $replacements = array_merge($replacements, $umlauts);
$string = strtr($string, $replacements);
return $string;
}
?>
I would change this function a bit ...
<?php
//echo rawurlencode("Ã¡Ã Ã©ÃšÃÃ¬Ã³Ã²ÃºÃ¹ÃÃÃÃÃÃÃÃÃÃ"); // RFC 1738 codes; NOTE: One
might use UTF-8 as this documents encoding
function remove_accents($string) {
$string = rawurlencode($string);
$replacements = array(
'%C3%A1' => 'a',
'%C3%A0' => 'a',
'%C3%A9' => 'e',
'%C3%A8' => 'e',
'%C3%AD' => 'i',
'%C3%AC' => 'i',
'%C3%B3' => 'o',
'%C3%B2' => 'o',
'%C3%BA' => 'u',
'%C3%B9' => 'u',
'%C3%81' => 'A',
'%C3%80' => 'A',
'%C3%89' => 'E',
'%C3%88' => 'E',
'%C3%8D' => 'I',
'%C3%8C' => 'I',
'%C3%93' => 'O',
'%C3%92' => 'O',
'%C3%9A' => 'U',
'%C3%99' => 'U'
);
return strtr($string, $replacements);
}
//echo remove_accents("CÃfÃ©"); // I know it's not spelled right
echo remove_accents("Ã¡Ã Ã©ÃšÃÃ¬Ã³Ã²ÃºÃ¹ÃÃÃÃÃÃÃÃÃÃ"); //OUTPUT (again: i used UTF-8
for document): aaeeiioouuAAEEIIOOUU
?>
Ciao
Yeti

Post by Andrew Ballard
On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti

Post by Giulio Mastrosanti
Brilliant !!!
so you replace every occurence of every accent variation with all the

accent

the

Post by Giulio Mastrosanti
match string?

This page explains them both.
http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php

Post by Giulio Mastrosanti
preg_replace('/[aÃ Ã¡Ã¢Ã£Ã€Ã¥Ç»ÄÄÄ](?!e)/iu',... whats (?!e) for? -- every
occurence of aÃ Ã¡Ã¢Ã£Ã€Ã¥Ç»ÄÄÄ NOT followed by e?

Yes. It matches any character based on the latin 'a' that is not
followed by an 'e'. It keeps the pattern from matching the 'a' when it
immediately precedes an 'e' for the character 'ae' for words like
http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
(However, that may cause problems with words that have other variants
of 'ae' in them. I'll leave that to you to resolve.)
http://us.php.net/manual/en/regexp.reference.php

Post by Giulio Mastrosanti
Many thanks again for your effort,
I'm definitely on the good way
Giulio

$search);

Post by Giulio Mastrosanti

Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giulio

Andrew

Andrew Ballard

2008-07-15 13:46:18 UTC

Permalink

Post by Andrew Ballard
<?php

Post by Andrew Ballard
On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti

This page explains them both.
http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php

Post by Giulio Mastrosanti
preg_replace('/[aàáâãäåǻāăą](?!e)/iu',... whats (?!e) for? -- every
occurence of aàáâãäåǻāăą NOT followed by e?

Yes. It matches any character based on the latin 'a' that is not
followed by an 'e'. It keeps the pattern from matching the 'a' when it
immediately precedes an 'e' for the character 'ae' for words like
http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
(However, that may cause problems with words that have other variants
of 'ae' in them. I'll leave that to you to resolve.)
http://us.php.net/manual/en/regexp.reference.php

Post by Giulio Mastrosanti
Many thanks again for your effort,
I'm definitely on the good way
Giulio

Post by Andrew Ballard
I was intrigued by your example, so I played around with it some more
this morning. My own quick web search yielded a lot of results for
highlighting search terms, but none that I found did what you're
after. (I admit I didn't look very deep.) I was up to something like
this before your reply came in. It's still by no means complete. It
even handles simple English plurals (words ending in 's' or 'es'), but
not variations that require changing the word base (like 'daisy' to
'daisies').
<?php
function highlight_search_terms($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);
$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
$search_words[] = $word;
}
}
$search_words = array_unique($search_words);
foreach ($search_words as $word) {
$search = preg_quote($word);
/* repeat for each possible accented character */
$search = preg_replace('/(ae|æ|ǽ)/iu', '(ae|æ|ǽ)', $search);
$search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);
$search = preg_replace('/[aàáâãäåǻāăą](?!e)/iu',
'[aàáâãäåǻāăą]', $search);
$search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search);
$search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);
$search = preg_replace('/(?<![ao])[eèéêëēĕėęě]/iu',
'[eèéêëēĕėęě]', $search);
$search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search);
$search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);
$search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]',
$search);
$search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);
$search = preg_replace('/[kķĸ]/iu', '[kķĸ]', $search);
$search = preg_replace('/[lĺļľŀł]/iu', '[lĺļľŀł]', $search);
$search = preg_replace('/[nñńņňŉŋ]/iu', '[nñńņňŉŋ]', $search);
$search = preg_replace('/[oòóôõöōŏőǿơ](?!e)/iu',
'[oòóôõöōŏőǿơ]', $search);
$search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);
$search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search);
$search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);
$search = preg_replace('/[uùúûüũūŭůűųǔǖǘǚǜ]/iu',
'[uùúûüũūŭůűųǔǖǘǚǜ]', $search);
$search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);
$search = preg_replace('/[yýÿŷ]/iu', '[yýÿŷ]', $search);
$search = preg_replace('/[zźżž]/iu', '[zźżž]', $search);
$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '$0', $string);
}
return $string;
}
?>
I still can't help feeling there must be some better way, though.

Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giulio

Andrew

I agree it doesn't seem very efficient to me, but I haven't come up
with anything better. The problem with what you posted is that the OP
was looking to preserve the accented characters, NOT replace them. All
he wants to do is wrap some tags around the search terms so that they
are highlighted. I guess he could use your function to replace all the
accented characters with regular ones in a copy of the original
string, and then scan that string using str_pos() or similar against
the copy to find the index of each occurrence that needs replaced in
the original string. This seems even less efficient than the regular
expressions

Andrew Ballard

2008-07-15 14:15:32 UTC

Permalink

Post by Andrew Ballard

Post by Andrew Ballard
<?php

Post by Andrew Ballard
On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti

This page explains them both.
http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php

Post by Giulio Mastrosanti
preg_replace('/[aàáâãäåǻāăą](?!e)/iu',... whats (?!e) for? -- every
occurence of aàáâãäåǻāăą NOT followed by e?

Yes. It matches any character based on the latin 'a' that is not
followed by an 'e'. It keeps the pattern from matching the 'a' when it
immediately precedes an 'e' for the character 'ae' for words like
http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
(However, that may cause problems with words that have other variants
of 'ae' in them. I'll leave that to you to resolve.)
http://us.php.net/manual/en/regexp.reference.php

Post by Giulio Mastrosanti
Many thanks again for your effort,
I'm definitely on the good way
Giulio

Post by Andrew Ballard
I was intrigued by your example, so I played around with it some more
this morning. My own quick web search yielded a lot of results for
highlighting search terms, but none that I found did what you're
after. (I admit I didn't look very deep.) I was up to something like
this before your reply came in. It's still by no means complete. It
even handles simple English plurals (words ending in 's' or 'es'), but
not variations that require changing the word base (like 'daisy' to
'daisies').
<?php
function highlight_search_terms($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);
$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
$search_words[] = $word;
}
}
$search_words = array_unique($search_words);
foreach ($search_words as $word) {
$search = preg_quote($word);
/* repeat for each possible accented character */
$search = preg_replace('/(ae|æ|ǽ)/iu', '(ae|æ|ǽ)', $search);
$search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);
$search = preg_replace('/[aàáâãäåǻāăą](?!e)/iu',
'[aàáâãäåǻāăą]', $search);
$search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search);
$search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);
$search = preg_replace('/(?<![ao])[eèéêëēĕėęě]/iu',
'[eèéêëēĕėęě]', $search);
$search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search);
$search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);
$search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]',
$search);
$search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);
$search = preg_replace('/[kķĸ]/iu', '[kķĸ]', $search);
$search = preg_replace('/[lĺļľŀł]/iu', '[lĺļľŀł]', $search);
$search = preg_replace('/[nñńņňŉŋ]/iu', '[nñńņňŉŋ]', $search);
$search = preg_replace('/[oòóôõöōŏőǿơ](?!e)/iu',
'[oòóôõöōŏőǿơ]', $search);
$search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);
$search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search);
$search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);
$search = preg_replace('/[uùúûüũūŭůűųǔǖǘǚǜ]/iu',
'[uùúûüũūŭůűųǔǖǘǚǜ]', $search);
$search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);
$search = preg_replace('/[yýÿŷ]/iu', '[yýÿŷ]', $search);
$search = preg_replace('/[zźżž]/iu', '[zźżž]', $search);
$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '$0', $string);
}
return $string;
}
?>
I still can't help feeling there must be some better way, though.

Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giulio

Andrew

Well, OK, I can think of one optimization. This takes advantage of the
fact that preg_replace can accept arrays as parameters. In a couple
very quick tests this version is roughly 30% faster than my previous
version:

<?php

function highlight_search_terms2($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);

$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
$search_words[] = $word;
}
}

$search_words = array_unique($search_words);

$patterns = array(
/* repeat for each possible accented character */
'/(ae|æ|ǽ)/iu' => '(ae|æ|ǽ)',
'/(oe|œ)/iu' => '(oe|œ)',
'/[aàáâãäåǻāăą](?!e)/iu' => '[aàáâãäåǻāăą]',
'/[cçćĉċč]/iu' => '[cçćĉċč]',
'/[dďđ]/iu' => '[dďđ]',
'/(?<![ao])[eèéêëēĕėęě]/iu' => '[eèéêëēĕėęě]',
'/[gĝğġģ]/iu' => '[gĝğġģ]',
'/[hĥħ]/iu' => '[hĥħ]',
'/[iìíîïĩīĭįı]/iu' => '[iìíîïĩīĭįı]',
'/[jĵ]/iu' => '[jĵ]',
'/[kķĸ]/iu' => '[kķĸ]',
'/[lĺļľŀł]/iu' => '[lĺļľŀł]',
'/[nñńņňŉŋ]/iu' => '[nñńņňŉŋ]',
'/[oòóôõöōŏőǿơ](?!e)/iu' => '[oòóôõöōŏőǿơ]',
'/[rŕŗř]/iu' => '[rŕŗř]',
'/[sśŝşš]/iu' => '[sśŝşš]',
'/[tţťŧ]/iu' => '[tţťŧ]',
'/[uùúûüũūŭůűųǔǖǘǚǜ]/iu' => '[uùúûüũūŭůűųǔǖǘǚǜ]',
'/[wŵ]/iu' => '[wŵ]',
'/[yýÿŷ]/iu' => '[yýÿŷ]',
'/[zźżž]/iu' => '[zźżž]',
);

foreach ($search_words as $word) {
$search = preg_quote($word);

$search = preg_replace(array_keys($patterns), $patterns, $search);

$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '$0', $string);
}

return $string

tedd

2008-07-15 16:30:03 UTC

Permalink

On TueWell, OK, I can think of one optimization. This takes advantage of the
fact that preg_replace can accept arrays as parameters. In a couple
very quick tests this version is roughly 30% faster than my previous

-snip-

Hey, when you finally get finished with that function, please let me
know I would like to copy it. :-)

Cheers,

tedd

--
-------
http://sperling.com http://ancientstones.com http://earthstones.com
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Andrew Ballard

2008-07-15 17:17:15 UTC

Permalink

Post by tedd

On TueWell, OK, I can think of one optimization. This takes advantage of
the
fact that preg_replace can accept arrays as parameters. In a couple
very quick tests this version is roughly 30% faster than my previous

-snip-
Hey, when you finally get finished with that function, please let me know I
would like to copy it. :-)
Cheers,
tedd

All yours. I figure I'm done with it. (At least until I actually need
to use it for something and then I have to test it for real. :-) )

Andrew

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Yeti

2008-07-15 18:07:19 UTC

Permalink

The original problem was

User X submits a character string A.

A PHP scripts uses A to search for it's occurences in a DB, ignoring special
characters.

The result of ze search is a list of character strings M-LIST with matches.

This list gets outputted to the user X, but before that all the matching
strings should be replaced with ''..''

If i clearly got the OP then he is using MySQL to perform the search.

I guess he is doing it with MATCH. So MySQL already found the match and in
PHP it has to be done again ...

eg.

The table has 2 entries, string1 and string2 ..

string1 = 'Thís ís an éxámplè stríng wíth áccénts.'

string2 = 'This is an example string without accents.'

Now the user searches for "ample":

search = '*ample*'

Both string have matches due to accent-insensitivity (AI). Now the result is
outputted with highlighting ..

*Thís ís an éx**ámplè** stríng wíth
áccénts.*

*This is an ex**ample** string without
accents.*

So since MySQL already did the job, why not get the occurances from it?

I'm not an MySQL expert, but I know google and found something called string
functions. Especially a "locate" function got my interest.

http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_locate

Now shouldnt it be possible to create a query that searches the db for
matches and additionally uses the string function?

I have no idea, but maybe some MySQL-expert out there has ...

Yeti

Post by Andrew Ballard

Post by tedd

On TueWell, OK, I can think of one optimization. This takes advantage of
the
fact that preg_replace can accept arrays as parameters. In a couple
very quick tests this version is roughly 30% faster than my previous

-snip-
Hey, when you finally get finished with that function, please let me know

Post by tedd
would like to copy it. :-)
Cheers,
tedd

All yours. I figure I'm done with it. (At least until I actually need
to use it for something and then I have to test it for real. :-) )
Andrew
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Andrew Ballard

2008-07-15 19:16:51 UTC

Permalink

Post by Yeti
The original problem was
User X submits a character string A.
A PHP scripts uses A to search for it's occurences in a DB, ignoring special
characters.
The result of ze search is a list of character strings M-LIST with matches.
This list gets outputted to the user X, but before that all the matching
strings should be replaced with ''..''
If i clearly got the OP then he is using MySQL to perform the search.
I guess he is doing it with MATCH. So MySQL already found the match and in
PHP it has to be done again ...
eg.
The table has 2 entries, string1 and string2 ..
string1 = 'Thís ís an éxámplè stríng wíth áccénts.'
string2 = 'This is an example string without accents.'
search = 'ample'
Both string have matches due to accent-insensitivity (AI). Now the result is
outputted with highlighting ..
Thís ís an éxámplè stríng wíth áccénts.
This is an example string without
accents.

Correct.

Post by Yeti
So since MySQL already did the job, why not get the occurances from it?
I'm not an MySQL expert, but I know google and found something called string
functions. Especially a "locate" function got my interest.
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_locate
Now shouldnt it be possible to create a query that searches the db for
matches and additionally uses the string function?
I have no idea, but maybe some MySQL-expert out there has ...
Yeti

There are definitely possibilities there. Personally, I tend to be
biased against using the database to format output for presentation,
so I'd rather not push the task off there. Still, I know lots of
developers do not share this bias, so I'll address a couple other
issues I see with this approach:

1) If the search word appears multiple times, LOCATE() will only find
it once. I'd probably use REPLACE() instead. This leads to the next
problem:

2) I'm not sure if the OP wants this or not, but if he wants to
highlight each of multiple search terms the way many sites do, he
would have to split the terms and build a SQL phrase that like this
(there are probably other approaches available in MySQL to do the same
thing):

-- search phrase 'quaint french cafe'
SELECT REPLACE(REPLACE(REPLACE(`my_column`, 'quaint', 'quaint'), 'french', 'french'), 'cafe', 'cafe') FROM ...

In this case, he should get all instances of each word highlighted,
but the accented characters would again be replaced with a particular
style. (Not to mention the size and complexity of the query being
passed from PHP to the database or the potential size of the result
being passed from the database to PHP since it now could have lots of
formatting text

Continue reading on narkive:

Search results for 'case and accent - insensitive regular expression?' (Questions and Answers)

replies

Help. 2 coworkers always speak in Chinese to offend 4 non chinese workers. Is there a law? Touchy subject.?

started 2006-10-10 20:39:49 UTC

etiquette