Discussion:
case and accent - insensitive regular expression?
Giulio Mastrosanti
2008-07-12 07:36:50 UTC
Permalink
Hi,
I have a php page that asks user for a key ( or a list of keys ) and
then shows a list of items matching the query.

every item in the list shows its data, and the list of keys it has ( a
list of comma-separated words )

I would like to higlight, in the list of keys shown for every item,
the words matching the query,

this can be easily achieved with a search and replace, for every
search word, i search it in the key list and replace it adding a style
tag to higlight it such as for example to have it in red color:

if ( @stripos($keylist,$keysearch!== false ) {
$keylist = str_ireplace($keysearch,'<span style="color: #FF0000">'.
$keysearch.'</span>',$keylist);
}

but i have some problem with accented characters:

i have mysql with character encoding utf8, and all the php pages are
declared as utf8

mysql in configured to perform queries in a case and accent
insensitive way.
this mean that if you search for the word 'cafe', you have returned
rows that contains in the keyword list 'cafe', but also 'café' with
the accent. ( I think it has to do with 'collation' settings, but I'm
not investigating at the moment because it is OK for me the way it
works ).

now my problem is to find a way ( I imagine with some kind of regular
expression ) to achieve in php a search and replace accent-
insensitive, so that i can find the word 'cafe' in a string also if it
is 'café', or 'CAFÉ', or 'CAFE', and vice-versa.

hope the problem is clear and well-explained in english,

thank you for any tip,

Giulio
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
tedd
2008-07-12 14:29:22 UTC
Permalink
Post by Giulio Mastrosanti
Hi,
I have a php page that asks user for a key ( or
a list of keys ) and then shows a list of items
matching the query.
every item in the list shows its data, and the
list of keys it has ( a list of comma-separated
words )
I would like to higlight, in the list of keys shown for every item,
the words matching the query,
this can be easily achieved with a search and
replace, for every search word, i search it in
the key list and replace it adding a style tag
to higlight it such as for example to have it in
$keylist = str_ireplace($keysearch,'<span
#FF0000">'.$keysearch.'</span>',$keylist);
}
i have mysql with character encoding utf8, and
all the php pages are declared as utf8
mysql in configured to perform queries in a case and accent insensitive way.
this mean that if you search for the word
'cafe', you have returned rows that contains in
the keyword list 'cafe', but also 'café' with
the accent. ( I think it has to do with
'collation' settings, but I'm not investigating
at the moment because it is OK for me the way it
works ).
now my problem is to find a way ( I imagine with
some kind of regular expression ) to achieve in
php a search and replace accent-insensitive, so
that i can find the word 'cafe' in a string also
if it is 'café', or 'CAFÉ', or 'CAFE', and
vice-versa.
hope the problem is clear and well-explained in english,
thank you for any tip,
Giulio
Giulio:

Three things:

1. Your English is fine.

2. Try using mb_ereg_replace()

http://www.php.net/mb_ereg_replace

Place the accents you want to change in that and
change them to whatever you want.

3. Change:

<span style="color: #FF0000">'.$keysearch.'</span>'

to

<span class="keysearch">'.$keysearch.'</span>'

and add

.keysearch
{
color: #FF0000;
}

to your css.

Cheers,

tedd
--
-------
http://sperling.com http://ancientstones.com http://earthstones.com
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Andrew Ballard
2008-07-13 12:31:28 UTC
Permalink
Post by tedd
Hi,
I have a php page that asks user for a key ( or a list of keys ) and then
shows a list of items matching the query.
every item in the list shows its data, and the list of keys it has ( a
list of comma-separated words )
I would like to higlight, in the list of keys shown for every item, the
words matching the query,
this can be easily achieved with a search and replace, for every search
word, i search it in the key list and replace it adding a style tag to
#FF0000">'.$keysearch.'</span>',$keylist);
}
i have mysql with character encoding utf8, and all the php pages are
declared as utf8
mysql in configured to perform queries in a case and accent insensitive
way.
this mean that if you search for the word 'cafe', you have returned rows
that contains in the keyword list 'cafe', but also 'café' with the accent. (
I think it has to do with 'collation' settings, but I'm not investigating at
the moment because it is OK for me the way it works ).
now my problem is to find a way ( I imagine with some kind of regular
expression ) to achieve in php a search and replace accent-insensitive, so
that i can find the word 'cafe' in a string also if it is 'café', or 'CAFÉ',
or 'CAFE', and vice-versa.
hope the problem is clear and well-explained in english,
thank you for any tip,
Giulio
1. Your English is fine.
2. Try using mb_ereg_replace()
http://www.php.net/mb_ereg_replace
Place the accents you want to change in that and change them to whatever you
want.
<span style="color: #FF0000">'.$keysearch.'</span>'
to
<span class="keysearch">'.$keysearch.'</span>'
and add
.keysearch
{
color: #FF0000;
}
to your css.
Cheers,
tedd
I may be mistaken (and if I am, then just ignore this as ignorant
rambling), but I don't think he's wanting to replace the accented
characters in the original string. I think he's just wanting the
pattern to find all variations of the same string and highlight them
without changing them. For example, his last paragraph would look like
this:

[quote]
now my problem is to find a way ( I imagine with some kind of regular
expression ) to achieve in php a search and replace
accent-insensitive, so that i can find the word '<span
class="keysearch">cafe</span>' in a string also if it is '<span
class="keysearch">café</span>', or '<span
class="keysearch">CAFÉ</span>', or '<span
class="keysearch">CAFE</span>', and vice-versa.
[/quote]

The best I can think of right now is something like this:

<?php

function highlight_search_terms($word, $string) {
$search = preg_quote($word);

$search = str_replace('a', '[aàáâãäå]', $search);
$search = str_replace('e', '[eèéêë]', $search);
/* repeat for each possible accented character */

return preg_replace('/\b' . $search . '\b/i', '<span
class="keysearch">$0</span>', $string);

}

$string = "now my problem is to find a way ( I imagine with some kind
of regular expression ) to achieve in php a search and replace
accent-insensitive, so that i can find the word 'cafe' in a string
also if it is 'café', or 'CAFÉ', or 'CAFE', and vice-versa.";


echo highlight_search_
tedd
2008-07-13 12:50:27 UTC
Permalink
Post by Andrew Ballard
Post by tedd
Hi,
I have a php page that asks user for a key ( or a list of keys ) and then
shows a list of items matching the query.
every item in the list shows its data, and the list of keys it has ( a
list of comma-separated words )
I would like to higlight, in the list of keys shown for every item, the
words matching the query,
this can be easily achieved with a search and replace, for every search
word, i search it in the key list and replace it adding a style tag to
#FF0000">'.$keysearch.'</span>',$keylist);
}
i have mysql with character encoding utf8, and all the php pages are
declared as utf8
mysql in configured to perform queries in a case and accent insensitive
way.
this mean that if you search for the word 'cafe', you have returned rows
that contains in the keyword list 'cafe', but
also 'café' with the accent. (
I think it has to do with 'collation'
settings, but I'm not investigating at
the moment because it is OK for me the way it works ).
now my problem is to find a way ( I imagine with some kind of regular
expression ) to achieve in php a search and replace accent-insensitive, so
that i can find the word 'cafe' in a string
also if it is 'café', or 'CAFÉ',
or 'CAFE', and vice-versa.
hope the problem is clear and well-explained in english,
thank you for any tip,
Giulio
1. Your English is fine.
2. Try using mb_ereg_replace()
http://www.php.net/mb_ereg_replace
Place the accents you want to change in that and change them to whatever you
want.
<span style="color: #FF0000">'.$keysearch.'</span>'
to
<span class="keysearch">'.$keysearch.'</span>'
and add
.keysearch
{
color: #FF0000;
}
to your css.
Cheers,
tedd
I may be mistaken (and if I am, then just ignore this as ignorant
rambling), but I don't think he's wanting to replace the accented
characters in the original string. I think he's just wanting the
pattern to find all variations of the same string and highlight them
without changing them. For example, his last paragraph would look like
[quote]
now my problem is to find a way ( I imagine with some kind of regular
expression ) to achieve in php a search and replace
accent-insensitive, so that i can find the word '<span
class="keysearch">cafe</span>' in a string also if it is '<span
class="keysearch">café</span>', or '<span
class="keysearch">CAFÉ</span>', or '<span
class="keysearch">CAFE</span>', and vice-versa.
[/quote]
<?php
function highlight_search_terms($word, $string) {
$search = preg_quote($word);
$search = str_replace('a', '[aàáâãäå]', $search);
$search = str_replace('e', '[eèéêë]', $search);
/* repeat for each possible accented character */
return preg_replace('/\b' . $search . '\b/i', '<span
class="keysearch">$0</span>', $string);
}
$string = "now my problem is to find a way ( I imagine with some kind
of regular expression ) to achieve in php a search and replace
accent-insensitive, so that i can find the word 'cafe' in a string
also if it is 'café', or 'CAFÉ', or 'CAFE', and vice-versa.";
echo highlight_search_terms('cafe', $string);
?>
Andrew
Andrew:

You may be right -- it's ambiguous now that I
review it again. He does say search and replace
but I'm not sure if that's what he really wants.
It looks more like search with one string and
highlight all like-strings.

Cheers,

tedd
--
-------
http://sperling.com http://ancientstones.com http://earthstones.com
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Giulio Mastrosanti
2008-07-14 15:06:14 UTC
Permalink
First of all thank you all for your answers, and thank you for your time

and yes Tedd, my question was quite ambiguous in that point.

Andrew is right, i don't want to change in any way the list of keys I
show in the result, I just want to find the way to higlight the
matching words, regardless of their accent variations.

So I think his Andrew's suggestion could be a good solution, and I'll
try it ASAP...

let me se if i correctly understood:

$search = preg_quote($word); -- quotes chars that could be intrepreted
like regex special chars

$search = str_replace('e', '[eèéêë]', $search); -- trasforms i.e.
cafe in caf[eèéêë], so matches all the accented variations

return preg_replace('/\b' ... -- replaces all the occurences adding
the tags, you use \b as word boundary, right?

it seems a fine soultion to the problem!

the only thing i must add is, befor calling highlight_search_terms, to
'normalize' the word string ( the word used for the search) to
transform it removing the accentated versions of the chars:

$word = preg_replace('[èé]{1}','e',$word);
$word = preg_replace('[à]{1}','a',$word);

that because also the search string could contain an accented char,
and this way I avoid to perform str_replace in the
highlight_search_terms function for every combination of accented chars

well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)

thank you!

Giulio
Post by Andrew Ballard
I may be mistaken (and if I am, then just ignore this as ignorant
rambling), but I don't think he's wanting to replace the accented
characters in the original string. I think he's just wanting the
pattern to find all variations of the same string and highlight them
without changing them. For example, his last paragraph would look
like
[quote]
now my problem is to find a way ( I imagine with some kind of regular
expression ) to achieve in php a search and replace
accent-insensitive, so that i can find the word '<span
class="keysearch">cafe</span>' in a string also if it is '<span
class="keysearch">café</span>', or '<span
class="keysearch">CAFÉ</span>', or '<span
class="keysearch">CAFE</span>', and vice-versa.
[/quote]
<?php
function highlight_search_terms($word, $string) {
$search = preg_quote($word);
$search = str_replace('a', '[aàáâãäå]', $search);
$search = str_replace('e', '[eèéêë]', $search);
/* repeat for each possible accented character */
return preg_replace('/\b' . $search . '\b/i', '<span
class="keysearch">$0</span>', $string);
}
$string = "now my problem is to find a way ( I imagine with some kind
of regular expression ) to achieve in php a search and replace
accent-insensitive, so that i can find the word 'cafe' in a string
also if it is 'café', or 'CAFÉ', or 'CAFE', and vice-versa.";
echo highlight_search_terms('cafe', $string);
?>
Andrew
You may be right -- it's ambiguous now that I review it again. He
does say search and replace but I'm not sure if that's what he
really wants. It looks more like search with one string and
highlight all like-strings.
Cheers,
tedd
--
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Andrew Ballard
2008-07-14 16:04:06 UTC
Permalink
On Mon, Jul 14, 2008 at 11:06 AM, Giulio Mastrosanti
Post by Giulio Mastrosanti
First of all thank you all for your answers, and thank you for your time
and yes Tedd, my question was quite ambiguous in that point.
Andrew is right, i don't want to change in any way the list of keys I show
in the result, I just want to find the way to higlight the matching words,
regardless of their accent variations.
So I think his Andrew's suggestion could be a good solution, and I'll try it
ASAP...
$search = preg_quote($word); -- quotes chars that could be intrepreted like
regex special chars
$search = str_replace('e', '[eטיךכ]', $search); -- trasforms i.e. cafe in
caf[eטיךכ], so matches all the accented variations
return preg_replace('/\b' ... -- replaces all the occurences adding the
tags, you use \b as word boundary, right?
Yes, yes, and yes. :-)
Post by Giulio Mastrosanti
it seems a fine soultion to the problem!
the only thing i must add is, befor calling highlight_search_terms, to
'normalize' the word string ( the word used for the search) to transform it
$word = preg_replace('[טי]{1}','e',$word);
$word = preg_replace('[א]{1}','a',$word);
that because also the search string could contain an accented char, and this
way I avoid to perform str_replace in the highlight_search_terms function
for every combination of accented chars
I was intrigued by your example, so I played around with it some more
this morning. My own quick web search yielded a lot of results for
highlighting search terms, but none that I found did what you're
after. (I admit I didn't look very deep.) I was up to something like
this before your reply came in. It's still by no means complete. It
even handles simple English plurals (words ending in 's' or 'es'), but
not variations that require changing the word base (like 'daisy' to
'daisies').

<?php
function highlight_search_terms($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);

$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
$search_words[] = $word;
}
}

$search_words = array_unique($search_words);

foreach ($search_words as $word) {
$search = preg_quote($word);

/* repeat for each possible accented character */
$search = preg_replace('/(ae|æ|ǽ)/iu', '(ae|æ|ǽ)', $search);
$search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);
$search = preg_replace('/[aàáâãäåǻāăą](?!e)/iu',
'[aàáâãäåǻāăą]', $search);
$search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search);
$search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);
$search = preg_replace('/(?<![ao])[eèéêëēĕėęě]/iu',
'[eèéêëēĕėęě]', $search);
$search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search);
$search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);
$search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]', $search);
$search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);
$search = preg_replace('/[kķĸ]/iu', '[kķĸ]', $search);
$search = preg_replace('/[lĺļľŀł]/iu', '[lĺļľŀł]', $search);
$search = preg_replace('/[nñńņňʼnŋ]/iu', '[nñńņňʼnŋ]', $search);
$search = preg_replace('/[oòóôõöōŏőǿơ](?!e)/iu',
'[oòóôõöōŏőǿơ]', $search);
$search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);
$search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search);
$search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);
$search = preg_replace('/[uùúûüũūŭůűųǔǖǘǚǜ]/iu',
'[uùúûüũūŭůűųǔǖǘǚǜ]', $search);
$search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);
$search = preg_replace('/[yýÿŷ]/iu', '[yýÿŷ]', $search);
$search = preg_replace('/[zźżž]/iu', '[zźżž]', $search);


$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<span
class="keysearch">$0</span>', $string);
}

return $string;

}
?>

I still can't help feeling there must be some better way, though.
Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giuli
Giulio Mastrosanti
2008-07-14 17:35:26 UTC
Permalink
Brilliant !!!

so you replace every occurence of every accent variation with all the
accent variations...

OK, that's it!

only some more doubts ( regex are still an headhache for me... )

preg_replace('/[iìíîïĩīĭįı]/iu',... -- what's the meaning of
iu after the match string?

preg_replace('/[aàáâãÀåǻāăą](?!e)/iu',... whats (?!e) for?
-- every occurence of aàáâãÀåǻāăą NOT followed by e?

Many thanks again for your effort,

I'm definitely on the good way

Giulio
Post by Andrew Ballard
I was intrigued by your example, so I played around with it some more
this morning. My own quick web search yielded a lot of results for
highlighting search terms, but none that I found did what you're
after. (I admit I didn't look very deep.) I was up to something like
this before your reply came in. It's still by no means complete. It
even handles simple English plurals (words ending in 's' or 'es'), but
not variations that require changing the word base (like 'daisy' to
'daisies').
<?php
function highlight_search_terms($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);
$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars,
$word)) {
$search_words[] = $word;
}
}
$search_words = array_unique($search_words);
foreach ($search_words as $word) {
$search = preg_quote($word);
/* repeat for each possible accented character */
$search = preg_replace('/(ae|Ê|ǜ)/iu', '(ae|Ê|ǜ)',
$search);
$search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);
$search = preg_replace('/[aàáâãÀåǻāăą](?!e)/iu',
'[aàáâãÀåǻāăą]', $search);
$search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]',
$search);
$search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);
$search = preg_replace('/(?<![ao])[eÚéêëēĕėęě]/iu',
'[eÚéêëēĕėęě]', $search);
$search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]',
$search);
$search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);
$search = preg_replace('/[iìíîïĩīĭįı]/iu',
'[iìíîïĩīĭįı]', $search);
$search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);
$search = preg_replace('/[kÄ·Äž]/iu', '[kÄ·Äž]', $search);
$search = preg_replace('/[lĺČğŀł]/iu', '[lĺČğŀł]',
$search);
$search = preg_replace('/[nñńņňʼnŋ]/iu',
'[nñńņňʼnŋ]', $search);
$search = preg_replace('/[oòóÎõöōŏőǿơ](?!e)/iu',
'[oòóÎõöōŏőǿơ]', $search);
$search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);
$search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]',
$search);
$search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);
$search = preg_replace('/[uùúûÌũūŭůűųǔǖǘǚǜ]/iu',
'[uùúûÌũūŭůűųǔǖǘǚǜ]', $search);
$search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);
$search = preg_replace('/[yÜÿŷ]/iu', '[yÜÿŷ]', $search);
$search = preg_replace('/[zźŌş]/iu', '[zźŌş]', $search);
$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<span
class="keysearch">$0</span>', $string);
}
return $string;
}
?>
I still can't help feeling there must be some better way, though.
Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some
other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giulio
Andrew
Andrew Ballard
2008-07-14 18:20:54 UTC
Permalink
On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti
Post by Giulio Mastrosanti
Brilliant !!!
so you replace every occurence of every accent variation with all the accent
variations...
OK, that's it!
only some more doubts ( regex are still an headhache for me... )
preg_replace('/[iìíîïĩīĭįı]/iu',... -- what's the meaning of iu after the
match string?
This page explains them both.
http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php
Post by Giulio Mastrosanti
preg_replace('/[aàáâãäåǻāăą](?!e)/iu',... whats (?!e) for? -- every
occurence of aàáâãäåǻāăą NOT followed by e?
Yes. It matches any character based on the latin 'a' that is not
followed by an 'e'. It keeps the pattern from matching the 'a' when it
immediately precedes an 'e' for the character 'ae' for words like
these:
http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
(However, that may cause problems with words that have other variants
of 'ae' in them. I'll leave that to you to resolve.)
http://us.php.net/manual/en/regexp.reference.php
Post by Giulio Mastrosanti
Many thanks again for your effort,
I'm definitely on the good way
Giulio
Post by Andrew Ballard
I was intrigued by your example, so I played around with it some more
this morning. My own quick web search yielded a lot of results for
highlighting search terms, but none that I found did what you're
after. (I admit I didn't look very deep.) I was up to something like
this before your reply came in. It's still by no means complete. It
even handles simple English plurals (words ending in 's' or 'es'), but
not variations that require changing the word base (like 'daisy' to
'daisies').
<?php
function highlight_search_terms($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);
$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
$search_words[] = $word;
}
}
$search_words = array_unique($search_words);
foreach ($search_words as $word) {
$search = preg_quote($word);
/* repeat for each possible accented character */
$search = preg_replace('/(ae|æ|ǽ)/iu', '(ae|æ|ǽ)', $search);
$search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);
$search = preg_replace('/[aàáâãäåǻāăą](?!e)/iu',
'[aàáâãäåǻāăą]', $search);
$search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search);
$search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);
$search = preg_replace('/(?<![ao])[eèéêëēĕėęě]/iu',
'[eèéêëēĕėęě]', $search);
$search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search);
$search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);
$search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]', $search);
$search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);
$search = preg_replace('/[kķĸ]/iu', '[kķĸ]', $search);
$search = preg_replace('/[lĺļľŀł]/iu', '[lĺļľŀł]', $search);
$search = preg_replace('/[nñńņňʼnŋ]/iu', '[nñńņňʼnŋ]', $search);
$search = preg_replace('/[oòóôõöōŏőǿơ](?!e)/iu',
'[oòóôõöōŏőǿơ]', $search);
$search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);
$search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search);
$search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);
$search = preg_replace('/[uùúûüũūŭůűųǔǖǘǚǜ]/iu',
'[uùúûüũūŭůűųǔǖǘǚǜ]', $search);
$search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);
$search = preg_replace('/[yýÿŷ]/iu', '[yýÿŷ]', $search);
$search = preg_replace('/[zźżž]/iu', '[zźżž]', $search);
$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<span
class="keysearch">$0</span>', $string);
}
return $string;
}
?>
I still can't help feeling there must be some better way, though.
Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giulio
Yeti
2008-07-15 09:38:17 UTC
Permalink
I dont think using all these regular expressions is a very efficient way to
do so. As i previously pointed out there are many users who had a similar
problem, which can be viewed at:

http://it.php.net/manual/en/function.strtr.php

One of my favourites is what derernst at gmx dot ch used.

derernst at gmx dot ch
wrote on 20-Sep-2005 07:29
This works for me to remove accents for some characters of Latin-1, Latin-2
and Turkish in a UTF-8 environment, where the htmlentities-based solutions
fail:

<?php
function remove_accents($string, $german=false) {

// Single letters

$single_fr = explode(" ", "ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ &#260; &#258; ᅵ &#262; &#268;
&#270; &#272; ᅵ ᅵ ᅵ ᅵ ᅵ &#280; &#282; &#286; ᅵ ᅵ ᅵ ᅵ &#304; &#321; &#317;
&#313; ᅵ &#323; &#327; ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ &#336; &#340; &#344; ᅵ &#346; &#350;
&#356; &#354; ᅵ ᅵ ᅵ ᅵ &#366; &#368; ᅵ ᅵ &#377; &#379; ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ &#261;
&#259; ᅵ &#263; &#269; &#271; &#273; ᅵ ᅵ ᅵ ᅵ &#281; &#283; &#287; ᅵ ᅵ ᅵ ᅵ
&#305; &#322; &#318; &#314; ᅵ &#324; &#328; ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ &#337; &#341;
&#345; &#347; ᅵ &#351; &#357; &#355; ᅵ ᅵ ᅵ ᅵ &#367; &#369; ᅵ ᅵ ᅵ &#378;
&#380;");

$single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I I
I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a a
a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s s
t t u u u u u u y y z z z");

$single = array();

for ($i=0; $i<count($single_fr); $i++) {

$single[$single_fr[$i]] = $single_to[$i];

}

// Ligatures

$ligatures = array("ᅵ"=>"Ae", "ᅵ"=>"ae", "ᅵ"=>"Oe", "ᅵ"=>"oe", "ᅵ"=>"ss");

// German umlauts

$umlauts = array("ᅵ"=>"Ae", "ᅵ"=>"ae", "ᅵ"=>"Oe", "ᅵ"=>"oe", "ᅵ"=>"Ue",
"ᅵ"=>"ue");

// Replace

$replacements = array_merge($single, $ligatures);

if ($german) $replacements = array_merge($replacements, $umlauts);

$string = strtr($string, $replacements);

return $string;

}

?>

I would change this function a bit ...

<?php
//echo rawurlencode("áàéÚíìóòúùÁÀÉÈÍÌÓÒÚÙ"); // RFC 1738 codes; NOTE: One
might use UTF-8 as this documents encoding
function remove_accents($string) {
$string = rawurlencode($string);
$replacements = array(
'%C3%A1' => 'a',
'%C3%A0' => 'a',
'%C3%A9' => 'e',
'%C3%A8' => 'e',
'%C3%AD' => 'i',
'%C3%AC' => 'i',
'%C3%B3' => 'o',
'%C3%B2' => 'o',
'%C3%BA' => 'u',
'%C3%B9' => 'u',
'%C3%81' => 'A',
'%C3%80' => 'A',
'%C3%89' => 'E',
'%C3%88' => 'E',
'%C3%8D' => 'I',
'%C3%8C' => 'I',
'%C3%93' => 'O',
'%C3%92' => 'O',
'%C3%9A' => 'U',
'%C3%99' => 'U'
);
return strtr($string, $replacements);
}
//echo remove_accents("CÀfé"); // I know it's not spelled right
echo remove_accents("áàéÚíìóòúùÁÀÉÈÍÌÓÒÚÙ"); //OUTPUT (again: i used UTF-8
for document): aaeeiioouuAAEEIIOOUU
?>

Ciao

Yeti
Post by Andrew Ballard
On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti
Post by Giulio Mastrosanti
Brilliant !!!
so you replace every occurence of every accent variation with all the
accent
Post by Giulio Mastrosanti
variations...
OK, that's it!
only some more doubts ( regex are still an headhache for me... )
preg_replace('/[iìíîïĩīĭįı]/iu',... -- what's the meaning of iu after
the
Post by Giulio Mastrosanti
match string?
This page explains them both.
http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php
Post by Giulio Mastrosanti
preg_replace('/[aàáâãÀåǻāăą](?!e)/iu',... whats (?!e) for? -- every
occurence of aàáâãÀåǻāăą NOT followed by e?
Yes. It matches any character based on the latin 'a' that is not
followed by an 'e'. It keeps the pattern from matching the 'a' when it
immediately precedes an 'e' for the character 'ae' for words like
http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
(However, that may cause problems with words that have other variants
of 'ae' in them. I'll leave that to you to resolve.)
http://us.php.net/manual/en/regexp.reference.php
Post by Giulio Mastrosanti
Many thanks again for your effort,
I'm definitely on the good way
Giulio
Post by Andrew Ballard
I was intrigued by your example, so I played around with it some more
this morning. My own quick web search yielded a lot of results for
highlighting search terms, but none that I found did what you're
after. (I admit I didn't look very deep.) I was up to something like
this before your reply came in. It's still by no means complete. It
even handles simple English plurals (words ending in 's' or 'es'), but
not variations that require changing the word base (like 'daisy' to
'daisies').
<?php
function highlight_search_terms($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);
$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
$search_words[] = $word;
}
}
$search_words = array_unique($search_words);
foreach ($search_words as $word) {
$search = preg_quote($word);
/* repeat for each possible accented character */
$search = preg_replace('/(ae|Ê|ǜ)/iu', '(ae|Ê|ǜ)', $search);
$search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);
$search = preg_replace('/[aàáâãÀåǻāăą](?!e)/iu',
'[aàáâãÀåǻāăą]', $search);
$search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search);
$search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);
$search = preg_replace('/(?<![ao])[eÚéêëēĕėęě]/iu',
'[eÚéêëēĕėęě]', $search);
$search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search);
$search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);
$search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]',
$search);
Post by Giulio Mastrosanti
Post by Andrew Ballard
$search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);
$search = preg_replace('/[kÄ·Äž]/iu', '[kÄ·Äž]', $search);
$search = preg_replace('/[lĺČğŀł]/iu', '[lĺČğŀł]', $search);
$search = preg_replace('/[nñńņňʼnŋ]/iu', '[nñńņňʼnŋ]', $search);
$search = preg_replace('/[oòóÎõöōŏőǿơ](?!e)/iu',
'[oòóÎõöōŏőǿơ]', $search);
$search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);
$search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search);
$search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);
$search = preg_replace('/[uùúûÌũūŭůűųǔǖǘǚǜ]/iu',
'[uùúûÌũūŭůűųǔǖǘǚǜ]', $search);
$search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);
$search = preg_replace('/[yÜÿŷ]/iu', '[yÜÿŷ]', $search);
$search = preg_replace('/[zźŌş]/iu', '[zźŌş]', $search);
$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<span
class="keysearch">$0</span>', $string);
}
return $string;
}
?>
I still can't help feeling there must be some better way, though.
Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giulio
Andrew
Yeti
2008-07-15 09:44:24 UTC
Permalink
Oh, and i forgot about this one ...

jorge at seisbits dot com
wrote on 11-Jul-2008 09:04
If you try to make a strtr of not usual charafters when you are in a utf8
enviroment, you can do that:

function normaliza ($string){
$string = utf8_decode($string);
$string = strtr($string, utf8_decode(" ÂÊÎÔÛÀ"), "-AEIOU");
$string = strtolower($string);
return $string;
}
Post by Yeti
I dont think using all these regular expressions is a very efficient way to
do so. As i previously pointed out there are many users who had a similar
http://it.php.net/manual/en/function.strtr.php
One of my favourites is what derernst at gmx dot ch used.
derernst at gmx dot ch
wrote on 20-Sep-2005 07:29
This works for me to remove accents for some characters of Latin-1, Latin-2
and Turkish in a UTF-8 environment, where the htmlentities-based solutions
<?php
function remove_accents($string, $german=false) {
// Single letters
$single_fr = explode(" ", "ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ &#260; &#258; ᅵ &#262; &#268;
&#270; &#272; ᅵ ᅵ ᅵ ᅵ ᅵ &#280; &#282; &#286; ᅵ ᅵ ᅵ ᅵ &#304; &#321; &#317;
&#313; ᅵ &#323; &#327; ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ &#336; &#340; &#344; ᅵ &#346; &#350;
&#356; &#354; ᅵ ᅵ ᅵ ᅵ &#366; &#368; ᅵ ᅵ &#377; &#379; ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ &#261;
&#259; ᅵ &#263; &#269; &#271; &#273; ᅵ ᅵ ᅵ ᅵ &#281; &#283; &#287; ᅵ ᅵ ᅵ ᅵ
&#305; &#322; &#318; &#314; ᅵ &#324; &#328; ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ ᅵ &#337; &#341;
&#345; &#347; ᅵ &#351; &#357; &#355; ᅵ ᅵ ᅵ ᅵ &#367; &#369; ᅵ ᅵ ᅵ &#378;
&#380;");
$single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I
I I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a
a a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s
s t t u u u u u u y y z z z");
$single = array();
for ($i=0; $i<count($single_fr); $i++) {
$single[$single_fr[$i]] = $single_to[$i];
}
// Ligatures
$ligatures = array("ᅵ"=>"Ae", "ᅵ"=>"ae", "ᅵ"=>"Oe", "ᅵ"=>"oe",
"ᅵ"=>"ss");
// German umlauts
$umlauts = array("ᅵ"=>"Ae", "ᅵ"=>"ae", "ᅵ"=>"Oe", "ᅵ"=>"oe", "ᅵ"=>"Ue",
"ᅵ"=>"ue");
// Replace
$replacements = array_merge($single, $ligatures);
if ($german) $replacements = array_merge($replacements, $umlauts);
$string = strtr($string, $replacements);
return $string;
}
?>
I would change this function a bit ...
<?php
//echo rawurlencode("áàéÚíìóòúùÁÀÉÈÍÌÓÒÚÙ"); // RFC 1738 codes; NOTE: One
might use UTF-8 as this documents encoding
function remove_accents($string) {
$string = rawurlencode($string);
$replacements = array(
'%C3%A1' => 'a',
'%C3%A0' => 'a',
'%C3%A9' => 'e',
'%C3%A8' => 'e',
'%C3%AD' => 'i',
'%C3%AC' => 'i',
'%C3%B3' => 'o',
'%C3%B2' => 'o',
'%C3%BA' => 'u',
'%C3%B9' => 'u',
'%C3%81' => 'A',
'%C3%80' => 'A',
'%C3%89' => 'E',
'%C3%88' => 'E',
'%C3%8D' => 'I',
'%C3%8C' => 'I',
'%C3%93' => 'O',
'%C3%92' => 'O',
'%C3%9A' => 'U',
'%C3%99' => 'U'
);
return strtr($string, $replacements);
}
//echo remove_accents("CÀfé"); // I know it's not spelled right
echo remove_accents("áàéÚíìóòúùÁÀÉÈÍÌÓÒÚÙ"); //OUTPUT (again: i used UTF-8
for document): aaeeiioouuAAEEIIOOUU
?>
Ciao
Yeti
Post by Andrew Ballard
On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti
Post by Giulio Mastrosanti
Brilliant !!!
so you replace every occurence of every accent variation with all the
accent
Post by Giulio Mastrosanti
variations...
OK, that's it!
only some more doubts ( regex are still an headhache for me... )
preg_replace('/[iìíîïĩīĭįı]/iu',... -- what's the meaning of iu after
the
Post by Giulio Mastrosanti
match string?
This page explains them both.
http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php
Post by Giulio Mastrosanti
preg_replace('/[aàáâãÀåǻāăą](?!e)/iu',... whats (?!e) for? -- every
occurence of aàáâãÀåǻāăą NOT followed by e?
Yes. It matches any character based on the latin 'a' that is not
followed by an 'e'. It keeps the pattern from matching the 'a' when it
immediately precedes an 'e' for the character 'ae' for words like
http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
(However, that may cause problems with words that have other variants
of 'ae' in them. I'll leave that to you to resolve.)
http://us.php.net/manual/en/regexp.reference.php
Post by Giulio Mastrosanti
Many thanks again for your effort,
I'm definitely on the good way
Giulio
Post by Andrew Ballard
I was intrigued by your example, so I played around with it some more
this morning. My own quick web search yielded a lot of results for
highlighting search terms, but none that I found did what you're
after. (I admit I didn't look very deep.) I was up to something like
this before your reply came in. It's still by no means complete. It
even handles simple English plurals (words ending in 's' or 'es'), but
not variations that require changing the word base (like 'daisy' to
'daisies').
<?php
function highlight_search_terms($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);
$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
$search_words[] = $word;
}
}
$search_words = array_unique($search_words);
foreach ($search_words as $word) {
$search = preg_quote($word);
/* repeat for each possible accented character */
$search = preg_replace('/(ae|Ê|ǜ)/iu', '(ae|Ê|ǜ)', $search);
$search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);
$search = preg_replace('/[aàáâãÀåǻāăą](?!e)/iu',
'[aàáâãÀåǻāăą]', $search);
$search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search);
$search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);
$search = preg_replace('/(?<![ao])[eÚéêëēĕėęě]/iu',
'[eÚéêëēĕėęě]', $search);
$search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search);
$search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);
$search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]',
$search);
Post by Giulio Mastrosanti
Post by Andrew Ballard
$search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);
$search = preg_replace('/[kÄ·Äž]/iu', '[kÄ·Äž]', $search);
$search = preg_replace('/[lĺČğŀł]/iu', '[lĺČğŀł]', $search);
$search = preg_replace('/[nñńņňʼnŋ]/iu', '[nñńņňʼnŋ]', $search);
$search = preg_replace('/[oòóÎõöōŏőǿơ](?!e)/iu',
'[oòóÎõöōŏőǿơ]', $search);
$search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);
$search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search);
$search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);
$search = preg_replace('/[uùúûÌũūŭůűųǔǖǘǚǜ]/iu',
'[uùúûÌũūŭůűųǔǖǘǚǜ]', $search);
$search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);
$search = preg_replace('/[yÜÿŷ]/iu', '[yÜÿŷ]', $search);
$search = preg_replace('/[zźŌş]/iu', '[zźŌş]', $search);
$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<span
class="keysearch">$0</span>', $string);
}
return $string;
}
?>
I still can't help feeling there must be some better way, though.
Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giulio
Andrew
Andrew Ballard
2008-07-15 13:46:18 UTC
Permalink
Post by Yeti
I dont think using all these regular expressions is a very efficient way to
do so. As i previously pointed out there are many users who had a similar
http://it.php.net/manual/en/function.strtr.php
One of my favourites is what derernst at gmx dot ch used.
derernst at gmx dot ch
wrote on 20-Sep-2005 07:29
This works for me to remove accents for some characters of Latin-1, Latin-2
and Turkish in a UTF-8 environment, where the htmlentities-based solutions
Post by Andrew Ballard
<?php
function remove_accents($string, $german=false) {
// Single letters
$single_fr = explode(" ", "� � � � � � &#260; &#258; � &#262; &#268;
&#270; &#272; � � � � � &#280; &#282; &#286; � � � � &#304; &#321; &#317;
&#313; � &#323; &#327; � � � � � � &#336; &#340; &#344; � &#346; &#350;
&#356; &#354; � � � � &#366; &#368; � � &#377; &#379; � � � � � � &#261;
&#259; � &#263; &#269; &#271; &#273; � � � � &#281; &#283; &#287; � � � �
&#305; &#322; &#318; &#314; � &#324; &#328; � � � � � � � &#337; &#341;
&#345; &#347; � &#351; &#357; &#355; � � � � &#367; &#369; � � � &#378;
&#380;");
$single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I I
I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a a
a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s s
t t u u u u u u y y z z z");
$single = array();
for ($i=0; $i<count($single_fr); $i++) {
$single[$single_fr[$i]] = $single_to[$i];
}
// Ligatures
$ligatures = array("�"=>"Ae", "�"=>"ae", "�"=>"Oe", "�"=>"oe", "�"=>"ss");
// German umlauts
$umlauts = array("�"=>"Ae", "�"=>"ae", "�"=>"Oe", "�"=>"oe", "�"=>"Ue",
"�"=>"ue");
// Replace
$replacements = array_merge($single, $ligatures);
if ($german) $replacements = array_merge($replacements, $umlauts);
$string = strtr($string, $replacements);
return $string;
}
?>
I would change this function a bit ...
<?php
//echo rawurlencode("áàéèíìóòúùÁÀÉÈÍÌÓÒÚÙ"); // RFC 1738 codes; NOTE: One
might use UTF-8 as this documents encoding
function remove_accents($string) {
$string = rawurlencode($string);
$replacements = array(
'%C3%A1' => 'a',
'%C3%A0' => 'a',
'%C3%A9' => 'e',
'%C3%A8' => 'e',
'%C3%AD' => 'i',
'%C3%AC' => 'i',
'%C3%B3' => 'o',
'%C3%B2' => 'o',
'%C3%BA' => 'u',
'%C3%B9' => 'u',
'%C3%81' => 'A',
'%C3%80' => 'A',
'%C3%89' => 'E',
'%C3%88' => 'E',
'%C3%8D' => 'I',
'%C3%8C' => 'I',
'%C3%93' => 'O',
'%C3%92' => 'O',
'%C3%9A' => 'U',
'%C3%99' => 'U'
);
return strtr($string, $replacements);
}
//echo remove_accents("CÀfé"); // I know it's not spelled right
echo remove_accents("áàéèíìóòúùÁÀÉÈÍÌÓÒÚÙ"); //OUTPUT (again: i used UTF-8
for document): aaeeiioouuAAEEIIOOUU
?>
Ciao
Yeti
Post by Andrew Ballard
On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti
Post by Giulio Mastrosanti
Brilliant !!!
so you replace every occurence of every accent variation with all the accent
variations...
OK, that's it!
only some more doubts ( regex are still an headhache for me... )
preg_replace('/[iìíîïĩīĭįı]/iu',... -- what's the meaning of iu after the
match string?
This page explains them both.
http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php
Post by Giulio Mastrosanti
preg_replace('/[aàáâãäåǻāăą](?!e)/iu',... whats (?!e) for? -- every
occurence of aàáâãäåǻāăą NOT followed by e?
Yes. It matches any character based on the latin 'a' that is not
followed by an 'e'. It keeps the pattern from matching the 'a' when it
immediately precedes an 'e' for the character 'ae' for words like
http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
(However, that may cause problems with words that have other variants
of 'ae' in them. I'll leave that to you to resolve.)
http://us.php.net/manual/en/regexp.reference.php
Post by Giulio Mastrosanti
Many thanks again for your effort,
I'm definitely on the good way
Giulio
Post by Andrew Ballard
I was intrigued by your example, so I played around with it some more
this morning. My own quick web search yielded a lot of results for
highlighting search terms, but none that I found did what you're
after. (I admit I didn't look very deep.) I was up to something like
this before your reply came in. It's still by no means complete. It
even handles simple English plurals (words ending in 's' or 'es'), but
not variations that require changing the word base (like 'daisy' to
'daisies').
<?php
function highlight_search_terms($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);
$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
$search_words[] = $word;
}
}
$search_words = array_unique($search_words);
foreach ($search_words as $word) {
$search = preg_quote($word);
/* repeat for each possible accented character */
$search = preg_replace('/(ae|æ|ǽ)/iu', '(ae|æ|ǽ)', $search);
$search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);
$search = preg_replace('/[aàáâãäåǻāăą](?!e)/iu',
'[aàáâãäåǻāăą]', $search);
$search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search);
$search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);
$search = preg_replace('/(?<![ao])[eèéêëēĕėęě]/iu',
'[eèéêëēĕėęě]', $search);
$search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search);
$search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);
$search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]',
$search);
$search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);
$search = preg_replace('/[kķĸ]/iu', '[kķĸ]', $search);
$search = preg_replace('/[lĺļľŀł]/iu', '[lĺļľŀł]', $search);
$search = preg_replace('/[nñńņňʼnŋ]/iu', '[nñńņňʼnŋ]', $search);
$search = preg_replace('/[oòóôõöōŏőǿơ](?!e)/iu',
'[oòóôõöōŏőǿơ]', $search);
$search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);
$search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search);
$search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);
$search = preg_replace('/[uùúûüũūŭůűųǔǖǘǚǜ]/iu',
'[uùúûüũūŭůűųǔǖǘǚǜ]', $search);
$search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);
$search = preg_replace('/[yýÿŷ]/iu', '[yýÿŷ]', $search);
$search = preg_replace('/[zźżž]/iu', '[zźżž]', $search);
$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<span
class="keysearch">$0</span>', $string);
}
return $string;
}
?>
I still can't help feeling there must be some better way, though.
Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giulio
Andrew
I agree it doesn't seem very efficient to me, but I haven't come up
with anything better. The problem with what you posted is that the OP
was looking to preserve the accented characters, NOT replace them. All
he wants to do is wrap some tags around the search terms so that they
are highlighted. I guess he could use your function to replace all the
accented characters with regular ones in a copy of the original
string, and then scan that string using str_pos() or similar against
the copy to find the index of each occurrence that needs replaced in
the original string. This seems even less efficient than the regular
expressions
Andrew Ballard
2008-07-15 14:15:32 UTC
Permalink
Post by Andrew Ballard
Post by Yeti
I dont think using all these regular expressions is a very efficient way to
do so. As i previously pointed out there are many users who had a similar
http://it.php.net/manual/en/function.strtr.php
One of my favourites is what derernst at gmx dot ch used.
derernst at gmx dot ch
wrote on 20-Sep-2005 07:29
This works for me to remove accents for some characters of Latin-1, Latin-2
and Turkish in a UTF-8 environment, where the htmlentities-based solutions
Post by Andrew Ballard
<?php
function remove_accents($string, $german=false) {
// Single letters
$single_fr = explode(" ", "� � � � � � &#260; &#258; � &#262; &#268;
&#270; &#272; � � � � � &#280; &#282; &#286; � � � � &#304; &#321; &#317;
&#313; � &#323; &#327; � � � � � � &#336; &#340; &#344; � &#346; &#350;
&#356; &#354; � � � � &#366; &#368; � � &#377; &#379; � � � � � � &#261;
&#259; � &#263; &#269; &#271; &#273; � � � � &#281; &#283; &#287; � � � �
&#305; &#322; &#318; &#314; � &#324; &#328; � � � � � � � &#337; &#341;
&#345; &#347; � &#351; &#357; &#355; � � � � &#367; &#369; � � � &#378;
&#380;");
$single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I I
I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a a
a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s s
t t u u u u u u y y z z z");
$single = array();
for ($i=0; $i<count($single_fr); $i++) {
$single[$single_fr[$i]] = $single_to[$i];
}
// Ligatures
$ligatures = array("�"=>"Ae", "�"=>"ae", "�"=>"Oe", "�"=>"oe", "�"=>"ss");
// German umlauts
$umlauts = array("�"=>"Ae", "�"=>"ae", "�"=>"Oe", "�"=>"oe", "�"=>"Ue",
"�"=>"ue");
// Replace
$replacements = array_merge($single, $ligatures);
if ($german) $replacements = array_merge($replacements, $umlauts);
$string = strtr($string, $replacements);
return $string;
}
?>
I would change this function a bit ...
<?php
//echo rawurlencode("áàéèíìóòúùÁÀÉÈÍÌÓÒÚÙ"); // RFC 1738 codes; NOTE: One
might use UTF-8 as this documents encoding
function remove_accents($string) {
$string = rawurlencode($string);
$replacements = array(
'%C3%A1' => 'a',
'%C3%A0' => 'a',
'%C3%A9' => 'e',
'%C3%A8' => 'e',
'%C3%AD' => 'i',
'%C3%AC' => 'i',
'%C3%B3' => 'o',
'%C3%B2' => 'o',
'%C3%BA' => 'u',
'%C3%B9' => 'u',
'%C3%81' => 'A',
'%C3%80' => 'A',
'%C3%89' => 'E',
'%C3%88' => 'E',
'%C3%8D' => 'I',
'%C3%8C' => 'I',
'%C3%93' => 'O',
'%C3%92' => 'O',
'%C3%9A' => 'U',
'%C3%99' => 'U'
);
return strtr($string, $replacements);
}
//echo remove_accents("CÀfé"); // I know it's not spelled right
echo remove_accents("áàéèíìóòúùÁÀÉÈÍÌÓÒÚÙ"); //OUTPUT (again: i used UTF-8
for document): aaeeiioouuAAEEIIOOUU
?>
Ciao
Yeti
Post by Andrew Ballard
On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti
Post by Giulio Mastrosanti
Brilliant !!!
so you replace every occurence of every accent variation with all the accent
variations...
OK, that's it!
only some more doubts ( regex are still an headhache for me... )
preg_replace('/[iìíîïĩīĭįı]/iu',... -- what's the meaning of iu after the
match string?
This page explains them both.
http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php
Post by Giulio Mastrosanti
preg_replace('/[aàáâãäåǻāăą](?!e)/iu',... whats (?!e) for? -- every
occurence of aàáâãäåǻāăą NOT followed by e?
Yes. It matches any character based on the latin 'a' that is not
followed by an 'e'. It keeps the pattern from matching the 'a' when it
immediately precedes an 'e' for the character 'ae' for words like
http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
(However, that may cause problems with words that have other variants
of 'ae' in them. I'll leave that to you to resolve.)
http://us.php.net/manual/en/regexp.reference.php
Post by Giulio Mastrosanti
Many thanks again for your effort,
I'm definitely on the good way
Giulio
Post by Andrew Ballard
I was intrigued by your example, so I played around with it some more
this morning. My own quick web search yielded a lot of results for
highlighting search terms, but none that I found did what you're
after. (I admit I didn't look very deep.) I was up to something like
this before your reply came in. It's still by no means complete. It
even handles simple English plurals (words ending in 's' or 'es'), but
not variations that require changing the word base (like 'daisy' to
'daisies').
<?php
function highlight_search_terms($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);
$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
$search_words[] = $word;
}
}
$search_words = array_unique($search_words);
foreach ($search_words as $word) {
$search = preg_quote($word);
/* repeat for each possible accented character */
$search = preg_replace('/(ae|æ|ǽ)/iu', '(ae|æ|ǽ)', $search);
$search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);
$search = preg_replace('/[aàáâãäåǻāăą](?!e)/iu',
'[aàáâãäåǻāăą]', $search);
$search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search);
$search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);
$search = preg_replace('/(?<![ao])[eèéêëēĕėęě]/iu',
'[eèéêëēĕėęě]', $search);
$search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search);
$search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);
$search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]',
$search);
$search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);
$search = preg_replace('/[kķĸ]/iu', '[kķĸ]', $search);
$search = preg_replace('/[lĺļľŀł]/iu', '[lĺļľŀł]', $search);
$search = preg_replace('/[nñńņňʼnŋ]/iu', '[nñńņňʼnŋ]', $search);
$search = preg_replace('/[oòóôõöōŏőǿơ](?!e)/iu',
'[oòóôõöōŏőǿơ]', $search);
$search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);
$search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search);
$search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);
$search = preg_replace('/[uùúûüũūŭůűųǔǖǘǚǜ]/iu',
'[uùúûüũūŭůűųǔǖǘǚǜ]', $search);
$search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);
$search = preg_replace('/[yýÿŷ]/iu', '[yýÿŷ]', $search);
$search = preg_replace('/[zźżž]/iu', '[zźżž]', $search);
$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<span
class="keysearch">$0</span>', $string);
}
return $string;
}
?>
I still can't help feeling there must be some better way, though.
Post by Giulio Mastrosanti
well, i think I'm on the good way now, unfortunately I have some other
urgent work and can't try it immediately, but I'll let you know :)
thank you!
Giulio
Andrew
I agree it doesn't seem very efficient to me, but I haven't come up
with anything better. The problem with what you posted is that the OP
was looking to preserve the accented characters, NOT replace them. All
he wants to do is wrap some tags around the search terms so that they
are highlighted. I guess he could use your function to replace all the
accented characters with regular ones in a copy of the original
string, and then scan that string using str_pos() or similar against
the copy to find the index of each occurrence that needs replaced in
the original string. This seems even less efficient than the regular
expressions, to me.
Andrew
Well, OK, I can think of one optimization. This takes advantage of the
fact that preg_replace can accept arrays as parameters. In a couple
very quick tests this version is roughly 30% faster than my previous
version:

<?php

function highlight_search_terms2($phrase, $string) {
$non_letter_chars = '/[^\pL]/iu';
$words = preg_split($non_letter_chars, $phrase);

$search_words = array();
foreach ($words as $word) {
if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
$search_words[] = $word;
}
}

$search_words = array_unique($search_words);

$patterns = array(
/* repeat for each possible accented character */
'/(ae|æ|ǽ)/iu' => '(ae|æ|ǽ)',
'/(oe|œ)/iu' => '(oe|œ)',
'/[aàáâãäåǻāăą](?!e)/iu' => '[aàáâãäåǻāăą]',
'/[cçćĉċč]/iu' => '[cçćĉċč]',
'/[dďđ]/iu' => '[dďđ]',
'/(?<![ao])[eèéêëēĕėęě]/iu' => '[eèéêëēĕėęě]',
'/[gĝğġģ]/iu' => '[gĝğġģ]',
'/[hĥħ]/iu' => '[hĥħ]',
'/[iìíîïĩīĭįı]/iu' => '[iìíîïĩīĭįı]',
'/[jĵ]/iu' => '[jĵ]',
'/[kķĸ]/iu' => '[kķĸ]',
'/[lĺļľŀł]/iu' => '[lĺļľŀł]',
'/[nñńņňʼnŋ]/iu' => '[nñńņňʼnŋ]',
'/[oòóôõöōŏőǿơ](?!e)/iu' => '[oòóôõöōŏőǿơ]',
'/[rŕŗř]/iu' => '[rŕŗř]',
'/[sśŝşš]/iu' => '[sśŝşš]',
'/[tţťŧ]/iu' => '[tţťŧ]',
'/[uùúûüũūŭůűųǔǖǘǚǜ]/iu' => '[uùúûüũūŭůűųǔǖǘǚǜ]',
'/[wŵ]/iu' => '[wŵ]',
'/[yýÿŷ]/iu' => '[yýÿŷ]',
'/[zźżž]/iu' => '[zźżž]',
);

foreach ($search_words as $word) {
$search = preg_quote($word);

$search = preg_replace(array_keys($patterns), $patterns, $search);

$string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<span
class="keysearch">$0</span>', $string);
}

return $string
tedd
2008-07-15 16:30:03 UTC
Permalink
On TueWell, OK, I can think of one optimization. This takes advantage of the
fact that preg_replace can accept arrays as parameters. In a couple
very quick tests this version is roughly 30% faster than my previous
-snip-

Hey, when you finally get finished with that function, please let me
know I would like to copy it. :-)

Cheers,

tedd
--
-------
http://sperling.com http://ancientstones.com http://earthstones.com
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Andrew Ballard
2008-07-15 17:17:15 UTC
Permalink
Post by tedd
On TueWell, OK, I can think of one optimization. This takes advantage of
the
fact that preg_replace can accept arrays as parameters. In a couple
very quick tests this version is roughly 30% faster than my previous
-snip-
Hey, when you finally get finished with that function, please let me know I
would like to copy it. :-)
Cheers,
tedd
All yours. I figure I'm done with it. (At least until I actually need
to use it for something and then I have to test it for real. :-) )

Andrew
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Yeti
2008-07-15 18:07:19 UTC
Permalink
The original problem was

User X submits a character string A.

A PHP scripts uses A to search for it's occurences in a DB, ignoring special
characters.

The result of ze search is a list of character strings M-LIST with matches.

This list gets outputted to the user X, but before that all the matching
strings should be replaced with '<span style="color: #FF0000">'..'</span>'

If i clearly got the OP then he is using MySQL to perform the search.

I guess he is doing it with MATCH. So MySQL already found the match and in
PHP it has to be done again ...

eg.

The table has 2 entries, string1 and string2 ..

string1 = 'Thís ís an éxámplè stríng wíth áccénts.'

string2 = 'This is an example string without accents.'

Now the user searches for "ample":

search = '*ample*'

Both string have matches due to accent-insensitivity (AI). Now the result is
outputted with highlighting ..

*Thís ís an éx*<span style="color: #FF0000">*ámplè*</span>* stríng wíth
áccénts.*

*This is an ex*<span style="color: #FF0000">*ample*</span>* string without
accents.*

So since MySQL already did the job, why not get the occurances from it?

I'm not an MySQL expert, but I know google and found something called string
functions. Especially a "locate" function got my interest.

http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_locate

Now shouldnt it be possible to create a query that searches the db for
matches and additionally uses the string function?

I have no idea, but maybe some MySQL-expert out there has ...

Yeti
Post by Andrew Ballard
Post by tedd
On TueWell, OK, I can think of one optimization. This takes advantage of
the
fact that preg_replace can accept arrays as parameters. In a couple
very quick tests this version is roughly 30% faster than my previous
-snip-
Hey, when you finally get finished with that function, please let me know
I
Post by tedd
would like to copy it. :-)
Cheers,
tedd
All yours. I figure I'm done with it. (At least until I actually need
to use it for something and then I have to test it for real. :-) )
Andrew
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Andrew Ballard
2008-07-15 19:16:51 UTC
Permalink
Post by Yeti
The original problem was
User X submits a character string A.
A PHP scripts uses A to search for it's occurences in a DB, ignoring special
characters.
The result of ze search is a list of character strings M-LIST with matches.
This list gets outputted to the user X, but before that all the matching
strings should be replaced with '<span style="color: #FF0000">'..'</span>'
If i clearly got the OP then he is using MySQL to perform the search.
I guess he is doing it with MATCH. So MySQL already found the match and in
PHP it has to be done again ...
eg.
The table has 2 entries, string1 and string2 ..
string1 = 'Thís ís an éxámplè stríng wíth áccénts.'
string2 = 'This is an example string without accents.'
search = 'ample'
Both string have matches due to accent-insensitivity (AI). Now the result is
outputted with highlighting ..
Thís ís an éx<span style="color: #FF0000">ámplè</span> stríng wíth áccénts.
This is an ex<span style="color: #FF0000">ample</span> string without
accents.
Correct.
Post by Yeti
So since MySQL already did the job, why not get the occurances from it?
I'm not an MySQL expert, but I know google and found something called string
functions. Especially a "locate" function got my interest.
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_locate
Now shouldnt it be possible to create a query that searches the db for
matches and additionally uses the string function?
I have no idea, but maybe some MySQL-expert out there has ...
Yeti
There are definitely possibilities there. Personally, I tend to be
biased against using the database to format output for presentation,
so I'd rather not push the task off there. Still, I know lots of
developers do not share this bias, so I'll address a couple other
issues I see with this approach:

1) If the search word appears multiple times, LOCATE() will only find
it once. I'd probably use REPLACE() instead. This leads to the next
problem:

2) I'm not sure if the OP wants this or not, but if he wants to
highlight each of multiple search terms the way many sites do, he
would have to split the terms and build a SQL phrase that like this
(there are probably other approaches available in MySQL to do the same
thing):

-- search phrase 'quaint french cafe'
SELECT REPLACE(REPLACE(REPLACE(`my_column`, 'quaint', '<span
class="keysearch">quaint</span>'), 'french', '<span
class="keysearch">french</span>'), 'cafe', '<span
class="keysearch">cafe</span>') FROM ...

In this case, he should get all instances of each word highlighted,
but the accented characters would again be replaced with a particular
style. (Not to mention the size and complexity of the query being
passed from PHP to the database or the potential size of the result
being passed from the database to PHP since it now could have lots of
formatting text

Continue reading on narkive:
Loading...