Discussion:
Help needed with mb_convert_encoding()
Alain Williams
2014-05-28 09:03:03 UTC
Permalink
I am trying to use this to validate input that is supposed to be UTF-8 and to
replace any bad characters with something - '?' would do.

I have the test program below. No matter what I try to give as an argument to
mb_substitute_character() it always removes the bad input sequence, I would like
to replace it.

Thanks in advance

<?php
mb_internal_encoding("UTF-8");

// I have tried many lines like the 2 below
// (comment out one or the other)
mb_substitute_character((int)0x3013);
mb_substitute_character((int)63); // '?' is ascii 63

// \xC0\xBC is invalid UTF-8 - over long encoding, should be \x3C
$input = "a bad angle bracket \xC0\xBC here";
$valid = mb_convert_encoding($input, "UTF-8", "UTF-8");

// I always find 2 spaces between 'bracket' and 'here'
echo "valid='$valid'\n";
--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h>
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Nicolas Grekas
2014-05-28 09:24:27 UTC
Permalink
Hi Alain,

I should advertise
patchwork/utf8<https://github.com/nicolas-grekas/Patchwork-UTF8>here
:)

I am trying to use this to validate input that is supposed to be UTF-8 and
Post by Alain Williams
to
replace any bad characters with something - '?' would do.
I'd personally take an other way: ill formed utf-8 sequences do not exist
in non bugged applications. What is much more common is an application that
sends something else than UTF-8. If you don't know what your input charset
encoding is, then HTML5 tells us that CP1252 is a good default fall-back.

Thus, I'd recommend using this snippet (preg_match is the quickest way to
check for utf-8 in PHP):

$input = "...";

if (!preg_match('//u', $input)) {
$input = iconv('CP1252', 'UTF-8', $input);
}

// here, $input is always an utf-8 string

Regards,
Nicolas
Flavio Kenji Yanai
2014-05-28 09:56:48 UTC
Permalink
I don't test it ...

$utf8_str = utf8_decode($original_str);

if (!substr_cmp($utf8_str,$original_str,length($original_str)){
echo "equal, valid utf8";
}
else {
echo "not equal , non valid utf8 input";
}
Post by Alain Williams
I am trying to use this to validate input that is supposed to be UTF-8 and to
replace any bad characters with something - '?' would do.
I have the test program below. No matter what I try to give as an argument to
mb_substitute_character() it always removes the bad input sequence, I would like
to replace it.
Thanks in advance
<?php
mb_internal_encoding("UTF-8");
// I have tried many lines like the 2 below
// (comment out one or the other)
mb_substitute_character((int)0x3013);
mb_substitute_character((int)63); // '?' is ascii 63
// \xC0\xBC is invalid UTF-8 - over long encoding, should be \x3C
$input = "a bad angle bracket \xC0\xBC here";
$valid = mb_convert_encoding($input, "UTF-8", "UTF-8");
// I always find 2 spaces between 'bracket' and 'here'
echo "valid='$valid'\n";
--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 http://www.phcomp.co.uk/
http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h>
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
--
Flávio Kenji Yanai (Toccos)
Alain Williams
2014-05-28 11:23:50 UTC
Permalink
Post by Flavio Kenji Yanai
I don't test it ...
$utf8_str = utf8_decode($original_str);
if (!substr_cmp($utf8_str,$original_str,length($original_str)){
echo "equal, valid utf8";
}
else {
echo "not equal , non valid utf8 input";
}
Sorry, maybe I did not explain myself well enough. I want to be able to provide
feedback to the user to say where something is wrong, so it would be nice to
say, with the example that I gave, something like:

Bad input detected, invalid character(s) replaced by '?':

a bad angle bracket ? here

I suppose that you could take the point of view that bad character encoding is a
result of someone trying to break the PHP script & so you do not need to be
nice. But maybe it is as a result of an innocent error somewhere.

With a bit of work I can find the first difference & replace by '?', but as far
as I can see mb_convert_encoding() should make it easy.
Post by Flavio Kenji Yanai
Post by Alain Williams
I am trying to use this to validate input that is supposed to be UTF-8 and to
replace any bad characters with something - '?' would do.
I have the test program below. No matter what I try to give as an argument to
mb_substitute_character() it always removes the bad input sequence, I would like
to replace it.
Thanks in advance
<?php
mb_internal_encoding("UTF-8");
// I have tried many lines like the 2 below
// (comment out one or the other)
mb_substitute_character((int)0x3013);
mb_substitute_character((int)63); // '?' is ascii 63
// \xC0\xBC is invalid UTF-8 - over long encoding, should be \x3C
$input = "a bad angle bracket \xC0\xBC here";
$valid = mb_convert_encoding($input, "UTF-8", "UTF-8");
// I always find 2 spaces between 'bracket' and 'here'
echo "valid='$valid'\n";
--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h>
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Christoph Becker
2014-05-28 12:12:32 UTC
Permalink
Post by Alain Williams
I am trying to use this to validate input that is supposed to be UTF-8 and to
replace any bad characters with something - '?' would do.
I have the test program below. No matter what I try to give as an argument to
mb_substitute_character() it always removes the bad input sequence, I would like
to replace it.
Have you considered using htmlspecialchars($input, ENT_SUBSTITUTE,
'UTF-8') instead of mb_substitute_character()?
--
Christoph M. Becker
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Alain Williams
2014-05-28 13:46:35 UTC
Permalink
Post by Christoph Becker
Post by Alain Williams
I am trying to use this to validate input that is supposed to be UTF-8 and to
replace any bad characters with something - '?' would do.
I have the test program below. No matter what I try to give as an argument to
mb_substitute_character() it always removes the bad input sequence, I would like
to replace it.
Have you considered using htmlspecialchars($input, ENT_SUBSTITUTE,
'UTF-8') instead of mb_substitute_character()?
OK-ish -- thanks.

* ENT_SUBSTITUTE is only available from PHP 5.4 - my production machine is PHP 5.3.3 (CentOS)

* It also munges & < > -- but I can undo that with htmlspecialchars_decode()

* I need to replace the Unicode Replacement Character ("\xEF\xBF\xBD") with a '?' (easy)

* If I give it an over long character encoding (I tested "\xC0\xBC") it replaces
each byte with a '?' - so I get two of them.

It would be nice to get mb_substitute_character() working.
--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h>
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Christoph Becker
2014-05-28 14:04:43 UTC
Permalink
Post by Alain Williams
Post by Christoph Becker
Post by Alain Williams
I am trying to use this to validate input that is supposed to be UTF-8 and to
replace any bad characters with something - '?' would do.
I have the test program below. No matter what I try to give as an argument to
mb_substitute_character() it always removes the bad input sequence, I would like
to replace it.
Have you considered using htmlspecialchars($input, ENT_SUBSTITUTE,
'UTF-8') instead of mb_substitute_character()?
OK-ish -- thanks.
* ENT_SUBSTITUTE is only available from PHP 5.4 - my production machine is PHP 5.3.3 (CentOS)
* It also munges & < > -- but I can undo that with htmlspecialchars_decode()
* I need to replace the Unicode Replacement Character ("\xEF\xBF\xBD") with a '?' (easy)
* If I give it an over long character encoding (I tested "\xC0\xBC") it replaces
each byte with a '?' - so I get two of them.
It would be nice to get mb_substitute_character() working.
You might be out of luck with PHP 5.3, see <http://3v4l.org/I3mEd>.
--
Christoph M. Becker
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Alain Williams
2014-05-28 15:04:28 UTC
Permalink
Post by Christoph Becker
Post by Alain Williams
It would be nice to get mb_substitute_character() working.
You might be out of luck with PHP 5.3, see <http://3v4l.org/I3mEd>.
Brilliant! Many thanks. I have been bashing my head on the table trying to
figure out what I was doing wrong. This has told me that I was trying to do the
right thing -- I can accept that there is a bug in the version of PHP that I
have.

Thanks again - I will have to remember that site.
--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h>
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Loading...