Discussion:
Removing empty paragraphs from HTML file using simple_html_dom.php
Geoffrey van Wyk
2010-09-18 06:21:15 UTC
Permalink
Hi All,

I want to remove empty paragraphs from an HTML document using simple_html_dom.php. I know how to do it using the DOMDocument class, but, because the HTML files I work with are prepared in MS Word, the DOMDocument's loadHTMLFile() function gives this exception "Namespaces are not defined".

This is the code I use with the DOMDocument object for HTML files not prepared in MS Word:

<?php
/* Using the DOMDocument class */

/* Create a new DOMDocument object. */
$html = new DOMDocument("1.0", "UTF-8");

/* Load HTML code from an HTML file into the DOMDocument. */
$html->loadHTMLFile("HTML File With Empty Paragraphs.html");

/* Assign all the <p> elements into the $pars DOMNodeList object. */
$pars = $html->getElementsByTagName("p");

echo "The initial number of paragraphs is " . $pars->length . ".<br />";

/* The trim() function is used to remove leading and trailing spaces as well as
* newline characters. */
for ($i = 0; $i < $pars->length; $i++){
if (trim($pars->item($i)->textContent == "")){
$pars->item($i)->parentNode->removeChild($pars->item($i));
$i--;
}
}

echo "The final number of paragraphs is " . $pars->length . ".<br />";

// Write the HTML code back into an HTML file.
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
?>

This is the code I use with the simple_html_dom.php module for HTML files prepared in MS Word:

<?php
/* Using simple_html_dom.php */

include("simple_html_dom.php");

$html = file_get_html("HTML File With Empty Paragraphs.html");

$pars = $html->find("p");

for ($i = 0; $i < count($pars); $i++) {
if (trim($pars[$i]->plaintext == "")) {
unset($pars[$i]);
$i--;
}
}

$html->save("HTML File without Empty Paragraphs.html");
?>

It is almost the same, except that that the $pars variable is a DOMNodeList when using DOMDocument and an array when using simple_html_dom.php. But this code does not work. First it runs for two minutes and then reports these errors: "Undefined offset: 1" and "Trying to get property of nonobject" for this line: "if (trim($pars[$i]->plaintext == "")) {".

Does anyone know how I can fix this?

Thank you.

Geoffrey van Wyk
Simon J Welsh
2010-09-18 08:24:53 UTC
Permalink
Post by Geoffrey van Wyk
Hi All,
I want to remove empty paragraphs from an HTML document using simple_html_dom.php. I know how to do it using the DOMDocument class, but, because the HTML files I work with are prepared in MS Word, the DOMDocument's loadHTMLFile() function gives this exception "Namespaces are not defined".
<?php
/* Using the DOMDocument class */
/* Create a new DOMDocument object. */
$html = new DOMDocument("1.0", "UTF-8");
/* Load HTML code from an HTML file into the DOMDocument. */
$html->loadHTMLFile("HTML File With Empty Paragraphs.html");
/* Assign all the <p> elements into the $pars DOMNodeList object. */
$pars = $html->getElementsByTagName("p");
echo "The initial number of paragraphs is " . $pars->length . ".<br />";
/* The trim() function is used to remove leading and trailing spaces as well as
* newline characters. */
for ($i = 0; $i < $pars->length; $i++){
if (trim($pars->item($i)->textContent == "")){
$pars->item($i)->parentNode->removeChild($pars->item($i));
$i--;
}
}
echo "The final number of paragraphs is " . $pars->length . ".<br />";
// Write the HTML code back into an HTML file.
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
?>
<?php
/* Using simple_html_dom.php */
include("simple_html_dom.php");
$html = file_get_html("HTML File With Empty Paragraphs.html");
$pars = $html->find("p");
for ($i = 0; $i < count($pars); $i++) {
if (trim($pars[$i]->plaintext == "")) {
unset($pars[$i]);
$i--;
}
}
$html->save("HTML File without Empty Paragraphs.html");
?>
It is almost the same, except that that the $pars variable is a DOMNodeList when using DOMDocument and an array when using simple_html_dom.php. But this code does not work. First it runs for two minutes and then reports these errors: "Undefined offset: 1" and "Trying to get property of nonobject" for this line: "if (trim($pars[$i]->plaintext == "")) {".
Does anyone know how I can fix this?
Thank you.
Geoffrey van Wyk
Personally, I'd just use regex to do it. Something like preg_replace('#<p[^>]*?>\s*</p>#m', '', $html) should do it.

Otherwise, you've got trim($pars[$i]->plaintext == "") instead of trim($pars[$i]->plaintext) == "".

---
Simon Welsh
Admin of http://simon.geek.nz/

Who said Microsoft never created a bug-free program? The blue screen never, ever crashes!

http://www.thinkgeek.com/brain/gimme.cgi?wid=81d520e5e
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Loading...