#StackBounty: #java #xml #xml-parsing #xml-namespaces #dtd Java XML parser adding unnecessary xmlns and xml:space attributes

Bounty: 50

I’m using Java 11 (AdoptOpenJDK 11.0.5 2019-10-15) on Windows 10. I’m parsing some legacy XHTML 1.1 files, which take the following general form:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" http://www.w3.org/MarkUp/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <title>XHTML 1.1 Skeleton</title>
</head>
<body>
</body>
</html>

I’m using a simple non-validating parser:

DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
final Document document;
try (InputStream inputStream = new BufferedInputStream(getClass().getResourceAsStream("xhtml-1.1-test.xhtml"))) {
  document = documentBuilder.parse(inputStream);
}

For some reason it’s adding extra attributes such as xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" and xml:space="preserve" all over the place:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" version="-//W3C//DTD XHTML 1.1//EN" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="en">
<head xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <title xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">XHTML 1.1 Skeleton</title>
</head>
<body xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:space="preserve"></body>
</html>

I know that DTDs can provide default attribute values, but I don’t understand why the xmlns:xsi attribute was added, when there appear to be no elements or attributes in that namespace. Furthermore xml:space="preserve" seems to be incorrect altogether; only elements like <pre> should have xml:space="preserve" set, I would think. (Note the version="-//W3C//DTD XHTML 1.1//EN" as well; that’s something I don’t need or want.)

Am I doing something wrong? Is there a way I can configure the parser not to include this unnecessary cruft?

Interestingly this is not a problem with XHTML 1.0 strict.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>XHTML 1.0 Skeleton</title>
</head>
<body>
</body>
</html>

When parsed that yields what one would expect:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>XHTML 1.0 Skeleton</title>
</head>
<body>
</body>
</html>

But it is a problem with -//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN. So this seems to be just an XHTML 1.1 problem.

Update: I have some potentially helpful news: if I create a new document without a DTD and import the entire document tree into the new document, all this cruft (which apparently comes from implied attributes in the DTD) goes away, because the destination document doesn’t have a DTD at all. (See How to force removal of attributes with implied default values from DTD in Java XML DOM .) But this is very inefficient; it would be nice to turn this off altogether when parsing.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.