Data

XML: The Extensible Markup Language That Won't Retire

Q: Why are XML namespaces so confusing?

Namespaces solve a real problem (name collisions between schemas) but introduce significant complexity. The confusion comes from: default namespaces that silently apply to descendants, the distinction between namespace URIs and prefixes, namespace-unaware tools that break on prefixed elements, and the fact that the same namespace can have different prefixes in different documents. Most developers only need namespaces when working with SOAP, RSS extensions, or combining schemas.

Q: Should I use XSD or JSON Schema for validation?

Use whichever matches your data format. If your data is XML, use XSD — it's mature, widely supported, and deeply integrated with XML tools. If your data is JSON, use JSON Schema. XSD is more powerful (40+ built-in types, ordering constraints, cross-field validation) but more complex. JSON Schema is simpler and growing in adoption through OpenAPI. Don't convert your data format just for schema validation.

Q: How does XML compare to JSON in file size and parse speed?

XML is typically 2-3x larger than JSON for equivalent data due to closing tags and verbose syntax. JSON parses 2-5x faster in most benchmarks. However, both compress to similar sizes with gzip (XML's repetitive tags compress very efficiently), and for files under 10MB the speed difference is negligible. XML's SAX and StAX streaming parsers can process arbitrarily large files with constant memory, which is something standard JSON parsers can't do.

Q: Can I convert XML to JSON without losing information?

Not completely. XML has features with no JSON equivalent: the attribute/element distinction, namespaces, processing instructions, comments, CDATA sections, and mixed content (text interleaved with child elements). Different converters handle these differently, and round-tripping (XML to JSON to XML) will produce different XML. For simple, attribute-free XML the conversion is nearly lossless, but complex XML with namespaces and mixed content will lose structural information.

Q: What's the difference between well-formed and valid XML?

Well-formed XML follows basic syntax rules: proper nesting, quoted attributes, matching tags. Any XML parser can check well-formedness. Valid XML is well-formed AND conforms to a schema (XSD, DTD, or RelaxNG). Validation requires a schema definition and a validating parser. All valid XML is well-formed, but not all well-formed XML is valid. Most XML in the wild is well-formed but not validated against any schema.

Published Mar 19, 2026 10 min read By ChangeThisFile Team

Quick Answer

XML (Extensible Markup Language) is a self-describing, schema-validatable markup language descended from SGML. Despite losing the API format war to JSON, XML remains irreplaceable for document markup (DOCX, SVG), enterprise integration (SOAP, SAML), data validation (XSD), and transformation (XSLT). It's verbose, but that verbosity buys power no other format matches.

XML is the most hated format that everyone still uses. Developers mock its verbosity — <name>John</name> versus JSON's "name": "John" — and then spend their days working with DOCX files (ZIP archives of XML), SVG images (XML), RSS feeds (XML), Android layouts (XML), Maven builds (XML), and SAML authentication (XML).

The reason XML survives is not momentum or legacy compatibility, though both help. XML survives because it does things no other common format can do: schema validation at the parser level, namespace-based composition of multiple vocabularies, mixed content (text interleaved with markup), and declarative transformation via XSLT. These capabilities are irreplaceable in their domains.

This guide covers XML's actual strengths, its real costs, and the specific scenarios where no other format will do.

From SGML to XML: A Brief History

XML was born from SGML (Standard Generalized Markup Language), an ISO standard from 1986 that defined a framework for creating markup languages. SGML was powerful but monstrously complex — the specification ran to 500+ pages, and few implementations fully complied with it. HTML was defined as an SGML application, but browsers never actually parsed it as strict SGML.

In 1996, the W3C set out to create a simplified subset of SGML suitable for the web. The result was XML 1.0, published as a W3C Recommendation in February 1998. The design goals were explicit: XML should be straightforwardly usable over the internet, support a wide variety of applications, be compatible with SGML, be easy to write programs that process XML documents, have a minimum of optional features (ideally zero), and documents should be human-legible and reasonably clear.

XML achieved most of these goals. The "minimum of optional features" goal resulted in strict syntax rules: every opening tag must have a closing tag (or be self-closing), attribute values must be quoted, elements must be properly nested. This strictness was intentional — it made parsers simpler and documents unambiguous.

Anatomy of an XML Document

A well-formed XML document has three components:

<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns="http://example.com/catalog">
  <product id="P001" status="active">
    <name>Widget</name>
    <price currency="USD">29.99</price>
    <description><![CDATA[Price < $30 & ships free]]></description>
  </product>
</catalog>

Key structural concepts:

Prolog (<?xml ... ?>): Declares XML version and encoding. Optional but recommended.
Elements: The building blocks. Can contain text, other elements, or both (mixed content). Case-sensitive.
Attributes: Key-value metadata on elements. Values must be quoted. id="P001" and status="active" above.
CDATA sections: <![CDATA[...]]> blocks where the parser treats content as raw text. No need to escape <, >, or &. Essential for embedding code, HTML, or mathematical expressions.
Namespaces: The xmlns attribute identifies which vocabulary an element belongs to, preventing name collisions when combining schemas.

Schema Validation: XSD, DTD, and RelaxNG

XML's killer feature is schema validation — the ability to define exactly what a valid document looks like and have the parser enforce it before your application code ever runs. This is fundamentally different from JSON, where validation is always application-level and opt-in.

XSD (XML Schema Definition)

XSD is the most powerful and most widely used XML schema language. An XSD schema defines:

Which elements and attributes are allowed
The data type of each element/attribute (string, integer, decimal, date, boolean, and ~40 more built-in types)
Cardinality constraints (minOccurs, maxOccurs)
Element ordering (sequence, choice, all)
Pattern restrictions (regex validation on string content)
Enumeration constraints (value must be one of a defined set)

XSD schemas are themselves XML documents, which means they can be parsed, generated, and transformed with the same XML tools. This is powerful but also means XSD schemas are verbose — a schema definition is often longer than the documents it validates.

XSD is used by SOAP web services (WSDL files contain XSD schemas), XBRL (financial reporting), HL7 FHIR (healthcare), and most government data exchange standards.

DTD (Document Type Definition)

DTD is the original SGML schema language, inherited by XML. It's simpler than XSD but less powerful — DTDs can define elements and attributes but have limited type support (everything is essentially a string) and can't express complex constraints. DTDs use their own non-XML syntax, making them harder to process with standard XML tools.

DTDs are still used in HTML5 (the <!DOCTYPE html> declaration is a vestigial DTD reference) and in defining entities (character shortcuts like © for ©). For new schemas, XSD or RelaxNG are preferred.

RelaxNG

RelaxNG is a schema language designed to be simpler than XSD while remaining expressive. It comes in two syntaxes: an XML syntax and a compact non-XML syntax that's more human-readable. RelaxNG is technically more expressive than XSD in some areas (like supporting unordered content models) and is easier to learn.

RelaxNG is used by OpenDocument Format (ODF), DocBook, and several OASIS standards. It's less widely supported by tools than XSD but is often praised by developers who have used both.

Namespaces: Solving the Name Collision Problem

Namespaces are XML's answer to a practical problem: what happens when two different schemas both define an element called <title>? A book's title and an HTML page's title are different things, but they share a name.

XML namespaces use URIs (usually URLs, though they don't need to resolve) to uniquely identify vocabularies:

<invoice xmlns:cust="http://example.com/customer"
         xmlns:ship="http://shipper.com/schema">
  <cust:name>Acme Corp</cust:name>
  <ship:name>FedEx Ground</ship:name>
</invoice>

Both <name> elements coexist without ambiguity because they're in different namespaces. The prefix (cust:, ship:) is just a shorthand — the actual namespace is the URI.

Namespaces are powerful but confusing. Default namespaces (xmlns="..." without a prefix) apply to the element and all its descendants, which can cause unexpected behavior when nesting elements from different schemas. Namespace-unaware tools may break when encountering prefixed elements. This complexity is the single biggest complaint about XML from developers working with it for the first time.

No other common data format has namespaces. JSON, YAML, TOML, and CSV all assume a single vocabulary per document. This is fine for most applications but breaks down when you need to combine data from multiple independent schemas in one document — which is exactly what enterprise integration requires.

XSLT and XPath: Transformation and Query

XPath is a query language for selecting nodes in an XML document. It uses path expressions similar to file system paths: /catalog/product/name selects all <name> elements that are children of <product> elements that are children of the root <catalog>. XPath supports predicates (//product[@status='active']), functions (count(), string-length(), sum()), and axes (parent, ancestor, sibling, descendant). XPath is used by XSLT, XSD, and most XML processing libraries.

XSLT (Extensible Stylesheet Language Transformations) is a declarative language for transforming XML documents. An XSLT stylesheet defines template rules that match XPath patterns and produce output. XSLT can transform XML into different XML, HTML, plain text, or any text format. It's Turing-complete, meaning it can compute anything computable — though using XSLT for general computation is widely regarded as a war crime against readability.

Practical XSLT uses: transforming XML data feeds into HTML pages, converting between XML schemas (mapping one industry standard to another), generating reports from XML data, and converting XML to CSV or JSON. XSLT 3.0 (2017) added JSON support, streaming for large documents, and higher-order functions.

Where XML Remains Irreplaceable

XML lost the API format war to JSON around 2012-2015. New REST APIs return JSON. New databases store JSON. New config files use YAML or TOML. But XML dominates specific domains where its unique features are required:

Domain	Format	Why XML
Office documents	DOCX, XLSX, PPTX (OOXML)	Mixed content (text + formatting). Document structure requires markup.
Vector graphics	SVG	Hierarchical element structure with attributes for styling/geometry.
Web feeds	RSS 2.0, Atom	Established standard. Self-describing with namespaces for extensions.
Enterprise APIs	SOAP + WSDL	Schema validation, namespace composition, formal contracts.
Authentication	SAML	XML Signature for cryptographic signing of assertions.
Financial reporting	XBRL	Extensible taxonomies via namespaces. Regulatory requirement.
Healthcare	HL7 FHIR, CDA	Complex data models with strict validation requirements.
Build systems	Maven (pom.xml), Ant, MSBuild	Established ecosystem, schema validation for configuration.
Android	Layouts, manifests, resources	Hierarchical UI structure with attribute-based properties.
Configuration	Java/.NET app configs, Spring	Schema-validated configuration with IDE support.

The Verbosity Cost — and When It Doesn't Matter

XML is approximately 2-3x larger than equivalent JSON for the same data. A JSON object {"name": "John", "age": 30} is 27 bytes. The XML equivalent <person><name>John</name><age>30</age></person> is 56 bytes — more than double. At scale, this means more bandwidth, more storage, and slower parsing.

But verbosity is often irrelevant:

Compressed transfer: XML and JSON compress to similar sizes with gzip. A 100KB XML file and a 50KB JSON file with the same data both compress to roughly 10-15KB because XML's repetitive tag names compress extremely well.
Document formats: DOCX files are already ZIP-compressed. The internal XML is never transferred raw.
Enterprise integration: When processing a $10M financial transaction, nobody cares that the XBRL message is 3KB instead of 1.5KB.
Developer time: If XML's schema validation catches a malformed message before it enters your system, the bytes saved by using JSON are meaningless compared to the debugging hours saved.

Converting XML to Other Formats

XML conversions are uniquely lossy because XML has features without equivalents in most target formats:

Conversion	What's Lost
XML to JSON	Attributes vs. elements distinction, namespaces, processing instructions, comments, CDATA markers, mixed content
XML to CSV	All hierarchy, attributes, types. Only works for flat repeated elements.
XML to YAML	Same as JSON (YAML is a JSON superset). Attributes become special keys.
XML to TOML	Deep nesting, mixed content, attributes. Only works for simple config-like XML.

The reverse direction — JSON to XML, CSV to XML — is generally lossless because XML can represent everything these simpler formats contain, just more verbosely. The generated XML won't have attributes (JSON has no attribute concept) or namespaces, but the data is preserved.

XML isn't going anywhere. The domains where it dominates — document formats, enterprise integration, regulated industries, vector graphics — chose XML for features that no alternative provides. Schema validation, namespaces, mixed content, and XSLT are not available in JSON, YAML, or any other lightweight format. Until they are, XML will remain the backbone of these ecosystems.

The practical takeaway: don't use XML for new APIs (JSON won that battle), new config files (YAML or TOML are simpler), or simple data exchange (CSV or JSON). But when you encounter XML in the wild — and you will — understand that it's there for a reason. The verbosity you're paying for buys validation, composition, and transformation capabilities that save engineering time downstream.

Key Takeaways

XML is a simplified subset of SGML, standardized by the W3C in 1998. Its strictness (mandatory closing tags, quoted attributes, proper nesting) was a deliberate design choice.
Schema validation (XSD/DTD/RelaxNG) is XML's killer feature. No other common format provides parser-level structure validation.
Namespaces solve name collisions when combining multiple vocabularies in one document. No other common data format has this.
XSLT enables declarative document transformation — converting between XML schemas, generating HTML, or producing reports without application code.
XML is 2-3x larger than JSON for the same data, but compresses to similar sizes and the verbosity rarely matters in practice.
XML is irreplaceable for document markup (DOCX/SVG), enterprise integration (SOAP/SAML), and regulated industries (XBRL, HL7).
Converting XML to JSON is inherently lossy. The attribute/element distinction, namespaces, and mixed content have no JSON equivalents.

Frequently Asked Questions

Is XML still relevant in 2026?

Absolutely. XML is less visible than it was in 2005 because JSON replaced it for APIs and web development. But XML remains dominant in document formats (every DOCX, XLSX, and PPTX file is a ZIP of XML), vector graphics (SVG), enterprise integration (SOAP, SAML), financial reporting (XBRL), healthcare (HL7 FHIR), and Android development. These domains chose XML for features JSON doesn't have.

What's the difference between XML attributes and elements?

An attribute is metadata on an element: <product id="P001">. An element is a structural component: <name>Widget</name>. The general rule: use elements for data and attributes for metadata about that data. But the distinction is often subjective, and different XML schemas make different choices. When converting XML to JSON, this distinction is lost because JSON has no concept of attributes — converters must map both to JSON keys using conventions like @id for attributes.

What is CDATA and when should I use it?

CDATA (Character Data) sections tell the XML parser to treat enclosed content as raw text, not markup. Inside <![CDATA[...]]>, characters like <, >, and & don't need escaping. Use CDATA when embedding HTML, JavaScript, SQL, or any content that contains characters XML would normally interpret as markup. Without CDATA, you'd need to escape every < as < and every & as &.

Why are XML namespaces so confusing?

Namespaces solve a real problem (name collisions between schemas) but introduce significant complexity. The confusion comes from: default namespaces that silently apply to descendants, the distinction between namespace URIs and prefixes, namespace-unaware tools that break on prefixed elements, and the fact that the same namespace can have different prefixes in different documents. Most developers only need namespaces when working with SOAP, RSS extensions, or combining schemas.

Should I use XSD or JSON Schema for validation?

Use whichever matches your data format. If your data is XML, use XSD — it's mature, widely supported, and deeply integrated with XML tools. If your data is JSON, use JSON Schema. XSD is more powerful (40+ built-in types, ordering constraints, cross-field validation) but more complex. JSON Schema is simpler and growing in adoption through OpenAPI. Don't convert your data format just for schema validation.

How does XML compare to JSON in file size and parse speed?

XML is typically 2-3x larger than JSON for equivalent data due to closing tags and verbose syntax. JSON parses 2-5x faster in most benchmarks. However, both compress to similar sizes with gzip (XML's repetitive tags compress very efficiently), and for files under 10MB the speed difference is negligible. XML's SAX and StAX streaming parsers can process arbitrarily large files with constant memory, which is something standard JSON parsers can't do.

Can I convert XML to JSON without losing information?

Not completely. XML has features with no JSON equivalent: the attribute/element distinction, namespaces, processing instructions, comments, CDATA sections, and mixed content (text interleaved with child elements). Different converters handle these differently, and round-tripping (XML to JSON to XML) will produce different XML. For simple, attribute-free XML the conversion is nearly lossless, but complex XML with namespaces and mixed content will lose structural information.

What's the difference between well-formed and valid XML?

Well-formed XML follows basic syntax rules: proper nesting, quoted attributes, matching tags. Any XML parser can check well-formedness. Valid XML is well-formed AND conforms to a schema (XSD, DTD, or RelaxNG). Validation requires a schema definition and a validating parser. All valid XML is well-formed, but not all well-formed XML is valid. Most XML in the wild is well-formed but not validated against any schema.

Ready to convert your files?

Use ChangeThisFile to convert between 600+ formats — free, fast, and private.

Start Converting