Here's how I use it:
xsltproc --stringparam filter XML atomFilter.xsl All.atom > XML.atom
Of course, if you use a different XSL tool than xsltproc
then simply adjust the parameters for passing the 'filter' value in. Bear in mind that it's expecting to be a string, so needs to be quoted properly; for example:
java org.apache.xalan.xslt.Process -IN All.atom -XSL atomFilter.xsl -OUT XML.atom -PARAM filter "'XML'"
Note that the XPath expression needs to be in single quotes; your shell will probably require you to quote that value in double quotes.
]]>For example, the XML-ised version of HTML, called XHTML, can be represented using the MIME types text/html
, application/xhtml+xml
as well as text/xml
and application/xml
. At least they've got their own file extension, .xhtml
, that can be used to denote files.
Sidenote: MIME stands for Multipurpose Internet Mail Exchange, and was originally used to define what kinds of documents were being attached to mail messages between systems that may not know about extension types. Most webservers have pre-defined mappings between file types and extensions, and the MIME type is recorded with that mapping as well; in
/etc/mime.types
on Unix systems, and in the 'File Types' that is visible in Windows.MIME types are officially defined in RFC 2046 and defines the initial top level types as:
- text
- textual information
- image
- image data
- audio
- audio data
- video
- video data
- application
- any application-specific data that doesn't fall into the above categories
- multipart
- an encoding that allows multiple items, potentially of different types, to be concatenated together (this is how mail messages with attachments are sent)
- message
- an e-mail message, mostly used with the rfc822 subtype
Each of these top-level types have a number of subtypes, such as
text/html
,text/xml
andtext/plain
that are dependent on the top type.The authors note that "It should be noted that the list of media type values given here may be augmented in time, via the mechanisms described above, and that the set of subtypes is expected to grow substantially."
Coming back to XML, even for XHTML documents, there's still a number of potential MIME types that can be used (mostly to do with backwards compatibility). And this has started a disturbing trend for XML documents to have either xml+
or +xml
in their MIME type. As a result, you have a number of different types of XML document, such as image/svg+xml
, application/xml+html
. Thus, there's no easy way if you had any prior knowledge of whether a document is an XML one (and thus should be in text) or a binary one.
Even though RFC 3023 explicitly disses the possibility, it would make far more sense to define a top-level MIME type to encapsulate XML-encoded documents. This is at least as sensible as breaking down documents into 'text', 'image', 'video', 'audio' and 'everything else', where 'everything else' seems to get used by pretty much everything. If we had an 'xml' major type, we could allow processors to know that they were about to receive XML, and do things like character set negotiation (though UTF-8 would be the default) and structural checking, even if the actual validation of the DTD/schema may not be done. Partially the argument for not adding an 'xml' type "because it would break existing stuff" isn't tremendously valid; that is a recipie for killing innovation, and the originators of RFC 2046 explicitly intended for future top-level types to be created. Their second point -- that the MIME type describes the document type, not its syntax isn't tremendously relevant either; after all, we can have plain ASCII images (which are currently described as text/plain
instead of image/ASCII
-- there's even a Star Wars video in ASCII. And without the top-level type, everything is just dumped in the application/
subtype anyway.
So here's an example where having a top-level type would help; maybe for specific cases (like image/svg+xml
) would they not necessarily fit into the xml/
top-level type, but there's really very little type of data that couldn't fit in the xml/
top-level type. Indeed, for the thousands of other types that are created daily by business, they would have an ideal fit in the newly created top-level type.
This brings to my final rant about XML documents. Why do they all end in .xml
? I mean, the XML is the encoding type, much like the character set; imagine a whole bunch of documents being labelled .ascii
. For example, most RSS feeds end in .xml
(and even have the audacity to believe that they are the only type of XML to have their own image ), despite the fact that it's an RSS document. At least some of the early adopters of atom end theirs with .atom
(though www.blogger.com only generates feeds called atom.xml). And don't get me started on Ant files. If they were all called build.ant
instead of build.xml
, then it would be trivial to search for a build file (or even have many different build files) which otherwise might not be trivially distinguishable from other .xml
files. This is especially true of other files, though at least .xsl
has the decency to use its own file type, even if it does mix the XSLT
and XSLFO
types.
So, I propose a new MIME type and approach to naming file documents. Specifically, I would ban the use of .xml
as an extension, and any xml+
or +xml
MIME types. Instead, we would have:
Application | File extension | MIME type |
---|---|---|
Note: no-one uses these file types. I wish they would. But if you've googled and come up with this list, it's not actually standardised (yet) | ||
Ant | .ant | xml/ant |
Atom | .atom | xml/atom |
Docbook | .docbook | xml/docbook |
RSS | .rss | xml/rss |
SVG | .svg | xml/svg |
XSLT | .xslt | xml/xslt |
XSL Formatting Objects | .xslfo | xml/xslfo |
XHTML | .xhtml | xml/html |
Backwards compatibility is good for most things. But backwards compatibility needs to be tempered; restricting stuff to only be backwardly compatible for ever stifles innovation.
]]>Where XML succeds is where there's a defined format for structuring the data in the first place. "XML is a data format. It is NOT a serialization of programming language structures so don't treat it as such." Specifically, design the XML around the data, not the objects. This is why the XMLEncoder is designed to fail; it tightly couples the data structure with the object structure. The result? The data changes, and the object fails. Even accepting that you can abstract list implementations away with the same XML data, there are some structure (like refactoring out a set of common features like year-month-day into a single Date object instead of a single string) that aren't going to be compatible.
When you've defined the XML structure up front, and then written your tools to process that data structure, it's much more likely you've got a structure that's efficient for the data that you need. For example, if you're designing data that requires a date/time combo, hopefully you're going to realise that you need to separate out the year/month/day components as attributes, and possibly even their own element. You can then write code to process that data afterwards.
The post also brings up another blindingly obvious but often unappreciated statement: "Use XML tools as much as possible. XPath and XSL-T are enormously powerful tools for working with XML." XPath should really be used for all places where XML data is being used for reading purposes. That way, if you ever need to evolve your XML data structure, it's a case of redefining some XPath expressions. The only downside to using XPath is that it typically operates best on an in-memory model, and that tends to limit its use with large documents.
In short, a great summary and a must-read for developers wishing to use XML as a data structure.
]]>