Thoughts on XML and the Web

People often comment on the hype of XML and say it's not all it's cracked out to be. It's too simple in structure, too complex to manipulate and too expensive to pass over networks. Others see it as some kind of panacea.

Personally, I think XML is a huge advance, an idea that acts as a mental laxative allowing other ideas to flow. I've had many disconnected thoughts on XML and the web that I keep on meaning to bring together into some grand thesis. This will probably never happen, so I've decided to collect together a lot of these thoughts in a single article, without feeling the need to produce something which is completely polished and coherent. I hope it makes some sense.

The Web Trinity

The Web Trinity is composed of three key standards: URLs,HTTP and HTML. There are many others which these are built on or can use, i.e. FTP, SSL, TCP/IP, XML, but for simplicity's sake we'll concentrate on these three.

URLs identify each resource on the network uniquely. More interesting though, URLs can also represent queries. URLs, therefore, can represent both static and dynamic resources.

HTTP allows us to access those resources and even upload new ones. These can be in any format and HTTP uses the MIME standard to identify the type of the resource.

Together, URLs and HTTP give us the "World Wide" aspect of the World Wide Web. It is HTML that turns it into a Web.

When a user downloads an HTML page they are downloading a resource that contains links to other resources. HTML is the language designed to let humans navigate internet based resources. HTML pages are displayed to a user. A user decides what link in the page to follow next.

It's worth remember as XML eclipses HTML and other technologies gain ground that almost everything that makes the web so powerful was there at the beginning.

Beyond the Last Mile

If the Internet is about anything I'd say it was about connecting anything to anything - all are welcome. I can send emails between Macs, PCs, Unix and Linux Workstations, almost any kind of handheld computer and some mobiles phones. Web browsers run on almost as many devices - and often web servers do as well. The Internet is about interconnection. But one shortcoming of all this is that Internet technologies are mainly used to connect you along what I call "The Last Mile". By this I mean that if you have a group of interconnected computers and a set of users of those computers, the emphasis on things like the web and email is the communication of user to remote computer, not computer to remote computer. HTML is a last mile technology - it's designed to be displayed.

Imagine though that I'm searching a directory through a web page. I type in the name of the person I'm looking, click on a button and a page is returned with that person's details. HTML is brilliant for the communication between the user's browser and the web server, but what about the communication between the web server and the database that contains the real information? How is that data represented? Normally we could use some middleware like JDBC or ODBC, or even CORBA or DCOM, but these are not what I'd call internet technologies. They may run over internet style networks but they don't embrace the Zen of the Internet. This is why we need XML. XML allows us to use technologies such as HTTP for almost all our connectivity, not just the last mile.

The beauty of this is that we can build n-tier systems without a big mix of technologies. From a human perspective it makes these systems easier to understand for their designers, implementers and maintainers. It also means these systems are easier to build as Internet technologies are simpler and more flexible than conventional middleware.

Worse is Better

Richard Gabriel, a writer on software design, has a phrase: "Worse is Better". What he means is that when you're designing a system for the first time you should design it with the minimal functionality you can get away with and with the simplest design. You shouldn't worry about doing things "properly" or satisfying a long list of requirements. The idea of this philosophy is you can get things out the door quickly, see if people like them and if they do, get feedback on how they can be improved. Your "requirements" come from people actually using your product, and not from people guessing what they might require in the future. This doesn't mean you code can be full of bugs, it's just not full of useless features. This philosophy has now been embraced by the extreme programming movement.

The web is a brilliant example of the success of "Worse is Better". You can learn basic HTML in a couple of hours. Anyone who can write a program that uses the ubiquitous TCP/IP protocol or knows how to use the Telnet application can understand HTTP and use it. Because of this HTML and HTTP have been incorporate into hundreds of programs, many of them free, and web sites have sprung up everywhere. Now there are lots of people who will tell you how warped HTML has become over the years and how inefficient HTTP is, but the bottom line is that "better" systems for doing distributed hypertext have all fallen by the wayside.

I'll give you an example. For the last thirty years Ted Nelson has been shepherding a distributed hypertext system called Xanadu. Only recently (1999) has any software actually been released. It doesn't really matter how good Xanadu is and how much better it is than the web. Xanadu will become a footnote in the history of the web and at best an object of academic study. Bits of its design may find their way into the Web infrastructure but it will never replace the Web.

XML Engines

Part of the success of Unix comes from the large number of tools that were available for processing plain text files. It is often a reasonably easy task for a seasoned Unix hacker to process and extract information from a large number of files just by writing a short shell script. You could say that it's these tools that have contributed to the popularity of Unix and Linux amongst programmers and system administraters.

In the same way it will be the availability of tools and libraries for manipulating XML that will contribute to its success. In a sense XML is the new plain text and in the same way that simple command-line tools in Unix allow complex processing to be performed, complex data manipulation of structured data in XML is possible using a similar set of simple tools.

There are currently two popular APIs for manipulating XML: DOM and SAX. DOM, the Document Object Model, is geared towards in-memory document processing. SAX, the Simple API for XML, is an event-driven model, treating the document as a stream of data and is better for processing larger documents. We should be aware that these APIs are quite low-level. To use them requires the programmer to do a lot of work and some might think that this could be an Achille's heel for XML but this isn't necessarily so. It's highly likely that most XML libraries, as they mature, will have support for standards such as XPath and XML query languages. XPath would allow the programmer to pick out any part of the document relatively easily. You could say that XPath is to XML as regular expressions are to plain text files. XML query languages, which will initially be applied to XML databases, can just as easily be applied to the smaller data sets that ordinary applications often manipulate.

Also, there is no reason why languages such as XSLT, which allow us to perform XML-to-XML transformation, should just be used by end-user tools such as web browsers. Application programmers would also be able to gain tremendous benefit from a library that allowed them to use XSLT.

To push this point even further we should also note that XML linking languages will allow us to define XML documents as being constructed from parts of other documents in a process called transclusion.

We can motivate a fundamental principle here and that is that XML libraries will soon provide us with powerful, generic declarative engines for the manipulation (through transformation and combination) and interrogation of structured data. It's easy to think of XML as being more trouble than it's worth for today's application programmer, but in a few years time that same programmer will be able to produce complex applications using XML libraries and a "shell" of application-specific code that are an order of magnitude more complex than they would be able to achieve otherwise.

One idea that may become central to this will be model-view-controller (MVC) pattern. In this design pattern we have some data structure in memory (the model), some separate logic which manipulates this data (the controllers) and other components that present this data in a variety of ways (the views). It is easy to see how we could produce applications that hold XML in models with view objects configured using something such as XSLT which are themselves models for other views. A controller that manipulates the model will have any changes it makes ripple through what is effectively a powerful dependency graph. All that the application is required to write logic that displays selected views to the user, perhaps extracting data using XPath.

CORBA & SOAP

CORBA blazed the trail for distributed objects, but CORBA had to do everything from scratch, even down to saying how data is marshalled before being sent and received. It also was created by large companies for there own needs. It's a heavyweight solution. On the other hand SOAP makes good use of other technologies: XML for structuring requests, HTTP for transporting them, SSL for encryption and URLs for locating remote objects. And that's just the beginning. SOAP will be able to use a whole load of other web technologies very easily. CORBA is also in a phase where every new feature gives diminishing returns. SOAP picks up a whole host of extra features from new web-based technologies, even though the creators of those technologies probably weren't even thinking of SOAP when they created them. And SOAP is just one example of how XML and HTTP can be used together.

Whether SOAP will take off is a matter of debate, but what is clear to me is that the writing is on the wall for systems like CORBA and that within five years they will be considered obsolete, even though at the moment they offer a lot a functionality that web-based solutions don't have.

History

Date	Version	Comments
23rd January, 2001	1.0	Trinity, Last Mile, Worse, Engines, CORBA.

Up: Ian Fairman's Writings