MuXML, An XML Multiplexor
MuXML is a prototype Perl module that implements configurable multiplexing of XML document streams accessed via the LWPng module and parsed using the XML::Parser module. Its returns its results using the XML::Grove module. Its primary purpose is to serve as a demonstration of the use of non-blocking design approaches with XML and Perl. If it also ends up being a useful tool then that's gravy :-).
You can find the source code here if you are interested in browsing.
One way to think about the problem domain that MuXML (and the underlying sub-systems) is addressing is the generalization of stream oriented document processing to the handling or more than one document stream at a time (lets call them a stream set). It turns out that in a single threaded environment, you don't even have to have more than one document in order to to need to employ the kinds of approaches used by LWPng and MuXML.
The various streams in a stream set may have very different throughput levels. This means that you may need to be able to throttle the fast streams in order to not overwhelm the slow ones. Single stream processing frameworks like those of SAX based XML parsers do not generalize to use with stream sets. The main reason is that single stream based frameworks do not support flow-control except in a cumbersome round-about way. MuXML can be viewed as a framework for processing stream sets with explicit flow control built in.
The most obvious use of MuXML is to multiplex record oriented XML document streams. I'll try to provide some demonstration data generated with something like XFlat/XML Convert in the next release.
One possible application of MuXML would be the emulation of the UNIX sort command's merge capability. Let's call this sample application MergeML. You would pass MergeML a list of XML documents, each of which was already sorted. It would output an XML document that interleaved the contents of the input documents based on same sorting criteria used for the internal sort.
Another sample application would be aggregating information from a distributed logging system. Let's say that you have a site that is replicated across multiple distinct servers. Each server is stand-alone and does its own logging. The servers create a nightly ordered listing of page accesses (in XML for arguments sake :-). Your MuXML application would access the hit logs and process them in parallel. It would output the aggregate access count across all the sites.
Usage
The MuXML conceptual model is that of a filtering smart multiplexor. You hand MuXML a list of URL. It accesses and parses them incrementally based on a blockSize that you specify. MuXML uses HTTP partial GET requests to allow flow control to be propagated to the resource servers. It will only get as much data from the server as it needs it in order to generate a fragment. Once it has one or more fragments queued for a particular stream, it will only get more data once the application has consumed the current fragments.
It filters out fragments of the incoming documents using a fragment recognizer you give it. Whenever it has a complete fragment from any of the document streams that it is processing, It calls the fragment multiplexor that you have provided. Below are description of these callbacks.
-
fragRecognizer
-
The fragRecognizer is invoked whenever a start tag is parsed AND we are not already inside a fragment for that stream. It returns a boolean value indicating whether this element should be made the root of a new fragment. It is passed the following parameters:
-
tag
The tag (GI) of the element.
attrs
An array of the attr/value pairs
-
fragMux
-
fragMux is called when the end tag of a recognized fragment in one of the document streams has been processed. MuXML and the fragMux communicate through arrays that contain an entry for each document stream in the stream set. I.e. if you passed MuXML three URL, the array would contain three entries.
FragMux is passed two arrays, one containing the state of the stream set on the previous call to fragMux and the other containing the current state. Each entry in the array(s) can have one of three values:
fragment An XML::Grove::Element.
0 This indicates that there is no fragment available for this stream.
-1 This stream has reached EOF