How to cite this paper
From GitHub to GitHub with XProc: An approach to automate documentation for an open source project with XProc
and the GitHub Web API
Balisage: The Markup Conference 2016
August 2 - 5, 2016
Since we published our transpect framework under an Open Source license in 2013, it had become evident that we
need to improve documentation to facilitate the use of our software for other developers. At this time no common
standards for documentation existed in our XML team. Each developer followed it's own approach towards
documentation up to doubtful “paradigms” that the code would be documentation enough. Generating a documentation
seemed to be impractical due to different coding practices. These issues are
Before we moved to XProc, our XSLT stylesheets were executed from Make and Ruby. These scripts were also used to perform non-XML data processing, for example basic file operations, running command-line utilities or to query databases. Although these technologies fullfilled their purpose, they came with some disadvantages.
The code was difficult to deploy on other systems because it required typically an Unix-based environment and a number of preinstalled software tools. Reusability was very limited as many scripts expected resources at specific locations on the file system.
XProc helped to overcome these deficiencies. XProc is platform-neutral and provides a declarative XML vocabulary to specify XML workflows. As a positive side effect, the XML syntax allows us to analyze and transform XProc pipelines with Schematron and XSLT. XProc allows to re-use pipelines in other pipelines with
p:import statements. For file operations and other tasks which are not part of the XProc standard, we use build-in extension steps of the XProc processor XML Calabash. For specific tasks such as unzipping archives and image analysis, we developed own extension steps.
Transpect shares some basic principles with the DAISY Pipeline project, Romain Deltour presented at XML Prague 2013kraet01: To avoid explicit file system paths in XSLT or XProc import statements, the components are identified by canonical URIs. The idea is to identify external resources by virtual addresses and to have a catalog resolver to rewrite the virtual addresses to physical locations.
In practice, each transpect module includes an XML catalog which contains its base URI. If an module is used in a project, the module catalog needs to be referenced in the project catalog as well.
<?xml version="1.0" encoding="UTF-8"?>
<!-- base URI of the project -->
<rewriteURI uriStartString="http://transpect.github.io/" rewritePrefix="../"/>
<!-- references to the transpect modules -->
Due to the modular nature of transpect, it was necessary to develop a reference which depicts not only the properties of the modules but also their dependencies. The complexity of this issue was reinforced by the heterogeneous ways in which the code was stored.
Unfortunately, there was no standard how code should be organized. Transpect modules were stored in several publicly available Subversion (SVN) repositories. Some included various modules in subdirectories, other repositories just consisted of one module. Sometimes, the repository or directory names did not match the base URI in the XML catalog.
For this reason, it was not easy to derive the dependencies of a certain module. Some repositories implemented the concept of a “project root” with three subdirectories
tags. Others just provided the code at the root directory. Inside the repository the code was either organized by language, functional logic or for historical reasons. Hence, it could not be assumed that stylesheets or pipelines which belonged logically together were also stored at the same location.
Following a consistent set of naming conventions not only improves the usability of a framework but also facilitates the building of tools to analyze the code. Previously, naming conventions differed among transpect modules. Directories and files which served the same purpose were named differently. Three namespaces existed to represent core transpect modules.
Lack of Documentation
Just a few transpect modules provided an extensive documentation. Although there was a verbose setup guide laid out in DocBookkraet02, a lot of other modules were poorly documented: Even for complex templates and functions, comments were missing. Some variable and function names were not self-explanatory and Readme files were rare.
In addition, there was no standard on how a transpect module should be described. The setup guide mentioned earlier was laid out in DocBook, other projects included plaintext Readme files or larger XML comments at the top of an XML document.
It became obvious that the lack is that the lack of an extensive documentation was connected to the lack of common development standards. So we identified a number of major issues and created standards addressing these issues.
Creating a Styleguide
First, we included some basic guidelines on how to write transpect code in a styleguidekraet03. These conventions provide guidance on encoding, indent style, whitespace, new lines and some recommendations on writing XProc and XSLT code.
Moving to GitHub
To facilitate the ways of using and contributing to transpect, we moved its entire codebase from our public SNV server to GitHub. Only customer-specific projects are still stored in SVN repositories including the GitHub transpect modules as SVN externals. We created a GitHub organization and stored each transpect module in a particular repository. To minimize dependency management, we tried to combine modules with two-way dependencies into one module. GitHub already provides built-in functions how to create branches, releases and tags.
Restructuring the Code Base
Taking into account XProc and XSLT only, transpect counts about 50.000 lines of code. The code was reviewed and namespaces, attributes and functions were renamed. Canonical URIs were unified and common directories e.g. for XML catalogs were named consistently. With our continuous integration system, it could be ensured our customer projects still produce the same output.
After the migration to GitHub, the repository layout for each module follows the same guidelines. The code is stored separated by its language. XML catalogs can always be found at the same location.
Figure 2: Typical directory structure of a transpect module
GitHub already provides an HTML rendering for Markdown-based Readme files. Therefore, a Readme file named
README.md must be stored in a directory of the repository. When opening this directory with GitHub's directory browser, the file is rendered at the bottom of the page. Since moving to GitHub, we have been used frequently this method to provide a brief documentation to developers.
However, it's not feasible to include all technical aspects of a module in a Readme. For example, an XProc pipeline may consist of input and output ports, a set of options and various import statements. Incorporating this information within a separate Readme for every transpect module wouldn't be particularly practicable. So we came up with the idea of generating the documentation directly from the source code.
XProc already features a
p:documentation element intended for adding documentation. This element can be used anywhere in the pipeline and doesn't affect its execution.
p:documentation can be used as child element of
p:input to describe which kind of document is expected by this port. Moreover, you can add the HTML namespace to use HTML markup in your documentation. This method proved to be useful to include markup code blocks or use hyperlinks to reference other resources.
Figure 3: Inline documentation in XProc
<?xml version="1.0" encoding="UTF-8"?>
The documentation may include <a href="https://en.wikipedia.org/wiki/HTML">HTML</a> markup as well.
Our first project which employed
p:documentation tags was transpectdoc.
Transpectdoc is basically an XProc pipeline which generates an HTML documentation for all XProc pipelines used
in a project. Transpectdoc starts from a frontend XProc pipeline and parse all imported pipelines and libraries.
For each of the pipelines’ input and output ports, a linked list will be generated that points to ports in other
pipelines that are connected to these ports.
With transpectdoc it's possible to easily create a technical documentation for a specific project. However,
it was not intended to replace tutorials or an user guide. Nevertheless, transpectdoc works for projects with
one front-end pipeline only. It was not supposed to generate a reference for the entire transpect
After various efforts had been made to develop common standards, we identified major requirements for a global documentation of transpect.
The documentation should be comprehensive and extensible. In this sense, it should include all necessary
resources for each module. Furthermore, it should be possible to easily integrate new repositories.
Currently, the entire transpect framework includes 52 repositories on GitHub. It would not be convenient
to clone each directory to your local machine to generate a documentation. Only a minimum set of data should
be downloaded. Furthermore, it should be possible to easily update and extend the documentation.
The automatically generated reference should be accompanied by other kinds of documentation, such as
The documentation should be both online and offline readable.
Requesting GitHub Repositories
The entire process starting with requesting the transpect repositories up to generating the documentation is implemented in XProc. The documentation is written in XML and hosted as GitHub page at http://transpect.io
Initially, an method had to be established to retrieve the repository information hosted on GitHub.
Fortunately, GitHub provides a public HTTP APIkraet04. The data is sent and received as JSON.
So we decided to use XProc's
p:http-request to request the GitHub API. The process comprises three
stages: First, an XProc step requests a list of all repositories of our GitHub organization. Second, lists the
content of each repository recursivelykraet05. Third, XML Catalogs, XSLT stylesheets and XProc
pipelines are downloaded from GitHub.
To convert JSON to XML, we use an extension of XML Calabash, called the “transparent JSON”kraet06. We may replace this later with a
p:xslt step that implements the function
parse-json() introduced in XSLT 3.0. Below you can find an XML output from a HTTP request of a GitHub repository by using XProc and XML Calabash.
Figure 4: Output of an XProc HTTP-request as transparent JSON
<?xml version="1.0" encoding="UTF-8"?>
<j:item xmlns:j="http://marklogic.com/json" type="object">
<j:description type="string">Libraries to implement a transpect cascade configuration</j:description>
Generating the Documentation
All the information is combined together with XProc pipelines, XSLT stylesheets and XML catalogs in one XML document. Then an XSLT stylesheet is used to convert this intermediate format to DocBook to provide a more descriptive reference.
For each transpect module a DocBook chapter is generated. Every chapter includes general information about the repository and sections for all XProc steps that can be imported by other pipelines. Within the documentation, an XProc step is specified by its
@type attribute and its canonical import URI. If the step imports
steps from other repositories, references to the repositories are specified as dependencies.
If an XProc step contains a
p:documentation tag following below the root element, its content is taken as a brief description of the step and rendered as well. Furthermore, input and output ports, options and their default values are specified. Currently just a few steps provide
p:documentation tags within port and
option declarations. If their number is increasing in the future, we think of integrating these contents, too.
Despite the automatically generated reference, there are other kinds of documentation such as setup guides and tutorials. These documents are authored in DocBook and stored together with the module reference within the same directory. They are merged later with XProc's
p:xinclude step. After the merge, the XML document is split into chunks, each of which is converted to a single HTML document later.
The generated HTML chunks just include a basic HTML markup and are injected into a global HTML template which provides the layout information. The entries of the navigation bar are also generated and inserted into the template.
In this way, it's possible to change the general appearance without changing the content. The previous XSLTs just rely on some specific
id attributes in the template to add the content. However, if you want to use specific CSS features, such as a multi-column layout or individual colors, you have to add the corresponding HTML
class values in your DocBook
The process ends with the generation of the HTML snippets of the documentation. From this point, the only
thing left to do is to commit and push the changed HTML files to GitHub.
Figure 5: Figure 1: Example of the Reference page.
Discussion and Outlook
There are various benefits in generating vital parts of the reference directly from the code base. The use of inline documentation enriched with HTML markup facilitates adding larger portions of text. Redundant information like port declarations and dependencies is automatically generated. It's no problem to keep the documentation up-to-date if new repositories were added or code has changed.
Naturally, the generated documentation is just as good as the quality of the inline documentation. But this approach is also suitable to expose poor documentation which was previously hidden. Furthermore, a generated technical reference cannot replace other more descriptive types of user documentation such as tutorials, FAQs or How-tos and it's not convenient to incorporate complex documents in
p:documentation tags. But it seems reasonable to combine these different kinds of documentation in a single XML source that can then be published in a variety of formats.
It seems like a natural choice to use GitHub services to both generate and host documentation for code which is already hosted on GitHub. But it also involves the risk of beeing too dependent on a specific vendor. If it would be necessary to move away from GitHub, just the pipeline which requests the repositories needs to be changed. The source of the documentation is stored as XML, so another pipeline just had to generate the expected XML structure.
The project is still in a early state. As of now, the initial requirements have been implemented and
additional features such as EPUB output and a graphical representation of the XProc steps are planned. Expected
data types and XML schemas at XProc ports could provide a deeper level of information. Sometimes it would be
convenient to have a brief overview of the structure of the pipeline. In this sense, the main focus is just on
adding more informative documentation.