Kraetke, Martin. “From GitHub to GitHub with XProc: An approach to automate documentation for an open
source project with XProc
and the GitHub Web API.” Presented at Balisage: The Markup Conference 2016, Washington, DC, August 2 - 5, 2016. In Proceedings of Balisage: The Markup Conference 2016. Balisage Series on Markup Technologies, vol. 17 (2016). https://doi.org/10.4242/BalisageVol17.Kraetke01.
Balisage: The Markup Conference 2016 August 2 - 5, 2016
Balisage Paper: From GitHub to GitHub with XProc: An approach to automate documentation for an open
source project with XProc
and the GitHub Web API
le-tex publishing services
Martin Kraetke works as Lead Content Engineer at le-tex. He introduced XProc at le-tex
pipelines for the Open Source framework transpect. From time to time, he posts articles
for his blog
For all too many developers, documentation comes last. The move from Subversion to
Git provided the impetus
to rethink storage of both code and documentation. It also provided an opportunity
to use XSLT to harvest input
for documentation from the code itself: XProc pipelines, XSLT transforms, XML Catalogs,
and other sources. The
process also allows the integration of human-generated documentation, such as tutorials
written in DocBook.
Final documentation is generated by more XSLT, which generates HTML pages which are
committed with the source
code to the GitHub repository.
Since we published our transpect framework under an Open Source license in 2013, it
had become evident that we
need to improve documentation to facilitate the use of our software for other developers.
At this time no common
standards for documentation existed in our XML team. Each developer followed it's
own approach towards
documentation up to doubtful “paradigms” that the code would be documentation enough.
Generating a documentation
seemed to be impractical due to different coding practices. These issues are
Before we moved to XProc, our XSLT stylesheets were executed from Make and Ruby. These
scripts were also used to perform non-XML data processing, for example basic file
operations, running command-line utilities or to query databases. Although these technologies
fullfilled their purpose, they came with some disadvantages.
The code was difficult to deploy on other systems because it required typically an
Unix-based environment and a number of preinstalled software tools. Reusability was
very limited as many scripts expected resources at specific locations on the file
XProc helped to overcome these deficiencies. XProc is platform-neutral and provides
a declarative XML vocabulary to specify XML workflows. As a positive side effect,
the XML syntax allows us to analyze and transform XProc pipelines with Schematron
and XSLT. XProc allows to re-use pipelines in other pipelines with p:import statements. For file operations and other tasks which are not part of the XProc standard,
we use build-in extension steps of the XProc processor XML Calabash. For specific
tasks such as unzipping archives and image analysis, we developed own extension steps.
Transpect shares some basic principles with the DAISY Pipeline project, Romain Deltour
presented at XML Prague 2013kraet01: To avoid explicit file system paths in XSLT or XProc import statements, the components
are identified by canonical URIs. The idea is to identify external resources by virtual
addresses and to have a catalog resolver to rewrite the virtual addresses to physical
In practice, each transpect module includes an XML catalog which contains its base
URI. If an module is used in a project, the module catalog needs to be referenced
in the project catalog as well.
Due to the modular nature of transpect, it was necessary to develop a reference which
depicts not only the properties of the modules but also their dependencies. The complexity
of this issue was reinforced by the heterogeneous ways in which the code was stored.
Unfortunately, there was no standard how code should be organized. Transpect modules
were stored in several publicly available Subversion (SVN) repositories. Some included
various modules in subdirectories, other repositories just consisted of one module.
Sometimes, the repository or directory names did not match the base URI in the XML
For this reason, it was not easy to derive the dependencies of a certain module. Some
repositories implemented the concept of a “project root” with three subdirectories
trunk, branches, and tags. Others just provided the code at the root directory. Inside the repository the code
was either organized by language, functional logic or for historical reasons. Hence,
it could not be assumed that stylesheets or pipelines which belonged logically together
were also stored at the same location.
Following a consistent set of naming conventions not only improves the usability of
a framework but also facilitates the building of tools to analyze the code. Previously,
naming conventions differed among transpect modules. Directories and files which served
the same purpose were named differently. Three namespaces existed to represent core
Lack of Documentation
Just a few transpect modules provided an extensive documentation. Although there was
a verbose setup guide laid out in DocBookkraet02, a lot of other modules were poorly documented: Even for complex templates and functions,
comments were missing. Some variable and function names were not self-explanatory
and Readme files were rare.
In addition, there was no standard on how a transpect module should be described.
The setup guide mentioned earlier was laid out in DocBook, other projects included
plaintext Readme files or larger XML comments at the top of an XML document.
It became obvious that the lack is that the lack of an extensive documentation was
connected to the lack of common development standards. So we identified a number of
major issues and created standards addressing these issues.
Creating a Styleguide
First, we included some basic guidelines on how to write transpect code in a styleguidekraet03. These conventions provide guidance on encoding, indent style, whitespace, new lines
and some recommendations on writing XProc and XSLT code.
Moving to GitHub
To facilitate the ways of using and contributing to transpect, we moved its entire
codebase from our public SNV server to GitHub. Only customer-specific projects are
still stored in SVN repositories including the GitHub transpect modules as SVN externals.
We created a GitHub organization and stored each transpect module in a particular
repository. To minimize dependency management, we tried to combine modules with two-way
dependencies into one module. GitHub already provides built-in functions how to create
branches, releases and tags.
Restructuring the Code Base
Taking into account XProc and XSLT only, transpect counts about 50.000 lines of code.
The code was reviewed and namespaces, attributes and functions were renamed. Canonical
URIs were unified and common directories e.g. for XML catalogs were named consistently.
With our continuous integration system, it could be ensured our customer projects
still produce the same output.
After the migration to GitHub, the repository layout for each module follows the same
guidelines. The code is stored separated by its language. XML catalogs can always
be found at the same location.
GitHub already provides an HTML rendering for Markdown-based Readme files. Therefore,
a Readme file named README.md must be stored in a directory of the repository. When opening this directory with
GitHub's directory browser, the file is rendered at the bottom of the page. Since
moving to GitHub, we have been used frequently this method to provide a brief documentation
However, it's not feasible to include all technical aspects of a module in a Readme.
For example, an XProc pipeline may consist of input and output ports, a set of options
and various import statements. Incorporating this information within a separate Readme
for every transpect module wouldn't be particularly practicable. So we came up with
the idea of generating the documentation directly from the source code.
XProc already features a p:documentation element intended for adding documentation. This element can be used anywhere in the
pipeline and doesn't affect its execution. p:documentation can be used as child element of p:input to describe which kind of document is expected by this port. Moreover, you can add
the HTML namespace to use HTML markup in your documentation. This method proved to
be useful to include markup code blocks or use hyperlinks to reference other resources.
Our first project which employed p:documentation tags was transpectdoc.
Transpectdoc is basically an XProc pipeline which generates an HTML documentation
for all XProc pipelines used
in a project. Transpectdoc starts from a frontend XProc pipeline and parse all imported
pipelines and libraries.
For each of the pipelines’ input and output ports, a linked list will be generated
that points to ports in other
pipelines that are connected to these ports.
With transpectdoc it's possible to easily create a technical documentation for a specific
it was not intended to replace tutorials or an user guide. Nevertheless, transpectdoc
works for projects with
one front-end pipeline only. It was not supposed to generate a reference for the entire
After various efforts had been made to develop common standards, we identified major
requirements for a global documentation of transpect.
The documentation should be comprehensive and extensible. In this sense, it should
include all necessary
resources for each module. Furthermore, it should be possible to easily integrate
Currently, the entire transpect framework includes 52 repositories on GitHub. It would
not be convenient
to clone each directory to your local machine to generate a documentation. Only a
minimum set of data should
be downloaded. Furthermore, it should be possible to easily update and extend the
The automatically generated reference should be accompanied by other kinds of documentation,
The documentation should be both online and offline readable.
Requesting GitHub Repositories
The entire process starting with requesting the transpect repositories up to generating
the documentation is implemented in XProc. The documentation is written in XML and
hosted as GitHub page at http://transpect.io
Initially, an method had to be established to retrieve the repository information
hosted on GitHub.
Fortunately, GitHub provides a public HTTP APIkraet04. The data is sent and received as JSON.
So we decided to use XProc's p:http-request to request the GitHub API. The process comprises three
stages: First, an XProc step requests a list of all repositories of our GitHub organization.
Second, lists the
content of each repository recursivelykraet05. Third, XML Catalogs, XSLT stylesheets and XProc
pipelines are downloaded from GitHub.
To convert JSON to XML, we use an extension of XML Calabash, called the “transparent
JSON”kraet06. We may replace this later with a p:xslt step that implements the function parse-json() introduced in XSLT 3.0. Below you can find an XML output from a HTTP request of a
GitHub repository by using XProc and XML Calabash.
Generating the Documentation
All the information is combined together with XProc pipelines, XSLT stylesheets and
XML catalogs in one XML document. Then an XSLT stylesheet is used to convert this
intermediate format to DocBook to provide a more descriptive reference.
For each transpect module a DocBook chapter is generated. Every chapter includes general
information about the repository and sections for all XProc steps that can be imported
by other pipelines. Within the documentation, an XProc step is specified by its @type attribute and its canonical import URI. If the step imports
steps from other repositories, references to the repositories are specified as dependencies.
If an XProc step contains a p:documentation tag following below the root element, its content is taken as a brief description
of the step and rendered as well. Furthermore, input and output ports, options and
their default values are specified. Currently just a few steps provide p:documentation tags within port and
option declarations. If their number is increasing in the future, we think of integrating
these contents, too.
Despite the automatically generated reference, there are other kinds of documentation
such as setup guides and tutorials. These documents are authored in DocBook and stored
together with the module reference within the same directory. They are merged later
with XProc's p:xinclude step. After the merge, the XML document is split into chunks, each of which is converted
to a single HTML document later.
The generated HTML chunks just include a basic HTML markup and are injected into a
global HTML template which provides the layout information. The entries of the navigation
bar are also generated and inserted into the template.
In this way, it's possible to change the general appearance without changing the content.
The previous XSLTs just rely on some specific id attributes in the template to add the content. However, if you want to use specific
CSS features, such as a multi-column layout or individual colors, you have to add
the corresponding HTML class values in your DocBook role attributes.
The process ends with the generation of the HTML snippets of the documentation. From
this point, the only
thing left to do is to commit and push the changed HTML files to GitHub.
Discussion and Outlook
There are various benefits in generating vital parts of the reference directly from
the code base. The use of inline documentation enriched with HTML markup facilitates
adding larger portions of text. Redundant information like port declarations and dependencies
is automatically generated. It's no problem to keep the documentation up-to-date if
new repositories were added or code has changed.
Naturally, the generated documentation is just as good as the quality of the inline
documentation. But this approach is also suitable to expose poor documentation which
was previously hidden. Furthermore, a generated technical reference cannot replace
other more descriptive types of user documentation such as tutorials, FAQs or How-tos
and it's not convenient to incorporate complex documents in p:documentation tags. But it seems reasonable to combine these different kinds of documentation in
a single XML source that can then be published in a variety of formats.
It seems like a natural choice to use GitHub services to both generate and host documentation
for code which is already hosted on GitHub. But it also involves the risk of beeing
too dependent on a specific vendor. If it would be necessary to move away from GitHub,
just the pipeline which requests the repositories needs to be changed. The source
of the documentation is stored as XML, so another pipeline just had to generate the
expected XML structure.
The project is still in a early state. As of now, the initial requirements have been
additional features such as EPUB output and a graphical representation of the XProc
steps are planned. Expected
data types and XML schemas at XProc ports could provide a deeper level of information.
Sometimes it would be
convenient to have a brief overview of the structure of the pipeline. In this sense,
the main focus is just on
adding more informative documentation.