From GitHub to GitHub with XProc: An approach to automate documentation for an open source project with XProc
    and the GitHub Web API

Martin Kraetke

Abstract

For all too many developers, documentation comes last. The move from Subversion to Git provided the impetus to rethink storage of both code and documentation. It also provided an opportunity to use XSLT to harvest input for documentation from the code itself: XProc pipelines, XSLT transforms, XML Catalogs, and other sources. The process also allows the integration of human-generated documentation, such as tutorials written in DocBook. Final documentation is generated by more XSLT, which generates HTML pages which are committed with the source code to the GitHub repository.

Background

Since we published our transpect framework under an Open Source license in 2013, it had become evident that we need to improve documentation to facilitate the use of our software for other developers. At this time no common standards for documentation existed in our XML team. Each developer followed it's own approach towards documentation up to doubtful “paradigms” that the code would be documentation enough. Generating a documentation seemed to be impractical due to different coding practices. These issues are outlined below.

Component Model

Before we moved to XProc, our XSLT stylesheets were executed from Make and Ruby. These scripts were also used to perform non-XML data processing, for example basic file operations, running command-line utilities or to query databases. Although these technologies fullfilled their purpose, they came with some disadvantages.

The code was difficult to deploy on other systems because it required typically an Unix-based environment and a number of preinstalled software tools. Reusability was very limited as many scripts expected resources at specific locations on the file system.

XProc helped to overcome these deficiencies. XProc is platform-neutral and provides a declarative XML vocabulary to specify XML workflows. As a positive side effect, the XML syntax allows us to analyze and transform XProc pipelines with Schematron and XSLT. XProc allows to re-use pipelines in other pipelines with p:import statements. For file operations and other tasks which are not part of the XProc standard, we use build-in extension steps of the XProc processor XML Calabash. For specific tasks such as unzipping archives and image analysis, we developed own extension steps.

Transpect shares some basic principles with the DAISY Pipeline project, Romain Deltour presented at XML Prague 2013kraet01: To avoid explicit file system paths in XSLT or XProc import statements, the components are identified by canonical URIs. The idea is to identify external resources by virtual addresses and to have a catalog resolver to rewrite the virtual addresses to physical locations.

In practice, each transpect module includes an XML catalog which contains its base URI. If an module is used in a project, the module catalog needs to be referenced in the project catalog as well.

Figure 1

<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">  
  <!-- base URI of the project -->
  <rewriteURI uriStartString="http://transpect.github.io/" rewritePrefix="../"/>
  <!-- references to the transpect modules -->  
  <nextCatalog catalog="../xslt-util/xmlcatalog/catalog.xml"/>
  <nextCatalog catalog="../github-api/xmlcatalog/catalog.xml"/>
</catalog>

Due to the modular nature of transpect, it was necessary to develop a reference which depicts not only the properties of the modules but also their dependencies. The complexity of this issue was reinforced by the heterogeneous ways in which the code was stored.

Repository Layout

Unfortunately, there was no standard how code should be organized. Transpect modules were stored in several publicly available Subversion (SVN) repositories. Some included various modules in subdirectories, other repositories just consisted of one module. Sometimes, the repository or directory names did not match the base URI in the XML catalog.

For this reason, it was not easy to derive the dependencies of a certain module. Some repositories implemented the concept of a “project root” with three subdirectories trunk, branches, and tags. Others just provided the code at the root directory. Inside the repository the code was either organized by language, functional logic or for historical reasons. Hence, it could not be assumed that stylesheets or pipelines which belonged logically together were also stored at the same location.

Naming Conventions

Following a consistent set of naming conventions not only improves the usability of a framework but also facilitates the building of tools to analyze the code. Previously, naming conventions differed among transpect modules. Directories and files which served the same purpose were named differently. Three namespaces existed to represent core transpect modules.

Lack of Documentation

Just a few transpect modules provided an extensive documentation. Although there was a verbose setup guide laid out in DocBookkraet02, a lot of other modules were poorly documented: Even for complex templates and functions, comments were missing. Some variable and function names were not self-explanatory and Readme files were rare.

In addition, there was no standard on how a transpect module should be described. The setup guide mentioned earlier was laid out in DocBook, other projects included plaintext Readme files or larger XML comments at the top of an XML document.

Establishing Standards

It became obvious that the lack is that the lack of an extensive documentation was connected to the lack of common development standards. So we identified a number of major issues and created standards addressing these issues.

Creating a Styleguide

First, we included some basic guidelines on how to write transpect code in a styleguidekraet03. These conventions provide guidance on encoding, indent style, whitespace, new lines and some recommendations on writing XProc and XSLT code.

Moving to GitHub

To facilitate the ways of using and contributing to transpect, we moved its entire codebase from our public SNV server to GitHub. Only customer-specific projects are still stored in SVN repositories including the GitHub transpect modules as SVN externals. We created a GitHub organization and stored each transpect module in a particular repository. To minimize dependency management, we tried to combine modules with two-way dependencies into one module. GitHub already provides built-in functions how to create branches, releases and tags.

Restructuring the Code Base

Taking into account XProc and XSLT only, transpect counts about 50.000 lines of code. The code was reviewed and namespaces, attributes and functions were renamed. Canonical URIs were unified and common directories e.g. for XML catalogs were named consistently. With our continuous integration system, it could be ensured our customer projects still produce the same output.

After the migration to GitHub, the repository layout for each module follows the same guidelines. The code is stored separated by its language. XML catalogs can always be found at the same location.

Figure 2: Typical directory structure of a transpect module

MyProject/
  |--xmlcatalog/
  |  |--catalog.xml
  |--moduleXY
  |--css/
  |  |--stylesheet.css
  |--xpl/
  |  |--pipeline.xpl
  |--xsl/
  |  |--transform.xsl

Creating Documentation

GitHub already provides an HTML rendering for Markdown-based Readme files. Therefore, a Readme file named README.md must be stored in a directory of the repository. When opening this directory with GitHub's directory browser, the file is rendered at the bottom of the page. Since moving to GitHub, we have been used frequently this method to provide a brief documentation to developers.

However, it's not feasible to include all technical aspects of a module in a Readme. For example, an XProc pipeline may consist of input and output ports, a set of options and various import statements. Incorporating this information within a separate Readme for every transpect module wouldn't be particularly practicable. So we came up with the idea of generating the documentation directly from the source code.

XProc already features a p:documentation element intended for adding documentation. This element can be used anywhere in the pipeline and doesn't affect its execution. p:documentation can be used as child element of p:input to describe which kind of document is expected by this port. Moreover, you can add the HTML namespace to use HTML markup in your documentation. This method proved to be useful to include markup code blocks or use hyperlinks to reference other resources.

Figure 3: Inline documentation in XProc

<?xml version="1.0" encoding="UTF-8"?>
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
  xmlns:c="http://www.w3.org/ns/xproc-step" version="1.0">
  
  <p:documentation xmlns="http://www.w3.org/1999/xhtml">
    The documentation may include <a href="https://en.wikipedia.org/wiki/HTML">HTML</a> markup as well.
  </p:documentation>
  
  <p:input port="source">
    <p:inline>
      <doc>Hello world!</doc>
    </p:inline>
  </p:input>
  <p:output port="result"/>
  <p:identity/>
  
</p:declare-step>

Our first project which employed p:documentation tags was transpectdoc. Transpectdoc is basically an XProc pipeline which generates an HTML documentation for all XProc pipelines used in a project. Transpectdoc starts from a frontend XProc pipeline and parse all imported pipelines and libraries. For each of the pipelines’ input and output ports, a linked list will be generated that points to ports in other pipelines that are connected to these ports.

With transpectdoc it's possible to easily create a technical documentation for a specific project. However, it was not intended to replace tutorials or an user guide. Nevertheless, transpectdoc works for projects with one front-end pipeline only. It was not supposed to generate a reference for the entire transpect framework.

Requirements

After various efforts had been made to develop common standards, we identified major requirements for a global documentation of transpect.

The documentation should be comprehensive and extensible. In this sense, it should include all necessary resources for each module. Furthermore, it should be possible to easily integrate new repositories.
Currently, the entire transpect framework includes 52 repositories on GitHub. It would not be convenient to clone each directory to your local machine to generate a documentation. Only a minimum set of data should be downloaded. Furthermore, it should be possible to easily update and extend the documentation.
The automatically generated reference should be accompanied by other kinds of documentation, such as tutorials.
The documentation should be both online and offline readable.

Implementation

Requesting GitHub Repositories

The entire process starting with requesting the transpect repositories up to generating the documentation is implemented in XProc. The documentation is written in XML and hosted as GitHub page at http://transpect.io

Initially, an method had to be established to retrieve the repository information hosted on GitHub. Fortunately, GitHub provides a public HTTP APIkraet04. The data is sent and received as JSON. So we decided to use XProc's p:http-request to request the GitHub API. The process comprises three stages: First, an XProc step requests a list of all repositories of our GitHub organization. Second, lists the content of each repository recursivelykraet05. Third, XML Catalogs, XSLT stylesheets and XProc pipelines are downloaded from GitHub.

To convert JSON to XML, we use an extension of XML Calabash, called the “transparent JSON”kraet06. We may replace this later with a p:xslt step that implements the function parse-json() introduced in XSLT 3.0. Below you can find an XML output from a HTTP request of a GitHub repository by using XProc and XML Calabash.

Figure 4: Output of an XProc HTTP-request as transparent JSON

<?xml version="1.0" encoding="UTF-8"?>
  <j:item xmlns:j="http://marklogic.com/json" type="object">
  <j:language type="string">XProc</j:language>
  <j:svn_005furl type="string">https://github.com/transpect/cascade</j:svn_005furl>
  <j:forks type="number">0</j:forks>
  <j:ssh_005furl type="string">git@github.com:transpect/cascade.git</j:ssh_005furl>
  <j:full_005fname type="string">transpect/cascade</j:full_005fname>
  <j:clone_005furl type="string">https://github.com/transpect/cascade.git</j:clone_005furl>
  <j:default_005fbranch type="string">master</j:default_005fbranch>
  <j:private type="boolean">false</j:private>
  <j:open_005fissues_005fcount type="number">0</j:open_005fissues_005fcount>
  <j:description type="string">Libraries to implement a transpect cascade configuration</j:description>
  <j:git_005furl type="string">git://github.com/transpect/cascade.git</j:git_005furl>
  <j:has_005fissues type="boolean">true</j:has_005fissues>
  <j:contents_005furl type="string">https://api.github.com/repos/transpect/cascade/contents/{+path}</j:contents_005furl>  
</j:item>

Generating the Documentation

All the information is combined together with XProc pipelines, XSLT stylesheets and XML catalogs in one XML document. Then an XSLT stylesheet is used to convert this intermediate format to DocBook to provide a more descriptive reference.

For each transpect module a DocBook chapter is generated. Every chapter includes general information about the repository and sections for all XProc steps that can be imported by other pipelines. Within the documentation, an XProc step is specified by its @type attribute and its canonical import URI. If the step imports steps from other repositories, references to the repositories are specified as dependencies.

If an XProc step contains a p:documentation tag following below the root element, its content is taken as a brief description of the step and rendered as well. Furthermore, input and output ports, options and their default values are specified. Currently just a few steps provide p:documentation tags within port and option declarations. If their number is increasing in the future, we think of integrating these contents, too.

Despite the automatically generated reference, there are other kinds of documentation such as setup guides and tutorials. These documents are authored in DocBook and stored together with the module reference within the same directory. They are merged later with XProc's p:xinclude step. After the merge, the XML document is split into chunks, each of which is converted to a single HTML document later.

The generated HTML chunks just include a basic HTML markup and are injected into a global HTML template which provides the layout information. The entries of the navigation bar are also generated and inserted into the template.

In this way, it's possible to change the general appearance without changing the content. The previous XSLTs just rely on some specific id attributes in the template to add the content. However, if you want to use specific CSS features, such as a multi-column layout or individual colors, you have to add the corresponding HTML class values in your DocBook role attributes.

The process ends with the generation of the HTML snippets of the documentation. From this point, the only thing left to do is to commit and push the changed HTML files to GitHub.

Discussion and Outlook

There are various benefits in generating vital parts of the reference directly from the code base. The use of inline documentation enriched with HTML markup facilitates adding larger portions of text. Redundant information like port declarations and dependencies is automatically generated. It's no problem to keep the documentation up-to-date if new repositories were added or code has changed.

Naturally, the generated documentation is just as good as the quality of the inline documentation. But this approach is also suitable to expose poor documentation which was previously hidden. Furthermore, a generated technical reference cannot replace other more descriptive types of user documentation such as tutorials, FAQs or How-tos and it's not convenient to incorporate complex documents in p:documentation tags. But it seems reasonable to combine these different kinds of documentation in a single XML source that can then be published in a variety of formats.

It seems like a natural choice to use GitHub services to both generate and host documentation for code which is already hosted on GitHub. But it also involves the risk of beeing too dependent on a specific vendor. If it would be necessary to move away from GitHub, just the pipeline which requests the repositories needs to be changed. The source of the documentation is stored as XML, so another pipeline just had to generate the expected XML structure.

The project is still in a early state. As of now, the initial requirements have been implemented and additional features such as EPUB output and a graphical representation of the XProc steps are planned. Expected data types and XML schemas at XProc ports could provide a deeper level of information. Sometimes it would be convenient to have a brief overview of the structure of the pipeline. In this sense, the main focus is just on adding more informative documentation.

References

[kraet01] Deltour, Romain: XProc at the heart of an ebook production framework. The approach of the DAISY Pipeline project. XML Prague 2013 Conference Proceedings. http://archive.xmlprague.cz/2013/files/xmlprague-2013-proceedings.pdf [cited 19 Apr 2016]

[kraet02] le-tex publishing services: transpect Setup Manual. https://subversion.le-tex.de/common/transpect-demo/content/le-tex/setup-manual/en/transpect-setup.xml

[kraet03] le-tex publishing services: Styleguide. Guidelines for writing XProc, XSLT, XPath… http://transpect.github.io/styleguide.html [cited 19 Apr 2016]

[kraet04] GitHub. API Overview. https://developer.github.com/v3/ [accessed 19 Apr 2016]

[kraet05] transpect.io. GitHub-API. XProc steps that implement the GitHub API. https://github.com/transpect/github-api [accessed 19 Apr 2016]

[kraet06] Norman Walsh. Language Extensions. In XML Calabash Reference. http://xmlcalabash.com/docs/reference/langext.html [accessed 19 Apr 2016]

GitHub. API Contents. https://developer.github.com/v3/repos/contents/ [accessed 19 Apr 2016]

Martin Kraetke

le-tex publishing services

Martin Kraetke works as Lead Content Engineer at le-tex. He introduced XProc at le-tex and writes pipelines for the Open Source framework transpect. From time to time, he posts articles for his blog xporc.net.

BalisageThe Markup Conference

Balisage Paper: From GitHub to GitHub with XProc: An approach to automate documentation for an open source project with XProc and the GitHub Web API