PreTeXt [PreTeXt] is an XML vocabulary to describe the content and structure of a scholarly textbook, monograph, or article. The aim of the project is to make it easy for (academic) authors to create highly-capable versions of their work, ideally provided with open licenses. The initial focus has been on the particular demands of mathematics, with its specialized layouts and symbols, and other sciences with similar requirements. We are also interested in embedding computational examples, and producing output for use within computational systems, so documents addressing computer science and technology are well-supported. Not surprisingly, the most interest has come from authors of openly-licensed (or low-cost) undergraduate mathematics textbooks, such as [Boelkins 2018]. However, there are also textbooks authored in PreTeXt on physics, computer science [Plantz 2018], music theory [Hutchinson 2018], and even a handbook on writing for entering university students [Chun 2018].
Why not use an existing markup vocabulary? The short answer is that we want to walk a fine line between making it easy for authors to write, while also capturing the complete structure of a scholarly document. We looked hard at vocabularies such as DocBook [DocBook], TEI [TEI], DITA [DITA], BITS [BITS], and JATS [JATS], but were unsuccessful in grafting on the features needed for disciplines like mathematics and computer science in a work as big as an entire textbook. Maybe today, with greater skills working with XSLT, we would be able to get further. But still, DocBook has an emphasis on documentation, and TEI has an emphasis on textual analysis, and neither is consistent with our purpose. No system seems to have robust enough support for mathematics, such as numbering multi-line displays of equations. Some have explicit tags for presentation, such as bold or italic fonts, which we do not want to make available to authors.
According to the Scholarly Publishing and Academic Resources Coalition (SPARC) [SPARC 2017], mathematics is the discipline with the greatest
open educational resources traction. LaTeX [LaTeX] has been the de-facto markup language for mathematics for 35 years. It was designed
for print and its base system, TeX [TeX], contains a Turing-complete language. LaTeX has some shortcomings in describing
document structure, though TeX itself does a very good job of typesetting mathematical
expressions. Conversions to PDF [PDF] have evolved nicely, but conversion to HTML [HTML] has been problematic.
TeX4ht [TeX4ht] is one good system, as it uses the TeX executable, other attempts include [TtH] and [LaTeX2HTML]. The web is full of nicely-produced lecture notes, authored in LaTeX, and distributed
as one PDF per chapter. Authored in PreTeXt, it would take very little effort to
collect these notes into book form, and make available in web browsers with capabilities
well beyond those offered by PDF.
With markup that rigorously captures structure and intent, without allowing any preference for presentation, it becomes possible for software to create output formats with disparate technical formats, making encouraging multiple styles where supported (e.g. CSS for HTML). As a demonstration of the power and utility of these ideas, we have as targets five output formats:
PDF for print-on-demand publishing
PDF for electronic distribution and reading
HTML with semantic CSS
EPUB for offline electronic use
Jupyter notebooks for
laboratory notebookswith embedded code [Jupyter]
Why would an academic author abandon a tool like LaTeX, or a WYSIWYG word-processor for a scholarly writing project? Stated simply, the HTML output is so far superior to the electronic PDF output, while also a faithful and accurate representation of the print version. PreTeXt authors who come from LaTeX say that PreTeXt is no more complicated conceptually, it is just a new language to learn.
Discussions within a diverse community have been critical for developing a system that can meet the competing demands of a variety of authors, disciplines, and document styles.
PreTeXt was initiated in May 2013. There are now roughly 35 books authored in PreTeXt.
The RELAX-NG schema and XSLT stylesheets are distributed as a
git repository with an open license (GPL), so development is an open and transparent
process. There are three Google Groups for discussion and announcements [Announce, Support, Dev], involving 200 members, and there is an issue tracker on GitHub [GitHub]. There have been 21 contributors to the code and 70 forks.
More important than numerical measures of interest, and reports of bugs, the discussion groups have seen lively and civil discussions ranging from pedagogical aspects of textbook design and use, though to the nitty-gritty of what XML and XSLT can accomplish, and how. The evolution of the PreTeXt schema has been strongly influenced by these discussions involving authors and teachers with interests in technology, while also maintaining the consistency that ultimately comes from a single designer.
Only recently, we realized that academic authors of openly-licensed materials often play three roles simultaneously: author, publisher, and instructor. This realization led to a split of the documentation into an Author's Guide [Author's Guide] and a Publisher's Guide [Publisher's Guide]. This also clarifies certain technical decisions: should an option be hard-coded into an author's source as a result of a decision that properly rests with the author, or should it be a command-line switch that is a publisher's decision about the form of a particular output or rendering of the author's content? Work to support instructor's needs (such as indicating which exercises are part of collected homework) is just starting. So the discussion of language features includes decisions about which actor makes the decision.
Below we discuss the features, design decisions, and technical decisions, that have resulted from this community process.
In this section, we describe a number of key features of PreTeXt.
Content versus Presentation
Every attempt is made to rigorously separate content from presentation. This can
be a hard argument to make to authors from a WYSIWYG background, and can even be hard
to make for authors experienced with LaTeX, since they think LaTeX already does this
(it could, but there are few facilities to enforce it). The obvious payoff is the
ability to produce many (dissimilar) output formats from one source, with no editing.
And it has been a long time since anybody asked for a
Multiple Output Formats
PreTeXt has excellent support, via XSLT stylesheets, for conversion to the following output formats based on other markup languages: online HTML, PDF for viewing on a screen, and PDF for print-on-demand physical versions. There are good, but not complete, conversions to EPUB and Jupyter notebooks. The EPUB conversion renders well in iBooks, and work is in-progress to convert further to Kindle format. Jupyter notebooks are online computational notebooks [Jupyter], allowing editing in a web browser with embedded code executed by connections to modular computational kernels. Their underlying format is JSON [JSON] holding a mix of Markdown [Markdown] and HTML.
ID/IDREF system provides the basis for a robust and unified system for cross-referencing.
Many elements in PreTeXt can be the target of a cross-reference, and the cross-reference
produced can take many forms (display number, display number plus type-prefix, title,
etc.). By contrast LaTeX uses
\label to mark objects, and then
\ref produces display numbers for cross-references, but there is also
\pageref to generate page numbers,
\cite for citations, and
\eqref for equations. Further, certain natural parts of a document are not easy to cross-reference
in LaTeX, such as a proof, or a case within a proof, since they do not earn a display
number. In HTML output a cross-reference can be implemented with a knowl [Knowl], which is a form of transclusion [Wikipedia]. So when a reader clicks on a cross-reference to definition, the page splits and
a box opens containing the text of the definition. Clicking again on the knowl puts
the box away. Readers rapidly come to expect a citation to open up with the bibliographic
information, rather than spiriting them away to a division of the back matter. Rather
than having page numbers in the index, a knowl works much better. Every knowl contains
in-context link if the reader wishes to migrate away to another location.
Extensive and Flexible Numbering
A textbook or monograph is not a novel. Readers may sample, authors will cross-reference
previous material, and a textbook may later serve as reference material. So items
that need to be identified unambiguously are automatically assigned numbers (which
are, of course, the same in each output format). There are six numbering schemes
in use (divisions, blocks/environments, equations, exercises, citations, and footnotes)
and the form of the number can be configured to follow the document hierarchy to a
specified level. So the same theorem in the source could be
Theorem 7.3.4 or
Theorem 7.17 or
Theorem 105, depending on an author's document-wide choice. Almost everything that is not unique
(e.g. the Acknowledgments division) is automatically numbered, with no provision for
turning it off, either document-wide nor one-off. Equations are the one notable exception,
allowing numbers optionally.
Online Output (HTML)
Our HTML output is the most capable. We can embed interactive components such as
video, and activities expecting more reader interaction through sliders, checkboxes
and text fields. Anything that can live on a standard web page becomes a candidate
for a component of a book, including current support for HTML5
led to CSS designed to meet the needs of serious readers. Rich navigational devices,
knowls (information hiding and cross-referencing), integrated search, and other features
further aid the reader.
We aim to find the best mix of the old and new. Centuries of book design have given us a hierarchical organization, an optimal number of characters per line, and navigational devices like an index. Web browsers allow extensive hyper-linking, a copious margin for tangential material, and integrated search. We are careful to not throw out the baby with the bathwater.
The accessibility of online materials for students with physical limitations is a huge concern at many universities, and enhancing the opportunities for those with disabilities should be part of the promise of electronic formats. Because we do not presume electronic outputs are housed in proprietary systems, we can employ open standards and best practices wherever possible. For example, the MathJax library [MathJax] gives every mathematical expression a textual representation that a screen reader can read in sync with the expression tree. Presently, the HTML markup is being fine-tuned to validate completely and roles from the Accessible Rich Internet Applications [ARIA] standard will then be added.
Diagrams Authored in Source
Graphic images, created with other tools, often get disassociated from their final resting place in a finished document, and then reside only in uneditable formats. We encourage authors to use graphics languages, such as LaTeX's tikz, or Asymptote, to include the source code of graphics within their PreTeXt source. An XSLT stylesheet and a Python script, with other executables, then manage a conversion from source to final image format. SVG is the ideal for online output, since it scales uniformly along with surrounding text.
As mentioned earlier, we have an Author's Guide, which explains the elements and attributes, and partially serves as an implementer's guide. The Publisher's Guide is more concerned with aspects of particular conversions, and topics like choosing an open license.
We have formatted a style guide for source, so that contributors will feel comfortable working with another author's material, and so material can be shared and remixed across different projects. This should remind Python programmers of the infamous PEP 8 [PEP8].
In this section, we describe fundamental decisions which have influenced the construction of the PreTeXt vocabulary. Many of these decisions are important to an author, but are not always readily apparent to a reader.
A printed book delivers information in a format that has evolved over hundreds of years. Some aspects of a book fail miserably in an electronic medium. For example, page numbers. We believe others are still important, like a hierarchical organization expressed through chapters, sections, subsections, and so on. Not unlike this article. So a decision to use XML syntax is natural.
There are many more examples of our fourth Principle:
PreTeXt respects the good design practices which have been developed over the past
centuries. Such as, we limit the width of text in HTML output to fall within the optimal number
of characters for the mechanical and mental aspects of reading.
LaTeX and MathJax
Mathematics is complicated. And complicated typographically. The simplest fraction becomes a two-dimensional exercise rather than a linear exercise (which is not to really imply that regular typography is not two-dimensional). A sequence of complicated expressions (aligned on an equal sign) or a matrix with three rows and five columns, where entries contain symbols, quickly becomes challenging.
MathML would seem an obvious choice. However we want authors to be able to write
and edit their source directly at this point. So requiring MathML would suggest another
in LaTeX syntax. It could be said that MathJax really is the enabling technology
for this entire project. TeX is
great between the dollar signs (its math delimiter), and MathJax supports a broad enough subset of symbols and constructions
to satisfy almost any technical author.
Lists and Display Mathematics
The logical placement of lists and display mathematics (runs of centered equations) was a difficult decision. Rightly or wrongly, we decided that these items are parts of paragraphs (rather than siblings of paragraphs). A fortunate consequence is that the conversion of deeply-nested lists became less difficult. An unfortunate consequence is that this decision is contrary to the HTML specification, so we must frequently explode a PreTeXt paragraph into a sequence of mixed HTML paragraphs, lists, and display mathematics.
Print Representation of Interactive Content
Including interactive components in the online version of a text begs the question of what to do for a print representation. Our solution is to include in print a screenshot and a link, accessible through a matrix barcode, specifically a Quick Response (QR) code.
Normal text is linear across the page as a sequence of characters, but is then stacked vertically as a sequence of lines, paragraphs, figures, lists, etc. as you move down the page or screen. But sometimes you want to go horizontal. Maybe you have three images from a science experiment, and a comparison will work best if you line them up side-by-side. Or, because width always seems to be at a premium, you do not want to leave large amounts of whitespace, and so want to place a table and some text next to each other.
We accomplish this with a container named
sidebyside, which suggests that it is specifying layout. Carrying crude controls for widths,
margins, and vertical alignment, it probably is a layout device. But it is incredibly
useful, and important enough to get an annual refactor every summer.
Short Names, Long Names, Attributes, IDs
XML does not have to look like it was machine-generated. We have tried very hard
to make the vocabulary easy to remember, easy to use, intuitive to read, and not overly
verbose. To this end, we use full names (not abbreviations) for elements used infrequently,
subsection. For frequently used elements, we have kept the names short, often borrowing from
HTML, such as
li. And we have added our own, such as
q (quotes), and
sq (single quotes).
We have tried to keep attributes to a minimum, especially by providing sensible default
behavior in their absence. Boolean attributes are named to work well with only the
no (not true/false, nor 0/1). The
xml:id attribute is used as more of an author-provided
short-identifier. It is the basis of the cross-referencing system, and migrates to other identifiers,
such as URLs in the HTML output.
We try to recycle element names as much as possible, when resulting behavior will
be predictable. The best example is
title, which is ubiquitous. Sometimes this backfires and makes constructing XSL templates
difficult. Some of the places an
introduction can appear are in a
chapter, prior to a list of
objectives, at the start of a
project, and at the start of an
exercisegroup. This works well for authors, but requires some care in processing.
In this section we describe some technical aspects of PreTeXt, and some common templates used in every conversion. Generally these considerations are not immediately apparent to an author, though they do have some effect on their work.
Escaping Escape-Character Hell
With many different markup languages in play there can be considerable confusion moving from one to another. Thankfully, XML has very few special characters. LaTeX has at least ten, JSON has several, Markdown has a multitude, including combinations. In technical documents there is often a need to express every single character literally. PreTeXt aims to insulate an author from all this. An interesting exercise is to give examples of a markup language, authored using that same language. But of course, for us, the author needs to write in XML!
Consider the ampersand, &. (How was that just authored?) In LaTeX this symbol is
used to separate cells of a table, or separate entries of a matrix, or provide
alignment points in a sequence of displayed equations. And in XML is the the escape character. So
PreTeXt provides an
ampersand element for normal text use, which can be output properly into a given markup language.
For use in mathematics, PreTeXt makes available the
\amp TeX macro (which is actually built-in as part of MathJax, and a simple addition for
LaTeX output itself). For verbatim text (such as a logical or bit-wise and-operator
in a programming listing) we provide education and advise authoring & (how was
that just authored?). The conversion then handles literal characters according to
the expectations of the target language. We very assiduously avoid instructing authors
in the use of a
Coordinating LaTeX and XSLT Numbering
LaTeX provides automatic hierarchical numbering of divisions (chapters, etc.) and their contents (theorems, figures, etc.) through a system of counters which increment as used and reset as boundaries are crossed. The depth of the numbering, the number of different counters, and the boundaries where they reset, can all be configured by a knowledgeable user of LaTeX.
In PreTeXt we implement a significant and flexible subset of this behavior through
the use of the
xsl:number instruction. Remarkably, we are able to provide authors with a wide range of options
for customizing numbering, write hard-coded numbers into HTML output, create preamble
instructions for LaTeX, and have the automatic numbers produced by LaTeX match the
numbers created by counting nodes in the XML tree.
LaTeX is fairly permissive about whitespace, though not as much so as HTML. For example,
two consecutive newlines indicates the start of a new paragraph. And there are some
instances where a blank line will cause a fatal error (e.g. the
align environment for multi-line mathematics). Sometimes excess space at the end of a
line will cause subtle additions of unwanted vertical space.
Authors have been conditioned to expect to a lot of freedom in using whitespace liberally.
Further, authors accustomed to programming and revision control like to think of text
in 80-character chunks, so want to be very liberal with newlines. Fortunately,
xsl:strip-space solves the problem of whitespace interspersed with the elements of structured content,
but mixed content proved more delicate. At least we know where that is, and where
it should go. So we can effectively sanitize normal text and LaTeX mathematics without
doing any harm and meeting the expectations of the output format.
A happy consequence of sanitizing the mixed content is that we have a good handle on what that version looks like. We can then provide services like the following. Recall that we decided that display mathematics belongs in a paragraph. When several equations in a row are displayed, and end a sentence, then the period should be rendered in the display, at the end of the last equation (which could have a number pushed to the right margin). The author places the period after the closing tag of the element containing the mathematics, where it logically belongs, and the stylesheet wraps it properly for its automatic insertion into the display at the right place.
Conversion of Legacy LaTeX
There is a wealth of valuable material authored in LaTeX. There is no converter that reliably does a good job converting it to a structured form, since the extensibility of LaTeX and the numerous consequent add-on packages, give an author freedom to accomplish many variations. (Remember that Tex has a Turing-complete language.) However, David Farmer maintains a Python script, based on the use of regular expressions, to break apart a LaTeX document, throw away formatting information, deduce intent and purpose based on previous experience, and then reassemble the content in a structured way, using PreTeXt as the language. His script is not perfect, and never will be, but with each conversion it improves. An author needs to do a careful review, and a little bit of tedious editing, but quickly gains new output formats beyond a PDF.
Some content migrates to new places, like titles going to a Table of Contents. Other content gets duplicated, like the statement of an exercise being reprised in the back of a textbook. We implement cross-references to all but the largest chunks of a book via knowls, which duplicate content.
So it is important to recognize when content is being
born, and so is the original version, and should have a unique identifier, such as an
id attribute, or when it is a duplicate version and none of its components should have
a repeated version of an identifier. This is simply a matter of careful programming,
and passing a flag (original versus duplicate) to templates.
When writing about computer code, either for a text in computer science, or in a discipline that is enhanced by computation, there is the need to use verbatim text that can display any character (including reserved, special, or escape characters). And a typical presentation is a fixed-width font of slightly heavier weight, meant to remind one of typewriters, teletypes, and primitive consoles.
This simply becomes a tedious and tricky exercise to let authors write without concern
for the various output formats and let the stylesheets and templates perform the necessary
manipulations. We have empty elements for many such characters. Some are necessary,
ampersand described above, others are a convenience, such as
copyright for the circled-c symbol.
We do not believe the size of images should be given in explicit units by an author,
nor do we think it is the job of PreTeXt to manipulate the aspect ratio. But the
natural size of an image is not a reliable indicator. So we have authors specify desired width as a percentage of available width. This works well, especially since width is a scarce resource. However, it lacks
precise control, and some authors desire to have the font-size of their diagrams exactly
match the font-size of the surrounding text, in all possible formats. We do need
to make this approach more robust, but do not yet have a concrete plan.
Styling and Themes
The CSS used in our HTML relies on extensive use of semantic class names, with sensible HTML elements for fallback styling when the CSS is not available. We have begun to improve the CSS by refactoring to provide greater modularity and extensibility. So providing alternative styling through additional CSS should see excellent support soon.
A strong suit of LaTeX is its customizability, though that is also one of its weaknesses. We have some good ideas on how to mimic the versatility of CSS to style LaTeX output via a system of hooks in existing templates and simple imported stylesheets.
Example PreTeXt Markup
We finish with a small, but representative, example of PreTeXt markup. When shown
to mathematicians with no prior experience with XML, they are quick to understand
the content of this snippet, to the point where they may even comment on the novelty of the proof.
There is a cross-reference to a previous theorem, the Product Rule, which is not shown,
but would have an
xml:id attribute with value
product-rule. Note the inline mathematics in LaTeX syntax, authored with the
m element, and the single displayed equation in LaTeX syntax, authored with the
PreTeXt has demonstrated that technically-minded authors, and even those from less technical disciplines, can, and will, author using a carefully-designed vocabulary based on an underlying XML syntax. Long-term, PreTeXt may attract front-ends perceived to be more friendly. However, for now, the utility of highly-capable output formats, obtained without additional editing, is enough to win over new authors and projects. The benefits, once properly understood, are hard to deny.
The PreTeXt language is fairly stable today, to where deprecations are infrequent, and planned new elements are few. A robust system for bibliographies, built on Citation Style Language (CSL) [CSL] is the only glaring omission for serious scholarly writing. Work will continue on the vocabulary and schema, while existing and new output formats also improve in parallel.
Partial support for this work was provided by the National Science Foundation's Improving Undergraduate STEM Education (IUSE) program under Award No. 1626455. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
[Plantz 2018] Platz, Bob.
Introduction to Computer Organization: ARM Assembly Language Using the Raspberry Pi. [online]. [cited 22 Apr 2018]. http://bob.cs.sonoma.edu/IntroCompOrg-RPi/intro-co-rpi.html.
[Hutchinson 2018] Hutchinson, Rob.
Music Theory for the 21st-Century Classroom. [online]. [cited 22 Apr 2018]. http://musictheory.pugetsound.edu/mt21c/MusicTheory.html.
[BITS] Book Interchange Tag Suite (BITS). [online]. [cited 3 Jul 2018]. https://www.loc.gov/preservation/digital/formats/fdd/fdd000453.shtml.
[SPARC 2017] Scholarly Publishing and Academic Resources Coalition.
Connect OER Report, 2016-2017. [online]. [cited 22 Apr 2018]. http://sparcopen.org/wp-content/uploads/2017/09/Connect-OER-2016-2017-Annual-Report.pdf.
Adobe Portable Document Format (PDF). [online]. [cited 22 Apr 2018]. https://acrobat.adobe.com/us/en/acrobat/about-adobe-pdf.html.
[Announce] Google Group. pretext-announce. [online]. [cited 22 Apr 2018]. https://groups.google.com/forum/#!forum/pretext-support.
[Support] Google Group. pretext-support. [online]. [cited 22 Apr 2018]. https://groups.google.com/forum/#!forum/pretext-announce.
[Dev] Google Group. pretext-dev. [online]. [cited 22 Apr 2018]. https://groups.google.com/forum/#!forum/pretext-dev.
PreTeXt Author's Guide. [online]. [cited 22 Apr 2018]. http://mathbook.pugetsound.edu/doc/publisher-guide/html/.
[Markdown] Markdown. [online]. [cited 22 Apr 2018]. https://daringfireball.net/projects/markdown/syntax.