Introduction
HTML, XML, and many other systems provide ways to mark the natural language of text
portions, such as by defining language, lang, or
xml:lang attributes. The values are typically expected to be natural
language codes, often with a region-code suffix to distinguish variants (see ISO 639
and
RFC 5646). Typically, such codes consist of a 2- or 3-letter language code,
plus sometimes a 2-letter region code. For
example:
<div lang="en-uk">Put the portmanteau in the boot.</div>
Such identification is useful for many kinds of processing:
-
switching spelling, grammar, or style checkers to a language-appropriate configuration
-
applying language-specific font or layout preferences
-
triggering special treatment of punctuation, ligatures, or other details whose customs vary by language.
Technical documentation, tutorials, and other documents commonly contain code as well as natural-language text, such as for examples. Those content portions have similar needs for language-dependent processing:
-
spelling and other language checking should typically be turned off within program code, or at least operate differently. One might wish to substitute a programming-language-specific checker such as lint.
-
In some contexts syntax highlighting is useful, but again is language-specific.
-
some special treatment commonly applicable to natural-language text should not apply, because it is actively harmful to program code. For example, many editors automatically change straight quotes to curly quotes, but code rarely benefits from such changes. Likewise for converting two hyphens to an em dash, composing ligatures, or re-wrapping lines using prose rather than code conventions.
Language specification methods (most obviously ISO 639) have excellent coverage of natural languages, including ancient ones and even extending to constructed languages such as Esperanto and Klingon. However, they do not provide for programming languages, making it harder to gain similar benefits. There are reasons for this: programming languages are invented far more often than natural languages; they are used in different ways by different communities; and they differ typologically.
Despite many differences, there are reasons that natural and programming languages
are
both called languages
. Both are generative systems that allow expressing
an unbounded range of information in interpretable ways. More relevant for the current
purpose, text in both kinds of languages frequently occurs in documents, and knowing
which kind and which specific language a part is in affects many aspects of how it
should typically be processed.
To help gain similar benefits to those of labelling the applicable natural language of content it would be useful to enable labelling the programming language of content where applicable. Data formats might similarly be of value when data in them is likewise embedded.
A given piece of content is normally only in one language (whether natural or program), barring unusual edge cases such as code carefully crafted to conform simultaneously to multiple syntaxes, or poetry carefully crafted to also be executable. This proposal does not addess such cases. The usual case, that content is generally not in a natural language if it is program code and vice versa, means that the very same labeling mechanism can be used in both cases so long as the specific codes do not collide. Indeed, this complementary distribution suggests that is desirable.
Therefore this proposal does not suggest any new attributes or other constructs, but
an extension to the set of permissible values for existing ones. That is, we propose
using contructs such as the lang, xml:lang, and other
constructs in the usual manner but with additional values, to identify document portions
such as program code. This seems clearly better than leaving such portions unidentified,
which leads to poor results from spelling and grammar checkers, inappropriate
formatting, and sometimes active corruption of code by applying language-inappropriate
corrections
or enhancements
. It also seems better than
the epub solution of labelling them merely zxx or und (see
below).
Leveraging existing language indicators has the advantage that inheritance works as expected: for example, if a chapter in Japanese contains a code snippet in Java, setting the language to Java locally overrides the containing language. If a separate indicator were used instead it would suggest it is orthogonal, which it is not. Even if identifiers within that Java code uniformly use Japanese identifiers, the code remains in Java not in the natural language Japanese – its syntax, even down to what strings are permissible identifiers, is defined by the rules of Java not of Japanese per se. Once back in the non-code text the rules of Japanese apply again.
The lang attribute in HTML and xml:lang in XML provide
standardized mechanisms for language identification, but they and most other mechanisms
use language codes that cover only natural languages and certain variants (plus a
few
edge cases such as constructed languages). Obvious approaches to having normative
names
for specific programming or other languages, would be to add them to the ISO 639 space
or to separately define non-conflicting extensions. However, because programming
languages occupy a different conceptual space, are used by a smaller community, and
have
a much higher rate of additions, the high level of co-ordination required for such
approaches seems impractical.
Instead, this proposal takes a lesson from how ISBNs were incorporated as a sub-space
within product codes. Barcodes for retail products begin with a country code. ISBNs
were
inserted as a single new country
called Bookland
, with
code 978 (see Barcoding Guidelines). This approach neatly separated concerns: Books could do their own
thing within their own space (and what they do is use ISBNs), and not trip over other
product codes (or vice versa).
Similarly, we propose a single top-level language
code,
qpr, to cover the space of all programming languages. ISO 639 reserves
codes beginning with q
for extensions, private use, etc. The particular
programming language is then specified by and additional code field, using the already
familiar rules of RFC 5646. Because programming languages rarely have
regional
dialects or variants per se, this shift in the region
semantics seems fair. Grouping programming languages under one code makes it trivial
to
tell whether given content is natural language text or program code. Some processing
such as turning off spelling correction or fancy quotes may only need that distinction,
while other processing such as syntax highlighting would leverage the more specific
sub-codes in the remainder.
There does not appear to be a normative registry of programming languages, although
many have language definitions promulgated via ISO, IETF, W3C, or other organizations.
This proposal does not require creating a registry (though it also has no objections
should one arise). Rather, it suggests using the widely-used de facto codes embodied
in
file extensions: py for Python, lisp for LISP, and so on.
For practical reasons these tend to be nonconflicting. However, there are exceptions
such as dot and m. In such cases more is needed. This proposal
thus also includes means for declaring unambiguous mapping of language codes to URLs,
as
is typically desired for Web vocabularies. We illustrate a mechanism for this in XML
that leverages already-existing mechanisms and requires no additions to parsers, DOM
implementations, etc. We also give possible approaches to achieve the same kind of
mapping in HTML. Similar approaches could be applied to any document format or
processing system that supports language identification codes, from markup languages
to
content management systems and beyond.
Background and Related Work
Current Practice
There are practices for identifying entire files as being in one language or
another, such as shebang lines, hidden file type
metadata,
registries, and file extensions. However, most technical documentation systems
either ignore programming language identification altogether or achieve it through
implementation-specific mechanisms:
-
GitHub Flavored Markdown: Uses fenced code blocks with language identifiers (
```python) -
HTML/CSS: Relies on class attributes (
<code class="language-python">) -
DocBook: Uses the
languageattribute onprogramlistingelements (very similar to, and compatible with, the current proposal)
These approaches are useful, but share the limitation of providing no mechanism for resolving language identifiers to authoritative specifications.
Details of RFC 5646
RFC5646 defines the syntax of language codes as a series of hyphen-separated ASCII alphanumeric tokens, which ignore case and are limited to length 8. It trades heavily on fixed token lengths, but does leave room for extensions.
The forms supported include a langcode which contains the parts
described below (with all but the first part optional):
-
a 2- or 3-letter
language, specified by the 2- or 3-lettershortest ISO 639 code
(thusenrather thanengfor English), or a 4-letter value (those are reserved), or a 5- to 8-letter code registered with IANA. Thelanguagemay also include from 1 to 3 hyphen-separated 3-letter ISO 639 codes comprising anextlangsuffix. Preferably all lower case. -
a 4-character
script(orthography) code drawn from ISO 15924. Preferably initial-cap.[1] -
a 2-letter or 3-digit
regioncode drawn from ISO 3166-1 or UN M.49, respectively (these mainly enumerate countries, although some languages have regional variants that are not so conveniently bounded). Preferably all-cap. -
a 5- to 8-letter registered
variantcode. -
an
extensioncode consisting of a single alphanumeric other thanx, a hyphen, and 2-8 alphanumerics. -
a
private usecode consisting ofx, a hyphen, and 1-8 alphnumerics.
There are also certain grandfathered
strings, and x-
plus a private-use token. To fully conform to RFC 5646, programming language or data
format codes must begin with x- (the qpr language code
already conforms to ISO 639, so does not need similar treatment). This proposal
recommends that practice for all contexts where full conformance to the RFC
definition is required or helpful. However, this proposal also recommends that
software recognize qpr subtags without regard to whether they begin
with x-, just as they should accept them without regard to case.
Document Language Identification
Both HTML and XML provide built-in language identification through
lang and xml:lang attributes respectively, following
BCP 47 language tag conventions. Many specific XML schemas just use
xml:lang, but some make similar provisions of their own (such as
DocBook’s language, ODF’s fo:language, and OOXML’s
w:lang) . Despite such minor variations this general approach
offers several advantages:
-
Standardized inheritance semantics
-
ease of integration with CSS language selectors and XSLT processing
-
Support for hierarchical language switching
-
Compatibility with accessibility tools and screen readers
-
Usefulness for language-specific spell checking and grammar validation
Nearly all such specifications defer to ISO 639, RFC 5646, and BCP 47. Some make a few additions:
-
HTML adds "" and a private-use
x-prefix -
epub adds
zxxforno linguistic content
andundforundetermined language
(though not conforming to ISO 639, these codes at least allow systems to avoid inappropriate processing such as spelling correction or quoteeducation
. But they don’t help toward programming-language specific treatments). -
TEI recommends such standardized codes but allows others.
-
OOXML and ODF mainly use standard codes but make some additions.
-
DocBook specifically distinguishes programming languages by offering the
languageattribute on elements such as programlisting and code, butxml:langelsewhere. This distinction is intuitive, but does add another name despite complementary distribution of values, and assumes special treatment of inheritance (for example, acodeelement conceptually should not inherit the value ofxml:langfrom its parent in the usual fashion). -
JATS notes the distinction but does not seem to provide a specific mechanism.
Language code values generally focus on natural languages, and while some specifications acknowledge or even provide generically for programming languages, none seems to provide or even recommend particular codes.
Proposed Approach
Language Code Structure
We propose using qpr as a base language code to cover the space of
programming languages, followed by specific programming language identifiers as
suffixes. For example:
<code xml:lang="qpr-x-py">
def fibonacci(n):
return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)
</code>
The qpr code is in the range that ISO 639 reserves for extensions,
and this avoids conflict with any natural language codes while clearly indicating
programming content. The x- accords with RFC 5646 rules so that
py is considered an extension code. This proposal recommends that
such codes be drawn from established file extensions, to leverage existing developer
knowledge.
File Extension Alignment
Programming language suffixes align with conventional file extensions to minimize cognitive overhead:
-
qpr-x-py↔.py(Python) -
qpr-x-js↔.js(JavaScript) -
qpr-x-css↔.css(CSS) -
qpr-x-html↔.html(HTML) -
qpr-x-xml↔.xml(XML) -
qpr-x-sql↔.sql(SQL) -
qpr-x-rust↔.rs(Rust) -
qpr-x-go↔.go(Go)
Where conflicts exist (e.g., .m used by Objective-C, MATLAB, and
Mathematica), explicit names resolve ambiguity: qpr-x-objc,
qpr-x-matlab, qpr-x-mathemat (the last is shortened
because "Mathematica" is longer that the 8-character limit for RFC5646 extensions).
Redundant cases also arise, such as HTML using both html and
html extensions; these may be annoying but are not very harmful
since it is easy to recognize both.
As noted below, this proposal strongly recommends that application code receiving
a language code beginning with qpr- recognize and accept the remainder
whether or not it begins with x-. For example, while
qpr-x-py is the full correct code and is what should be generated,
receiving programs should also accept qpr-py as synonymous.
Hierarchical Encoding Support
Sometimes formats are stacked. One example is the common use of tar + gzip,
leading to names like archive.tar.gz. A different case includes
metalanguages such as XML or BNF, within which many specific languages may be
defined. RFC 3023 proposed MIME types addressing the latter case, such as
text/xml+docbook. Similarly, this proposal permits chaining
encodings using - separators for scenarios like these:
<data xml:lang="qpr-png-base64">
iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8...
</data>
Resolving Codes to Definitions
An important characteristic of names on the Web is that they eventually map to URLs. This can be accomplished through various mechanisms: normative registries (like ISO 639 for languages, ISO 15924 for scripts, and ISO 3166 for country codes), local declarations (like XML namespace declarations), de facto conventions (like file extensions), or other means.
This proposal defines a single top-level language code qpr to conform to
ISO 639, then creates an extension space for particular programming languages. This
subspace does not, at present, have a formal registry — our research found no existing
ISO or ANSI enumeration of programming language codes, despite the extensive work
of
ISO/IEC JTC 1/SC 22 on individual language specifications.
However, this proposal provides two mechanisms for associating codes with definitive specifications: (a) strong recommendation to use existing file extensions with their established meanings, and (b) declaration-based association of codes with authoritative URLs. However, this proposal does not itself require use of such declarations (of couse, any particular use of the specification could add such a requirement if desired in its own context).
XML NOTATION Declarations
In the case of XML, NOTATION is an already-defined mechanism for
declaring external data formats and associating them with public identifiers and
system identifiers (such as URIs):
<!NOTATION png SYSEM "https://www.w3.org/TR/png-3/">
Defined notations may then be used to qualify declarations of external entities, or
on attributes of type NOTATION.
<!ENTITY figure_12 SYSEM "pix/fig12.png" NDATA png>
<!ATTLIST figlink
href CDATA #REQUIRED
fmt NOTATION #REQUIRED
alt CDATA #REQUIRED>
&figure_12;
<figlink href="http://example.com/globe.png" fmt="png" alt="A dog."/>
Notations are most often used to identify data formats such as for included images,
videos,
etc. However, they are not defined so as to limit them in that way. Nothing prevents
declaring programming language codes as XML NOTATIONs, or using them on
NDATA qualifiers or NOTATION attributes:
<!NOTATION qpr-x-py PUBLIC "-//Python Software Foundation//Python 3//EN"
"https://docs.python.org/3/reference/grammar.html">
<!NOTATION qpr-x-rust PUBLIC "-//Mozilla Foundation//Rust Language//EN"
"https://doc.rust-lang.org/reference/">
<!NOTATION qpr-x-js PUBLIC "-//ECMA International//ECMAScript 2023//EN"
"https://tc39.es/ecma262/">
<!ENTITY example_1 SYSEM "snippets/ex1.py" NDATA qpr-x-py>
<!ATTLIST pre
lang NOTATION #IMPLIED>
&example_1;
<pre lang="qpr-x-py">
import sys
print(sys.version)
</pre>
This leverages the notation mechanism to map names to URIs and to formalize the identification of programming languages. It requires no extensions to existing XML tools and processing. Applications that wish to leverage the information, however, can do so in a clear and reasonably familiar way. This provides a principled way to disambiguate ambiguous extensions, and/or to add new names on a per-document or per-schema basis.
HTML meta Declarations
HTML can simply use the existing lang attribute and applications can
leverage it as needed. For example, a spelling checker can easily skip any parts
where lang starts with qpr; an editor can easily extract
Python code from an HTML document by looking for lang set to
qpr-x-py, and so on.
If desired, HTML could define an alternative mechanism to associate language codes
with specifications. One possibility would be using meta elements. HTML
could of course also choose to provide or reference a set of applicable codes.
<meta name="lang-qpr-x-py" content="https://docs.python.org/3/reference/">
<meta name="lang-qpr-x-js" content="https://tc39.es/ecma262/">
The HTML Working Group would have the ultimate say, but users could of course use
the
available attribute in the meantime if desired, trusting to the near-standardization
of extensions in common practice. CSS can easily refer to attribute values within
selectors, so per-language rendering could be managed. For example, if
<pre> were used both for Python code and for English poetry the
cases could be distinguished like this:
<style type="text/css">
pre[lang="qpr-x-py"] { font-family:"Courier", monospace; }
pre[lang="en-us"] { font-family:"Garamond"; }
</style>
Implementation Considerations
Backward Compatibility
The approach requires no changes to XML processing infrastructure. Existing
parsers handle qpr-* language codes as ordinary attribute values. If
any XML or other applications actively check language code values, they should
accept the forms defined here, since they are valid (other than the suggested
support for omitting x-). NOTATION declarations are parsed
and stored in the document information set without affecting core processing (except
when applied to external entities, in which case they do the right thing).
Legacy tools typically ignore unknown language codes gracefully. CSS can leverage the codes for formatting distinctions. Enhanced processors could leverage codes and/or URIs for syntax highlighting, validation, and/or cross-reference generation.
Validation
Validation works unchanged. Syntactic validity of codes need not be addressed by XML, JSON, or other validation tools, just as grammatical checking of natural language text is not. However, such checking is possible once particular languages are identified. Also, if desired, language codes can be constrained through enumerated types or pattern restrictions in schemas:
<!ATTLIST code xml:lang (en|fr|qpr-py|qpr-js|qpr-rust) #IMPLIED>
Example language codes
Programming languages
These are just the TIOBE top 20 languages as of this writing.
| Language | qpr-x- | Extension | Possible NOTATION URI |
|---|---|---|---|
| Python | py | .py | https://docs.python.org/3/reference/ |
| C++ | cpp | .cpp | https://isocpp.org/std/the-standard |
| C | c | .c | https://www.iso.org/standard/74528.html |
| Java | java | .java | https://docs.oracle.com/javase/specs/ |
| C# | cs | .cs | https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/ |
| JavaScript | js | .js | https://tc39.es/ecma262/ |
| Go | go | .go | https://go.dev/ref/spec |
| Visual Basic | vb | .vb | https://docs.microsoft.com/en-us/dotnet/visual-basic/language-reference/ |
| Delphi/Object Pascal | pas | .pas | https://www.freepascal.org/docs.html |
| SQL | sql | .sql | https://www.iso.org/standard/76583.html |
| Fortran | f90 | .f90 | https://www.iso.org/standard/72320.html |
| Scratch | sb3 | .sb3 | https://scratch.mit.edu/developers |
| PHP | php | .php | https://www.php.net/manual/en/langref.php |
| R | r | .r | https://cran.r-project.org/doc/manuals/r-release/R-lang.html |
| Ada | ada | .adb | https://www.iso.org/standard/61507.html |
| MATLAB[2] | matlab | .m | https://www.mathworks.com/help/matlab/language-fundamentals.html |
| Assembly | asm | .asm | https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html |
| Rust | rs | .rs | https://doc.rust-lang.org/reference/ |
| Perl | pl | .pl | https://perldoc.perl.org/perlsyn |
| COBOL | cob | .cob | https://www.iso.org/standard/74527.html |
A few data formats
| Format | qpr-x- | Extension | Possible NOTATION URI |
|---|---|---|---|
| SQL | sql |
.sql |
|
| YAML | yaml |
.yaml |
|
| JSON | json |
.json |
|
| Regex | regex |
.regex |
|
| Base64 | b64 |
.b64 |
Shell and scripting
| Shell | qpr-x- | Extension | Possible NOTATION URI |
|---|---|---|---|
| Bash | bash |
.bash |
|
| Zsh | zsh |
.zsh |
|
| Fish | fish |
.fish |
|
| PowerShell | ps1 |
.ps1 |
Usage examples
Technical documentation
<article xml:lang="en">
<title>Getting Started with Rust</title>
<para>Here's a simple Rust program:</para>
<programlisting xml:lang="qpr-rs">
fn main() {
println!("Hello, world!");
}
</programlisting>
<para>To compile it, run:</para>
<screen xml:lang="qpr-bash">
rustc hello.rs
./hello
</screen>
</article>
API Documentation
<function xml:lang="en">
<funcsynopsis><funcdef>authenticate</funcdef></funcsynopsis>
<para>Authenticates a user via JWT token</para>
<example xml:lang="qpr-py">
import jwt
token = jwt.encode({"user": "alice"}, secret, algorithm="HS256")
</example>
<example xml:lang="qpr-js">
const jwt = require('jsonwebtoken');
const token = jwt.sign({user: 'alice'}, secret, {algorithm: 'HS256'});
</example>
</function>
Configuration Examples
<section>
<title>Configuration</title>
<programlisting xml:lang="qpr-docker">
FROM python:3.9
COPY requirements.txt .
RUN pip install -r requirements.txt
</programlisting>
<programlisting xml:lang="qpr-yaml">
version: '3.8'
services:
web:
build: .
ports:
- "8000:8000"
</programlisting>
</section>
Remaining issues
In some cases it is important to specify a particular version number for programming
languages. This could be done by defining separate extension codes (say,
f66 vs. f90 above), or perhaps better, appending a field
specifically for the version number, for example qpr-py-3.11.
Data formats such as CSV, png, and countless others as well as templating,
configuration, and other systems, may be considered another category(s) of language.
So
considered, they are largely complementary to natural and programming languages. Similar
declaration functionality might be useful for them as well. In that case, the same
approach could potentially be applied (say, with prefix qda.).
The boundaries between programming languages, data representation languages, and markup languages are imprecise. Some interesting boundary cases include XSLT, PostScript, and regexes.
XSD does support the NOTATION datatype for attributes. However, it does
not provide an equivalent for NOTATION declarations.
Conclusion
The approach requires no formal standardization or updating of applications for immediate use or adoption. Organizations or individuals can begin using the conventions immediately while building consensus around language code assignments.
This paper presents a backward-compatible approach to programming language
identification in documents, that leverages existing standards (natural language tags
and XML NOTATION) to provide both machine-readable identification and
dereferenceable specifications.
The approach offers several advantages:
-
Zero breaking changes to existing document processing
-
Broad coverage of programming languages (and potentially data formats)
-
URI-based grounding through
NOTATIONdeclarations -
Tool integration pathway for enhanced processing
-
Extensible framework for future language additions
By building on existing language identification infrastructure rather than introducing new mechanisms, the approach provides a path toward better programming language support in technical documentation while maintaining full compatibility with existing toolchains. Using the same mechanism also properly captures the complementary use of natural vs. programming language syntax.
The combination of familiar file extension conventions with formal
NOTATION declarations strikes a good balance between developer
usability and semantic precision, enabling both human authors and automated tools
to
work more effectively with diverse technical content.
Acknowledgments
The author thanks Claude (Anthropic) for collaborative development of these ideas and significant contributions to the analysis and presentation in this paper.
References
[RFC 3986] Berners-Lee, T., R. Fielding, L. Masinter. Uniform Resource Identifiers (URI): Generic Syntax.
RFC 3986. January
2005.
[Barcoding Guidelines] BISG. Barcoding Guidelines for the U.S. Book Industry.
Retrieved 2025-06-04. https://www.bisg.org/barcoding-guidelines-for-the-us-book-industry
[XML 1.0] Bray, T., J. Paoli, C.M. Sperberg-McQueen, E.
Maler, F. Yergeau. Extensible Markup Language (XML) 1.0 (Fifth Edition).
W3C Recommendation. November 2008.
[ISO 15924] International Organisation for Standardization. 2004. ISO 15924: Codes for the representation of names of scripts.
See also https://en.wikipedia.org/wiki/ISO_15924
[RFC 5646] Phillips, A., M. Davis (eds). RFC 5646:
Tags for Identifying Languages.
September 2009. https://www.rfc-editor.org/info/rfc5646
[HTML] WHATWG. HTML Standard, Section 3.2.6.2:
The
https://html.spec.whatwg.org/multipage/dom.html#the-lang-and-xml:lang-attributes
lang and xml:lang attributes.
[ISBN] Wikipedia. International Standard Book
Number.
https://en.wikipedia.org/wiki/International_Standard_Book_Number
[RFC 3023] Murata, M. XML Media Types.
RFC 3023. January 2001.
[TEI P5] Text Encoding Initiative. TEI P5:
Guidelines for Electronic Text Encoding and Interchange.
TEI Consortium.
Version 4.6.0. April 2023.
[DocBook 5.0] Walsh, N. The DocBook Schema Version 5.0.
14 Mar 2008. https://docbook.org/specs/docbook-5.0-spec-cd-03.html
[1] This token is rarely included except when the language in
question is written using multiple orthographics (such as
Serbian, which might use either sr-Cyrl or
sr-Latn), or when transliteration is used (such
as ell-Latn).
[2] MATLAB is given as an exception to the usual rules, because .m has other conflicting uses.