How to cite this paper

DeRose, Steven J. “Language Identification for Program Code in Documents.” Presented at Balisage: The Markup Conference 2025, Washington, DC, August 4 - 8, 2025. In Proceedings of Balisage: The Markup Conference 2025. Balisage Series on Markup Technologies, vol. 30 (2025). https://doi.org/10.4242/BalisageVol30.DeRose02.

Balisage: The Markup Conference 2025
August 4 - 8, 2025

Balisage Paper: Language Identification for Program Code in Documents

Steven J. DeRose

Consultant

`<sderose@acm.org>`

ORCID ID: https://orcid.org/0000-0003-4216-548X

Steve DeRose has been working with electronic document and hypertext systems since 1979. He holds degrees in Computer Science and in Linguistics and a Ph.D. in Computational Linguistics from Brown University.

He co-founded Electronic Book Technologies in 1989 to build the first SGML browser and retrieval system, DynaText, and has been deeply involved in document standards including XML, TEI, HyTime, HTML 4, XPath, XPointer, EAD, Open eBook, OSIS, NLM, and others. He has served as adjunct faculty at Brown and Calvin Universities and has written many papers, two books, and fifteen patents. Most recently he has been working as a consultant in text analytics.

Abstract

This paper proposes a simple approach to labelling the language of programming code embedded in structured documents, using standard language attributes (such as xml:lang and HTML lang) with a single reserved language code qpr, and specific programming languages and/or data formats distinguished via the following portion. This approach facilitates language-specific processing including syntax highlighting, spell checking, and validation while maintaining backward compatibility with existing document processing systems.

Introduction

Background and Related Work

Current Practice
Details of RFC 5646
Document Language Identification

Proposed Approach

Language Code Structure
File Extension Alignment
Hierarchical Encoding Support

Resolving Codes to Definitions

XML NOTATION Declarations
HTML meta Declarations

Implementation Considerations

Backward Compatibility
Validation

Example language codes

Programming languages
A few data formats
Shell and scripting

Usage examples

Technical documentation
API Documentation
Configuration Examples

Remaining issues

Conclusion

Acknowledgments

Introduction

HTML, XML, and many other systems provide ways to mark the natural language of text portions, such as by defining language, lang, or xml:lang attributes. The values are typically expected to be natural language codes, often with a region-code suffix to distinguish variants (see ISO 639 and RFC 5646). Typically, such codes consist of a 2- or 3-letter language code, plus sometimes a 2-letter region code. For example:

<div lang="en-uk">Put the portmanteau in the boot.</div>

Such identification is useful for many kinds of processing:

switching spelling, grammar, or style checkers to a language-appropriate configuration
applying language-specific font or layout preferences
triggering special treatment of punctuation, ligatures, or other details whose customs vary by language.

Technical documentation, tutorials, and other documents commonly contain code as well as natural-language text, such as for examples. Those content portions have similar needs for language-dependent processing:

spelling and other language checking should typically be turned off within program code, or at least operate differently. One might wish to substitute a programming-language-specific checker such as lint.
In some contexts syntax highlighting is useful, but again is language-specific.
some special treatment commonly applicable to natural-language text should not apply, because it is actively harmful to program code. For example, many editors automatically change straight quotes to curly quotes, but code rarely benefits from such changes. Likewise for converting two hyphens to an em dash, composing ligatures, or re-wrapping lines using prose rather than code conventions.

Language specification methods (most obviously ISO 639) have excellent coverage of natural languages, including ancient ones and even extending to constructed languages such as Esperanto and Klingon. However, they do not provide for programming languages, making it harder to gain similar benefits. There are reasons for this: programming languages are invented far more often than natural languages; they are used in different ways by different communities; and they differ typologically.

Despite many differences, there are reasons that natural and programming languages are both called languages. Both are generative systems that allow expressing an unbounded range of information in interpretable ways. More relevant for the current purpose, text in both kinds of languages frequently occurs in documents, and knowing which kind and which specific language a part is in affects many aspects of how it should typically be processed.

To help gain similar benefits to those of labelling the applicable natural language of content it would be useful to enable labelling the programming language of content where applicable. Data formats might similarly be of value when data in them is likewise embedded.

A given piece of content is normally only in one language (whether natural or program), barring unusual edge cases such as code carefully crafted to conform simultaneously to multiple syntaxes, or poetry carefully crafted to also be executable. This proposal does not addess such cases. The usual case, that content is generally not in a natural language if it is program code and vice versa, means that the very same labeling mechanism can be used in both cases so long as the specific codes do not collide. Indeed, this complementary distribution suggests that is desirable.

Therefore this proposal does not suggest any new attributes or other constructs, but an extension to the set of permissible values for existing ones. That is, we propose using contructs such as the lang, xml:lang, and other constructs in the usual manner but with additional values, to identify document portions such as program code. This seems clearly better than leaving such portions unidentified, which leads to poor results from spelling and grammar checkers, inappropriate formatting, and sometimes active corruption of code by applying language-inappropriate corrections or enhancements. It also seems better than the epub solution of labelling them merely zxx or und (see below).

Leveraging existing language indicators has the advantage that inheritance works as expected: for example, if a chapter in Japanese contains a code snippet in Java, setting the language to Java locally overrides the containing language. If a separate indicator were used instead it would suggest it is orthogonal, which it is not. Even if identifiers within that Java code uniformly use Japanese identifiers, the code remains in Java not in the natural language Japanese – its syntax, even down to what strings are permissible identifiers, is defined by the rules of Java not of Japanese per se. Once back in the non-code text the rules of Japanese apply again.

The lang attribute in HTML and xml:lang in XML provide standardized mechanisms for language identification, but they and most other mechanisms use language codes that cover only natural languages and certain variants (plus a few edge cases such as constructed languages). Obvious approaches to having normative names for specific programming or other languages, would be to add them to the ISO 639 space or to separately define non-conflicting extensions. However, because programming languages occupy a different conceptual space, are used by a smaller community, and have a much higher rate of additions, the high level of co-ordination required for such approaches seems impractical.

Instead, this proposal takes a lesson from how ISBNs were incorporated as a sub-space within product codes. Barcodes for retail products begin with a country code. ISBNs were inserted as a single new country called Bookland, with code 978 (see Barcoding Guidelines). This approach neatly separated concerns: Books could do their own thing within their own space (and what they do is use ISBNs), and not trip over other product codes (or vice versa).

Similarly, we propose a single top-level language code, qpr, to cover the space of all programming languages. ISO 639 reserves codes beginning with q for extensions, private use, etc. The particular programming language is then specified by and additional code field, using the already familiar rules of RFC 5646. Because programming languages rarely have regional dialects or variants per se, this shift in the region semantics seems fair. Grouping programming languages under one code makes it trivial to tell whether given content is natural language text or program code. Some processing such as turning off spelling correction or fancy quotes may only need that distinction, while other processing such as syntax highlighting would leverage the more specific sub-codes in the remainder.

There does not appear to be a normative registry of programming languages, although many have language definitions promulgated via ISO, IETF, W3C, or other organizations. This proposal does not require creating a registry (though it also has no objections should one arise). Rather, it suggests using the widely-used de facto codes embodied in file extensions: py for Python, lisp for LISP, and so on.

For practical reasons these tend to be nonconflicting. However, there are exceptions such as dot and m. In such cases more is needed. This proposal thus also includes means for declaring unambiguous mapping of language codes to URLs, as is typically desired for Web vocabularies. We illustrate a mechanism for this in XML that leverages already-existing mechanisms and requires no additions to parsers, DOM implementations, etc. We also give possible approaches to achieve the same kind of mapping in HTML. Similar approaches could be applied to any document format or processing system that supports language identification codes, from markup languages to content management systems and beyond.

Background and Related Work

Current Practice

There are practices for identifying entire files as being in one language or another, such as shebang lines, hidden file type metadata, registries, and file extensions. However, most technical documentation systems either ignore programming language identification altogether or achieve it through implementation-specific mechanisms:

GitHub Flavored Markdown: Uses fenced code blocks with language identifiers (```python)
HTML/CSS: Relies on class attributes (<code class="language-python">)
DocBook: Uses the language attribute on programlisting elements (very similar to, and compatible with, the current proposal)

These approaches are useful, but share the limitation of providing no mechanism for resolving language identifiers to authoritative specifications.

Details of RFC 5646

RFC5646 defines the syntax of language codes as a series of hyphen-separated ASCII alphanumeric tokens, which ignore case and are limited to length 8. It trades heavily on fixed token lengths, but does leave room for extensions.

The forms supported include a langcode which contains the parts described below (with all but the first part optional):

a 2- or 3-letter language, specified by the 2- or 3-letter shortest ISO 639 code (thus en rather than eng for English), or a 4-letter value (those are reserved), or a 5- to 8-letter code registered with IANA. The language may also include from 1 to 3 hyphen-separated 3-letter ISO 639 codes comprising an extlang suffix. Preferably all lower case.
a 4-character script (orthography) code drawn from ISO 15924. Preferably initial-cap.^[1]
a 2-letter or 3-digit region code drawn from ISO 3166-1 or UN M.49, respectively (these mainly enumerate countries, although some languages have regional variants that are not so conveniently bounded). Preferably all-cap.
a 5- to 8-letter registered variant code.
an extension code consisting of a single alphanumeric other than x, a hyphen, and 2-8 alphanumerics.
a private use code consisting of x, a hyphen, and 1-8 alphnumerics.

There are also certain grandfathered strings, and x- plus a private-use token. To fully conform to RFC 5646, programming language or data format codes must begin with x- (the qpr language code already conforms to ISO 639, so does not need similar treatment). This proposal recommends that practice for all contexts where full conformance to the RFC definition is required or helpful. However, this proposal also recommends that software recognize qpr subtags without regard to whether they begin with x-, just as they should accept them without regard to case.

Document Language Identification

Both HTML and XML provide built-in language identification through lang and xml:lang attributes respectively, following BCP 47 language tag conventions. Many specific XML schemas just use xml:lang, but some make similar provisions of their own (such as DocBook’s language, ODF’s fo:language, and OOXML’s w:lang) . Despite such minor variations this general approach offers several advantages:

Standardized inheritance semantics
ease of integration with CSS language selectors and XSLT processing
Support for hierarchical language switching
Compatibility with accessibility tools and screen readers
Usefulness for language-specific spell checking and grammar validation

Nearly all such specifications defer to ISO 639, RFC 5646, and BCP 47. Some make a few additions:

HTML adds "" and a private-use x- prefix
epub adds zxx for no linguistic content and und for undetermined language (though not conforming to ISO 639, these codes at least allow systems to avoid inappropriate processing such as spelling correction or quote education. But they don’t help toward programming-language specific treatments).
TEI recommends such standardized codes but allows others.
OOXML and ODF mainly use standard codes but make some additions.
DocBook specifically distinguishes programming languages by offering the language attribute on elements such as programlisting and code, but xml:lang elsewhere. This distinction is intuitive, but does add another name despite complementary distribution of values, and assumes special treatment of inheritance (for example, a code element conceptually should not inherit the value of xml:lang from its parent in the usual fashion).
JATS notes the distinction but does not seem to provide a specific mechanism.

Language code values generally focus on natural languages, and while some specifications acknowledge or even provide generically for programming languages, none seems to provide or even recommend particular codes.

Proposed Approach

Language Code Structure

We propose using qpr as a base language code to cover the space of programming languages, followed by specific programming language identifiers as suffixes. For example:

                     
<code xml:lang="qpr-x-py">
def fibonacci(n):
    return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)
</code>

The qpr code is in the range that ISO 639 reserves for extensions, and this avoids conflict with any natural language codes while clearly indicating programming content. The x- accords with RFC 5646 rules so that py is considered an extension code. This proposal recommends that such codes be drawn from established file extensions, to leverage existing developer knowledge.

File Extension Alignment

Programming language suffixes align with conventional file extensions to minimize cognitive overhead:

qpr-x-py ↔ .py (Python)
qpr-x-js ↔ .js (JavaScript)
qpr-x-css ↔ .css (CSS)
qpr-x-html ↔ .html (HTML)
qpr-x-xml ↔ .xml (XML)
qpr-x-sql ↔ .sql (SQL)
qpr-x-rust ↔ .rs (Rust)
qpr-x-go ↔ .go (Go)

Where conflicts exist (e.g., .m used by Objective-C, MATLAB, and Mathematica), explicit names resolve ambiguity: qpr-x-objc, qpr-x-matlab, qpr-x-mathemat (the last is shortened because "Mathematica" is longer that the 8-character limit for RFC5646 extensions). Redundant cases also arise, such as HTML using both html and html extensions; these may be annoying but are not very harmful since it is easy to recognize both.

As noted below, this proposal strongly recommends that application code receiving a language code beginning with qpr- recognize and accept the remainder whether or not it begins with x-. For example, while qpr-x-py is the full correct code and is what should be generated, receiving programs should also accept qpr-py as synonymous.

Hierarchical Encoding Support

Sometimes formats are stacked. One example is the common use of tar + gzip, leading to names like archive.tar.gz. A different case includes metalanguages such as XML or BNF, within which many specific languages may be defined. RFC 3023 proposed MIME types addressing the latter case, such as text/xml+docbook. Similarly, this proposal permits chaining encodings using - separators for scenarios like these:

                     
<data xml:lang="qpr-png-base64">
iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8...
</data>

Resolving Codes to Definitions

An important characteristic of names on the Web is that they eventually map to URLs. This can be accomplished through various mechanisms: normative registries (like ISO 639 for languages, ISO 15924 for scripts, and ISO 3166 for country codes), local declarations (like XML namespace declarations), de facto conventions (like file extensions), or other means.

This proposal defines a single top-level language code qpr to conform to ISO 639, then creates an extension space for particular programming languages. This subspace does not, at present, have a formal registry — our research found no existing ISO or ANSI enumeration of programming language codes, despite the extensive work of ISO/IEC JTC 1/SC 22 on individual language specifications.

However, this proposal provides two mechanisms for associating codes with definitive specifications: (a) strong recommendation to use existing file extensions with their established meanings, and (b) declaration-based association of codes with authoritative URLs. However, this proposal does not itself require use of such declarations (of couse, any particular use of the specification could add such a requirement if desired in its own context).

XML `NOTATION` Declarations

In the case of XML, NOTATION is an already-defined mechanism for declaring external data formats and associating them with public identifiers and system identifiers (such as URIs):

                     
<!NOTATION png SYSEM "https://www.w3.org/TR/png-3/">

Defined notations may then be used to qualify declarations of external entities, or on attributes of type NOTATION.

                     
<!ENTITY figure_12 SYSEM "pix/fig12.png" NDATA png>

<!ATTLIST figlink
    href    CDATA    #REQUIRED
    fmt     NOTATION #REQUIRED
    alt     CDATA    #REQUIRED>

&figure_12;

<figlink href="http://example.com/globe.png" fmt="png" alt="A dog."/>

Notations are most often used to identify data formats such as for included images, videos, etc. However, they are not defined so as to limit them in that way. Nothing prevents declaring programming language codes as XML NOTATIONs, or using them on NDATA qualifiers or NOTATION attributes:

                     
<!NOTATION qpr-x-py PUBLIC "-//Python Software Foundation//Python 3//EN"
                    "https://docs.python.org/3/reference/grammar.html">
<!NOTATION qpr-x-rust PUBLIC "-//Mozilla Foundation//Rust Language//EN"
                    "https://doc.rust-lang.org/reference/">
<!NOTATION qpr-x-js PUBLIC "-//ECMA International//ECMAScript 2023//EN"
                    "https://tc39.es/ecma262/">

<!ENTITY example_1 SYSEM "snippets/ex1.py" NDATA qpr-x-py>

<!ATTLIST pre
    lang    NOTATION #IMPLIED>

&example_1;

<pre lang="qpr-x-py">
    import sys
    print(sys.version)
</pre>

This leverages the notation mechanism to map names to URIs and to formalize the identification of programming languages. It requires no extensions to existing XML tools and processing. Applications that wish to leverage the information, however, can do so in a clear and reasonably familiar way. This provides a principled way to disambiguate ambiguous extensions, and/or to add new names on a per-document or per-schema basis.

HTML `meta` Declarations

HTML can simply use the existing lang attribute and applications can leverage it as needed. For example, a spelling checker can easily skip any parts where lang starts with qpr; an editor can easily extract Python code from an HTML document by looking for lang set to qpr-x-py, and so on.

If desired, HTML could define an alternative mechanism to associate language codes with specifications. One possibility would be using meta elements. HTML could of course also choose to provide or reference a set of applicable codes.

                     
<meta name="lang-qpr-x-py" content="https://docs.python.org/3/reference/">
<meta name="lang-qpr-x-js" content="https://tc39.es/ecma262/">

The HTML Working Group would have the ultimate say, but users could of course use the available attribute in the meantime if desired, trusting to the near-standardization of extensions in common practice. CSS can easily refer to attribute values within selectors, so per-language rendering could be managed. For example, if <pre> were used both for Python code and for English poetry the cases could be distinguished like this:

                     
<style type="text/css">
    pre[lang="qpr-x-py"] { font-family:"Courier", monospace; }
    pre[lang="en-us"]    { font-family:"Garamond"; }
</style>

Implementation Considerations

Backward Compatibility

The approach requires no changes to XML processing infrastructure. Existing parsers handle qpr-* language codes as ordinary attribute values. If any XML or other applications actively check language code values, they should accept the forms defined here, since they are valid (other than the suggested support for omitting x-). NOTATION declarations are parsed and stored in the document information set without affecting core processing (except when applied to external entities, in which case they do the right thing).

Legacy tools typically ignore unknown language codes gracefully. CSS can leverage the codes for formatting distinctions. Enhanced processors could leverage codes and/or URIs for syntax highlighting, validation, and/or cross-reference generation.

Validation

Validation works unchanged. Syntactic validity of codes need not be addressed by XML, JSON, or other validation tools, just as grammatical checking of natural language text is not. However, such checking is possible once particular languages are identified. Also, if desired, language codes can be constrained through enumerated types or pattern restrictions in schemas:

                     
<!ATTLIST code xml:lang (en|fr|qpr-py|qpr-js|qpr-rust) #IMPLIED>

Example language codes

Programming languages

These are just the TIOBE top 20 languages as of this writing.

Language	qpr-x-	Extension	Possible NOTATION URI
Python	py	.py	https://docs.python.org/3/reference/
C++	cpp	.cpp	https://isocpp.org/std/the-standard
C	c	.c	https://www.iso.org/standard/74528.html
Java	java	.java	https://docs.oracle.com/javase/specs/
C#	cs	.cs	https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/
JavaScript	js	.js	https://tc39.es/ecma262/
Go	go	.go	https://go.dev/ref/spec
Visual Basic	vb	.vb	https://docs.microsoft.com/en-us/dotnet/visual-basic/language-reference/
Delphi/Object Pascal	pas	.pas	https://www.freepascal.org/docs.html
SQL	sql	.sql	https://www.iso.org/standard/76583.html
Fortran	f90	.f90	https://www.iso.org/standard/72320.html
Scratch	sb3	.sb3	https://scratch.mit.edu/developers
PHP	php	.php	https://www.php.net/manual/en/langref.php
R	r	.r	https://cran.r-project.org/doc/manuals/r-release/R-lang.html
Ada	ada	.adb	https://www.iso.org/standard/61507.html
MATLAB^[2]	matlab	.m	https://www.mathworks.com/help/matlab/language-fundamentals.html
Assembly	asm	.asm	https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html
Rust	rs	.rs	https://doc.rust-lang.org/reference/
Perl	pl	.pl	https://perldoc.perl.org/perlsyn
COBOL	cob	.cob	https://www.iso.org/standard/74527.html

A few data formats

Format	qpr-x-	Extension
SQL	`sql`	`.sql`
YAML	`yaml`	`.yaml`
JSON	`json`	`.json`
Regex	`regex`	`.regex`
Base64	`b64`	`.b64`

Shell and scripting

Shell	qpr-x-	Extension
Bash	`bash`	`.bash`
Zsh	`zsh`	`.zsh`
Fish	`fish`	`.fish`
PowerShell	`ps1`	`.ps1`

Usage examples

Technical documentation

                     
<article xml:lang="en">
  <title>Getting Started with Rust</title>
  <para>Here's a simple Rust program:</para>
  <programlisting xml:lang="qpr-rs">
fn main() {
    println!("Hello, world!");
}
  </programlisting>
  
  <para>To compile it, run:</para>
  <screen xml:lang="qpr-bash">
rustc hello.rs
./hello
  </screen>
</article>

API Documentation

                     
<function xml:lang="en">
  <funcsynopsis><funcdef>authenticate</funcdef></funcsynopsis>
  <para>Authenticates a user via JWT token</para>
  <example xml:lang="qpr-py">
import jwt
token = jwt.encode({"user": "alice"}, secret, algorithm="HS256")
  </example>
  <example xml:lang="qpr-js">
const jwt = require('jsonwebtoken');
const token = jwt.sign({user: 'alice'}, secret, {algorithm: 'HS256'});
  </example>
</function>

Configuration Examples

                     
<section>
  <title>Configuration</title>
  <programlisting xml:lang="qpr-docker">
FROM python:3.9
COPY requirements.txt .
RUN pip install -r requirements.txt
  </programlisting>
  
  <programlisting xml:lang="qpr-yaml">
version: '3.8'
services:
  web:
    build: .
    ports:
      - "8000:8000"
  </programlisting>
</section>

Remaining issues

In some cases it is important to specify a particular version number for programming languages. This could be done by defining separate extension codes (say, f66 vs. f90 above), or perhaps better, appending a field specifically for the version number, for example qpr-py-3.11.

Data formats such as CSV, png, and countless others as well as templating, configuration, and other systems, may be considered another category(s) of language. So considered, they are largely complementary to natural and programming languages. Similar declaration functionality might be useful for them as well. In that case, the same approach could potentially be applied (say, with prefix qda.).

The boundaries between programming languages, data representation languages, and markup languages are imprecise. Some interesting boundary cases include XSLT, PostScript, and regexes.

XSD does support the NOTATION datatype for attributes. However, it does not provide an equivalent for NOTATION declarations.

Conclusion

The approach requires no formal standardization or updating of applications for immediate use or adoption. Organizations or individuals can begin using the conventions immediately while building consensus around language code assignments.

This paper presents a backward-compatible approach to programming language identification in documents, that leverages existing standards (natural language tags and XML NOTATION) to provide both machine-readable identification and dereferenceable specifications.

The approach offers several advantages:

Zero breaking changes to existing document processing
Broad coverage of programming languages (and potentially data formats)
URI-based grounding through NOTATION declarations
Tool integration pathway for enhanced processing
Extensible framework for future language additions

By building on existing language identification infrastructure rather than introducing new mechanisms, the approach provides a path toward better programming language support in technical documentation while maintaining full compatibility with existing toolchains. Using the same mechanism also properly captures the complementary use of natural vs. programming language syntax.

The combination of familiar file extension conventions with formal NOTATION declarations strikes a good balance between developer usability and semantic precision, enabling both human authors and automated tools to work more effectively with diverse technical content.

Acknowledgments

The author thanks Claude (Anthropic) for collaborative development of these ideas and significant contributions to the analysis and presentation in this paper.

References

[RFC 3986] Berners-Lee, T., R. Fielding, L. Masinter. Uniform Resource Identifiers (URI): Generic Syntax. RFC 3986. January 2005.

[Barcoding Guidelines] BISG. Barcoding Guidelines for the U.S. Book Industry. Retrieved 2025-06-04. https://www.bisg.org/barcoding-guidelines-for-the-us-book-industry

[XML 1.0] Bray, T., J. Paoli, C.M. Sperberg-McQueen, E. Maler, F. Yergeau. Extensible Markup Language (XML) 1.0 (Fifth Edition). W3C Recommendation. November 2008.

[ISO 15924] International Organisation for Standardization. 2004. ISO 15924: Codes for the representation of names of scripts. See also https://en.wikipedia.org/wiki/ISO_15924

[RFC 5646] Phillips, A., M. Davis (eds). RFC 5646: Tags for Identifying Languages. September 2009. https://www.rfc-editor.org/info/rfc5646

[HTML] WHATWG. HTML Standard, Section 3.2.6.2: The lang and xml:lang attributes. https://html.spec.whatwg.org/multipage/dom.html#the-lang-and-xml:lang-attributes

[ISBN] Wikipedia. International Standard Book Number. https://en.wikipedia.org/wiki/International_Standard_Book_Number

[RFC 3023] Murata, M. XML Media Types. RFC 3023. January 2001.

[TEI P5] Text Encoding Initiative. TEI P5: Guidelines for Electronic Text Encoding and Interchange. TEI Consortium. Version 4.6.0. April 2023.

[DocBook 5.0] Walsh, N. The DocBook Schema Version 5.0. 14 Mar 2008. https://docbook.org/specs/docbook-5.0-spec-cd-03.html

^[1] This token is rarely included except when the language in question is written using multiple orthographics (such as Serbian, which might use either sr-Cyrl or sr-Latn), or when transliteration is used (such as ell-Latn).

^[2] MATLAB is given as an exception to the usual rules, because .m has other conflicting uses.

Berners-Lee, T., R. Fielding, L. Masinter. Uniform Resource Identifiers (URI): Generic Syntax. RFC 3986. January 2005.

BISG. Barcoding Guidelines for the U.S. Book Industry. Retrieved 2025-06-04. https://www.bisg.org/barcoding-guidelines-for-the-us-book-industry

Bray, T., J. Paoli, C.M. Sperberg-McQueen, E. Maler, F. Yergeau. Extensible Markup Language (XML) 1.0 (Fifth Edition). W3C Recommendation. November 2008.

International Organisation for Standardization. 2004. ISO 15924: Codes for the representation of names of scripts. See also https://en.wikipedia.org/wiki/ISO_15924

Phillips, A., M. Davis (eds). RFC 5646: Tags for Identifying Languages. September 2009. https://www.rfc-editor.org/info/rfc5646

WHATWG. HTML Standard, Section 3.2.6.2: The lang and xml:lang attributes. https://html.spec.whatwg.org/multipage/dom.html#the-lang-and-xml:lang-attributes

Wikipedia. International Standard Book Number. https://en.wikipedia.org/wiki/International_Standard_Book_Number

Murata, M. XML Media Types. RFC 3023. January 2001.

Text Encoding Initiative. TEI P5: Guidelines for Electronic Text Encoding and Interchange. TEI Consortium. Version 4.6.0. April 2023.

Walsh, N. The DocBook Schema Version 5.0. 14 Mar 2008. https://docbook.org/specs/docbook-5.0-spec-cd-03.html

BalisageThe Markup Conference2025

Balisage Paper: Language Identification for Program Code in Documents

`<sderose@acm.org>`

Abstract

Table of Contents

Introduction

Background and Related Work

Current Practice

Details of RFC 5646

Document Language Identification

Proposed Approach

Language Code Structure

File Extension Alignment

Hierarchical Encoding Support

Resolving Codes to Definitions

XML `NOTATION` Declarations

HTML `meta` Declarations

Implementation Considerations

Backward Compatibility

Validation

Example language codes

Programming languages

A few data formats

Shell and scripting

Usage examples

Technical documentation

API Documentation

Configuration Examples

Remaining issues

Conclusion

Acknowledgments

References

Balisage Series on Markup Technologies

Balisage Paper: Language Identification for Program Code in Documents

<sderose@acm.org>

Abstract

Table of Contents

Introduction

Background and Related Work

Current Practice

Details of RFC 5646

Document Language Identification

Proposed Approach

Language Code Structure

File Extension Alignment

Hierarchical Encoding Support

Resolving Codes to Definitions

XML NOTATION Declarations

HTML meta Declarations

Implementation Considerations

Backward Compatibility

Validation

Example language codes

Programming languages

A few data formats

Shell and scripting

Usage examples

Technical documentation

API Documentation

Configuration Examples

Remaining issues

Conclusion

Acknowledgments

References

Balisage Series on Markup Technologies

`<sderose@acm.org>`

XML `NOTATION` Declarations

HTML `meta` Declarations