SCEP: | 106 |
---|---|
Title: | Document formats suitable for “source” documents |
Version: | 72553b43ff9b2176cc8f50197af23e8cd3c5eac5 |
Last modified: | 2015-05-14 12:58:18 UTC (Thu, 14 May 2015) |
Author: | Raphael ‘kena’ Poss |
Status: | Draft |
Type: | Informational |
Created: | 2014-06-23 |
Source: | scep0106.rst (fp:jzfQBa0-Owi93TfYOPsYrx1RrShWAEVDl-Lgmd5FCMZ-Uw) |
The Structured Common model [1] is highly dependent on a consensus by authors and readers about what constitutes the "source" of a published document: the object fingerprint [2] used for inter-document citations should identify the "essence" of a scientific work, as independent as possible from its representation in various formats.
This SCEP provides guidelines and rationales for users of written documents, in particular scholarly authors, to choose source formats according to their compatibility with the Structured Commons vision and other requirements.
The content of the following sections can be summarized as follows:
This SCEP is only applicable to Structured Common objects that primarily consist of written text, ie. NOT data sets, images, program source code, program executables, virtual machine images, etc.
It is possible to integrate printable PDFs in the Structured Commons network directly; ie., compute fingerprints of PDF files directly and/or cite works via their PDF fingerprints. However, the Structured Commons model strongly encourages authors to publish their document sources as well.
This requirement is already prevalent in online document libraries, either from established academic publishers or in open repositories like arXiv [3]. Moreover, once authors take the habit to publish document sources alongside other presentation formats, it becomes possible to make fingerprints independent from document representation.
This in turn enables authors to (re-)generate alternate representations of a document after it has been published, without breaking the existing fingerprint-based citations from other works.
There currently exist multiple workflows and tools used by scientific authors to prepare documents prior to publication. Anecdotically, this diversity is maintained and usually polarised by conflicting requirements between the authors’ desire for a WYSIWYG editing interface and the field’s requirement for high-quality print typesetting and long-term portability of document formats; the conflict is epitomized by this common question from graduate students worldwide: "should I use Word or LaTeX to write my thesis?"
For various reasons, some of which detailed below, this controversy may be soon resolved for scientific works by a common shift away from word processors, towards standard-based and document-centric workflows using multiple editing tools simultaneously—including but not limited LaTeX, and also newer "lightweight" markup formats like rST or Markdown.
Nevertheless, this SCEP acknowledges that both technology and user preferences will continue to evolve over time, and thus that the Structured Common model should not restrict users to a single source format or technology.
This SCEP recommends the following prioritization of criteria when considering multiple candidate markup languages for a new Structured Commons documents, in decreasing priority order:
Criterion #1 promotes all standard-based workflows and formats (eg. LaTeX, rST, Markdown, HTML, etc) over implementation-based workflows and formats (eg. OOXML, OXF, etc.), because program-centric environments have only poorly/partially standardized interchange formats, and it is thus unlikely that documents can be recovered from sources after current implementations fall out of use.
Criterion #2 promotes pre-structured markup languages like LaTeX, rST, Markdown or HTML compared to general markup languages like XML, where markup tags can be inscrutable without access to an externally provided schema, or print-oriented typesetting languages like Troff, where markup tags specify layout and typography instead of semantics.
Criterion #3 promotes "transparent" markup languages like rST, Markdown or Org-mode, where the source form of a document is usually also conveniently readable, compared to command-based or tag-based languages like LaTeX, texinfo or HTML which require preprocessing/interpretation to become conveniently readable.
Other criteria to further discriminate between alternatives are intendedly not covered by this SCEP, in order to:
See also Example markup languages below.
Historically, the following requirements have motivated major technology shifts by authors, ie. situations where authors willfully decided to adapt their workflow and working style and accept/adopt new tools and technology for source documents, even sometimes at the cost of a partial feature loss from their existing habits and expectations:
Requirement | Advent period | Origin | Historical motivation and shift | Casualties / compromises |
---|---|---|---|---|
sep: Ability to specify content and layout separately, to facilitate collaboration and reuse | 1960-1990 | Authors | As authors started using personal computers and collaborating with peers using digital formats, implementers were forced to provide more features to enable separation of form and content, which in turn stimulated more and more new authors to learn and use these features from the get-go. | Reduced expectation/use of fine-grained, per-character control over typography and print. |
multi: High-quality and high-fidelity support for multiple reading environments, in particular web and print | 1995-2005 | Readers | This requirement from the advent of the World Wide Web forced authors to adopt tools with extensive support for multiple output formats, with output quality becoming a higher priority requirement when selecting editor programs than user interfaces. | Reduced expectation/use of WYSIWYG editing. |
long: Long-term durability, ability to continue working with a document long after it was created, even after the original editor program has been obsoleted, updated, etc. | 2000-2010 | Authors | This requirement emerged in the early 2000’s as the majority of word processor users faced the realization that new software eventually drops compatibility with old documents over time. It stimulated the development and general adoption of standard-based document languages independent from the particular programs used to edit them. | Longer time between the definition of new editing features and general availability in authoring and reader software. |
reflow: Ability for readers/viewers to recompute a presentation layout without access to the author’s editing environment | 2000-2010 | Readers | This requirement from users of portable document readers and smart phones stimulated acceptance of source delivery, ie. of publication channels where readers/viewers have access to part of whole of the "source" document format and can recompute renderings, at will, using standards-based technology. | Reduced expectation/use of workflows where authors decide the final appearance of documents. |
trans: Transparent/human-friendly source language that enables fast adoption, and fast reading and interpretation by humans without prior processing | 2005-2015 | Authors and Readers | This requirement from users who mostly communicate online with peers using lightweight client interfaces (chat, web forms, mobile apps) stimulated the creation and adoption of markup languages where the source definition of a document is also an adequate text-only rendering, confortable to read and reuse in "simple" interfaces with limited or no support for formatting. | Steeper learning curve when authors start seeking more control over rendering than provided by the markup language. |
The following table illustrates how technology has evolved to respond to the requirements stated above over time:
Edition environments / source formats | Features vs. Requirements | ||||||
---|---|---|---|---|---|---|---|
Group | Flavor | Examples | sep | multi | long | reflow | trans |
Word processors | Print-oriented | Word, LibreOffice | yes [4] | no | no | no | no |
Online-oriented | Dreamweaver, Wordpress, Google docs | yes [4] | no | no | yes | no | |
Markup languages | Print-oriented | Troff, TeX, LaTeX | yes | yes | yes | no [5] | no |
Online-oriented | HTML | yes | no [6] | yes | yes | no | |
Hybrid, tag-based markup | Texinfo, SGML, Docbook XML, POD | yes | yes | yes | yes | no | |
Hybrid, punctuation and layout-based markup | rST, Markdown, Wiki markup, Org-mode AsciiDoc | yes | yes | yes | yes | yes |
At the time of this writing, word processors are coming out of fashion for scientific works in favor of markup languages, with LaTeX historically prevalent in mathematics, logics and computer science.
LaTeX is commonly advertised to new scientific scholars as the go-to markup language suitable for academic publishing. LaTeX particularly contrasts with most word processing software with its long history of technical stability, reliability and typeset output quality, and these differences is commonly used as "selling point".
However, all users, including new authors, teachers of LaTeX and existing LaTeX users, should consider how LaTeX may not fully cater for recent requirements from both authors and readers:
In contrast, the new generation of "lightweight markup formats" pionereed by Wikipedia (Wiki markup), Web fora (Markdown) and inline source code documentation (rST, AsciiDoc) is tailored to these new requirements without sacrificing the other advantages of LaTeX compared to word processors.
In short, this SCEP recommends scientific authors to consider alternate source markup languages for new works, tailored to contemporary user expectations, without sacrificing the Structured Commons vision: long-term document durability.
The following table summarizes a few markup languages in common use at the time of this writing (2014). This table is provided for informational use only. This SCEP does not endorse nor promote any of these languages and associated technologies.
Name | Status | Origins / motivation | Strong support for print | Strong support for math | Strong support for tables | Links |
---|---|---|---|---|---|---|
LaTeX | Actively used & coherently maintained | Scientific publishing | yes | yes | yes | user manual, example online editor |
rST | Actively used & coherently maintained | Technical documentation | yes | yes | yes | user manual, example online editor, alternate online editor |
AsciiDoc | Actively used & coherently maintained | Technical documentation | yes | yes | yes | user manual |
Wiki markup | Actively used, coherently maintained | Knowledge preservation | yes | yes | yes | user manual, example online editor |
LilyPond | Actively used, coherently maintained | Music engraving | yes | user manuals, example online editor | ||
Markdown | Actively used & fragmented implementations | Web authoring | [7] | [7] | [7] | user manual, manifesto |
Org-mode | Actively used, coherently maintained | Productivity enhancements | yes | user manuals | ||
Textile | Somewhat less actively used, fragmented implementations | Web authoring | yes | user manual & example online editor |
In the table above,
[1] | SCEP 100. "Structured Commons Model Overview" (http://www.structured-commons.org/scep0100.html) |
[2] | SCEP 101. "Structured Commons Object Model and Fingerprints". (http://www.structured-commons.org/scep0101.html) |
[3] | ArXiv.org: "Why Submit the TeX/LaTeX Source?" (http://arxiv.org/help/faq/whytex) |
[4] | (1, 2) Support for separation of content and presentation is present but is usually opt-in by authors. |
[5] | Support for client-side reflowing is partially available via conversion to another markup language, typically HTML, but the conversion tools may not support all the markup used by authors. |
[6] | Implementations focus on rendering by web browsers; alternate styling/presentation for print or e-book readers is possible but rarely or only partially supported by tools. |
[7] | (1, 2, 3) Control over print formatting, math and tables for Markdown is not provided by the main Markdown implementation; it is commonly provided by third-party conversion tools to other markup formats. |
This document has been placed in the public domain.