SCEP: | 101 |
---|---|
Title: | Structured Commons Object Model and Fingerprints |
Version: | f66b1e00c3a59d2c1820e322d4a6f10670fd16a6 |
Last modified: | 2014-06-16 10:17:28 UTC (Mon, 16 June 2014) |
Author: | Raphael ‘kena’ Poss |
Status: | Draft |
Type: | Standards Track |
Created: | 2014-06-16 |
Source: | scep0101.rst (fp:Py491rKIVazfq54w5IEAYe1I6uNamwgTKn95SEp0oZRXTg) |
This SCEP describes the lowest level of the Structured Commons model and infrastructure: how document and digital objects are structured and fingerprinted.
A Structured Commons object is either:
A name is an array of Unicode [2] characters (code points):
Other SCEPs and Structured Commons materials and technology can use the following terminology:
An object A is said to contain an object B if either of the following is true:
An object A is said to refer to an object B if either of the following is true:
By extension, an object A is said to refer to a fingerprint F if it refers to an object B whose fingerprint is F.
Object fingerprints are fixed size 32-byte arrays computed from any object as follows:
serialize the object onto an array of bytes as follows:
first the byte code 115 ("s"), followed by the object length in big endian ASCII [1] decimal (codes 48-57), followed by a NUL byte, followed by the object’s byte data.
compute the SHA-256 [4] checksum of the serialization in #1. The fingerprint is the SHA-256 checksum.
Note
Although fingerprints can be represented as an array of bytes, a fingerprint and a file object containing the fingerprint’s byte representation are separate entities within this specification.
Note
For rereference, the fingerprint of the empty file object is b39a4820-77f7da28-95347fde-04604c5e-d95784c6-bb748df0-f4a06bbc-767ebf53; and the fingerprint of the empty dictionary object is 0d7f33e1-3e14f31b-3195494a-c7d21f1d-88ee5ade-c4d392ab-1a3fe336-ab9df24b.
There are different possible representations for a fingerprint, suitable for different contexts:
Name | Format / Encoding | Target use |
---|---|---|
binary | 32 bytes (256 bits), no encoding | Binary storage, network protocols |
compact | 46 characters, Base64 + checksum | Print and hypertext media |
long | 55 characters, Base32 + checksum | Mouth-to-ear, analog phone/radio |
hex | 64 characters, hexadecimal | Databases w/o proper support for binary |
The binary representation is formed by the 32 bytes output by the hashing algorithm, unchanged; the hex representation is the same value printed in hexadecimal.
Both the compact and long representations are intended for use by people: they contain a checksum so that accidental character transpositions, substitutions, insertions or deletions during the human-to-human transmission of fingerprints can be detected automatically. See Computing and verifying fingerprint checksums below for details.
Implementations may choose either lowercase or upper case digits for hex and long; as well as adding hyphens in the middle of a hex or long representation to ease reading by humans.
For example, the following three representations are equivalent:
fp:s5pIIHf32iiVNH_eBGBMXtlXhMa7dI3w9KBrvHZ-v1NRAA (compact) fp::WONE-QIDX-67NC-RFJU-P7PA-IYCM-L3MV-PBGG-XN2I-34HU-UBV3-Y5T6-X5JV-CAA (long) b39a4820-77f7da28-95347fde-04604c5e-d95784c6-bb748df0-f4a06bbc-767ebf53 (hex)
Caution!
The hex, long and compact representations are not unique, due to a possible ambiguity of upper vs. lowercase letter digits for hex and long, and the padding in final position for long and compact. Fingerprints MUST be normalized (eg. to binary representation) prior to comparison.
To compute the compact or long representation, proceed as follows:
To verify that a compact or long representation is valid, proceed as follows:
Each object may have multiple syntactic representations. The mapping from semantic to syntactic representation and back again is identified by the name of a representation method.
Any representation method must follow the following common requirements:
New methods can be added over time via new method names; it is expected that Structured Common tools will support common archival formats as representation methods (eg. tgz, tbz, zip, etc), as long as public, open implementations are guaranteed to remain available in the future.
Example code in Python is provided separately:
[1] | RFC 20. "ASCII format for network interchange". (https://tools.ietf.org/html/rfc20) |
[2] | https://en.wikipedia.org/wiki/Unicode |
[3] | RFC 3629. "UTF-8, a transformation format of ISO 10646". (https://tools.ietf.org/html/rfc3629) |
[4] | RFC 4634. "US Secure Hash Algorithms (SHA and SHA-based HMAC and HKDF)". (https://tools.ietf.org/html/rfc6234; see also https://en.wikipedia.org/wiki/SHA-2) |
[5] | Fletcher, J. G."An Arithmetic Checksum for Serial Transmissions". 1982. IEEE Trans. Comm., COM-30(1):247–252. (https://en.wikipedia.org/wiki/Fletcher%27s_checksum) |
[6] | RFC 4648. "The Base16, Base32, and Base64 Data Encodings". (https://tools.ietf.org/html/rfc4648) |
This document has been placed in the public domain.