Process properties specification, version 1

So, I have arrived to the point where I should define what are process properties, this mysterious database which defines every property of a process with respect to every system service which it has to deal with. Should it be based on XML ? SQL ? Architecture-specific binary blobs ? In the end, I have settled for a humble home-made text file format, inspired by the syntax of the programming languages I use and my earlier work on Hashish‘s configuration files. Read on to find out what I have come up with.

Design

Centralized or decentralized ?

First, I had to define what this specification was for. In essence, every “insulator” of the system that is contributing to the process abstraction in a significant way has an set of internal data structures about each program, specifying what its properties are (such as : which files it has access to, whether it can print, etc…). These data structures have to be saved somewhere, and for this there are basically two options :

  • Either one single file per program, stored somewhere in a system folder, which summarizes all there is to know about the associated process. Its content is then sliced and spread across insulators.
  • Or a decentralized database, where it is the job of each individual insulator to store and retrieve per-process properties

The first option has many attractive sides to it with respect to the second one. First, it allows one to load every process property in one single burst of mass storage media access, which is faster than loading lots of tiny files on pretty much every nonvolatile media in existence today. Second, it does not explicitly require insulators to access the filesystem, which is very useful in some scenarios (such as during OS boot, when the filesystem is not yet initialized). Third, it makes it easier to cleanly uninstall a program or adjust its properties (just remove or edit the aforementioned file in the system folder). Fourth, it allows a process to only request the help of some insulators, without eating up space in the internal database of others. And finally, it allows a program to specify which security permissions it requests by offering a single file which is identical in structure and can be easily diff’d against the system’s copy when new security permissions are requested (e.g. following an update)

On its side, the second option does have some advantages, such as making it easier for some insulators to have a completely different config file format or scaling better when some insulators start to request huge config files. However, I am not sure that either of these should be supported use cases, and still tend to prefer the first option.

What could be stored there ?

A process property database should list every insulator which the process requests access to. For each insulator, the database should specify an array of (property, value) pairs, which specify each property of the process with respect to this insulator. Property naming would follow simple rules akin to the variable naming rules of modern programming languages (alphanumeric characters + underscores, case sensitive).

On their side, values should in principle be able to contain any kind of computer data that fits in a regular text file. But, in practice, this goal is not achievable without an extremely complex (and slow) configuration file parser. So what we propose instead is to explicitly support the most common and useful value types out there, and leave it to insulators to support other sorts of value if they need to. In practice, the value types which I would be thinking about are…

  • 32-bit integer numbers (64-bit is not portable across all target architectures yet, although parsers can internally work with them on architectures that support them)
  • Boolean values (can be stored in integer, but config files are much cleaner when keywords like “true” and “false” are used)
  • 32-bit floating-point numbers (following the IEEE 754 single-precision standard)
  • Strings (initially 7-bit ASCII, with the aim to support Unicode in the future, once we are able to display it)
  • Pointers to any other process property within the same insulator, and empty (“NULL”) pointers.
  • “Custom” values (data is left unparsed by the process property parser between two reserved delimiters, allowing extra data types to be supported as needed)
  • Structures (contains a set of other values, in a C-like fashion, useful for organizing large sets of properties)
  • 1D arrays of those.

I believe that with this, most basic and advanced needs of insulator developers are covered already.

How should it be stored ?

The goal of this specification is to design configuration files that can be parsed with reasonably good performance, read and edited by hand (when developers write software without using an IDE, or when debugging the OS), and easily compared with each other using a computer program. This specification should also be future-proof, allowing features to be added or removed later.

Non-goals include lowering HDD footprint (when even low-end flash memories can store gigabytes of data, no one cares about whether all config files combined weight 32 KB or 13 KB), allowing use of extremely compact notations (you do not edit this file on a cellphone every morning, so there is no need to be clever), and offering perfect 1:1 mapping between the process property names used by the configuration file and the actual variables that these property names map to inside of the insulator code (an insulator rewrite should preferably not mean a compatibility breakage with respect to all existing programs).

With this in mind, I propose…

  • A header, specifying which revision of this spec is being used
  • A partially syntax-enforced use of line feeds and indentation, to separate semantically unrelated stuff and improve readability
  • A comment system (which, once again, is useful for hand-editing of files)

Specification

Header and specification revisions

The header is a single-line string, which identifies the text file as a database of process properties and specifies which revision of the specification is being used. Headers must follow the pattern “*** Process properties vN ***”, where N is to be replaced with the integer revision of the specification (currently 1), and be followed by one or more blank lines.

Whenever possible, effort will be spent to express process properties within the framework of the current specification. However, sometimes, changes may need to be carried out, such as adding new value types, or reducing the constraints on existing ones as hardware and system software evolves (e.g. 64-bit integers, double-precision floating point numbers, Unicode strings…). As a rule of thumb, specification-compliant parsers can safely attempt to parse older revisions of the specification, but should preferably abort and request users to update them when facing newer revisions of it.

Insulator declaration

Before being able to perform any kind of trade with an insulator, that is, a system component managing a part of the process abstraction, programs must express their intent to deal with it in their process properties. If this is not done, the insulator will not create management structures associated with the running program’s PID, and any attempt from the program to request something from it will result in either failure or instant process termination.

Declaring which insulators a process deals with is simply done in the process properties by specifying the name of the insulator insulator of interest, followed by a colon and a line feed. After that, (property name, value) pairs related to this insulator shall sequentially be specified, in an indented fashion, using the syntax that described later in this specification. Once the last property is specified, one or more blank line shall be left before the next insulator is declared.

Insulator names must not feature any blank space, nor any of the characters = # ” { } [ ] < >, and must be at least one character long.

Example :

*** Process properties v1 ***

Insulator1:
    property1 = false
    property2 = true

Insulator2:
    property1 = 34
Property names, assignment, and basic value types

Insulators can specify any number of properties which defines a process under their jurisdiction, and are subject to add and remove to this list as functionality is added and deprecated. Consequently, files which store process properties should not enumerate every single one of them, and instead only specify the ones that matter at the time where the program is written, leaving other properties to their default values. Compliance with this recommendation leads to simpler configuration files, faster parsing, and more future-proofing. For this to work, parsers should make sure that default values are properly set, and only emit warnings where some unknown extra process properties are specified (as opposed to when some properties are left unspecified).

The basic syntax for setting the value of a process property is to specify the property name, followed by the “=” sign, the property’s value, and a line feed character as a terminator. Primitive value types are specified by roughly following the conventions of the C++ programming language :

  • Decimal integers are just typed directly. It is also possible to write numbers in an hexadecimal or binary basis, respectively by using the syntax “0xH” and “0bB” where H and B are to be replaced by hexadecimal and binary numbers respectively, in a big-endian representation.
  • Boolean values are specified using the “true” and “false” keywords, without quotes.
  • Floating-point values use dots as a radix point. Values will not be recognized as a real number unless said dot is present : “1.” is a floating-point number, but “1” is an integer.
  • Strings are put between quotes. The anti-slash character “\” is used to put special characters within the string, such as line feeds (\n), quotes (\”), tabulations (\t), or anti-slashes (\\).

Property names must not feature any blank space, nor any of the characters = # ” { } [ ] < >, and must be at least one character long.

Here is an example showcasing the use of these various data types :

ExampleInsulator:
    dec_property = 34
    hex_property = 0x10
    bool_property = true
    float_property = 1.5
    string_property = "My first C program used to write \"Hello World !\\n\\tSucker :P\"..."
Advanced value types

Beyond the property values presented earlier, which represents raw information that can be directly parsed by a system-wide parser, there are times where more fancy properties are needed. For these use cases, pointers, custom values, structures and arrays  come in handy.

Pointers point to another process property within the same insulator. They can be used for a wide range of things, from keeping the value of two variables in sync to implementing linked lists. In configuration file, pointer values are specified by using the “&” character, followed by the name of the property which the pointer points to. The “NULL” special value can also be used to create a pointer which points to nothing. Implementations of this specification must provide recursion detection in their pointer implementations, in order to prevent a pointer pointing to itself in a direct or indirect fashion to cause crashes.

    int_value = 30
    int_ptr = &int_value
    null_ptr = NULL

Custom values allow insulators to extend the set of primitive value types presented earlier with other kinds of values, which shall be parsed by the insulator itself. They work by starting a property’s value with the special character “<” and ending it with the special character “>” and the customary line feed. Anything in-between these two characters, including line feeds, shall be ignored by compliant parser implementations.

    custom_value = <hexbyte[16] - 546865204F537C70 6572696D656E7400>

Structures, just like C structs (which they take inspiration from), are used to organize large sets of process properties by slicing them into heterogeneous groups. They start with the “{” character, followed by a line feed and a list of process properties that uses the same syntax as above but with an extra level of indentation. After the last property that is to be put inside of the structure is specified, a “}” character closes the structure, and after a line feed one can move on to describe the next process property. Structures can be nested indefinitely, and to tell the parser to access a specific structure element in a request or a pointer value, one uses the “structure_name.property_name” syntax, where structure_name is the name of the structure and property_name is the name of the requested property inside of it.

    structure = {
        member1 = 34
        member2 = true
        this_ptr = &structure
        member1_ptr = &structure.member1
    }
    something_else = 1.5

Arrays are lists of values of an identical type, separated by the “,” sign. They are identified by a “[]” at the end of their property name. Line feeds between array values is allowed for large data sets, as the trailing comma provides parsers with a hint that the array is continuing on the next line. Array contents are accessed in pointers and parser requests by stating the name of the array, followed by a “[” character, the index of the targeted element within the array (first element being located at index 0), and a “]” character. Parsers shall also provide an easy way to know the size of an array, and an efficient method to copy the full contents of an array in a memory buffer. As arrays must contain identical elements, it is optional to specify the names of structure members beyond the first element of a structure array.

    int_array[] = 1, 2, 3, 4, 5, 6
    full_array_ptr = &int_array[]
    array_elt_ptr = &int_array[3]
    linked_list[] = {
        name = "Toto"
        age = 3
        next_item = &linked_list[1]
    }, {
        "Tata"
        30
        &linked_list[2]
    }, {
        "Titi"
        0x30
        NULL
    }

As shown in the “linked list” example above, pointers may target properties which are specified after them in the configuration file. Parser implementations shall be careful to allow this use case, which as shown above has some sensible uses.

Comments and ignored characters

Process properties may receive annotation in the form of single-line comment. A”#”, along with anything that follows it on the same line of text, shall be ditched by a compliant parser, except of course if it is located inside of a string or custom value.

    # This is a comment, it is ditched by the parser
    int_value = 3   #This should also be removed
    str_value = "#This is a string value, which is not to be ignored"
    custom_value = <#This is a custom value, so again we leave it there>

White spaces, tabulations, and multiple line feeds that are located outside of strings and custom values should be ignored by the parser, though their absence can optionally trigger stylistic warnings during the debugging of software.

Conclusions

I believe this is a sufficiently detailed and complete specification to store any kind of process properties which I can think of right now. I have architecture-agnostic integers, booleans, floating-point numbers, and strings. I have pointers, which I can use to easily serialize linked lists. I have custom values for arch-specific data that requires specific parsing attention. I can organize large heaps data in structures and arrays. I have comments to annotate files which I am working on. I can’t think of anything else that I might need right now, so I’m going to implement a first parser and see how well it goes.

Appendix 1 : Test vector

A parser which fully complies with the aforementioned specification should not have issues when handling the following file.

*** Process properties v0 ***

#Simple comment test
BasicInsulator: #Comment and white space after insulator title
    int_property = 0123456789
    hex_property = 0x89abcdef
    bin_property = 0b11111111111111111111111111111110 #Comment after a variable declaration
    invalid_hex_property = 0x100000000 #Although a parser implementation could be able to read this 64-bit
                                       #number, it should warn users about its non-compliant nature
    bool_property = false
    str_property = "This is a basic string"
    str_property2 = "This string uses\tmore advanced formatting\n\"Or does it ?\" \\o/"
    str_property3 = "#Not a comment"

AdvancedInsulator:
    null_ptr = NULL
    invalid_ptr = &invalid_ptr #This pointer is invalid and should trigger an error
    invalid_ptr_cycle1 = &invalid_ptr_cycle2  #These pointers are invalid too.
    invalid_ptr_cycle2 = &invalid_ptr_cycle1  #Again, the parser shall not hang.
    custom_value = <Absolutely Random Stuff @^#_ # No comment on that !>
    basic_struct = {
        int_field = 1, #Should trigger an error because it creates an inhomogeneous array
        bool_field = true
    }
    struct_ptr = &basic_struct.int_field
    basic_array[] = 1, 2, 3, 4, 5, 6, 7, 8
    array_ptr = &basic_array[]  #Points to the whole array
    array_elt_ptr = &basic_array[3]  #Points to one element in the array
    linked_list[] = {
        name = "Element 1"
        content = 1
        next_item = &linked_list[1]
    }, {
        "Element 2"
        2
        &linked_list[2]
    }, {
        "Element 3"
        3
        NULL
    }

 #Trailing lines at end of files should not be a problem either

Appendix 2 : Sample parsing algorithm

Initialization
  1. Check the header and spec revision
  2. Remove comments and spacing (be careful within strings or custom values)
  3. Check that every opening bracket or quote is properly closed (ditto)
  4. Check that insulator declarations are well-formed (e.g. no double colon)
  5. Check that each property is of a recognized type, and has a valid value of that type
Insulator parsing
  1. Parse the file, looking for insulator names and their trailing colon
  2. Generate a list of all insulators which are being described in the files, along with their locations inside of it
  3. ProcessManager may then send add the newly created process in each insulator, in the proper order
Property parsing
  1. Proceed sequentially, line-by-line
  2. Check the nature of the current line of text
  3. If it is a property-value assignment, locate the relevant property in a string list
  4. If the property has a simple type, directly set the relevant value
  5. If not, proceed with more complex parsing

5 thoughts on “Process properties specification, version 1

  1. Rudla Kudla April 9, 2012 / 10:00 pm

    Hi, just some small notes:

    1. In text, you talk about C style comments using double slash, but examples use hash.
    2. The parser should correctly handle integers with too many digits than it can handle

    It seems the types of properties are known before parsing. In such case, it is not necessary to explicitly define the type of property by syntax. If we know, that the value is string, there is no need to enclose it in quotes.

  2. Hadrien April 9, 2012 / 10:37 pm

    Thanks for your comments !

    1. In text, you talk about C style comments using double slash, but examples use hash.

    Yup, I have finally settled for hash because it is easier to parse. However, I had to submit the post rewrite associated several times to that due to internet connectivity issues, and it seems I have forgotten some things in the last version. Now, that should be fixed.

    2. The parser should correctly handle integers with too many digits than it can handle

    What do you mean by that ?

    It seems the types of properties are known before parsing. In such case, it is not necessary to explicitly define the type of property by syntax. If we know, that the value is string, there is no need to enclose it in quotes.

    My thought about this were that if a parser knows about property types, it can more easily detect and report errors when parsing a file, because it can discriminate issues that are specific to each type (example : the ‘y’ and ‘h’ chars have nothing to do in an integer value, but they are at the right place in the middle of a string).
    Of course, it is still possible to detect errors at run time, when the parser is asked to read a property of a given type. And this is by definition what happens for “custom” properties. But I tend to think that it would be preferable to be able to detect errors just at the beginning of the process loading procedure, whenever possible, rather than in the middle of it. Or even to be able to check if a file containing process properties contains error without having to load the process for that, during software development.

  3. Rudla Kudla April 18, 2012 / 9:10 pm

    Sorry it took me long to response, I was kind of busy lately.

    ad 2) I mean the parser should report error, when parsing number like 4294967296 if it only uses 4 byte integer, as it would not fit in there. May be kind of obvious, but many parsers will just overflow and produce incorrect number.

    I’m not sure about exact workflow you have in mind. But consider this:

    int_propety = “xxxxx”

    may be syntactically correct, but there would still be error, if the int is supposed to be int.
    So you are in fact unable to perform necessary checks.
    Still, you can use the same (or simmilar) language at other places in your OS and then it would be reasonable to use the same language.

    However if you do not need to share the language, I would opt for strings without quotes.

  4. Hadrien April 20, 2012 / 8:42 am

    I agree for the large numbers. I think I had already put something in there about it.

    For strings, I tend to agree with you too. After some thoughts, my only remaining issue is that removing quotes would make removing spaces from the parsed file copy before parsing quite complicated (how do I differentiate a string beginning with a space from space used as formatting, as an example ?), so I would likely have to enforce stricter formatting rules, such as “one space before the equality sign and one space after, nothing more and nothing less” and “no trailing space” in the end. But it is not necessarily a very big deal.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s