jes notes Index Gallery . 4th axis Blog ideas Micro-machining CNC milling machine Paint colours CNC Router Shaft passers Software ideas Stepper motor clock Toy ideas Watchmaking

2024-04-30

Last modified: 2024-04-30 19:56:03

< 2024-04-29 2024-05-01 >

Numeric form value encoding

Let's say you have a form like the Oscillating Engine Simulator one:

And you want to convert the contents of the form into a string that users can copy/paste to save/restore the values, share in DMs, spreadsheet columns, etc.

Also I really want to maintain the string value of each number (i.e. distinguish "2.5" from "2.50000" etc.), to faithfully recreate the actual contents.

Essentially I want to encode numeric-valued form contents in a short string that is easily copy/pastable.

The first thing that springs to mind is to encode the form in JSON, with numbers as strings. That has the advantage that it is easy to do and easy to hand-edit, but it is long, and double-click-to-select-all doesn't work because of all the word boundaries.

{"bore":"15","stroke":"30","gap":"4.75","pistonrod":"63","pivotsep":"67.5","portswing":"12","inletdia":"2.5","exhaustdia":"2.5","cyldia":"2","inletangle":"-14.5","exhaustangle":"14.5","flywheeldia":"68","flywheelmoi":"0.000203","atmos":"101.325","supply":"50","reservoir":"12000","fillport":"0.7","static":"0.0136","speeddep":"-0.00003347","sqrspeeddep":"0.0000000588","load":"0"}

That's 380 bytes.

(Without loss of generality I'm only encoding the numeric values; other values can be passed in broadly the same way).

If we're loading/saving the same form every time then we can drop the field names and just make it an array of values:

["15","30","4.75","63","67.5","12","2.5","2.5","2","-14.5","14.5","68","0.000203","101.325","50","12000","0.7","0.0136","-0.00003347","0.0000000588","0"]

153 bytes.

Since we know that we're making an array of strings, we can drop the square brackets and quote marks, and just transfer comma-separated values:

15,30,4.75,63,67.5,12,2.5,2.5,2,-14.5,14.5,68,0.000203,101.325,50,12000,0.7,0.0136,-0.00003347,0.0000000588,0

109 bytes.

But it is longer than necessary, and we still have the issue that double-click-to-select-all doesn't work. We could fix the double-clicking issue without making it any longer by (for example) replacing commas with "A", decimal points with "B", and minus signs with "C":

15A30A4B75A63A67B5A12A2B5A2B5A2AC14B5A14B5A68A0B000203A101B325A50A12000A0B7A0B0136AC0B00003347A0B0000000588A0

So we're still at 109 bytes.

My idea is let's encode these values with a sort of "base11" which is like the normal base 10, plus a 10th fake digit that encodes the comma. We split each number into 2 values: one before the decimal point and one after.

And the rest is along the same lines as https://github.com/jes/chess-steg - form this big value as a bignum, and then write it out in base64. And in forming the bignum we can use the "variable base" idea to insert a sign bit in front of every integer part.

So how long does that end up being?

The number we get is:

522035207404143755986838511646149911090601406833074548504285605901409306612626322367767620165551368371428668549753190300443098832819140940011788

That's 59 bytes of actual information, but takes up 144 bytes encoded in ASCII.

DBkIEazg/UACBeGwIx6j3cDFu846MT7VubzCWQDRpKgW1TFRa1ENaTiCGfOGlLWj3vcyMstGkBdHRs8q

80 bytes after the raw value is base64'd, which I think is the best we're going to do.

But I don't like the "+" and "/" characters in base64, I prefer base58. That gets us:

8A1s1KCJbCamByaY7xE6eUb2GHRobWv1RNQhtPPCgA9hATgkLZgWsqjhBCEX6jeswKd8SaZuhQaK85iecS

Which is 82 bytes and no special characters.

Out of curiosity I want to know how large all the other intermediate stages would be after compressing with zlib (which I understand to be gzip minus header?) and encoding with base58:

So compressing a CSV actually gets you pretty close! 82 bytes for my scheme vs 96 bytes (+17%) for compressed CSV. I expected compressed CSV to do much worse on such a short string. Obviously compressed CSV will win out eventually as the string gets longer, but the example case I'm using here is already long for the kind of use-case I have in mind. Most of my forms don't have that many fields.

Next up:

Instead of base58 I would actually rather use base62. Just all of [a-zA-Z0-9].

I realised my script was wrong - I used a factor of 11 for the sign bit instead of 2, fixing that means 82 bytes comes down to 73.

Need to mention the idea of adding a "magic" number to your collection of fields, so you can validate that you are getting the right stuff. Also maybe add a checksum (does starting with a single "1" already do this? probably not because there's a 1/11 chance of landing on a 1 anyway?). Also maybe add a way for library users to add boolean fields, multiple choice, and arbitrary strings.

It is possible for later values to be different based on an earlier boolean, I want to allow usage like:

let formpack = new FormPack("example");
formpack.encodeNum(fooValue);
formpack.encodeBool(longFormat);
if (longFormat) {
    formpack.encodeNum(extraValue);
}
formpack.encodeMulti(size, ["small", "medium", "large"]);

and then on decoding:

let formpack = new FormUnpack("example", "A67sdf90324908sfdfs2w7934sdhgf...");
fooValue = formpack.decodeNum();
longFormat = formpack.decodeBool();
if (longFormat) {
    extraValue = formpack.decodeNum();
}
size = formpack.decodeMulti(["small", "medium", "large"]);

A "bool" is a base-2 digit.

A "multi" is a single variable-base digit, where the base is the number of elements.

A "string" would be maybe capped at 256 characters, with an initial base-256 digit to say the length, and then a series of base-256 digits containing the bytes.

The argument to FormPack() is a string that will be hashed to create the magic number for validation.

One downside of an API like this is you have to duplicate the specification between the encoding and decoding phases, and almost always you won't actually want the thing where the behaviour is different based on one of the earlier bits. In fact, maybe the solution is if you have something that wants to alter the decoding, you should look at that ahead of time and use it to pick a different magic string, and then on decoding you try each decoder and see which one works?

Will including the magic string be enough of a checksum? Maybe include it both at the start and the end.

So if we have a "fixed" encoding spec, and we name the fields and make it operate on hashes instead of arrays, then maybe the API would be more like:

let formpack = new Formpack("example");
formpack.numField("fooValue");
formpack.boolField("longFormat");
formpack.numField("extraValue");
formpack.multiField("size", ["small", "medium", "large"]);
formpack.stringField("name", 256);

let vals = {
    fooValue: 1234.56,
    longFormat: true,
    extraValue: 56.78,
    size: "medium",
    name: "jes",
};
let str = formpack.encode(vals);
let vals = formpack.decode(str);

The second argument to stringField() says what the maximum length of the string is. We'll use exceptions for error reporting.

Great success, works: https://github.com/jes/formpack/blob/master/formpack.js

Supports numbers, bools, strings, multiple choice, has half a chance of detecting errors because it puts the nameHash both at the start and the end.

Needs error handling, a proper checksum instead of duplicating the nameHash, documentation, examples.

OK, we've got error handling, a proper checksum, we're using a hash of the field spec instead of a nameHash, there's documentation and an example in https://github.com/jes/formpacker and I have published my first ever npm package at https://www.npmjs.com/package/formpacker

And I'm using it in https://incoherency.co.uk/oscillating-engine/

Todo

< 2024-04-29 2024-05-01 >