<- PreviousBack to post index

Stred Part 4: Numbers Overhaul

2025-05-07

This post is the fourth in a series starting here. It might not make much sense without reading the earlier parts first. This series is about developing an idea for a version of sed that works on JSON.

Stred and numbers

I take serious issue with the way numbers are currently handled in subexes. To recap, the + operator in subex will consume the output of the subexes before it, and output their sum (e.g. ..+ would add two values together). -, *, and / also have vaguely reverse polish notation functionality.

Unfortunately, this idea is really bad and has been calling for something better since just after I implemented it. Some of the biggest problems with it are...

Convinced that I could come up with something better, I spent over a year experimenting. A year!? you might say, why did it take you that long? Well I kept having bad ideas. I could just write up the good ideas, but I think it'd be more fun to take you on a journey through my pain. Here we go!

Quick aside: I play very fast and loose with syntax in this post because I didn't sit with most of ideas long enough to properly set syntax for them. I know the syntax in the examples is inconsistent and nonsensical, I just don't care.

Idea 1: Represent the numbers as strings

Subexes are really good at matching and transforming strings, right? So, if numbers could just be strings, then subexes would now be really good at matching and transforming numbers! Look how easy it is to multiply a number by 10:

.{-0} \.$_ . `.` .{-0}

Eh, at least it's possible. The syntax of subexes is undocumented, extremely hard to read, and changes regularly. To save you from madness, do not bother trying to understand the subex examples. I will do my best explain them with words, as I have done above. The overall effect of that subex is to swap the decimal point in the number with the digit after it (e.g. replace 107.94 with 1079.4) which multiplies the number by 10!

Uh oh, a problem

This is already impractical for multiplying by 10, which is probably the easiest thing possible with this system. Doing anything more complicated (even just incrementing an integer) is difficult enough to not be worth doing. Fortunately I did not waste much time considering this option.

Idea 2: Represent the number in binary

Very similar to the last idea, not really any better but I'll mention it anyway. A number is represented as a string of binary 0s or 1s. Again, subex already has all the tools needed to work with these strings.

What did I expect?

This doesn't even fix the problem from the last idea, and still introduces a new one! No one thinks in binary, imagine if before you could add 150 to something, you had to convert it to binary, no good at all.

Idea 3: Represent the number in unary

I really like this idea, and I wish it worked. We pick some character (e.g. a) that counts as 1, and then a number is just a string of that many as in a row. For example, 12 is represented as aaaaaaaaaaaa. Addition becomes concatenation, which subexes can already do easily. Multiplication is just repeated addition, and subexes already have repetition operators. For example, to double a number, just replace every a with aa:

(a`a`){-0}

If only it were that simple

This works pretty well... for non-negative integers. Due to some cosmic misfortune however, numbers like -5 and 3.2 also exist. You try writing a 3.2 times.

Idea 4: Expression Strings

At this point I started searching for any other obscure ways of expressing numbers as strings. Any base system is out, as is unary. Tallies and Roman numerals are similar enough to the other ideas to be discounted too.

I started thinking about a system where you can dynamically change the base the number is represented in, which might've helped. I was thinking about how base 10 is just a fancy way to express a sum of products, like 123 is 1 * 100 + 2 * 10 + 3, when suddenly it hit me, that expression is a string!

If we represent 50 as the expression 25 * 2, then suddenly halving it using string operations is really easy!

. (* 2)$_
  1. . Copy a value from input to output unchanged
  2. (* 2)$_ Remove the string "* 2"

We could also do the reverse to double a number. The big question is, how do we decide which expression to use for a particular number? Any number could be represented with infinitely many expressions, some of them will make certain operations really easy and some will make them extremely difficult. I don't think there's any canonical way to formulate a number as an expression that will enable all of the operations we need.

Who cares? Just do all of them!

Subexes are translated to non-deterministic state machines, which means they effectively can look into the future. This could work by effectively trying every possible expression, but discarding all the ones that will later fail.

There might be too many expressions

The current implementation of subexes uses an array of every possible state at each point while it is evaluating a subex against an input. The number of possible states at once and the length of time it can run for is bounded depending on the length of the input (finite) and the length of the subex itself (also finite). This does not hold for numeric expressions. Consider:

0 (+ 1|- 1){-0}

Given the input 5, these are the expressions that it could be written as that would match the above subex:

See the problem? It is always possible to match another pair of values, as 1 and -1 cancel each other out. {-0} specifies to match the longest possible string, so anything close to the existing algorithm would just keep matching forever and never terminate.

For a while I tried to write some kind of algorithm that could evaluate these by deducing an equation from them and intelligently computing limits. This rapidly spiralled out of control, and became far more complex than is suitable for a "simple" tool.

Idea 5: Unary... again?

I realised that writing a subex to match only a particular value (say 31) using the unary representation could actually just as easily be applied to a non-natural number (like -3.2):

Now we can represent any number we want.

Deja vu

Look at this subex, seem familiar?

(a|a{-1}){-0}

It's basically the same problem as the last idea. The a and the a{-1} (a negative a) can keep being matched in pairs forever.

Idea 6: Number mode

It seems pretty clear that the string operations (concatenation, repetition, etc.) don't fit very well with the needs of an arithmetic system. Fortunately, as long as the parser can tell when a subex is operating on numbers, we could redefine the existing syntax to mean something else. This could give us much more useful operations without needing to introduce loads of extra syntax.

The big issue so far has been the combination of alternation (the | operator) and repetition (the {-0} operator). If we redefine one or both of them, we could avoid this issue.

I mentioned earlier that repetion for multiplication is a natural fit because multiplication is kinda just repeated addition. This is the closest fit to how the string subexes work, but if we make a small change we could maybe fix the last idea.

{-0} is now literally multiplication/division

The repetition operator is now interpreted as division by some number on the input, and multiplication by it on the output. This is a hard idea to get across, so I'll show a couple of examples:

Example 1: Match numbers between 4 and 10

a{4-10}

With the number 6 as input, the interpreter divides it by 6 (as it needs the result to be 1), matches with 1, and then multiplies by 6 again. Therefore it matches it and makes no changes.

With the number 2 as input, the interpreter deduces that it cannot divide it by a number between 4 and 10 to get 1, so it rejects it. Therefore the subex matches numbers from 4 to 10.

Example 2: Multiply numbers between 0 and 20 by 1.5

(aa`a`){0-10}

With 5 as an input, the interpreter first divides it by 2.5 (which is between 0 and 10) in order to pass 2 to the inner subex. That subex matches it and outputs 3, which is then multiplied by 2.5 again to finally produce 7.5. Overall, the subex will match numbers between 0 and 20 and multiply them by 1.5.

Example 3: That thing that ruined my last idea

(a|a{-1}){-}

The last idea failed here, but not this one! With 5 as an input, the interpreter could divide by 5 or -5 as the inner subex requires 1 or -1. As with the other operators, we just pick a way to break the tie. Let's say it always picks the largest number, in this case, 5. The inner subex matches and outputs 1, which is multiplied again by 5, finally outputting 5. The subex is now well defined!

Example 4: Pain

a{-0}{-0}

Ok, with 5 as input the interpreter needs to work out what to divide it by to get a positive number... well, there are quite a few options. Any positive number would work actually. The tiebreaker tells us to pick the largest positive number, but there isn't one. We could come up with another tiebreaker, lowest, closest to 0, etc. They all have very similar examples that break them.

There might be other ways to redefine the meaning of the alternation or repetition to make this work. For it to work, they really ought to be fairly close in use to how they work with strings though, and I haven't found anything that fits that.

Idea 7: Special placeholders

At this point, I realised that a single mechanism that can do both matching and transformation, while using the same basic operations used for strings, is unlikely to exist. Let's relax the conditions a bit then, and consider some ideas that don't meet all three of the criteria.

This idea focuses just on matching numbers. We already have characters like . that match more than one possible value, so we could add a symbol that matches all integers, or one for all reals etc. Additionally, if we know while parsing whether a particular subex will be reading characters from a string or whole values, we can reuse characters like i for integers.

I want a number class for even numbers between 14 and 30

There are also too many classes of numbers that might be needed for a system this simple to be useful. We can't dedicate a character to every possible set of numbers, we don't have that many characters! Fortunately this can be improved with...

Idea 8: Number ranges

Regexes already have a [] syntax for ranges of characters. Subexes use it too for characters, but could also use it for numbers as the parser knows whether a given expression will be applied to values or runes. The obvious choice here is to allow ranges of numbers like 0-5 or something. The - syntax won't work as it's needed for negative numbers, and we'll want some way to distinguish between the various interpretations of such a range:

We might pick _ for real numbers since a line is a continuous set of points, and .. for discreet steps (which should default to integers) as the dots have a gap between them. !_ could indicate that the left number is excluded but the right included, and similarly for _! and !_!.

For discreet steps, we could allow numbers to be listed before the .. to show the correct spacing of the values, for example ..0,2.. to encode even numbers. By allowing two examples, we can effectively match any element of a linear sequence.

Idea 9: Extrapolating Sequences

Let's keep going! We could allow more than two examples in the sequence of numbers to allow matching more complex patterns like square numbers (with 0,1,4...) or powers of two (with 1,2,4...) for example.

That's not how maths works

For any finite list of values, there are infinitely many polynomial functions that produce a sequence including that list, as well as infinitely many non-polynomial functions that could also produce it. 0,2... could be even numbers, but could also be interpreted as "Match numbers that are double square numbers". Obviously we could assume that the user is intending for it to mean even numbers in this instance, as it's the "simpler" and "more obvious" option, but other examples are much less clear cut.

Maybe it could be how maths works?

We can constrain which discreet steps are allowed. Linear sequences (an + b) are very easy to spot and compute, and quite common, so those should be allowed. Polynomials with only one term and exponentials are also fairly common ((n+a)^b and b^(n+a) respectively). Many lists of two numbers can be matched to all three of these (e.g. 1,4 could be counting up in threes, square numbers, or powers of 4) so we should assume that if two numbers are given, the result is always linear (which is the only one that is always possible).

At this point, I went down a rabbit hole of trying to prove that 3 terms in a sequence is enough to disambiguate between quadratic and exponential functions, and coming up with algorithms to deduce these functions. After a couple of hours I realised something:

Subexes do not need to easily match members of quadratic or exponential sequences

No one will care if they cannot easily write a subex to recognise triangle numbers! Stred is supposed to be simple. It won't have trigonometric functions, complex numbers, or pattern matching niche numeric sets. It would be kinda cool, but it's just too much.

Idea 10: Number Sets

The last idea would've required deriving an expression for a sequence based on examples, but perhaps it would be better to just have the user write that expression themselves. Building on Idea 7: Special Placeholders we could allow those placeholders to be used in the [] subexes to produce more complex patterns using standard expression syntax. For example:

This lets us express a whole lot of sets of numbers using combinations of not that many placeholders and numerical operators.

Now can I match even numbers between 14 and 30?

Well, 2 * p + 12 gets us to even numbers that are at least 14. Then, erm, hmm, well... no.

Unfortunately, there is a frustrating limitation that with few exceptions (e.g. multiplying by 0), applying numerical operators will not change the size of the set of numbers matched. There are 9 even numbers between 14 and 30, so we know we will have to start with a number class that matches exactly 9 different numbers.

Easy, use an integer parameter

This one isn't so bad. We can introduce another placeholder that has an integer parameter to specify how many values it should match. For example, c5 could match integers from 0 to 4 (5 possible values). This is a little clunky, but does solve the problem in a pretty simple way. Now 2 * c9 + 14 matches even numbers between 14 and 30!

Think of all the cool things I can match now!

(p + 1) * (p + 1)

You may have recognised that the above subex will match any non-prime number that is at least 4. You may also have recognised this as a famously difficult problem to compute, so difficult that a lot of cryptography literally relies on it being difficult. We probably should not be able to write a subex for it in 11 characters.

Well now you can't

Another simple but slightly clunky solution! We disallow using more than one placeholder in a single pattern. This means the non-prime pattern above would be rejected by the parser before any attempt to evaluate it was made.

Idea 11: Special Slots

Subexes already have slots which can be used to reference values matched earlier. There are two ways of interacting with a slot:

  1. Storing value/s in the slot
  2. Retrieving value/s from the slot

So far, there is the assumption that when you retrieve values from a slot, you get exactly the values that were stored, but this doesn't need to be the case. If we designate a particular slot to be used for addition (the + slot makes sense), we can say that retrieving values from that slot produces the sum of the values that were stored. This would be used in a very similar way to the reverse polish notation system already in use:

(..)$+`$+`

We store two values into the +, but when we load them, there is only one value there, the sum of the two original values.

The big advantage this idea has over what is currently used is that the arithmetic operators are used in the slots rather than being their own syntax, which means + can still be used for repetition when it's not referring to a slot. If we introduce some additional syntax to mean "store and immediately load a slot" (for example, %) then the above example becomes (..)%+, which is barely longer than the current system.

Idea 12: Standard Infix Notation

More specifically, in the context where we are directly inserting values (i.e. inside `s) the arithmetic operators are currently unused, so we can just have them mean exactly what you'd expect. This means `2+3` would be equivalent to `5`. Naturally, we can also use slots. .$a.$b`$a+$b` would read two numbers, and output the sum of them.

This is a little bit of extra complexity, and does nothing to help with matching, but is extremely intuitive and the extra syntax isn't already in use.

Idea 13: Combine ideas 8 and 12

Already the syntax for ranges of characters in strings ([a-z] which I borrowed from regexes) has a feature to map between characters (e.g. [a-z=A-Z] to capitalise English letters). Idea 8 (and 9 and 10) give a similar pattern matching feature for numbers, so we could easily support a similar mapping feature.

The string version works by lining up the arrays of characters on both sides, finding the character on the left and outputting the corresponding character on the right. That same strategy should work with numbers:

Already this idea has a lot of potential use, but we could also incorporate arithmetic operations into the replacement side of these subexes to make them even more powerful. Let's use & to represent the matched value.

Conclusion

TLDR: Idea 13 is the most cohesive idea that provides the required functionality without extra complexity.

This leaves ideas 8, 11, 12, and 13. Recall that idea 13 is a combination of ideas 8 and 12, so really we just have ideas 11 and 13.

Idea 11 addresses only a small part of the problem by itself, while idea 13 stands well by itself. With all that in mind...

Idea 13 is the natural choice

The examples I gave with idea 13 encompass a very wide range of uses and features, but (I think) it's still simple enough to be relatively easy to implement.

The biggest problem I can think of is that operations like summing a list of numbers can't be done in one subex, meaning a loop will be needed. This isn't that bad, and I think could easily be covered by adding a new loop shorthand to subex that repeatedly runs a subex until it doesn't match. A subex like .$a.$b`$a+$b`.* will replace the first two numbers in an input with their sum, but will reject an input with only one value. This means if repeated on a list of numbers until rejecting, you'll end up with the sum of all those numbers.

Remaining Problems

My last post also included "subex may be unnecessarily complex" on this list. I've since decided that it's not unnecessarily complex. It helps that I think it's actually got simpler since I first wrote that as a disadvantage, though I don't think it was too complicated to begin with.

That is a bit vague, so I've broken it down into more specific problems:

As always if you've found this interesting do send me an email, (shtanton at shtanton dot xyz) I love to chat more about this!

I'm pretty happy with the semantics of stred now, so next will probably be overhauling the syntax to make it more user friendly.