Powershell and nushell take unix piping beyond raw streams of text to structured or typed data. Is it possible to keep a traditional shell like bash and still get typed pipes?
I think it is possible, and I'm now surprised noone seems to have done it yet. This is a fairly detailed design for how to do it. I've not implemented it yet. RFC.
Let's start with a command called typed
. You can use it in a pipeline
like this:
typed foo | typed bar | typed baz
What typed
does is discover the types of the commands to its left and its
right, while communicating the type of the command it runs back to them.
Then it checks if the types match, and runs the command, communicating the
type information to it. Pipes are unidirectional, so it may seem hard
to discover the type to the right, but I'll explain how it can be done
in a minute.
Now suppose that foo generates json, and bar filters structured data of a
variety of types, and baz consumes csv and pretty-prints a table. Then bar
will be informed that its input is supposed to be json, and that its output
should be csv. If bar didn't support json, typed foo
and typed bar
would both fail with a type error.
Writing "typed" in front of everything is annoying. But it can be made a
shell alias like "t". It also possible to wrap programs using typed
:
cat >~/bin/foo <<EOF
#/usr/bin/typed /usr/bin/foo
EOF
Or program could import a library that uses typed
, so it
natively supports being used in typed pipelines. I'll explain one way to
make such a library later on, once some more details are clear.
Which gets us back to a nice simple pipeline, now automatically typed.
foo | bar | baz
If one of the commands is not actually typed, the other ones in the pipe will treat it as having a raw stream of text as input or output. Which will sometimes result in a type error (yay, I love type errors!), but in other cases can do something useful.
find | bar | baz
# type error, bar expected json or csv
foo | bar | less
# less displays csv
So how does typed
discover the types of the commands to the left and
right? That's the hard part. It has to start by finding the pids to its
left and right. There is no really good way to do that, but on Linux, it
can be done: Look at what /proc/self/fd/0
and /proc/self/fd/1
link to,
which contains the unique identifiers of the pipes. Then look at other
processes' fd/0
and fd/1
to find matching pipe identifiers. (It's also
possible to do this on OSX, I believe. I don't know about BSDs.)
Searching through all processes would be a bit expensive (around 15 ms with
an average number of processes), but there's a nice optimisation:
The shell will have started the processes close together in time, so the
pids are probably nearby. So look at the previous pid, and the next
pid, and fan outward. Also, check isatty
to detect the beginning and end
of the pipeline and avoid scanning all the processes in those cases.
To indicate the type of the command it will run, typed
simply opens
a file with an extension of ".typed". The file can be located
anywhere, and can be an already existing file, or can be created as needed
(eg in /run
). Once it discovers the pid at the other end of a
pipe, typed
first looks at /proc/$pid/cmdline
to see if it's
also running typed
. If it is, it looks at its open file handles
to find the first ".typed" file. It may need to wait for the file handle
to get opened, which is why it needs to verify the pid is running typed
.
There also needs to be a way for typed
to learn the type of the command
it will run. Reading /usr/share/typed/$command.typed
is one way.
Or it can be specified at the command line, which is useful for wrapper scripts:
cat >~/bin/bar <<EOF
#/usr/bin/typed --type="JSON | CSV" --output-type="JSON | CSV" /usr/bin/bar
EOF
And typed
communicates the type information to the command that it runs.
This way a command like bar
can know what format its input should be in,
and what format to use as output. This might be done with environment
variables, eg INPUT_TYPE=JSON
and OUTPUT_TYPE=CSV
I think that's everything typed
needs, except for the syntax of types and
how the type checking works. Which I should probably not try to think up
off the cuff. I used Haskell ADT syntax in the example above, but don't
think that's necessarily the right choice.
Finally, here's how to make a library that lets a program natively support
being used in a typed pipeline. It's a bit tricky, because it has to run
typed
, because typed
checks /proc/$pid/cmdline
as detailed above. So,
check an environment variable. When not set yet, set it, and exec typed
,
passing it the path to the program, which it will re-exec. This should
be done before program does anything else.
This work was sponsored by Mark Reidenbach on Patreon.
Wow, this is really clever!
I think the /proc dancing is a strong argument to implement "typed" as a shell command (in the shell itself), because otherwise performance will probably drop in shell scripts with a lot of simple calls.
Hacker news thread with some prior art.
The shell will have put all the processes of the pipeline into a single process group, so this can be sped up a bit more by calling getpgid() on a process before examining its fds.