Modeling Genomes as Programs

“What turns out to be crucial to the success of the evolutionary algorithm is how the candidate solutions are represented as data structures. This is known as the”representation problem,” […] (Wagner and Altenberg 1996)

What data structure should we use for our genome? Dawkins’ Biomorphs used a fixed-length array of integers. Other ways of generating genotype-phenotype spaces make different choices. Genomes can be binary strings that directly encode candidate structures (Holland?), numerical weights in neural networks [XXX], or formal grammars that generate syntactic forms [YYY]. I am going to treat the genome as a program.

There are good engineering reasons for choosing a program. John Koza argued that what should vary are not just numerical parameters within a model, but the procedures themselves (Koza 1992). On this view, the genome is not a data structure handed off to a fixed developmental process—it is the process itself.

It is easy to see how this could be useful. The Biomorph genome was an array of integers, but Dawkins experimented with variations of the algorithm: modifying the code to relax symmetry constraints, introduce segmentation, or alter how segments scaled (Dawkins 1989). Each of these changes allowed for new kinds of variation in the resulting forms. But in each case, the change came from outside the system. The modeler intervened; the model itself did not adapt.

Koza’s proposal eliminates that boundary. By allowing the procedure to vary, models gain access to a much broader design space—one in which the rules of construction themselves are open to transformation.

But here is a problem. Koza’s objective was engineering-focused. He wanted better problem-solving—more flexibility. The aim here is different.

Our aim here is not solving optimisation problems, but gaining insight into biological processes. So our choice of representation cannot be guided solely by how well it works. The difference in goals here explains why, despite hope otherwise, XXXX.

A model that treats the genome as a program may succeed in generating novelty, but unless it captures something about how these systems work, it risks becoming a toy: expressive, but uninformative. And this matters because the idea of a genome as a program has been widely criticized and often dismissed.

Reactive Programming

The criticisms of genetic programs are not without merit. Many appeals to them rest on vague metaphors, rarely clarified or justified.[CITE LATEST STUFF] As a result, a broad consensus has formed: despite its persistent common usage, the genome is not really a program—not in any sense worth defending.

The consensus suffers from some problems however. The first is that critics have rejected was a particular conception of a program—what Harel calls a transformative program, and I have called a batch program. This kind of program transforms an input into a final output.

But not all programs are like this. Some are designed not to produce a final result, but to regulate behaviour over time. These are reactive systems (Harel’s term) or interactive systems (in my terminology): they run continuously, respond to signals, and remain tightly coupled to their environment. Their behaviour is not fixed in advance—it emerges through ongoing interaction. TODO: Stopping behaviour.

Obvious versions of this are robotic systems, like the Roomba. But many familiar systems we use daily—phones, operating systems, word processors—work this way. They are not executing a plan from start to finish. They are waiting and responding to input—adapting their behaviour whilst coupled to some external environment (we are the environment in many cases).

Two recent papers revisit the genome-as-program metaphor. In A Roomful of Robovacs (Calcott 2020), I argue that genetic programs can be understood in terms of decentralized, real-time control. Capraru makes a complementary case in Making Sense of ‘Genetic Programs’ (Capraru 2024), where he shows that regulatory genes match closely to a programming model XX Post-Newell Production Systems. Capraru’s paper connects neatly with the idea of reactive programming, as the control flow in it mimics the publish-subscribe pattern found in many reactive systems, (Calcott, Balcan, and Hohenlohe 2008) Publish-subscribe systems are a key example of reactive programming, where components communicate through events rather than direct calls.

These connections make it clear that we can model the genome as a program, but one that is reactive rather than transformative. So in the rest of this chapter, I will introduce a simple programming language called AND

(Absolutely Not DNA) , my attempt at a nerdy tradition of self-referential naming in computer science.

Given the change in focus from producing something—like building an organism—to controlling behaviour, we arrive at a much simpler and more tractable claim. Rather than imagine the genome as executing a global construction plan, we can think of it as regulating what cells do, in context, over time.

This is the claim I want to explore.

The genome encodes a program for controlling cellular behaviour.

To do that, I will introduce a simple programming language called AND.

Representing behaviour

Before introducing the programming language, I want to talk about what the program does. The program controls behaviour, so we need a simple way to think about behaviour.

A dog hears a bell and salivates; A bacterium runs low on energy and begins tumbling; A Roomba arrives at the top of the stairs and backs away. These all have the same form: a stimulus, followed by a response. In each case, some change in the environment is perceived by an agent¹, and this modifies what it does next. The examples above are one-shot events: a stimulus is presented, and the organism responds. If we allow behaviour to unfold over multiple time steps, we can capture much more complex and interesting patterns. An agent might respond differently to a sequence of stimuli depending on the order in which they are presented. Behaviour, then, is a sequence of stimuli and responses.

To model this, we need something more abstract than sights or sounds and actions that make up actual stimuli and responses. I will represent stimuli and responses as capital letters. For example, a dog hears a bell, which we represent as A, and salivates, which we represent as D.

from minimal_epigenesis.table import CellTableBuilder, ActivityTableBuilder
import wdd2 as wd

st = wd.Stimulus.from_string("A#1")
gn = wd.Genome.from_string(
    "BVBab+BVBac->C UNBca->D", element_def=wd.ElementDef(2, 1, 1)
)
ctb = ActivityTableBuilder.from_stimulus(st, gn, 4)
ctb.to_markdown()

	0	1	2	3
time	0	1	2	3
stimulus	A	CD	C	C
response	A	CD	C	C

For example, we might create a world that has just four elements {A, B, C, D}, where {A, B} are stimuli, and {C, D} are responses. The behaviour of any organism in this world is defined by how it responds to the possible stimuli. The possible stimuli are subsets of the stimuli: {}, {A}, {B}, {A,B}. Similarly for responses, we would have {}, {C}, {D}, {D, E}.

The definitions are sets of elements, but it will help to simplify the terminology. I will write sets as ordered letters: AB, rather than {A, B}, and use / for the empty set {}. We can define our set of elements thus, AB|CD, showing the entire set of elements is partitioned into two disjoint (non-overlapping) subsets of stimulus / response.

Given any definition of elements, we can define behaviour in terms of stimuli and responses. A dog hears nothing /, and does nothing /. A dog hears a bell A, and responds by salivating, C.

The response of an organism is a subset of elements, disjoint from the stimulus set. So, given a stimulus set ABC, and a response set DEF, our organism might respond to AB with E, and to A with / (the empty set).

With this much, we can already represent the behaviour of an organism that knows how to integrate information from multiple stimuli.

But this is still a limited representation of behaviour, as it treats behaviour as a one-shot activity. If we extend it in time, allowing for sequences of stimuli and sequences of responses, we can capture behaviour that is far more interesting.

A stimulus sequence, then, is an ordered sequence of stimuli. For example, S_n = [A,A,AB,AB], represents the stimuli present over four time steps. It will sometimes be useful to refer to an individual entry by its time index, thus S_n{t=3} = AB.

A response sequence is, similarly, an ordered sequence of responses. But, in our simple world of discrete time, a response must follow a stimulus. An organism responds to the stimuli at time \(t\), by producing a response at \(t+1\).

Thus, our world is Markovian. Does this matter?

As the series above suggests—in this model—the organism response takes place between time-steps. If we see a response at some time step, the cause of it must be in a previous time step.

It doesn’t look like we’ve added much. We’ve just putting all of our S-R in a line. This changes, however, if we assume our organism has some kind of memory, that is, that in retain information about previous time steps. etc.

This opens up the possibility of streams of stimuli that involve different responses, depending on prior information that is kept about. This is the beginning of learning. We can subject our organism to multiple stimulus sequences and see how it responds. Imagine we have trained our dog to salivates when it hears a bell followed by a horn. If the horn sounds first, followed by the bell, then it does not salivate. If A = bell and B = horn.

Activity Sequences

1. /,A,B,C
2. /,B,A,/

The first stream of input says has four values, separated by a comma. Each value is set, show what is present at that time, where each thing is defined by a single capital Roman letters.

In this first step, both A and B are present, in the next step, nothing is present (indicated by the symbol for empty set, /). The next just has a B present, and the last just a C. Streams of input can be of various (even infinite) size.

We want our organism to respond to these stimuli, by producing some kind of output. A response, like the stimulus, will consist of stream of time-indexed responses to the incoming stimulu. For example, our organism might respond to the stimuli in the following way:

S = A,A,B,B,AB,AB
R = /,/,/,/,C,C

Rather than tracking stimulus and response separately, we can combine the two, as they come from disjoint sets. I’ll call this combination the Activity—the set of all elements present at a particular time step. An Activity stream captures everything captures much of what is happening in our tiny world.

Notice that the response comes in the time step after the stimulus. So when we look at the total set of things that are active in any time step, we have the stimuli from the world, and the response to the stimulus in the previous time step.

We now capture the behaviour of organism: it is one or more activity sequences. Each of these contain both how an organisms responds to the sequences of stimuli.

With a little work, we have the beginnings of how to judge how well an organism performs a particular task, by prescribing what responses we want to see.

First, however, we need a way to encode how an organisms behaviour.

Encoding Behaviour

TODO: Sidebar here on gene regulation. Or perhaps something in the appendix. Or maybe a nutshell.

So far, I’ve described a world that consists of series of activities, where the activity at any time step is a subset of the all possible stimuli (from the environment), and all possible responses (from the organism).

I haven’t said how our organism generates a response to a stimulus. This is where we need an encoding. Recall what an encoding is. It is not the a complete description of how the something works. Rather, it is a way of externalising control over a system, so that it can be manipulated. So we need to provide an encoding and a way to interpret that encoding so that it produces the behaviour described in the last section. We’ll start simple, and build up to a the full encoding.

I will do process will take a few steps. The process may seems long-winded, but it will be there are several distinctions that are important, and building it slowly will allow me to talk about each of these.

To begin: Consider, first, what our encoding must do: it needs to describe how the stimuli from the world is wired up to particular responses. And this description needs to be manipulable, so it can change how this “wiring up” is done.

it must capture which stimuli it is sensitive to the presence and absence of stimuli
it must be able to designate which response.
it must be able to integrate multiple stimuli.

The last condition is crucial. An organism needs to be able to respond to more than stimuli, but it also needs to be able to integrate the status of these multiple stimuli.

What does our encoding look like? Here, I will draw inspiration from simple representations of gene regulatory interactions (cite Buchler). A gene is transcribed, and a protein produced in the presence of certain other factors. These consist of several CIS regulatory modules.

The schema for behaviour I presented above is very general—the elements could be behaviour at any level. But now we can give a simple gloss to the elements: they are molecular structures (mostly proteins), that are present for one time step. Our “organism” is a single cell, responding to the local environment. The environment contains certain molecules (A, B), and our cell responds to them by producing other molecules (C, D).

Our encoding is going to be a string of symbols. You can think of it as “DNA” of our tiny world. Here is a first try at describing a gene.

ab->C

Notice the lower-case a. You can think of this as a detector of the element A. Or, keeping in our gene regulatory context, a is a stretch of DNA—a cis-regulatory motif, a stretch of DNA that the element A binds to.

We can tell a story about how this works. If element A is present at time \(t\), it binds to the motif a. This switches on the gene, and the element C gets transcribed in the next time step \(t+1\).

Thus, for this particular gene to be switched on, proteins A and B must be present. They bind cooperatively, switching it on. And when it is switched on, the gene expresses a third protein, C.

What is missing is the logic of how the binding of these two motifs interact. We need to encode that too. There are a few obvious possibilities: C is transcribed when both A and B are present, or when either A or B is present, or when A is present but B is not present. Many of these combinations have an obvious physico-chemical basis. So, for example, it sticks and gets knocked off.

The actual molecular interactions are incredibly complex. We’re not going to try and model that, though there is evidence that these interactions can be summarized usefully into simpler Boolean interactions (see xxx, yyy, but also the one that shows it is more complex). Rather, I will introduce a simple code for the binding interaction. I’ll use a three letter mnenonic for the binding. Here is how we write our “DNA” now:

BVB ab -> C
BNB ab -> C
BNX ab -> C

Explain this.

With only two elements, there are four possibilities: {XXXXX}. So there are a total of sixteen ways that these motifs could interact.

BNBab->C
   |- left motif
   +- right motif
--- 3 character code for binding operation.

Multiple Regulatory Modules

So far, our encoding allows a gene to be controlled by two elements. This is limiting, and actual genetic control can be have many more upstream controls. One way to do this is to have more complex binding interactions. But another way this occurs is to have multiple cis-regulatory modules. A module is a one binding site, but multiple modules interact not at the binding site, but between each other, and they can be distant.

Reproduce a picture of this kind of binding interaction.

We can extend the model in a similar way, with multiple modules controlling a single gene. Here is how we represent a single gene with multiple modules.

BNBab BVBcd -> E

We now encounter the same issue we did with multiple binding sites: how do these modules interact? A simplified reading of modules is that they are either independent or that they are cooperative. I represent these two different kinds of interaction with a + for independent and a * for cooperative. Our complete syntax for encoding a gene is now:

BNBab + BVBcd * UNBac -> E

Gene Regulation vs Regulatory Genes

The encoding described so far is capable of generating some interesting behaviour. By simply allowing fro multiple regulatory modules, we can produced a gene that is capable of integrating large numbers of XX. The following gene, for example:

complex example.
XXXXX

But there is a crucial piece missing from our model. For although we gene regulation, we do not yet have regulatory genes. That is, we have genes that produce responses, and they so in response to incoming stimuli from the world. But we do not yet have genes that produce a response that is a signal to another gene. The distinction here is an important one, and I will return to this point presently, once it is clear how we can fix the model.

Recall how I defined the world. There are stimuli that come from the environment and there responses that our organism can generate. I defined this by partitioning the elements in our tiny world into two subsets: AB|CD, for example. Our encoding defines genes that produce both of responses: thus XXXXX.

To fix it, we have to introduce a new category of elements. have to change how we define our world. Before we had two types of elements, by dividing them into sets {AB|CD}. Stimuli are generated by the environment and can affect how an organism behaves. We introduce a third type of element: a regulatory element.

The next step is to turn these set of genes into adaptive response to an environment. To begin with, we’ll divide our genes into two types. Regulatory and Structural. How to do this. We naturally now get different products regulatory products and Structural products. Some genes produce proteins whose only role is to turn other genes on or off. They are built to bind to the DNA. But that isn’t all that genes produce. Other proteins go on to do work. The distinction is between control and work. A bit like a car. The engine and the wheels are essential for driving the car. But a lot of the control mechanisms for making sure this energetic tasks are done at the right time and in the right way.

The importance of regulatory genes

The fact that we have one and not the other is not because I think that these things come apart in the real world. But the distinction makes a difference to how we think about genes.

Finally, let us add one further thing: a product that can regulate a gene, but is neither structural nor regulatory. Rather, it comes from some external source.

I am telling a small lie here. Again, this is another simplification. Genes are affected by the presence various external factors other than genes, including the presence of co-factors. These don’t bind directly, but they do bind transcription factors, changing their shape. The model captures the “spirit” of this interaction, rather than the precise nature of it. That is, the presence or absence of certain environmental factors makes up part of the combinatorial equation that turns a gene on or off.

We now have a very natural way of thinking about our set of genes support an input / output device. Env products are inputs to the system. Structural genes are outputs from the system. Regulatory products are intermediaries that serve to control the system.

Importantly, we now have an input/output relationship that we can target. Rather than targeting the output of ALL genes, the regulatory genes are the means to achieve some output. What they do doesn’t matter. This separation of ends and means is key for it opens up a path to differentiation implementation and interface. The implementation can change over time.

OBJECTION: This is just a rehash of boolean nets, or some other kind of network, like Kauffman nets.

Yes, it is similar to boolean nets. But there are some important differences.

Any Relation to Real Genes?

Real genes regulation is far more complex than this. The shape of the regulatory proteins themselves matter, and these may be changed elsewhere (cite Payne Wagner). The state of the DNA (unravelled etc). A number of transcripts could be produced, and different proteins made from these same transcripts. And the whole thing is not deterministic – the binding increases the probability of transcription.

We are going to make the presumption that control over this gene is encoded in this stretch of DNA. This is, in good part, true. The binding regions tells us what proteins can affect the transcription. The relative location of these regions encodes how these binding interactions occur. And the coding regions tells us what protein will be produced when the gene is turned on.

We have reasons to believe that some of these simplifications are okay. Deterministic models can approximate behaviour (cite) and there is some evidence that these models can reproduce and even predict the behaviour.

But, interesting features

What we have is more explicit. But even with this, we will see that there is range of behaviour that we can capture. Even with this simple start, we can capture:

what proteins are able bind to the regulatory region of the gene,
what combinations of those proteins determine whether the gene switches on or off,
what protein is subsequently produced when the gene is switched on and,
how mutations to the regulatory regions in 1 and 2 that might change when the gene is switched on.
Introduces a time scale that is inherent in the encoding, rather than imposed from above.

Implications of the Model

There is a lot we can do. But even this much challenges many of the common assumptions in how we think about genes.

Causation
Develpmental systems – phenotype as the life cycel or whatever.
Dichotomous thinking.

Connections to existing ideas.

Causation

Here I want to describe how this model changes the way we think about genes, environments, and the way that genes have causal control.

First, the model describes a simple causal mechanism that relates a set of causes, A and B, to an effect, C. The equation tells us what would happen to the effect under any changes to causes. The appeal to causal structure is implicit in work on actual gene regulatory networks. Understanding the conditions which makes genes turn off and on is essential to constructing the kind of networks that play the role in many explanations of development. And much of the work that is done is, in effect, doing this [cite article by davidson].

Revisiting some assumptions

Let us see how this changes our causal patterns thinking. What are genes. Are they produces. What does our DNA encode.

A chief difference between standard boolean nets is that we get multiple outputs. Note that this is nothing special, plenty of people do this. Nonetheless, I want to emphasise that an important shift in our thinking takes place when we move from our genes as generators of a particular outcome. These are conditional or plastic systems that can react to multiple conditions by exrpesssing different stuff. They can evolve to do this.

Let’s look at this pattern of explanation. What is being encoded? It is more than the shape of a protein or anything. It is the timing and conditional onset of some internal reaction.

How can computers learn to solve problems without being explicitly programmed? In other words, how can computers be made to do what is needed to be done, without being told exactly how to do it? One impediment to getting computers to solve problems without being explicitly programmed is that existing methods of machine learning, artificial intelligence, self-improving systems, self-organizing systems, neural networks, and induction do not seek solutions in the form of computer programs. Instead, existing paradigms involve specialized structures which are nothing like computer programs (e.g., weight vectors for neural networks, decision trees, formal grammars, frames, conceptual clusters, coefficients for polynomials, production rules, chromosome strings in the conventional genetic algorithm, and concept sets). Each of these specialized structures can facilitate the solution of certain problems, and many of them facilitate mathematical analysis that might not otherwise be possible. However, these specialized structures are an unnatural and constraining way of getting computers to solve problems without being explicitly programmed. Human programmers do not regard these specialized structures as having the flexibility necessary for programming computers, as evidenced by the fact that computers are not commonly programmed in the language of weight vectors, decision trees, formal grammars, frames, schemata, conceptual clusters, polynomial coefficients, production rules, chromosome strings, or concept sets. (Koza 1992, p1)

Dawkin’s Biomorph model was inspired by multicellular forms. The phenotypes were two-dimensional drawings, producing analogs of animal shapes. And Dawkins description of the process was “DEVELOPMENT” Other clues: Weissman. His own text about recursive local stuff.

The generation and evolution of multicellularity is precisely what this website is about. But I want to take a step back in time before attempting this. The history of evolution is largely about the evolution of single cells. Cells don’t “develop”, but they still have phenotypes — traits that can vary.

We’re heading toward a model of gene regulation, but at this point, I want to state this case very generally, for it will help connect these very basic ideas about behaviour to broader, more complex thinking about “cognition” at higher levels.

Following the previous chapter, this encoding will show how behaviour across generations is both stable and labile. It will do more than that, as we shall see. We will get fine-grained control over behaviour. This has important implications for how we think about the difference between environments as causes and genes as causes. All of this will be clearer with model in hand.

References

Calcott, Brett. 2020. “A Roomful of Robovacs: How to Think About Genetic Programs.” In Philosophical Perspectives on the Engineering Approach in Biology: Living Machines?, edited by Sune Holm and Maria Serban. Routledge. https://doi.org/10.4324/9781351212243.

Calcott, Brett, Duygu Balcan, and Paul A. Hohenlohe. 2008. “A Publish-Subscribe Model of Genetic Networks.” PLoS ONE 3 (9): e3245. https://doi.org/10.1371/journal.pone.0003245.

Capraru, Mihnea. 2024. “Making Sense of ‘Genetic Programs’: Biomolecular Post–Newell Production Systems.” Biology & Philosophy 39 (2): 6. https://doi.org/10.1007/s10539-024-09943-3.

Dawkins, Richard. 1989. “The Evolution of Evolvability.” In Artificial Life. Routledge.

Koza, John R. 1992. Genetic Programming: On the Programming of Computers by Means of Natural Selection. Complex Adaptive Systems. Cambridge, Mass: MIT Press.

Wagner, Gunter P., and Lee Altenberg. 1996. “Complex Adaptations and the Evolution of Evolvability.” Evolution 50 (3): 967–76. https://doi.org/10.2307/2410639.

Footnotes

“Agent” here is a neutral way of capturing something XXX without presuming it is an organism. I’m not yet↩︎