computer systems and Synthetic Intelligence, we had established establishments designed to purpose systematically about human conduct — the courtroom. The authorized system is one in every of humanity’s oldest reasoning engines, the place details and proof are taken as enter, related legal guidelines are used as reasoning guidelines and verdicts are the system’s output. The legal guidelines, nonetheless, have been constantly evolving from the very starting of human civilization. The earliest Codified Regulation – the Code of Hammurabi (circa 1750 BCE) – represents one of many first large-scale makes an attempt to formalize ethical and social reasoning into specific symbolic guidelines. Its magnificence lies in readability and uniformity — but it is usually inflexible, incapable of adaptation to context. Centuries later, Frequent Regulation traditions like these formed by the Case of Donoghue v Stevenson (1932), launched the other philosophy: reasoning based mostly on precedential expertise and instances. At this time’s authorized methods, as we all know, are often a mix of each, whereas the proportions differ throughout completely different nations.
In distinction to the cohesive mixture in authorized methods, an identical paradigm pair in AI — Symbolism and Connectionism — appear to be considerably tougher to unite. The latter has dominated the surge of AI improvement in recent times, the place the whole lot is implicitly discovered with monumental quantities of knowledge and computing sources and encoded throughout parameters in neural networks. And this path, certainly, has been confirmed very efficient when it comes to benchmark efficiency. So, do we actually want a symbolic element in our AI methods?
Symbolic Methods v.s. Neural Networks: A Perspective of Data Compression
To reply the query above, we have to take a more in-depth take a look at each methods. From a computational standpoint, each symbolic methods and neural networks might be seen as machines of compression — they cut back the huge complexity of the world into compact representations that allow reasoning, prediction, and management. But they achieve this by means of basically completely different mechanisms, guided by reverse philosophies of what it means to “perceive”.
In essence, each paradigms might be imagined as filters utilized to uncooked actuality. Given enter (X), every learns or defines a change (H(cdot)) that yields a compressed illustration (Y = H(X)), preserving info that it considers significant and discarding the remaining. However the form of this filtering is completely different. Typically talking, symbolic methods behave like high-pass filters — they extract the sharp, rule-defining contours of the world whereas ignoring its clean gradients. Neural networks, against this, resemble low-pass filters, smoothing native fluctuations to seize world construction. The distinction is just not in what they see, however in what they select to overlook.
Symbolic methods compress by discretization. They carve the continual cloth of expertise into distinct classes, relations, and guidelines: a authorized code, a grammar or an ontology. Every image acts as a crisp boundary, a deal with for manipulation inside a pre-defined schema. The method resembles projecting a loud sign onto a set of human-designed foundation vectors — an area spanned by ideas similar to Entity and Relation. A data graph, as an illustration, may learn the sentence “UIUC is a rare college and I adore it”, and retain solely (UIUC, is_a, Establishment), discarding the whole lot that falls exterior its schema. The result’s readability and composability, but in addition rigidity: which means exterior the ontological body merely evaporates.
Neural networks, in distinction, compress by smoothing. They forgo discrete classes in favor of clean manifolds the place close by inputs yield comparable activations (often bounded by some Lipschitz fixed in fashionable LLMs). Fairly than mapping knowledge to predefined coordinates, they study a latent geometry that encodes correlations implicitly. The world, on this view, is just not a algorithm however a area of gradients. This makes neural representations remarkably adaptive: they will interpolate, analogize, and generalize throughout unseen examples. However the identical smoothness that grants flexibility additionally breeds opacity. Data is entangled, semantics turn into distributed, and interpretability is misplaced within the very act of generalization.
| Property | Symbolic Methods | Neural Networks |
|---|---|---|
| Survived Data | Discrete, schema-defined details | Frequent, steady statistical patterns |
| Supply of Abstraction | Human-defined ontology | Knowledge-driven manifold |
| Robustness | Brittle at rule edges | Domestically sturdy however globally fuzzy |
| Error Mode | Missed details (protection gaps) | Smoothed details (hallucinations) |
| Interpretability | Excessive | Low |
In conclusion, we are able to summarize the distinction between the 2 methods from the data compression perspective in a single sentence: “Neural Networks are blurry photographs of the world, whereas symbolic methods are high-resolution photos with lacking patches.” This really signifies the explanation why neuro-symbolic methods are an artwork of compromise: they will harness data from each paradigms through the use of them collaboratively at completely different scales, with neural networks offering a world, low-resolution spine and symbolic elements supplying high-resolution native particulars.
The Problem of Scalability
Although it is vitally tempting so as to add symbolic elements into neural networks to harness advantages from each, scalability is a giant downside getting in the best way of our makes an attempt, particularly within the period of Basis Fashions. Conventional neuro-symbolic methods depend on a set of expert-defined ontology / schema / symbols, which is assumed to have the ability to cowl all doable enter instances. That is acceptable for domain-specific methods (for instance, a pizza order chatbot); nonetheless, you can not apply comparable approaches to open-domain methods, the place you will want specialists to assemble trillions of symbols and their relations.
A pure response is to go totally data-driven: as an alternative of asking people to handcraft an ontology, we let the mannequin induce its personal “symbols” from inside activations. Sparse autoencoders (SAEs) are a distinguished incarnation of this concept. By factorizing hidden states into a big set of sparse options, they seem to offer us a dictionary of neural ideas: every function fires on a specific sample, is (usually) human-interpretable, and behaves like a discrete unit that may be turned on or off. At first look, this appears to be like like an ideal escape from the skilled bottleneck: we not design the image set; we study it.
Right here (D) is known as the dictionary matrix the place every column shops a semantically significant idea; the primary time period is the reconstruction loss of the hidden state (h), whereas the second is a sparsity penalty encouraging minimal activated neurons within the code.

Nevertheless, an SAE-only strategy runs into two elementary points. The primary is computational: utilizing SAEs as a reside symbolic layer would require multiplying each hidden state by an unlimited dictionary matrix, paying a dense computation value even when the ensuing code is sparse. This makes them unattainable for deployment at Basis Mannequin scales. The second is conceptual: SAE options are symbol-like representations, however they aren’t a symbolic system — they lack an specific formal language, compositional operators, and executable guidelines. They inform us what ideas exist within the mannequin’s latent house, however not the right way to purpose with them.
This doesn’t imply we must always abandon SAEs altogether — they supply substances, not a completed meal. Fairly than asking SAEs to be the symbolic system, we are able to deal with them as a bridge between the mannequin’s inside idea house and the various symbolic artefacts we have already got: data graphs, ontologies, rule bases, taxonomies, the place reasoning can occur by definition. And a high-quality SAE educated on a big mannequin’s hidden states then turns into a shared “idea coordinate system”: completely different symbolic methods can then be aligned inside this coordinate system by associating their symbols with the SAE options which might be constantly activated when these symbols are invoked in context.
Doing this has a number of benefits over merely putting symbolic methods facet by facet and querying them independently. First, it permits image merging and aliasing throughout methods: if two symbols from completely different formalisms repeatedly mild up virtually the identical set of SAE options, we’ve got sturdy proof that they correspond to the identical underlying neural idea, and might be linked and even unified. Second, it helps cross-system relation discovery: symbols which might be far aside in our hand-designed schemas however constantly shut in SAE house level to bridges we didn’t encode — new relations, abstractions, or mappings between domains. Third, SAE activations give us a model-centric notion of salience: symbols that by no means discover a clear counterpart within the neural idea house are candidates for pruning or refactoring, whereas sturdy SAE options with no matching image in any system spotlight blind spots shared by all of our present abstractions.
Crucially, this use of SAEs stays scalable. The costly SAE is educated offline, and the symbolic methods themselves don’t must develop to “Basis Mannequin measurement” — they will stay as small or as giant as their respective duties require. At inference time, the neural community continues to do the heavy lifting in its steady latent house; the symbolic artefacts solely form, constrain, or audit behaviour on the factors the place specific construction and accountability are most useful. SAEs assist by tying all these heterogeneous symbolic views again to a single discovered conceptual map of the mannequin, making it doable to check, merge, and enhance them with out ever setting up a monolithic, expert-designed symbolic twin.
When Can an SAE Function a Symbolic Bridge?
The image above quietly assumes that our SAE is “ok” to function a significant coordinate system. What does that really require? We don’t want perfection, nor do we want the SAE to outperform human symbolic methods on each axis. As a substitute, we want a number of extra modest however essential properties:
– Semantic Continuity: Inputs that specific the identical underlying idea ought to induce comparable help patterns within the sparse code: the identical subset of SAE options ought to are usually non-zero, fairly than flickering on and off beneath small paraphrases or context shifts. In different phrases, semantic equivalence ought to be mirrored in a secure sample of energetic ideas.
– Partial Interpretability: We don’t have to grasp each function, however a nontrivial fraction of them ought to admit sturdy human descriptions, in order that merging and debugging are doable on the idea stage.
– Behavioral Relevance: The options that the SAE discovers should really matter for the mannequin’s outputs: intervening on them, or conditioning on their presence, ought to change or predict the mannequin’s selections in systematic methods.
– Capability and Grounding: An SAE can solely refactor no matter construction already exists within the base mannequin; it can’t conjure wealthy ideas out of a weak spine. For the “idea coordinate system” image to make sense, the bottom mannequin itself must be giant and well-trained sufficient that its hidden states already encode a various, non-trivial set of abstractions. In the meantime, the SAE will need to have adequate dimensionality and overcompleteness: if the code house is simply too small, many distinct ideas shall be pressured to share the identical options, resulting in entangled and unstable representations.
Now we talk about the primary three properties intimately.
Semantic Continuity
On the stage of pure perform approximation, a deep neural community with ReLU- or GELU-type activations implements a Lipschitz-continuous map: small perturbations within the enter can’t trigger arbitrarily unbounded jumps within the output logits. However this type of continuity may be very completely different from what we want in a sparse autoencoder. For the bottom mannequin, a number of neurons flipping on or off can simply be absorbed by downstream layers and redundancy; so long as the ultimate logits change easily, we’re glad.
In an SAE, against this, we’re not simply taking a look at a clean output — we’re treating the help sample of the sparse code reconstructed over the residual stream as a proto-symbolic object. A “idea” is recognized with a specific code subset being energetic. That makes the geometry rather more brittle: if a small change within the underlying illustration pushes a pre-activation throughout the ReLU threshold within the SAE layer, a neuron within the code will all of a sudden flip from off to on (or vice versa), and from the symbolic standpoint the idea has appeared or disappeared. There isn’t any downstream community to common this out; the code itself is the illustration we care about.
Sparsity penalty in setting up the SAE even exacerbates this. The standard SAE goal combines a reconstruction loss with an (ell_1) penalty on the activations, which explicitly encourages most neuron values to be as near zero as doable. In consequence, even many helpful neurons find yourself sitting close to the activation boundary: simply above zero when they’re wanted, just under zero when they aren’t — this is called “activation shrinkage” in SAEs. That is unhealthy for semantic continuity on the help sample stage: tiny perturbations within the enter can change non-zero neurons, even when the underlying which means has barely modified. Due to this fact, Lipschitz continuity of the bottom mannequin doesn’t routinely give us a secure non-zero subset of code within the SAE house, and support-level stability must be handled as a separate design goal and evaluated explicitly.
Partial Interpretability
SAE defines an overcomplete dictionary to retailer doable options discovered from knowledge. Due to this fact, we solely want a subset of those dictionary entries to be interpretable options. Even for that subset, meanings of the options are solely required to be roughly correct. Once we align current symbols to the SAE house, it’s the activation patterns within the SAE layer that we depend upon: we probe the mannequin in contexts the place an emblem is “in play”, file the ensuing sparse codes, and use the aggregated code as an embedding for that image. Symbols from completely different methods whose embeddings are shut might be linked or merged, even when we by no means assign human-readable semantics to each particular person function.
Interpretable options then play a extra targeted function: they supply human-facing anchors inside this activation geometry. If a specific function has a fairly correct description, all symbols that load closely on it inherit a shared semantic trace (e.g. “these are all duty-of-care-like issues”), making it simpler to examine, debug, and manage the merged symbolic house. In different phrases, we don’t want an ideal, totally named dictionary. We want (i) sufficient capability in order that essential ideas can get their very own instructions, and (ii) a sizeable, behaviorally related subset of options whose approximate meanings are secure sufficient to function anchors. The remainder of the overcomplete code can stay as nameless background; it nonetheless contributes to distances and clusters within the SAE house, even when we by no means title it.
Behavioral Relevance by way of Counterfactuals
A function is simply fascinating, as a part of a bridge, if it really influences the mannequin’s conduct — not simply if it correlates with a sample within the knowledge. In causal phrases, we care about whether or not the function lies on a causal path within the community’s computation from enter to output: if we perturb the function whereas holding the whole lot else mounted, does the mannequin’s behaviour change in the best way that its believed which means would predict?
Formally, altering a function is much like an intervention of the shape (textual content{do}(z = c)) within the causal sense, the place we overwrite that inside variable and rerun the computation. However in contrast to classical causal inference modeling, we don’t actually need Pearl’s do-calculus to establish (P(y mid textual content{do}(z))). The neural community is a totally observable and intervenable system, so we are able to merely execute the intervention on the interior nodes and observe the brand new output. On this sense, neural networks give us the posh of performing idealized interventions which might be unattainable in most real-world social or financial methods.
Intervening on SAE options is conceptually comparable however carried out in another way. We sometimes have no idea the which means of an arbitrary worth within the function house, so the onerous intervention talked about above might not be significant. As a substitute, we amplify or suppress the magnitude of an current function, which behaves extra like a tender intervention: the structural graph is left untouched, however the function’s efficient affect is modified. As a result of SAE reconstructs hidden activations as a linear mixture of a small variety of semantically significant options, we are able to change the coefficients of these options to implement significant, localized interventions with out affecting different options.

Symbolic-System Based mostly Compression as an Alignment Course of
Now let’s take a barely completely different view. Whereas neural networks compress the world into some extremely summary, steady manifolds, symbolic methods compress it right into a human-defined house with semantically significant axes alongside which the system’s behaviors might be judged. From this attitude, compressing info into the symbolic house is an alignment course of, the place a messy, high-dimensional world is projected onto an area whose coordinates replicate human ideas, pursuits, and values.
Once we introduce symbols like “responsibility of care”, “menace of violence”, or “protected attribute” right into a symbolic system, we’re not simply inventing labels. This compression course of does three issues directly:
– It selects which points of the world the system is obliged to care about (and which it’s speculated to ignore).
– It creates a shared vocabulary in order that completely different stakeholders can reliably level to “the identical factor” in disputes and audits.
– It turns these symbols into dedication factors: as soon as written down, they are often cited, challenged, and reinterpreted, however not quietly erased.
Against this, a purely neural compression lives solely contained in the mannequin. Its latent axes are unnamed, its geometry is personal, and its content material can drift as coaching knowledge or fine-tuning targets change. Such a illustration is great for generalization, however poor as a locus of obligation. It’s onerous to say, in that house alone, what the system owes to anybody, or which distinctions it’s speculated to deal with as invariant. In different phrases, neural compression serves prediction, whereas symbolic compression serves alignment with a human normative body.
When you see symbolic methods as alignment maps fairly than mere rule lists, the connection to accountability turns into direct. To say “the mannequin should not discriminate on protected attributes”, or “the mannequin should apply a duty-of-care customary”, is to insist that sure symbolic distinctions be mirrored, in a secure manner, inside its inside idea house — and that we have the ability to find, probe, and, if obligatory, appropriate these reflections. And this accountability is often desired, even at the price of compromising a part of the mannequin functionality.
From Hidden Regulation to Shared Symbols
In Zuo Zhuan, the Jin statesman Shu-Xiang as soon as wrote to Zi-Chan of Zheng: “When punishment is unknown, deterrence turns into unfathomable.” For hundreds of years, the ruling class maintained order by means of secrecy, believing that concern thrived the place understanding ended. That’s why it turned a milestone in historic Chinese language historical past when Zi-Chan shattered that custom, forged the felony code onto bronze tripods and displayed it publicly in 536 BCE. Now AI methods are dealing with an identical downside. Who would be the subsequent Zi-Chan?
References
- Bloom, J., Elhage, N., Nanda, N., Heimersheim, S., & Ngo, R. (2024). Scaling monosemanticity: Sparse autoencoders and language fashions. Anthropic.
- Garcez, A. d’Avila, Gori, M., Lamb, L. C., Serafini, L., Spranger, M., & Tran, S. N. (2019). Neural-symbolic computing: An efficient methodology for principled integration of machine studying and reasoning. FLAIRS Convention Proceedings, 32, 1–6.
- Gao, L., Dupré la Tour, T., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., & Wu, J. (2024). Scaling and evaluating sparse autoencoders.
- Bartlett, P. L., Foster, D. J., & Telgarsky, M. (2017). Spectrally-normalized margin bounds for neural networks. Advances in Neural Data Processing Methods, 30, 6241–6250.
- Chiang, T. (2023, February 9). ChatGPT is a blurry JPEG of the Internet. The New Yorker.
- Pearl, J. (2009). Causality: Fashions, reasoning, and inference (2nd ed.). Cambridge College Press.
- Donoghue v Stevenson [1932] AC 562 (HL).

