I am a physicist and I am interested in Machine Learning, particularly the part that integrates and makes sense of what has been learned, and puts it to a good use.
I
have invented a new type of recurrent neural network (RNN) that learns in a
completely different way. It does not use synaptic weights or back propagation.
Instead, it simulates a fundamental mechanism that exists in nature, and is
based on a universal logic known as Causal Mathematical Logic (CML) that all
systems in the universe must obey. This includes all the RNN’s, all AI machines, brains, robots, even that idea you may have on your mind
and are sure it will work.
The new RNN has many
unusual features, and is very different from anything ever tried in Machine
Learning (ML). It needs
no training and does not expect any samples to be presented to it. Instead, it
goes after its own samples. It does so by using sensors, whatever sensors it may
have ranging from cameras to microphones to radar to scientific instruments and
detectors, to observe and measure its environment and learn from whatever
information it can catch. It can learn in multiple domains, even at the same
time, because it defines its own domains and does not depend on pre-specified
domains. It learns indefinitely and continuously and grows as it learns while
resources last. It learns “so
much from so little so quickly” like babies do. The machine does not use
probabilities or statistics, it just applies CML. It makes sense of what it
learns by finding invariants with a physical meaning in the information it
learns and recursively arranging it into hierarchies. This is how information
becomes knowledge.
To do all that, the machine needs to keep information stored in a good data structure, one that can associate and integrate it and support the core features of self-organization, intelligence and adaptation. This basic data structure is the causal set -a collection of cause/effect pairs. CML works on and only on a causal set. It is important to insist with the following remark: causal sets are not chosen because they are effective or “better” in some sense than other options, or simply because I am a very smart person and was able to chose the best. The use of causal sets is a consequence of something else, an unavoidable consequence: causal sets formalize the fundamental principle of causality that governs the universe, as well as the laws of Thermodynamics. The principle and the laws are the source of the power and generality of CML, the principle is where causal sets find their unique algebraic properties that support intelligence and adaptation and make the new RNN possible. There is really no data in the usual sense. The machine does not approximate a function; it optimizes an algorithm, a program, which is precisely what RNN’s do. A causal set is an executable algorithm.
Hence, the new RNN consists of (1) a hidden network that represents a causal set and has no prescribed width or number of layers; (2) an input vector that talks to the sensors; and (3) an output vector that talks to the actuators and may eventually feed back to the input vector for tasks such as sensory-motor integration, sensor/actuator calibrations, or action/reaction situations. In a typical scenario, the causal machine learns something at a coarse granularity, and becomes immediately ready to use what it just learnt and react to it, for example in a situation where an imminent threat is present. As the machine continues to learn it refines its granularity and improves, becoming capable of reacting and reasoning with more accuracy and fewer errors. The machine can continue doing this indefinitely and learn in other domains at the same time, and refine the granularity each time it learns something new. It can also learn requests from users, for example from a company executive who needs to add new features to the current implementation of the company’s software, in which case it may have to restructure the program completely. It does so automatically and with no human intervention.
The features of the new
machine are exceptional. They are detailed in several publications that appeared
in the course of several years. These features are: (to be completed)
There is
good literature about RNN’s theory and practice. My personal preference is
The
Unreasonable Effectiveness of Recurrent Neural Networks.
Where is the substance of my invention?
The substance is in the idea to simulate the fundamental laws of nature directly rather than simulating their consequences, such as biological implementations, evolution, statistics or engineering in general, or simply ideas that emerge from our own brains, which in turn obey those same fundamental laws.
Building the machine
The objective of my project is to develop a new type of adaptive, scalable and autonomous AI system with the following features: learns perpetually and without interruption while resources last; improves as it learns; learns directly from the environment without a need for training examples or pre-programming; adapts to the circumstances by using existing knowledge and new knowledge but without pre-programming; updates itself as needed and without human intervention or a need for reprogramming; learns new tasks, even when the new tasks had not been anticipated; develops complex capabilities based on the acquired knowledge; finds meaningful invariants in the learned information and updates them as it learns more; partitions and compresses the acquired knowledge by organizing it into hierarchies of meaningful invariants, with a theorem that guarantees convergence; creates associative memories that keep context, so they are and remain self-explanatory. Many of these features are not available when synaptic weights and back propagation are used for convergence.
The planned machine will operate on the base of the fundamental principles of nature, as explained next. The fundamental principle of self-organization, which states that a system that has a symmetry also has a corresponding conservation law; the fundamental principle of causality, that effects follow their causes, where causality breaks a perfect symmetry and engages the onset of self-organization; and the laws of thermodynamics, where a complex system experiences binding and self-organization when its entropy is minimized. It is very important to insist that the principles, i.e. physics and not engineering, determine the core function of the machine. The machine is universal, because the principles are, and supports general AI. There is only one such conceptual machine, any other machine that supports general AI must be a causal machine. A causal machine is a new type of neural network, a new type of deep learning network, and a new type of RNN network, but none of these by themselves are sufficient to support general AI.
It is surprisingly easy to simulate these principles on software, provided the causal set is used as the basic data structure, as explained above. I have built a machine as described that runs on my desktop, and have tested it in several examples with complete success so far. It is certainly far easier to do this than it is to carry ahead those grandiose AI projects that big companies carry ahead. In fact, it is so easy that no special technical knowledge is required although some programming experience will help.
Human/Machine
collaboration and the role of Adaptation
There are categories of human/machine collaboration depending on the intensity of the actual interactions. Adaptation plays a crucial role in all 3 of them. The first category is when interactions are relatively weak or infrequent, and includes tasks such as driving a vehicle or building one. The strong interactions category includes tasks such as a hospital robot negotiating its way in a crowded hospital corridor, or a personal assistant. In the strong category the machine takes full charge of its task. The average category is left for cases where interactions are relatively weak or infrequent.
Adaptation is crucial for all 3 categories. Adaptation is about optimizing the use of the resources the machine has, whatever they are and whatever the situation the machine finds itself in. Adaptation does not affect the need for good engineering. The computer or robot must be engineered in the best possible way, and all necessary resources and features must be provided for the purpose. The ability to adapt possessed by the machine will then optimize the use of all resources in each particular situation, even in situations that had not been anticipated. Responses will not be always perfect but will be the best they can given the reality constraints. The limitations in the responses will shed valuable light into possible future improvements of the machine.
Humans are accostumed to working with adaptive animals, but have never before worked with adaptive machines. AI/human collaboration is a completely new situation that will have to be carefully examined and perfected in the near future.
A simple but very instructive example
This is a simple but very
instructive example intended to explain basic concepts of general AI: causality,
invariance, meaning, compression, and the CML algorithm. I will explain the same
example three times in a completely different way. I will then expand the
example to a great generality. We will assume an object is moving at a constant
speed and in a fixed direction in front of a fixed black-and-white camera
trained towards the object. This is the first explanation, and it is easy to
understand: it has meaning, and it is very short.
The lens of the camera creates an image of the object
on the area where the light-sensitive pixels are located. As the object moves,
the pixels constantly change their color between black and white in a complex
manner. These changes are recorded in terms of time and pixel ID and sent to a
file in binary notation. The file is very large and you are
not
supposed to understand it. If you don’t know about the moving object, it would
be very difficult for you to make sense of the blinking pixels written in binary
notation. This is the second explanation, and it does not have meaning for you.
Here comes the third
explanation. An AI computer connected to the camera receives the output file and
is expected to “discover” what is causing the pixels to blink the way they do.
The computer is assumed to have an adjacency table for the pixels. I will
explain this table below. The first thing the computer does is to convert the
information in the file to a causal set. The causal set will expose the
symmetries existing in the information. The symmetries are necessary to enable
the self-organization guaranteed by the principle of self-organization. The CML
algorithm removes heat from the causal set, lowering its temperature. At this
point several things happen, and they look almost magical when put together.
The Carnot cycle tells us that any system that loses heat also loses entropy. When entropy is low and symmetries are present, binding takes place. Parts of the system bind together in a permanent manner. We don’t know which parts will bind, but we can calculate them using the causal set model. Binding leads to the formation of permanent structures, known as invariants. While everything else keeps changing because of the active CML algorithm, the invariant structures remain fixed and do not change. This is the process of self-organization: as we remove heat from the system’s model, the causal set, its entropy drops and the system suddenly and unexpectedly self-organizes. Because the structures do not change, we can observe them, measure them, make them part of our physical world and use them to build our understanding of the behavior of the system. By including them in our mental process we attach a meaning to them. The expression “invariants with a physical meaning” is well known in physics, but ignored in AI. Everything we find around us, from atoms to furniture to stars and planets are invariants, and we build our world in terms of the invariants we find. Here is a very important point: The new structures form a causal set as well, and this causal set is smaller than the original one. Two important consequences follow. The first is recursion. We can cool the new causal set, cause its entropy to drop, and create a second level of invariant structures. We can continue the recursion and create more and more progressively smaller and smaller levels until we reach a trivial result. We create a pyramid, a fractal hierarchy of meaningful invariants.
The second consequence is
compression of the information.
There is an obvious theorem of convergence: because the successive levels are
smaller and smaller, they must converge to something trivial that we cannot
compress anymore (the Kolmogorov limit? I don’t know). In a typical scenario, we
will have many hierarchies. If we label the tops, we will be able to refer to
them by the labels. For example, when we say “AI” we are in fact referring to a
label at the top of a very large hierarchy. Many other things happen; I’ll
mention some of them briefly. The process just described is one of
machine learning,
because the information acquired by the camera is compressed, organized, and
made sense of. The process is also one of
logic, because the hierarchies and their
invariants are new knowledge
obtained from the acquired knowledge. The memory rearranged by the hierarchies
is associative,
and it naturally keeps context because of the associations,
thus solving the frame problem. The property of
scalability
comes from fractal recursion, and
locality and
massive parallelism come from the action functional.
So what do you have to do
to get all those complicated things working for you? Here is the only explicit
action you have to take: once you have the causal set, minimize the action
functional. That’s it. Everything else will happen automatically because of the
algebraic properties of causal sets. This is how nature works, it is a fact, we
don’t have to engineer anything; it is already engineered for us and ready for
us to apply. Of course, once you have completed the minimization, you have to
identify the new invariants and take them for output.
Above I promised to explain
the pixel adjacency table that the AI computer must have. To understand why the
table is necessary, consider what would happen without it. Without the table,
each pixel is independent of everything else. There is one pixel sending a
signal, there is another sending another signal, and so on, without any
connection between the pixels. The adjacency table identifies the near-neighbors
of each pixel and connects them all. With that information the AI computer can
follow and later predict the expected changes in pixel color. As it builds the
hierarchy starting from the lower levels, the machine will notice that there is
a certain correlation between changes in a pixel and changes in neighboring
pixels. At higher levels, it will begin to notice a certain degree of invariance
between the correlations of neighboring pixels, and finally at the highest
levels it will discover that all changes can be explained by an object painted
in a certain combination of black and white passing in front of the camera. This
is the point where object recognition takes place. There is no need for
programming any tensors or anisotropies, or their operations, or be concerned
about edge or feature detection. All we need is already there, in the causal
set, and is complete.
The pixel adjacency table
must be learned by the machine,
not just supplied, and the machine must remain capable of recalibrating the
table, otherwise it is not adaptive and features are lost.
The
table allows the machine to automatically adjust vision to the peculiarities of
the camera, just like humans do. We do not see our blind spot, even though we
know it is there. If our eyes get damaged, for example by a spot of blood on the
retina, the brain works its way around the problem in just a few days. It does
so by recalibrating the adjacency table. A little baby who is lying in the crib
constantly shakes her arms and legs, or stares at objects such as her mother’s
face. What is the baby doing? She is calibrating her brain’s adjacency table,
and at the same time creating a map of the space around her, even in depth, and
calibrating it in terms of the muscle activation signals needed to move her arms
and legs, her neck and head, her eyes, etc.
She is creating the complex mechanisms she will
use all her life for seeing, grabbing, understanding motion, and many other
tasks fundamental for survival.
The example I provided can grow in complexity in an extraordinary manner. Motion is of the essence for this growth. The object in front of the camera may be moving in a more complicated manner. It may spin, move in different directions, approach the camera or get farther away, even change shape or color, or there may be multiple objects in sight. There may be more that one camera capturing the scene, and again, if this is the case, “stitching” will be automatic and there will be no need to write code. The cameras may be moving under the control of actuators, or mounted on a vehicle, and there may be sensors of different types. In every case, CML will be effective. It will create a compact description of everything that is happening and identify all objects and each of the various processes that are taking place.
The Fundamental Principles of Nature
A number of principles are used in Physics to describe nature, but the following four are fundamental:
Causality. Causality is the subject of the preceding Sections. A more detailed article will be published soon.
Self-organization. A symmetry of a physical system is any property or feature of the system that remains unchanged when a certain transformation is applied to the system. A system can have many different symmetries. The principle states that, if a dynamical system has a symmetry, then there exists a corresponding law that establishes the conservation of a certain conserved quantity. That "quantity" can be anything that pertains to the system, a value, a structure, a law, a property. The principle states that a conserved quantity does exist, but does not specify what it is or how to obtain it.
Causal sets with a partial order have a symmetry. A causal set remains the same if its elements are permuted in any way that does not violate the partial order. To this symmetry, there corresponds a conserved quantity that pertains to the causal set. However, unlike the principle, the theory of causality does specify a precise way for obtaining the corresponding conserved quantity: the conserved quantity corresponding to the symmetry is a block system. The block system is also a causal set, usually smaller than the original one. It has its own symmetry and its own conserved block system. Iterating, a hierarchy of block systems on the given set is obtained. These facts are rich in significance, universal in nature, and of central importance in the theory of causality. Particularly when combined with the power causal sets have to represent nature.
By contrast, in Computer Science, the fundamental building blocks are strings of characters. But strings are sets with a total order. They have no symmetries, no conserved quantities, and no structures. An unfortunate fact, indeed.
Least-action. It is customary to describe a dynamical system by means of a set of independent variables, each of them a function of time. It is also customary to define a multi-dimensional state space, the coordinates of which are the variables. One single point in state space describes the complete state of the system at a particular instant of time. As the system evolves in time as a result of its dynamics, the point moves in state space and describes a trajectory. According to the principle of least-action, if the system travels between two given points in state space, then it does so along a path of stationary action. Fermat's principle, that light travels from one point to another along the shortest path, is a popular example of application of the principle to this particular case.
Different theories of Physics describe action differently. It is therefore better, at this introductory level, to explain action in a more intuitive way. Think of action as traveling energy, energy that travels from one point to another, or as an energy density that travels on a path. Think of a causal set represented as a directed graph, where energy travels from one vertex to the next along a directed edge. The dimensions of action are energy times time, so the action for the transfer would be the product of the amount of energy transferred and the time it takes to transfer it. In a slightly more precise definition, action is a functional that assigns a number to any given trajectory in state space. If a pair of points is given in state space, then there can be many different trajectories between them, and the action will take a different value for each trajectory. Then, the principle states that the trajectory actually followed by the system is one with a stationary value of the action.
The principle of self-organization, and the principle of least action, are known to be closely tied. This fact is not only confirmed by the theory of causality, but also essential for its development.
Entropy. The classical and more familiar approach to Thermodynamics is to define entropy in terms of two measurable quantities: heat exchanges between bodies, and temperature. If a system at temperature T gains an amount of heat ΔQ, then its entropy increases by an amount ΔS = ΔQ/T. There follows that, if heat in the amount ΔQ passes from a "hot" system at temperature T1 to a "cold" system at temperature T2, where T1>T2, without performing any work, then the net change of entropy is - ΔQ/T1 + ΔQ/T2, which is a positive number, meaning that the combined entropy of the two systems has increased. Thus, when heat flows from hot to cold, it is also flowing in the direction in which the total entropy increases. This definition originated from studies of the Carnot cycle, and as a response to the quest for perpetual motion machines.
The modern approach considers the internal energy E and the entropy S of a system as independent state variables, separately defined, and externally controlled. Recalling that a state variable is one that depends on the state of the system, the definition means that both E and S can be calculated if the state of the system is given. In this approach, temperature is defined as the derivative of the energy with respect to the entropy: T = dE/dS. The classical definition above is a particular case of the modern definition. For, if a system at temperature T gains an amount of heat ΔQ, then by conservation of energy its internal energy must increase in the same amount ΔE = ΔQ. The increase of entropy would then be ΔS = ΔE/T. This can be written as T = ΔE/ΔS, and the definition of T is obtained by taking the limit.
In the modern approach, the entropy of a system is defined as a measure of uncertainty of the system's dynamics. As explained above, a system is described by a set of variables that can have different values at different instants of time, and the state of the system at a certain instant of time is the particular combination of values of the variables at that instant. The dynamics of the system is the set of rules that determine how the system transitions from one state to the next. Say that the system is in a certain state A, and a transition to some destination state is about to take place. The rules of the dynamics specify the conditions for a transition to be possible, but in general there can be many destination states that satisfy the conditions, and the rules do not give preference to any one of them. A transition will take place from A to one of the candidate destinations, but there is an uncertainty as to which one. The entropy of state A is a measure of that uncertainty. The classical example is a dice. If state A is "dice in my hand" and I throw it, it can land in any of 6 possible states. Note that the entropy is a property of the state, that's why we say the entropy is a function of state.
Entropy and Information. The first lesson to be learned about information is that information is a property of a physical system. Information does not exist by itself. There is always a physical system or media that carries the information. It can be an optical disk, a computer's memory, a brain's memory, a beam of light or gamma rays, a fiber-optic cable that carries television signals. Radiation coming from the stars carries with it information about the history of the universe. Astronomy is the art of reading that information.
In this Section, I treat information as a physical system and study its physical properties. Information has long been known to have physical properties of its own, independent of the media that carries it. But it wasn't until very recently, March 2012, that a direct measurement of the energy of information was completed. The amount of heat released by the act of erasing one bit of information was experimentally measured to be 3 x 10-21 Joules. This release of heat is real. It is a limitation for modern-day computing, and, even more importantly, it confirms the physical nature of information.
If information is viewed as a physical system, then it must also obey the four fundamental principles of nature, have physical properties such as energy and entropy, and, of course, it must also have a state and a dynamics. And the entropy must be a measure of the uncertainty in the information.
Complex Dynamical Systems and Complex Adaptive Systems.
A complex dynamical system (CDS) is any physical system that has a dynamics and can be considered as composed of many interactive parts. The system can be open if it interacts with its environment, or closed if it does not. A complex adaptive system (CAS) is an open CDS that can find regularities in its interactions with the environment, and use the regularities to adjust its behavior based on the prediction of possible outcomes. There is a very large volume of literature on these subjects, but much of it deals with applications to particular systems. For general-purpose basic information and definitions, see the information-theoretic primer developed by Prokopenko et. al. in 2009. For more fundamental concepts and profound insights consider The Quark and the Jaguar by Physics Nobel prize winner Murray Gell-Mann.
We shall soon see that complex systems do not need to be that complex. In fact systems with as few as a single-digit number of parts can exhibit many of the features usually attributed to systems with a very large number of parts, such as attractors and deterministic chaos.
Causal sets. Are they powerful enough to represent physical systems?
Causal sets are a particular case of partially ordered sets. Anything said
about partially ordered sets applies to causal sets. But the converse is not
true, and the differences are very important. Partially ordered sets can be
finite or infinite, and nearly all of the attention in that field is focused on
infinite partially ordered sets. But causal sets are always finite. The study
and applications of causal sets are very different from those of partially
ordered sets.
Under certain conditions, any algorithm or any computer program that runs on a
computer and halts, is a causal set. That happens because algorithms and
computer programs that halt satisfy the definition of causal set. But the
fundamental reason is that a real-world computer program running on a computer
is a physical system, one that really exists in the physical world, and
causality applies in the physical world. It can equally well be said that my
research interests are in the properties and transformations of computer
programs.
Because computer programs have been used to simulate practically anything we can
think of, one can say that causal sets have been used to simulate practically
anything we can think of, and that causal sets have an unparalleled ability to
represent our knowledge about the world that surrounds us.
There is a big difference, though. Causal sets allow one to deal with causality
and knowledge in a mathematical way. Causal sets allow transformations to be
applied and consequences to be drawn from the transformations. This would be
very difficult to do with programs, because the notation used in programming
languages is not mathematically-friendly.
Saying that my research interest is in causality, also means that my research interest is in computer programs and
their properties and transformations, and that I work with computer
programs. I do not write them, I transform them. And the results are
fascinating. And it further means that the scope of my research is very wide. It
relates to several disciplines, among them Computer Science, Complex Systems
Science, Artificial Intelligence, and Physics.
This website contains a number of short articles describing different aspects of
my works. Currently (July 2012), only a few of the articles are posted, but I am
working on many more. There is a great deal of material here. I hope you will
find it interesting. Check it frequently, it keeps being updated.
Sergio Pissanetzky
Who is Sergio Pissanetzky?