Routing in sparse graphs: a distributed Q-learning strategy

by root February 3, 2026

written by root February 3, 2026 0 comment 4 views

About small world experimentperformed by Stanley Milgram within the Nineteen Sixties. He devised an experiment by which he gave a letter to a volunteer in the US and instructed him to ahead the letter to a private contact who was most certainly to know one other particular person in the identical nation (the goal). The recipient of the letter is then requested to ahead the letter once more till it reaches the supposed recipient. Whereas most letters by no means reached their targets, those that did (survivor bias at work!) averaged about 6 hops. The “six levels of separation” has grow to be a cultural reference to society’s shut interconnectedness.

The concept ~10 folks would do that also amazes me.² Contacts can join with as much as 10 random folks in your community⁸ Folks by means of a number of hops.

How is that attainable? heuristic.

Suppose you might be requested to ship a letter to a goal particular person in Finland.¹.

Sadly, I haven’t got any connections in Finland. Then again, I do know somebody who lived in Sweden for a few years. Maybe she is aware of Finnish folks. If not, she most likely nonetheless has connections in neighboring Sweden. She is your greatest guess to get nearer to your goal particular person. The purpose is that though I do not know the topology of the social community past my very own private connections, I can use some guidelines of thumb to route the letter in the appropriate course.

Hey Finland! photos from Ilya Panasenkoabove unsplash.

Adopting the attitude of a node within the community (a human concerned in an experiment), the motion {that a} node can carry out is to ahead a message (a letter) to one among its outgoing edges (a private contact). This downside of sending messages in the appropriate course offers a possibility to have some enjoyable with machine studying.

Nodes aren’t conscious of your entire community topology. You may arrange an setting that rewards you for routing messages alongside recognized shortest paths, incentivizing exploration of suboptimal candidate paths. That seems like a fantastic use case for reinforcement studying, proper?

If you’re enthusiastic about working the code, please go to the repository. here.

downside

We’re given a directed graph with sparse edges between nodes. Which means that the common variety of edges output by a node is considerably lower than the variety of nodes. Moreover, edges have related prices. This addition generalizes the case of small-world experiments the place every hop of a personality counts as a price of 1.

The issue we think about is to design a reinforcement studying algorithm that finds a path from any beginning node to any goal node in a sparse directed graph, if it exists, on the lowest attainable value. There’s a deterministic answer to this downside. for instance, Dijkstra’s algorithm Discover the shortest path from a beginning node to all different nodes in a directed graph. That is helpful for evaluating the outcomes of reinforcement studying algorithms that don’t essentially discover the optimum answer.

Q-Studying

Q-Learning This can be a reinforcement studying approach by which the agent maintains a desk of state-action pairs related to anticipated discounted cumulative rewards. high qualitydue to this fact, Q-learn. By way of iterative experiments, the desk is up to date till a stopping criterion is met. After coaching, the agent can select, for a given state (row of the Q matrix), the motion (column of the Q matrix) that corresponds to the very best high quality.

Replace guidelines if trial motion is given a_jleads to a transition from the state s_I I’ll state s_okreward rthe very best estimate of the standard of the state s_okenamel:

[ Q(i, j) leftarrow (1 – alpha) Q(i, j) + alpha left( r + gamma max_{l} Q(k, l) right) ]

_{Equation 1: Q-learning replace rule.}

Method 1:

α is the educational fee, which controls how rapidly new outcomes erase previous high quality estimates.
γ is the low cost issue, which controls how a lot the longer term reward will probably be in comparison with the speedy reward.
Q is a top quality matrix. row index i That is the unique index and the column index. j Index of the chosen motion.

In different phrases, Equation 1 states that the standard of (state, motion) must be partially up to date with a brand new high quality worth consisting of the sum of the speedy reward and the discounted estimate of the following state’s most high quality for the attainable actions.

For our downside assertion, a attainable formulation of the state is a pair (present node, goal node), and the set of actions is a set of nodes. The state set contains: N² The worth, and motion set contains: N worth, right here N is the variety of nodes. Nevertheless, as a result of the graph is sparse, a given origin node has solely a small subset of nodes as output edges. This formulation yields: Q-Many of the matrix N³ Entries are by no means visited, consuming reminiscence unnecessarily.

distributed agent

Distribute brokers to make use of assets extra successfully. Every node could be regarded as an agent. Because the agent state is goal node, Q-Within the matrix N line and N_exterior Column (variety of outgoing edges for this specific node). and N For brokers, the whole variety of matrix entries is N²N_exteriordecrease N³.

In abstract:

we do coaching N One agent for every node within the graph.
Every agent learns: Q-dimensional matrix [N x N_out]. The variety of outgoing edges (N_exterior) could range from node to node. For loosely related graphs, N_exterior << N.
row index of Q-matrix corresponds to the state of the agent, or goal node.
column index of Q-matrix corresponds to the outgoing edge chosen by the agent to route the message towards the goal node.
Q[i, j] Represents an estimate of the standard of a node when forwarding messages to it. j^th The outgoing edge, the goal node is i.
When a node receives a message, the message contains the goal node. It isn’t included within the latter message as a result of the sender of the earlier message doesn’t have to make any routing selections for the following message.

code

The core class, Node, is known as: QNode.

class QNode:
    def __init__(self, number_of_nodes=0, connectivity_average=0, connectivity_std_dev=0, Q_arr=None, neighbor_nodes=None,
                 state_dict=None):
        if state_dict isn't None:
            self.Q = state_dict['Q']
            self.number_of_nodes = state_dict['number_of_nodes']
            self.neighbor_nodes = state_dict['neighbor_nodes']
        else:  # state_dict is None
            if Q_arr is None:
                self.number_of_nodes = number_of_nodes
                number_of_neighbors = connectivity_average + connectivity_std_dev * np.random.randn()
                number_of_neighbors = spherical(number_of_neighbors)
                number_of_neighbors = max(number_of_neighbors, 2)  # A minimum of two out-connections
                number_of_neighbors = min(number_of_neighbors, self.number_of_nodes)  # No more than N connections
                self.neighbor_nodes = random.pattern(vary(self.number_of_nodes), number_of_neighbors)  # [1, 4, 5, ...]
                self.Q = np.zeros((self.number_of_nodes, number_of_neighbors))  # Optimistic initialization: all rewards will probably be detrimental
                # q = self.Q[state, action]: state = goal node; motion = chosen neighbor node (transformed to column index) to route the message to

            else:  # state_dict is None and Q_arr isn't None
                self.Q = Q_arr
                self.number_of_nodes = self.Q.form[0]
                self.neighbor_nodes = neighbor_nodes

class QNode incorporates the variety of nodes within the graph, the checklist of output edges, and Q-matrix. of Q-matrix is initialized with zero. The rewards acquired from the setting are the detrimental of the sting prices. Subsequently, all high quality values are detrimental. Subsequently, initialization with zero is an optimistic initialization.

When the message arrives, QNode For an object, choose one among its outgoing edges. epsilon grasping algorithm. When ε is small, the epsilon-greedy algorithm selects the very best output edge more often than not. Q-value. In some circumstances, we randomly choose outgoing edges.

def epsilon_greedy(self, target_node, epsilon):
        rdm_nbr = random.random()
        if rdm_nbr < epsilon:  # Random alternative
            random_choice = random.alternative(self.neighbor_nodes)
            return random_choice
        else:  # Grasping alternative
            neighbor_columns = np.the place(self.Q[target_node, :] == self.Q[target_node, :].max())[0]  # [1, 4, 5]
            neighbor_column = random.alternative(neighbor_columns)
            neighbor_node = self.neighbor_node(neighbor_column)
            return neighbor_node

One other class is graphs. QGraph.

class QGraph:
    def __init__(self, number_of_nodes=10, connectivity_average=3, connectivity_std_dev=0, cost_range=[0.0, 1.0],
                 maximum_hops=100, maximum_hops_penalty=1.0):
        self.number_of_nodes = number_of_nodes
        self.connectivity_average = connectivity_average
        self.connectivity_std_dev = connectivity_std_dev
        self.cost_range = cost_range
        self.maximum_hops = maximum_hops
        self.maximum_hops_penalty = maximum_hops_penalty
        self.QNodes = []
        for node in vary(self.number_of_nodes):
            self.QNodes.append(QNode(self.number_of_nodes, self.connectivity_average, self.connectivity_std_dev))

        self.cost_arr = cost_range[0] + (cost_range[1] - cost_range[0]) * np.random.random((self.number_of_nodes, self.number_of_nodes))

Its fundamental fields are an inventory of nodes and an array of edge prices. The precise edge is QNode class, as an inventory of outgoing nodes.

If you wish to generate a path from the beginning node to the goal node, use QGraph.trajectory() Name the tactic. QNode.epsilon_greedy() technique:

    def trajectory(self, start_node, target_node, epsilon):
        visited_nodes = [start_node]
        prices = []
        if start_node == target_node:
            return visited_nodes, prices
        current_node = start_node
        whereas len(visited_nodes) < self.maximum_hops + 1:
            next_node = self.QNodes[current_node].epsilon_greedy(target_node, epsilon)
            value = float(self.cost_arr[current_node, next_node])
            visited_nodes.append(next_node)
            prices.append(value)
            current_node = next_node
            if current_node == target_node:
                return visited_nodes, prices
        # We reached the utmost variety of hops
        return visited_nodes, prices

of trajectory() This technique returns an inventory of nodes visited alongside the trail and an inventory of prices related to the sides used.

The final lacking piece is the replace rule for Equation 1, QGraph.update_Q() technique:

def update_Q(self, start_node, neighbor_node, alpha, gamma, target_node):
   value = self.cost_arr[start_node, neighbor_node]
   reward = -cost
   # Q_orig(goal, dest) <- (1 - alpha) Q_orig(goal, dest) + alpha * ( r + gamma * max_neigh' Q_dest(goal, neigh') )
   Q_orig_target_dest = self.QNodes[start_node].Q[target_node, self.QNodes[start_node].neighbor_column(neighbor_node)]
   max_neigh_Q_dest_target_neigh = np.max(self.QNodes[neighbor_node].Q[target_node, :])
   updated_Q = (1 - alpha) * Q_orig_target_dest + alpha * (reward + gamma * max_neigh_Q_dest_target_neigh)
   self.QNodes[start_node].Q[target_node, self.QNodes[start_node].neighbor_column(neighbor_node)] = updated_Q

To coach the agent, we repeatedly loop by means of the next pairs: (start_node, target_node) Use internal loop by means of adjoining nodes start_nodeand we name update_Q().

Experiments and outcomes

Let’s begin with a easy graph of 12 nodes with directed weighted edges.

Determine 1: 12-node graph. Picture by creator.

From Determine 1, we are able to see that the one incoming node to Node-1 is Node-7, and the one incoming node to Node-7 is Node-1. Subsequently, nodes apart from these two nodes can not attain Node-1 and Node-7. If one other node is tasked with sending a message to node 1 or node 7, the message bounces across the graph till the utmost variety of hops is reached. Important detrimental impression is anticipated. Q-values in these circumstances.

when training the graphAs proven in Determine 2, we acquire statistics on value and variety of hops as a operate of epoch.

Determine 2: Typical variation in value and path size (variety of hops) as a operate of epoch. Picture by creator.

That is what it seems to be like after coaching Q-Matrix of node 4:

Desk 1: Q matrix for node 4. Picture by creator.

The trajectory from node 4 to node 11 could be obtained by calling: trajectory() Methodology, settings epsilon=0 For grasping deterministic options: [4, 3, 5, 11] The entire undiscounted value is 0.9 + 0.9 + 0.3 = 2.1. The Dijkstra algorithm returns the identical path.

In uncommon circumstances, the very best path couldn’t be discovered. For instance, to go from node 6 to node 9, a particular occasion of the skilled Q-graph is returned. [6, 11, 0, 4, 10, 2, 9] The Dijkstra algorithm returned a complete undiscounted value of three.5. [6, 0, 4, 10, 2, 9] The entire value earlier than reductions is 3.4. As talked about earlier than, that is what you’ll count on from an iterative algorithm.

conclusion

We formulated the small-world experiment as an issue of discovering minimum-cost paths between pairs of nodes in a sparse directed graph with weighted edges. We applied the node as a Q-learning agent. Q-learning brokers be taught by means of replace guidelines and take the least pricey motion given a goal node.

A easy graph exhibits that coaching has settled on an answer near the optimum answer.

Thanks to your time. Be at liberty to try the code. If in case you have concepts for enjoyable purposes for Q-Studying, please tell us!

¹ OK, I ought to transcend the unique small world experiment and name it the small nation experiment.

References

Reinforcement Studying, Richard S. Sutton, Andrew G. Burt, MIT Press, 1998

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Routing in sparse graphs: a distributed Q-learning strategy

downside

Q-Studying

distributed agent

code

Experiments and outcomes

conclusion

References

5 predictions for the insurance coverage business in 2026 | Insurance coverage Weblog

One-third of younger individuals are violent in the direction of their dad and mom

Converter

Editors Pick

Newsletter

Categories

Related Posts

Leave a Comment Cancel Reply

Latest