*Bounty: 50*

*Bounty: 50*

**Motivation**

The most common model for a random graph is the Erdős–Rényi model. However, it does not guarantee the connectedness of the graph. Instead, let’s consider the following algorithm (in python-style pseudocode) for generating a random connected graph with $n$ nodes:

```
g = empty graph
g.add_nodes_from(range(n))
while not g.is_connected:
i, j = random combination of two (distinct) nodes in range(n)
if {i, j} not in g.edges:
g.add_edge(i, j)
return g
```

The graph generated this way is guaranteed to be connected. Now, my intuition tells me that its expected number of edges is of the order $ O(n log n) $, and I want to test my hypothesis in Python. I don’t intend to do a rigorous mathematical proof or a comprehensive statistical inference, just some basic graph plotting.

**The Codes**

In order to know whether a graph is connected, we need a partition structure (i.e. union-find). I first wrote a `Partition`

class in the module `partition.py`

. It uses path compression and union by weights:

```
# partition.py
class Partition:
"""Implement a partition of a set of items to disjoint subsets (groups) as
a forest of trees, in which each tree represents a separate group.
Two trees represent the same group if and only if they have the same root.
Support union operation of two groups.
"""
def __init__(self, items):
items = list(items)
# parents of every node in the forest
self._parents = {item: item for item in items}
# the sizes of the subtrees
self._weights = {item: 1 for item in items}
def __len__(self):
return len(self._parents)
def __contains__(self, item):
return item in self._parents
def __iter__(self):
yield from self._parents
def find(self, item):
"""Return the root of the group containing the given item.
Also reset the parents of all nodes along the path to the root.
"""
if self._parents[item] == item:
return item
else:
# find the root and recursively set all parents to it
root = self.find(self._parents[item])
self._parents[item] = root
return root
def union(self, item1, item2):
"""Merge the two groups (if they are disjoint) containing
the two given items.
"""
root1 = self.find(item1)
root2 = self.find(item2)
if root1 != root2:
if self._weights[root1] < self._weights[root2]:
# swap two roots so that root1 becomes heavier
root1, root2 = root2, root1
# root1 is heavier, reset parent of root2 to root1
# also update the weight of the tree at root1
self._parents[root2] = root1
self._weights[root1] += self._weights[root2]
@property
def is_single_group(self):
"""Return true if all items are contained in a single group."""
# we just need one item, any item is ok
item = next(iter(self))
# group size is the weight of the root
group_size = self._weights[self.find(item)]
return group_size == len(self)
```

Next, since we are only interested in the number of edges, we don’t actually need to explicitly construct any graph object. The following function implicitly generates a random connected graph and return its number of edges:

```
import random
from partition import Partition
def connected_edge_count(n):
"""Implicitly generate a random connected graph and return its number of edges."""
edges = set()
forest = Partition(range(n))
# each time we join two nodes we merge the two groups containing them
# the graph is connected iff the forest of nodes form a single group
while not forest.is_single_group:
start = random.randrange(n)
end = random.randrange(n)
# we don't bother to check whether the edge already exists
if start != end:
forest.union(start, end)
edge = frozenset({start, end})
edges.add(edge)
return len(edges)
```

We then estimate the expected number of edges for a given $n$:

```
def mean_edge_count(n, sample_size):
"""Compute the sample mean of numbers of edges in a sample of
random connected graphs with n nodes.
"""
total = sum(connected_edge_count(n) for _ in range(sample_size))
return total / sample_size
```

Now, we can plot the expected numbers of edges against $ n log n $ for different values of $n$:

```
from math import log
import matplotlib.pyplot as plt
def plt_mean_vs_nlogn(nlist, sample_size):
"""Plot the expected numbers of edges against n * log(n) for
a given list of values of n, where n is the number of nodes.
"""
x_values = [n * log(n) for n in nlist]
y_values = [mean_edge_count(n, sample_size) for n in nlist]
plt.plot(x_values, y_values, '.')
```

Finally, when we called `plt_mean_vs_nlogn(range(10, 1001, 10), sample_size=100)`

, we got:

The plot seems very close to a straight line, supporting my hypothesis.

**Questions and ideas for future work**

- My program is slow! It took me 90 seconds to run
`plt_mean_vs_nlogn(range(10, 1001, 10), sample_size=100)`

. How can I improve the performance? - What other improvement can I make on my codes?
- An idea for future work: do a linear regression on the data. A high coefficient of determination would support my hypothesis. Also find out the coefficient of $ n log n $.
- Any other idea for testing my hypothesis programmatically?