#StackBounty: #python-3.x #deployment #virtualenv Best practice of python virtual environment deployment

Bounty: 50

Sorry if this was answered in another post.

I am very new to python and learning about the virtual environment. I understand that I am supposed to have all the libraries installed in the virtual environment and create the requirement.txt, so others can install using that. However, I am not sure what the best practice to deploy to production?

The reason I ask is that no one supposes to have access to the production environment, the deployment is through a predefined pipeline, and my understanding is that it will zip all my code and deploy it to production, no one suppose to go into production to do any manual work. I can try to get the pipeline to run a script to install all the libraries base on the requirement.txt, but I am not sure if the firewall setting is the same. Should I package the libraries as well?

also, how should I trigger the python script? should I have a wrapper script to activate the vevn before calling python script and deactivate it after? or there is a easier way?


Get this bounty!!!

#StackBounty: #python-3.x #django #image #google-image-search How can I refine my Python reverse image search to limit to a specific do…

Bounty: 200

I’m using Python 3.8. I have the following script for launching a Google reverse image search …

    filePath = '/tmp/cat_and_dog.webp'
    searchUrl = 'http://www.google.hr/searchbyimage/upload'
    multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': '', }
    response = requests.post(searchUrl, files=multipart, allow_redirects=False)
    fetchUrl = response.headers['Location']
    webbrowser.open(fetchUrl)

Does anyone how, if possible, I can refine the search to a specific domain?


Get this bounty!!!

#StackBounty: #python #python-3.x #machine-learning #autocomplete #artificial-intelligence Is there a way to reset Kite Autocomplete&#3…

Bounty: 50

I’m wondering if I can reset the training of Kite’s AI that uses my code. I want to do this because I want to change my code style and there is some stuff that I quit doing.

Take xrange for example; it’s deprecated in python3 (I’m a python coder). So, I want to reset all of the data it learned from me as if I just got it again. I don’t want to uninstall and reinstall it.

Is uninstalling the Sublime Text/Atom plugins and reinstalling them would do the trick? Or is it not possible?

And for the specs, I got a MacOS Catalina (10.15.5 (19F96)), non-pro and no account for Kite, and Kite version 0.20200609.2.

I really want to know if there’s an official way, not some file removing magic.

But if some file removing magic is necessary, then I’m fine.

Also, I wonder if just removing and reinstalling the plugins for editors would do the trick…


Someone set a bounty on this; I don’t wanna.


Get this bounty!!!

#StackBounty: #python #python-3.x #pandas #dataframe #data-science Better way to iterate over dataset and change a feature value for ot…

Bounty: 50

I have a dataset of velocities registered by sensors on highways and I’m changing the label values for the avg5 (velocities average of 5 minutes timestamp) 2 hours in the future (the normal is 30 minutes. The label value of now is the observed avg5 of 30 minutes in the future).

My dataset have the following features and values:
head of the dataset: features and values

expanded dataset

And I’m doing this switch of values by this way:

hours_added = datetime.timedelta(hours = 2)

for index in data_copy.index:

  hours_ahead = data.loc[index, "timestamp5"] + hours_added
  result = data_copy[((data_copy["timestamp5"] == hours_ahead) & (data_copy["sensor_id"] == data_copy["sensor_id"].loc[index]))]

  if len(result) == 1:
    data_copy.at[index, "label"] = result["avg5"]

  if(index % 50 == 0):
    print(f"Index: {index}")

The code is querying 2 hours ahead and catching the result for the same sensor_id that I’m iterating now. I only change the value of my label if the result brings me something (len(result) == 1).

My dataframe has 2950521 indexes and at the moment I’m publishing this question the kernel is running for more then 24 hours and only reached the 371650 Index.

So I started thinking that I’m doing something wrong or if have a better way of change these values who don’t take so long time.

Updates
The desired behavior is to assign the avg5 of the respective sensor_id of 2 hours in the future for the label 2 hours before.
Let’s take as example the two images from this question and suppose that instead of 2 hours I want to assign the avg5 of 10 minutes later in future (the sensor_id in this example are the same).

So the label of the row with index 0 instead of be 50.79 should be 51.59 (avg5 value of the row with index 2).


Get this bounty!!!

#StackBounty: #python-3.x #pandas #dataframe #formatting dataframe: transform row-based transaction data into aggregates per date

Bounty: 50

I retrieve data from a SQLITE Database (and transform it to a pandas dataframe) in the following format:

Driver | Date loading | Date unloading | Loading Adress | Unloading Address
Peter  | 02.05.2020   | 03.05.2020     | 12342, Berlin  | 14221, Utrecht
Peter  | 03.05.2020   | 04.05.2020     | 14221, Utrecht | 13222, Amsterdam
Franz  | 03.05.2020   | 03.05.2020     | 11111, Somewher| 11221, Somewhere2
Franz  | 03.05.2020   | 05.05.2020     | 11223, Upsalla | 14231, Berlin

The date range can be specified for the query, so that it gives an overview over which driver has which transports to deliver within the specified date range, ordered by date.

The goal of the transformation I want to do is a weekly plan for each driver, with the dates from the range sorted in the available columns. So for the data above, this would look like the following:

Driver | 02.05.2020           | 03.05.2020            | 04.05.2020         | 05.05.2020      |
Peter  | Loading:             | Unloading:              Unloading:
         12342, Berlin          14221, Utrecht          13222, Amsterdam
                                Loading:
                                14221, Utrecht

Franz  |                      | Loading:              |                    | Unloading:
                                11111, Somewher                              14231, Berlin
                                Unloading:
                                11221, Somewhere2
                                Loading:
                                11223, Upsalla

Is there any way to achieve the described output with dataframe operations? Within the single data columns I will need to keep the order, which is loading first, unloading second, and then go to the next data row if the date is the same.


Get this bounty!!!

#StackBounty: #python #python-3.x #random #graph Test the hypothesis that the expected number of edges of a random connected graph is …

Bounty: 50

Motivation

The most common model for a random graph is the Erdős–Rényi model. However, it does not guarantee the connectedness of the graph. Instead, let’s consider the following algorithm (in python-style pseudocode) for generating a random connected graph with $n$ nodes:

g = empty graph
g.add_nodes_from(range(n))

while not g.is_connected:
    i, j = random combination of two (distinct) nodes in range(n)
    if {i, j} not in g.edges:
        g.add_edge(i, j)

return g

The graph generated this way is guaranteed to be connected. Now, my intuition tells me that its expected number of edges is of the order $ O(n log n) $, and I want to test my hypothesis in Python. I don’t intend to do a rigorous mathematical proof or a comprehensive statistical inference, just some basic graph plotting.

The Codes

In order to know whether a graph is connected, we need a partition structure (i.e. union-find). I first wrote a Partition class in the module partition.py. It uses path compression and union by weights:

# partition.py

class Partition:
    """Implement a partition of a set of items to disjoint subsets (groups) as
    a forest of trees, in which each tree represents a separate group.
    Two trees represent the same group if and only if they have the same root.
    Support union operation of two groups.
    """

    def __init__(self, items):
        items = list(items)

        # parents of every node in the forest
        self._parents = {item: item for item in items}

        # the sizes of the subtrees
        self._weights = {item: 1 for item in items}

    def __len__(self):
        return len(self._parents)

    def __contains__(self, item):
        return item in self._parents

    def __iter__(self):
        yield from self._parents

    def find(self, item):
        """Return the root of the group containing the given item.
        Also reset the parents of all nodes along the path to the root.
        """
        if self._parents[item] == item:
            return item
        else:
            # find the root and recursively set all parents to it
            root = self.find(self._parents[item])
            self._parents[item] = root
            return root

    def union(self, item1, item2):
        """Merge the two groups (if they are disjoint) containing
        the two given items.
        """
        root1 = self.find(item1)
        root2 = self.find(item2)

        if root1 != root2:
            if self._weights[root1] < self._weights[root2]:
                # swap two roots so that root1 becomes heavier
                root1, root2 = root2, root1

            # root1 is heavier, reset parent of root2 to root1
            # also update the weight of the tree at root1
            self._parents[root2] = root1
            self._weights[root1] += self._weights[root2]

    @property
    def is_single_group(self):
        """Return true if all items are contained in a single group."""
        # we just need one item, any item is ok
        item = next(iter(self))

        # group size is the weight of the root
        group_size = self._weights[self.find(item)]
        return group_size == len(self)

Next, since we are only interested in the number of edges, we don’t actually need to explicitly construct any graph object. The following function implicitly generates a random connected graph and return its number of edges:

import random
from partition import Partition

def connected_edge_count(n):
    """Implicitly generate a random connected graph and return its number of edges."""
    edges = set()
    forest = Partition(range(n))

    # each time we join two nodes we merge the two groups containing them
    # the graph is connected iff the forest of nodes form a single group
    while not forest.is_single_group:
        start = random.randrange(n)
        end = random.randrange(n)

        # we don't bother to check whether the edge already exists
        if start != end:
            forest.union(start, end)
            edge = frozenset({start, end})
            edges.add(edge)

    return len(edges)

We then estimate the expected number of edges for a given $n$:

def mean_edge_count(n, sample_size):
    """Compute the sample mean of numbers of edges in a sample of
    random connected graphs with n nodes.
    """
    total = sum(connected_edge_count(n) for _ in range(sample_size))
    return total / sample_size

Now, we can plot the expected numbers of edges against $ n log n $ for different values of $n$:

from math import log
import matplotlib.pyplot as plt

def plt_mean_vs_nlogn(nlist, sample_size):
    """Plot the expected numbers of edges against n * log(n) for
    a given list of values of n, where n is the number of nodes.
    """
    x_values = [n * log(n) for n in nlist]
    y_values = [mean_edge_count(n, sample_size) for n in nlist]
    plt.plot(x_values, y_values, '.')

Finally, when we called plt_mean_vs_nlogn(range(10, 1001, 10), sample_size=100), we got:

enter image description here

The plot seems very close to a straight line, supporting my hypothesis.

Questions and ideas for future work

  1. My program is slow! It took me 90 seconds to run plt_mean_vs_nlogn(range(10, 1001, 10), sample_size=100). How can I improve the performance?
  2. What other improvement can I make on my codes?
  3. An idea for future work: do a linear regression on the data. A high coefficient of determination would support my hypothesis. Also find out the coefficient of $ n log n $.
  4. Any other idea for testing my hypothesis programmatically?


Get this bounty!!!

#StackBounty: #python #python-3.x #random #graph Test the hypothesis that the expected number of edges of a random connected graph is …

Bounty: 50

Motivation

The most common model for a random graph is the Erdős–Rényi model. However, it does not guarantee the connectedness of the graph. Instead, let’s consider the following algorithm (in python-style pseudocode) for generating a random connected graph with $n$ nodes:

g = empty graph
g.add_nodes_from(range(n))

while not g.is_connected:
    i, j = random combination of two (distinct) nodes in range(n)
    if {i, j} not in g.edges:
        g.add_edge(i, j)

return g

The graph generated this way is guaranteed to be connected. Now, my intuition tells me that its expected number of edges is of the order $ O(n log n) $, and I want to test my hypothesis in Python. I don’t intend to do a rigorous mathematical proof or a comprehensive statistical inference, just some basic graph plotting.

The Codes

In order to know whether a graph is connected, we need a partition structure (i.e. union-find). I first wrote a Partition class in the module partition.py. It uses path compression and union by weights:

# partition.py

class Partition:
    """Implement a partition of a set of items to disjoint subsets (groups) as
    a forest of trees, in which each tree represents a separate group.
    Two trees represent the same group if and only if they have the same root.
    Support union operation of two groups.
    """

    def __init__(self, items):
        items = list(items)

        # parents of every node in the forest
        self._parents = {item: item for item in items}

        # the sizes of the subtrees
        self._weights = {item: 1 for item in items}

    def __len__(self):
        return len(self._parents)

    def __contains__(self, item):
        return item in self._parents

    def __iter__(self):
        yield from self._parents

    def find(self, item):
        """Return the root of the group containing the given item.
        Also reset the parents of all nodes along the path to the root.
        """
        if self._parents[item] == item:
            return item
        else:
            # find the root and recursively set all parents to it
            root = self.find(self._parents[item])
            self._parents[item] = root
            return root

    def union(self, item1, item2):
        """Merge the two groups (if they are disjoint) containing
        the two given items.
        """
        root1 = self.find(item1)
        root2 = self.find(item2)

        if root1 != root2:
            if self._weights[root1] < self._weights[root2]:
                # swap two roots so that root1 becomes heavier
                root1, root2 = root2, root1

            # root1 is heavier, reset parent of root2 to root1
            # also update the weight of the tree at root1
            self._parents[root2] = root1
            self._weights[root1] += self._weights[root2]

    @property
    def is_single_group(self):
        """Return true if all items are contained in a single group."""
        # we just need one item, any item is ok
        item = next(iter(self))

        # group size is the weight of the root
        group_size = self._weights[self.find(item)]
        return group_size == len(self)

Next, since we are only interested in the number of edges, we don’t actually need to explicitly construct any graph object. The following function implicitly generates a random connected graph and return its number of edges:

import random
from partition import Partition

def connected_edge_count(n):
    """Implicitly generate a random connected graph and return its number of edges."""
    edges = set()
    forest = Partition(range(n))

    # each time we join two nodes we merge the two groups containing them
    # the graph is connected iff the forest of nodes form a single group
    while not forest.is_single_group:
        start = random.randrange(n)
        end = random.randrange(n)

        # we don't bother to check whether the edge already exists
        if start != end:
            forest.union(start, end)
            edge = frozenset({start, end})
            edges.add(edge)

    return len(edges)

We then estimate the expected number of edges for a given $n$:

def mean_edge_count(n, sample_size):
    """Compute the sample mean of numbers of edges in a sample of
    random connected graphs with n nodes.
    """
    total = sum(connected_edge_count(n) for _ in range(sample_size))
    return total / sample_size

Now, we can plot the expected numbers of edges against $ n log n $ for different values of $n$:

from math import log
import matplotlib.pyplot as plt

def plt_mean_vs_nlogn(nlist, sample_size):
    """Plot the expected numbers of edges against n * log(n) for
    a given list of values of n, where n is the number of nodes.
    """
    x_values = [n * log(n) for n in nlist]
    y_values = [mean_edge_count(n, sample_size) for n in nlist]
    plt.plot(x_values, y_values, '.')

Finally, when we called plt_mean_vs_nlogn(range(10, 1001, 10), sample_size=100), we got:

enter image description here

The plot seems very close to a straight line, supporting my hypothesis.

Questions and ideas for future work

  1. My program is slow! It took me 90 seconds to run plt_mean_vs_nlogn(range(10, 1001, 10), sample_size=100). How can I improve the performance?
  2. What other improvement can I make on my codes?
  3. An idea for future work: do a linear regression on the data. A high coefficient of determination would support my hypothesis. Also find out the coefficient of $ n log n $.
  4. Any other idea for testing my hypothesis programmatically?


Get this bounty!!!

#StackBounty: #python #python-3.x #random #graph Test the hypothesis that the expected number of edges of a random connected graph is …

Bounty: 50

Motivation

The most common model for a random graph is the Erdős–Rényi model. However, it does not guarantee the connectedness of the graph. Instead, let’s consider the following algorithm (in python-style pseudocode) for generating a random connected graph with $n$ nodes:

g = empty graph
g.add_nodes_from(range(n))

while not g.is_connected:
    i, j = random combination of two (distinct) nodes in range(n)
    if {i, j} not in g.edges:
        g.add_edge(i, j)

return g

The graph generated this way is guaranteed to be connected. Now, my intuition tells me that its expected number of edges is of the order $ O(n log n) $, and I want to test my hypothesis in Python. I don’t intend to do a rigorous mathematical proof or a comprehensive statistical inference, just some basic graph plotting.

The Codes

In order to know whether a graph is connected, we need a partition structure (i.e. union-find). I first wrote a Partition class in the module partition.py. It uses path compression and union by weights:

# partition.py

class Partition:
    """Implement a partition of a set of items to disjoint subsets (groups) as
    a forest of trees, in which each tree represents a separate group.
    Two trees represent the same group if and only if they have the same root.
    Support union operation of two groups.
    """

    def __init__(self, items):
        items = list(items)

        # parents of every node in the forest
        self._parents = {item: item for item in items}

        # the sizes of the subtrees
        self._weights = {item: 1 for item in items}

    def __len__(self):
        return len(self._parents)

    def __contains__(self, item):
        return item in self._parents

    def __iter__(self):
        yield from self._parents

    def find(self, item):
        """Return the root of the group containing the given item.
        Also reset the parents of all nodes along the path to the root.
        """
        if self._parents[item] == item:
            return item
        else:
            # find the root and recursively set all parents to it
            root = self.find(self._parents[item])
            self._parents[item] = root
            return root

    def union(self, item1, item2):
        """Merge the two groups (if they are disjoint) containing
        the two given items.
        """
        root1 = self.find(item1)
        root2 = self.find(item2)

        if root1 != root2:
            if self._weights[root1] < self._weights[root2]:
                # swap two roots so that root1 becomes heavier
                root1, root2 = root2, root1

            # root1 is heavier, reset parent of root2 to root1
            # also update the weight of the tree at root1
            self._parents[root2] = root1
            self._weights[root1] += self._weights[root2]

    @property
    def is_single_group(self):
        """Return true if all items are contained in a single group."""
        # we just need one item, any item is ok
        item = next(iter(self))

        # group size is the weight of the root
        group_size = self._weights[self.find(item)]
        return group_size == len(self)

Next, since we are only interested in the number of edges, we don’t actually need to explicitly construct any graph object. The following function implicitly generates a random connected graph and return its number of edges:

import random
from partition import Partition

def connected_edge_count(n):
    """Implicitly generate a random connected graph and return its number of edges."""
    edges = set()
    forest = Partition(range(n))

    # each time we join two nodes we merge the two groups containing them
    # the graph is connected iff the forest of nodes form a single group
    while not forest.is_single_group:
        start = random.randrange(n)
        end = random.randrange(n)

        # we don't bother to check whether the edge already exists
        if start != end:
            forest.union(start, end)
            edge = frozenset({start, end})
            edges.add(edge)

    return len(edges)

We then estimate the expected number of edges for a given $n$:

def mean_edge_count(n, sample_size):
    """Compute the sample mean of numbers of edges in a sample of
    random connected graphs with n nodes.
    """
    total = sum(connected_edge_count(n) for _ in range(sample_size))
    return total / sample_size

Now, we can plot the expected numbers of edges against $ n log n $ for different values of $n$:

from math import log
import matplotlib.pyplot as plt

def plt_mean_vs_nlogn(nlist, sample_size):
    """Plot the expected numbers of edges against n * log(n) for
    a given list of values of n, where n is the number of nodes.
    """
    x_values = [n * log(n) for n in nlist]
    y_values = [mean_edge_count(n, sample_size) for n in nlist]
    plt.plot(x_values, y_values, '.')

Finally, when we called plt_mean_vs_nlogn(range(10, 1001, 10), sample_size=100), we got:

enter image description here

The plot seems very close to a straight line, supporting my hypothesis.

Questions and ideas for future work

  1. My program is slow! It took me 90 seconds to run plt_mean_vs_nlogn(range(10, 1001, 10), sample_size=100). How can I improve the performance?
  2. What other improvement can I make on my codes?
  3. An idea for future work: do a linear regression on the data. A high coefficient of determination would support my hypothesis. Also find out the coefficient of $ n log n $.
  4. Any other idea for testing my hypothesis programmatically?


Get this bounty!!!

#StackBounty: #python #python-3.x #random #graph Test the hypothesis that the expected number of edges of a random connected graph is …

Bounty: 50

Motivation

The most common model for a random graph is the Erdős–Rényi model. However, it does not guarantee the connectedness of the graph. Instead, let’s consider the following algorithm (in python-style pseudocode) for generating a random connected graph with $n$ nodes:

g = empty graph
g.add_nodes_from(range(n))

while not g.is_connected:
    i, j = random combination of two (distinct) nodes in range(n)
    if {i, j} not in g.edges:
        g.add_edge(i, j)

return g

The graph generated this way is guaranteed to be connected. Now, my intuition tells me that its expected number of edges is of the order $ O(n log n) $, and I want to test my hypothesis in Python. I don’t intend to do a rigorous mathematical proof or a comprehensive statistical inference, just some basic graph plotting.

The Codes

In order to know whether a graph is connected, we need a partition structure (i.e. union-find). I first wrote a Partition class in the module partition.py. It uses path compression and union by weights:

# partition.py

class Partition:
    """Implement a partition of a set of items to disjoint subsets (groups) as
    a forest of trees, in which each tree represents a separate group.
    Two trees represent the same group if and only if they have the same root.
    Support union operation of two groups.
    """

    def __init__(self, items):
        items = list(items)

        # parents of every node in the forest
        self._parents = {item: item for item in items}

        # the sizes of the subtrees
        self._weights = {item: 1 for item in items}

    def __len__(self):
        return len(self._parents)

    def __contains__(self, item):
        return item in self._parents

    def __iter__(self):
        yield from self._parents

    def find(self, item):
        """Return the root of the group containing the given item.
        Also reset the parents of all nodes along the path to the root.
        """
        if self._parents[item] == item:
            return item
        else:
            # find the root and recursively set all parents to it
            root = self.find(self._parents[item])
            self._parents[item] = root
            return root

    def union(self, item1, item2):
        """Merge the two groups (if they are disjoint) containing
        the two given items.
        """
        root1 = self.find(item1)
        root2 = self.find(item2)

        if root1 != root2:
            if self._weights[root1] < self._weights[root2]:
                # swap two roots so that root1 becomes heavier
                root1, root2 = root2, root1

            # root1 is heavier, reset parent of root2 to root1
            # also update the weight of the tree at root1
            self._parents[root2] = root1
            self._weights[root1] += self._weights[root2]

    @property
    def is_single_group(self):
        """Return true if all items are contained in a single group."""
        # we just need one item, any item is ok
        item = next(iter(self))

        # group size is the weight of the root
        group_size = self._weights[self.find(item)]
        return group_size == len(self)

Next, since we are only interested in the number of edges, we don’t actually need to explicitly construct any graph object. The following function implicitly generates a random connected graph and return its number of edges:

import random
from partition import Partition

def connected_edge_count(n):
    """Implicitly generate a random connected graph and return its number of edges."""
    edges = set()
    forest = Partition(range(n))

    # each time we join two nodes we merge the two groups containing them
    # the graph is connected iff the forest of nodes form a single group
    while not forest.is_single_group:
        start = random.randrange(n)
        end = random.randrange(n)

        # we don't bother to check whether the edge already exists
        if start != end:
            forest.union(start, end)
            edge = frozenset({start, end})
            edges.add(edge)

    return len(edges)

We then estimate the expected number of edges for a given $n$:

def mean_edge_count(n, sample_size):
    """Compute the sample mean of numbers of edges in a sample of
    random connected graphs with n nodes.
    """
    total = sum(connected_edge_count(n) for _ in range(sample_size))
    return total / sample_size

Now, we can plot the expected numbers of edges against $ n log n $ for different values of $n$:

from math import log
import matplotlib.pyplot as plt

def plt_mean_vs_nlogn(nlist, sample_size):
    """Plot the expected numbers of edges against n * log(n) for
    a given list of values of n, where n is the number of nodes.
    """
    x_values = [n * log(n) for n in nlist]
    y_values = [mean_edge_count(n, sample_size) for n in nlist]
    plt.plot(x_values, y_values, '.')

Finally, when we called plt_mean_vs_nlogn(range(10, 1001, 10), sample_size=100), we got:

enter image description here

The plot seems very close to a straight line, supporting my hypothesis.

Questions and ideas for future work

  1. My program is slow! It took me 90 seconds to run plt_mean_vs_nlogn(range(10, 1001, 10), sample_size=100). How can I improve the performance?
  2. What other improvement can I make on my codes?
  3. An idea for future work: do a linear regression on the data. A high coefficient of determination would support my hypothesis. Also find out the coefficient of $ n log n $.
  4. Any other idea for testing my hypothesis programmatically?


Get this bounty!!!

#StackBounty: #python #python-3.x #random #graph Test the hypothesis that the expected number of edges of a random connected graph is …

Bounty: 50

Motivation

The most common model for a random graph is the Erdős–Rényi model. However, it does not guarantee the connectedness of the graph. Instead, let’s consider the following algorithm (in python-style pseudocode) for generating a random connected graph with $n$ nodes:

g = empty graph
g.add_nodes_from(range(n))

while not g.is_connected:
    i, j = random combination of two (distinct) nodes in range(n)
    if {i, j} not in g.edges:
        g.add_edge(i, j)

return g

The graph generated this way is guaranteed to be connected. Now, my intuition tells me that its expected number of edges is of the order $ O(n log n) $, and I want to test my hypothesis in Python. I don’t intend to do a rigorous mathematical proof or a comprehensive statistical inference, just some basic graph plotting.

The Codes

In order to know whether a graph is connected, we need a partition structure (i.e. union-find). I first wrote a Partition class in the module partition.py. It uses path compression and union by weights:

# partition.py

class Partition:
    """Implement a partition of a set of items to disjoint subsets (groups) as
    a forest of trees, in which each tree represents a separate group.
    Two trees represent the same group if and only if they have the same root.
    Support union operation of two groups.
    """

    def __init__(self, items):
        items = list(items)

        # parents of every node in the forest
        self._parents = {item: item for item in items}

        # the sizes of the subtrees
        self._weights = {item: 1 for item in items}

    def __len__(self):
        return len(self._parents)

    def __contains__(self, item):
        return item in self._parents

    def __iter__(self):
        yield from self._parents

    def find(self, item):
        """Return the root of the group containing the given item.
        Also reset the parents of all nodes along the path to the root.
        """
        if self._parents[item] == item:
            return item
        else:
            # find the root and recursively set all parents to it
            root = self.find(self._parents[item])
            self._parents[item] = root
            return root

    def union(self, item1, item2):
        """Merge the two groups (if they are disjoint) containing
        the two given items.
        """
        root1 = self.find(item1)
        root2 = self.find(item2)

        if root1 != root2:
            if self._weights[root1] < self._weights[root2]:
                # swap two roots so that root1 becomes heavier
                root1, root2 = root2, root1

            # root1 is heavier, reset parent of root2 to root1
            # also update the weight of the tree at root1
            self._parents[root2] = root1
            self._weights[root1] += self._weights[root2]

    @property
    def is_single_group(self):
        """Return true if all items are contained in a single group."""
        # we just need one item, any item is ok
        item = next(iter(self))

        # group size is the weight of the root
        group_size = self._weights[self.find(item)]
        return group_size == len(self)

Next, since we are only interested in the number of edges, we don’t actually need to explicitly construct any graph object. The following function implicitly generates a random connected graph and return its number of edges:

import random
from partition import Partition

def connected_edge_count(n):
    """Implicitly generate a random connected graph and return its number of edges."""
    edges = set()
    forest = Partition(range(n))

    # each time we join two nodes we merge the two groups containing them
    # the graph is connected iff the forest of nodes form a single group
    while not forest.is_single_group:
        start = random.randrange(n)
        end = random.randrange(n)

        # we don't bother to check whether the edge already exists
        if start != end:
            forest.union(start, end)
            edge = frozenset({start, end})
            edges.add(edge)

    return len(edges)

We then estimate the expected number of edges for a given $n$:

def mean_edge_count(n, sample_size):
    """Compute the sample mean of numbers of edges in a sample of
    random connected graphs with n nodes.
    """
    total = sum(connected_edge_count(n) for _ in range(sample_size))
    return total / sample_size

Now, we can plot the expected numbers of edges against $ n log n $ for different values of $n$:

from math import log
import matplotlib.pyplot as plt

def plt_mean_vs_nlogn(nlist, sample_size):
    """Plot the expected numbers of edges against n * log(n) for
    a given list of values of n, where n is the number of nodes.
    """
    x_values = [n * log(n) for n in nlist]
    y_values = [mean_edge_count(n, sample_size) for n in nlist]
    plt.plot(x_values, y_values, '.')

Finally, when we called plt_mean_vs_nlogn(range(10, 1001, 10), sample_size=100), we got:

enter image description here

The plot seems very close to a straight line, supporting my hypothesis.

Questions and ideas for future work

  1. My program is slow! It took me 90 seconds to run plt_mean_vs_nlogn(range(10, 1001, 10), sample_size=100). How can I improve the performance?
  2. What other improvement can I make on my codes?
  3. An idea for future work: do a linear regression on the data. A high coefficient of determination would support my hypothesis. Also find out the coefficient of $ n log n $.
  4. Any other idea for testing my hypothesis programmatically?


Get this bounty!!!