#StackBounty: #python #performance #python-3.x #xml #memory-optimization Build embedded XML based on string containment

Bounty: 50

I have a list of strings that represent a path. These are unique, unsorted values that look like this:

li = ['1/', '1/2/4/', '1/23/4/', '1/2/', '1/1/3/', '1/2/3/4/', '1/1/', '1/23/', '1/2/3/']

It can also be assumed that every path has a parent, e.g. 1/2/ has parent 1/.

The goal is to organise these items in a structured XML format, where the parent-relation is clear, i.e. where deeper elements are part of their parents, mimicking a directory tree. Desired output for the list above would be:

<root>
  <include path="1/">
    <include path="1/1/">
      <include path="1/1/3/"/>
    </include>
    <include path="1/2/">
      <include path="1/2/3/">
        <include path="1/2/3/4/"/>
      </include>
      <include path="1/2/4/"/>
    </include>
    <include path="1/23/">
      <include path="1/23/4/"/>
    </include>
  </include>
</root>

The following works (run it here), but I am not sure about efficiency. The actual list can be thousands of values long. I am also curious about memory management, because I am writing all XML nodes in-memory. The eventual goal is to print the final XML to a file.

from xml.etree import ElementTree as ET
import re

li = ['1/', '1/2/4/', '1/23/4/', '1/2/', '1/1/3/', '1/2/3/4/', '1/1/', '1/23/', '1/2/3/']
li = sorted(li)
# In Python >= 3.6 dicts maintain order
# Sets don't? So use best of both worlds, dict: speed and order
di = {l: False for l in li}

root = ET.Element('root')

# First build the root nodes, remove them from dict
for p in list(di.keys()):
  if len(list(filter(None, p.split('/')))) == 1:
    del di[p]
    root.append(ET.Element('include', {'path': p}))

# Because dict (and so keys()) are ordered, we can assume that the nodes higher 
# in the include tree are created before the more in-depth ones
for p in list(di.keys()):
  parent_path = re.sub('d+/$', '', p)
  try:
    parent = root.find(f'.//include[@path="{parent_path}"]')
    parent.append(ET.Element('include', {'path': p}))
  except Exception:
    print(f"parent not found for {p}")

print(ET.tostring(root))

This indeed returns the expected output (not pretty-printed). I am wondering if there is a better option and whether there are flaws in my approach. I’ll only be using Python >= 3.6, which is quite important for the order of the dictionary.


Update

I noticed a bug with the sorting of the list/dict because actually the sorting was done on the strings which isn’t what we want. (e.g. 1001/ would be before 134/) I rewrote part of the code, and use sets assuming these are faster.

from xml.etree import ElementTree as ET
import re


def get_xml(paths):
    root = ET.Element('paths')

    root_paths = set()
    for p in paths:
        # If only one item, use as first-level node - e.g. 1/
        if len(list(filter(None, p.split('/')))) == 1:
            root_paths.add(p)
            root.append(ET.Element('include', {'path': p}))

    # Remove root_paths from set
    paths = paths.difference(root_paths)

    # We can't use regular sort because that'll sort by string
    # Instead, sort by array of ints representation. E.g. `12/1001/14/` -> [12, 1001, 14]
    sorted_paths = sorted(paths, key=lambda item: [int(n) for n in list(filter(None, item.split('/')))])
    for p in sorted_paths:
        # Find parent_path by removing last part of string
        parent_path = re.sub('d+/$', '', p)
        try:
            parent = root.find(f'.//include[@path="{parent_path}"]')
            parent.append(ET.Element('include', {'path': p}))
        except AttributeError:
            print(f"parent {parent_path} not found for {p}", flush=True)

    return root

s = {'1/', '1/2/4/', '1/23/4/', '1/2/', '1/1/3/', '1/2/3/4/', '1/1/', '1/23/', '1/2/3/'}

xml = get_xml(s)

print(ET.tostring(xml))


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.