#StackBounty: #python #python-2.7 Python re for custom sequence type

Bounty: 50

I have a custom sequence-like object, s, that inherits collections.Sequence and implements custom __len__ and __getitem__. It represents a big blob of strings (>4GB) and is lazily loaded (I can’t afford loading all into memory).

I’d like to do RE match on it, re.compile('some-pattern').match(s), but it fails with TypeError: expected string or buffer.

In practice, pattern is not something like '.*' that requires the entire s to be loaded; it usually takes the first few tens of bytes to match; however, I can’t tell beforehand the exact number of bytes and I want keep it general, therefore I don’t want to do something like re.compile('some-pattern').match(s[:1000]).

Any suggestions on how to create a str-like object that is accepted by re?

The following code illustrates my unsuccessful attempts. Inheriting from str is not working either.

In [1]: import re, collections

In [2]: class MyStr(collections.Sequence):
    def __len__(self): return len('hello')
    def __getitem__(self, item): return 'hello'[item]
   ...:

In [3]: print(re.compile('h.*o').match(MyStr()))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-df08913b19d7> in <module>()
----> 1 print(re.compile('h.*o').match(MyStr()))

TypeError: expected string or buffer

If the big blob of string comes from a single big file then I can use mmap and it should work. However, my case is more complicated. I have multiple big files, I mmaped each of them and have a custom class that is a concatenated view of them. I actually want to perform the RE match starting from any given position in the view. I omit such details in the original question, but I think it might be helpful to someone who wants to understand why I have such weird requirement.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.