#StackBounty: #python #performance #regex #natural-language-proc #cython Using lots of regex substitutions to tokenize text

Bounty: 50

I authored a piece of code that was merged into the nltk codebase. It is full of regex substitutions:

import re
from six import text_type

from nltk.tokenize.api import TokenizerI

class ToktokTokenizer(TokenizerI):
    This is a Python port of the tok-tok.pl from

    >>> toktok = ToktokTokenizer()
    >>> text = u'Is 9.5 or 525,600 my favorite number?'
    >>> print (toktok.tokenize(text, return_str=True))
    Is 9.5 or 525,600 my favorite number ?
    >>> text = u'The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things'
    >>> print (toktok.tokenize(text, return_str=True))
    The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things
    >>> text = u'xa1This, is a sentence with weirdxbb symbolsu2026 appearing everywherexbf'
    >>> expected = u'xa1 This , is a sentence with weird xbb symbols u2026 appearing everywhere xbf'
    >>> assert toktok.tokenize(text, return_str=True) == expected
    >>> toktok.tokenize(text) == [u'xa1', u'This', u',', u'is', u'a', u'sentence', u'with', u'weird', u'xbb', u'symbols', u'u2026', u'appearing', u'everywhere', u'xbf']
    # Replace non-breaking spaces with normal spaces.
    NON_BREAKING = re.compile(u"u00A0"), " "

    # Pad some funky punctuation.
    FUNKY_PUNCT_1 = re.compile(u'([،;؛¿!"])}»›”؟¡%٪°±©®।॥…])'), r" 1 "
    # Pad more funky punctuation.
    FUNKY_PUNCT_2 = re.compile(u'([({[“‘„‚«‹「『])'), r" 1 "
    # Pad En dash and em dash
    EN_EM_DASHES = re.compile(u'([–—])'), r" 1 "

    # Replace problematic character with numeric character reference.
    AMPERCENT = re.compile('& '), '& '
    TAB = re.compile('t'), ' 	 '
    PIPE = re.compile('|'), ' | '

    # Pad numbers with commas to keep them from further tokenization. 
    COMMA_IN_NUM = re.compile(r'(?<!,)([,،])(?![,d])'), r' 1 '

    # Just pad problematic (often neurotic) hyphen/single quote, etc.
    PROB_SINGLE_QUOTES = re.compile(r"(['’`])"), r' 1 '
    # Group ` ` stupid quotes ' ' into a single token.
    STUPID_QUOTES_1 = re.compile(r" ` ` "), r" `` "
    STUPID_QUOTES_2 = re.compile(r" ' ' "), r" '' "

    # Don't tokenize period unless it ends the line and that it isn't 
    # preceded by another period, e.g.  
    # "something ..." -> "something ..." 
    # "something." -> "something ." 
    FINAL_PERIOD_1 = re.compile(r"(?<!.).$"), r" ."
    # Don't tokenize period unless it ends the line eg. 
    # " ... stuff." ->  "... stuff ."
    FINAL_PERIOD_2 = re.compile(r"""(?<!.).s*(["'’»›”]) *$"""), r" . 1"

    # Treat continuous commas as fake German,Czech, etc.: „
    MULTI_COMMAS = re.compile(r'(,{2,})'), r' 1 '
    # Treat continuous dashes as fake en-dash, etc.
    MULTI_DASHES = re.compile(r'(-{2,})'), r' 1 '
    # Treat multiple periods as a thing (eg. ellipsis)
    MULTI_DOTS = re.compile(r'(.{2,})'), r' 1 '

    # This is the p{Open_Punctuation} from Perl's perluniprops
    # see http://perldoc.perl.org/perluniprops.html
    OPEN_PUNCT = text_type(u'([{u0f3au0f3cu169bu201au201eu2045u207d'
    # This is the p{Close_Punctuation} from Perl's perluniprops
    CLOSE_PUNCT = text_type(u')]}u0f3bu0f3du169cu2046u207eu208eu232a'
    # This is the p{Close_Punctuation} from Perl's perluniprops
    CURRENCY_SYM = text_type(u'$xa2xa3xa4xa5u058fu060bu09f2u09f3u09fb'

    # Pad spaces after opening punctuations.
    OPEN_PUNCT_RE = re.compile(u'([{}])'.format(OPEN_PUNCT)), r'1 '
    # Pad spaces before closing punctuations.
    CLOSE_PUNCT_RE = re.compile(u'([{}])'.format(CLOSE_PUNCT)), r'1 '
    # Pad spaces after currency symbols.
    CURRENCY_SYM_RE = re.compile(u'([{}])'.format(CURRENCY_SYM)), r'1 '

    # Use for tokenizing URL-unfriendly characters: [:/?#]
    URL_FOE_1 = re.compile(r':(?!//)'), r' : ' # in perl s{:(?!//)}{ : }g;
    URL_FOE_2 = re.compile(r'?(?!S)'), r' ? ' # in perl s{?(?!S)}{ ? }g;
    # in perl: m{://} or m{S+.S+/S+} or s{/}{ / }g;
    URL_FOE_3 = re.compile(r'(://)[S+.S+/S+][/]'), ' / '
    URL_FOE_4 = re.compile(r' /'), r' / ' # s{ /}{ / }g;

    # Left/Right strip, i.e. remove heading/trailing spaces.
    # These strip regexes should NOT be used,
    # instead use str.lstrip(), str.rstrip() or str.strip() 
    # (They are kept for reference purposes to the original toktok.pl code)  
    LSTRIP = re.compile(r'^ +'), ''
    RSTRIP = re.compile(r's+$'),'n' 
    # Merge multiple spaces.
    ONE_SPACE = re.compile(r' {2,}'), ' '

                      URL_FOE_1, URL_FOE_2, URL_FOE_3, URL_FOE_4,
                      AMPERCENT, TAB, PIPE,
                      OPEN_PUNCT_RE, CLOSE_PUNCT_RE, 
                      MULTI_COMMAS, COMMA_IN_NUM, FINAL_PERIOD_2,
                      FINAL_PERIOD_1, FINAL_PERIOD_2, ONE_SPACE]

    def tokenize(self, text, return_str=False):
        text = text_type(text) # Converts input string into unicode.
        for regexp, subsitution in self.TOKTOK_REGEXES:
            text = regexp.sub(subsitution, text)
        # Finally, strips heading and trailing spaces
        # and converts output string into unicode.
        text = text_type(text.strip()) 
        return text if return_str else text.split()

Is there a way to make the subtituition faster? E.g.

  • Combine the chain of regexes into one super regex.
  • Combine some of the regexes
  • Coding it in Cython (but Cython regexes are slow, no?)
  • Running the regex substitution in Julia and wrapping Julia code in Python

The use case for the tokenize() function usually takes a single input but if the same function is called 1,000,000,000 times, it’s rather slow and the GIL is going to lock up the core and process each sentence at a time.

The aim of the question is to ask for ways to speed up a Python code that’s made up of regex substitution, esp. when running the tokenize() function for 1,000,000,000+ times.

If Cython/Julia or any faster language + wrapper is suggested, it would be good if you give an one regex example of how the regex is written in Cython/Julia/Others and the suggestion on how the wrapper would look like.

Get this bounty!!!

#StackBounty: #javascript #python #selenium Python: Unable to download with selenium in webpage

Bounty: 50

My purpose it to download a zip file from https://www.shareinvestor.com/prices/price_download_zip_file.zip?type=history_all&market=bursa

It is a link in this webpage https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa. Then save it into this directory "/home/vinvin/shKLSE/ (I am using pythonaywhere). Then unzip it and the csv file extract in the directory.

The code run until the end with no error but it does not downloaded.
The zip file is automatically downloaded when click on https://www.shareinvestor.com/prices/price_download_zip_file.zip?type=history_all&market=bursa manually.

My code with a working username and password is used. The real username and password is used so that it is easier to understand the problem.

    print "hello from python 2"

    import urllib2
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time
    from pyvirtualdisplay import Display
    import requests, zipfile, os    

    display = Display(visible=0, size=(800, 600))

    profile = webdriver.FirefoxProfile()
    profile.set_preference('browser.download.folderList', 2)
    profile.set_preference('browser.download.manager.showWhenStarting', False)
    profile.set_preference('browser.download.dir', "/home/vinvin/shKLSE/")
    profile.set_preference('browser.helperApps.neverAsk.saveToDisk', '/zip')

    for retry in range(5):
            browser = webdriver.Firefox(profile)
            print "firefox"

    login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
    print browser.current_url
    username = browser.find_element_by_id("sic_login_header_username")
    password = browser.find_element_by_id("sic_login_header_password")
    print "find id done"
    print "log in done"
    login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
    print browser.current_url
    dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()


   zip_ref = zipfile.ZipFile(/home/vinvin/sh/KLSE, 'r')

HTML snippet:

<li><a href="/prices/price_download_zip_file.zip?type=history_all&amp;market=bursa">All Historical Data</a> <span>About 220 MB</span></li>

Note that &amp is shown when I copy the snippet. It was hidden from view source, so I guess it is written in JavaScript.

Observation I found

  1. The directory home/vinvin/shKLSE do not created even I run the code with no error

  2. I try to download a much smaller zip file which can be completed in a second but still do not download after a wait of 30s. dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_daily&date=20170519&market=bursa']").click()

enter image description here

Get this bounty!!!

#StackBounty: #python #python-3.x #inheritance #gal List all attributes which are inherited by a class

Bounty: 50

Hi I have the below from which I am trying to pull data from Outlook using code obtained on StackOverflow.

Using the first loop, I am trying to gather all attributes available to the object.

Whilst running it I notice the absence of Name which is later called in the 2nd loop, I assume this is due to inheritance. Please can you assist me in finding all attributes available to a class?

import win32com.client,sys

o = win32com.client.gencache.EnsureDispatch("Outlook.Application")
ns = o.GetNamespace("MAPI")

adrLi = ns.AddressLists.Item("Global Address List")
contacts = adrLi.AddressEntries
numEntries = adrLi.AddressEntries.Count
nameAliasDict = {}
attrs_ = dir(contacts)
for i in range(len(attrs_)):

for j in contacts:


Get this bounty!!!

#StackBounty: #scripting #python #ssh-tunneling Advance autologin via 2 jumphost/passphrase

Bounty: 50

I need to find a way to autologin on Remote Machine. There are several way in which we can do this but this is little tricky for me.

Auto Login to a remote machine and execute command or script and redirect the output in the local system file.

ssh remote-host < ./script >> storageinfo_$date.txt

But the hard part is to we can’t directly connect to the remote host; we need to first connect to the Jumphost1 –> Jumphost2 –> and then –> remote-host

Jumphostx is sshkeygen enabled but with passphrase for eg: userpass
remote-host is not sshkeygen enabled eg: remotepass

We used to do this with .ssh/config file in the below manner. This was successful so far in the test env. But we are not supposed to install expect in the live env.

# cat .ssh/config

Host jump1-*
    User ldap-user
    IdentityFile ~/.ssh/id_rsa
    ForwardAgent yes
    ServerAliveInterval 60
    ServerAliveCountMax 12

Host jump1-centos01-temporary 
    Port 2222

Host jump1-centos01        
    Port 22
    ProxyCommand ssh -W %h:%p jump1-centos01-temporary

Host remote-host
    ProxyCommand ssh -W %h:%p jump1-centos01
    User root

ssh connection with expect and send patern

# cat expect.sh 

#!/usr/bin/env expect
set timeout 7
set date [exec date "+%d-%B-%Y"]

spawn sh -c "ssh va1ap-vsns0001n < ./isi.py > storageinfo_$date.txt"
expect "Enter passphrase for key '/root/.ssh/id_rsa':"
send "r"
expect "Enter passphrase for key '/root/.ssh/id_rsa':"
send "userpassr"
expect "Enter passphrase for key '/root/.ssh/id_rsa':"
send "userpassr"
expect "Password:"
send "remotepassr"

Get this bounty!!!

#StackBounty: #machine-learning #python #ranking From pairwise comparisons to ranking – python

Bounty: 50

I have to solve a ranking ML issue. To start with, I have successfully applied the pointwise ranking approach.

Now, I’m playing around with pairwise ranking algorithms. I’ve created the pairwise probabilities (i.e. probability of item i being above item j) but I’m not sure how I can transform this to rankings.

For the historical data (let’s assume these are queries), I have their pairwise probs AND the actual ranking (the ideal one). I want a solution that will provide a ranking for a new query as well (i.e. the ideal ranking is what I’m looking for here).

Any python package that has, at least partially, the functionality I’m looking for?

Get this bounty!!!

#StackBounty: #python #similarities #tf-idf #latent-semantic-indexing #bag-of-words Online Document Similarity (LSI/WMD)

Bounty: 100

I’m running a gensim-based LSI similarity model, which needs to be rebuilt every time a new entry is added to the corpus. Since these additions are fairly common (the target is to reach multiple additions per minute) I would like to explore online options.

Is there an option in LSI for an incremental learning model? Would using WMD and replacing new entries into the dictionary be more efficient? Right now my issue is that WMD takes a lot of memory, but I’m willing to sacrifice upfront cost if I can get better per-query performance, as I ultimately aim to include this in a fast-responding API.

Currently building as (and please excuse the naming conventions):

def build_cache():
    self.MODEL_CACHE = {
        'urls': [],
        'texts': []

    all_articles = [retrieve from database]
    for article in all_articles:
    # self.preprocess(text) runs nltk's word_tokenize and [if not in stop_words]

    # all the imports below are from gensim, gensim.models, gensim.similarities
    self.MODEL_CACHE['dictionary'] = corpora.HashDictionary(self.MODEL_CACHE['texts'])
    self.MODEL_CACHE['corpus_gensim'] = [self.MODEL_CACHE['dictionary'].doc2bow(doc) for doc in self.MODEL_CACHE['texts']]

    self.MODEL_CACHE['corpus_tfidf'] = TfidfModel(self.MODEL_CACHE['corpus_gensim'])[self.MODEL_CACHE['corpus_gensim']]

    self.MODEL_CACHE['lsi'] = LsiModel(self.MODEL_CACHE['corpus_tfidf'], id2word=self.MODEL_CACHE['dictionary'], num_topics=100)
    self.MODEL_CACHE['lsi_index'] = MatrixSimilarity(self.MODEL_CACHE['lsi'][self.MODEL_CACHE['corpus_tfidf']])

    self.MODEL_CACHE['results'] = [self.MODEL_CACHE['lsi_index'][self.MODEL_CACHE['lsi'][self.MODEL_CACHE['corpus_tfidf'][i]]]
                                   for i in range(len(self.MODEL_CACHE['texts']))]

Most of what I’m doing is pretty close and inspired by:


If there’s a more efficient high-performance docsim implementation out there, I’d love some pointers, I haven’t had much luck with Keras or its two backends.

Get this bounty!!!

#StackBounty: #java #python #raspberry-pi #video #opencv Streaming H264 video from PiCamera to a JavaFX ImageView

Bounty: 50

I’m currently working on a robotics application where a video feed is being displayed from a Raspberry Pi 3.

I’ve been working on a way to stream the video directly into JavaFX (the rest of the UI is created in this), however, my knowledge of video streaming is very limited. The goal for the video system is to maintain decent video quality and FPS while reducing latency as much as possible (looking for sub 100 ms). H264 video was chosen as the format for it’s speed, but I hear that sending raw video could be faster as there is no compression (could not get raw video to work well at all).

Running my code I am capable of streaming a Pi camera at about 120-130ms of latency and ~48 frames per second. I would like to continue to reduce the latency of this application, and would like to make sure that I’m making decisions for the correct reasons.

The largest issue I have so far is start-up time; it takes about 15-20 seconds for the video to initially launch and catch up to the latest frame.

JavaFX python video

The following code is an MCVE of the video system. If anyone is interested in reproducing this, you can get it running on a Raspberry Pi (mine is a Raspberry Pi 3) with python-picamera installed. You’ll also need a Java Client with JavaCV installed. My version info is org.bytedeco:javacv-platform:1.3.2.

Python side:

We decided to use a Python library to control the video stream because it provides a nice wrapper around the picamera command-line tool. The output from the video is being sent over a TCP connection and will be received by a Java client. (The way we remotely launch this application has been left out of the review because I just wanted this post to focus on the video aspects)

import picamera
import socket
import signal
import sys

with picamera.PiCamera() as camera:
    camera.resolution = (1296, 720)
    camera.framerate = 48

    soc = socket.socket()
    soc.connect((sys.argv[1], int(sys.argv[2])))
    file = soc.makefile('wb')

        while True:

Why did I choose these values?:

  • camera.resolution = (1296, 720), camera.framerate = 48 were the largest images I could output at a frame-rate fast enough to reduce latency.
  • intra_period=0 Wanted the images to remain small, and by setting the intra_period to zero, no I frames/full frames (apart from the first frame) will be sent; reducing the time between frames
  • quality=0 from the docstring: Quality 0 is special and seems to be a "reasonable quality" default
  • bitrate=25000000 Wanted to set the bitrate as high as possible to not slow down video transfer when lots of changes in the frames (when P frames/partial frames become large)

Java side:

The Java decoder was written using JavaCV and sends the TCP H264 stream into an FFmpegFrameGrabber. The decoder then converts the Frame into a BufferedImage, and then into a WritableImage for JavaFX.

public class FFmpegFXImageDecoder {
    private FFmpegFXImageDecoder() { }

    public static void streamToImageView(
        final ImageView view,
        final int port,
        final int socketBacklog,
        final String format,
        final double frameRate,
        final int bitrate,
        final String preset,
        final int numBuffers
    ) {
        try (final ServerSocket server = new ServerSocket(port, socketBacklog);
             final Socket clientSocket = server.accept();
             final FrameGrabber grabber = new FFmpegFrameGrabber(
        ) {
            final Java2DFrameConverter converter = new Java2DFrameConverter();
            grabber.setVideoOption("preset", preset);
            while (!Thread.interrupted()) {
                final Frame frame = grabber.grab();
                if (frame != null) {
                    final BufferedImage bufferedImage = converter.convert(frame);
                    if (bufferedImage != null) {
                        Platform.runLater(() ->
                            view.setImage(SwingFXUtils.toFXImage(bufferedImage, null)));
        catch (final IOException e) {

This can then be placed into a JavaFX view like below:

public class TestApplication extends Application {

    static final int WIDTH = 1296;

    static final int HEIGHT = 720;

    public void start(final Stage primaryStage) throws Exception {
        final ImageView imageView = new ImageView();
        final BorderPane borderPane = new BorderPane();


        borderPane.setPrefSize(WIDTH, HEIGHT);

        final Scene scene = new Scene(borderPane);

        new Thread(() -> FFmpegFXImageDecoder.streamToImageView(
            imageView, 12345, 100, "h264", 96, 25000000, "ultrafast", 0)

Why did I choose these values?:

  • frameRate=96 Wanted the framerate of the Client to be twice the speed of the stream such that I’m not waiting on frames
  • bitrate=25000000 to match the stream
  • VideoOption preset="ultrafast" To try and reduce the startup time for the stream.

Final Questions:

What are some ways I improve the latency of this system?

How can I reduce the start-up time of this stream? It currently takes about 15 seconds to launch and catch up.

Are the parameters chosen for JavaCV and PiCamera logical? Is my understanding of them correct?

Get this bounty!!!

#StackBounty: #python #pdf #pdfminer #pdf-parsing PDFminer empty output

Bounty: 100

While processing a file with pdfminer (pdf2txt.py) I received empty output:

dan@work:~/project$ pdf2txt.py  docs/homericaeast.pdf 


Can anybody say what wrong with this file and what I can do to get data from it?

Here’s dumppdf.py docs/homericaeast.pdf output:

<dict size="4">
<value><ref id="2" /></value>
<value><ref id="1" /></value>
<value><list size="2">
<string size="16">on
<string size="16">on

<dict size="4">
<value><ref id="2" /></value>
<value><ref id="1" /></value>
<value><list size="2">
<string size="16">on
<string size="16">on

Get this bounty!!!

#StackBounty: #python #indexing #tuples #pairwise Finding index of pairwise elements

Bounty: 50

Given the target ('b', 'a') and the inputs:

x0 = ('b', 'a', 'z', 'z')
x1 = ('b', 'a', 'z', 'z')
x2 = ('z', 'z', 'a', 'a')
x3 = ('z', 'b', 'a', 'a')

The aim to find the location of the continuous ('b', 'a') element and get the output:

>>> find_ba(x0)
>>> find_ba(x1)
>>> find_ba(x2)
>>> find_ab(x3)

Using the pairwise recipe:

from itertools import tee
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

I could do this to get the desired output:

def find_ba(x, target=('b', 'a')):
        return next(i for i, pair in enumerate(pairwise(x)) if pair == target)
    except StopIteration:
        return None

But that would require me to loop through all pairs of characters till I find the first instance. Is there a way to finding index of pairwise elements without looping all the characters?

Answering @MatthiasFripp’s question in the comments:

Are your elements in lists or types (as shown) or in a generator (e.g. reading from a file handle)?

The x* are all tuples of strings. So they can be access through the index. But if the answer/solution can work for tuples and generator, that’ll be great!

Can you say about how many lists you have to search and about how long they are? That would help for suggesting a search strategy.

The lengths of the tuples are not fixed. They can be of size > 2.

Get this bounty!!!

#StackBounty: #python #python-3.6 ModuleNotFoundError: No module named x

Bounty: 50

This is the first time I’ve really sat down and tried python 3, and seem to be failing miserably. I have the following two files:


config.py has a few functions defined in it as well as a few variables. I’ve stripped it down to the following:

enter image description here

However, I’m getting the following error:

ModuleNotFoundError: No module named ‘config’

I’m aware that the py3 convention is to use absolute imports: from . import config. However, this leads to the following error:

ImportError: cannot import name ‘config’

So I’m at a loss as to what to do here… Any help is greatly appreciated. 🙂

Get this bounty!!!