#StackBounty: #l3regex Understanding assertion matching in l3regex

Bounty: 50

This question is a bit specific and beyond the official interface of the l3regex library, but I’m trying to understand how exactly the library works here.

Given the following example document:



cs_set:Npn prepare_regex:N #1 {
    cs_set_nopar:Npn l_tmpa_regex {
        __regex_branch:n {
            __regex_assertion:Nn c_true_bool { char`^ __regex_anchor:N l__regex_min_pos_int }
            __regex_class:NnnnN  c_true_bool {__regex_item_caseful_equal:n {97}}{1}{-1} #1
            __regex_assertion:Nn c_true_bool { X __regex_break_true:w }
            __regex_class:NnnnN  c_true_bool { use:c{__regex_prop_.:} }{0}{-1} c_false_bool
            __regex_assertion:Nn c_true_bool { char`$ __regex_anchor:N l__regex_max_pos_int }

cs_set:Npn test_match:n #1 {
    regex_extract_once:NnNTF l_tmpa_regex {#1} l_tmpa_seq {}{}

greedy~match: par
prepare_regex:N c_false_bool
test_match:n { ab }
test_match:n { aab }
test_match:n { aaab }
test_match:n { aaaa }

lazy~match: par
prepare_regex:N c_true_bool
test_match:n { ab }
test_match:n { aab }
test_match:n { aaab }
test_match:n { aaaa }


prepare_regex:N just assigns a regex to l_tmpa_regex that is equivalent to the compiled form of ^a+.*$ or ^a+?.*$, depending on the passed parameter (greedy or lazy matching). Additionally, an extra assertion that always succeeds was added before the .* part. To see the order in that assertions are executed internally, I added some characters that will be output when the matching happens. The output is

enter image description here

There are several things here, I don’t understand:

  • Why are all three assertions matched over and over again, when the ^ assertion, in my opinion, should only be executed at the beginning, and $ only at the end of the matching process. Perhaps $ would be executed several times to determine the end of the match, but ^ definitely not.
  • Why isn’t the X assertion executed only once per matching when the “boundary” from the series of as is crossed to the bs? It would make some kind of sense, if backtracking (or whatever equivalent l3regex uses for that) would happen, but that’s not the case, neither in the greedy nor in the lazy version.
  • Why is there a single ^$ at the end of all matchings? This makes no sense at all to me, because the regex doesn’t even allow ^ followed by $ without an X assertion in between.
  • The lazy matching adds an X for each additional a in the string, but strangely not for the last one. Why?

Luckily l3regex provides some debug output that lets us have a closer look at what’s going on. If we load expl3 with [enable-debug] and use the following snippet with my above example

prepare_regex:N c_false_bool
int_set:Nn g__regex_trace_regex_int { 100 }
% Recall this tries to match "^a+.*$" greedily
test_match:n { ab }
int_set:Nn g__regex_trace_regex_int { 0 }

we get a debug output similar to the following one (also note the typeouts added to the regex):

Trace: entering __regex_build:N
Trace: regex new state L=0-> R=0-> M=1-> 2
Trace: regex new state L=0-> R=1-> M=2-> 3
Trace: entering __regex_group_aux:nnnnN
Trace: regex new state L=1-> R=2-> M=3-> 4
Trace: entering __regex_branch:n
Trace: regex new state L=2-> R=3-> M=4-> 5
Trace: regex new state L=2-> R=4-> M=5-> 6
Trace: regex new state L=4-> R=5-> M=6-> 7
Trace: regex new state L=5-> R=6-> M=7-> 8
Trace: regex new state L=6-> R=7-> M=8-> 9
Trace: regex new state L=7-> R=8-> M=9-> 10
Trace: regex new state L=8-> R=9-> M=10-> 11
Trace: leaving __regex_branch:n
Trace: regex new state L=2-> R=3-> M=11-> 12
Trace: leaving __regex_group_aux:nnnnN
Trace: toks1={__regex_action_start_wildcard: }
Trace: toks2={__regex_action_submatch:n {0<}__regex_action_free:n {2}}
Trace: toks3={__regex_action_submatch:n {0>}__regex_action_free:n {8}}
Trace: toks4={typeout {^}char `^__regex_anchor:N l__regex_min_pos_int
               __regex_break_point:TF {__regex_action_free:n {1}}{}}
Trace: toks5={__regex_item_caseful_equal:n {97}
               __regex_break_point:TF {__regex_action_cost:n {1}}{}}
Trace: toks6={__regex_action_free:n {-1}__regex_action_free:n {1}}
Trace: toks7={typeout {X}X__regex_break_true:w
               __regex_break_point:TF {__regex_action_free:n {1}}{}}
Trace: toks8={use:c {__regex_prop_.:}
               __regex_break_point:TF {__regex_action_cost:n {0}}{}
               __regex_action_free:n {1}}
Trace: toks9={typeout {$}char `$__regex_anchor:N l__regex_max_pos_int
               __regex_break_point:TF {__regex_action_free:n {1}}{}}
Trace: toks10={__regex_action_free:n {-7}}
Trace: toks11={__regex_action_success: }

Trace: state 1
Trace: state 2
Trace: state 4
Trace: state 5
Trace: state 6
Trace: state 5
Trace: state 7
Trace: state 8
Trace: state 9
$                <-- up to this point the output makes perfect sense to me
Trace: state 1
Trace: state 2
Trace: state 4
Trace: state 8
Trace: state 9
Trace: state 10
Trace: state 3
Trace: state 11

I do not understand how to parse the start or the output with those L, R and M state mappings. The trace of the NFA execution makes sense to me at the beginning. First it runs up to state 6 where it (because of the greedy matching) tries to go back to state 5 to match another a, which fails and thus brings us to state 9 where the end of the string has been reached.

Now comes the thing I don’t understand. State 9 ends with

__regex_break_point:TF {__regex_action_free:n {1}}{}

meaning “if the assertion matches, go one state further”, i.e. state 10, then 3, then 11 and stop. But stangely it jumps from state 9 back to state 1, checks the ^ assertion again, skips the a matching (!), checks $ again, and only then goes into the final phase.

I’m sure this is just due to my lack of understanding how the regex matching works in detail. But to me this seems at least inefficient, if not wrong, because the regex is defined to find a greedy match and should be able to do that in a single pass. I’m still puzzled here.

Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.