#StackBounty: #algorithms #sorting #strings #binary Partitioning through block moves to the end

Bounty: 50

Suppose we have a binary string $s$. We wish to partition this string in a series of $0$s followed by $1$s (alternatively: we wish to sort), using only one operation: moving three consecutive elements to the end.

E.g. if our string is $ABCDEFGH$ we can do our operation at index $1$ to get $DEFGHABC$. If we did it at index $4$ we’d get $ABCGHDEF$.

I’m interested in optimal solutions – that is with the least amount of moves. I have a working algorithm using IDA* with heuristic “# of groups of ones with length $leq 3$ before the final zero”, with the logic that each such a group requires at least one move to fix. An example optimal solution (where ^ indicates where a block was chosen to move to the end):

1101101011101011010010010011                               
                       ^                                   
1101101011101011010010011100                               
                     ^                                     
1101101011101011010011100001                               
              ^                                            
1101101011101010011100001110                               
           ^                                               
1101101011110011100001110010                               
                        ^                                  
1101101011110011100001110001                               
     ^                                                     
1101111110011100001110001010                               
  ^                                                        
1111110011100001110001010011                               
                     ^                                     
1111110011100001110000011101                               
                       ^                                   
1111110011100001110000001111                               
               ^                                           
1111110011100000000001111111                               
        ^                                                  
1111110000000000001111111111                               
   ^                                                       
1110000000000001111111111111                               
^                                                          
0000000000001111111111111111 

However, this algorithm is exponential and not feasible for larger strings. After studying quite some optimal solutions, especially tough ones like above, I can’t immediately think of an optimal algorithm. Moves can be quite non-trivial.

Is there a feasible optimal algorithm? Or is this problem hard?


Get this bounty!!!

#StackBounty: #algorithms #strings #string-metrics #edit-distance Edit distance of list with unique elements

Bounty: 50

Levenshtein-Distance edit distance between lists
is a well studied problem.
But I can’t find much on possible improvements if
it is known that no element does occurs more than once in each list.

Let’s also assume that the elements are comparable/sortable
(but the lists to compare are not sorted to begin with).

In particular I am interested
if the uniqueness of elements makes it possible to improve upon
Ukkonen’s algorithm for edit distance
which has time complexity $O(min(m,n)s)$ and
space complexity $O(min(s,m,n)s)$,
where $s$ is the minimum cost of the editing steps.

More formally,

how efficiently can we compute the edit distance between two given strings
$s,t in Sigma^*$
with the promise that they don’t have any repeated letters?

$Sigma$ is a very large alphabet.


Get this bounty!!!

#StackBounty: #c #strings #io fgets() Alternative

Bounty: 100

size_t readline_tostring(char * restrict dest, size_t size, FILE * restrict stream)

fgets() is OK yet it has some short-comings that the following readline_tostring() addresses for reading a line:

  1. When the buffer is insufficient, the rest of the line is consumed (and lost). An error is indicated.

  2. In C, a line of input is up to and including the 'n‘ C11 7.21.2 2. When the streams ends with something other than a new-line, how that is handled is implementation defined behavior. J.3.12. This code treats a 'n‘ and end-of-file as the same. In both cases, a 'n' is not include in the saved buffer.

  3. If code reads a '', that is not practical to discern with fgets(). This code returns the size of the space used in dest, which includes an appended null character.

  4. Lesser issues include fgets() handling of NULL augments, small buffer size, undefined buffer state on ferror() and use of int vs size_t. The below code also clearly – I hope – handles that.

  5. An alternative allocate memory: Allocating memory per external input can lead to abuse. This allows external forces to overwhelm memory allocation. The following does not use memory allocation like getline substitute that will enforce 'n' as limit of characters read or getline.
    Another alternative could use limited allocation, but that is not done here.

Primary review requests (of the non-test code)

Portability concerns: Might a common or rare case fail on some select systems?

Handling of exceptional/error cases: Any suggested alternates?

Performance concerns are appreciated when they are backed with real measurements.

General comments (on any code).


The code below is listed as one file for code review convenience, yet would usually would be is separate .h, .c files.

/////////////////////////////////////////////////////////////////
// Header info, usually in some *.h file

/*
 * Read a _line_ of text. Save characters up to a limit, and form a _string_.
 * The string saved in `dest` never contains a 'n'.
 * A null character is always appended ***1.
 * Reading only attempted in non-pathological cases.  
 * Otherwise the end-of-file flag and error flags are cleared before reading.
 *
 * Normal: The return value is greater than  0 and represents the _size_ of `dest` used.
 *     This includes all non-'n' characters read and an appended null character. ***2
 *     Reading text "abcn" forms string "abc" and return 4
 *
 * Exceptional cases:
 *   In these cases, the return value is 0 and `dest[0] = ''` except as noted.
 *   1: Pathological: Buffer invalid for string.
 *     `dest == NULL` or `size == 0` (No data is written into `dest`)
 *   2: Pathological: Stream invalid.
 *     `stream == NULL`
 *   3: End-of-file occurs and no data read.
 *     Typical end-of-file: `feof(stream)` will return true.
 *   4: Input error.
 *     `ferror(stream)` will return true.
 *     strlen(dest) is number of characters successfully read. ***3
 *   5: Buffer is too small.
 *     First `size-1` non-'n' characters are saved in `dest[]`.
 *     Additional characters are read up to and including 'n'.  These are not saved.
 *     The end-of-file flag and error flags are cleared again.
 *     strlen(dest) is number of characters successfully save. ***3
 *
 * ***1 Except when `dest == NULL` or `size == 0`
 * ***2 If code reads a null character, it is treated like any non-'n' character.
 * ***3 strlen(dest) does not reflect the number of characters in `dest` 
 *       if a null character was read and saved.
 *
 */

#include <stdio.h>
#include <stdlib.h>
size_t readline_tostring(char * restrict dest, size_t size,
    FILE * restrict stream);

/////////////////////////////////////////////////////////////////
// Code, usually in some *.c file

size_t readline_tostring(char * restrict dest, size_t size,
    FILE * restrict stream) {
  // Handle pathological cases
  if (dest == NULL || size == 0) {
    return 0;
  }
  if (stream == NULL) {
    dest[0] = '';
    return 0;
  }
  clearerr(stream);

  size_t i = 0;
  int ch;
  while ((ch = fgetc(stream)) != 'n' && ch != EOF) {
    if (i < size) {
      dest[i++] = (char) ch;
    }
  }

  // Add null character termination - always
  // If too many were read
  if (i >= size) {
    dest[size - 1] = '';
    clearerr(stream);
    return 0;
  }
  dest[i] = '';

  if ((ch == EOF) && (i == 0 || ferror(stream))) { // end-of-file or error
    return 0;
  }

  clearerr(stream);
  return i + 1;
}

/////////////////////////////////////////////////////////////////
// Test code

#include <string.h>
#include <ctype.h>

// Sample usage showing how to discern the results.
void sample(char * restrict dest, size_t size, FILE * restrict stream) {
  size_t sz;
  while ((sz = readline_tostring(dest, size, stream)) > 0) {
    printf("Size:%zu string:"%s"n", sz, dest);
  }

  // Well formed code need not perform this 1st test
  if (dest == NULL || size == 0 || stream == NULL) {
    puts("Pathological case");
  } else if (feof(stream)) {
    puts("End of file");
  } else if (ferror(stream)) {
    puts("Input error");
  } else {
    printf("Line too long: begins with <%s>n", dest);
  }
  puts("");
}

void test4(const char *s) {
  FILE *stream = fopen("tmp.bin", "wb");
  size_t len = strlen(s);
  fwrite(s, 1, len, stream);
  fclose(stream);
  for (size_t i = 0; i < len; i++) {
    printf(isprint((unsigned char)s[i]) ? "%c" : "<%d>", s[i]);
  }
  puts("");

  stream = fopen("tmp.bin", "r");
  char buf[4];
  sample(buf, sizeof buf, stream);
  fclose(stream);
  fflush(stdout);
}

int main(void) {
  test4("12nABn");
  test4("123nABCn");
  test4("1234nABCDn");
  test4("");
  test4("1");
  test4("12");
  test4("123");
  test4("1234");
  return 0;
}

Output

12<10>AB<10>
Size:3 string:"12"
Size:3 string:"AB"
End of file

123<10>ABC<10>
Size:4 string:"123"
Size:4 string:"ABC"
End of file

1234<10>ABCD<10>
Line too long: begins with <123>


End of file

1
Size:2 string:"1"
End of file

12
Size:3 string:"12"
End of file

123
Size:4 string:"123"
End of file

1234
Line too long: begins with <123>


Get this bounty!!!

#StackBounty: #python #performance #algorithm #strings #search Naive implementation of KMP algorithm

Bounty: 50

After reading this answer to the question “High execution time to count overlapping substrings”, I decided to implement the suggested Knuth-Morris-Pratt (KMP) algorithm. I used the pseudo-code listed on Wikipedia for the functions kmp_table and kmp_search.

However, when running it on some corner-cases, I have observed that it is a lot slower than the standard str.find, which apparently uses a modified Boyer-Moore-Horspool algorithm and should thus have worse worst-case performance.

The specific case I looked at is:

$ ipython -i kmp.py
In [1]: text = "A"*1000000 + "B"
In [2]: word = "A"*100 + "B"
In [3]: %timeit kmp_search(text, word)
1 loop, best of 3: 410 ms per loop
In [4}: %timeit text.find(word)
1000 loops, best of 3: 703 µs per loop

So the difference is about a factor 1000 for this input. This is probably due to the fact that the native one is written in C and this is written in Python, but I still wanted to see if I did anything stupid here or missed any obvious optimization.

def kmp_table(word):
    table = [0] * len(word)
    position, candidate = 2, 0
    table[0] = -1

    while position < len(word):
        if word[position - 1] == word[candidate]:
            table[position] = candidate + 1
            candidate += 1
            position += 1
        elif candidate > 0:
            candidate = table[candidate]
        else:
            table[position] = 0
            position += 1
    return table


def kmp_search(text, word):
    m, i = 0, 0
    table = kmp_table(word)
    while m + i < len(text):
        if word[i] == text[m + i]:
            if i == len(word) - 1:
                return m
            i += 1
        else:
            if table[i] > -1:
                m += i - table[i]
                i = table[i]
            else:
                m += 1
                i = 0
    return len(text)


Get this bounty!!!

Convert Comma separated String to Rows in Oracle SQL

Many times we need to convert a comma separated list of terms in a single string and convert it rows in SQL query.

for example

 India, USA, Russia, Malaysia, Mexico

Needs to be converted to:

 Country
 India
 USA
 Russia
 Malaysia
 Mexico

The following SQL script can help in this:

just replace the required values with your string and your delimiter.

HackerRank: Repeated String

Problem

Lilah has a string, s, of lowercase English letters that she repeated infinitely many times.

Given an integer, n, find and print the number of letter a‘s in the first letters of Lilah’s infinite string.

Input Format

The first line contains a single string, s.
The second line contains an integer, n.

Constraints

  • 1<=|s|<=100
  • 1<=|n|<=10^12
  • For 25% of the test cases, n <= 10^6

Output Format

Print a single integer denoting the number of letter a’s in the first letters of the infinite string created by repeating infinitely many times.

Sample Input 0

aba
10

Sample Output 0

7

Explanation 0

The first n = 10 letters of the infinite string are abaabaabaa. Because there are 7 a‘s, we print on a new line.

Sample Input 1

a
1000000000000

Sample Output 1

1000000000000

Explanation 1

Because all of the first n=1000000000000 letters of the infinite string are a, we print 1000000000000 on a new line.

Solution

CodeEval: Penultimate Word

Challenge Description:

Write a program which finds the next-to-last word in a string.

Input Sample:

Your program should accept as its first argument a path to a filename. Input example is the following

some line with text
another line

Each line has more than one word.

Output Sample:

Print the next-to-last word in the following way.

with
another

Solution:

 

CodeEval: Shortest Repetition

Challenge Description:

Write a program to determine the shortest repetition in a string.
A string is said to have period p if it can be formed by concatenating one or more repetitions of another string of length p. For example, the string “xyzxyzxyzxyz” has period 3, since it is formed by 4 repetitions of the string “xyz”. It also has periods 6 (two repetitions of “xyzxyz”) and 12 (one repetition of “xyzxyzxyzxyz”).

Input Sample:

Your program should accept as its first argument a path to a filename. Each line will contain a string of up to 80 non-blank characters. E.g.

abcabcabcabc
bcbcbcbcbcbcbcbcbcbcbcbcbcbc
dddddddddddddddddddd
adcdefg

Output Sample:

Print out the smallest period of the input string. E.g.

3
2
1
7

Solution:

 

HackerRank: Alternating Characters

Problem

Shashank likes strings in which consecutive characters are different. For example, he likes ABABA, while he doesn’t like ABAA. Given a string containing characters A and B only, he wants to change it into a string he likes. To do this, he is allowed to delete the characters in the string.

Your task is to find the minimum number of required deletions.

Input Format

The first line contains an integer T, i.e. the number of test cases.
The next T lines contain a string each.

Output Format

For each test case, print the minimum number of deletions required.

Constraints

1T10
1≤ length of string 10^5

Sample Input

5
AAAA
BBBBB
ABABABAB
BABABA
AAABBB

Sample Output

3
4
0
0
4

Explanation

AAAA A, 3 deletions
BBBBB B, 4 deletions
ABABABAB ABABABAB, 0 deletions
BABABA BABABA, 0 deletions
AAABBB AB, 4 deletions because to convert it to AB we need to delete 2 A’s and 2 B’s

Solution

Combinations of a String

Problem:
Write an algorithm to print all possible combinations of characters in a string.

Solution:
Since we need to generate combinations, we can start with a single character and then continue to add a character to combinations we have seen so far.

Let’s use “XYZ” as an example.

public void buildTree(String input, StringBuffer output, int k)
{
    for (int i = k; i < input.length(); i++)
    {
        output.append(input.charAt(i));
        buildTree(input, output, i + 1);
        output.deleteCharAt(output.length() - 1);
    }
} 

public static void main(String[] args){
     buildTree("XYZ", new StringBuffer(), 0);
}

Dry Run just to give an idea:

X–>Y–>Z
–>Z–>Y

Y–>X–>Z
–>Z–>X

Z–>Y–>X
–>X–>Y