#StackBounty: #python #python-3.x #file #pdf How to rename PDF files using their title?

Bounty: 50

I have thousands of PDF files in my computers which names are from a0001.pdf to a3621.pdf, and inside of each there is a title; e.g. “aluminum carbonate” for a0001.pdf, “aluminum nitrate” in a0002.pdf, etc., which I’d like to extract to rename my files.

I use this program to rename a file:

path=r"C:UsersYANNDesktop..."

old='string 1'
new='string 2'

def rename(path,old,new):
    for f in os.listdir(path):
        os.rename(os.path.join(path, f), os.path.join(path, f.replace(old, new)))

rename(path,old,new)

I would like to know if there is/are solution(s) to extract the title embedded in the PDF file to rename the file?


Get this bounty!!!

#StackBounty: #java #file #http #logging Scanning through logs (tail -f fashion) parsing and sending to a remote server

Bounty: 100

I have a task at hand to build a utility which

  1. Scans through a log file.

  2. Rolls over if a log file is reset.

  3. Scans through each line of the log file.

  4. Each line is sent to an executor service and checks are performed: which include looking for a particular word in the line, if a match is found I forward this line for further processing which includes splitting up the line and forming JSON.

  5. This JSON is sent across to a server using a CloseableHttpCLient with connection keep alive and ServiceUnavailableRetryStrategy patterns.

EntryPoint FileTailReader:(Started from Main)

   public class FileTailReader implements Runnable {

    private final File file;
    private long filePointer;
    private String url;
    private static volatile boolean keepLooping = true; // TODO move to main class
    private static final Logger logger = LogManager.getLogger(Main.class);
    private ExecutorService executor;
    private List<Future<?>> futures;


    public FileTailReader(File file, String url, ExecutorService executor, List<Future<?>> futures) {
        this.file = file;
        this.url = url;
        this.executor = executor;
        this.futures = futures;

    }

    private HttpPost getPost() {
        HttpPost httpPost = new HttpPost(url);
        httpPost.setHeader("Accept", "application/json");
        httpPost.setHeader("Content-type", "application/json");
        return httpPost;
    }

    @Override
    public void run() {
        long updateInterval = 100;
        try {
            ArrayList<String> batchArray = new ArrayList<>();
            HttpPost httpPost = getPost();
            CloseableHttpAsyncClient closeableHttpClient = getCloseableClient();
            Path path = Paths.get(file.toURI());
            BasicFileAttributes basicFileAttributes = Files.readAttributes(path, BasicFileAttributes.class);
            Object fileKey = basicFileAttributes.fileKey();
            String iNode = fileKey.toString();  // iNode is common during file roll
            long startTime = System.nanoTime();
            while (keepLooping) {

                Thread.sleep(updateInterval);
                long len = file.length();

                if (len < filePointer) {

                    // Log must have been rolled
                    // We can spawn a new thread here to read the remaining part of the rolled file.
                    // Compare the iNode of the file in tail with every file in the dir, if a match is found
                    // - we have the rolled file
                    // This scenario will occur only if our reader lags behind the writer - No worry

                    RolledFileReader rolledFileReader = new RolledFileReader(iNode, file, filePointer, executor,
                            closeableHttpClient, httpPost, futures);
                    new Thread(rolledFileReader).start();

                    logger.info("Log file was reset. Restarting logging from start of file.");
                    this.appendMessage("Log file was reset. Restarting logging from start of file.");
                    filePointer = len;
                } else if (len > filePointer) {
                    // File must have had something added to it!
                    RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r");
                    randomAccessFile.seek(filePointer);
                    FileInputStream fileInputStream = new FileInputStream(randomAccessFile.getFD());
                    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream));
                    String bLine;
                    while ((bLine = bufferedReader.readLine()) != null) {
                        // We will use an array to hold 100 lines, so that we can batch process in a
                        // single thread
                        batchArray.add(bLine);
                        switch (batchArray.size()) {

                            case 1000:
                                appendLine((ArrayList<String>) batchArray.clone(), closeableHttpClient, httpPost);
                                batchArray.clear();
                                break;
                        }
                    }

                    if (batchArray.size() > 0) {
                        appendLine((ArrayList<String>) batchArray.clone(), closeableHttpClient, httpPost);
                    }

                    filePointer = randomAccessFile.getFilePointer();
                    randomAccessFile.close();
                    fileInputStream.close();
                    bufferedReader.close();
                   // logger.info("Total time taken: " + ((System.nanoTime() - startTime) / 1e9));

                }

                //boolean allDone = checkIfAllExecuted();
               // logger.info("isAllDone" + allDone + futures.size());

            }
            executor.shutdown();
        } catch (Exception e) {
            e.printStackTrace();
            this.appendMessage("Fatal error reading log file, log tailing has stopped.");
        }
    }

    private void appendMessage(String line) {
        System.out.println(line.trim());
    }

    private void appendLine(ArrayList<String> batchArray, CloseableHttpAsyncClient client, HttpPost httpPost) {
        Future<?> future = executor.submit(new LocalThreadPoolExecutor(batchArray, client, httpPost));
        futures.add(future);

    }

    private boolean checkIfAllExecuted() {
        boolean allDone = true;
        for (Future<?> future : futures) {
            allDone &= future.isDone(); // check if future is done
        }
        return allDone;
    }

    //Reusable connection
    private RequestConfig getConnConfig() {
        return RequestConfig.custom()
                .setConnectionRequestTimeout(5 * 1000)
                .setConnectTimeout(5 * 1000)
                .setSocketTimeout(5 * 1000).build();
    }

    private PoolingNHttpClientConnectionManager getPoolingConnManager() throws IOReactorException {
        ConnectingIOReactor ioReactor = new DefaultConnectingIOReactor();
        PoolingNHttpClientConnectionManager cm = new PoolingNHttpClientConnectionManager(ioReactor);
        cm.setMaxTotal(1000);
        cm.setDefaultMaxPerRoute(1000);

        return cm;
    }

    private CloseableHttpAsyncClient getCloseableClient() throws IOReactorException {
        CloseableHttpAsyncClient httpAsyncClient = HttpAsyncClientBuilder.create()
                .setDefaultRequestConfig(getConnConfig())
                .setConnectionManager(getPoolingConnManager()).build();

        httpAsyncClient.start();

        return httpAsyncClient;


                /*.setServiceUnavailableRetryStrategy(new ServiceUnavailableRetryStrategy() {
                    @Override
                    public boolean retryRequest(
                            final HttpResponse response, final int executionCount, final HttpContext context) {
                        int statusCode = response.getStatusLine().getStatusCode();
                        return statusCode != HttpURLConnection.HTTP_OK && executionCount < 5;
                    }

                    @Override
                    public long getRetryInterval() {
                        return 0;
                    }
                }).build();*/
    }


}

I am using an implementation of Rabin Karp for string find:

public class RabinKarp {
    private final String pat;      // the pattern  // needed only for Las Vegas
    private long patHash;    // pattern hash value
    private int m;           // pattern length
    private long q;          // a large prime, small enough to avoid long overflow
    private final int R;           // radix
    private long RM;         // R^(M-1) % Q

    /**
     * Preprocesses the pattern string.
     *
     * @param pattern the pattern string
     * @param R       the alphabet size
     */
    public RabinKarp(char[] pattern, int R) {
        this.pat = String.valueOf(pattern);
        this.R = R;
        throw new UnsupportedOperationException("Operation not supported yet");
    }

    /**
     * Preprocesses the pattern string.
     *
     * @param pat the pattern string
     */
    public RabinKarp(String pat) {
        this.pat = pat;      // save pattern (needed only for Las Vegas)
        R = 256;
        m = pat.length();
        q = longRandomPrime();

        // precompute R^(m-1) % q for use in removing leading digit
        RM = 1;
        for (int i = 1; i <= m - 1; i++)
            RM = (R * RM) % q;
        patHash = hash(pat, m);
    }

    // Compute hash for key[0..m-1].
    private long hash(String key, int m) {
        long h = 0;
        for (int j = 0; j < m; j++)
            h = (R * h + key.charAt(j)) % q;
        return h;
    }

    // Las Vegas version: does pat[] match txt[i..i-m+1] ?
    private boolean check(String txt, int i) {
        for (int j = 0; j < m; j++)
            if (pat.charAt(j) != txt.charAt(i + j))
                return false;
        return true;
    }

    // Monte Carlo version: always return true
    // private boolean check(int i) {
    //    return true;
    //}

    /**
     * Returns the index of the first occurrrence of the pattern string
     * in the text string.
     *
     * @param txt the text string
     * @return the index of the first occurrence of the pattern string
     * in the text string; n if no such match
     */
    public int search(String txt) {
        int n = txt.length();
        if (n < m) return n;
        long txtHash = hash(txt, m);

        // check for match at offset 0
        if ((patHash == txtHash) && check(txt, 0))
            return 0;

        // check for hash match; if hash match, check for exact match
        for (int i = m; i < n; i++) {
            // Remove leading digit, add trailing digit, check for match.
            txtHash = (txtHash + q - RM * txt.charAt(i - m) % q) % q;
            txtHash = (txtHash * R + txt.charAt(i)) % q;

            // match
            int offset = i - m + 1;
            if ((patHash == txtHash) && check(txt, offset))
                return offset;
        }

        // no match
        return -1;
    }


    // a random 31-bit prime
    private static long longRandomPrime() {
        BigInteger prime = BigInteger.probablePrime(31, new Random());
        return prime.longValue();
    }
}

Here is my RolledFileReader

public class RolledFileReader implements Runnable {

    private static final Logger logger = LogManager.getLogger(RolledFileReader.class);

    private String iNode;
    private File tailedFile;
    private long filePointer;
    private ExecutorService executor;
    private CloseableHttpAsyncClient client;
    private HttpPost httpPost;
    List<Future<?>> futures;

    public RolledFileReader(String iNode, File tailedFile, long filePointer, ExecutorService executor,
                            CloseableHttpAsyncClient client, HttpPost httpPost, List<Future<?>> futures) {
        this.iNode = iNode;
        this.tailedFile = tailedFile;
        this.filePointer = filePointer;
        this.executor = executor;
        this.client = client;
        this.httpPost = httpPost;
        this.futures = futures;
    }

    @Override
    public void run() {
        try {
            inodeReader();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }


    public void inodeReader() throws Exception {
        String fParent = tailedFile.getParentFile().toString();
        File[] files = new File(fParent).listFiles();
        if (files != null) {
            Arrays.sort(files, Collections.reverseOrder()); // Probability of finding the file at top increases
            for (File file : files) {
                if (file.isFile()) {
                    Path path = Paths.get(file.toURI());
                    BasicFileAttributes basicFileAttributes = Files.readAttributes(path, BasicFileAttributes.class);
                    Object fileKey = basicFileAttributes.fileKey();
                    String matchInode = fileKey.toString();
                    if (matchInode.equalsIgnoreCase(iNode) && file.length() > filePointer) {
                        //We found a match - now process the remaining file - we are in a separate thread
                        readRolledFile(file, filePointer);

                    }
                }
            }

        }
    }


    public void readRolledFile(File rolledFile, long filePointer) throws Exception {
        ArrayList<String> batchArray = new ArrayList<>();
        RandomAccessFile randomAccessFile = new RandomAccessFile(rolledFile, "r");
        randomAccessFile.seek(filePointer);
        FileInputStream fileInputStream = new FileInputStream(randomAccessFile.getFD());
        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream));
        String bLine;
        while ((bLine = bufferedReader.readLine()) != null) {

            batchArray.add(bLine);
            switch (batchArray.size()) {
                case 1000:
                    executor.execute(new LocalThreadPoolExecutor((ArrayList<String>) batchArray.clone(), client, httpPost));
            }
        }

        if (batchArray.size() > 0) {
            executor.execute(new LocalThreadPoolExecutor((ArrayList<String>) batchArray.clone(), client, httpPost));
        }
    }


}

And my executor service LocalThreadPoolExecutor:

   public class LocalThreadPoolExecutor implements Runnable {
    private static final Logger logger = LogManager.getLogger(Main.class);

    private final ArrayList<String> payload;
    private final CloseableHttpAsyncClient client;
    private final HttpPost httpPost;
    private HttpContext context;
    private final RabinKarp searcher = new RabinKarp("JioEvents");

    public LocalThreadPoolExecutor(ArrayList<String> payload, CloseableHttpAsyncClient client,
                                   HttpPost httpPost) {
        this.payload = payload;
        this.client = client;
        this.httpPost = httpPost;
    }

    @Override
    public void run() {
        try {
            for (String line : payload) {
                int offset = searcher.search(line);
                switch (offset) {
                    case -1:
                        break;
                    default:
                        String zeroIn = line.substring(offset).toLowerCase();
                        String postPayload = processLogs(zeroIn);
                        if (null != postPayload) {
                            postData(postPayload, client, httpPost);
                        }
                }
            }
       // logger.info("Processed a batch of: "+payload.size());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    private String processLogs(String line) {
        String[] jsonElements = line.split("\|");
        switch (jsonElements.length) {
            case 15:
                JSONObject jsonObject = new JSONObject();
                jsonObject.put("customerID", jsonElements[1]);
                jsonObject.put("mobileNumber", jsonElements[2]);
                jsonObject.put("eventID", jsonElements[3]);
                jsonObject.put("eventType", jsonElements[4]);
                jsonObject.put("eventDateTime", jsonElements[5]);
                jsonObject.put("eventResponseCode", jsonElements[6]);
                jsonObject.put("sourceSystem", jsonElements[7]);
                jsonObject.put("clientID", jsonElements[8]);
                jsonObject.put("serverHostName", jsonElements[9]);
                jsonObject.put("serverIPAddress", jsonElements[10]);
                jsonObject.put("serverSessionID", jsonElements[11]);
                jsonObject.put("softwareVersion", jsonElements[12]);
                jsonObject.put("deviceInfo", jsonElements[13]);
                jsonObject.put("userAgent", jsonElements[14]);
                return jsonObject.toString();
        }
        return null;
    }

    private void postData(String data, CloseableHttpAsyncClient client, HttpPost httpPost) throws Exception {

        StringEntity entity = new StringEntity(data);
        httpPost.setEntity(entity);
        Future<HttpResponse> future = client.execute(httpPost, context, null);
     //   HttpResponse response = future.get();
     //   logger.info("Resp is: "+response.getStatusLine().getStatusCode());

    }

}

And finally the Main class:

public class Main {
    private static final Logger logger = LogManager.getLogger(Main.class);
    private static final ExecutorService executor = Executors.newFixedThreadPool(25);
    private static final List<Future<?>> futures = new ArrayList<>();

    private static void usage() {
        System.out.println("Invalid usage");
    }

    public static void main(String[] args) {

        if (args.length < 2) {
            usage();
            System.exit(0);
        }
        String url = args[0];
        String fPath = args[1];

        File log = new File(fPath);
        FileTailReader fileTailReader = new FileTailReader(log, url, executor, futures);

        new Thread(fileTailReader).start(); // Can issue multiple threads with an executor like so, for multiple files


    }

}

The purpose of declaring member variables in Main is that I can later on add ShutdownHooks.

I am interested in knowing how I can make this code faster. Right now I am getting a throughput of 300000 lines per 8876 millis. Which is not going well with my peers.

Edit:

I changed the way RandomAccessFile is reading from the file and I have observed a considerable increase in speed, however I am still looking for fresh pointers to enhance and optimize this utility:

else if (len > filePointer) {
                    // File must have had something added to it!
                    long startTime = System.nanoTime();
                    RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r");
                    randomAccessFile.seek(filePointer);
                    FileInputStream fileInputStream = new FileInputStream(randomAccessFile.getFD());
                    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream));
                    String bLine;
                    logger.info("Pointer: "+filePointer+" fileLength: "+len);
                    while ((bLine = bufferedReader.readLine()) != null) {
                        this.appendLine(bLine, httpclient, httpPost);
                    }
                    logger.info("Total time taken: " + ((System.nanoTime() - startTime) / 1e9));
                    filePointer = randomAccessFile.getFilePointer();
                    logger.info("FilePointer reset to: "+filePointer);
                    randomAccessFile.close();
                    fileInputStream.close();
                    bufferedReader.close();
                }

I also added a bit of batch processing in the above snippet (Code from FileTailReader is edited to demonstrate the same in particular addition of batchArray which is a list) – I see an improvement of 10 seconds. Now the program executes in 21 point some milli seconds.


Get this bounty!!!

#StackBounty: #python #multithreading #sorting #file #audio Speech Recognition Part 2: Classifying Data

Bounty: 50

Now that I have generated training data, I need to classify each example with a label to train a TensorFlow neural net (first building a suitable dataset). To streamline the process, I wrote this little Python script to help me. Any suggestions for improvement?


classify.py:

# Builtin modules
import glob
import sys
import os
import shutil
import wave
import time
import re
from threading import Thread

# 3rd party modules
import scipy.io.wavfile
import pyaudio

DATA_DIR = 'raw_data'
LABELED_DIR = 'labeled_data'
answer = None

def classify_files():
    global answer
    # instantiate PyAudio
    p = pyaudio.PyAudio()

    for filename in glob.glob('{}/*.wav'.format(DATA_DIR)):
        # define stream chunk
        chunk = 1024

        #open a wav format music
        wf = wave.open(filename, 'rb')
        #open stream
        stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                        channels=wf.getnchannels(),
                        rate=wf.getframerate(),
                        output=True)
        #read data
        data = wf.readframes(chunk)

        #play stream
        while answer is None:
            stream.write(data)
            data = wf.readframes(chunk)
            if data == b'': # if file is over then rewind
                wf.rewind()
                time.sleep(1)
                data = wf.readframes(chunk)

        # don't know how to classify, skip sample
        if answer == '.':
            answer = None
            continue

        # sort spectogram based on input
        spec_filename = 'spec{}.jpeg'.format(str(re.findall(r'd+', filename)[0]))
        os.makedirs('{}/{}'.format(LABELED_DIR, answer), exist_ok=True)
        shutil.copyfile('{}/{}'.format(DATA_DIR, spec_filename), '{}/{}/{}'.format(LABELED_DIR, answer, spec_filename))

        # reset answer field
        answer = None

        #stop stream
        stream.stop_stream()
        stream.close()

    #close PyAudio
    p.terminate()

if __name__ == '__main__':
    try:
        # exclude file from glob
        os.remove('{}/ALL.wav'.format(DATA_DIR))

        num_files = len(glob.glob('{}/*.wav'.format(DATA_DIR)))
        Thread(target = classify_files).start()
        for i in range(0, num_files):
            answer = input("Enter letter of sound heard: ")
    except KeyboardInterrupt:
        sys.exit()


Get this bounty!!!

Java Code to Zip all folders in a particular folder.

A small utility code to create multiple zip files for all folders in the a particular folder.

for example

- c:/path/to/folder
    -> folder 1
    -> folder 2
    -> folder 3
    -> folder 4

Output:

- c:/path/to/folder
    -> folder 1
    -> folder 2
    -> folder 3
    -> folder 4
    -> folder 1.zip
    -> folder 2.zip
    -> folder 3.zip
    -> folder 4.zip

original source: https://goo.gl/sp0bqr

How to get current datetime on Windows command line for using in a filename?

Lets say you want to have a .bat file that zips up a directory into an archive with the current date and time as part of the name, for example, Code_2008-10-14_2257.zip

In a windows console(CMD), the following command helps you get what you need

@echo off
For /f "tokens=2-4 delims=/ " %a in ('date /t') do (set mydate=%c-%a-%b)
For /f "tokens=1-2 delims=/:" %a in ('time /t') do (set mytime=%a-%b)
echo %mydate%_%mytime%
@echo on

echo "Testing 123" >> "testFile_%mydate%_%mytime%.txt"

This would output the filename appended with the desired time format.

Source

How to read a file in Java

Usually such kind of function is not recommended when reading huge files. Because it is not possible for java to allocate so much contiguous memory.

As far a possible, avoid using this function.

public static String getFile(String filepath) 
{
        StringBuilder output = new StringBuilder("");
        try 
        {       
            File file = new File(filepath);
            FileReader fileReader = new FileReader(file);
            BufferedReader bfr = new BufferedReader(fileReader);
            String line ;
            while((line = bfr.readLine()) != null)
            {
                output.append(line + "n");
            } 
            bfr.close();
            fileReader.close();
        }
        catch (FileNotFoundException e) 
        {
              e.printStackTrace();
        } 
        catch (IOException e) 
        {
              e.printStackTrace();
        }   
        finally
        {
            
        }
        return output.toString();
}