#StackBounty: #python Approaches to pre-processing the huge but organised text data, with & without the generators

Bounty: 50

I’ve a huge text file, hence I’m reading it line-by-line, applying some basic cleaning, and separately writing the X & Y to 2 different csv files. Further I’m preparing 3 directories for each csv – train, val & test and writing each line as a separate csv to appropriate directories – This aids in using the fit_generator() method conveniently, by reading these files 1-at-a-time and train the model.

The concern is, before training, I’ve pre-processing steps and performing those on these many files, 1 file at a time, doesn’t seem to be a practical approach(it won’t be time-efficient as the operations wouldn’t be vectorized, besides there would be lot of read/write on disk since storing each processed file is also inevitable), are there any other approaches in dealing with such scenarios? What are the best practices? Are custom generator functions the only way? Appreciate any help.

Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.