#StackBounty: #machine-learning #python #scikit-learn #nlp Pass 2 different kinds of X training data to ML model simultaneously

Bounty: 50

I’m trying to classify if a book is fiction/nonfiction based on title and summary.

This is 2 distinct types of information – is there a way to segment title and summary before feeding it to a model, rather than concatenating the information?

For example:

Title: "such a long journey"

Summary: "it is bombay in 1971, the year india went to..."

Label: "fiction" (where fiction =1)

Current procedure:

What I’ve been doing until now is concatenating the information, so the above becomes,

example = "such a long journey it is bombay in 1971, the year india went to..."
label = 1

Then the usual setup, something like,

X.append(example)
y.append(label)
...
X = lemmatize(X)
...
X_train, X_test, y_train, y_test = split_data(X,y)

vectorizer = TfidfVectorizer(...)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

classifier.fit(X_train, y_train)
y_predict = classifier.predict(X_test)

But feeding the data concatenated feels intuitively wrong. Is there a better way to do this?

If for some reason its possible with a library other than sklearn (keras, tensorflow) I’d be also open to hearing about that.


UPDATE

Going from,

X = ['two'],['two'],['four'],['two'],['four'],['four']]
y = ['human','human','dog','human','dog','dog']

to,

X = [['two','hello'],['two','hello'],['four','bark'],['two','hi'],['four','bark'],['four','woof']]
y = ['human','human','dog','human','dog','dog']

causes errors to be thrown.

'list' object has no attribute 'lower' is X is a list, and 'numpy.ndarray' object has no attribute 'lower' if X is an array.

The error is thrown when I call,

X_train = vectorizer.fit_transform(X_train)

Is it possible to pass in a vector of features?


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.