I’m trying to classify if a book is fiction/nonfiction based on title and summary.
This is 2 distinct types of information – is there a way to segment
summary before feeding it to a model, rather than concatenating the information?
"such a long journey"
"it is bombay in 1971, the year india went to..."
"fiction" (where fiction =1)
What I’ve been doing until now is concatenating the information, so the above becomes,
example = "such a long journey it is bombay in 1971, the year india went to..." label = 1
Then the usual setup, something like,
X.append(example) y.append(label) ... X = lemmatize(X) ... X_train, X_test, y_train, y_test = split_data(X,y) vectorizer = TfidfVectorizer(...) X_train = vectorizer.fit_transform(X_train) X_test = vectorizer.transform(X_test) classifier.fit(X_train, y_train) y_predict = classifier.predict(X_test)
But feeding the data concatenated feels intuitively wrong. Is there a better way to do this?
If for some reason its possible with a library other than sklearn (keras, tensorflow) I’d be also open to hearing about that.
X = ['two'],['two'],['four'],['two'],['four'],['four']] y = ['human','human','dog','human','dog','dog']
X = [['two','hello'],['two','hello'],['four','bark'],['two','hi'],['four','bark'],['four','woof']] y = ['human','human','dog','human','dog','dog']
causes errors to be thrown.
'list' object has no attribute 'lower' is X is a list, and
'numpy.ndarray' object has no attribute 'lower' if X is an array.
The error is thrown when I call,
X_train = vectorizer.fit_transform(X_train)
Is it possible to pass in a vector of features?