I wish to train a model that detects the breed of a dog based on video input. I have a dataset containing 10 classes with 30 videos in each class. The problem is that for each of these videos, the dog is not present throughout the course of the video. The following are examples of 2 videos from the dataset:

Video 1: Video of backyard (first 5 seconds) –> Dog appears (15 seconds) –> Video of surrounding buildings (3 seconds)

Video 2: Video of grass (first 8 seconds) –> Dog appears (3 seconds) –> Video of nearby people (4 seconds)

I presume that my CNN would detect redundant features and hence give incorrect outputs if I trained my model on the videos as is. Hence, do I need to manually trim each of the 300 videos to show only the part where the dog appears or is there an easier way to approach this problem?

