I’m using spark to load json files from Amazon S3. I would like to remove duplicates based on two columns of the dataframe retaining the newest(I have timestamp column). What would be the best way to do it? Please note that the duplicates may be spread across partitions. Can I remove duplicates retaining the last record without shuffling? I’m dealing with 1 TB of data.