#StackBounty: #pyspark #apache-spark-sql How to remove duplicates from a spark data frame while retaining the latest?

Bounty: 50

I’m using spark to load json files from Amazon S3. I would like to remove duplicates based on two columns of the dataframe retaining the newest(I have timestamp column). What would be the best way to do it? Please note that the duplicates may be spread across partitions. Can I remove duplicates retaining the last record without shuffling? I’m dealing with 1 TB of data.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.