Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I've playing streaming data in Spark 2.

I want to duplicate record with dropDuplicates method.

I've found on Spark site that I can use dropDuplicates with watermark .

This is my code with watermark without dropDuplicates method:

parsed = parsed_opc \
    .withWatermark("sourceTimeStamp", "10 minutes") \
    .groupBy(
        window(parsed_opc.sourceTimeStamp, "4 seconds"),
        parsed_opc.id
    .agg({"value": "avg"}) \
    .withColumnRenamed("avg(value)", "avg")\
    .orderBy("avg", ascending=True)

This code works. But whn I want to add dropDuplicates like this:

parsed = parsed_opc \
    .withWatermark("sourceTimeStamp", "10 minutes") \
    .dropDuplicates("id", "sourceTimeStamp") \
    .groupBy(
        window(parsed_opc.sourceTimeStamp, "4 seconds"),
        parsed_opc.id
    .agg({"value": "avg"}) \
    .withColumnRenamed("avg(value)", "avg")\
    .orderBy("avg", ascending=True)

It throws an error : TypeError: dropDuplicates() takes from 1 to 2 positional arguments but 3 were given.

I don't understand why this error throwns. This usage is also in Spark site with like this. What is the reason of this error?

You need to use brackets to declare more than one column in your dropDuplicates() method.

Like this:

parsed = parsed_opc \
    .withWatermark("sourceTimeStamp", "10 minutes") \
    .dropDuplicates(["id", "sourceTimeStamp"]) \
    .groupBy(
        window(parsed_opc.sourceTimeStamp, "4 seconds"),
        parsed_opc.id
    .agg({"value": "avg"}) \
    .withColumnRenamed("avg(value)", "avg")\
    .orderBy("avg", ascending=True)
        

Thanks for contributing an answer to Stack Overflow!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.