#StackBounty: #apache-spark #pyspark #spark-streaming pyspark streaming how to set ConnectionPool

Bounty: 50

I have a task, I want read data from kafka and use spark spark streaming to process it, and I want send data to Hbase.

In the spark official document, I found:

def sendPartition(iter):
    # ConnectionPool is a static, lazily initialized pool of connections
    connection = ConnectionPool.getConnection()
    for record in iter:
        connection.send(record)
    # return to the pool for future reuse
    ConnectionPool.returnConnection(connection)

dstream.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition))

But I coudln’t find any clue to set up ConnectionPool to Hbase using pyspark.

I also don’t understand how streaming is working?
In the code there is foreachPartition, I want to be clear those partitions is on the same spark container or not?

Dose all variables in the closure is reset for all each rdd’s each partitions ?

Is there way I can set variable at the worker level?

Dose the globals() is the worker level ? Or it is the cluster level?


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.