I have a task, I want read data from kafka and use spark spark streaming to process it, and I want send data to Hbase.
In the spark official document, I found:
def sendPartition(iter): # ConnectionPool is a static, lazily initialized pool of connections connection = ConnectionPool.getConnection() for record in iter: connection.send(record) # return to the pool for future reuse ConnectionPool.returnConnection(connection) dstream.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition))
But I coudln’t find any clue to set up ConnectionPool to Hbase using pyspark.
I also don’t understand how streaming is working?
In the code there is
foreachPartition, I want to be clear those partitions is on the same spark container or not?
Dose all variables in the closure is reset for all each rdd’s each partitions ?
Is there way I can set variable at the worker level?
globals() is the worker level ? Or it is the cluster level?