#StackBounty: #python #apache-spark #pyspark Use parallelize function over python objects

Bounty: 50

Is it possible in pyspark to use the parallelize function over python objects? I want to run on parallel on a list of objects, modified them using a function, and then print these objects.

def init_spark(appname):
  spark = SparkSession.builder.appName(appname).getOrCreate()
  sc = spark.sparkContext
  return spark,sc

def run_on_configs_spark(object_list):
  spark,sc = init_spark(appname="analysis")
  p_configs_RDD = sc.parallelize(object_list)
  p_configs_RDD=p_configs_RDD.map(func)
  p_configs_RDD.foreach(print)

def func(object):
  return do-somthing(object)

When I run the above code, I encounter an error of "AttributeError: Can’t get attribute ‘Object’ on <module ‘pyspark.daemon’ from…> ". How can I solve it?

I did the following workaround. But I don’t think it is a good solution in general, and it assumes I can change the constructor of the object.

I have converted the object into a dictionary, and construed the object from the directory.

 def init_spark(appname):
  spark = SparkSession.builder.appName(appname).getOrCreate()
  sc = spark.sparkContext
  return spark,sc

def run_on_configs_spark(object_list):
  spark,sc = init_spark(appname="analysis")
  p_configs_RDD = sc.parallelize([x.__dict__() for x in object_list])
  p_configs_RDD=p_configs_RDD.map(func)
  p_configs_RDD.foreach(print)

def func(dict):
  object=CreateObject(create_from_dict=True,dictionary=dict)
  return do-something(object)

In the constructor of the Object:

class Object:
   def __init__(create_from_dict=False,dictionary=None, other_params...):
      if(create_from_dict):
        self.__dict__.update(dictionary)
        return

Are there any better solutions?


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.