Scribbling

(py)Spark Basics 본문

Computer Science/Data Science

(py)Spark Basics

focalpoint 2022. 10. 6. 09:44

In this post, I will review basics of Spark, especially pySpark.

Spark is a framework for handling big data and has a great strength in distributed system with multiple nodes.

To install pyspark, simply 'pip install pyspark'.

 

For demonstration, I will use 'heart.csv' dataset from https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset.

 

Now, let's get down to the code.

In pyspark, we can easily start a spark session with 'SparkSession'. getOrCreate() method creates a new session with a given name or retrieves an existing one with the name.

* local[*] in Master means that we are running spark on a local machine.

 

It is pretty similar to pandas when dealing with spark dataframe.

You can read csv data like below.

df = spark.read.csv('heart.csv', header=True, inferSchema=True)
df

 

We can check the data's schema with the following.

 

If you are familar with pandas, you would discover that many methods in pyspark are actually pretty similar to it.

 

To generate partial dataframe or to 'select', use the following syntax.

 

To add columns, 'withColumn' method.

 

To drop columns, 'drop'.

 

To drop null values or to fill them with specific values, 

 

Or you can impute values with "Imputer" like below.

 

Like pandas or SQL databases, you can filter your dataframe with 'filter'.

 

You can go with either 'age<=50' or df['age'] <=50. Plus, don't forget to put parenthesis for multiple conditions!

 

"groupBy" is so useful when you're up to aggregate values with respect to particular categories.Use aggregate functions (max, min, mean, sum, count, ...) with groupBy.

 

* One thing for sure, men are more vulnerable to heartdiseases.

 

pySpark provides so many valuable tools to handle your data. That being said, it also provides basic but miscellaneous machine learning models and they're pretty handy!

 

Let's get back to our dataset and see how it looks.

Seems like there are 11 features and 1 label (HeartDisease).

But for now, we'll use 'Age', 'Cholesterol', 'MaxHR' in our model. It's just a random selection of features that have 'int' data type. If you decide to use all the features, you'll first have to apply one-hot encoding / ordinal encoding to the features that have 'string' data type.

 

With 'VectorAssembler', we can easily make a vector out of features!

 

Now the training part: we will train the simplest model in ML; LinearRegression. Yes, I know it feels a bit lame, but you should always start with a simple model!

 

MSE is pretty large, but it's fair because we only used a few features in our model.

That's all for this post and please refer to attached .ipynb file for details!

Notebook1.ipynb
0.04MB