Member-only story
Pyspark (Creat Dataframe, structType, structField)
Let’s talk about Dataframe101. When you do the data analysis, the most headache and basic one is to have the fine structured data set. Today, I will talk about how to creat the dataframe and insert the data.
In PySpark, StructType is a class that represents a schema or a structure of a DataFrame. It is used to define the data type of each column in the DataFrame. StructType is a collection of StructFields, where each StructField represents a field of the DataFrame.
A StructField is a class that represents a field or a column of a DataFrame. It contains three properties: name, dataType, and nullable.
- name: A string that specifies the name of the field or column.
- dataType: A DataType object that specifies the data type of the field. PySpark supports various data types such as StringType, IntegerType, DoubleType, BooleanType, ArrayType, MapType, StructType, and more.
- nullable: A boolean value that specifies whether the field can contain null values or not.
Here’s an example of how to create a StructType schema and define StructFields for a DataFrame:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define the schema
my_schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True)
])