Member-only story

Pyspark (Creat Dataframe, structType, structField)

Park Sehun
2 min readApr 22, 2023

--

Let’s talk about Dataframe101. When you do the data analysis, the most headache and basic one is to have the fine structured data set. Today, I will talk about how to creat the dataframe and insert the data.

In PySpark, StructType is a class that represents a schema or a structure of a DataFrame. It is used to define the data type of each column in the DataFrame. StructType is a collection of StructFields, where each StructField represents a field of the DataFrame.

A StructField is a class that represents a field or a column of a DataFrame. It contains three properties: name, dataType, and nullable.

  • name: A string that specifies the name of the field or column.
  • dataType: A DataType object that specifies the data type of the field. PySpark supports various data types such as StringType, IntegerType, DoubleType, BooleanType, ArrayType, MapType, StructType, and more.
  • nullable: A boolean value that specifies whether the field can contain null values or not.

Here’s an example of how to create a StructType schema and define StructFields for a DataFrame:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define the schema
my_schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True)
])

--

--

No responses yet

Write a response