Introduction
- Typical painful workflow
- Productionalizing advanced analytics models is hard
- Why Scala?
- Who should read this book?
- Is this book for data engineers or data scientists?
- Beautiful Spark philosophy
- DataFrames vs. RDDs
- Spark streaming
- Machine learning
- The “coalesce test” for evaluating learning resources
- Will we cover the entire Spark SQL API?
- How this book is organized
- Spark programming levels
- Note about Spark versions
Running Spark Locally
- Starting the console
- Running Scala code in the console
- Accessing the SparkSession in the console
- Console commands
Databricks Community
- Creating a notebook and cluster
- Running some code
- Next steps
Introduction to DataFrames
- Creating DataFrames
- Adding columns
- Filtering rows
- More on schemas
- Creating DataFrames with createDataFrame()
- Next Steps
Working with CSV files
- Reading a CSV file into a DataFrame
- Writing a DataFrame to disk
- Reading CSV files in Databricks Notebooks
Just Enough Scala for Spark Programmers
- Scala function basics
- Currying functions
- object
- trait
package- Implicit classes
- Next steps
Column Methods
- A simple example
- Instantiating Column objects
- gt
- substr
- + operator
- lit
- isNull
- isNotNull
- when / otherwise
- Next steps
Introduction to Spark SQL functions
- High level review
- lit() function
- when() and otherwise() functions
- Writing your own SQL function
- Next steps
User Defined Functions (UDFs)
- Simple UDF example
- Using Column Functions
- Conclusion
Chaining Custom DataFrame Transformations in Spark
- Dataset Transform Method
- Transform Method with Arguments
Whitespace data munging with Spark
- trim(), ltrim(), and rtrim()
- singleSpace()
- removeAllWhitespace()
- Conclusion
Defining DataFrame Schemas with StructField and StructType
- Defining a schema to create a DataFrame
StructField- Defining schemas with the
::operator - Defining schemas with the
add()method - Common errors
LongType- Next steps
Different approaches to manually create Spark DataFrames
- toDF
- createDataFrame
- createDF
- How we’ll create DataFrames in this book
Dealing with null in Spark
- What is null?
- Spark uses null by default sometimes
- nullable Columns
- Native Spark code
- Scala null Conventions
- User Defined Functions
- Spark Rules for Dealing with null
Using JAR Files Locally
- Starting the console with a JAR file
- Adding JAR file to an existing console session
- Attaching JARs to Databricks clusters
- Review
Working with Spark ArrayType columns
- Scala collections
- Splitting a string into an ArrayType column
- Directly creating an
ArrayTypecolumn array_containsexplodecollect_list- Single column array functions
- Generic single column array functions
- Multiple column array functions
- Split array column into multiple columns
- Closing thoughts
Working with Spark MapType Columns
- Scala maps
- Creating MapType columns
- Fetching values from maps with element_at()
- Appending MapType columns
- Creating MapType columns from two ArrayType columns
- Converting Arrays to Maps with Scala
- Merging maps with map_concat()
- Using StructType columns instead of MapType columns
- Writing MapType columns to disk
- Conclusion
Adding StructType columns to DataFrames
- StructType overview
- Appending StructType columns
- Using StructTypes to eliminate order dependencies
- Order dependencies can be a big problem in large Spark codebases
Working with dates and times
- Creating DateType columns
- year(), month(), dayofmonth()
- minute(), second()
- datediff()
- date_add()
- Next steps
Performing operations on multiple columns with foldLeft
- foldLeft review in Scala
- Eliminating whitespace from multiple columns
- snake_case all columns in a DataFrame
- Wrapping foldLeft operations in custom transformations
- Next steps
Equality Operators
- ===
Introduction to Spark Broadcast Joins
- Conceptual overview
- Simple example
- Analyzing physical plans of joins
- Eliminating the duplicate
citycolumn - Diving deeper into
explain() - Next steps
Partitioning Data in Memory
- Intro to partitions
- coalesce
- Increasing partitions
- repartition
- Differences between coalesce and repartition
- Real World Example
Partitioning on Disk with partitionBy
- Memory partitioning vs. disk partitioning
- Simple example
- partitionBy with repartition(5)
- partitionBy with repartition(1)
- Partitioning datasets with a max number of files per partition
- Partitioning dataset with max rows per file
- Partitioning dataset with max rows per file pre Spark 2.2
- Small file problem
- Conclusion
Fast Filtering with Spark PartitionFilters and PushedFilters
- Normal DataFrame filter
partitionBy()- PartitionFilters
- PushedFilters
- Partitioning in memory vs. partitioning on disk
- Disk partitioning with skewed columns
- Next steps
Scala Text Editing
- Syntax highlighting
- Import reminders
- Import hints
- Argument type checking
- Flagging unnecessary imports
- When to use text editors and Databricks notebooks?
Structuring Spark Projects
- Project name
- Package naming convention
- Typical library structure
- Applications
Introduction to SBT
- Sample code
- Running SBT commands
- build.sbt
libraryDependencies- sbt test
- sbt doc
- sbt console
- sbt package / sbt assembly
- sbt clean
- Next steps
Managing the SparkSession, The DataFrame Entry Point
- Accessing the SparkSession
- Example of using the SparkSession
- Creating a DataFrame
- Reading a DataFrame
- Creating a SparkSession
- Reusing the SparkSession in the test suite
- SparkContext
- Conclusion
Testing Spark Applications
- Hello World Example
- Testing a User Defined Function
- A Real Test
- How Testing Improves Your Codebase
- Running a Single Test File
Environment Specific Config in Spark Scala Projects
- Basic use case
- Environment specific code anitpattern
- Overriding config
- Setting the
PROJECT_ENVvariable for test runs - Other implementations
- Next steps
Building Spark JAR Files with SBT
- JAR File Basics
- Building a Thin JAR File
- Building a Fat JAR File
- Next Steps
Shading Dependencies in Spark Projects with SBT
- When shading is useful
- How to shade the
spark-dariadependency - Conclusion
Dependency Injection with Spark
- Code with a dependency
- Injecting a path
- Injecting an entire DataFrame
- Conclusion
Broadcasting Maps
- Simple example
- Refactored code
- Building Maps from data files
- Conclusion
Validating Spark DataFrame Schemas
- Custom Transformations Refresher
- A Custom Transformation Making a Bad Assumption
- Column Presence Validation
- Full Schema Validation
- Documenting DataFrame Assumptions is Especially Important for Chained DataFrame Transformations
- Conclusion