Introduction
- Messy data
- Nightmare deploys
- Empower refactoring
- Tests encourage code that doesn’t have side effects
- Identifying code bottlenecks
- Test suites document behavior
- Technologies used
Testing Scala with Scalatest
- Writing a simple test
- Directory organization
- build.sbt
- More tests
- Running tests and configuring output
- assertThrows
- assertDoesNotCompile
- Other assertions
- Other test formats
- Test library alternatives
- Testing Spark applications
- Next steps
Column Equality Tests
- Custom DataFrame Transformation Refresher
- Spark project setup
- assertColumnEquality with spark-fast-tests
- Conclusion
Quieting Test Output
- Customizing test suite output
Creating DataFrames for Tests
- toDF
- createDataFrame
- createDF
- Including spark-daria in your projects
- Next steps
DataFrame Equality Tests
- Simple example
- assertSmallDataFrameEquality error messages
- Next steps
Running Tests
- Running from the SBT console
- Running a single test file
- Running a single test
- Best workflow
Approximate Equality
- Difference between double, float and decimal
- When assertColumnEquality falls short
- assertFloatTypeColumnEquality to the rescue
- assertApproximateDataFrameEquality
- Conclusion
Testing User Defined Functions
- Creating a UDF
- Testing a UDF
- Check the UDF fails with null input
- The billion dollar mistake
- Verifying test failure in the test suite
- Next steps
Testing Spark Column Functions
- Simple example
- How Spark functions handle null
- Important takeaway
- Why print DataFrames from the test suite?
- Next steps
Testing Filesystem Reads
- Untestable code
- Setting the path as a param
- Testing with the config pattern
- Elegant testing with dependency injection
- Abstracting custom transformation to a separate function
- Next steps
Testing Filesystem Writes
- Simple example
- Rude tests leave garbage behind
- Performance considerations
- Next steps
Identifying Bottlenecks
- Let’s find the bottleneck
- Benchmarking individual transformations
- Contrived but representative
- Conclusion
Organizing Tests
- Some pure Scala
- Poorly organized Spark tests
- Test suite rules to follow
- Good Spark test organization
- Tests should be descriptive and document behavior
- Quantifying performance difference
- Next steps
Test Suite Configuration
- Shuffle partitions
- javaOptions
- Conclusion
Testing Aggregations
- groupBy refresher
- groupBy with two columns
- groupBy with filters
- Conclusions