The Data Science Bootstrap Notes
- Where I think you, the reader, are coming from
- Things you’ll learn
- Apply these ideas just-in-time
- Changes from the first edition
- Ways to support the project
Philosophies
- How these philosophies work together
You should know your computing stack
- See this philosophy in action
Automate and standardize everywhere possible
- See this philosophy in action
You should always know the source of truth
- See this philosophy in action
Categorize everything that you can
- See this philosophy in action
Putting it all together
- How you’ll see these philosophies in action
Setup your machine
- Why this is important
- What you’ll learn in this section
Configure your shell
- Install Starship
- Configure environment variables
- Create shell aliases
- Troubleshooting
- Quick Reference
Install and configure system-wide software
- Install package managers
- Install software
- Configure your PATH
- TL;DR: Quick installation commands
- Troubleshooting
- Quick Reference
Install and configure Git on your machine
- Why do we need Git
- How to install Git
- How to configure Git with basic information
- How to configure Git with fancy features
- Troubleshooting
- Productivity Tip: Shell Aliases
- Quick Reference
Install uv to manage and install Python-based command line tools
- Further reading
Install Homebrew on your Mac (fallback package manager)
- Why install Homebrew?
- When to use Homebrew
- How to install Homebrew
- Using Homebrew on Linux
- See also
Install and configure direnv for environment management
- Why we need
direnv - How to install
direnv - How to configure
direnv - Loading
.envfiles automatically (direnv >= 2.31.0) - Troubleshooting
Leverage dotfiles to get your machine configured quickly
- Why create a dotfiles repository
- How to structure a dotfiles repository
- Examples and resources
Configure VSCode for maximum productivity
- How do I access VSCode settings?
- What built-in settings have transformed my workflow?
- What extensions have actually improved my productivity?
- What keyboard shortcuts do I actually use?
- How do I handle project-specific settings?
- What about collaborative coding?
- Remember: start simple, grow gradually
Master your shell for data science productivity
- Why shell mastery matters for data scientists
- What you’ll learn in this section
Take full control of your shell environment variables
- Why control your environment variables
- How do I control my environment variables
Create shell command aliases for your commonly used commands
- Why create shell aliases
- How to create aliases
- Where to store these aliases
- Useful aliases to get started
- Git aliases cheat sheet
- Port management aliases
- Enhancing built-in commands with functions
Shell commands cheat sheet
- Basic navigation and file operations
- File permissions and ownership
- Process management
- Network and system info
- Archive and compression
- Git shortcuts
- Text processing
- Advanced patterns
- Time-saving tips
Shell-based text editors
- Why should I care about shell editors?
- What do I actually need to learn?
- My recommendation: start with nano
- What about vim and emacs?
- My philosophy on shell editors
- Getting started
Manage and configure your projects
- Follow the 1:1:1:1… rule
- When can we break this rule
Start with a sane repository structure
- How to structure a standard repository structure
- Automate the scaffolding of new projects
Use pixi for maximally ergonomic and reproducible environments
- The Cheat Sheet of pixi commands
- Long-term reproducibility through lock files
- Composable multi-environment projects
Structure your source code repository sanely
- Phase 1: Initial Exploration
- Phase 2: Emerging Patterns
- Phase 3: One-off Scripts
- Phase 4: Production Structure
- Leveraging AI Assistants
- Core Development Principles
Store your project documentation in your project repository
- Introduction to the Diataxis framework
- When to add documentation
- Code comments as documentation
- AI assistance in documentation
- Reference documentation
- Automating documentation with CI/CD
Use CI/CD to automate tasks
- Key Concepts of CI/CD
- Environment Considerations
- Configuration and Environment Variables
- Leveraging Pixi Environment
- Practical Examples with GitHub Actions
Use data catalogs to manage data
- What are data catalogs?
- Traditional data catalog examples
- Modern ML data storage with xarray and zarr
- What are the advantages of data catalogs?
- When should you use data catalogs?
Choose your data formats wisely
- The binary vs text format decision
- The hybrid approach: Binary source, text derivatives
- High-dimensional data: The xarray advantage
- Format-specific recommendations
- How to implement this in practice
- The bottom line
Take advantage of uv for one-off projects
- Using PEP723 for script dependencies
- Self-contained notebooks
- Benefits of this approach
- Real-world examples
- Conclusion
Configuration files guide
- Why do we even need all these config files?
- What are the core configuration files you should know about?
- What about documentation configuration?
- How do I handle environment variables?
- What’s the quick reference for which tools use which files?
- What are my best practices for configuration files?
- How do I get started with all this?
Set environment variables in a .env file
- Why configure environment variables per project
- How to configure environment variables for your project
Name things consistently
- What constitutes a “sane” name?
Skills for Effective Data Science
- Core Technical Skills
- Effective Ways of Working
How to write software tests
- Why should I bother writing tests?
- How do I actually write tests?
- What do real tests look like?
- What about testing data assumptions?
- Don’t forget to test error conditions
- How do I actually run these tests?
- What are some advanced patterns worth knowing?
- How do I organize my tests?
- How do I know if I’m testing enough?
- How does this fit into my development workflow?
- What mistakes should I avoid?
- How do I get started?
Refactor code
Working with AI tools
- The speed of thought
- The right kind of lazy
- Effective patterns for AI interaction
- Beyond code generation
- Moving forward
Collaborating on Data Science Projects
- The power of pair programming
- Leveraging AI in collaborative work
- The science in data science projects
- Effective work distribution
- Handling merge conflicts without the drama
- The art of managing unproductive patches
Use notebooks effectively
- Choose Marimo over Jupyter for reactivity
- Notebooks as prototyping tools, not production code
- Data access best practices
- Scratch pad vs. Report-style notebooks
- Refactor with the help of AI
- Publish notebooks and strip outputs before committing
- Jupyter hygiene (if you must use Jupyter)
Looking back, moving forward
- What we’ve covered together
- The journey ahead
- A personal note
- Where to go from here
- Final thoughts