Data Science Bootstrap [Leanpub PDF/iPad/Kindle]

Hey there, data scientist! Have you found yourself struggling with your compute environment, such as package conflicts, this cryptic term called "environment variables", this weird concept called "containers", and technology terms that feel like an entire stack of things to learn? How about being lost amongst your multitude of projects, and not being able to get organized? This book, a linearized form of my freely available knowledge base, will give you a concise and practical guide to getting up-to-speed and organized, to help you do your best data science work.

The content in this book has been battle-tested since 2013 when I first switched over from bench science to computational science. You'll benefit from my distilled experience, where I learned all the hard lessons that come from not applying "best data science practices" to my projects, so you can avoid the same mistakes. You'll also benefit from the countless number of times I have guided newcomer colleagues and data practitioners to data science on getting their systems set up. You'll see the backing philosophies, the "whys" that explain why we ought to do certain things a certain way, as well as manifestations of those "whys".

End the struggle with getting your computer to do what you want. Gain full control over it instead. Come and learn how!

The Data Science Bootstrap Notes

Where I think you, the reader, are coming from
Things you’ll learn
Apply these ideas just-in-time
Changes from the first edition
Ways to support the project

Philosophies

How these philosophies work together

You should know your computing stack

See this philosophy in action

Automate and standardize everywhere possible

See this philosophy in action

You should always know the source of truth

See this philosophy in action

Categorize everything that you can

See this philosophy in action

Putting it all together

How you’ll see these philosophies in action

Setup your machine

Why this is important
What you’ll learn in this section

Configure your shell

Install Starship
Configure environment variables
Create shell aliases
Troubleshooting
Quick Reference

Install and configure system-wide software

Install package managers
Install software
Configure your PATH
TL;DR: Quick installation commands
Troubleshooting
Quick Reference

Install and configure Git on your machine

Why do we need Git
How to install Git
How to configure Git with basic information
How to configure Git with fancy features
Troubleshooting
Productivity Tip: Shell Aliases
Quick Reference

Install `uv` to manage and install Python-based command line tools

Further reading

Install Homebrew on your Mac (fallback package manager)

Why install Homebrew?
When to use Homebrew
How to install Homebrew
Using Homebrew on Linux
See also

Install and configure `direnv` for environment management

Why we need direnv
How to install direnv
How to configure direnv
Loading .env files automatically (direnv >= 2.31.0)
Troubleshooting

Leverage dotfiles to get your machine configured quickly

Why create a dotfiles repository
How to structure a dotfiles repository
Examples and resources

Configure VSCode for maximum productivity

How do I access VSCode settings?
What built-in settings have transformed my workflow?
What extensions have actually improved my productivity?
What keyboard shortcuts do I actually use?
How do I handle project-specific settings?
What about collaborative coding?
Remember: start simple, grow gradually

Master your shell for data science productivity

Why shell mastery matters for data scientists
What you’ll learn in this section

Take full control of your shell environment variables

Why control your environment variables
How do I control my environment variables

Create shell command aliases for your commonly used commands

Why create shell aliases
How to create aliases
Where to store these aliases
Useful aliases to get started
Git aliases cheat sheet
Port management aliases
Enhancing built-in commands with functions

Shell commands cheat sheet

Basic navigation and file operations
File permissions and ownership
Process management
Network and system info
Archive and compression
Git shortcuts
Text processing
Advanced patterns
Time-saving tips

Shell-based text editors

Why should I care about shell editors?
What do I actually need to learn?
My recommendation: start with nano
What about vim and emacs?
My philosophy on shell editors
Getting started

Manage and configure your projects

Follow the 1:1:1:1… rule
When can we break this rule

Start with a sane repository structure

How to structure a standard repository structure
Automate the scaffolding of new projects

Use pixi for maximally ergonomic and reproducible environments

The Cheat Sheet of pixi commands
Long-term reproducibility through lock files
Composable multi-environment projects

Structure your source code repository sanely

Phase 1: Initial Exploration
Phase 2: Emerging Patterns
Phase 3: One-off Scripts
Phase 4: Production Structure
Leveraging AI Assistants
Core Development Principles

Store your project documentation in your project repository

Introduction to the Diataxis framework
When to add documentation
Code comments as documentation
AI assistance in documentation
Reference documentation
Automating documentation with CI/CD

Use CI/CD to automate tasks

Key Concepts of CI/CD
Environment Considerations
Configuration and Environment Variables
Leveraging Pixi Environment
Practical Examples with GitHub Actions

Use data catalogs to manage data

What are data catalogs?
Traditional data catalog examples
Modern ML data storage with xarray and zarr
What are the advantages of data catalogs?
When should you use data catalogs?

Choose your data formats wisely

The binary vs text format decision
The hybrid approach: Binary source, text derivatives
High-dimensional data: The xarray advantage
Format-specific recommendations
How to implement this in practice
The bottom line

Take advantage of `uv` for one-off projects

Using PEP723 for script dependencies
Self-contained notebooks
Benefits of this approach
Real-world examples
Conclusion

Configuration files guide

Why do we even need all these config files?
What are the core configuration files you should know about?
What about documentation configuration?
How do I handle environment variables?
What’s the quick reference for which tools use which files?
What are my best practices for configuration files?
How do I get started with all this?

Set environment variables in a `.env` file

Why configure environment variables per project
How to configure environment variables for your project

Name things consistently

What constitutes a “sane” name?

Skills for Effective Data Science

Core Technical Skills
Effective Ways of Working

How to write software tests

Why should I bother writing tests?
How do I actually write tests?
What do real tests look like?
What about testing data assumptions?
Don’t forget to test error conditions
How do I actually run these tests?
What are some advanced patterns worth knowing?
How do I organize my tests?
How do I know if I’m testing enough?
How does this fit into my development workflow?
What mistakes should I avoid?
How do I get started?

Refactor code

Working with AI tools

The speed of thought
The right kind of lazy
Effective patterns for AI interaction
Beyond code generation
Moving forward

Collaborating on Data Science Projects

The power of pair programming
Leveraging AI in collaborative work
The science in data science projects
Effective work distribution
Handling merge conflicts without the drama
The art of managing unproductive patches

Use notebooks effectively

Choose Marimo over Jupyter for reactivity
Notebooks as prototyping tools, not production code
Data access best practices
Scratch pad vs. Report-style notebooks
Refactor with the help of AI
Publish notebooks and strip outputs before committing
Jupyter hygiene (if you must use Jupyter)

Looking back, moving forward

What we’ve covered together
The journey ahead
A personal note
Where to go from here
Final thoughts

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $14 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub

About

Share this book

Categories

Feedback

Price

You pay

Author earns

...Or Buy With Credits!

Number of credits (Minimum 1)

Author

Contents