1.Introduction
- Your own robot
- A book about not reading books
- I’m not a programmer
- PS: This isn’t a book
2.Scraper #1: Start scraping in 5 minutes
- How it works: functions and parameters
- What are the parameters? Strings and indexes
- Tables and lists?
- Recap
- Tests
3.Scraper #2: What happens when the data isn’t in a table?
- Strong structure: XML
- Scraping XML
- Recap
- Tests
4.Scraper #3: Looking for structure in HTML
- Detour: Introduction to HTML and the LIFO rule
- Attributes and values
- Classifying sections of content: div, span, classes and ids
- Back to Scraper #3: Scraping a <div> in a HTML webpage
- Recap
- Tests
5.Scraper #4: Finding more structure in webpages: Xpath
- Recap
- Tests
6.Scraper #5: Scraping multiple pages with Google Drive
- Recap
- Tests
7.Scraper #6: Structure in URLs - using Open Refine
- Looking for structure in URLs
- Assembling the ingredients
- Grabbing ID codes or URLs from a website
- Using Open Refine as a scraper
- Grabbing the HTML for each page
- Extracting data from the raw HTML with
parseHTML - Using CSS selectors in scraping
- Understanding the results
- Recap
- Tests
8.Scraper #7: Scraping multiple pages with ‘next’ links using Outwit Hub
- Creating a basic scraper in OutWit Hub
- Scraping a series of pages
- Customised scrapers in OutWit
- Trying it out on a search for questions about health
- Recap
- Tests
9.Scraper #8: Poorly formatted webpages - solving problems with OutWit
- Identifying what structure there is
- Repeating a heading or other piece of data for each part within it
- Splitting a larger piece of data into bits: using separators
- Recap
- Tests
10.Scraper #9: Scraping uglier HTML and ‘regular expressions’ in an OutWit scraper
- Introducing Regex
- Using regex to specify a range of possible matches
- Catching the regular expression too
- I want any character: the wildcard and quantifiers
- Matching zero, one or more characters - quantifiers
- 3 questions: What characters, how many, where?
- Using regex on an ugly page
- What’s the pattern?
- Matching non-textual characters
- What if my data contains full stops, forward slashes or other special characters?
- ‘Anything but that!’ - negative matches
- This or that - looking for more than one regular expression at the same time
- Only here - specifying location
- Back to the scraper: grabbing the rest of the data
- Which dash? Negative matches in practice.
- Recap
- Tests
11.Scrapers #10 and #11: Scraping hidden and ‘invisible’ data on a webpage: icons and ‘reveals’
- Scraping accessibility data on Olympic venues
- Hidden HTML
- Recap
- Tests
12.Scraper #12: An introduction to Python: adapting scraper code
- Python - you already know some of it
- The ScraperWiki Classic Archive
- Forking a scraper
- Introducing Morph.io
- Finding a scraper to clone
- How to copy a scraper into Morph.io
- Adapting the code
- How to copy a scraper into QuickCode
- Recap
- Tests
13.Scraper #13: Tracing the code - libraries and functions, and documentation in Scraperwiki
- Parent/child relationships
- Parameters (again)
- Detour: Variables
- Scraping tip #2: read code from right to left
- Back to Scraper #9
- Recap
- Tests
14.Scraper #13 continued: Scraperwiki’s tutorial scraper 2
- Scraping tip #3: follow the variables
- What are those variables?
- Detour: loops (for and while)
- Back to scraper #13: Storing the data
- Detour: Unique keys, primary keys, and databases
- A unique key can’t be empty: fixing the error
- Summing up the scraper
- Recap
- Tests
15.Scraper #14: Adapting the code to scrape a different webpage
- Dealing with errors
- Recap
- Tests
16.Scraper #15: Scraping multiple cells and pages
- Colour coding
- Creating your own functions:
def - If statements - asking a question
- Numbers in square brackets? Indexes again!
- Attributes
- scraperwiki.datastore generates an error
- Scraping tip #4: follow the functions
- Scraping tip #5: Read from bottom to top
- Recap
- Tests
17.Scraper #16: Adapting your third scraper: creating more than one column of data
- Recap
- Tests
18.Scraper #17: Scraping a list of pages
- Iterating
- Tip: creating a list of items using the
JOINfunction in a spreadsheet - Recap
- Tests
19.Scraper #18: Scraping a page - and the pages linked (badly) from it
- Using ranges to avoid errors
- Using
lento test lists - Problems with URLs
- Methods for changing text
- A second way of fixing bad URLs
- Other workarounds
- Recap
- Scraper tip: a checklist for understanding someone else’s code
20.Scraper #19: Scraping scattered data from multiple websites that share the same CMS
- Finding websites using the same content management system (CMS)
- Writing the scraper: looking at HTML structure
- Using if statements to avoid errors when data doesn’t exist
- Recap
- Tests
21.Scraper #20: Automating database searches (forms)
- Understanding URLs: queries and parameters
- When the URL doesn’t change
- Solving the cookie problem: Mechanize
- Recap
- Tests
22.Scraper #21: Storing the results of a search
- Recap
- Scraper tip: using
printto monitor progress - Tests
23.Scraper #22: Scraping PDFs part 1
- Detour: indexes and slicing shortcuts
- Back to the scraper
- Detour: operators
- Back to the scraper (again)
- Detour: the % sign explained
- Back to the scraper (again) (again)
- Running the code on a working URL
- Borrowing some code
- The first test run
- Fixing a unicode error
- Where’s the ‘view source’ on a PDF?
- Recap
- Tests
24.Scraper 23: Scraping PDFs part 2
- Scraping speed camera PDFs - welcome back to XPath
- Ifs and buts: measuring and matching data
- Recap
- Tests
25.Scraper 24: Scraping multiple PDFs
- The code
- Tasks 1 and 2: Find a pattern in the HTML and grab the links within
- XPath contains…
- The code: scraping more than one PDF
- The wrong kind of data: calculations with strings
- Putting square pegs in square holes: saving data based on properties
- Recap
- Tests
26.Scraper 25: Text, not tables, in PDFs - regex
- Starting the code: importing a regex library
- Scraping each PDF
- The UnicodeDecodeError and the AttributeError
- Storing the first pieces of information
- Why it’s a good idea to store line numbers
Re: Python’s regex library- Other functions from the
relibrary - Back to the code
- Joining lists of items into a single string
- Finding all the links to PDF reports on a particular webpage
- Finding the PDF link on a page
- Detour: global variables and local variables
- The code in full
- Recap
- Tests
27.Scraper 26: Scraping CSV files
- The CSV library
- Process of elimination 1: putting blind spots in the code
- Process of elimination 2: amending the source data
- Encoding, decoding, extracting
- Removing the header row
- Ready to scrape multiple sheets
- Combining CSV files on your computer
- Recap
- Tests
28.Scraper 27: Scraping Excel spreadsheets part 1
- A library for scraping spreadsheets
- What can you learn from a broken scraper?
- Applying the scraper to a different spreadsheet
- But what is the scraper doing?
- Recap
- Tests
29.Scraper 28: Scraping Excel spreadsheets part 2: scraping one sheet
- Testing on one sheet of a spreadsheet
- Recap
- Tests
30.Scraper 28 continued: Scraping Excel spreadsheets part 3: scraping multiple sheets
- One dataset, or multiple ones
- Using header row values as keys
- Recap
- Tests
31.Scraper 28 continued: Scraping Excel spreadsheets part 4: Dealing with dates in spreadsheets
- More string formatting: replacing bad characters
- Scraping multiple spreadsheets
- Loops within loops
- Scraper tip: creating a sandbox
- Recap
- Tests
32.Scraper 29: writing scrapers for JSON and APIs
- If you’re API and you know it
- Dealing with JSON
- Adding our own code
- Storing the results
- Querying your own API
- Recap
- Tests
33.The final chapter: where do you go from here?
- The map is not the territory
- Recommended reading and viewing
- End != End

