Scraping for Journalists (2nd edition) [Leanpub PDF/iPad/Kindle]

Paul Bradshaw runs the MA in Data Journalism and the MA in Multiplatform and Mobile Journalism at Birmingham City University, where he is an associate professor. He publishes the Online Journalism Blog, and is the founder of investigative journalism website HelpMeInvestigate. He has written for the Guardian and Telegraph’s data blogs, journalism.co.uk, Press Gazette, InPublishing, Nieman Reports and the Poynter Institute in the US. Formerly Visiting Professor at City University’s School of Journalism in London, He is the author of the Online Journalism Handbook, now in its second edition, Magazine Editing (3rd Edition) with John Morrish and Mobile-First Journalism with Steve Hill. Other books which Bradshaw has contributed to include Investigative Journalism (second edition), Web Journalism: A New Form of Citizenship; and Citizen Journalism: Global Perspectives.

His books on Leanpub include Scraping for Journalists, Finding Stories in Spreadsheets, the Data Journalism Heist, Snapchat for Journalists, and 8000 Holes: How the 2012 Olympic Torch Relay Lost its Way.

Bradshaw has been listed in Journalism.co.uk’s list of the leading innovators in journalism and media and Poynter’s most influential people in social media. In 2010, he was shortlisted for Multimedia Publisher of the Year. In 2016 he was part of a team that won the CNN MultiChoice African Journalist Awards.

In addition to teaching and writing, Paul acts as a consultant and trainer to a number of organisations on social media and data journalism. You can find him on Twitter @paulbradshaw

1.Introduction

Your own robot
A book about not reading books
I’m not a programmer
PS: This isn’t a book

2.Scraper #1: Start scraping in 5 minutes

How it works: functions and parameters
What are the parameters? Strings and indexes
Tables and lists?
Recap
Tests

3.Scraper #2: What happens when the data isn’t in a table?

Strong structure: XML
Scraping XML
Recap
Tests

4.Scraper #3: Looking for structure in HTML

Detour: Introduction to HTML and the LIFO rule
Attributes and values
Classifying sections of content: div, span, classes and ids
Back to Scraper #3: Scraping a <div> in a HTML webpage
Recap
Tests

5.Scraper #4: Finding more structure in webpages: Xpath

Recap
Tests

6.Scraper #5: Scraping multiple pages with Google Drive

Recap
Tests

7.Scraper #6: Structure in URLs - using Open Refine

Looking for structure in URLs
Assembling the ingredients
Grabbing ID codes or URLs from a website
Using Open Refine as a scraper
Grabbing the HTML for each page
Extracting data from the raw HTML with parseHTML
Using CSS selectors in scraping
Understanding the results
Recap
Tests

8.Scraper #7: Scraping multiple pages with ‘next’ links using Outwit Hub

Creating a basic scraper in OutWit Hub
Scraping a series of pages
Customised scrapers in OutWit
Trying it out on a search for questions about health
Recap
Tests

9.Scraper #8: Poorly formatted webpages - solving problems with OutWit

Identifying what structure there is
Repeating a heading or other piece of data for each part within it
Splitting a larger piece of data into bits: using separators
Recap
Tests

10.Scraper #9: Scraping uglier HTML and ‘regular expressions’ in an OutWit scraper

Introducing Regex
Using regex to specify a range of possible matches
Catching the regular expression too
I want any character: the wildcard and quantifiers
Matching zero, one or more characters - quantifiers
3 questions: What characters, how many, where?
Using regex on an ugly page
What’s the pattern?
Matching non-textual characters
What if my data contains full stops, forward slashes or other special characters?
‘Anything but that!’ - negative matches
This or that - looking for more than one regular expression at the same time
Only here - specifying location
Back to the scraper: grabbing the rest of the data
Which dash? Negative matches in practice.
Recap
Tests

11.Scrapers #10 and #11: Scraping hidden and ‘invisible’ data on a webpage: icons and ‘reveals’

Scraping accessibility data on Olympic venues
Hidden HTML
Recap
Tests

12.Scraper #12: An introduction to Python: adapting scraper code

Python - you already know some of it
The ScraperWiki Classic Archive
Forking a scraper
Introducing Morph.io
Finding a scraper to clone
How to copy a scraper into Morph.io
Adapting the code
How to copy a scraper into QuickCode
Recap
Tests

13.Scraper #13: Tracing the code - libraries and functions, and documentation in Scraperwiki

Parent/child relationships
Parameters (again)
Detour: Variables
Scraping tip #2: read code from right to left
Back to Scraper #9
Recap
Tests

14.Scraper #13 continued: Scraperwiki’s tutorial scraper 2

Scraping tip #3: follow the variables
What are those variables?
Detour: loops (for and while)
Back to scraper #13: Storing the data
Detour: Unique keys, primary keys, and databases
A unique key can’t be empty: fixing the error
Summing up the scraper
Recap
Tests

15.Scraper #14: Adapting the code to scrape a different webpage

Dealing with errors
Recap
Tests

16.Scraper #15: Scraping multiple cells and pages

Colour coding
Creating your own functions: def
If statements - asking a question
Numbers in square brackets? Indexes again!
Attributes
scraperwiki.datastore generates an error
Scraping tip #4: follow the functions
Scraping tip #5: Read from bottom to top
Recap
Tests

17.Scraper #16: Adapting your third scraper: creating more than one column of data

Recap
Tests

18.Scraper #17: Scraping a list of pages

Iterating
Tip: creating a list of items using the JOIN function in a spreadsheet
Recap
Tests

19.Scraper #18: Scraping a page - and the pages linked (badly) from it

Using ranges to avoid errors
Using len to test lists
Problems with URLs
Methods for changing text
A second way of fixing bad URLs
Other workarounds
Recap
Scraper tip: a checklist for understanding someone else’s code

20.Scraper #19: Scraping scattered data from multiple websites that share the same CMS

Finding websites using the same content management system (CMS)
Writing the scraper: looking at HTML structure
Using if statements to avoid errors when data doesn’t exist
Recap
Tests

21.Scraper #20: Automating database searches (forms)

Understanding URLs: queries and parameters
When the URL doesn’t change
Solving the cookie problem: Mechanize
Recap
Tests

22.Scraper #21: Storing the results of a search

Recap
Scraper tip: using print to monitor progress
Tests

23.Scraper #22: Scraping PDFs part 1

Detour: indexes and slicing shortcuts
Back to the scraper
Detour: operators
Back to the scraper (again)
Detour: the % sign explained
Back to the scraper (again) (again)
Running the code on a working URL
Borrowing some code
The first test run
Fixing a unicode error
Where’s the ‘view source’ on a PDF?
Recap
Tests

24.Scraper 23: Scraping PDFs part 2

Scraping speed camera PDFs - welcome back to XPath
Ifs and buts: measuring and matching data
Recap
Tests

25.Scraper 24: Scraping multiple PDFs

The code
Tasks 1 and 2: Find a pattern in the HTML and grab the links within
XPath contains…
The code: scraping more than one PDF
The wrong kind of data: calculations with strings
Putting square pegs in square holes: saving data based on properties
Recap
Tests

26.Scraper 25: Text, not tables, in PDFs - regex

Starting the code: importing a regex library
Scraping each PDF
The UnicodeDecodeError and the AttributeError
Storing the first pieces of information
Why it’s a good idea to store line numbers
Re: Python’s regex library
Other functions from the re library
Back to the code
Joining lists of items into a single string
Finding all the links to PDF reports on a particular webpage
Finding the PDF link on a page
Detour: global variables and local variables
The code in full
Recap
Tests

27.Scraper 26: Scraping CSV files

The CSV library
Process of elimination 1: putting blind spots in the code
Process of elimination 2: amending the source data
Encoding, decoding, extracting
Removing the header row
Ready to scrape multiple sheets
Combining CSV files on your computer
Recap
Tests

28.Scraper 27: Scraping Excel spreadsheets part 1

A library for scraping spreadsheets
What can you learn from a broken scraper?
Applying the scraper to a different spreadsheet
But what is the scraper doing?
Recap
Tests

29.Scraper 28: Scraping Excel spreadsheets part 2: scraping one sheet

Testing on one sheet of a spreadsheet
Recap
Tests

30.Scraper 28 continued: Scraping Excel spreadsheets part 3: scraping multiple sheets

One dataset, or multiple ones
Using header row values as keys
Recap
Tests

31.Scraper 28 continued: Scraping Excel spreadsheets part 4: Dealing with dates in spreadsheets

More string formatting: replacing bad characters
Scraping multiple spreadsheets
Loops within loops
Scraper tip: creating a sandbox
Recap
Tests

32.Scraper 29: writing scrapers for JSON and APIs

If you’re API and you know it
Dealing with JSON
Adding our own code
Storing the results
Querying your own API
Recap
Tests

33.The final chapter: where do you go from here?

The map is not the territory
Recommended reading and viewing
End != End

34.Acknowledgements

35.Glossary

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $14 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub

About

Share this book

Categories

Feedback

Price

You pay

Author earns

...Or Buy With Credits!

Number of credits (Minimum 2)

Author

Translations

Languages

Contents