Downloading Hyperallergic Crosswords

Following are the scripts I use to download .puz files from the Hyperallergic crosswords archive. I cobbled these together using my randomly acquired, woefully patchy programming skills, an A”I” chatbot, and the amazing xword-dl command-line puzzle downloading tool by the even more awesome Parker Higgins. Thus, I know it can be done better, and I’m always delighted to learn more if you want to share (gentle) corrections and advice, but in the absence of human contributions, the quick and dirty approach (putting this together took about 10 minutes) is good enough!

First, the Python script that grabs links from the crossword archive page(s):

Python [get-hyperallergic-crosswords-pages.py]
import requests
from bs4 import BeautifulSoup
import os
from urllib.parse import urljoin

START_URL = "https://hyperallergic.com/category/crosswords/"
OUTPUT_FILE = "hyperallergic-crosswords-links.txt"
HEADERS = {"User-Agent": "Mozilla/5.0"}


def load_existing_links(path):
    """Return a set of URLs already in the output file."""
    if not os.path.exists(path):
        return set()
    with open(path, "r") as f:
        return set(line.strip() for line in f if line.strip())


def fetch_page(url):
    """Fetch a page and return a BeautifulSoup object."""
    print(f"Fetching: {url}")
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return BeautifulSoup(response.text, "html.parser")


def extract_links(soup):
    """
    Extract post links from:  h2.entry-title a
    """
    links = []
    for a in soup.select("h2.entry-title a"):
        href = a.get("href")
        if href:
            links.append(href)
    return links


def find_next_page(soup, current_url):
    """
    Find the 'Older posts' link using markup like:

    <nav class="navigation pagination">
       ...
       <a class="next page-numbers" href=".../page/2/">
    """
    nav = soup.select_one("nav.navigation.pagination")
    if not nav:
        return None

    next_link = nav.select_one("a.next.page-numbers")
    if not next_link:
        return None

    href = next_link.get("href")
    if not href:
        return None

    return urljoin(current_url, href)


def main():
    existing = load_existing_links(OUTPUT_FILE)
    print(f"Loaded {len(existing)} existing links.")

    new_links = []
    url = START_URL
    stop = False

    while url and not stop:
        soup = fetch_page(url)
        page_links = extract_links(soup)

        if not page_links:
            print("No links found on this page, stopping.")
            break

        for link in page_links:
            if link in existing:
                print(f"Stopping: reached already-known link → {link}")
                stop = True
                break
            if link not in new_links:
                new_links.append(link)

        if stop:
            break

        url = find_next_page(soup, url)
        if not url:
            print("No more 'Older posts' pages found.")
            break

    if not new_links:
        print("No new links found.")
        return

    # Read old content
    old_content = ""
    if os.path.exists(OUTPUT_FILE):
        with open(OUTPUT_FILE, "r") as f:
            old_content = f.read()

    # Prepend new links
    with open(OUTPUT_FILE, "w") as f:
        for link in new_links:
            f.write(link + "\n")
        f.write(old_content)

    print(f"Added {len(new_links)} new links to {OUTPUT_FILE}.")


if __name__ == "__main__":
    main()

Then the Zsh script (Big Apple commandeth me to leave the Bash behind) that uses xword-dl to download the puzzles from the links in the file created by the earlier script, with a random pause in between requests to be nice to the server:

Zsh [dl-hyperallergic-crosswords.zsh]
#!/usr/bin/env zsh

# Exit if dlurls.txt does not exist
if [[ ! -f "hyperallergic-crosswords-links.txt" ]]; then
    echo "Error: dlurls.txt not found."
    exit 1
fi

# Read the file line-by-line
while IFS= read -r url; do
    # Skip empty lines or lines starting with '#'
    [[ -z "$url" || "$url" == \#* ]] && continue

    echo "Downloading: $url"

    # It must be possible for the script to load the development branch
    # but the computer's time running this costs less than figuring it
    # out right now.
    uvx --from git+https://github.com/thisisparker/xword-dl.git@main xword-dl "$url"

    # Random sleep between 2 and 5 seconds
    delay=$(( (RANDOM % 4) + 2 ))
    echo "Sleeping for $delay seconds…"
    sleep $delay

done < hyperallergic-crosswords-links.txt

Nothing earth shattering, obviously, but it does the job!

Leave a Reply

Your email address will not be published. Required fields are marked *