Following are the scripts I use to download .puz files from the Hyperallergic crosswords archive. I cobbled these together using my randomly acquired, woefully patchy programming skills, an A”I” chatbot, and the amazing xword-dl command-line puzzle downloading tool by the even more awesome Parker Higgins. Thus, I know it can be done better, and I’m always delighted to learn more if you want to share (gentle) corrections and advice, but in the absence of human contributions, the quick and dirty approach (putting this together took about 10 minutes) is good enough!
First, the Python script that grabs links from the crossword archive page(s):
import requests
from bs4 import BeautifulSoup
import os
from urllib.parse import urljoin
START_URL = "https://hyperallergic.com/category/crosswords/"
OUTPUT_FILE = "hyperallergic-crosswords-links.txt"
HEADERS = {"User-Agent": "Mozilla/5.0"}
def load_existing_links(path):
"""Return a set of URLs already in the output file."""
if not os.path.exists(path):
return set()
with open(path, "r") as f:
return set(line.strip() for line in f if line.strip())
def fetch_page(url):
"""Fetch a page and return a BeautifulSoup object."""
print(f"Fetching: {url}")
response = requests.get(url, headers=HEADERS)
response.raise_for_status()
return BeautifulSoup(response.text, "html.parser")
def extract_links(soup):
"""
Extract post links from: h2.entry-title a
"""
links = []
for a in soup.select("h2.entry-title a"):
href = a.get("href")
if href:
links.append(href)
return links
def find_next_page(soup, current_url):
"""
Find the 'Older posts' link using markup like:
<nav class="navigation pagination">
...
<a class="next page-numbers" href=".../page/2/">
"""
nav = soup.select_one("nav.navigation.pagination")
if not nav:
return None
next_link = nav.select_one("a.next.page-numbers")
if not next_link:
return None
href = next_link.get("href")
if not href:
return None
return urljoin(current_url, href)
def main():
existing = load_existing_links(OUTPUT_FILE)
print(f"Loaded {len(existing)} existing links.")
new_links = []
url = START_URL
stop = False
while url and not stop:
soup = fetch_page(url)
page_links = extract_links(soup)
if not page_links:
print("No links found on this page, stopping.")
break
for link in page_links:
if link in existing:
print(f"Stopping: reached already-known link → {link}")
stop = True
break
if link not in new_links:
new_links.append(link)
if stop:
break
url = find_next_page(soup, url)
if not url:
print("No more 'Older posts' pages found.")
break
if not new_links:
print("No new links found.")
return
# Read old content
old_content = ""
if os.path.exists(OUTPUT_FILE):
with open(OUTPUT_FILE, "r") as f:
old_content = f.read()
# Prepend new links
with open(OUTPUT_FILE, "w") as f:
for link in new_links:
f.write(link + "\n")
f.write(old_content)
print(f"Added {len(new_links)} new links to {OUTPUT_FILE}.")
if __name__ == "__main__":
main()
Then the Zsh script (Big Apple commandeth me to leave the Bash behind) that uses xword-dl to download the puzzles from the links in the file created by the earlier script, with a random pause in between requests to be nice to the server:
#!/usr/bin/env zsh
# Exit if dlurls.txt does not exist
if [[ ! -f "hyperallergic-crosswords-links.txt" ]]; then
echo "Error: dlurls.txt not found."
exit 1
fi
# Read the file line-by-line
while IFS= read -r url; do
# Skip empty lines or lines starting with '#'
[[ -z "$url" || "$url" == \#* ]] && continue
echo "Downloading: $url"
# It must be possible for the script to load the development branch
# but the computer's time running this costs less than figuring it
# out right now.
uvx --from git+https://github.com/thisisparker/xword-dl.git@main xword-dl "$url"
# Random sleep between 2 and 5 seconds
delay=$(( (RANDOM % 4) + 2 ))
echo "Sleeping for $delay seconds…"
sleep $delay
done < hyperallergic-crosswords-links.txt
Nothing earth shattering, obviously, but it does the job!