Source Scrapers
Horizon fetches content from four source types. All scrapers inherit from BaseScraper, share an async HTTP client, and implement a fetch(since) method that returns a list of ContentItem objects. Sources are fetched concurrently via asyncio.gather.
Hacker News
File: src/scrapers/hackernews.py
Uses the Firebase HN API:
GET /topstories.jsonβ fetches top story IDsGET /item/{id}.jsonβ fetches story/comment details
Stories and their comments are fetched concurrently. For each story, the top 5 comments are included (deleted/dead comments excluded, HTML stripped, truncated at 500 chars).
Config (sources.hackernews):
{
"enabled": true,
"fetch_top_stories": 30,
"min_score": 100
}
fetch_top_storiesβ number of top story IDs to fetchmin_scoreβ minimum HN points to include a story
Extracted data: title, URL (falls back to HN discussion URL), author, score, comment count, and top comment text.
GitHub
File: src/scrapers/github.py
Uses the GitHub REST API:
GET /users/{username}/events/publicβ user activity eventsGET /repos/{owner}/{repo}/releasesβ repository releases
Two source types are supported:
user_eventsβ tracks push, create, release, public, and watch events for a userrepo_releasesβ tracks new releases for a specific repository
Config (sources.github, list of entries):
{
"type": "user_events",
"username": "torvalds",
"enabled": true
}
{
"type": "repo_releases",
"owner": "golang",
"repo": "go",
"enabled": true
}
Authentication: Set GITHUB_TOKEN in your environment for higher rate limits (5000 req/hr vs 60 without).
RSS
File: src/scrapers/rss.py
Fetches any Atom/RSS feed using the feedparser library. Tries multiple date fields (published, updated, created) with fallback parsing.
Config (sources.rss, list of entries):
{
"name": "Simon Willison",
"url": "https://simonwillison.net/atom/everything/",
"enabled": true,
"category": "ai-tools"
}
categoryβ optional tag for grouping (e.g.,"programming","microblog")
Extracted data: title, URL, author, content (from summary/description/content fields), feed name, category, and entry tags.
File: src/scrapers/reddit.py
Uses Redditβs public JSON API (www.reddit.com):
GET /r/{subreddit}/{sort}.jsonβ subreddit postsGET /user/{username}/submitted.jsonβ user submissionsGET /r/{subreddit}/comments/{post_id}.jsonβ post comments
Subreddits and users are fetched concurrently. Comments are sorted by score, limited to the configured count, and exclude moderator-distinguished comments. Self-text is truncated at 1500 chars, comments at 500 chars.
Config (sources.reddit):
{
"enabled": true,
"fetch_comments": 5,
"subreddits": [
{
"subreddit": "MachineLearning",
"sort": "hot",
"fetch_limit": 25,
"min_score": 10
}
],
"users": [
{
"username": "spez",
"sort": "new",
"fetch_limit": 10
}
]
}
sortβhot,new,top, orrising(subreddits);hotornew(users)time_filterβ fortop/risingsorts:hour,day,week,month,year,allmin_scoreβ minimum post score (subreddits only)
Rate limiting: Detects HTTP 429 responses, reads the Retry-After header, waits, and retries once. Uses a descriptive User-Agent as required by Redditβs API guidelines.
Extracted data: title, URL, author, score, upvote ratio, comment count, subreddit, flair, self-text, and top comments.