Monitor how article titles are changed over time on news websites. You can find the scraped data here: https://git.nolog.cz/NoLog.cz/headline-exports
Find a file
2023-08-17 12:26:15 +02:00
.vscode Setup associations for vscode 2023-08-16 18:31:10 +02:00
data update feeds 2022-08-27 12:46:18 +02:00
misc give every article ID to enable grouping changes by article 2023-08-17 11:19:12 +02:00
processor Use WAL in sqlite3 to avoid database locks that prevent reads 2023-08-17 12:25:01 +02:00
view Merge pull request 'Merge forgotten changes from new-frontend' (#2) from new-frontend into main 2023-08-17 12:26:15 +02:00
.editorconfig Update editorconfig 2023-08-16 23:01:31 +02:00
.gitignore flask UI and dockerization:) Sorry. 2022-08-25 15:10:08 +02:00
docker-compose.yml Add article detail page 2023-08-16 23:01:45 +02:00
README.md add expire value to redis keys (7 days) 2022-08-27 13:45:25 +02:00

Headline

Monitor how article titles are changed over time on news websites.


This tool is probably not production ready beacause it was written in two afternoons by an amateur (I'm not a professional programmer). If you want to run it, at least put a reverse proxy between it and public network or run it locally.

I did't do any research on legality of analysing RSS feeds and it's possible you can get into legal issues by presenting the outcomes publicly.


Architecture

The "processor" script will fetch rss feeds configured in processor/config.yaml every 5 minutes (configured in processor/crontab), store the article in Redis and compare new/old articles to find changes in title. When change is found, it generates nice visual diff and stores it with other information (detection time, article link, new/old title, etc.) in permanent database (sqlite3 for now).

The "view" script is reading data from the permanent database (sqlite3) and presents it to the user.

Installation

Run docker-compose up -d and everything should start. You can change ./processor/config.yaml to edit rss sources. After first start, you have to wait for ~5mins for the "processor" to create first empty database. The webserver will throw error until then.

to-do

  • Collect creation time of orig/new article, write it to permanent storage (sqlite3 for now) and display it.
  • Write better readme and little more docs.
  • Create view with some more info and stats (list of feeds, articles in redis, etc.)
  • IDEA: Figure out how to monitor changes in article description (maybe just compare hashes?) and how to present them. (Right now, the code can store descriptions in redis, but nothing else)