mirror of https://git.nolog.cz/NoLog.cz/headline.git synced 2025-01-31 11:53:35 +01:00

Monitor how article titles are changed over time on news websites. You can find the scraped data here: https://git.nolog.cz/NoLog.cz/headline-exports

Find a file

mdivecky dee801a193 Merge pull request 'Merge forgotten changes from new-frontend' (#2 ) from new-frontend into main Reviewed-on: https://git.nolog.cz/mdivecky/headline/pulls/2		2023-08-17 12:26:15 +02:00
.vscode	Setup associations for vscode	2023-08-16 18:31:10 +02:00
data	update feeds	2022-08-27 12:46:18 +02:00
misc	give every article ID to enable grouping changes by article	2023-08-17 11:19:12 +02:00
processor	Use WAL in sqlite3 to avoid database locks that prevent reads	2023-08-17 12:25:01 +02:00
view	Merge pull request 'Merge forgotten changes from new-frontend' (#2 ) from new-frontend into main	2023-08-17 12:26:15 +02:00
.editorconfig	Update editorconfig	2023-08-16 23:01:31 +02:00
.gitignore	flask UI and dockerization:) Sorry.	2022-08-25 15:10:08 +02:00
docker-compose.yml	Add article detail page	2023-08-16 23:01:45 +02:00
README.md	add expire value to redis keys (7 days)	2022-08-27 13:45:25 +02:00

README.md

Headline

Monitor how article titles are changed over time on news websites.

This tool is probably not production ready beacause it was written in two afternoons by an amateur (I'm not a professional programmer). If you want to run it, at least put a reverse proxy between it and public network or run it locally.

I did't do any research on legality of analysing RSS feeds and it's possible you can get into legal issues by presenting the outcomes publicly.

Architecture

The "processor" script will fetch rss feeds configured in processor/config.yaml every 5 minutes (configured in processor/crontab), store the article in Redis and compare new/old articles to find changes in title. When change is found, it generates nice visual diff and stores it with other information (detection time, article link, new/old title, etc.) in permanent database (sqlite3 for now).

The "view" script is reading data from the permanent database (sqlite3) and presents it to the user.

Installation

Run docker-compose up -d and everything should start. You can change ./processor/config.yaml to edit rss sources. After first start, you have to wait for ~5mins for the "processor" to create first empty database. The webserver will throw error until then.

to-do

Collect creation time of orig/new article, write it to permanent storage (sqlite3 for now) and display it.
Write better readme and little more docs.
Create view with some more info and stats (list of feeds, articles in redis, etc.)
IDEA: Figure out how to monitor changes in article description (maybe just compare hashes?) and how to present them. (Right now, the code can store descriptions in redis, but nothing else)