Commit Graph

429 Commits (b42599278324949c5e12989f7ad3f7a62954cdba)

Author SHA1 Message Date
pictuga b425992783 morss: don't follow alt=rss with custom feeds
To have the same page as with :get=page and to avoid shitty feeds
2020-05-02 19:18:58 +02:00
pictuga 271ac8f80f crawler: comment code a bit 2020-05-02 19:18:01 +02:00
pictuga 64e41b807d crawler: handle http:/ (single slash)
Fixing one more corner case! malayalam.oneindia.com
2020-05-02 19:17:15 +02:00
pictuga 27a42c47aa morss: use final request url
Code is not very elegant...
2020-04-28 22:30:21 +02:00
pictuga c27c38f7c7 crawler: return dict instead of tuple 2020-04-28 22:29:07 +02:00
pictuga a1dc96cb50 feeds: remove mimetype from function call as no longer used 2020-04-28 22:07:25 +02:00
pictuga 749acc87fc Centralize url clean up in crawler.py 2020-04-28 22:03:49 +02:00
pictuga cb69e3167f crawler: accept non-ascii urls
Covering one more corner case!
2020-04-28 14:47:23 +02:00
pictuga c3f06da947 morss: process(): specify encoding for clarity 2020-04-28 14:45:00 +02:00
pictuga 44a3e0edc4 readabilite: specify in- and out-going encoding 2020-04-28 14:44:35 +02:00
pictuga 818cdaaa9b Make it possible to call sub-libs in non interactive mode
Run `python -m morss.feeds http://lemonde.fr` and so on
2020-04-27 18:00:14 +02:00
pictuga 2806c64326 Make it possible to directly run sub-libs (feeds, crawler, readabilite)
Run `python -im morss.feeds http://website.sample/rss.xml` and so on
2020-04-27 17:19:31 +02:00
pictuga f6bc23927f readabilite: drop dangerous tags (script, style) 2020-04-25 12:25:02 +02:00
pictuga c86572374e readabilite: minimum score requirement 2020-04-25 12:24:36 +02:00
pictuga 59ef5af9e2 feeds: fix bug when deleting attr in html 2020-04-24 22:12:05 +02:00
pictuga 6a0531ca03 crawler: randomize user agent 2020-04-24 11:28:39 +02:00
pictuga 8187876a06 crawler: stop at first alternative link
Should save a few ms and the first one is usually (?) the most relevant/generic
2020-04-23 11:23:45 +02:00
pictuga 325a373e3e feeds: add SyntaxError catch 2020-04-20 16:15:15 +02:00
pictuga 2719bd6776 crawler: fix chinese encoding 2020-04-20 16:14:55 +02:00
pictuga ec8edb02f1 Various small bug fixes 2020-04-19 12:54:02 +02:00
pictuga d01b943597 Remove leftover threading var 2020-04-19 12:51:11 +02:00
pictuga b361aa2867 Add timeout to :get 2020-04-19 12:50:26 +02:00
pictuga 4ce3c7cb32 Small code clean ups 2020-04-19 12:50:05 +02:00
pictuga 7e45b2611d Disable multi-threading
Impact was mostly negative due to locks
2020-04-19 12:29:52 +02:00
pictuga 036e5190f1 crawler: remove unused code 2020-04-18 21:40:02 +02:00
pictuga e99c5b3b71 morss: more sensible default MAX/LIM values 2020-04-18 17:21:45 +02:00
pictuga 7375adce33 sheet.xsl: fix & improve 2020-04-15 23:34:28 +02:00
pictuga fe82b19c91 Merge .xsl & html template
Turns out they somehow serve a similar purpose
2020-04-15 22:30:45 +02:00
pictuga 0b31e97492 morss: remove debug code in http file handler 2020-04-14 23:20:03 +02:00
pictuga b0ad7c259d Add README & LICENSE to data_files 2020-04-14 19:34:12 +02:00
pictuga 59139272fd Auto-detect the location of www/
Either ../www or /usr/share/morss
Adapted README accordingly
2020-04-14 18:07:19 +02:00
pictuga e6b7c0eb33 Fix app definition for uwsgi 2020-04-13 15:30:09 +02:00
pictuga 67c096ad5b feeds: add fake path to default html parser
Without it, some websites were accidentally matching it (false positives)
2020-04-12 13:00:56 +02:00
pictuga f018437544 crawler: make mysql backend thread safe 2020-04-12 12:53:05 +02:00
pictuga 8e5e8d24a4 Timezone fixes 2020-04-10 20:33:59 +02:00
pictuga ee78a7875a morss: focus on the most recent feed items 2020-04-10 16:08:13 +02:00
pictuga 9e7b9d95ee feeds: properly use html template 2020-04-09 20:00:51 +02:00
pictuga 987a719c4e feeds: try all parsers regardless of contenttype
Turns out some websites send the wrong contenttype (json for html, html for xml, etc.)
2020-04-09 19:17:51 +02:00
pictuga 47b33f4baa morss: specify server output encoding 2020-04-09 19:10:45 +02:00
pictuga 3c7f512583 feeds: handle several errors 2020-04-09 19:09:10 +02:00
pictuga a32f5a8536 readabilite: add debug option (also used by :get) 2020-04-09 19:08:13 +02:00
pictuga 63a06524b7 morss: various encoding fixes 2020-04-09 19:06:51 +02:00
pictuga b0f80c6d3c morss: fix csv output encoding 2020-04-09 19:05:50 +02:00
pictuga 78cea10ead morss: replace :getpage with :get
Also provides readabilite debugging
2020-04-09 18:43:20 +02:00
pictuga e5a82ff1f4 crawler: drop auto-referer
Was solving some issues. But creating even more issues.
2020-04-07 10:39:21 +02:00
pictuga f3d1f92b39 Detect encoding everytime 2020-04-07 10:38:36 +02:00
pictuga 7691df5257 Use wrapper for http calls 2020-04-07 10:30:17 +02:00
pictuga f1d0431e68 morss: drop :html, replaced with :reader
README updated accordingly
2020-04-07 09:23:29 +02:00
pictuga a09831415f feeds: fix bug when mimetype matches nothing 2020-04-06 18:53:07 +02:00
pictuga bfad6b7a4a readabilite: clean before counting
To remove links which are not kept anyway
2020-04-06 16:55:39 +02:00