Commit Graph

871 Commits (71d9c7a027835da8edc6502391224f909ea65ec5)
 

Author SHA1 Message Date
pictuga 2fe3e0b8ee feeds: clean up other stylesheets before putting ours 2020-05-26 19:26:36 +02:00
pictuga ad3ba9de1a sheet.xsl: add <select/> to use :firstlink 2020-05-13 12:33:12 +02:00
pictuga 68c46a1823 morss: remove deprecated twitter/fb link handling 2020-05-13 12:31:09 +02:00
pictuga 91be2d229e morss: ability to use first link from desc instead of default link 2020-05-13 12:29:53 +02:00
pictuga 038f267ea2 Rename :theforce into :force 2020-05-13 11:49:15 +02:00
pictuga 22005065e8 Use etree.tostring 'method' arg
Gives appropriately formatted html code.
Some pages might otherwise be rendered as blank.
2020-05-13 11:44:34 +02:00
pictuga 7d0d416610 morss: cache articles for 24hrs
Also make it possible to refetch articles, regardless of cache
2020-05-12 21:10:31 +02:00
pictuga 5dac4c69a1 crawler: more code comments 2020-05-12 20:44:25 +02:00
pictuga 36e2a1c3fd crawler: increase size limit from 100KiB to 500
I'm looking at you, worldbankgroup.csod.com/ats/careersite/search.aspx
2020-05-12 19:34:16 +02:00
pictuga 83dd2925d3 readabilite: better parsing
Keeping blank_text keeps the tree more as-it, making the final output closer to expectations
2020-05-12 14:15:53 +02:00
pictuga e09d0abf54 morss: remove deprecated peace of code 2020-05-07 16:05:30 +02:00
pictuga ff26a560cb Shift safari work around to morss.py 2020-05-07 16:04:54 +02:00
pictuga 74d7a1eca2 sheet.xsl: fix word wrap 2020-05-06 16:58:28 +02:00
pictuga eba295cba8 sheet.xsl: fixes for safari 2020-05-06 12:01:27 +02:00
pictuga f27631954e .htaccess: bypass Safari RSS detection 2020-05-06 11:47:24 +02:00
pictuga c74abfa2f4 sheet.xsl: use CDATA for js code 2020-05-06 11:46:38 +02:00
pictuga 1d5272c299 sheet.xsl: allow zooming on mobile 2020-05-04 14:44:43 +02:00
pictuga f685139137 crawler: use UPSERT statements
Avoid potential race conditions
2020-05-03 21:27:45 +02:00
pictuga 73b477665e morss: separate :clip with <hr> instead of stars 2020-05-02 19:19:54 +02:00
pictuga b425992783 morss: don't follow alt=rss with custom feeds
To have the same page as with :get=page and to avoid shitty feeds
2020-05-02 19:18:58 +02:00
pictuga 271ac8f80f crawler: comment code a bit 2020-05-02 19:18:01 +02:00
pictuga 64e41b807d crawler: handle http:/ (single slash)
Fixing one more corner case! malayalam.oneindia.com
2020-05-02 19:17:15 +02:00
pictuga a2c4691090 sheet.xsl: dir=auto for rtl languages (arabic, etc.) 2020-04-29 15:01:33 +02:00
pictuga b6000923bc README: clean up deprecated code 2020-04-28 22:31:11 +02:00
pictuga 27a42c47aa morss: use final request url
Code is not very elegant...
2020-04-28 22:30:21 +02:00
pictuga c27c38f7c7 crawler: return dict instead of tuple 2020-04-28 22:29:07 +02:00
pictuga a1dc96cb50 feeds: remove mimetype from function call as no longer used 2020-04-28 22:07:25 +02:00
pictuga 749acc87fc Centralize url clean up in crawler.py 2020-04-28 22:03:49 +02:00
pictuga c186188557 README: warning about lxml installation 2020-04-28 21:58:26 +02:00
pictuga cb69e3167f crawler: accept non-ascii urls
Covering one more corner case!
2020-04-28 14:47:23 +02:00
pictuga c3f06da947 morss: process(): specify encoding for clarity 2020-04-28 14:45:00 +02:00
pictuga 44a3e0edc4 readabilite: specify in- and out-going encoding 2020-04-28 14:44:35 +02:00
pictuga 4a9b505499 README: update python lib instructions 2020-04-27 18:12:14 +02:00
pictuga 818cdaaa9b Make it possible to call sub-libs in non interactive mode
Run `python -m morss.feeds http://lemonde.fr` and so on
2020-04-27 18:00:14 +02:00
pictuga 2806c64326 Make it possible to directly run sub-libs (feeds, crawler, readabilite)
Run `python -im morss.feeds http://website.sample/rss.xml` and so on
2020-04-27 17:19:31 +02:00
pictuga d39d7bb19d sheet.xsl: limit overflow 2020-04-25 15:27:49 +02:00
pictuga e5e3746fc6 sheet.xsl: show plain url 2020-04-25 15:27:13 +02:00
pictuga 960c9d10d6 sheet.xsl: customize output feed form 2020-04-25 15:26:47 +02:00
pictuga 0e7a5b9780 sheet.xsl: wrap header in <header> 2020-04-25 15:24:57 +02:00
pictuga 186bedcf62 sheet.xsl: smarter html reparser 2020-04-25 15:22:25 +02:00
pictuga 5847e18e42 sheet: improved feed address output (w/ c/c) 2020-04-25 15:21:47 +02:00
pictuga f6bc23927f readabilite: drop dangerous tags (script, style) 2020-04-25 12:25:02 +02:00
pictuga c86572374e readabilite: minimum score requirement 2020-04-25 12:24:36 +02:00
pictuga 59ef5af9e2 feeds: fix bug when deleting attr in html 2020-04-24 22:12:05 +02:00
pictuga 6a0531ca03 crawler: randomize user agent 2020-04-24 11:28:39 +02:00
pictuga 8187876a06 crawler: stop at first alternative link
Should save a few ms and the first one is usually (?) the most relevant/generic
2020-04-23 11:23:45 +02:00
pictuga 325a373e3e feeds: add SyntaxError catch 2020-04-20 16:15:15 +02:00
pictuga 2719bd6776 crawler: fix chinese encoding 2020-04-20 16:14:55 +02:00
pictuga 285e1e5f42 docker: pip install local 2020-04-19 13:25:53 +02:00
pictuga 41a63900c2 README: improve docker instructions 2020-04-19 13:01:08 +02:00