Commit Graph

324 Commits (2704e91a3d400101fe51719320ad2e14ab05ef69)

Author SHA1 Message Date
pictuga 386bafd391 readabilite: write_all use "node" instead of "item" 2017-07-17 00:13:15 +02:00
pictuga a61b259792 readabilite: easy option to highlight the nodes 2017-07-17 00:11:49 +02:00
pictuga c52b47616d readabilite: always return common of 2 best nodes
Better results. Less is not more
2017-07-17 00:10:58 +02:00
pictuga bfdda18b9c readbilite: better explain lowest_common output 2017-07-17 00:08:00 +02:00
pictuga 2afea497a3 readabilite: br2p use "node" instead of "item"
Confusing with rss items otherwise
2017-07-17 00:06:39 +02:00
pictuga 843dc97fbf readabilite: change scoring algorithm
Use 3 groups of keywords instead
2017-07-17 00:01:44 +02:00
pictuga df22396838 Only use chardet on 2k letters
Takes forever otherwise
2017-07-16 23:59:06 +02:00
pictuga 6f0efd5802 crawler: add cookies support
Somehow got dropped when splitting the big handler
2017-03-25 19:51:42 -10:00
pictuga d3bc2926fc Remove :hungry
Mostly usless. If you need it, you might as well not need to use morss in the first place...
2017-03-25 13:52:58 -10:00
pictuga 505b02d70d crawler: remove debugging print() 2017-03-25 13:45:12 -10:00
pictuga 3ca6ed5bb0 readabilite: add author/about to black list 2017-03-24 22:02:41 -10:00
pictuga 4aa25bf3d8 readabilite: clean_html before scoring
Surprisingly efficient
2017-03-24 21:50:46 -10:00
pictuga bfefa8d599 readabilite: add tags to black list 2017-03-24 21:50:26 -10:00
pictuga 91da0f36dc readabilite: comment the clean_html function 2017-03-24 21:50:01 -10:00
pictuga 67889a1d14 readabilite: drop useless tags
This extra cluster actually jams the algorithm
2017-03-24 21:49:14 -10:00
pictuga 167e3e4a15 feedify: accept xpath rules passed as parameters 2017-03-20 20:56:48 -10:00
pictuga bf3ef586c2 feedify: remove unused downloader 2017-03-20 20:53:52 -10:00
pictuga 08f08ef704 improve morss url detection regex 2017-03-20 20:51:13 -10:00
pictuga 1b4341f741 accept query_string in morss cgi 2017-03-20 20:50:04 -10:00
pictuga f965566054 feedify; make function use clearer 2017-03-20 20:19:08 -10:00
pictuga d6882e0a6a readabilite: (try to) emprove detection
Kinda hopeless
2017-03-19 02:00:31 -10:00
pictuga 79a8ada9f4 readabilite: add tags to score 2017-03-19 01:57:54 -10:00
pictuga 4a5150e030 readabilite: fix iter while iterating 2017-03-19 01:56:33 -10:00
pictuga e65c88abf8 readabilite: fix re.match 2017-03-19 01:55:40 -10:00
pictuga 9c331300eb crawler: move UAHandler to basic
Fuck u feedburner
2017-03-19 01:49:17 -10:00
pictuga 5e61686373 Only use full feed for articles & feedify
Sometimes using referrer and/or useragent makes some dumb websites return diferent content (hello feedburner)
2017-03-18 23:43:28 -10:00
pictuga 0b6e553054 Move iTunes code to feedify.py 2017-03-18 23:41:37 -10:00
pictuga d4937812a8 Remove HTTPError code
Only used to look nice but useless (inherits from IOError anyway)
2017-03-18 23:39:32 -10:00
pictuga 99f3c519f2 crawler: fix accept code 2017-03-18 23:37:51 -10:00
pictuga 67f5a21019 Move build_opener to crawler
Forgotten
2017-03-18 23:03:04 -10:00
pictuga f7d570d4c8 crawler: add some broken as rss mimetype
Seen out there
2017-03-18 23:00:13 -10:00
pictuga 2003e2760b Move custom_handler to crawler
Makes more sense. Easier to reuse. Also cleaned up a bit the code
2017-03-18 22:51:27 -10:00
pictuga e1a13a623c crawler: remove unefficient feedburner-specific code 2017-03-18 22:31:03 -10:00
pictuga 367f86987d readabilite: spread score to all ancestors
Instead of just parents and grandparents
2017-03-18 22:24:38 -10:00
pictuga e3ab3c6823 crawler: use less tertiary operator
Inherited from fork
2017-03-18 22:23:39 -10:00
pictuga 65055290d4 crawler: better use of chardet
Scan whole doc since beginning of html pages tends to be too regular. Ignore ASCII detection for the same reason.
2017-03-18 22:19:54 -10:00
pictuga 9ee6ff60e1 crawler: 301 http code doesn't respect headers
More or less according to the specs
2017-03-18 22:18:10 -10:00
pictuga f4abc4e8a4 Detect encoding (using crawler) before readabilite 2017-03-11 02:30:57 -10:00
pictuga c952b85d92 crawler: cache 301 HTTP code, for a week 2017-03-09 09:37:05 -10:00
pictuga e8023e4336 crawler: remove unused NotInCache error-class 2017-03-09 09:35:40 -10:00
pictuga 385f9eb39a morss: use crawler strict accept for feed 2017-03-08 19:05:48 -10:00
Florian Muenchbach 993ac638a3 Added override for auto-detected character encoding of parsed pages. 2017-03-08 18:45:20 -10:00
pictuga 627163abff Make cache settings in morss nicer 2017-03-08 18:09:24 -10:00
pictuga e5f8e43659 Shifted the <link rel='alternate'/> redirect to crawler
Now using MIMETYPE var from crawler within morss.py
2017-03-08 18:03:34 -10:00
pictuga fb8825b410 crawler: parse html to get http-equiv
For sure slower, but way cleaner (and probably more stable)
2017-03-08 17:50:57 -10:00
pictuga f4f6a86147 feeds: make wheezy.template mandatory
Cleaner code. Less confusing.
2017-03-08 15:38:59 -10:00
pictuga ad9bf946ec crawler: use chardet again
Always nice in case no encoding is specified. Somehow got dropped with commit 245ba99. Most probably by accident
2017-03-08 11:37:12 -10:00
pictuga 3fc89d5359 readabilite: improve score for <p>
Helps a lot with bbc, le monde. Might backfire on other websites tho...
2017-03-01 18:02:45 -10:00
pictuga a8ac2ed1ca Turn FeedBefore/After into ItemBefore/After
To reduce the number of loops
2017-02-28 23:24:32 -10:00
pictuga fcc5e8a076 Add "Feed/Item" in functions name
To make it instantly clearer what they work on
2017-02-28 23:23:15 -10:00
pictuga 60e3311e97 Use readabilite properly
Not thru some weird wrapper anymore
2017-02-28 22:45:26 -10:00
pictuga dc8423550f Support xml starting with \s 2017-02-25 19:04:32 -10:00
pictuga e0f533ca31 readabilite: test to replace <br/> with div 2017-02-25 18:16:15 -10:00
pictuga c6c113b8a8 readabilite: function to clean up the html code 2017-02-25 18:15:33 -10:00
pictuga 58d9f65735 readabilite: explain the use of .tail 2017-02-25 18:14:13 -10:00
pictuga a5aec8c7a6 readability: more keywords to the filter list
Also fixed indentation
2017-02-25 18:13:15 -10:00
pictuga 026903ce73 crawler: change http header after uncompressing
Change content-encoding to "identity"
2017-02-25 18:10:43 -10:00
pictuga e71fc967ce readabilite: shift "good" tags to a var (list)
So that this list can later be re-used
2017-02-25 18:07:28 -10:00
pictuga b14381f575 Use internal readability fork
Much simpler, doesn't clean the html, probably less efficient, but much faster
2016-05-31 02:50:03 +02:00
pictuga 2b9bfb47e5 Remove :smart and etag headers
Dirty code, not very useful. Use simple cache-control instead.
2016-05-31 02:47:49 +02:00
pictuga 4ff80cec86 Check argv length before using it 2016-05-31 02:46:28 +02:00
pictuga 466d8e47d6 Also make buriy's readability port compatible
Should be faster, and it now supports py3
2015-08-29 18:33:12 +02:00
pictuga 95d9d847e9 :proxy implies :keep 2015-08-29 17:48:07 +02:00
pictuga 8a1c00abf0 Typo in python version check 2015-08-28 19:29:09 +02:00
pictuga 624fa47f4f Allow CLI change of the www/ path 2015-08-28 19:22:55 +02:00
pictuga 31fc939d52 Allow CLI change of the http server port 2015-08-28 19:22:23 +02:00
pictuga 4f9000beed Comment code of launching modes 2015-08-28 19:18:09 +02:00
pictuga 5e87b56a03 Return error code in plain text in file server 2015-08-28 19:16:15 +02:00
pictuga ffda3fac7e Improve file detection in web server 2015-08-28 19:15:40 +02:00
pictuga 6741a408dd Remove now-useless ca-cert file path 2015-08-28 19:13:54 +02:00
Massimo Vannucci 8656e53b84 Correct Python version check 2015-08-05 23:36:11 +02:00
Massimo Vannucci 098a306c91 Fixed typo 2015-08-05 23:24:44 +02:00
pictuga 5c2151ffd6 Improve widely feedsportal url decoder 2015-06-14 20:32:47 +08:00
pictuga 8418212475 Use good path for html template access 2015-05-04 22:26:31 +08:00
pictuga 931fd53da6 Fix 304-cache handling
To make sure that the cached request also gets processed (by GZip and stuff)
2015-05-04 22:25:26 +08:00
pictuga ae062ebe90 Remove deprecated https error catch 2015-04-07 18:59:37 +08:00
pictuga 7a3b257328 Make :mono use basic loop
Makes profiling easier
2015-04-07 18:16:08 +08:00
pictuga 2f86a2a44b Remove useless obscure cgi code 2015-04-07 09:49:44 +08:00
pictuga 131ba09207 Change :cache mode behavior
Makes underlying code way cleaner
2015-04-07 09:38:22 +08:00
pictuga cafb87d561 Fix sqlite relative path in cgi 2015-04-07 09:37:25 +08:00
pictuga decb3f15f6 Move the mod_cgi files to /cgi/ 2015-04-07 09:36:00 +08:00
pictuga b267791199 Remove hashbang from __init__.py 2015-04-07 09:34:22 +08:00
pictuga acae47dc79 2to3: fix cli_app string print 2015-04-06 23:27:15 +08:00
pictuga 32aa96afa7 Cache HTTP content using a custom Handler
Much much cleaner. Nothing comparable
2015-04-06 23:26:12 +08:00
pictuga 006478d451 2to3: fix feeds.py string handling
Use bytes strings
2015-04-06 23:13:46 +08:00
pictuga a35225a234 2to3: fix feedify string handling 2015-04-06 23:12:50 +08:00
pictuga 1b4fc88ad0 Replace MetaRedirect handler with two cleaner ones
One for <meta http-equiv> and one for HTTP 'refresh' header
2015-04-06 23:03:17 +08:00
pictuga f2fe4fc364 Drop HTTPS SSL certificate verification
Breaks everything with python 3. Now built-in in recent python 2.7.9 and python 3.4-ish
2015-04-06 22:54:59 +08:00
pictuga 88af80e817 feeds: no need to decode xml strings
It event makes python3 lxml get angry
2015-04-06 22:37:33 +08:00
pictuga 1335b3fdda feedify: use better relative path for the .ini 2015-04-06 22:19:13 +08:00
pictuga c41c0761b6 feedify: don't insert useless url when none is found 2015-04-06 22:15:59 +08:00
pictuga dbc92068f0 feedify: explanation of methods' purpose
Kinda messy when reading code after a year
2015-04-06 22:11:31 +08:00
pictuga 9d64c31947 Feeds: use crawler.py encoding detection 2015-03-24 23:23:40 +08:00
pictuga 29d9e4702f Force enc det to return utf-8 rather than nothing 2015-03-24 23:22:56 +08:00
pictuga 2e3b766a0a http-server port as a var, print port on startup 2015-03-24 23:20:06 +08:00
pictuga b3572e143d New way of calling the program
python -m morss, python morss/main.py
2015-03-11 14:23:14 +08:00
pictuga 656b29e0ef 2to3: using unicode/str to please py3 2015-03-11 01:05:02 +08:00
pictuga cbeb01e555 2to3: fix urllib header retrieval 2015-03-11 01:03:16 +08:00
pictuga 6ae60d0343 2to3: py3-compatible readability fork 2015-03-03 01:03:03 +08:00
pictuga 28bb4b8647 2to3: csv (with if python 3) 2015-03-03 00:59:33 +08:00
pictuga 2f542005d1 2to3: urllib host 2015-03-03 00:59:00 +08:00
pictuga 9bc5b0c7f7 2to3; ordereddict fallback was for python2.6 2015-03-03 00:57:09 +08:00
pictuga dbb3883516 2to3: urllib mimetype 2015-03-03 00:55:58 +08:00
pictuga 7bd448789d 2to3: first attempt to fix strings 2015-02-26 00:50:23 +08:00
pictuga 071288015b 2to3: morss.py port xrange 2015-02-25 18:41:49 +08:00
pictuga 803d6e37c4 2to3: morss.py port most default libs 2015-02-25 18:36:27 +08:00
pictuga 327b8504c4 2to3: feeds.py port urllib2 2015-02-25 18:22:38 +08:00
pictuga 4f6f8bd41b 2to3: feedify.py port http-related lib 2015-02-25 18:16:35 +08:00
pictuga a0f2e0d995 2to3: crawler.py improve except 2015-02-25 18:07:09 +08:00
pictuga 6a06b742f9 2to3: crawler.py port try as 2015-02-25 18:03:54 +08:00
pictuga c2d85e2bf9 2to3: crawler.py port httplib 2015-02-25 18:02:29 +08:00
pictuga 4f224888d8 2to3: crawler.py port urllib2 and StringIO 2015-02-25 17:53:36 +08:00
pictuga 27cf8f6498 2to3: (iter)items to list 2015-02-25 12:02:53 +08:00
pictuga 3fb90cb7b4 2to3: local import 2015-02-25 11:57:10 +08:00
pictuga 47c8a511ff 2to3: print's 2015-02-25 11:57:10 +08:00
pictuga 604b03e2ba Delete desc when :keep=False
Still needed for Firefox, cause empty <desc/> still show up instead of content in feed preview
2015-02-24 00:38:34 +08:00
pictuga 83ed440e67 Fix issue when desc and content empty
Wouldn't put fetched article in feed
2015-02-24 00:38:02 +08:00
pictuga 5c23f90f0b Disable options filtering by default
But still provide sample code
2015-02-21 02:01:32 +08:00
pictuga 149117029c Improve logging of fetching errors 2015-02-21 01:58:45 +08:00
pictuga d5269964fc Make :theforce also bypass http errors 2015-02-21 01:58:16 +08:00
pictuga f0dcb9912e Fix cached errors handling 2015-02-21 01:57:33 +08:00
pictuga f62aedda12 Double HTTP timeout
Better slow than nothing (especially when running on a personal computer)
2015-02-21 01:55:53 +08:00
pictuga 76c4211a04 Make :hungry more useful 2015-02-21 01:55:25 +08:00
pictuga 446dd9fb3f Fix typo in FeedListDescriptor
Thanks @tehsphinx. Fixes #4.
2015-02-20 17:41:14 +08:00
pictuga ef946c0712 XML pretty-print in separate option
Who reads plain XML anyway?
2015-02-20 17:38:39 +08:00
pictuga fcf4197801 Populate __init__.py 2015-02-19 13:05:59 +08:00
pictuga ec5f5b865f Make it easy to restrict available options 2014-11-21 22:01:03 +01:00
pictuga 105ca67744 Move facebook token to own script
To a PHP script actually. Not sure why PHP. Keeps morss' code cleaner. This piece of code had nothing to do in there, and didn't bring any advantage.
2014-11-19 20:09:27 +01:00
pictuga a9654ea578 Fix encoding detection in feedify 2014-11-19 12:25:18 +01:00
pictuga 8131ea2244 HTTPS SSL certificate validation
Specific error message added
2014-11-19 11:59:59 +01:00
pictuga 1b26c5f0e3 Split SimpleDownload in a lot of Handlers
Cleaner code, easier to edit, more flexibility. Paves the way to SSL certificates validation.
Still have to clean up the code of AcceptHeadersHandler.
2014-11-19 11:57:40 +01:00
pictuga f46576168a Add :mono to disable multithreading
Convenient to have linear logging
2014-11-10 23:14:54 +01:00
pictuga 5dd262139d Add HTTP error code to download error message 2014-11-09 15:45:01 +01:00
pictuga 6d5bb2b3c5 Print error message in wgi mode 2014-11-09 15:44:42 +01:00
pictuga a820cf6812 Run :strip in After
Makes more sense
2014-11-09 15:01:50 +01:00
pictuga 607df4b123 Fix Twitter
They changed the html structure of the profile pages
2014-11-09 15:00:38 +01:00
pictuga 5eefe2c916 Log more when using wgi 2014-11-08 21:22:34 +01:00
pictuga 6f2061ff37 Fix :smart
Wasn't using the right way
2014-11-08 21:22:07 +01:00
pictuga 40834eeb93 Split After into Before/After
Needed since a bunch of options needed to be run before the actual fetching (cause no-one needs to fetch the articles of to-be-dropped items)
2014-11-08 20:31:29 +01:00
pictuga f20fb9cdf6 Use more stable loop-over-list in Gather 2014-11-08 20:30:36 +01:00
pictuga 6a40731248 Return output when DEBUG is on
Much more convenient to actually debug
2014-11-07 18:44:59 +01:00
pictuga d3eb2dd88d Implement :smart to save bandwidth 2014-11-07 18:40:44 +01:00
pictuga 67fc5f06f8 Run "After" even when debug mode is on 2014-11-06 21:15:16 +01:00
pictuga ad2673f474 Add :emtpy to remove all items
This is completely useless...
2014-11-06 21:14:41 +01:00
pictuga ecfda1d05a Add :strip to remove desc and content 2014-11-06 21:14:20 +01:00
pictuga 1a8ee716f3 Add "search" option
PLEASE NOTE that this is case sensitive and does really basic research ("is xyz in the title?"). Don't use this for fine filtering.
Also fixed an issue with After(), due to the fact that some functions were removing items from the feed while looping over the feed items, creating some anoying item-skipping issues.
2014-11-06 21:11:23 +01:00
pictuga 690bf43977 reader: show desc if no content is available 2014-10-26 19:22:57 +01:00
pictuga 0e22bb4316 Cache: catch json parse erros 2014-09-28 12:03:58 +02:00
pictuga 5f8288eecb Add :hungry to fill feeds with long intros 2014-06-28 01:43:31 +02:00
pictuga ac69b28f1b Pass options to Fill 2014-06-28 01:43:09 +02:00