pictuga
386bafd391
readabilite: write_all use "node" instead of "item"
2017-07-17 00:13:15 +02:00
pictuga
a61b259792
readabilite: easy option to highlight the nodes
2017-07-17 00:11:49 +02:00
pictuga
c52b47616d
readabilite: always return common of 2 best nodes
...
Better results. Less is not more
2017-07-17 00:10:58 +02:00
pictuga
bfdda18b9c
readbilite: better explain lowest_common output
2017-07-17 00:08:00 +02:00
pictuga
2afea497a3
readabilite: br2p use "node" instead of "item"
...
Confusing with rss items otherwise
2017-07-17 00:06:39 +02:00
pictuga
843dc97fbf
readabilite: change scoring algorithm
...
Use 3 groups of keywords instead
2017-07-17 00:01:44 +02:00
pictuga
df22396838
Only use chardet on 2k letters
...
Takes forever otherwise
2017-07-16 23:59:06 +02:00
pictuga
6f0efd5802
crawler: add cookies support
...
Somehow got dropped when splitting the big handler
2017-03-25 19:51:42 -10:00
pictuga
d3bc2926fc
Remove :hungry
...
Mostly usless. If you need it, you might as well not need to use morss in the first place...
2017-03-25 13:52:58 -10:00
pictuga
505b02d70d
crawler: remove debugging print()
2017-03-25 13:45:12 -10:00
pictuga
3ca6ed5bb0
readabilite: add author/about to black list
2017-03-24 22:02:41 -10:00
pictuga
4aa25bf3d8
readabilite: clean_html before scoring
...
Surprisingly efficient
2017-03-24 21:50:46 -10:00
pictuga
bfefa8d599
readabilite: add tags to black list
2017-03-24 21:50:26 -10:00
pictuga
91da0f36dc
readabilite: comment the clean_html function
2017-03-24 21:50:01 -10:00
pictuga
67889a1d14
readabilite: drop useless tags
...
This extra cluster actually jams the algorithm
2017-03-24 21:49:14 -10:00
pictuga
167e3e4a15
feedify: accept xpath rules passed as parameters
2017-03-20 20:56:48 -10:00
pictuga
bf3ef586c2
feedify: remove unused downloader
2017-03-20 20:53:52 -10:00
pictuga
08f08ef704
improve morss url detection regex
2017-03-20 20:51:13 -10:00
pictuga
1b4341f741
accept query_string in morss cgi
2017-03-20 20:50:04 -10:00
pictuga
f965566054
feedify; make function use clearer
2017-03-20 20:19:08 -10:00
pictuga
d6882e0a6a
readabilite: (try to) emprove detection
...
Kinda hopeless
2017-03-19 02:00:31 -10:00
pictuga
79a8ada9f4
readabilite: add tags to score
2017-03-19 01:57:54 -10:00
pictuga
4a5150e030
readabilite: fix iter while iterating
2017-03-19 01:56:33 -10:00
pictuga
e65c88abf8
readabilite: fix re.match
2017-03-19 01:55:40 -10:00
pictuga
9c331300eb
crawler: move UAHandler to basic
...
Fuck u feedburner
2017-03-19 01:49:17 -10:00
pictuga
5e61686373
Only use full feed for articles & feedify
...
Sometimes using referrer and/or useragent makes some dumb websites return diferent content (hello feedburner)
2017-03-18 23:43:28 -10:00
pictuga
0b6e553054
Move iTunes code to feedify.py
2017-03-18 23:41:37 -10:00
pictuga
d4937812a8
Remove HTTPError code
...
Only used to look nice but useless (inherits from IOError anyway)
2017-03-18 23:39:32 -10:00
pictuga
99f3c519f2
crawler: fix accept code
2017-03-18 23:37:51 -10:00
pictuga
67f5a21019
Move build_opener to crawler
...
Forgotten
2017-03-18 23:03:04 -10:00
pictuga
f7d570d4c8
crawler: add some broken as rss mimetype
...
Seen out there
2017-03-18 23:00:13 -10:00
pictuga
2003e2760b
Move custom_handler to crawler
...
Makes more sense. Easier to reuse. Also cleaned up a bit the code
2017-03-18 22:51:27 -10:00
pictuga
e1a13a623c
crawler: remove unefficient feedburner-specific code
2017-03-18 22:31:03 -10:00
pictuga
367f86987d
readabilite: spread score to all ancestors
...
Instead of just parents and grandparents
2017-03-18 22:24:38 -10:00
pictuga
e3ab3c6823
crawler: use less tertiary operator
...
Inherited from fork
2017-03-18 22:23:39 -10:00
pictuga
65055290d4
crawler: better use of chardet
...
Scan whole doc since beginning of html pages tends to be too regular. Ignore ASCII detection for the same reason.
2017-03-18 22:19:54 -10:00
pictuga
9ee6ff60e1
crawler: 301 http code doesn't respect headers
...
More or less according to the specs
2017-03-18 22:18:10 -10:00
pictuga
f4abc4e8a4
Detect encoding (using crawler) before readabilite
2017-03-11 02:30:57 -10:00
pictuga
c952b85d92
crawler: cache 301 HTTP code, for a week
2017-03-09 09:37:05 -10:00
pictuga
e8023e4336
crawler: remove unused NotInCache error-class
2017-03-09 09:35:40 -10:00
pictuga
385f9eb39a
morss: use crawler strict accept for feed
2017-03-08 19:05:48 -10:00
Florian Muenchbach
993ac638a3
Added override for auto-detected character encoding of parsed pages.
2017-03-08 18:45:20 -10:00
pictuga
627163abff
Make cache settings in morss nicer
2017-03-08 18:09:24 -10:00
pictuga
e5f8e43659
Shifted the <link rel='alternate'/> redirect to crawler
...
Now using MIMETYPE var from crawler within morss.py
2017-03-08 18:03:34 -10:00
pictuga
fb8825b410
crawler: parse html to get http-equiv
...
For sure slower, but way cleaner (and probably more stable)
2017-03-08 17:50:57 -10:00
pictuga
f4f6a86147
feeds: make wheezy.template mandatory
...
Cleaner code. Less confusing.
2017-03-08 15:38:59 -10:00
pictuga
ad9bf946ec
crawler: use chardet again
...
Always nice in case no encoding is specified. Somehow got dropped with commit 245ba99
. Most probably by accident
2017-03-08 11:37:12 -10:00
pictuga
3fc89d5359
readabilite: improve score for <p>
...
Helps a lot with bbc, le monde. Might backfire on other websites tho...
2017-03-01 18:02:45 -10:00
pictuga
a8ac2ed1ca
Turn FeedBefore/After into ItemBefore/After
...
To reduce the number of loops
2017-02-28 23:24:32 -10:00
pictuga
fcc5e8a076
Add "Feed/Item" in functions name
...
To make it instantly clearer what they work on
2017-02-28 23:23:15 -10:00
pictuga
60e3311e97
Use readabilite properly
...
Not thru some weird wrapper anymore
2017-02-28 22:45:26 -10:00
pictuga
dc8423550f
Support xml starting with \s
2017-02-25 19:04:32 -10:00
pictuga
e0f533ca31
readabilite: test to replace <br/> with div
2017-02-25 18:16:15 -10:00
pictuga
c6c113b8a8
readabilite: function to clean up the html code
2017-02-25 18:15:33 -10:00
pictuga
58d9f65735
readabilite: explain the use of .tail
2017-02-25 18:14:13 -10:00
pictuga
a5aec8c7a6
readability: more keywords to the filter list
...
Also fixed indentation
2017-02-25 18:13:15 -10:00
pictuga
026903ce73
crawler: change http header after uncompressing
...
Change content-encoding to "identity"
2017-02-25 18:10:43 -10:00
pictuga
e71fc967ce
readabilite: shift "good" tags to a var (list)
...
So that this list can later be re-used
2017-02-25 18:07:28 -10:00
pictuga
b14381f575
Use internal readability fork
...
Much simpler, doesn't clean the html, probably less efficient, but much faster
2016-05-31 02:50:03 +02:00
pictuga
2b9bfb47e5
Remove :smart and etag headers
...
Dirty code, not very useful. Use simple cache-control instead.
2016-05-31 02:47:49 +02:00
pictuga
4ff80cec86
Check argv length before using it
2016-05-31 02:46:28 +02:00
pictuga
466d8e47d6
Also make buriy's readability port compatible
...
Should be faster, and it now supports py3
2015-08-29 18:33:12 +02:00
pictuga
95d9d847e9
:proxy implies :keep
2015-08-29 17:48:07 +02:00
pictuga
8a1c00abf0
Typo in python version check
2015-08-28 19:29:09 +02:00
pictuga
624fa47f4f
Allow CLI change of the www/ path
2015-08-28 19:22:55 +02:00
pictuga
31fc939d52
Allow CLI change of the http server port
2015-08-28 19:22:23 +02:00
pictuga
4f9000beed
Comment code of launching modes
2015-08-28 19:18:09 +02:00
pictuga
5e87b56a03
Return error code in plain text in file server
2015-08-28 19:16:15 +02:00
pictuga
ffda3fac7e
Improve file detection in web server
2015-08-28 19:15:40 +02:00
pictuga
6741a408dd
Remove now-useless ca-cert file path
2015-08-28 19:13:54 +02:00
Massimo Vannucci
8656e53b84
Correct Python version check
2015-08-05 23:36:11 +02:00
Massimo Vannucci
098a306c91
Fixed typo
2015-08-05 23:24:44 +02:00
pictuga
5c2151ffd6
Improve widely feedsportal url decoder
2015-06-14 20:32:47 +08:00
pictuga
8418212475
Use good path for html template access
2015-05-04 22:26:31 +08:00
pictuga
931fd53da6
Fix 304-cache handling
...
To make sure that the cached request also gets processed (by GZip and stuff)
2015-05-04 22:25:26 +08:00
pictuga
ae062ebe90
Remove deprecated https error catch
2015-04-07 18:59:37 +08:00
pictuga
7a3b257328
Make :mono use basic loop
...
Makes profiling easier
2015-04-07 18:16:08 +08:00
pictuga
2f86a2a44b
Remove useless obscure cgi code
2015-04-07 09:49:44 +08:00
pictuga
131ba09207
Change :cache mode behavior
...
Makes underlying code way cleaner
2015-04-07 09:38:22 +08:00
pictuga
cafb87d561
Fix sqlite relative path in cgi
2015-04-07 09:37:25 +08:00
pictuga
decb3f15f6
Move the mod_cgi files to /cgi/
2015-04-07 09:36:00 +08:00
pictuga
b267791199
Remove hashbang from __init__.py
2015-04-07 09:34:22 +08:00
pictuga
acae47dc79
2to3: fix cli_app string print
2015-04-06 23:27:15 +08:00
pictuga
32aa96afa7
Cache HTTP content using a custom Handler
...
Much much cleaner. Nothing comparable
2015-04-06 23:26:12 +08:00
pictuga
006478d451
2to3: fix feeds.py string handling
...
Use bytes strings
2015-04-06 23:13:46 +08:00
pictuga
a35225a234
2to3: fix feedify string handling
2015-04-06 23:12:50 +08:00
pictuga
1b4fc88ad0
Replace MetaRedirect handler with two cleaner ones
...
One for <meta http-equiv> and one for HTTP 'refresh' header
2015-04-06 23:03:17 +08:00
pictuga
f2fe4fc364
Drop HTTPS SSL certificate verification
...
Breaks everything with python 3. Now built-in in recent python 2.7.9 and python 3.4-ish
2015-04-06 22:54:59 +08:00
pictuga
88af80e817
feeds: no need to decode xml strings
...
It event makes python3 lxml get angry
2015-04-06 22:37:33 +08:00
pictuga
1335b3fdda
feedify: use better relative path for the .ini
2015-04-06 22:19:13 +08:00
pictuga
c41c0761b6
feedify: don't insert useless url when none is found
2015-04-06 22:15:59 +08:00
pictuga
dbc92068f0
feedify: explanation of methods' purpose
...
Kinda messy when reading code after a year
2015-04-06 22:11:31 +08:00
pictuga
9d64c31947
Feeds: use crawler.py encoding detection
2015-03-24 23:23:40 +08:00
pictuga
29d9e4702f
Force enc det to return utf-8 rather than nothing
2015-03-24 23:22:56 +08:00
pictuga
2e3b766a0a
http-server port as a var, print port on startup
2015-03-24 23:20:06 +08:00
pictuga
b3572e143d
New way of calling the program
...
python -m morss, python morss/main.py
2015-03-11 14:23:14 +08:00
pictuga
656b29e0ef
2to3: using unicode/str to please py3
2015-03-11 01:05:02 +08:00
pictuga
cbeb01e555
2to3: fix urllib header retrieval
2015-03-11 01:03:16 +08:00
pictuga
6ae60d0343
2to3: py3-compatible readability fork
2015-03-03 01:03:03 +08:00
pictuga
28bb4b8647
2to3: csv (with if python 3)
2015-03-03 00:59:33 +08:00
pictuga
2f542005d1
2to3: urllib host
2015-03-03 00:59:00 +08:00
pictuga
9bc5b0c7f7
2to3; ordereddict fallback was for python2.6
2015-03-03 00:57:09 +08:00
pictuga
dbb3883516
2to3: urllib mimetype
2015-03-03 00:55:58 +08:00
pictuga
7bd448789d
2to3: first attempt to fix strings
2015-02-26 00:50:23 +08:00
pictuga
071288015b
2to3: morss.py port xrange
2015-02-25 18:41:49 +08:00
pictuga
803d6e37c4
2to3: morss.py port most default libs
2015-02-25 18:36:27 +08:00
pictuga
327b8504c4
2to3: feeds.py port urllib2
2015-02-25 18:22:38 +08:00
pictuga
4f6f8bd41b
2to3: feedify.py port http-related lib
2015-02-25 18:16:35 +08:00
pictuga
a0f2e0d995
2to3: crawler.py improve except
2015-02-25 18:07:09 +08:00
pictuga
6a06b742f9
2to3: crawler.py port try as
2015-02-25 18:03:54 +08:00
pictuga
c2d85e2bf9
2to3: crawler.py port httplib
2015-02-25 18:02:29 +08:00
pictuga
4f224888d8
2to3: crawler.py port urllib2 and StringIO
2015-02-25 17:53:36 +08:00
pictuga
27cf8f6498
2to3: (iter)items to list
2015-02-25 12:02:53 +08:00
pictuga
3fb90cb7b4
2to3: local import
2015-02-25 11:57:10 +08:00
pictuga
47c8a511ff
2to3: print's
2015-02-25 11:57:10 +08:00
pictuga
604b03e2ba
Delete desc when :keep=False
...
Still needed for Firefox, cause empty <desc/> still show up instead of content in feed preview
2015-02-24 00:38:34 +08:00
pictuga
83ed440e67
Fix issue when desc and content empty
...
Wouldn't put fetched article in feed
2015-02-24 00:38:02 +08:00
pictuga
5c23f90f0b
Disable options filtering by default
...
But still provide sample code
2015-02-21 02:01:32 +08:00
pictuga
149117029c
Improve logging of fetching errors
2015-02-21 01:58:45 +08:00
pictuga
d5269964fc
Make :theforce also bypass http errors
2015-02-21 01:58:16 +08:00
pictuga
f0dcb9912e
Fix cached errors handling
2015-02-21 01:57:33 +08:00
pictuga
f62aedda12
Double HTTP timeout
...
Better slow than nothing (especially when running on a personal computer)
2015-02-21 01:55:53 +08:00
pictuga
76c4211a04
Make :hungry more useful
2015-02-21 01:55:25 +08:00
pictuga
446dd9fb3f
Fix typo in FeedListDescriptor
...
Thanks @tehsphinx. Fixes #4 .
2015-02-20 17:41:14 +08:00
pictuga
ef946c0712
XML pretty-print in separate option
...
Who reads plain XML anyway?
2015-02-20 17:38:39 +08:00
pictuga
fcf4197801
Populate __init__.py
2015-02-19 13:05:59 +08:00
pictuga
ec5f5b865f
Make it easy to restrict available options
2014-11-21 22:01:03 +01:00
pictuga
105ca67744
Move facebook token to own script
...
To a PHP script actually. Not sure why PHP. Keeps morss' code cleaner. This piece of code had nothing to do in there, and didn't bring any advantage.
2014-11-19 20:09:27 +01:00
pictuga
a9654ea578
Fix encoding detection in feedify
2014-11-19 12:25:18 +01:00
pictuga
8131ea2244
HTTPS SSL certificate validation
...
Specific error message added
2014-11-19 11:59:59 +01:00
pictuga
1b26c5f0e3
Split SimpleDownload in a lot of Handlers
...
Cleaner code, easier to edit, more flexibility. Paves the way to SSL certificates validation.
Still have to clean up the code of AcceptHeadersHandler.
2014-11-19 11:57:40 +01:00
pictuga
f46576168a
Add :mono to disable multithreading
...
Convenient to have linear logging
2014-11-10 23:14:54 +01:00
pictuga
5dd262139d
Add HTTP error code to download error message
2014-11-09 15:45:01 +01:00
pictuga
6d5bb2b3c5
Print error message in wgi mode
2014-11-09 15:44:42 +01:00
pictuga
a820cf6812
Run :strip in After
...
Makes more sense
2014-11-09 15:01:50 +01:00
pictuga
607df4b123
Fix Twitter
...
They changed the html structure of the profile pages
2014-11-09 15:00:38 +01:00
pictuga
5eefe2c916
Log more when using wgi
2014-11-08 21:22:34 +01:00
pictuga
6f2061ff37
Fix :smart
...
Wasn't using the right way
2014-11-08 21:22:07 +01:00
pictuga
40834eeb93
Split After into Before/After
...
Needed since a bunch of options needed to be run before the actual fetching (cause no-one needs to fetch the articles of to-be-dropped items)
2014-11-08 20:31:29 +01:00
pictuga
f20fb9cdf6
Use more stable loop-over-list in Gather
2014-11-08 20:30:36 +01:00
pictuga
6a40731248
Return output when DEBUG is on
...
Much more convenient to actually debug
2014-11-07 18:44:59 +01:00
pictuga
d3eb2dd88d
Implement :smart to save bandwidth
2014-11-07 18:40:44 +01:00
pictuga
67fc5f06f8
Run "After" even when debug mode is on
2014-11-06 21:15:16 +01:00
pictuga
ad2673f474
Add :emtpy to remove all items
...
This is completely useless...
2014-11-06 21:14:41 +01:00
pictuga
ecfda1d05a
Add :strip to remove desc and content
2014-11-06 21:14:20 +01:00
pictuga
1a8ee716f3
Add "search" option
...
PLEASE NOTE that this is case sensitive and does really basic research ("is xyz in the title?"). Don't use this for fine filtering.
Also fixed an issue with After(), due to the fact that some functions were removing items from the feed while looping over the feed items, creating some anoying item-skipping issues.
2014-11-06 21:11:23 +01:00
pictuga
690bf43977
reader: show desc if no content is available
2014-10-26 19:22:57 +01:00
pictuga
0e22bb4316
Cache: catch json parse erros
2014-09-28 12:03:58 +02:00
pictuga
5f8288eecb
Add :hungry to fill feeds with long intros
2014-06-28 01:43:31 +02:00
pictuga
ac69b28f1b
Pass options to Fill
2014-06-28 01:43:09 +02:00