Commit Graph

581 Commits (master)

Author SHA1 Message Date
pictuga e5f8e43659 Shifted the <link rel='alternate'/> redirect to crawler
Now using MIMETYPE var from crawler within morss.py
2017-03-08 18:03:34 -10:00
pictuga fb8825b410 crawler: parse html to get http-equiv
For sure slower, but way cleaner (and probably more stable)
2017-03-08 17:50:57 -10:00
pictuga f4f6a86147 feeds: make wheezy.template mandatory
Cleaner code. Less confusing.
2017-03-08 15:38:59 -10:00
pictuga ad9bf946ec crawler: use chardet again
Always nice in case no encoding is specified. Somehow got dropped with commit 245ba99. Most probably by accident
2017-03-08 11:37:12 -10:00
pictuga 3fc89d5359 readabilite: improve score for <p>
Helps a lot with bbc, le monde. Might backfire on other websites tho...
2017-03-01 18:02:45 -10:00
pictuga a8ac2ed1ca Turn FeedBefore/After into ItemBefore/After
To reduce the number of loops
2017-02-28 23:24:32 -10:00
pictuga fcc5e8a076 Add "Feed/Item" in functions name
To make it instantly clearer what they work on
2017-02-28 23:23:15 -10:00
pictuga 60e3311e97 Use readabilite properly
Not thru some weird wrapper anymore
2017-02-28 22:45:26 -10:00
pictuga dc8423550f Support xml starting with \s 2017-02-25 19:04:32 -10:00
pictuga e0f533ca31 readabilite: test to replace <br/> with div 2017-02-25 18:16:15 -10:00
pictuga c6c113b8a8 readabilite: function to clean up the html code 2017-02-25 18:15:33 -10:00
pictuga 58d9f65735 readabilite: explain the use of .tail 2017-02-25 18:14:13 -10:00
pictuga a5aec8c7a6 readability: more keywords to the filter list
Also fixed indentation
2017-02-25 18:13:15 -10:00
pictuga 026903ce73 crawler: change http header after uncompressing
Change content-encoding to "identity"
2017-02-25 18:10:43 -10:00
pictuga e71fc967ce readabilite: shift "good" tags to a var (list)
So that this list can later be re-used
2017-02-25 18:07:28 -10:00
pictuga b14381f575 Use internal readability fork
Much simpler, doesn't clean the html, probably less efficient, but much faster
2016-05-31 02:50:03 +02:00
pictuga 2b9bfb47e5 Remove :smart and etag headers
Dirty code, not very useful. Use simple cache-control instead.
2016-05-31 02:47:49 +02:00
pictuga 4ff80cec86 Check argv length before using it 2016-05-31 02:46:28 +02:00
pictuga 466d8e47d6 Also make buriy's readability port compatible
Should be faster, and it now supports py3
2015-08-29 18:33:12 +02:00
pictuga 95d9d847e9 :proxy implies :keep 2015-08-29 17:48:07 +02:00
pictuga 8a1c00abf0 Typo in python version check 2015-08-28 19:29:09 +02:00
pictuga 624fa47f4f Allow CLI change of the www/ path 2015-08-28 19:22:55 +02:00
pictuga 31fc939d52 Allow CLI change of the http server port 2015-08-28 19:22:23 +02:00
pictuga 4f9000beed Comment code of launching modes 2015-08-28 19:18:09 +02:00
pictuga 5e87b56a03 Return error code in plain text in file server 2015-08-28 19:16:15 +02:00
pictuga ffda3fac7e Improve file detection in web server 2015-08-28 19:15:40 +02:00
pictuga 6741a408dd Remove now-useless ca-cert file path 2015-08-28 19:13:54 +02:00
Massimo Vannucci 8656e53b84 Correct Python version check 2015-08-05 23:36:11 +02:00
Massimo Vannucci 098a306c91 Fixed typo 2015-08-05 23:24:44 +02:00
pictuga 5c2151ffd6 Improve widely feedsportal url decoder 2015-06-14 20:32:47 +08:00
pictuga 8418212475 Use good path for html template access 2015-05-04 22:26:31 +08:00
pictuga 931fd53da6 Fix 304-cache handling
To make sure that the cached request also gets processed (by GZip and stuff)
2015-05-04 22:25:26 +08:00
pictuga ae062ebe90 Remove deprecated https error catch 2015-04-07 18:59:37 +08:00
pictuga 7a3b257328 Make :mono use basic loop
Makes profiling easier
2015-04-07 18:16:08 +08:00
pictuga 2f86a2a44b Remove useless obscure cgi code 2015-04-07 09:49:44 +08:00
pictuga 131ba09207 Change :cache mode behavior
Makes underlying code way cleaner
2015-04-07 09:38:22 +08:00
pictuga cafb87d561 Fix sqlite relative path in cgi 2015-04-07 09:37:25 +08:00
pictuga decb3f15f6 Move the mod_cgi files to /cgi/ 2015-04-07 09:36:00 +08:00
pictuga b267791199 Remove hashbang from __init__.py 2015-04-07 09:34:22 +08:00
pictuga acae47dc79 2to3: fix cli_app string print 2015-04-06 23:27:15 +08:00
pictuga 32aa96afa7 Cache HTTP content using a custom Handler
Much much cleaner. Nothing comparable
2015-04-06 23:26:12 +08:00
pictuga 006478d451 2to3: fix feeds.py string handling
Use bytes strings
2015-04-06 23:13:46 +08:00
pictuga a35225a234 2to3: fix feedify string handling 2015-04-06 23:12:50 +08:00
pictuga 1b4fc88ad0 Replace MetaRedirect handler with two cleaner ones
One for <meta http-equiv> and one for HTTP 'refresh' header
2015-04-06 23:03:17 +08:00
pictuga f2fe4fc364 Drop HTTPS SSL certificate verification
Breaks everything with python 3. Now built-in in recent python 2.7.9 and python 3.4-ish
2015-04-06 22:54:59 +08:00
pictuga 88af80e817 feeds: no need to decode xml strings
It event makes python3 lxml get angry
2015-04-06 22:37:33 +08:00
pictuga 1335b3fdda feedify: use better relative path for the .ini 2015-04-06 22:19:13 +08:00
pictuga c41c0761b6 feedify: don't insert useless url when none is found 2015-04-06 22:15:59 +08:00
pictuga dbc92068f0 feedify: explanation of methods' purpose
Kinda messy when reading code after a year
2015-04-06 22:11:31 +08:00
pictuga 9d64c31947 Feeds: use crawler.py encoding detection 2015-03-24 23:23:40 +08:00
pictuga 29d9e4702f Force enc det to return utf-8 rather than nothing 2015-03-24 23:22:56 +08:00
pictuga 2e3b766a0a http-server port as a var, print port on startup 2015-03-24 23:20:06 +08:00
pictuga b3572e143d New way of calling the program
python -m morss, python morss/main.py
2015-03-11 14:23:14 +08:00
pictuga 656b29e0ef 2to3: using unicode/str to please py3 2015-03-11 01:05:02 +08:00
pictuga cbeb01e555 2to3: fix urllib header retrieval 2015-03-11 01:03:16 +08:00
pictuga 6ae60d0343 2to3: py3-compatible readability fork 2015-03-03 01:03:03 +08:00
pictuga 28bb4b8647 2to3: csv (with if python 3) 2015-03-03 00:59:33 +08:00
pictuga 2f542005d1 2to3: urllib host 2015-03-03 00:59:00 +08:00
pictuga 9bc5b0c7f7 2to3; ordereddict fallback was for python2.6 2015-03-03 00:57:09 +08:00
pictuga dbb3883516 2to3: urllib mimetype 2015-03-03 00:55:58 +08:00
pictuga 7bd448789d 2to3: first attempt to fix strings 2015-02-26 00:50:23 +08:00
pictuga 071288015b 2to3: morss.py port xrange 2015-02-25 18:41:49 +08:00
pictuga 803d6e37c4 2to3: morss.py port most default libs 2015-02-25 18:36:27 +08:00
pictuga 327b8504c4 2to3: feeds.py port urllib2 2015-02-25 18:22:38 +08:00
pictuga 4f6f8bd41b 2to3: feedify.py port http-related lib 2015-02-25 18:16:35 +08:00
pictuga a0f2e0d995 2to3: crawler.py improve except 2015-02-25 18:07:09 +08:00
pictuga 6a06b742f9 2to3: crawler.py port try as 2015-02-25 18:03:54 +08:00
pictuga c2d85e2bf9 2to3: crawler.py port httplib 2015-02-25 18:02:29 +08:00
pictuga 4f224888d8 2to3: crawler.py port urllib2 and StringIO 2015-02-25 17:53:36 +08:00
pictuga 27cf8f6498 2to3: (iter)items to list 2015-02-25 12:02:53 +08:00
pictuga 3fb90cb7b4 2to3: local import 2015-02-25 11:57:10 +08:00
pictuga 47c8a511ff 2to3: print's 2015-02-25 11:57:10 +08:00
pictuga 604b03e2ba Delete desc when :keep=False
Still needed for Firefox, cause empty <desc/> still show up instead of content in feed preview
2015-02-24 00:38:34 +08:00
pictuga 83ed440e67 Fix issue when desc and content empty
Wouldn't put fetched article in feed
2015-02-24 00:38:02 +08:00
pictuga 5c23f90f0b Disable options filtering by default
But still provide sample code
2015-02-21 02:01:32 +08:00
pictuga 149117029c Improve logging of fetching errors 2015-02-21 01:58:45 +08:00
pictuga d5269964fc Make :theforce also bypass http errors 2015-02-21 01:58:16 +08:00
pictuga f0dcb9912e Fix cached errors handling 2015-02-21 01:57:33 +08:00
pictuga f62aedda12 Double HTTP timeout
Better slow than nothing (especially when running on a personal computer)
2015-02-21 01:55:53 +08:00
pictuga 76c4211a04 Make :hungry more useful 2015-02-21 01:55:25 +08:00
pictuga 446dd9fb3f Fix typo in FeedListDescriptor
Thanks @tehsphinx. Fixes #4.
2015-02-20 17:41:14 +08:00
pictuga ef946c0712 XML pretty-print in separate option
Who reads plain XML anyway?
2015-02-20 17:38:39 +08:00
pictuga fcf4197801 Populate __init__.py 2015-02-19 13:05:59 +08:00
pictuga ec5f5b865f Make it easy to restrict available options 2014-11-21 22:01:03 +01:00
pictuga 105ca67744 Move facebook token to own script
To a PHP script actually. Not sure why PHP. Keeps morss' code cleaner. This piece of code had nothing to do in there, and didn't bring any advantage.
2014-11-19 20:09:27 +01:00
pictuga a9654ea578 Fix encoding detection in feedify 2014-11-19 12:25:18 +01:00
pictuga 8131ea2244 HTTPS SSL certificate validation
Specific error message added
2014-11-19 11:59:59 +01:00
pictuga 1b26c5f0e3 Split SimpleDownload in a lot of Handlers
Cleaner code, easier to edit, more flexibility. Paves the way to SSL certificates validation.
Still have to clean up the code of AcceptHeadersHandler.
2014-11-19 11:57:40 +01:00
pictuga f46576168a Add :mono to disable multithreading
Convenient to have linear logging
2014-11-10 23:14:54 +01:00
pictuga 5dd262139d Add HTTP error code to download error message 2014-11-09 15:45:01 +01:00
pictuga 6d5bb2b3c5 Print error message in wgi mode 2014-11-09 15:44:42 +01:00
pictuga a820cf6812 Run :strip in After
Makes more sense
2014-11-09 15:01:50 +01:00
pictuga 607df4b123 Fix Twitter
They changed the html structure of the profile pages
2014-11-09 15:00:38 +01:00
pictuga 5eefe2c916 Log more when using wgi 2014-11-08 21:22:34 +01:00
pictuga 6f2061ff37 Fix :smart
Wasn't using the right way
2014-11-08 21:22:07 +01:00
pictuga 40834eeb93 Split After into Before/After
Needed since a bunch of options needed to be run before the actual fetching (cause no-one needs to fetch the articles of to-be-dropped items)
2014-11-08 20:31:29 +01:00
pictuga f20fb9cdf6 Use more stable loop-over-list in Gather 2014-11-08 20:30:36 +01:00
pictuga 6a40731248 Return output when DEBUG is on
Much more convenient to actually debug
2014-11-07 18:44:59 +01:00
pictuga d3eb2dd88d Implement :smart to save bandwidth 2014-11-07 18:40:44 +01:00
pictuga 67fc5f06f8 Run "After" even when debug mode is on 2014-11-06 21:15:16 +01:00
pictuga ad2673f474 Add :emtpy to remove all items
This is completely useless...
2014-11-06 21:14:41 +01:00
pictuga ecfda1d05a Add :strip to remove desc and content 2014-11-06 21:14:20 +01:00
pictuga 1a8ee716f3 Add "search" option
PLEASE NOTE that this is case sensitive and does really basic research ("is xyz in the title?"). Don't use this for fine filtering.
Also fixed an issue with After(), due to the fact that some functions were removing items from the feed while looping over the feed items, creating some anoying item-skipping issues.
2014-11-06 21:11:23 +01:00
pictuga 690bf43977 reader: show desc if no content is available 2014-10-26 19:22:57 +01:00
pictuga 0e22bb4316 Cache: catch json parse erros 2014-09-28 12:03:58 +02:00
pictuga 5f8288eecb Add :hungry to fill feeds with long intros 2014-06-28 01:43:31 +02:00
pictuga ac69b28f1b Pass options to Fill 2014-06-28 01:43:09 +02:00
pictuga 6cc3e7eb93 Fix :callback and add content-type 2014-06-28 01:20:47 +02:00
pictuga 0ec7c2f3e6 Fix :callback crash 2014-06-28 01:13:29 +02:00
pictuga 484432d804 Add :callback for JSONP calls 2014-06-28 00:59:57 +02:00
pictuga 226441d821 Add :cors for cross-domain XHR (with README update) 2014-06-28 00:59:13 +02:00
pictuga 230659a34b Reenable args with values 2014-06-28 00:58:37 +02:00
pictuga 38b90e0e4c Fix template syntax 2014-06-22 20:23:32 +02:00
pictuga d877e856d3 Fix feed.items.append since pep8
The underscore naming convention was not yet applied in that function
2014-06-22 20:13:36 +02:00
pictuga ee3b2590d0 Remove useless line-break (pep8) 2014-06-22 20:00:44 +02:00
pictuga 5a0084c7cc Fix isPermaLink in feedify 2014-06-22 19:54:13 +02:00
pictuga e991d356f4 Fix duckduckgo layout in .ini 2014-06-22 19:53:53 +02:00
pictuga ecabbc0175 Replace <a> with <span> in reader with :noref 2014-06-22 19:42:52 +02:00
pictuga 6352ef28a9 Use pep8-like layout for .ini 2014-06-22 02:14:11 +02:00
pictuga 3ca5dbaf31 Raise ImportError when missing dependency for call 2014-06-22 02:04:14 +02:00
pictuga 9f51448160 Use xrange where applicable (faster) 2014-06-22 02:02:43 +02:00
pictuga f01efb7334 Make most of the code pep8-compliant
Thanks a lot to github.com/SamuelMarks for his nice work
2014-06-22 01:59:01 +02:00
pictuga da0a8feadd Replace TABS with FOUR SPACES in .py
(you might want to use: git diff -w)
2014-06-21 18:35:59 +02:00
pictuga da857f8bb2 Remove useless odata var in morss/morss.py 2014-06-21 18:25:50 +02:00
pictuga 286b90ab8e Fix typo in error raising message 2014-06-21 16:29:05 +02:00
pictuga cc27483143 Remove ununsed imports 2014-06-21 16:13:54 +02:00
pictuga 1cf959ce5b Fix item.link deletion 2014-06-21 16:08:37 +02:00
pictuga de5b75162c Add :ad mode (as an example)
Not really useful, but shows how to quickly add/remove items from the feed
2014-06-16 14:07:59 +02:00
pictuga 850d574424 Add one comment
Was waiting to be committed for months...
2014-06-16 14:07:23 +02:00
pictuga 45478b592e Remove cache-redirect
Some kind of no-longer-working code left-over
2014-06-16 14:06:42 +02:00
pictuga 8270685ac6 Use longer timeout for xml fetching 2014-06-16 14:03:24 +02:00
pictuga 0e3751c712 Remove useless comment 2014-06-16 14:02:54 +02:00
pictuga 862fe3cae4 Use more recent user-agent 2014-06-16 14:01:01 +02:00
pictuga 7211093cc5 Add :smart :noref modes, update README 2014-06-16 14:00:02 +02:00
pictuga f991802d9e Try to use less server-specific code for FB tokens 2014-06-16 13:57:53 +02:00
pictuga 9285525256 Unify internal/external errors 2014-06-16 13:55:59 +02:00
pictuga cdef40fbbe Fix Cache saving crash
Because was deleting values of a dict while looping over its values...
2014-06-07 19:14:31 +02:00
pictuga f90958149e Add :reader
Uses wheezy.template, which is said to be fast and light. Provided template file is really basic, custom css suggested.
2014-05-29 14:12:16 +02:00
pictuga b66ac2bc5e Make it possible not to use caching 2014-05-24 19:13:41 +02:00
pictuga 25fdca4bf0 Add do-it-all function
For quick lib use
2014-05-24 19:02:22 +02:00
pictuga 26c91070f5 Time-based Cache
Solves the :proxy issue for good. More convenient, more flexible
2014-05-24 19:01:21 +02:00
pictuga 5e64696031 Fix '/morss.py/' url fixer 2014-05-22 22:53:36 +02:00
pictuga 364fbc4ba6 Remove apparent limit
Cause no longer works, cause of all-bool args introduced earlier
2014-05-22 22:52:49 +02:00
pictuga b03d865b7b Get rid of ParseOptions()
That thing wasn't nice, and depended too much on the various use case. The new approach is to turn morss into a library and turn the use cases into some pre-implemented lib usages
2014-05-22 22:44:59 +02:00
pictuga 3c48c58127 Remove useless HOLD var
Was needed in DEBUG at some point
2014-05-21 12:19:49 +02:00
pictuga e8e7f170a6 Include super dumb http file server
For index.html, other files can be added, but everything has to be hard-coded (mimetype included)
2014-05-18 12:34:23 +02:00
pictuga c41a1fe226 Support for wikipedia fetured articles feed
Should work with most wikipedias
2014-05-18 12:17:14 +02:00
pictuga d8a3c4e9af Add support for Google News 2014-05-18 11:58:45 +02:00
pictuga bbf1ffbb15 Remove 'persistent' and 'dic' arg in Cache
'dic' was mostly intended for facebook now-bygone advanced buggy token storage. 'persistent' was needed by fb and 'proxy' mode, but a small workaround was found for the proxy mode (basically making sure the cache object is always at least 5-item long)
2014-05-15 00:54:40 +02:00
pictuga 76e7f1ea00 Try to use more generic 302/303 redirections
Still far from being great, but at least I can use it on both morss.it and test.morss.it now
2014-05-14 15:05:14 +02:00
pictuga 031b67a8db Remove some useless options
progress and a accidentaly-disclosed one, cause useless
2014-05-14 15:03:40 +02:00
pictuga 974bad7974 Fix and strip down facebook
Remove unstable non-working facebook semi-automatic token renewal (a simple warning on morss.it should be enough). Also commited some forgotten stuff.
2014-05-14 15:01:41 +02:00
pictuga b7136f2056 Pull iTunes raw feed out of iTunes url
This iTunes thinggy somehow qualifies as yet-another-apple-tech-rape: just some old tech behind iron curtains…
2014-05-12 23:15:51 +02:00
pictuga d8074d6b6d Redirect google translate links to original link
Cause anyway Google Translate isn't scrappable. So it's better to have at least some content.
2014-03-22 20:53:33 +01:00
pictuga a4cf5e0daa Google link cleaner now works on all .dot versions 2014-03-22 20:52:25 +01:00
pictuga c94ef92131 Fix Facebook support
Now token is grabbed directly by the server, and sent back by means of a cookie. This does unify token "creation" and renewal.
2014-02-21 14:36:06 +01:00
pictuga a1f5c3db3a Have .csv files be downloaded
So that users can open it in LibreOffice/OpenOffice/Word without having to save it to disk beforehands
2014-02-05 00:37:12 +01:00
pictuga 6c33bb6e1c Safer Cache saving
Create tmp file and then move it to destination. Avoids corrupt files during write
2014-01-29 20:36:45 +01:00
pictuga 6eaec96af7 Keep "dic" param in Cache.new 2014-01-22 15:56:08 +01:00
pictuga 4e549dc88a Change lim/max settings only for current "run" 2014-01-19 23:36:41 +01:00
pictuga 0f7bc568e4 Send CGI HTTP headers earlier
So that browsers show that sth is going on
2014-01-15 21:02:47 +01:00
pictuga 4d6ef92504 Separate function for output. Add csv 2014-01-13 00:10:57 +01:00
pictuga 7fbe728f93 Feeds: allow json, csv export
Uses OrderedDict
2014-01-13 00:08:03 +01:00
pictuga ec55f5e856 Use smarter order for RSS.dict 2014-01-13 00:07:04 +01:00
pictuga 3d78cfb638 Fix HTTP bug when returning empty page 2014-01-11 18:21:37 +01:00
pictuga 840b0b1ded Remove yet another silly log message 2014-01-11 18:18:02 +01:00
pictuga 8209f243bb Fix rss-redirection code
And add log, which was lost when splitting functions (which made this fix needed)
2014-01-11 18:15:36 +01:00
pictuga 3b3ac4c8a6 Remove batch of useless imports 2014-01-11 17:31:27 +01:00
pictuga 5feb061bf7 First attempt at decent folder structure
Use setup.py, subfolder for code.
2014-01-11 17:11:57 +01:00
pictuga 851dacdfbc Renamed to .py. 2013-04-04 18:17:12 +02:00
pictuga 6783bbf992 Improved shebang. 2013-04-04 17:56:37 +02:00
pictuga 82084c2c75 Move to OOP.
This is a huge commit. The whole code is ported to Object-Oritented Programming. This makes the code cleaner, which became required to deal with all the different cases, for example with encoding detection. Encoding detection now works better, and uses 3 different methods. HTML pages with an xml declaration are now supported. Feed urls with parameters (eg. "index.php?option=par") are also supported. Cache is now smarter, since it no longer grows indefinitely, since only in-use pages are kept in the cache. Caching is now mandatory. urllib (not urllib2) is no longer needed. Solved a possible crash with log function (when passing list of str with non-unicode encoging).
README is also updated.
2013-04-04 17:43:30 +02:00
pictuga 05b5bc7783 Catch extra errors (timeout). 2013-03-29 20:06:31 +01:00
pictuga 6f6c5fbaad Faster xml cleaning 2013-03-01 14:26:51 +01:00
pictuga e305f387ab Hopefully fixed encoding issues
with the dirtiest trick out there...
2013-02-27 15:12:32 +01:00
pictuga ed8a45875c Default to "//h1/.." since most website use it
because it is said to be good for SEO. Debug now requires env variable "DEBUG" to be set to something else than "".
2013-02-25 21:36:02 +01:00
pictuga d39604c453 Support for cookies added
NYT needs them
2013-02-25 20:53:59 +01:00
pictuga d6179a734f Clearer debug info 2013-02-25 20:53:22 +01:00
pictuga eb63ce3f4f Handle more errors 2013-02-25 18:32:23 +01:00
pictuga b63f91a151 Added cache, easier debug 2013-02-25 18:01:59 +01:00
pictuga 51fe6ce81b First commit 2013-02-25 15:50:32 +01:00