Commit Graph

123 Commits (e1ed33f3207612ee9b96e60ac7697089ad2a896a)

Author SHA1 Message Date
pictuga cb69e3167f crawler: accept non-ascii urls
Covering one more corner case!
2020-04-28 14:47:23 +02:00
pictuga 818cdaaa9b Make it possible to call sub-libs in non interactive mode
Run `python -m morss.feeds http://lemonde.fr` and so on
2020-04-27 18:00:14 +02:00
pictuga 2806c64326 Make it possible to directly run sub-libs (feeds, crawler, readabilite)
Run `python -im morss.feeds http://website.sample/rss.xml` and so on
2020-04-27 17:19:31 +02:00
pictuga 6a0531ca03 crawler: randomize user agent 2020-04-24 11:28:39 +02:00
pictuga 8187876a06 crawler: stop at first alternative link
Should save a few ms and the first one is usually (?) the most relevant/generic
2020-04-23 11:23:45 +02:00
pictuga 2719bd6776 crawler: fix chinese encoding 2020-04-20 16:14:55 +02:00
pictuga ec8edb02f1 Various small bug fixes 2020-04-19 12:54:02 +02:00
pictuga 4ce3c7cb32 Small code clean ups 2020-04-19 12:50:05 +02:00
pictuga 036e5190f1 crawler: remove unused code 2020-04-18 21:40:02 +02:00
pictuga f018437544 crawler: make mysql backend thread safe 2020-04-12 12:53:05 +02:00
pictuga e5a82ff1f4 crawler: drop auto-referer
Was solving some issues. But creating even more issues.
2020-04-07 10:39:21 +02:00
pictuga 7691df5257 Use wrapper for http calls 2020-04-07 10:30:17 +02:00
pictuga eeac630855 crawler: add more "realistic" headers 2020-04-05 21:11:57 +02:00
pictuga 99461ea185 crawler: fix var name issues (private_cache) 2020-04-05 16:11:36 +02:00
pictuga bf86c1e962 crawler: make AutoUA match http(s) type 2020-04-05 16:07:51 +02:00
pictuga d20f6237bd crawler: replace ContentNegoHandler with AlternateHandler
More basic. Sends the same headers no matter what. Make requests more "replicable".
Also, drop "text/xml" from RSS contenttype, too broad, matches garbage
2020-04-05 16:05:59 +02:00
pictuga 8a4d68d72c crawler: drop 'basic' toggle
Can't even remember the use case
2020-04-05 16:03:06 +02:00
pictuga 5288cc8796 Clean up unused import's 2020-03-19 15:09:53 +01:00
pictuga 90110a4661 crawler: reduce max file size 2018-10-25 01:15:09 +02:00
pictuga 91a084e5ed crawler: make py2/3 code distinction clearer 2018-10-25 01:14:46 +02:00
pictuga 945e0dceab crawler: typo in comment 2018-09-30 21:59:50 +02:00
pictuga f9217102f3 crawler: fix sqlite/binary issue 2017-11-25 19:58:14 +01:00
pictuga 21480f90de Move from gzip to zlib to decompress data
Faster on incomplete files
2017-11-25 19:57:41 +01:00
pictuga d091e74d56 crawler: add MySQL backend
With extra dependency
2017-11-04 14:51:41 +01:00
pictuga f29a107a09 crawler: make SQLiteCache inherit from BaseCache
Saves some time for other cache backends
2017-11-04 14:48:00 +01:00
pictuga b7db78f631 crawler: use BLOB in sqlite and drop "buffer"
Can't really remember why "buffer" was introduced in the first place
2017-11-04 13:54:40 +01:00
pictuga 194465544a crawler: separate CacheHander and actual caching
Default cache is now just an in-memory {}
2017-11-04 12:41:56 +01:00
pictuga 523b250907 crawler: SQL request in CAPS for readability 2017-11-04 12:36:58 +01:00
pictuga a8c2df7f41 crawler: fix truncated gzip reader
For python 3
2017-11-04 12:07:08 +01:00
pictuga d39d0f4cae crawler: properly define default sqlite file 2017-11-02 22:50:40 +01:00
pictuga 0df6409b0e crawler: use `with con` to commit, journal WAL for perf 2017-10-28 01:28:47 +02:00
pictuga 7b85f692a0 crawler: fix encoding detection 2017-10-27 23:14:08 +02:00
pictuga 840842d246 crawler: limit download to 500KiB
More can only be linked to a fraudulent/incorrect use of the service
2017-10-27 23:12:40 +02:00
pictuga fbe811384a crawler: add (unused) DebugHandler to output headers sent/received
Saves a lot of time when debugging
2017-10-27 23:10:03 +02:00
pictuga df22396838 Only use chardet on 2k letters
Takes forever otherwise
2017-07-16 23:59:06 +02:00
pictuga 6f0efd5802 crawler: add cookies support
Somehow got dropped when splitting the big handler
2017-03-25 19:51:42 -10:00
pictuga 505b02d70d crawler: remove debugging print() 2017-03-25 13:45:12 -10:00
pictuga 9c331300eb crawler: move UAHandler to basic
Fuck u feedburner
2017-03-19 01:49:17 -10:00
pictuga 99f3c519f2 crawler: fix accept code 2017-03-18 23:37:51 -10:00
pictuga 67f5a21019 Move build_opener to crawler
Forgotten
2017-03-18 23:03:04 -10:00
pictuga f7d570d4c8 crawler: add some broken as rss mimetype
Seen out there
2017-03-18 23:00:13 -10:00
pictuga 2003e2760b Move custom_handler to crawler
Makes more sense. Easier to reuse. Also cleaned up a bit the code
2017-03-18 22:51:27 -10:00
pictuga e1a13a623c crawler: remove unefficient feedburner-specific code 2017-03-18 22:31:03 -10:00
pictuga e3ab3c6823 crawler: use less tertiary operator
Inherited from fork
2017-03-18 22:23:39 -10:00
pictuga 65055290d4 crawler: better use of chardet
Scan whole doc since beginning of html pages tends to be too regular. Ignore ASCII detection for the same reason.
2017-03-18 22:19:54 -10:00
pictuga 9ee6ff60e1 crawler: 301 http code doesn't respect headers
More or less according to the specs
2017-03-18 22:18:10 -10:00
pictuga c952b85d92 crawler: cache 301 HTTP code, for a week 2017-03-09 09:37:05 -10:00
pictuga e8023e4336 crawler: remove unused NotInCache error-class 2017-03-09 09:35:40 -10:00
Florian Muenchbach 993ac638a3 Added override for auto-detected character encoding of parsed pages. 2017-03-08 18:45:20 -10:00
pictuga e5f8e43659 Shifted the <link rel='alternate'/> redirect to crawler
Now using MIMETYPE var from crawler within morss.py
2017-03-08 18:03:34 -10:00
pictuga fb8825b410 crawler: parse html to get http-equiv
For sure slower, but way cleaner (and probably more stable)
2017-03-08 17:50:57 -10:00
pictuga ad9bf946ec crawler: use chardet again
Always nice in case no encoding is specified. Somehow got dropped with commit 245ba99. Most probably by accident
2017-03-08 11:37:12 -10:00
pictuga 026903ce73 crawler: change http header after uncompressing
Change content-encoding to "identity"
2017-02-25 18:10:43 -10:00
pictuga 8a1c00abf0 Typo in python version check 2015-08-28 19:29:09 +02:00
Massimo Vannucci 8656e53b84 Correct Python version check 2015-08-05 23:36:11 +02:00
pictuga 931fd53da6 Fix 304-cache handling
To make sure that the cached request also gets processed (by GZip and stuff)
2015-05-04 22:25:26 +08:00
pictuga 131ba09207 Change :cache mode behavior
Makes underlying code way cleaner
2015-04-07 09:38:22 +08:00
pictuga 32aa96afa7 Cache HTTP content using a custom Handler
Much much cleaner. Nothing comparable
2015-04-06 23:26:12 +08:00
pictuga 1b4fc88ad0 Replace MetaRedirect handler with two cleaner ones
One for <meta http-equiv> and one for HTTP 'refresh' header
2015-04-06 23:03:17 +08:00
pictuga f2fe4fc364 Drop HTTPS SSL certificate verification
Breaks everything with python 3. Now built-in in recent python 2.7.9 and python 3.4-ish
2015-04-06 22:54:59 +08:00
pictuga 29d9e4702f Force enc det to return utf-8 rather than nothing 2015-03-24 23:22:56 +08:00
pictuga 656b29e0ef 2to3: using unicode/str to please py3 2015-03-11 01:05:02 +08:00
pictuga cbeb01e555 2to3: fix urllib header retrieval 2015-03-11 01:03:16 +08:00
pictuga 2f542005d1 2to3: urllib host 2015-03-03 00:59:00 +08:00
pictuga dbb3883516 2to3: urllib mimetype 2015-03-03 00:55:58 +08:00
pictuga 7bd448789d 2to3: first attempt to fix strings 2015-02-26 00:50:23 +08:00
pictuga a0f2e0d995 2to3: crawler.py improve except 2015-02-25 18:07:09 +08:00
pictuga 6a06b742f9 2to3: crawler.py port try as 2015-02-25 18:03:54 +08:00
pictuga c2d85e2bf9 2to3: crawler.py port httplib 2015-02-25 18:02:29 +08:00
pictuga 4f224888d8 2to3: crawler.py port urllib2 and StringIO 2015-02-25 17:53:36 +08:00
pictuga 27cf8f6498 2to3: (iter)items to list 2015-02-25 12:02:53 +08:00
pictuga 8131ea2244 HTTPS SSL certificate validation
Specific error message added
2014-11-19 11:59:59 +01:00
pictuga 1b26c5f0e3 Split SimpleDownload in a lot of Handlers
Cleaner code, easier to edit, more flexibility. Paves the way to SSL certificates validation.
Still have to clean up the code of AcceptHeadersHandler.
2014-11-19 11:57:40 +01:00