morss

Commit Graph

Author	SHA1	Message	Date
pictuga	cb69e3167f	crawler: accept non-ascii urls Covering one more corner case!	2020-04-28 14:47:23 +02:00
pictuga	818cdaaa9b	Make it possible to call sub-libs in non interactive mode Run `python -m morss.feeds http://lemonde.fr` and so on	2020-04-27 18:00:14 +02:00
pictuga	2806c64326	Make it possible to directly run sub-libs (feeds, crawler, readabilite) Run `python -im morss.feeds http://website.sample/rss.xml` and so on	2020-04-27 17:19:31 +02:00
pictuga	6a0531ca03	crawler: randomize user agent	2020-04-24 11:28:39 +02:00
pictuga	8187876a06	crawler: stop at first alternative link Should save a few ms and the first one is usually (?) the most relevant/generic	2020-04-23 11:23:45 +02:00
pictuga	2719bd6776	crawler: fix chinese encoding	2020-04-20 16:14:55 +02:00
pictuga	ec8edb02f1	Various small bug fixes	2020-04-19 12:54:02 +02:00
pictuga	4ce3c7cb32	Small code clean ups	2020-04-19 12:50:05 +02:00
pictuga	036e5190f1	crawler: remove unused code	2020-04-18 21:40:02 +02:00
pictuga	f018437544	crawler: make mysql backend thread safe	2020-04-12 12:53:05 +02:00
pictuga	e5a82ff1f4	crawler: drop auto-referer Was solving some issues. But creating even more issues.	2020-04-07 10:39:21 +02:00
pictuga	7691df5257	Use wrapper for http calls	2020-04-07 10:30:17 +02:00
pictuga	eeac630855	crawler: add more "realistic" headers	2020-04-05 21:11:57 +02:00
pictuga	99461ea185	crawler: fix var name issues (private_cache)	2020-04-05 16:11:36 +02:00
pictuga	bf86c1e962	crawler: make AutoUA match http(s) type	2020-04-05 16:07:51 +02:00
pictuga	d20f6237bd	crawler: replace ContentNegoHandler with AlternateHandler More basic. Sends the same headers no matter what. Make requests more "replicable". Also, drop "text/xml" from RSS contenttype, too broad, matches garbage	2020-04-05 16:05:59 +02:00
pictuga	8a4d68d72c	crawler: drop 'basic' toggle Can't even remember the use case	2020-04-05 16:03:06 +02:00
pictuga	5288cc8796	Clean up unused import's	2020-03-19 15:09:53 +01:00
pictuga	90110a4661	crawler: reduce max file size	2018-10-25 01:15:09 +02:00
pictuga	91a084e5ed	crawler: make py2/3 code distinction clearer	2018-10-25 01:14:46 +02:00
pictuga	945e0dceab	crawler: typo in comment	2018-09-30 21:59:50 +02:00
pictuga	f9217102f3	crawler: fix sqlite/binary issue	2017-11-25 19:58:14 +01:00
pictuga	21480f90de	Move from gzip to zlib to decompress data Faster on incomplete files	2017-11-25 19:57:41 +01:00
pictuga	d091e74d56	crawler: add MySQL backend With extra dependency	2017-11-04 14:51:41 +01:00
pictuga	f29a107a09	crawler: make SQLiteCache inherit from BaseCache Saves some time for other cache backends	2017-11-04 14:48:00 +01:00
pictuga	b7db78f631	crawler: use BLOB in sqlite and drop "buffer" Can't really remember why "buffer" was introduced in the first place	2017-11-04 13:54:40 +01:00
pictuga	194465544a	crawler: separate CacheHander and actual caching Default cache is now just an in-memory {}	2017-11-04 12:41:56 +01:00
pictuga	523b250907	crawler: SQL request in CAPS for readability	2017-11-04 12:36:58 +01:00
pictuga	a8c2df7f41	crawler: fix truncated gzip reader For python 3	2017-11-04 12:07:08 +01:00
pictuga	d39d0f4cae	crawler: properly define default sqlite file	2017-11-02 22:50:40 +01:00
pictuga	0df6409b0e	crawler: use `with con` to commit, journal WAL for perf	2017-10-28 01:28:47 +02:00
pictuga	7b85f692a0	crawler: fix encoding detection	2017-10-27 23:14:08 +02:00
pictuga	840842d246	crawler: limit download to 500KiB More can only be linked to a fraudulent/incorrect use of the service	2017-10-27 23:12:40 +02:00
pictuga	fbe811384a	crawler: add (unused) DebugHandler to output headers sent/received Saves a lot of time when debugging	2017-10-27 23:10:03 +02:00
pictuga	df22396838	Only use chardet on 2k letters Takes forever otherwise	2017-07-16 23:59:06 +02:00
pictuga	6f0efd5802	crawler: add cookies support Somehow got dropped when splitting the big handler	2017-03-25 19:51:42 -10:00
pictuga	505b02d70d	crawler: remove debugging print()	2017-03-25 13:45:12 -10:00
pictuga	9c331300eb	crawler: move UAHandler to basic Fuck u feedburner	2017-03-19 01:49:17 -10:00
pictuga	99f3c519f2	crawler: fix accept code	2017-03-18 23:37:51 -10:00
pictuga	67f5a21019	Move build_opener to crawler Forgotten	2017-03-18 23:03:04 -10:00
pictuga	f7d570d4c8	crawler: add some broken as rss mimetype Seen out there	2017-03-18 23:00:13 -10:00
pictuga	2003e2760b	Move custom_handler to crawler Makes more sense. Easier to reuse. Also cleaned up a bit the code	2017-03-18 22:51:27 -10:00
pictuga	e1a13a623c	crawler: remove unefficient feedburner-specific code	2017-03-18 22:31:03 -10:00
pictuga	e3ab3c6823	crawler: use less tertiary operator Inherited from fork	2017-03-18 22:23:39 -10:00
pictuga	65055290d4	crawler: better use of chardet Scan whole doc since beginning of html pages tends to be too regular. Ignore ASCII detection for the same reason.	2017-03-18 22:19:54 -10:00
pictuga	9ee6ff60e1	crawler: 301 http code doesn't respect headers More or less according to the specs	2017-03-18 22:18:10 -10:00
pictuga	c952b85d92	crawler: cache 301 HTTP code, for a week	2017-03-09 09:37:05 -10:00
pictuga	e8023e4336	crawler: remove unused NotInCache error-class	2017-03-09 09:35:40 -10:00
Florian Muenchbach	993ac638a3	Added override for auto-detected character encoding of parsed pages.	2017-03-08 18:45:20 -10:00
pictuga	e5f8e43659	Shifted the <link rel='alternate'/> redirect to crawler Now using MIMETYPE var from crawler within morss.py	2017-03-08 18:03:34 -10:00
pictuga	fb8825b410	crawler: parse html to get http-equiv For sure slower, but way cleaner (and probably more stable)	2017-03-08 17:50:57 -10:00
pictuga	ad9bf946ec	crawler: use chardet again Always nice in case no encoding is specified. Somehow got dropped with commit `245ba99`. Most probably by accident	2017-03-08 11:37:12 -10:00
pictuga	026903ce73	crawler: change http header after uncompressing Change content-encoding to "identity"	2017-02-25 18:10:43 -10:00
pictuga	8a1c00abf0	Typo in python version check	2015-08-28 19:29:09 +02:00
Massimo Vannucci	8656e53b84	Correct Python version check	2015-08-05 23:36:11 +02:00
pictuga	931fd53da6	Fix 304-cache handling To make sure that the cached request also gets processed (by GZip and stuff)	2015-05-04 22:25:26 +08:00
pictuga	131ba09207	Change :cache mode behavior Makes underlying code way cleaner	2015-04-07 09:38:22 +08:00
pictuga	32aa96afa7	Cache HTTP content using a custom Handler Much much cleaner. Nothing comparable	2015-04-06 23:26:12 +08:00
pictuga	1b4fc88ad0	Replace MetaRedirect handler with two cleaner ones One for <meta http-equiv> and one for HTTP 'refresh' header	2015-04-06 23:03:17 +08:00
pictuga	f2fe4fc364	Drop HTTPS SSL certificate verification Breaks everything with python 3. Now built-in in recent python 2.7.9 and python 3.4-ish	2015-04-06 22:54:59 +08:00
pictuga	29d9e4702f	Force enc det to return utf-8 rather than nothing	2015-03-24 23:22:56 +08:00
pictuga	656b29e0ef	2to3: using unicode/str to please py3	2015-03-11 01:05:02 +08:00
pictuga	cbeb01e555	2to3: fix urllib header retrieval	2015-03-11 01:03:16 +08:00
pictuga	2f542005d1	2to3: urllib host	2015-03-03 00:59:00 +08:00
pictuga	dbb3883516	2to3: urllib mimetype	2015-03-03 00:55:58 +08:00
pictuga	7bd448789d	2to3: first attempt to fix strings	2015-02-26 00:50:23 +08:00
pictuga	a0f2e0d995	2to3: crawler.py improve except	2015-02-25 18:07:09 +08:00
pictuga	6a06b742f9	2to3: crawler.py port try as	2015-02-25 18:03:54 +08:00
pictuga	c2d85e2bf9	2to3: crawler.py port httplib	2015-02-25 18:02:29 +08:00
pictuga	4f224888d8	2to3: crawler.py port urllib2 and StringIO	2015-02-25 17:53:36 +08:00
pictuga	27cf8f6498	2to3: (iter)items to list	2015-02-25 12:02:53 +08:00
pictuga	8131ea2244	HTTPS SSL certificate validation Specific error message added	2014-11-19 11:59:59 +01:00
pictuga	1b26c5f0e3	Split SimpleDownload in a lot of Handlers Cleaner code, easier to edit, more flexibility. Paves the way to SSL certificates validation. Still have to clean up the code of AcceptHeadersHandler.	2014-11-19 11:57:40 +01:00

1 2 3

123 Commits (e1ed33f3207612ee9b96e60ac7697089ad2a896a)