pictuga
0b3e6d7749
Apply isort
2021-09-21 08:04:23 +02:00
pictuga
06e0ada95b
Allow POST requests
2021-09-08 20:43:21 +02:00
pictuga
8f24214915
crawler: better name for custom fns
2021-08-29 00:22:40 +02:00
pictuga
5582fbef31
crawler: comment
2021-08-29 00:18:50 +02:00
pictuga
6880a443e0
crawler: improve CacheHandler code
2021-03-25 23:54:08 +01:00
pictuga
7342ab26d2
crawler: comment on how urllib works
2021-03-25 23:49:58 +01:00
pictuga
981da9e66a
crawler: SQLITE_PATH point to .db file instead of folder
2021-03-25 23:48:21 +01:00
pictuga
3e886caaab
crawler: drop encoding setting
2020-10-30 22:41:16 +01:00
pictuga
ad927e03a7
crawler: use regex instead of lxml
...
Less reliable but should be faster
2020-10-30 22:21:19 +01:00
pictuga
0efb096fa7
crawler: shift gzip & encoding-fix to intermediary handler
2020-10-30 22:16:51 +01:00
pictuga
9ab2e488ef
crawler: add intermediary handlers
2020-10-30 22:15:35 +01:00
pictuga
b525ab0d26
crawler: fix typo
2020-10-30 22:12:43 +01:00
pictuga
bd0bca69fc
crawler: ignore ssl via env var
2020-10-03 19:57:08 +02:00
pictuga
8abd951d40
More sensible default values for cache autotrim (1k entries, 1min)
2020-10-03 19:55:57 +02:00
pictuga
056a1b143f
crawler: autotrim: make ctrl+c working
2020-10-01 00:04:36 +02:00
pictuga
eed949736a
crawler: add ability to limit cache size
2020-09-30 23:59:55 +02:00
pictuga
d9f46b23a6
crawler: default value for MYSQL_HOST (localhost)
2020-09-30 13:17:02 +02:00
pictuga
bbada0436a
Quick guide to ignore SSL certs
2020-09-27 16:48:22 +02:00
pictuga
0f33db248a
Add license info in each file
2020-08-26 20:08:22 +02:00
pictuga
bd0efb1529
crawler: missing os import
2020-08-23 18:45:44 +02:00
pictuga
4dfebe78f7
Pick caching backend via env vars
2020-08-23 18:43:18 +02:00
pictuga
64af86c11e
crawler: catch html parsing errors
2020-07-06 12:25:38 +02:00
pictuga
ce4cf01aa6
crawler: clean up encoding detection code
2020-05-27 21:35:24 +02:00
pictuga
dcfdb75a15
crawler: fix chinese encoding support
2020-05-27 21:34:43 +02:00
pictuga
4ccc0dafcd
Basic help for sub-lib interactive use
2020-05-26 19:34:20 +02:00
pictuga
5dac4c69a1
crawler: more code comments
2020-05-12 20:44:25 +02:00
pictuga
36e2a1c3fd
crawler: increase size limit from 100KiB to 500
...
I'm looking at you, worldbankgroup.csod.com/ats/careersite/search.aspx
2020-05-12 19:34:16 +02:00
pictuga
f685139137
crawler: use UPSERT statements
...
Avoid potential race conditions
2020-05-03 21:27:45 +02:00
pictuga
271ac8f80f
crawler: comment code a bit
2020-05-02 19:18:01 +02:00
pictuga
64e41b807d
crawler: handle http:/ (single slash)
...
Fixing one more corner case! malayalam.oneindia.com
2020-05-02 19:17:15 +02:00
pictuga
c27c38f7c7
crawler: return dict instead of tuple
2020-04-28 22:29:07 +02:00
pictuga
749acc87fc
Centralize url clean up in crawler.py
2020-04-28 22:03:49 +02:00
pictuga
cb69e3167f
crawler: accept non-ascii urls
...
Covering one more corner case!
2020-04-28 14:47:23 +02:00
pictuga
818cdaaa9b
Make it possible to call sub-libs in non interactive mode
...
Run `python -m morss.feeds http://lemonde.fr ` and so on
2020-04-27 18:00:14 +02:00
pictuga
2806c64326
Make it possible to directly run sub-libs (feeds, crawler, readabilite)
...
Run `python -im morss.feeds http://website.sample/rss.xml ` and so on
2020-04-27 17:19:31 +02:00
pictuga
6a0531ca03
crawler: randomize user agent
2020-04-24 11:28:39 +02:00
pictuga
8187876a06
crawler: stop at first alternative link
...
Should save a few ms and the first one is usually (?) the most relevant/generic
2020-04-23 11:23:45 +02:00
pictuga
2719bd6776
crawler: fix chinese encoding
2020-04-20 16:14:55 +02:00
pictuga
ec8edb02f1
Various small bug fixes
2020-04-19 12:54:02 +02:00
pictuga
4ce3c7cb32
Small code clean ups
2020-04-19 12:50:05 +02:00
pictuga
036e5190f1
crawler: remove unused code
2020-04-18 21:40:02 +02:00
pictuga
f018437544
crawler: make mysql backend thread safe
2020-04-12 12:53:05 +02:00
pictuga
e5a82ff1f4
crawler: drop auto-referer
...
Was solving some issues. But creating even more issues.
2020-04-07 10:39:21 +02:00
pictuga
7691df5257
Use wrapper for http calls
2020-04-07 10:30:17 +02:00
pictuga
eeac630855
crawler: add more "realistic" headers
2020-04-05 21:11:57 +02:00
pictuga
99461ea185
crawler: fix var name issues (private_cache)
2020-04-05 16:11:36 +02:00
pictuga
bf86c1e962
crawler: make AutoUA match http(s) type
2020-04-05 16:07:51 +02:00
pictuga
d20f6237bd
crawler: replace ContentNegoHandler with AlternateHandler
...
More basic. Sends the same headers no matter what. Make requests more "replicable".
Also, drop "text/xml" from RSS contenttype, too broad, matches garbage
2020-04-05 16:05:59 +02:00
pictuga
8a4d68d72c
crawler: drop 'basic' toggle
...
Can't even remember the use case
2020-04-05 16:03:06 +02:00
pictuga
5288cc8796
Clean up unused import's
2020-03-19 15:09:53 +01:00