pictuga
7375adce33
sheet.xsl: fix & improve
2020-04-15 23:34:28 +02:00
pictuga
663212de0a
sheet.xsl: various cosmetic improvements
2020-04-15 23:22:45 +02:00
pictuga
4a2ea1bce9
README: add gunicorn instructions
2020-04-15 22:31:21 +02:00
pictuga
fe82b19c91
Merge .xsl & html template
...
Turns out they somehow serve a similar purpose
2020-04-15 22:30:45 +02:00
pictuga
0b31e97492
morss: remove debug code in http file handler
2020-04-14 23:20:03 +02:00
pictuga
b0ad7c259d
Add README & LICENSE to data_files
2020-04-14 19:34:12 +02:00
pictuga
bffb23f884
README: how to use cli
2020-04-14 18:21:32 +02:00
pictuga
59139272fd
Auto-detect the location of www/
...
Either ../www or /usr/share/morss
Adapted README accordingly
2020-04-14 18:07:19 +02:00
pictuga
39b0a1d7cc
setup.py: fix deps & files
2020-04-14 17:36:42 +02:00
pictuga
65803b328d
New git url and updated date in provided index.html
2020-04-13 15:30:32 +02:00
pictuga
e6b7c0eb33
Fix app definition for uwsgi
2020-04-13 15:30:09 +02:00
pictuga
67c096ad5b
feeds: add fake path to default html parser
...
Without it, some websites were accidentally matching it (false positives)
2020-04-12 13:00:56 +02:00
pictuga
f018437544
crawler: make mysql backend thread safe
2020-04-12 12:53:05 +02:00
pictuga
8e5e8d24a4
Timezone fixes
2020-04-10 20:33:59 +02:00
pictuga
ee78a7875a
morss: focus on the most recent feed items
2020-04-10 16:08:13 +02:00
pictuga
9e7b9d95ee
feeds: properly use html template
2020-04-09 20:00:51 +02:00
pictuga
987a719c4e
feeds: try all parsers regardless of contenttype
...
Turns out some websites send the wrong contenttype (json for html, html for xml, etc.)
2020-04-09 19:17:51 +02:00
pictuga
47b33f4baa
morss: specify server output encoding
2020-04-09 19:10:45 +02:00
pictuga
3c7f512583
feeds: handle several errors
2020-04-09 19:09:10 +02:00
pictuga
a32f5a8536
readabilite: add debug option (also used by :get)
2020-04-09 19:08:13 +02:00
pictuga
63a06524b7
morss: various encoding fixes
2020-04-09 19:06:51 +02:00
pictuga
b0f80c6d3c
morss: fix csv output encoding
2020-04-09 19:05:50 +02:00
pictuga
78cea10ead
morss: replace :getpage with :get
...
Also provides readabilite debugging
2020-04-09 18:43:20 +02:00
pictuga
e5a82ff1f4
crawler: drop auto-referer
...
Was solving some issues. But creating even more issues.
2020-04-07 10:39:21 +02:00
pictuga
f3d1f92b39
Detect encoding everytime
2020-04-07 10:38:36 +02:00
pictuga
7691df5257
Use wrapper for http calls
2020-04-07 10:30:17 +02:00
pictuga
0ae0dbc175
README: mention csv output
2020-04-07 09:24:32 +02:00
pictuga
f1d0431e68
morss: drop :html, replaced with :reader
...
README updated accordingly
2020-04-07 09:23:29 +02:00
pictuga
a09831415f
feeds: fix bug when mimetype matches nothing
2020-04-06 18:53:07 +02:00
pictuga
bfad6b7a4a
readabilite: clean before counting
...
To remove links which are not kept anyway
2020-04-06 16:55:39 +02:00
pictuga
6b8c3e51e7
readabilite: fix threshold feature
...
Awkward typo...
2020-04-06 16:52:06 +02:00
pictuga
dc9e425247
readabilite: don't clean-out the top 10% nodes
...
Loosen up the code once again to limit over-kill
2020-04-06 14:26:28 +02:00
pictuga
2f48e18bb1
readabilite: put scores directly in html node
...
Probably slower but makes code somewhat cleaner...
2020-04-06 14:21:41 +02:00
pictuga
31cac921c7
README: remove ref to iTunes
2020-04-05 22:20:33 +02:00
pictuga
a82ec96eb7
Delete feedify.py leftover code
...
iTunes integration untested, unreliable and not working...
2020-04-05 22:16:52 +02:00
pictuga
aad2398e69
feeds: turns out lxml.etree doesn't have drop_tag
2020-04-05 21:50:38 +02:00
pictuga
eeac630855
crawler: add more "realistic" headers
2020-04-05 21:11:57 +02:00
pictuga
e136b0feb2
readabilite: loosen the slayer
...
Previous impl. lead to too many empty results
2020-04-05 20:47:30 +02:00
pictuga
6cf32af6c0
readabilite: also use BS
2020-04-05 20:46:42 +02:00
pictuga
568e7d7dd2
feeds: make BS's output bytes for lxml's sake
2020-04-05 20:46:04 +02:00
pictuga
3617f86e9d
morss: make cgi_encore more robust
2020-04-05 16:43:11 +02:00
pictuga
d90756b337
morss: drop 'keep' option
...
Because the Firefox behaviour it is working around is no longer in use
2020-04-05 16:37:27 +02:00
pictuga
40c69f17d2
feeds: parse html with BS
...
More robust & to make it consistent with :getpage
2020-04-05 16:12:41 +02:00
pictuga
99461ea185
crawler: fix var name issues (private_cache)
2020-04-05 16:11:36 +02:00
pictuga
bf86c1e962
crawler: make AutoUA match http(s) type
2020-04-05 16:07:51 +02:00
pictuga
d20f6237bd
crawler: replace ContentNegoHandler with AlternateHandler
...
More basic. Sends the same headers no matter what. Make requests more "replicable".
Also, drop "text/xml" from RSS contenttype, too broad, matches garbage
2020-04-05 16:05:59 +02:00
pictuga
8a4d68d72c
crawler: drop 'basic' toggle
...
Can't even remember the use case
2020-04-05 16:03:06 +02:00
pictuga
e6811138fd
morss: use redirected url in :getpage
...
Still have to find how to do the same thing with feeds...
2020-04-04 20:04:57 +02:00
pictuga
35b702fffd
morss: default values for feed creation
2020-04-04 19:39:32 +02:00
pictuga
4a88886767
morss: get_page to act as a basic proxy (for iframes)
2020-04-04 16:37:15 +02:00