Commit Graph

72 Commits (master)

Author SHA1 Message Date
pictuga e81f6b173f readabilite: remove code duplicate 2022-01-23 11:41:32 +01:00
pictuga fb643f5ef1 readabilite: remove unneeded reference to `features` (overriden by `builder`)
continuous-integration/drone/push Build is passing Details
2022-01-03 18:01:12 +00:00
pictuga dbdca910d8 readabilite: fix new parser code & drop PIs
continuous-integration/drone/push Build was killed Details
2022-01-03 17:51:49 +00:00
pictuga 9eb19fac04 readabilite: use custom html parser within bs4's lxml parser
continuous-integration/drone/push Build is passing Details
Solves the following obscure error:
ValueError: Invalid PI name 'b'xml''
2022-01-03 16:26:17 +00:00
pictuga d424e394d1 readabilite: use lxml bs4 parser for speed
continuous-integration/drone/push Build is passing Details
2022-01-01 14:52:48 +01:00
pictuga 3f92787b38 readabilite: limit html comments related issues
continuous-integration/drone/push Build is passing Details
2022-01-01 13:58:42 +01:00
pictuga afc31eb6e9 readabilite: avoid double parsing of html
continuous-integration/drone/push Build is passing Details
2022-01-01 12:51:30 +01:00
pictuga d8cc07223e readabilite: fix bug when nothing above threshold
continuous-integration/drone/push Build is failing Details
2021-11-23 20:53:00 +01:00
pictuga 6ec3fb47d1 readabilite: .strip() first to save time
continuous-integration/drone/push Build is passing Details
2021-11-15 21:54:07 +01:00
pictuga 0365232a73 readabilite: custom xpath for article detection
continuous-integration/drone/push Build is failing Details
2021-09-21 08:04:45 +02:00
pictuga 52c48b899f readability: better var names 2021-09-21 08:04:45 +02:00
pictuga 4fd730b983 Further isort implementation 2021-09-21 08:04:45 +02:00
pictuga 0f33db248a Add license info in each file 2020-08-26 20:08:22 +02:00
pictuga b5b355aa6e readabilite: increase penalty for high link density 2020-08-21 23:55:04 +02:00
pictuga c6d3a0eb53 readabilite: clean up code 2020-07-15 00:49:34 +02:00
pictuga 4ccc0dafcd Basic help for sub-lib interactive use 2020-05-26 19:34:20 +02:00
pictuga 22005065e8 Use etree.tostring 'method' arg
Gives appropriately formatted html code.
Some pages might otherwise be rendered as blank.
2020-05-13 11:44:34 +02:00
pictuga 83dd2925d3 readabilite: better parsing
Keeping blank_text keeps the tree more as-it, making the final output closer to expectations
2020-05-12 14:15:53 +02:00
pictuga c27c38f7c7 crawler: return dict instead of tuple 2020-04-28 22:29:07 +02:00
pictuga 44a3e0edc4 readabilite: specify in- and out-going encoding 2020-04-28 14:44:35 +02:00
pictuga 818cdaaa9b Make it possible to call sub-libs in non interactive mode
Run `python -m morss.feeds http://lemonde.fr` and so on
2020-04-27 18:00:14 +02:00
pictuga 2806c64326 Make it possible to directly run sub-libs (feeds, crawler, readabilite)
Run `python -im morss.feeds http://website.sample/rss.xml` and so on
2020-04-27 17:19:31 +02:00
pictuga f6bc23927f readabilite: drop dangerous tags (script, style) 2020-04-25 12:25:02 +02:00
pictuga c86572374e readabilite: minimum score requirement 2020-04-25 12:24:36 +02:00
pictuga ec8edb02f1 Various small bug fixes 2020-04-19 12:54:02 +02:00
pictuga a32f5a8536 readabilite: add debug option (also used by :get) 2020-04-09 19:08:13 +02:00
pictuga f3d1f92b39 Detect encoding everytime 2020-04-07 10:38:36 +02:00
pictuga bfad6b7a4a readabilite: clean before counting
To remove links which are not kept anyway
2020-04-06 16:55:39 +02:00
pictuga 6b8c3e51e7 readabilite: fix threshold feature
Awkward typo...
2020-04-06 16:52:06 +02:00
pictuga dc9e425247 readabilite: don't clean-out the top 10% nodes
Loosen up the code once again to limit over-kill
2020-04-06 14:26:28 +02:00
pictuga 2f48e18bb1 readabilite: put scores directly in html node
Probably slower but makes code somewhat cleaner...
2020-04-06 14:21:41 +02:00
pictuga e136b0feb2 readabilite: loosen the slayer
Previous impl. lead to too many empty results
2020-04-05 20:47:30 +02:00
pictuga 6cf32af6c0 readabilite: also use BS 2020-04-05 20:46:42 +02:00
pictuga a7b01ee85e readabilite: further html processing instructions fix 2020-03-21 17:23:50 +01:00
pictuga 2704e91a3d readabilite: handle another weird html stuff 2020-03-19 10:24:09 +01:00
pictuga 5d93d68f62 readabilite: add some function descriptions 2018-10-25 01:12:42 +02:00
pictuga 8d7e1811fd readabilite: update lists
Some code was also meant to be committed earlier
2018-10-25 01:12:08 +02:00
pictuga 72d03f21fe readabilite: forgot count_content
Was meant to be in an earlier commit
2018-10-25 01:11:29 +02:00
pictuga 1d6d0b8ff1 readabilite: move br2p in the cleaning code 2018-10-25 01:09:15 +02:00
pictuga 7d005e9a65 readabilite: run the new cleaning code 2018-10-25 01:08:25 +02:00
pictuga 58fe5243af readabilite: improve cleaning code 2018-10-25 01:07:25 +02:00
pictuga f044c242ef readabilite: simplify scoring loop
For perfomance
2018-10-25 00:59:39 +02:00
pictuga a6befad136 readabilite: change scoring 2018-10-25 00:57:43 +02:00
pictuga 9e71de8d40 readabilite: improve output 2018-10-24 23:49:16 +02:00
pictuga 787d90fac0 readabilite: some technical improvements for score
Linear, removed misplaced debugging code
2018-10-24 23:47:37 +02:00
pictuga 040d2cb889 readabilite: improve word count 2018-10-23 00:09:34 +02:00
pictuga f563040809 readabilite: threshold to detect if it contains an article
Useful for videos/images-based images
2017-10-28 01:30:21 +02:00
pictuga 3bfad54add readabilite: change cleaning & code structure
Kinda struggled to make some "nice" code
2017-07-17 00:27:41 +02:00
pictuga 386bafd391 readabilite: write_all use "node" instead of "item" 2017-07-17 00:13:15 +02:00
pictuga a61b259792 readabilite: easy option to highlight the nodes 2017-07-17 00:11:49 +02:00