pictuga
|
4ccc0dafcd
|
Basic help for sub-lib interactive use
|
2020-05-26 19:34:20 +02:00 |
pictuga
|
22005065e8
|
Use etree.tostring 'method' arg
Gives appropriately formatted html code.
Some pages might otherwise be rendered as blank.
|
2020-05-13 11:44:34 +02:00 |
pictuga
|
83dd2925d3
|
readabilite: better parsing
Keeping blank_text keeps the tree more as-it, making the final output closer to expectations
|
2020-05-12 14:15:53 +02:00 |
pictuga
|
c27c38f7c7
|
crawler: return dict instead of tuple
|
2020-04-28 22:29:07 +02:00 |
pictuga
|
44a3e0edc4
|
readabilite: specify in- and out-going encoding
|
2020-04-28 14:44:35 +02:00 |
pictuga
|
818cdaaa9b
|
Make it possible to call sub-libs in non interactive mode
Run `python -m morss.feeds http://lemonde.fr` and so on
|
2020-04-27 18:00:14 +02:00 |
pictuga
|
2806c64326
|
Make it possible to directly run sub-libs (feeds, crawler, readabilite)
Run `python -im morss.feeds http://website.sample/rss.xml` and so on
|
2020-04-27 17:19:31 +02:00 |
pictuga
|
f6bc23927f
|
readabilite: drop dangerous tags (script, style)
|
2020-04-25 12:25:02 +02:00 |
pictuga
|
c86572374e
|
readabilite: minimum score requirement
|
2020-04-25 12:24:36 +02:00 |
pictuga
|
ec8edb02f1
|
Various small bug fixes
|
2020-04-19 12:54:02 +02:00 |
pictuga
|
a32f5a8536
|
readabilite: add debug option (also used by :get)
|
2020-04-09 19:08:13 +02:00 |
pictuga
|
f3d1f92b39
|
Detect encoding everytime
|
2020-04-07 10:38:36 +02:00 |
pictuga
|
bfad6b7a4a
|
readabilite: clean before counting
To remove links which are not kept anyway
|
2020-04-06 16:55:39 +02:00 |
pictuga
|
6b8c3e51e7
|
readabilite: fix threshold feature
Awkward typo...
|
2020-04-06 16:52:06 +02:00 |
pictuga
|
dc9e425247
|
readabilite: don't clean-out the top 10% nodes
Loosen up the code once again to limit over-kill
|
2020-04-06 14:26:28 +02:00 |
pictuga
|
2f48e18bb1
|
readabilite: put scores directly in html node
Probably slower but makes code somewhat cleaner...
|
2020-04-06 14:21:41 +02:00 |
pictuga
|
e136b0feb2
|
readabilite: loosen the slayer
Previous impl. lead to too many empty results
|
2020-04-05 20:47:30 +02:00 |
pictuga
|
6cf32af6c0
|
readabilite: also use BS
|
2020-04-05 20:46:42 +02:00 |
pictuga
|
a7b01ee85e
|
readabilite: further html processing instructions fix
|
2020-03-21 17:23:50 +01:00 |
pictuga
|
2704e91a3d
|
readabilite: handle another weird html stuff
|
2020-03-19 10:24:09 +01:00 |
pictuga
|
5d93d68f62
|
readabilite: add some function descriptions
|
2018-10-25 01:12:42 +02:00 |
pictuga
|
8d7e1811fd
|
readabilite: update lists
Some code was also meant to be committed earlier
|
2018-10-25 01:12:08 +02:00 |
pictuga
|
72d03f21fe
|
readabilite: forgot count_content
Was meant to be in an earlier commit
|
2018-10-25 01:11:29 +02:00 |
pictuga
|
1d6d0b8ff1
|
readabilite: move br2p in the cleaning code
|
2018-10-25 01:09:15 +02:00 |
pictuga
|
7d005e9a65
|
readabilite: run the new cleaning code
|
2018-10-25 01:08:25 +02:00 |
pictuga
|
58fe5243af
|
readabilite: improve cleaning code
|
2018-10-25 01:07:25 +02:00 |
pictuga
|
f044c242ef
|
readabilite: simplify scoring loop
For perfomance
|
2018-10-25 00:59:39 +02:00 |
pictuga
|
a6befad136
|
readabilite: change scoring
|
2018-10-25 00:57:43 +02:00 |
pictuga
|
9e71de8d40
|
readabilite: improve output
|
2018-10-24 23:49:16 +02:00 |
pictuga
|
787d90fac0
|
readabilite: some technical improvements for score
Linear, removed misplaced debugging code
|
2018-10-24 23:47:37 +02:00 |
pictuga
|
040d2cb889
|
readabilite: improve word count
|
2018-10-23 00:09:34 +02:00 |
pictuga
|
f563040809
|
readabilite: threshold to detect if it contains an article
Useful for videos/images-based images
|
2017-10-28 01:30:21 +02:00 |
pictuga
|
3bfad54add
|
readabilite: change cleaning & code structure
Kinda struggled to make some "nice" code
|
2017-07-17 00:27:41 +02:00 |
pictuga
|
386bafd391
|
readabilite: write_all use "node" instead of "item"
|
2017-07-17 00:13:15 +02:00 |
pictuga
|
a61b259792
|
readabilite: easy option to highlight the nodes
|
2017-07-17 00:11:49 +02:00 |
pictuga
|
c52b47616d
|
readabilite: always return common of 2 best nodes
Better results. Less is not more
|
2017-07-17 00:10:58 +02:00 |
pictuga
|
bfdda18b9c
|
readbilite: better explain lowest_common output
|
2017-07-17 00:08:00 +02:00 |
pictuga
|
2afea497a3
|
readabilite: br2p use "node" instead of "item"
Confusing with rss items otherwise
|
2017-07-17 00:06:39 +02:00 |
pictuga
|
843dc97fbf
|
readabilite: change scoring algorithm
Use 3 groups of keywords instead
|
2017-07-17 00:01:44 +02:00 |
pictuga
|
3ca6ed5bb0
|
readabilite: add author/about to black list
|
2017-03-24 22:02:41 -10:00 |
pictuga
|
4aa25bf3d8
|
readabilite: clean_html before scoring
Surprisingly efficient
|
2017-03-24 21:50:46 -10:00 |
pictuga
|
bfefa8d599
|
readabilite: add tags to black list
|
2017-03-24 21:50:26 -10:00 |
pictuga
|
91da0f36dc
|
readabilite: comment the clean_html function
|
2017-03-24 21:50:01 -10:00 |
pictuga
|
67889a1d14
|
readabilite: drop useless tags
This extra cluster actually jams the algorithm
|
2017-03-24 21:49:14 -10:00 |
pictuga
|
d6882e0a6a
|
readabilite: (try to) emprove detection
Kinda hopeless
|
2017-03-19 02:00:31 -10:00 |
pictuga
|
79a8ada9f4
|
readabilite: add tags to score
|
2017-03-19 01:57:54 -10:00 |
pictuga
|
4a5150e030
|
readabilite: fix iter while iterating
|
2017-03-19 01:56:33 -10:00 |
pictuga
|
e65c88abf8
|
readabilite: fix re.match
|
2017-03-19 01:55:40 -10:00 |
pictuga
|
367f86987d
|
readabilite: spread score to all ancestors
Instead of just parents and grandparents
|
2017-03-18 22:24:38 -10:00 |
Florian Muenchbach
|
993ac638a3
|
Added override for auto-detected character encoding of parsed pages.
|
2017-03-08 18:45:20 -10:00 |