Commit Graph

40 Commits (6cf32af6c070f16c53d80c8884dda05039bbe66a)

Author SHA1 Message Date
pictuga 6cf32af6c0 readabilite: also use BS 2020-04-05 20:46:42 +02:00
pictuga a7b01ee85e readabilite: further html processing instructions fix 2020-03-21 17:23:50 +01:00
pictuga 2704e91a3d readabilite: handle another weird html stuff 2020-03-19 10:24:09 +01:00
pictuga 5d93d68f62 readabilite: add some function descriptions 2018-10-25 01:12:42 +02:00
pictuga 8d7e1811fd readabilite: update lists
Some code was also meant to be committed earlier
2018-10-25 01:12:08 +02:00
pictuga 72d03f21fe readabilite: forgot count_content
Was meant to be in an earlier commit
2018-10-25 01:11:29 +02:00
pictuga 1d6d0b8ff1 readabilite: move br2p in the cleaning code 2018-10-25 01:09:15 +02:00
pictuga 7d005e9a65 readabilite: run the new cleaning code 2018-10-25 01:08:25 +02:00
pictuga 58fe5243af readabilite: improve cleaning code 2018-10-25 01:07:25 +02:00
pictuga f044c242ef readabilite: simplify scoring loop
For perfomance
2018-10-25 00:59:39 +02:00
pictuga a6befad136 readabilite: change scoring 2018-10-25 00:57:43 +02:00
pictuga 9e71de8d40 readabilite: improve output 2018-10-24 23:49:16 +02:00
pictuga 787d90fac0 readabilite: some technical improvements for score
Linear, removed misplaced debugging code
2018-10-24 23:47:37 +02:00
pictuga 040d2cb889 readabilite: improve word count 2018-10-23 00:09:34 +02:00
pictuga f563040809 readabilite: threshold to detect if it contains an article
Useful for videos/images-based images
2017-10-28 01:30:21 +02:00
pictuga 3bfad54add readabilite: change cleaning & code structure
Kinda struggled to make some "nice" code
2017-07-17 00:27:41 +02:00
pictuga 386bafd391 readabilite: write_all use "node" instead of "item" 2017-07-17 00:13:15 +02:00
pictuga a61b259792 readabilite: easy option to highlight the nodes 2017-07-17 00:11:49 +02:00
pictuga c52b47616d readabilite: always return common of 2 best nodes
Better results. Less is not more
2017-07-17 00:10:58 +02:00
pictuga bfdda18b9c readbilite: better explain lowest_common output 2017-07-17 00:08:00 +02:00
pictuga 2afea497a3 readabilite: br2p use "node" instead of "item"
Confusing with rss items otherwise
2017-07-17 00:06:39 +02:00
pictuga 843dc97fbf readabilite: change scoring algorithm
Use 3 groups of keywords instead
2017-07-17 00:01:44 +02:00
pictuga 3ca6ed5bb0 readabilite: add author/about to black list 2017-03-24 22:02:41 -10:00
pictuga 4aa25bf3d8 readabilite: clean_html before scoring
Surprisingly efficient
2017-03-24 21:50:46 -10:00
pictuga bfefa8d599 readabilite: add tags to black list 2017-03-24 21:50:26 -10:00
pictuga 91da0f36dc readabilite: comment the clean_html function 2017-03-24 21:50:01 -10:00
pictuga 67889a1d14 readabilite: drop useless tags
This extra cluster actually jams the algorithm
2017-03-24 21:49:14 -10:00
pictuga d6882e0a6a readabilite: (try to) emprove detection
Kinda hopeless
2017-03-19 02:00:31 -10:00
pictuga 79a8ada9f4 readabilite: add tags to score 2017-03-19 01:57:54 -10:00
pictuga 4a5150e030 readabilite: fix iter while iterating 2017-03-19 01:56:33 -10:00
pictuga e65c88abf8 readabilite: fix re.match 2017-03-19 01:55:40 -10:00
pictuga 367f86987d readabilite: spread score to all ancestors
Instead of just parents and grandparents
2017-03-18 22:24:38 -10:00
Florian Muenchbach 993ac638a3 Added override for auto-detected character encoding of parsed pages. 2017-03-08 18:45:20 -10:00
pictuga 3fc89d5359 readabilite: improve score for <p>
Helps a lot with bbc, le monde. Might backfire on other websites tho...
2017-03-01 18:02:45 -10:00
pictuga e0f533ca31 readabilite: test to replace <br/> with div 2017-02-25 18:16:15 -10:00
pictuga c6c113b8a8 readabilite: function to clean up the html code 2017-02-25 18:15:33 -10:00
pictuga 58d9f65735 readabilite: explain the use of .tail 2017-02-25 18:14:13 -10:00
pictuga a5aec8c7a6 readability: more keywords to the filter list
Also fixed indentation
2017-02-25 18:13:15 -10:00
pictuga e71fc967ce readabilite: shift "good" tags to a var (list)
So that this list can later be re-used
2017-02-25 18:07:28 -10:00
pictuga b14381f575 Use internal readability fork
Much simpler, doesn't clean the html, probably less efficient, but much faster
2016-05-31 02:50:03 +02:00