| 
						
					 | 
					
						
						
							
						
						dcfdb75a15
					 | 
					
						
						
							
							crawler: fix chinese encoding support
						
						
						
						
						
						
					 | 
					
						2020-05-27 21:34:43 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						4ccc0dafcd
					 | 
					
						
						
							
							Basic help for sub-lib interactive use
						
						
						
						
						
						
					 | 
					
						2020-05-26 19:34:20 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						2fe3e0b8ee
					 | 
					
						
						
							
							feeds: clean up other stylesheets before putting ours
						
						
						
						
						
						
					 | 
					
						2020-05-26 19:26:36 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						ad3ba9de1a
					 | 
					
						
						
							
							sheet.xsl: add <select/> to use :firstlink
						
						
						
						
						
						
					 | 
					
						2020-05-13 12:33:12 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						68c46a1823
					 | 
					
						
						
							
							morss: remove deprecated twitter/fb link handling
						
						
						
						
						
						
					 | 
					
						2020-05-13 12:31:09 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						91be2d229e
					 | 
					
						
						
							
							morss: ability to use first link from desc instead of default link
						
						
						
						
						
						
					 | 
					
						2020-05-13 12:29:53 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						038f267ea2
					 | 
					
						
						
							
							Rename :theforce into :force
						
						
						
						
						
						
					 | 
					
						2020-05-13 11:49:15 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						22005065e8
					 | 
					
						
						
							
							Use etree.tostring 'method' arg
						
						
						
						
						
						
						
						Gives appropriately formatted html code.
Some pages might otherwise be rendered as blank. 
						
						
					 | 
					
						2020-05-13 11:44:34 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						7d0d416610
					 | 
					
						
						
							
							morss: cache articles for 24hrs
						
						
						
						
						
						
						
						Also make it possible to refetch articles, regardless of cache 
						
						
					 | 
					
						2020-05-12 21:10:31 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						5dac4c69a1
					 | 
					
						
						
							
							crawler: more code comments
						
						
						
						
						
						
					 | 
					
						2020-05-12 20:44:25 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						36e2a1c3fd
					 | 
					
						
						
							
							crawler: increase size limit from 100KiB to 500
						
						
						
						
						
						
						
						I'm looking at you, worldbankgroup.csod.com/ats/careersite/search.aspx 
						
						
					 | 
					
						2020-05-12 19:34:16 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						83dd2925d3
					 | 
					
						
						
							
							readabilite: better parsing
						
						
						
						
						
						
						
						Keeping blank_text keeps the tree more as-it, making the final output closer to expectations 
						
						
					 | 
					
						2020-05-12 14:15:53 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						e09d0abf54
					 | 
					
						
						
							
							morss: remove deprecated peace of code
						
						
						
						
						
						
					 | 
					
						2020-05-07 16:05:30 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						ff26a560cb
					 | 
					
						
						
							
							Shift safari work around to morss.py
						
						
						
						
						
						
					 | 
					
						2020-05-07 16:04:54 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						74d7a1eca2
					 | 
					
						
						
							
							sheet.xsl: fix word wrap
						
						
						
						
						
						
					 | 
					
						2020-05-06 16:58:28 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						eba295cba8
					 | 
					
						
						
							
							sheet.xsl: fixes for safari
						
						
						
						
						
						
					 | 
					
						2020-05-06 12:01:27 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						f27631954e
					 | 
					
						
						
							
							.htaccess: bypass Safari RSS detection
						
						
						
						
						
						
					 | 
					
						2020-05-06 11:47:24 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						c74abfa2f4
					 | 
					
						
						
							
							sheet.xsl: use CDATA for js code
						
						
						
						
						
						
					 | 
					
						2020-05-06 11:46:38 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						1d5272c299
					 | 
					
						
						
							
							sheet.xsl: allow zooming on mobile
						
						
						
						
						
						
					 | 
					
						2020-05-04 14:44:43 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						f685139137
					 | 
					
						
						
							
							crawler: use UPSERT statements
						
						
						
						
						
						
						
						Avoid potential race conditions 
						
						
					 | 
					
						2020-05-03 21:27:45 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						73b477665e
					 | 
					
						
						
							
							morss: separate :clip with <hr> instead of stars
						
						
						
						
						
						
					 | 
					
						2020-05-02 19:19:54 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						b425992783
					 | 
					
						
						
							
							morss: don't follow alt=rss with custom feeds
						
						
						
						
						
						
						
						To have the same page as with :get=page and to avoid shitty feeds 
						
						
					 | 
					
						2020-05-02 19:18:58 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						271ac8f80f
					 | 
					
						
						
							
							crawler: comment code a bit
						
						
						
						
						
						
					 | 
					
						2020-05-02 19:18:01 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						64e41b807d
					 | 
					
						
						
							
							crawler: handle http:/ (single slash)
						
						
						
						
						
						
						
						Fixing one more corner case! malayalam.oneindia.com 
						
						
					 | 
					
						2020-05-02 19:17:15 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						a2c4691090
					 | 
					
						
						
							
							sheet.xsl: dir=auto for rtl languages (arabic, etc.)
						
						
						
						
						
						
					 | 
					
						2020-04-29 15:01:33 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						b6000923bc
					 | 
					
						
						
							
							README: clean up deprecated code
						
						
						
						
						
						
					 | 
					
						2020-04-28 22:31:11 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						27a42c47aa
					 | 
					
						
						
							
							morss: use final request url
						
						
						
						
						
						
						
						Code is not very elegant... 
						
						
					 | 
					
						2020-04-28 22:30:21 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						c27c38f7c7
					 | 
					
						
						
							
							crawler: return dict instead of tuple
						
						
						
						
						
						
					 | 
					
						2020-04-28 22:29:07 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						a1dc96cb50
					 | 
					
						
						
							
							feeds: remove mimetype from function call as no longer used
						
						
						
						
						
						
					 | 
					
						2020-04-28 22:07:25 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						749acc87fc
					 | 
					
						
						
							
							Centralize url clean up in crawler.py
						
						
						
						
						
						
					 | 
					
						2020-04-28 22:03:49 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						c186188557
					 | 
					
						
						
							
							README: warning about lxml installation
						
						
						
						
						
						
					 | 
					
						2020-04-28 21:58:26 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						cb69e3167f
					 | 
					
						
						
							
							crawler: accept non-ascii urls
						
						
						
						
						
						
						
						Covering one more corner case! 
						
						
					 | 
					
						2020-04-28 14:47:23 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						c3f06da947
					 | 
					
						
						
							
							morss: process(): specify encoding for clarity
						
						
						
						
						
						
					 | 
					
						2020-04-28 14:45:00 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						44a3e0edc4
					 | 
					
						
						
							
							readabilite: specify in- and out-going encoding
						
						
						
						
						
						
					 | 
					
						2020-04-28 14:44:35 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						4a9b505499
					 | 
					
						
						
							
							README: update python lib instructions
						
						
						
						
						
						
					 | 
					
						2020-04-27 18:12:14 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						818cdaaa9b
					 | 
					
						
						
							
							Make it possible to call sub-libs in non interactive mode
						
						
						
						
						
						
						
						Run `python -m morss.feeds http://lemonde.fr` and so on 
						
						
					 | 
					
						2020-04-27 18:00:14 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						2806c64326
					 | 
					
						
						
							
							Make it possible to directly run sub-libs (feeds, crawler, readabilite)
						
						
						
						
						
						
						
						Run `python -im morss.feeds http://website.sample/rss.xml` and so on 
						
						
					 | 
					
						2020-04-27 17:19:31 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						d39d7bb19d
					 | 
					
						
						
							
							sheet.xsl: limit overflow
						
						
						
						
						
						
					 | 
					
						2020-04-25 15:27:49 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						e5e3746fc6
					 | 
					
						
						
							
							sheet.xsl: show plain url
						
						
						
						
						
						
					 | 
					
						2020-04-25 15:27:13 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						960c9d10d6
					 | 
					
						
						
							
							sheet.xsl: customize output feed form
						
						
						
						
						
						
					 | 
					
						2020-04-25 15:26:47 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						0e7a5b9780
					 | 
					
						
						
							
							sheet.xsl: wrap header in <header>
						
						
						
						
						
						
					 | 
					
						2020-04-25 15:24:57 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						186bedcf62
					 | 
					
						
						
							
							sheet.xsl: smarter html reparser
						
						
						
						
						
						
					 | 
					
						2020-04-25 15:22:25 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						5847e18e42
					 | 
					
						
						
							
							sheet: improved feed address output (w/ c/c)
						
						
						
						
						
						
					 | 
					
						2020-04-25 15:21:47 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						f6bc23927f
					 | 
					
						
						
							
							readabilite: drop dangerous tags (script, style)
						
						
						
						
						
						
					 | 
					
						2020-04-25 12:25:02 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						c86572374e
					 | 
					
						
						
							
							readabilite: minimum score requirement
						
						
						
						
						
						
					 | 
					
						2020-04-25 12:24:36 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						59ef5af9e2
					 | 
					
						
						
							
							feeds: fix bug when deleting attr in html
						
						
						
						
						
						
					 | 
					
						2020-04-24 22:12:05 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						6a0531ca03
					 | 
					
						
						
							
							crawler: randomize user agent
						
						
						
						
						
						
					 | 
					
						2020-04-24 11:28:39 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						8187876a06
					 | 
					
						
						
							
							crawler: stop at first alternative link
						
						
						
						
						
						
						
						Should save a few ms and the first one is usually (?) the most relevant/generic 
						
						
					 | 
					
						2020-04-23 11:23:45 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						325a373e3e
					 | 
					
						
						
							
							feeds: add SyntaxError catch
						
						
						
						
						
						
					 | 
					
						2020-04-20 16:15:15 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					| 
						
					 | 
					
						
						
							
						
						2719bd6776
					 | 
					
						
						
							
							crawler: fix chinese encoding
						
						
						
						
						
						
					 | 
					
						2020-04-20 16:14:55 +02:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 |