Apache Nutch 1.14 发布了。Nutch是一个成熟的、可用于生产的 Web 爬虫。 Nutch 1.x 可以依靠 Apache Hadoop™ 数据结构进行细粒度配置,这对于批处理非常有用。 更新内容: Bug 修复 [NUTCH-2071] - A parser failure on a single document may fail crawling job [NUTCH-2235] - Classpath discrepancy with protocol-selenium in deploy mode [NUTCH-2269] - Clean not working after crawl [NUTCH-2295] - Nutch master docker container broken [NUTCH-2297] - CrawlDbReader -stats wrong values for earliest fetch time and shortest interval [NUTCH-2316] - Library conflict with Parser-Tika Plugin and Lib Folder 提升 [NUTCH-1763] - Improving comments on the Injector Class [NUTCH-2034] - CrawlDB filtered documents counter. [NUTCH-2035] - Regex filter using case sensitive rules. [NUTCH-2046] - The crawl script should be able to skip an initial injection. [NUTCH-2135] - Ant Eclipse build does not include protocol-interactiveselenium [NUTCH-2193] - Upgrade feed parser plugin to use rome 1.5 完整更新内容请查看发布说明。 下载地址: http://nutch.apache.org/downloads.html Apache Nutch 1.14 发布,Web 爬虫下载地址