Scrapy 1.1.0 发布了。Scrapy 是一套基于基于Twisted的异步处理框架,纯python实现的爬虫框架,用户只需要定制开发几个模块就可以轻松的实现一个爬虫,用来抓取网页内容以及各种图片,非常之方便。 改进记录如下: Scrapy 1.1 has beta Python 3 support (requires Twisted >= 15.5). See:ref:`news_betapy3` for more details and some limitations. Hot new features: Item loaders now support nested loaders (:issue:`1467`). FormRequest.from_response improvements (:issue:`1382`, :issue:`1137`). Added setting :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` and improved AutoThrottle docs (:issue:`1324`). Added response.text to get body as unicode (:issue:`1730`). Anonymous S3 connections (:issue:`1358`). Deferreds in downloader middlewares (:issue:`1473`). This enables better robots.txt handling (:issue:`1471`). HTTP caching now follows RFC2616 more closely, added settings:setting:`HTTPCACHE_ALWAYS_STORE` and:setting:`HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS` (:issue:`1151`). Selectors were extracted to the parsel library (:issue:`1409`). This means you can use Scrapy Selectors without Scrapy and also upgrade the selectors engine without needing to upgrade Scrapy. HTTPS downloader now does TLS protocol negotiation by default, instead of forcing TLS 1.0. You can also set the SSL/TLS method using the new :setting:`DOWNLOADER_CLIENT_TLS_METHOD`. These bug fixes may require your attention: Don't retry bad requests (HTTP 400) by default (:issue:`1289`). If you need the old behavior, add 400 to :setting:`RETRY_HTTP_CODES`. Fix shell files argument handling (:issue:`1710`, :issue:`1550`). If you try scrapy shell index.html it will try to load the URL http://index.html, use scrapy shell ./index.html to load a local file. Robots.txt compliance is now enabled by default for newly-created projects (:issue:`1724`). Scrapy will also wait for robots.txt to be downloaded before proceeding with the crawl (:issue:`1735`). If you want to disable this behavior, update :setting:`ROBOTSTXT_OBEY` in settings.py file after creating a new project. Exporters now work on unicode, instead of bytes by default (:issue:`1080`). If you use PythonItemExporter, you may want to update your code to disable binary mode which is now deprecated. Accept XML node names containing dots as valid (:issue:`1533`). When uploading files or images to S3 (with FilesPipeline orImagesPipeline), the default ACL policy is now "private" instead of "public" Warning: backwards incompatible!. You can use :setting:`FILES_STORE_S3_ACL` to change it. We've reimplemented canonicalize_url() for more correct output, especially for URLs with non-ASCII characters (:issue:`1947`). This could change link extractors output compared to previous scrapy versions. This may also invalidate some cache entries you could still have from pre-1.1 runs.Warning: backwards incompatible!. 下载地址: Source code (zip) Source code (tar.gz) Scrapy 1.1.0 发布,web 爬虫框架下载地址