1. XenForo 1.5.14 中文版——支持中文搜索!现已发布!查看详情
  2. Xenforo 爱好者讨论群:215909318 XenForo专区

新闻 Apache Tika 1.11 发布,内容抽取工具集合 下载

本帖由 漂亮的石头2015-10-27 发布。版面名称:软件资讯

  1. 漂亮的石头

    漂亮的石头 版主 管理成员

    注册:
    2012-02-10
    帖子:
    487,342
    赞:
    47
    Apache Tika 1.11 发布,此版本包括大量的改进和 bug 修复:

    * Java7 API support for allowing java.nio.file.Path as method arguments
    was added to Tika and to ParsingReader, TikaFileTypeDetector, and to
    Tika Config (TIKA-1745, TIKA-1746, TIKA-1751).

    * MIME support was added for WebVTT: The Web Video Text Tracks Format
    files (TIKA-1772).

    * MIME magic improved to ensure emails detected as message/rfc822
    (TIKA-1771).

    * Upgrade to Jackcess Encrypt 2.1.1 to avoid binary incompatibility
    with Bouncy Castle (TIKA-1736).

    * Make div and other markup more consistent between PPT and
    PPTX (TIKA-1755).

    * Parse multiple authors from MSOffice's semi-colon delimited
    author field (TIKA-1765).

    * Include CTAKESConfig.properties within tika-parsers resources
    by default (TIKA-1741).

    * Prevent infinite recursion when processing inline images
    in PDF files by limiting extraction of duplicate images
    within the same page (TIKA-1742).

    * Upgrade to POI 3.13-final (via Andreas Beeker) (TIKA-1707).

    * Upgraded tika-batch to use Path throughout (TIKA-1747 and
    (TIKA-1754).

    * Upgraded to Path in TikaInputStream (via Yaniv Kunda) (TIKA-1744).

    * Changed default content handler type for "/rmeta" in tika-server
    to "xml" to align with "-J" option in tika-app.
    Clients can now specify handler types via PathParam. (TIKA-1716).

    * The fantastic GROBID (or Grobid) GeneRation Of BIbliographic Data
    for machine learning from PDF files is now integrated as a
    Tika parser (TIKA-1699, TIKA-1712).

    * The ability to specify the Tesseract Config Path was added
    to the OCR Parser (TIKA-1703).

    * Upgraded to ASM 5.0.4 (TIKA-1705).

    * Corrected Tika Config XML detector definition explicit loading
    of MimeTypes (TIKA-1708)

    * In Tika Parsers, Batch, Server, App and Examples, use Apache
    Commons IO instead of inlined ex-Commons classes, and the Java 7
    Standard Charset definitions (TIKA-1710)

    * Upgraded to Commons Compress 1.10, which enables zlib compressed
    archives support (TIKA-1718)

    详细改进请看:

    http://www.apache.org/dist/tika/CHANGES-1.11.txt

    下载:http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.11-src.zip

    Apache Tika 同时提供在 Maven:

    http://repo1.maven.org/maven2/org/apache/tika/

    更多内容请看发行说明

    [​IMG]

    Tika是一个内容抽取的工具集合(a toolkit for text extracting)。它集成了POI, Pdfbox 并且为文本抽取工作提供了一个统一的界面。其次,Tika也提供了便利的扩展API,用来丰富其对第三方文件格式的支持。

    在当前的0.2-SNAPSHOT版本中, Tika提供了对如下文件格式的支持:


    • PDF - 通过Pdfbox


    • MS-* - 通过POI


    • HTML - 使用nekohtml将不规范的html整理成为xhtml


    • OpenOffice 格式 - Tika提供


    • Archive - zip, tar, gzip, bzip等


    • RTF - Tika提供


    • Java class - Class解析由ASM完成


    • Image - 只支持图像的元数据抽取


    • XML

    Tika的API十分便捷,核心是Parser interface,其中定义了一个parse方法:
    public void parse(InputStream stream, ContentHandler handler, Metadata metadata)
    用stream参数传递需要解析的文件流, 文本内容会被传入handler,而元数据会更新至metadata。

    可以使用Tika的ParserUtils工具来根据文件的mime-type来得到一个适当的Parser来进行解析工作。或者Tika还提供了一个AutoDetectParser根据不同的二进制文件的特殊格式 (比如说Magic Code),来寻找适合的Parser。
    Apache Tika 1.11 发布,内容抽取工具集合下载地址
     
正在加载...