This week has been an exciting time for those in the Search and Content space with two product releases from the Apache Software Foundation; Apache Solr 4.10.0 and Apache Tika 1.6.

Both products provide a range of new features and fixes so we thought it would be good to highlight a couple to ensure you can deliver great search/content solutions and services.

Apache Tika 1.6

For those of you who don’t know about about Apache Tika, it is a toolkit that detects and extracts metadata and text content from various documents - from PPT to CSV to PDF - using existing parser libraries. Unifying such parsers under a single interface with type detection to allow you to easily parse over a thousand different file types without needing to know what file is in advance. This is great for tasks such as search engine indexing, content analysis, translation, and much more.

The Apache Tika 1.6 release provides lots of cool new enhancements but two I would like to highlight are the Tika Server and Tika Translator.

Tika Server

The Tika Server is a standalone JAXRS server that allows users to access Tika via a friendly REST based API. This is great for users who need to access detection or parsing from many different applications as well as those for which a Tika binding does not currently exist.

Whilst this feature has been around for a while, the results of a Tika Hack at ApacheCon has been finalised in the 1.6 release to make it easier to discover what API methods are available as well as tidy up around how you use it. For example, simply browsing to the root URL of the server Tika is deployed on will provide you with a list of supported endpoints, which you can then use to discover what parsers are configured or to process your requests.

To get started you can simply download the Tika Server JAR and run this locally:

java -jar tika-server-1.6.jar

By default the server will bind to port 9998, which can be overridden by the -p flag. Having started the server you can now browse to http://localhost:9998 to see the documentation.

The most commonly used commands are content parsing (/tika) and metadata extraction (/meta), which you can try out using the following commands

curl -T <file> http://localhost:9998/tika
curl -T <file> http://localhost:9998/meta

For those of you looking to dive straight in and check if Tika Server meets your needs, LogicalSpark has put together an Apache Tika Server OpenShift Cartridge which can help you get an environment up and running in minutes. It is alread deployed on OpenShift under our account, so if you would like to try out the Apache Tika server now you can simply use the following URL:

http://tikaserver-logicalspark.rhcloud.com/

You can read more about the TIKA JAXRS server and its commands here.

Tika Translator

The NASA Jet Propulsion Lab has been doing work with DARPA around adding translation support to Apache Tika, which has now landed in 1.6. This functionality allows you to send content from Tika for machine translation via a variety of sources including:

  • Google Translate API
  • Lingo24’s Premium Machine Translation API
  • Microsoft Translator API
  • Joshua Machine Translation Engine
  • Moses Machine Translation Engine.

Whilst using a Machine Translation Engine will require you to install the relevant software and have working models, which is a significant undertaking for most users, access to the three translations APIs is a simple as adding their relevant API access/user key to a property file on the class path.

At this stage access to the Tika translation is via the Tika class which will use the default configured translator class using the org.apache.tika.language.translate.Translator service.

Apache Solr 4.10.0

Apache Solr is the popular open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document handling (e.g. Word, PDF), and geospatial search.

Like Apache Tika, the 4.10.0 release of Apache Solr provides lots of great new enhancements but there is one change we would like to highlight to all Solr users; an update to Solr Cell’s (contrib/extraction) dependency on Apache POI to mitigate two security vulnerabilities.

We would also like to announce the update of our Apache Solr OpenShift QuickStart to 4.10.0.

Solr Cell Update for Security Vulnerability

Apache Solr versions 4.8.0, 4.8.1, 4.9.0 bundled Apache POI 3.10-beta2 with its binary release tarball. This version (and all previous ones) of Apache POI are vulnerable to the following issues:

  • CVE-2014-3529: XML External Entity (XXE) problem in Apache POI’s OpenXML parser
  • CVE-2014-3574: XML Entity Expansion (XEE) problem in Apache POI’s OpenXML parser

You will be affected by this if you have enabled the “Apache Solr Content Extraction Library (Solr Cell)” contrib module from the folder “contrib/extraction” of the release tarball. The recommendation from the Apache Solr, and LogicalSpark, is to replace the libraries affected or to update to Solr 4.10.0.

You can read more about the vulnerabilities and find instructions on how to replace the libraries here.

If you are not sure if you are affected or are worried about upgrading, feel free to contact us and we can help you ensure you are covered.

Apache Solr OpenShift QuickStart

Similar to the Apache Tika Cartridge discussed above, we have also updated our Apache Solr OpenShift QuickStart to use version 4.10.0.

Created by LogicalSpark for a talk at JBUG Scotland, the quick start helps users of OpenShift get Apache Solr up and running in their environment with minimal effort. All the code is hosted on GitHub allowing you to adapt or amend your configuration.

For those looking to get a Solr instance up and running, you can simply create a new OpenShift application using the JBOSSEWS cartridge and clone in the repository.