Tika in action

Finally finished the book <<Tika in action>>

a free read thanks to my friend west

I didn’t need to master tika yet, while the book serves as a nice window to the world of searching & Apache projects in this area

It covers a lot of topics, while not very detailed for each of them.

To put it simple, Tika is a parser, a powerful one.

Spawned from the search engine project Lucene, it is specialized for searching use cases.

It is capable in detecting file types, encoding and even language out of the box.

Using a uniform parse api, it will use various parser libraries , like PDFbox for pdf to extract and analyze the documents, capturing any metadata.

These information can easily feed into search engine indexer.

It take care special handling that such a parser component will need – incremental extraction / type & encoding & language detection , optimize to use random access for metadata before parsing,
do memory access only if needed and more.

Currently it forms a lower level component in the apache searching ecosystem, as the book put it.

For this, it also support modularize as an OSGI bundle.

Apache search EcoSystem

In the Ecosystem, there are numerous apache projects that link to each other, which I will explore.

For the typical search engine use case, one can apply Tika, Lucene & Solr to do facet searching & host it in a web server,

or in Nutch, an apache version of Google, using Tika together with Hadoop for indexing with Map Reduce, then use Gora to do data storage like BigTable. Apache Accumulo

I also discovered Jackrabbit, to do content repository management, which is the relevant project that exists.

 

Mapout for Machine learning algorithm seems also be useful in my research.

Another thing I benefited is the book also introduced some metadata model like Dublin Core and general file handling concepts, language detection algorithm etc. This makes me understand more on the field I am doing research, better idea on what my topic actually is.

 

 

I made many notes while I dont want to copy the book here. I will try to share together with my usage in my project

There are always introduction-content-summary in the book, good for memory but make the actual content relatively superficial and short after those boilerplate.
IMO this is not very good if you are skimming through for some insights. Sometimes I want to skip but found I almost missed some important concepts.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s