Finally finished the book <<Tika in action>>
a free read thanks to my friend west
I didn’t need to master tika yet, while the book serves as a nice window to the world of searching & Apache projects in this area
It covers a lot of topics, while not very detailed for each of them.
To put it simple, Tika is a parser, a powerful one.
Spawned from the search engine project Lucene, it is specialized for searching use cases.
It is capable in detecting file types, encoding and even language out of the box.
Using a uniform parse api, it will use various parser libraries , like PDFbox for pdf to extract and analyze the documents, capturing any metadata.
These information can easily feed into search engine indexer.
It take care special handling that such a parser component will need – incremental extraction / type & encoding & language detection , optimize to use random access for metadata before parsing,
do memory access only if needed and more.
Currently it forms a lower level component in the apache searching ecosystem, as the book put it.
For this, it also support modularize as an OSGI bundle.
Apache search EcoSystem
In the Ecosystem, there are numerous apache projects that link to each other, which I will explore.
For the typical search engine use case, one can apply Tika, Lucene & Solr to do facet searching & host it in a web server,
or in Nutch, an apache version of Google, using Tika together with Hadoop for indexing with Map Reduce, then use Gora to do data storage like BigTable. Apache Accumulo
I also discovered Jackrabbit, to do content repository management, which is the relevant project that exists.
Mapout for Machine learning algorithm seems also be useful in my research.
Another thing I benefited is the book also introduced some metadata model like Dublin Core and general file handling concepts, language detection algorithm etc. This makes me understand more on the field I am doing research, better idea on what my topic actually is.
I made many notes while I dont want to copy the book here. I will try to share together with my usage in my project
There are always introduction-content-summary in the book, good for memory but make the actual content relatively superficial and short after those boilerplate.
IMO this is not very good if you are skimming through for some insights. Sometimes I want to skip but found I almost missed some important concepts.