Sunday, April 14, 2013

Apache Solr Installation


Hi Friends,
In Last post we discuss about the Apache Solr and its features. Now in this post we will discuss about its Setup.

Setup

   As the very first step, you should follow the official tutorial which covers the basic aspects of any search use case:
  • Indexing - Get the data of any form into Solr. Examples: JSON, XML, CSV and SQL-database. This step creates the inverted index - i.e. it links every term to its documents.
  • Querying - Ask Solr to return the most relevant documents for the users' query
To follow the official tutorial you'll have to download Java and the latest version of Solr here. More information about installation is available at the official description.

Next you'll have to decide which web server you choose for Solr. In the official tutorial, Jetty is used, but you can also use Tomcat/Jboss. 
Indexing
If you've followed the official tutorial you have pushed some XML files into the Solr index. This process is called indexing or feeding. There are a lot more possibilities to get data into Solr:
  • Using the Data Import Handler (DIH) is a really powerful language neutral option. It allows you to read from a SQL database, from CSV, XML files, RSS feeds, Emails, etc. without any Java knowledge. DIH handles full-imports and delta-imports.  This is necessary when only a small amount of documents were added, updated or deleted.
  • The HTTP interface is used from the post tool, which you have already used in the official tutorial to index XML files.
  • Client libraries in different languages also exist. (e.g. for Java (SolrJ) or Python).
Before indexing you'll have to decide which data fields should be searchable and how the fields should get indexed. For example, when you have a field with HTML in it, then you can strip irrelevant characters, tokenize the text into 'searchable terms', lower case the terms and finally stem the terms. In contrast, if you would have a field with text in it that should not be interpreted (e.g. URLs) youshouldn't tokenize it and use the default field type string. Please refer to the official documentationabout field and field type definitions in the schema.xml file.
When designing an index keep in mind the advice from Mauricio"The document is what you will search for." For example, if you have tweets and you want to search for similar users, you'll need to setup a user index - created from the tweets. Then every document is a user. If you want to search for tweets, then setup a tweet index; then every document is a tweet. Of course, you can setup both indices with the multi index options of Solr.
Please also note that there is a project called Solr Cell which lets you extract the relevant information out of several different document types with the help of Tika.

Querying

For debugging it is very convenient to use the HTTP interface with a browser to query Solr and get back XML. Use Firefox and the XML will be displayed nicely:
You can also do a lot more; one other concept is boosting. In Solr you can boost while indexing and while querying. To prefer the terms in the title write:
q=title:superman^2 subject:superman
When using the dismax request handler write:
q=superman&qf=title^2 subject
Check out all the various query options like fuzzy searchspellcheck query inputfacetscollapsingand suffix query support.

Hope this will help!!



Apache Solr - Open Source Search Engine


Apache Solr



Solr  is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting,faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable.Solr is the most popular enterprise search engine. Solr 4 adds NoSQL features. 
Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Apache Tomcat, Jboss or Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it usable from most popular programming languages. Solr's powerful external configuration allows it to be tailored to many types of application without Java coding, and it has an plugin architecture to support more advanced customization.
Apache Lucene and Apache Solr are both produced by the same Apache Software Foundation development team since the two projects were merged in 2010. It is common to refer to the technology or products as Lucene/Solr or Solr/Lucene.
One advantage of Solr in enterprise projects is that you don't need any Java code, although Java itself has to be installed. If you are unsure when to use Solr and when Lucene, these answers could help. If you need to build your Solr index from websites, you should take a look into the open source crawler called Apache Nutch before creating your own solution.

To be convinced that Solr is actually used in a lot of enterprise projects, take a look at this amazing list of public projects powered by Solr. If you encounter problems then the mailing list or stackoverflow will help you. 

Features


  • Uses the Lucene library for full-text search
  • Faceted navigation
  • Hit highlighting
  • Query language supports structured as well as textual search
  • JSON, XML, PHP, Ruby, Python, XSLT, Velocity and custom Java binary output formats over HTTP
  • HTML administration interface
  • Replication to other Solr servers - enables scaling QPS
  • Distributed Search through Sharding - enables scaling content volume
  • Search results clustering based on Carrot2
  • Extensible through plugins
  • Flexible relevance - boost through function queries
  • Caching - queries, filters, and documents
  • Embeddable in a Java Application
  • Geo-spatial search
  • Automated management of large clusters through ZooKeeper
  • More function queries
  • Field Collapsing 
  • A new auto-suggest component
Stay Tuned for Installation and configuration Part !!

Thanks!!
Kuldeep

Integrate Jenkins with Azure Key Vault

Jenkins has been one of the most used CI/CD tools. For every tool which we are using in our daily life, it becomes really challenges when ...