Apache Solr vs Elasticsearch: The Feature Smackdown

API

Feature Solr 5.3.0 ElasticSearch 2.0
Format XML,CSV,JSON JSON
HTTP REST API
Binary API SolrJ TransportClient, Thrift (through a plugin)
JMX support ES specific stats are exposed through the REST API
Official client libraries Java Java, Groovy, PHP, Ruby, Perl, Python, .NET, JavascriptOfficial list of clients
Community client libraries PHP, Ruby, Perl, Scala, Python, .NET, Javascript, Go, Erlang, Clojure Clojure, Cold Fusion, Erlang, Go, Groovy, Haskell, Java, JavaScript, .NET, OCaml, Perl, PHP, Python, R, Ruby, Scala, Smalltalk, Vert.x Complete list
3rd-party product integration (open-source) Drupal, Magento, Django, ColdFusion, WordPress, OpenCMS, Plone, Typo3, ez Publish, Symfony2, Riak (via Yokozuna) Drupal, Django, Symfony2, WordPress, CouchBase
3rd-party product integration (commercial) DataStax Enterprise Search, Cloudera Search, Hortonworks Data Platform, MapR SearchBlox, Hortonworks Data Platform, MapR etcComplete list
Output JSON, XML, PHP, Python, Ruby, CSV, Velocity, XSLT, native Java JSON, XML/HTML (via plugin)

Infrastructure

Feature Solr 5.3.0 ElasticSearch 2.0
Master-slave replication Only in non-SolrCloud. In SolrCloud, behaves identically to ES. Not an issue because shards are replicated across nodes.
Integrated snapshot and restore Filesystem Filesystem, AWS Cloud Plugin for S3 repositories, HDFS Plugin for Hadoop environments, Azure Cloud Plugin for Azure storage repositories

Indexing

Feature Solr 5.3.0 ElasticSearch 2.0
Data Import DataImportHandler – JDBC, CSV, XML, Tika, URL, Flat File [DEPRECATED in 2.x] Rivers modules – ActiveMQ, Amazon SQS, CouchDB, Dropbox, DynamoDB, FileSystem, Git, GitHub, Hazelcast, JDBC, JMS, Kafka, LDAP, MongoDB, neo4j, OAI, RabbitMQ, Redis, RSS, Sofa, Solr, St9, Subversion, Twitter, Wikipedia
ID field for updates and deduplication
DocValues
Partial Doc Updates with stored fields with _source field
Custom Analyzers and Tokenizers
Per-field analyzer chain
Per-doc/query analyzer chain
Synonyms Supports Solr and Wordnet synonym format
Multiple indexes
Near-Realtime Search/Indexing
Complex documents
Schemaless 4.4+
Multiple document types per schema One set of fields per schema, one schema per core
Online schema changes Schemaless mode or via dynamic fields. Only backward-compatible changes.
Apache Tika integration
Dynamic fields
Field copying via multi-fields
Hash-based deduplication Murmur plugin or ER plugin

Searching

Feature Solr 5.3.0 ElasticSearch 2.0
Lucene Query parsing
Structured Query DSL Need to programmatically create queries if going beyond Lucene query syntax.
Span queries via SOLR-2703
Spatial/geo search
Multi-point spatial search
Faceting Top N term accuracy can be controlled with shard_size
Advanced Faceting New JSON faceting API blog post
Geo-distance Faceting
Pivot Facets
More Like This
Boosting by functions
Boosting using scripting languages
Push Queries JIRA issue Percolation. Distributed percolation supported in 1.0
Field collapsing/Results grouping
Spellcheck Suggest API
Autocomplete
Query elevation workaround
Joins Joined index has to be single-shard and replicated across all nodes. via has_children and top_children queries
Resultset Scrolling New to 4.7.0 via scan search type
Filter queries also supports filtering by native scripts
Filter execution order local params and cache property
Alternative QueryParsers DisMax, eDisMax query_string, dis_max, match, multi_match etc
Negative boosting but awkward. Involves positively boosting the inverse set of negatively-boosted documents.
Search across multiple indexes it can search across multiple compatible collections
Result highlighting
Custom Similarity
Searcher warming on index reload Warmers API
Term Vectors API

Customizability

Feature Solr 5.3.0 ElasticSearch 2.0
Pluggable API endpoints
Pluggable search workflow via SearchComponents
Pluggable update workflow
Pluggable Analyzers/Tokenizers
Pluggable Field Types
Pluggable Function queries
Pluggable scoring scripts
Pluggable hashing
Pluggable webapps site plugin
Automated plugin installation Installable from GitHub, maven, sonatype or elasticsearch.org

 

Full article

System Design Interview Prep Material

System design is a very broad topic. Even a software engineer with many years of working experience at top IT company may not be an expert on system design. If you want to become an expert, you need to read many books, articles, and solve real large scale system design problems. This repository only teaches you to handle the system design interview with a systematic approach in a short time. You can dive into each topic if you have time. Of course, welcome to add your thoughts!

Table of Contents

System Design Interview Tips:

  • Clarify the constraints and identify the user cases Spend a few minutes questioning the interviewer and agreeing on the scope of the system. Remember to make sure you know all the requirements the interviewer didn’t tell your about in the beginning. User cases indicate the main functions of the system, and constraints list the scale of the system such as requests per second, requests types, data written per second, data read per second.
  • High-level architecture design Sketch the important components and the connections between them, but don’t go into some details. Usually, a scalable system includes web server (load balancer), service (service partition), database (master/slave database cluster plug cache).
  • Component design For each component, you need to write the specific APIs for each component. You may need to finish the detailed OOD design for a particular function. You may also need to design the database schema for the database.

Basic Knowledge about System Design:

Here are some articles about system design related topics.

Of course, if you want to dive into system related topics, here is a good collection of reading list about services-engineering, and a good collection of material about distributed systems.

Company Engineering Blogs:

If you are going to have an onsite with a company, you should read their engineering blog.

Products and Systems:

The following papers/articles/slides can help you to understand the general design idea of different real products and systems.

Hot Questions and Reference:

There are some good references for each question. The references here are slides and articles.
Design a CDN network Reference:

Design a Google document system Reference:

Design a random ID generation system Reference:

Design a key-value database Reference:

Design the Facebook news feed function Reference:

Design the Facebook timeline function Reference:

Design a function to return the top k requests during past time interval Reference:

Design an online multiplayer card game Reference:

Design a graph search function Reference:

Design a picture sharing system Reference:

Design a search engine Reference:

Design a recommendition system Reference:

Design a tinyurl system Reference:

Design a garbage collection system Reference:

Design a scalable web crawling system Reference:

Design the Facebook chat function Reference:

Design a trending topic system Reference:

Design a cache system Reference:

Good Books:

Object Oriented Design:

Tips for OOD Interview

Clarify the scenario, write out user cases Use case is a description of sequences of events that, taken together, lead to a system doing something useful. Who is going to use it and how they are going to use it. The system may be very simple or very complicated. Special system requirements such as multi-threading, read or write oriented.
Define objects Map identity to class: one scenario for one class, each core object in this scenario for one class. Consider the relationships among classes: certain class must have unique instance, one object has many other objects (composition), one object is another object (inheritance). Identify attributes for each class: change noun to variable and action to methods. Use design patterns such that it can be reused in multiple applications.

Useful Websites

Original Source

Clustering J2EE Applications

All mission critical applications need to be have high Availability and Scalability features built-in. Any incident or outage in such applications can have huge implications – like business loss or legal issues faced by the company. Clustering helps in application scalability (through load balancing) as well as high availability (through failover).


Clustering J2EE Applications
Never assume that stand-alone applications can be transmit transparently to a cluster structure.
Most of the leading application server vendors like IBM (Websphere) and Oracle (BEA Weblogic) support clustering their servers and provide for built-in load balancing among them.




It might seem that Clustering is a deployment or server related activity and we as normal Java developers might not be concerned with it. This is not true. There are various aspects which the Java developers need to take care while working on design, coding and testing for creating cluster aware, scalable J2EE applications.


Figure 1: Visualize Clustering



Application Design and Coding Considerations
Below are some basic points to consider while developing and maintaining applications – which will ultimately be deployed in a clustered environment in production.


User Session
HttpSession object is responsible for maintaining user session across requests. Session object is tracked by a unique session id.


To provide transparent failover in a cluster, application servers need to ‘replicate’ session state on multiple servers. As a result, based on the vendor implementation, session id might vary across client requests.

  • ‘session id’ should not be used in an application for operations synonymous to user id or should not be used to mark any transaction because it may not remain same throughout the user session. For such purposes, user id would be a more appropriate choice.
  • All objects stored in the session must be made Serializable. If any Java object, say Hashtable or any user defined object, is to be kept in session then that object must implement java.io.Serializable interface.
  • Object serialization and de-serialization are very costly in performance especially in the database persistent approach. In clustered environment, storing large or numerous objects in the session should be avoided.
  • Updating data in HttpSession – While updating any session attribute, use HttpSession.setAttribute() explicitly. This call will make sure that updated object is reflected in all session copies maintained on different servers by the application server.
  • Setting and getting values to-from HttpSession – Deprecated APIs getValue(), putValue(), removeValue() should not be used. Instead getArribute(), setAttribute() and removeAttribute() should be used.

Static Variables and Singletons

  • Do not use static variables for any logic/data that is at an application level. Simple reason being that a variable value will not be available to the JVMs in other machines and next client request can go to any server.
  • Avoid use of Singleton class that encapsulates logic for whole application (across JVMs in a cluster). In a single machine on one JVM singleton class will work. But in a cluster with multiple JVMs, having single instance will not be possible.

Data Caching

  • Application may have data cached in memory from secondary storage like database. It would be much faster to access reference data and data that doesn’t change frequently from memory.
  • Synchronize Cache – In a clustered environment, each JVM instance will maintain its own copy of the cache, which should be synchronized with others to provide consistent state in all server instances. Sometimes this kind of sync will bring worse performance than no caching at all.

File Access


Although not recommended by the J2EE specification, the external I/O operations are used for various purposes.

  • Since components will be deployed across machines in a cluster, file system must be accessible in a uniform way to all the machines.
  • To achieve this, either common network file system (like SAN) could be mounted or
  • Files could be replicated on all the machines in a cluster.
  • Instead of a file, maintain required information in the database.
  • Application logging – If application uses file to log messages then evaluate whether single file per cluster or separate file per server in cluster is required to be maintained.

Other Services

  • There are some functionalities and services which makes sense in stand-alone mode only. For example Timers/Schedulers and Email Notifications Services. These services are trigged by events instead of requests, and should only be executed only once. These services are hard to migrate to a cluster environment.
  • Some products have prepared for such services. For example, JBoss uses “clustered singleton facility” to coordinate all the instances to guarantee to execute these services once and only once. Spring Quartz (batch scheduler API) also provides clustering support.

Third Party Software


In an enterprise application usually other software/tools are used like Rule Engines, OR Mapper, third party cache manager, messaging software, logging component etc. Some of these third party software may not support clustering. This should be looked upon at design time before selecting these tools.


Application Testing Considerations


Test plans should be created to test application in cluster environment. Test cases should cover cluster objectives, which are load balancing, scalability and failover. To test cases, it might be required to create test stubs. Test cases should also cover any application specific and cluster sensitive services like file access, static data caching, and Singleton.


Brief Terminology


Scalabilty – Scalability refers to a system’s ability to support fast increasing numbers of users. One way to is to add resources (memory, CPU or hard disk) to a server. Clustering allows a group of servers to share the heavy tasks, and operate as a single server logically.
High Availability – Single server solution to scalability (adding more resources) is not a robust one because of its single point of failure. It is required that mission critical services are accessible with reasonable/predictable response times at any time. Clustering is a solution to achieve this kind of high availability by providing redundant servers in the cluster in case one server fails to provide service.
Load balancing – Load balancing is one of the key technologies behind clustering, which is a way to obtain high availability and better performance by dispatching incoming requests to different servers. In addition the load balancer should perform other important tasks such as “session stickiness” to have a user session live entirely on one server and “health check” to prevent dispatching requests to a failing server.
Fault Tolerance – Highly available data is not necessarily strictly correct data. A fault tolerant service always guarantees strictly correct behavior despite a certain number of faults.
Failover – Failover is a key technology behind clustering to achieve fault tolerance. By choosing another node in the cluster, the process will continue when the original node fails.


References:
• http://www.ibm.com/developerworks/websphere/library/techarticles/0606_zhou/0606_zhou.html


• http://onjava.com/pub/a/onjava/2004/07/14/clustering.html