Solr needs opinions, because Solr documentation gives you way too many options. It's hard to navigate the best practices for using Solr. Some of my opinions based on dozens of Solr projects :)
Schema files are a good thing. They are declarative, and not letting them change at runtime prevents all kinds of security issues. Further, classic schema / solrconfig support all of Solr's functionality and are well documented with tons of examples online in blog articles and stackoverflow. Using managed schema or the config API takes a lot of experimentation.
Static configurations can also be easily version controlled. As I've learned as a long time Elasticsearch user, this is one of Solr's advantages. Having an API for changing every underlying config option of your index means finding the code that made the change is rather time consuming.
Static configuration is also good separation of concerns. You cleanly separate Solr configuration from your application. Having worked in Elasticsearch, the ability for clever Elasticsearch libraries to manipulate your index in weird ways (such as searchkick ) makes working directly with Elasticsearch directly difficult because you don't know what searchkick has done to it.
In SolrCloud mode, this means upconfiging to zookeeper. In Standalone mode, it means having a configset setup in the configset folder.
Here's a falacy I see on a lot of search teams: they keep the search engine at arms length. They prefetch thousands of search results and then run some kind of complicated model or process on top of those. This can lead to a very slow and complex search application, and forces application code to take on a lot of responsibilities of the search engine (pagination, faceting, highlighting). Don't be shy pushing your work into Solr!
Solr wraps Lucene, and it's biggest strength is the underlying Lucene library. Lucene is so extensible and fast for search and language like problems, you do yourself a disservice keeping it at arms length.
Lucene has access to term statitics in a fast index. Many information retrieval models use document frequency or total term frequency, which can be quickly accessed via the inverted index.
Lucene also increasingly uses columnar doc-values for fast retrieval of numeric attributes. Indeed, this is how a lot of vector scoring is often done.
You shouldn't be shy about pushing work to Solr, but it IS important to push the right work to Solr. You should push the smallest amount of functionality that enables your application. A good plugin:
(a) helps you avoid negative patterns like prefetching thousands of results, or asking the search engine for every terms document frequency, etc - these smell like "I'm recreating the search engine in application logic"
(b) can be configured to solve your problem
(c) fits cleanly in the existing, extension points (like analyzers, custom lucene queries, query parsers, etc)
(d) doesn't implement application logic / decisions in the plugin. Or is application logic unavoidable to separate from search engine problems
(e) is a solution to a somewhat generic IR problem, like something you wish were part of Solr more generally
Deciding the right set of responsibilities for the search engine vs application code can be subjective. Here's some examples of good plugins:
-
Replacing a regex that removes a specified suffix (like 'js' from Angularjs). Just removing the suffix is much faster than a regex char replace factory
-
A query parser that quotes specified, unique phrases. Such as collocations.
Let's say you encountered a problem like code search, and you wanted to implement it in Solr. How should you go about thinking of a 'plugin' for this problem? You could
-
Go "all Solr" and build in programming language parsing, etc into Solr itself.
-
Go "Solr at arms length" and barely use Solr, perhaps building a completely different index itself. Maybe just reserving Solr for natural language problems
-
Analyze the problem of searching programming languages. Build some supporting Solr functionality. For example, trigram indexing might be a problem that comes out of code search, and you could build some Solr functionality that analyzes, indexes, and scores using trigram indexes more efficiently for your use case.
Solr has myriad options for where to store plugins. You can place them in several places on the node, the zookeeper blob store, etc.
From a security perspective, uploading runtime libs just seems like a bad idea. I mean look at all the security requirements. Indeed the
With modern deployment infrastructure, it's not hard to create a Solr container, that places a plugin in the right spot.
Zookeeper's blob store has had issues dealing with plugins
Solr can be the wild west. An apache project like this is best suited for active organizations that can deal with warts, but are comfortable getting under the hood. Don't expect a consistent user experience, but DO expect plugability and power.
Solr doesn't have a single vendor like Elastic (for better or worse). The upside is any organization can be as core to community as any other. The downside is it's a confederation of lots of voices and opinions, for better or worse.
If you want to passively consume something, with solid opinions, think about whether you should use Elastic or another technology.
(See Relevant Search, chapter 5). And What Should Your Search Document Be
Search engines are flat, the nested features don't work well, think hard if you truly need them
Solr is a very "bazaar" project (pun intended). It can be a fun, organic free-for-all where new ideas come in, but often in half-completed form, not ready for prime time. Don't expect the new Solr features to be production ready :)
Searchkick should link to https://github.com/ankane/searchkick . Current link points back to the gist.