Setting Up Solr On Nginx With Lets Encrypt

I want to enable search on my blog. I started looking into different solutions. I started with ElasticSearch but came across too many issues to get it setup that I paused on that and moved on to Solr. I fully intend on working with ElasticSearch and Kibana soon.

Setting up Solr comes with its own set of challenges, some blatantly obvious mistakes that I made and others that required a little bit of digging.

I configured my solution on a virtual server running Fedora 25.

Install Solr

Solr is an open source enterprise search platform, written in Java, from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling.

Missing dependency

The Solr installation guide is quite straightforward but I got an error because of a missing dependency.

It wanted the lsof package which "lists open files". It can get a list of all open files and the processes that opened them. sudo dnf install lsof

Install package

I downloaded the package referenced in the installation guide using curl:

curl http://mirror.za.web4africa.net/apache/lucene/solr/7.6.0/solr-7.6.0-src.tgz -o solr-7.6.0-src.tgz

tar xzf solr-7.6.0.tgz solr-7.6.0/bin/install_solr_service.sh --strip-components=2

./install_solr_service.sh solr-7.6.0.tgz

The solr user will be created by the installation script which will own /opt/solr and /var/solr. Once the script completes, Solr will be installed as a service and running in the background on your server on port 8983.

sudo service solr status

Open firewall

I temporarily opened port 8983 while I was working on the project. I had to open the port on my server's firewall and through my hosting provider's firewall done through their admin interface.

First, I need to ensure that the firewall is enabled on my server:

sudo firewall-cmd --state

I need to add the Solr port to the firewall:

sudo firewall-cmd --zone=public --permanent --add-port=8983/tcp

Once I have made a change to the firewall, I need to reload it for the change to take effect:

sudo firewall-cmd --reload

Verify that my changes took effect:

sudo firewall-cmd --zone=public --list-ports

Test connection

Open a browser and browse to http://localhost:8983/solr or curl http://localhost:8983/solr. Test it remotely by accessing it with your public IP address.

Reflection

My connections were timing out and I had double checked my configuration. After Googling, the obvious answer hit me: firewall. I had made the changes but forgot to reload the firewall for the change to take effect.

Add a Solr core

A Solr Core is a running instance of a Lucene index that contains all the Solr configuration files required to use it. We need to create a Solr Core to perform operations like indexing and analyzing. A Solr application may contain one or multiple cores.

Using the solr user created during the installation, I create a new core.

su solr
cd /opt/solr/bin
./solr create -c collection_name

See the core now available in the web interface ready to index some data: http://localhost:8983/solr/#/~cores/collection_name

Side note: You can delete a core using the delete command ./solr delete -c collection_name

Curve ball: I received a warning about my ulimit settings

*** [WARN] *** Your open file limit is currently 1024.
 It should be set to 65000 to avoid operational disruption.
 If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh

I set it to the suggested amount. I am not sure if I did this correctly though.

ulimit -a
ulimit -n 65000

Reflection

Don't be scared of creating, deleting and recreating things, especially in the beginning, while learning. Break it, fix it, understand it, learn it.

Scrape Solr

Solr needed some data and I found a really useful python tutorial to create a crawler for my blog which will be hosted on my Fedora server, a server not hosting my blog.

Missing dependencies

While setting up I came across the following missing dependencies

sudo dnf install python-devel :The libraries and header files needed for Python development

pip install twisted: An extensible framework for Python programming, with special focus on event-based network programming and multiprotocol integration.

Install Scrapy

Prepare to run Scrapy in a python virtualenv:

PROJECT_DIR=~/projects/scrapy
mkdir $PROJECT_DIR
cd $PROJECT_DIR
virtualenv scrapyenv
source scrapyenv/bin/activate
pip install scrapy

Create a Scrapy application:

scrapy startproject blog
cd blog

Edit blog/items.py to indicate what needs to be indexed:

from scrapy.item import Item, Field

class BlogItem(Item):
    title = Field()
    url = Field()
    text = Field()

Create a spider to crawl my blog:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from blog.items import BlogItem
from scrapy.item import Item
from urlparse import urljoin
import re

class BlogSpider(BaseSpider):
    name = 'blog'
    allowed_domains = ['curiousprogrammer.io']
    start_urls = ['https://curiousprogrammer.io/']

    seen = set()

    def parse(self, response):
        if response.url in self.seen:
            self.log('already seen  %s' % response.url)
        else:
            self.log('parsing  %s' % response.url)
            self.seen.add(response.url)

        hxs = HtmlXPathSelector(response)
        if re.match(r'https://curiousprogrammer.io/blog/\S.*$', response.url):
            item = BlogItem()
            item['title'] = hxs.select('//title/text()').extract()
            item['url'] = response.url
            item['text'] = hxs.select('//div[@id="post"]//child::node()/text()').extract()
            self.log("yielding item " + response.url)
            yield item

        for url in hxs.select('//a/@href').extract():
            url = urljoin(response.url, url)
            if not url in self.seen and not re.search(r'.(pdf|zip|jar|gif|png|jpg)$', url):
                self.log("yielding request " + url)
                yield Request(url, callback=self.parse)

Crawl the blog. An items.json file is generated. It will be appended to each time a crawl is processed outputting to the same file.

scrapy crawl blog -o items.json -t json

The tutorial showcases a index python script using pysolr but it didn't work for me. I indexed it directly through the Solr API using curl.

curl "http://localhost:8983/solr/collection_name/update/json/docs?commit=true" -H "Content-type:application/json" --data-binary @items.json

I set up a daily cronjob to index data using crontab through vim ~/projects/scrapy/blog/crawl-and-index:

#!/bin/bash

echo "Delete entries from Solr"
curl http://localhost:8983/solr/oxygen/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8'
curl http://localhost:8983/solr/oxygen/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'

echo "Remove existing scraped database"
cd /root/projects/scrapy/blog
rm items.json

echo "Enter they scrapy virtualenv"
source ../scrapyenv/bin/activate

echo "Start crawling your blog..."
scrapy crawl blog -o items.json -t json

echo "Index Solr with crawled database"
curl "http://localhost:8983/solr/oxygen/update/json/docs?commit=true" -H "Content-type:application/json" --data-binary @items.json

echo "Bye!"
crontab -e
@daily ~/projects/scrapy/blog/crawl-and-index

Setup your hosting environment

I no longer wanted to access the Solr API publically using the port. To achieve this, I had to configure a reverse proxy. A great benefit to using this approach is the usage of SSL. For me to get SSL to work, I had to start by getting a domain name.

Get a domain name

You can get a free domain name at freenom or other services that are available.

To release your inner geek, you can update your host's name:

sudo hostname new_host_name

sudo vim /etc/hostname

Update your /etc/hosts to look something as follows:

127.0.0.1    new_host_name

Add your nameservers that your domain registrar will provide to you:

vim /etc/resolv.conf

# Generated by NetworkManager
search new_host_name
nameserver <IP>
nameserver <IP>
nameserver 8.8.8.8 #Google

Create a webserver with Nginx

NGINX is a high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. NGINX is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption.

sudo dnf install nginx
# Start NGINX when system boots
sudo systemctl enable nginx
# Start NGINX
sudo systemctl start nginx
# Check NGINX Status
sudo systemctl status nginx

Create an SSL Certificate with Let's Encrypt

Install certbot, configure it and then create a cron job to automatically renew the certificate every twelve months

sudo dnf install certbot-nginx
crontab -e
0 12 * * * /usr/bin/certbot renew --quiet

Redirects

Configure server on port 80 to redirect all traffic to SSL:

server {
    listen       80 default_server;
    listen       [::]:80 default_server;
    server_name  *.example.com;
    root         /usr/share/nginx/html;

    # Load configuration files for the default server block.
    include /etc/nginx/default.d/*.conf;

    if ($scheme != "https") {
        return 301 https://$host$request_uri;
    }
}

Verify that port 443 is configured correctly: Configure server on port 443:

server {
    listen       443 ssl http2 default_server;
    listen       [::]:443 ssl http2 default_server;
    server_name  *.example.com;
    root         /usr/share/nginx/html;

    ssl_certificate /etc/letsencrypt/live/example.com/fullchain.pem; # managed by Certbot
    ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem; # managed by Certbot
    include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot

    # Load configuration files for the default server block.
    include /etc/nginx/default.d/*.conf;

    location / {
    }
}

Then add hosted applications underneath the last location statement. In this case I am directing all incoming /solr traffic to localhost:8983 so that I can run Solr on HTTPS.

# This is our Solr instance
# We will access it through SSL instead of using the port directly
location /solr {
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_pass "http://localhost:8983";
}

error_page 404 /404.html;
    location = /40x.html {
}

error_page 500 502 503 504 /50x.html;
    location = /50x.html {
}

Test this in a browser or by running a curl command. If connections time out, double check the firewall rules, this time making sure 443 is open on host and hosting provider and that the firewall has been reloaded.

Dropping access to ports

I no longer need to expose Solr's port so I can drop it from the firewall.

sudo firewall-cmd --zone=public --permanent --remove-port=8983/tcp
sudo firewall-cmd --reload
sudo firewall-cmd --zone=public --list-ports

Consume the API

When ready to consume the API using a JavaScript application, it is highly likely that you encounter a Cross-Origin Resource Sharing error when trying to make calls to the remote server. The reverse proxy and the Solr application haven't been explicitly told to give you the resources you are requesting.

Cross-Origin Resource Sharing (CORS) is a mechanism that uses additional HTTP headers to tell a browser to let a web application running at one origin (domain) have permission to access selected resources from a server at a different origin. A web application makes a cross-origin HTTP request when it requests a resource that has a different origin (domain, protocol, and port) than its own origin.

The error goes along the lines of Access to XMLHttpRequest at 'https://example.com/solr/collection_name/select' from origin 'http://localhost:8081' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.

Reverse proxy

To enable CORS on the reverse proxy we need to edit the /etc/nginx/nginx.conf file. In this example I configure Nginx CORS to support the reverse proxied Solr API.

location /solr {
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_pass "http://localhost:8983";

    set $cors '';
    if ($http_origin ~ '^http(s)*?://(localhost|www\.curiousprogrammer\.io)') {
            set $cors 'true';
    }

    if ($cors = 'true') {
            add_header 'Access-Control-Allow-Origin' "$http_origin" always;
            add_header 'Access-Control-Allow-Credentials' 'true' always;
            add_header 'Access-Control-Allow-Methods' 'GET, POST, PUT, DELETE, OPTIONS' always;
            add_header 'Access-Control-Allow-Headers' 'Accept,Authorization,Cache-Control,Content-Type,DNT,If-Modified-Since,Keep-Alive,Origin,User-Agent,X-Requested-With' always;
            add_header 'Access-Control-Expose-Headers' 'Authorization' always;
    }

    if ($request_method = 'OPTIONS') {
            # Tell client that this pre-flight info is valid for 20 days
            add_header 'Access-Control-Max-Age' 1728000;
            add_header 'Content-Type' 'text/plain charset=UTF-8';
            add_header 'Content-Length' 0;
            return 204;
    }
}

Application layer

It can also be applied on the Solr application layer as it ships with the Jetty servlet engine.

cd /opt/solr/server/solr-webapp/webapp/WEB-INF/lib

curl "http://search.maven.org/remotecontent?filepath=org/eclipse/jetty/jetty-servlets/8.1.14.v20131031/jetty-servlets-8.1.14.v20131031.jar" -o jetty-servlets-8.1.14.v20131031.jar

curl "http://search.maven.org/remotecontent?filepath=org/eclipse/jetty/jetty-util/8.1.14.v20131031/jetty-util-8.1.14.v20131031.jar" -o jetty-util-8.1.14.v20131031.jar

Edit server/solr-webapp/webapp/WEB-INF/web.xml

 <filter>
   <filter-name>cross-origin</filter-name>
   <filter-class>org.eclipse.jetty.servlets.CrossOriginFilter</filter-class>
   <init-param>
     <param-name>allowedOrigins</param-name>
     <param-value>http://localhost:8081,https://curiousprogrammer.io</param-value>
   </init-param>
   <init-param>
     <param-name>allowedMethods</param-name>
     <param-value>GET,POST,OPTIONS,DELETE,PUT,HEAD</param-value>
   </init-param>
   <init-param>
     <param-name>allowedHeaders</param-name>
     <param-value>origin, content-type, accept</param-value>
   </init-param>
 </filter>

 <filter-mapping>
   <filter-name>cross-origin</filter-name>
   <url-pattern>/*</url-pattern>
 </filter-mapping>

My final thoughts

This exercise took me a day and a half to figure out. It took me another day to make sense of it all to write about it. I made a lot of silly mistakes that could have been avoided had I just taken a few extra breaks to clear my head.

When something goes wrong and I am highly uncertain about the technology, domain or intricacies of the problem, it becomes extremely challenging to translate that into search terms. This is sometimes overwhelming. The best advice I can give is to break down the problem as much as possible and unpack logically how each component fits together and what could possibly be causing an issue. This could extract more useful search terms or actually help solve the problem.

If it isn't that simple and you have error messages, copy them verbatim and read and re-read the errors and solutions presented to you, even if some of them might not seem related. Something might stand out or spark a thought.

If you don't have error messages, spit ball with terms you do know that you are working with and problems you think you may be experiencing.

Most importantly, make notes of what you are doing. You never know when you might need it again. You could also just jot it down and blog about it to help you and others experiencing the same challenges that you are facing regardless of how large or small.


With regards to Solr, it's early days. Looking back at the actual steps I had to take to install it, the setup and installation is really simple. It was finding that information that was the tricky part for me because there is a lot of information out there.

The blog crawler was an interesting find and I am glad that it is in Python because it's a language I feel worthwhile learning. The crawler does a dandy job and exactly what I need it to do for now.

I enjoy working with Nginx as I am now accustomed to it and Let's Encrypt was fairly straightforward to configure once the domain name was correctly configured.

Next is actually configuring this solution into a React component on my Gatsby website. A tale for another day.

References

Solr

Scraping

Hosting