Nutch 2.2 with ElasticSearch 1.x and HBase

This document describes how to install and run Nutch 2.2.1 with HBase 0.90.4 and ElasticSearch 1.1.1 on Ubuntu 14.04

Prerequisites

Make sure you installed the Java-SDK 7.


$ sudo apt-get install openjdk-7-jdk

And you set JAVA_HOME in your .bashrc:
Add the following line at the bottom of HOME/.bashrc:


export JAVA_HOME=/usr/lib/jvm/java-7-openjdk

(the jdk might differ)

Now you need to either reconnect with your terminal or type:


$ source ~/.bashrc

To load the changes in that file.

Download Nutch 2.2.x

Download the latest release or 2.2.1 from:
https://nutch.apache.org/downloads.html

Unpack it and follow the steps described in the tutorial:
http://wiki.apache.org/nutch/Nutch2Tutorial

Download HBase

It’s proven to work with version 0.90.4. This version is quite old (2011) so you might try with newer versions but nutch doesn’t support them. Hopefully there will be an upgrade soon.

http://archive.apache.org/dist/hbase/hbase-0.90.4/

Download ElasticSearch

Download and unpack ElasticSearch 1.x from:

http://www.elasticsearch.org/overview/elkdownloads/

To run ElasticSearch with the default configuration just go to ES_HOME and type:


$ bin/elasticsearch

Install HBase

Install HBase according to:
http://hbase.apache.org/book/quickstart.html

If you’re running on Ubuntu you need to change the file /etc/hosts
Due to some internal problems with old versions of HBase and the loopback of IP-addresses you need to specify localhost as 127.0.0.1
Just change all localhost-ips to the format above. Sometimes (on Ubuntu) localhost is 127.0.1.1.
Apparently this is fixed in newer versions of HBase, but you cannot use them yet.

Now you have to change the configuration of HBASE_HOME/conf/hbase-site.xml.
Hbase and Zookeper need directories where to save data to. Default is /temp which would be gone after restarting the computer.
So create 2 folders one for HBase and one for Zookeeper where they can save their data.


<property>
<name>hbase.rootdir</name>
<value>file:///DIRECTORY/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/DIRECTORY/zookeeper</value>
</property>

Just replace DIRECTORY whith a folder of your choice. Don’t forget file:// in front of your hbase.rootdir
You need to specify a location on your local filesytem for running HBase in standalone-mode (without hdfs).

Now start Hbase and run in HBASE_HOME:


$ ./bin/start-hbase.sh

Now you can check the logs at the specified location.

Now please use the shell and test your HBase installation.


$ ./bin/hbase shell

You should be able to create a table:


$ create 'test', 'ab'

Expected output:


$ ۰ row(s) in 1.2200 seconds

With the command scan you can just list all the content of the created table:


$ scan 'test'

If there are no errors, you’re HBase should be set up correctly.

Setting up Nutch to work with HBase and ElasticSearch 1.x

Go to your NUTCH_HOME and edit conf/nutch-site.xml:
Enable HBase as backend-database:


<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>

 

<property>
<name>http.agent.name</name>
<value>My Private Spider Bot</value>
</property>

<property>
<name>http.robots.agents</name>
<value>My Private Spider Bot</value>
</property>

Now set the versions in your dependency-manager in NUTCH_HOME/ivy/ivy.xml:


<!-- Uncomment this to use HBase as Gora backend. -->
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

To make sure that the correct version of ElasticSearch is used you also need to change the default version to the one you want to use:


<dependency org="org.elasticsearch" name="elasticsearch" rev="1.1.1" conf="*->default"/>

Now you need to edit a line of Java-Source-Code.
NUTCH_HOME/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
The line with item.failed() needs to be changed. Since there was an API-Update from the version that was used per default.


if (item.isFailed()) {...}

Now you need to edit in gora.properties:
Enable HBase as a default datastore:


gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Compile Nutch

Just go to your NUTCH_HOME directory and run:


$ ant runtime

When the build was succesful you can start working.

Make sure Hbase is running!

Now you can start crawling a website

Create a folder called e.g. ‘urls’ in NUTCH_HOME/runtime
Create a file called seed.txt inside and add, line per line all the URLs that you want to crawl.

Now for the standalone mode (not using hadoop) go to NUTCH_HOME/runtime/local:

Now you need to execute a pipeline of commands all starting with bin/nutch:
http://wiki.apache.org/nutch/CommandLineOptions


۱ $ bin/nutch inject <seed-url-dir>
۲ $ bin/nutch generate -topN <n>
۳ $ bin/nutch fetch -all
۴ $ bin/nutch parse -all
۵ $ bin/nutch updatedb
۶ $ bin/nutch elasticindex <clustername> -all

To check whether everything worked you can look at hbase (via hbase-shell):


$ > scan 'webpage'

Then repete the steps 2-5 as much as you want and then write everything to the index (6).

To check whether something has been written to the ElasticSearch index just execute:


$ curl -XGET 'http://localhost:9200/index/_search?q=*&pretty=true'

There you should see the crawled and downloaded documents with the raw text and all the metadata in json-format.

Nutch saves everything from HBase ‘webpage’ to an index called ‘index’ per default and exports all ‘documents’ to ElasticSearch with the type ‘doc’.

Useful Links:

http://www.sigpwned.com/content/nutch-2-and-elasticsearch
http://etechnologytips.com/create-web-crawler-data-miner/
http://wiki.apache.org/nutch/CommandLineOptions
http://de.slideshare.net/digitalpebble/j-nioche-lucenerevoeu2013
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-16/nutch-search-engine

 

 

 

نصب ElasticSearch در ۵ دقیقه :

Install Elasticsearch in 5 Minutes

This is a short tutorial to install Elasticsearch in 5 minutes on Ubuntu in a Digital Ocean droplet.

I’ve been working with WordPress for a long time and what really got me hooked in the early days was the “Famous 5-Minute Install”. I’m extending that same concept to one of my new favorite tools – Elasticsearch. It’s a super fast search service built on Lucene that has an embedded RESTful JSON API.

Since it’s native JSON, any object you have in your code – whether it be a Javascript object or a C# object – can be serialized and inserted into an Elasticsearch Index. So technically you can use it as a NoSQL database. It clusters and does a lot of other fancy stuff but that’s not the point of this article. Anyway you probably already know what it is if you stumbled on this post so lets get your very own Elastic Search sandbox up and running…

Step 1: Get A Server

Screen Shot 2014-05-23 at 1.23.28 PMIn order to get this done in 5 minutes we’re going to useDigital Ocean to spin up a cloud server. Why? Because it’s awesome and your server will be ready in 55 seconds… It’s cheap to run and free to get started if you use one of their many promo codes. If this doesn’t sound awesome to you, feel free to spend an hour or so setting up a Linux virtual machine. Either way, this tutorial assumes you are going to run ElasticSearch on Linux, specifically Ubuntu.

So after you sign up for Digital Ocean, setup a free  Ubuntu Droplet (more info than you need is here). They’ll email you the root password and you should be good to go to access the Linux console from their website.

Screen Shot 2014-05-23 at 1.31.38 PM

Note: there are a bunch of other things you’ll want to do if you run this server in production – like setting up SSH, disabling root login, and other things. Follow this tutorial for ‘Initial Server Setup With Ubuntu‘ for more details.

Step 2: Install Elasticsearch

Now you are ready to install Elasticsearch. Fortunately that’s the easy part. Run the shell script in this gist to get up and running.

 

۱۲۳۴۵۶۷۸۹
cd ~
sudo apt-get update
sudo apt-get install openjdk-7-jre-headless -y
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.2.2.deb
sudo dpkg -i elasticsearch-1.2.2.deb
sudo service elasticsearch start
#curl http://localhost:9200
view rawes.sh hosted with ❤ by GitHub

 

Aaaand you’re done.

Want to make sure it’s running? Run a curl in your console, hitting port 9200.

curl http://localhost:9200

You should see something like this giving you some meta data about your Elasticsearch instance.

 

Install ElasticSearch In 5 Minutes

Now, if I had DNS setup for this hostname, you will now be able to hit Elasticsearch externally with http://elastic.brudtkuhl.com:9200 but for now you can just go at the public IP address that Digital Ocean provides.

This is the first in a series of posts on my experiences working with Elasticsearch. Do you have any questions on how to install Elasticsearch?

Now onto your next step: Securing Elasticsearch.