{"id":1407,"date":"2015-01-18T13:49:13","date_gmt":"2015-01-18T10:19:13","guid":{"rendered":"http:\/\/vua.nadiran.com\/?p=1407"},"modified":"2015-01-18T17:15:39","modified_gmt":"2015-01-18T13:45:39","slug":"%d9%be%db%8c%d8%a7%d8%af%d9%87-%d8%b3%d8%a7%d8%b2%db%8c-%d9%85%d9%88%d8%aa%d9%88%d8%b1-%d8%ac%d8%b3%d8%aa%d8%ac%d9%88-%d8%a8%d8%a7-hbase-nutch-elasticsearch","status":"publish","type":"post","link":"https:\/\/vua.nadiran.com\/?p=1407","title":{"rendered":"\u067e\u06cc\u0627\u062f\u0647 \u0633\u0627\u0632\u06cc \u0645\u0648\u062a\u0648\u0631 \u062c\u0633\u062a\u062c\u0648 \u0628\u0627   HBase Nutch Elasticsearch"},"content":{"rendered":"<div style=\"text-align: left; direction: ltr;\" role=\"main\">\n<article id=\"post-6832\">\n<header>\n<h1>Nutch 2.2 with ElasticSearch 1.x and HBase<\/h1>\n<\/header>\n<div>\n<p>This document describes how to install and run Nutch 2.2.1 with HBase 0.90.4 and ElasticSearch 1.1.1 on Ubuntu 14.04<\/p>\n<h4>Prerequisites<\/h4>\n<p>Make sure you installed the Java-SDK 7.<\/p>\n<pre><code>\r\n$ sudo apt-get install openjdk-7-jdk\r\n<\/code><\/pre>\n<p>And you set JAVA_HOME in your .bashrc:<br \/>\nAdd the following line at the bottom of HOME\/.bashrc:<\/p>\n<pre><code>\r\nexport JAVA_HOME=\/usr\/lib\/jvm\/java-7-openjdk\r\n<\/code><\/pre>\n<p>(the jdk might differ)<\/p>\n<p>Now you need to either reconnect with your terminal or type:<\/p>\n<pre><code>\r\n$ source ~\/.bashrc\r\n<\/code><\/pre>\n<p>To load the changes in that file.<\/p>\n<h4>Download Nutch 2.2.x<\/h4>\n<p>Download the latest release or 2.2.1 from:<br \/>\n<a href=\"https:\/\/nutch.apache.org\/downloads.html\" target=\"_blank\">https:\/\/nutch.apache.org\/downloads.html<\/a><\/p>\n<p>Unpack it and follow the steps described in the tutorial:<br \/>\n<a href=\"http:\/\/wiki.apache.org\/nutch\/Nutch2Tutorial\" target=\"_blank\">http:\/\/wiki.apache.org\/nutch\/Nutch2Tutorial<\/a><\/p>\n<h4>Download HBase<\/h4>\n<p>It\u2019s proven to work with version 0.90.4. This version is quite old (2011) so you might try with newer versions but nutch doesn\u2019t support them. Hopefully there will be an upgrade soon.<\/p>\n<p><a href=\"http:\/\/archive.apache.org\/dist\/hbase\/hbase-0.90.4\/\" target=\"_blank\">http:\/\/archive.apache.org\/dist\/hbase\/hbase-0.90.4\/<\/a><\/p>\n<h4>Download ElasticSearch<\/h4>\n<p>Download and unpack ElasticSearch 1.x from:<\/p>\n<p><a href=\"http:\/\/www.elasticsearch.org\/overview\/elkdownloads\/\" target=\"_blank\">http:\/\/www.elasticsearch.org\/overview\/elkdownloads\/<\/a><\/p>\n<p>To run ElasticSearch with the default configuration just go to ES_HOME and type:<\/p>\n<pre><code>\r\n$ bin\/elasticsearch\r\n<\/code><\/pre>\n<h4>Install HBase<\/h4>\n<p>Install HBase according to:<br \/>\n<a href=\"http:\/\/hbase.apache.org\/book\/quickstart.html\" target=\"_blank\">http:\/\/hbase.apache.org\/book\/quickstart.html<\/a><\/p>\n<p>If you\u2019re running on Ubuntu you need to change the file \/etc\/hosts<br \/>\nDue to some internal problems with old versions of HBase and the loopback of IP-addresses you need to specify localhost as 127.0.0.1<br \/>\nJust change all localhost-ips to the format above. Sometimes (on Ubuntu) localhost is 127.0.1.1.<br \/>\nApparently this is fixed in newer versions of HBase, but you cannot use them yet.<\/p>\n<p>Now you have to change the configuration of HBASE_HOME\/conf\/hbase-site.xml.<br \/>\nHbase and Zookeper need directories where to save data to. Default is \/temp which would be gone after restarting the computer.<br \/>\nSo create 2 folders one for HBase and one for Zookeeper where they can save their data.<\/p>\n<pre><code>\r\n&lt;property&gt;\r\n&lt;name&gt;hbase.rootdir&lt;\/name&gt;\r\n&lt;value&gt;file:\/\/\/DIRECTORY\/hbase&lt;\/value&gt;\r\n&lt;\/property&gt;\r\n&lt;property&gt;\r\n&lt;name&gt;hbase.zookeeper.property.dataDir&lt;\/name&gt;\r\n&lt;value&gt;\/DIRECTORY\/zookeeper&lt;\/value&gt;\r\n&lt;\/property&gt;\r\n<\/code><\/pre>\n<p>Just replace DIRECTORY whith a folder of your choice. Don\u2019t forget file:\/\/ in front of your hbase.rootdir<br \/>\nYou need to specify a location on your local filesytem for running HBase in standalone-mode (without hdfs).<\/p>\n<p>Now start Hbase and run in HBASE_HOME:<\/p>\n<pre><code>\r\n$ .\/bin\/start-hbase.sh\r\n<\/code><\/pre>\n<p>Now you can check the logs at the specified location.<\/p>\n<p>Now please use the shell and test your HBase installation.<\/p>\n<pre><code>\r\n$ .\/bin\/hbase shell\r\n<\/code><\/pre>\n<p>You should be able to create a table:<\/p>\n<pre><code>\r\n$ create 'test', 'ab'\r\n<\/code><\/pre>\n<p>Expected output:<\/p>\n<pre><code>\r\n$ \u06f0 row(s) in 1.2200 seconds\r\n<\/code><\/pre>\n<p>With the command scan you can just list all the content of the created table:<\/p>\n<pre><code>\r\n$ scan 'test'\r\n<\/code><\/pre>\n<p>If there are no errors, you\u2019re HBase should be set up correctly.<\/p>\n<h4>Setting up Nutch to work with HBase and ElasticSearch 1.x<\/h4>\n<p>Go to your NUTCH_HOME and edit conf\/nutch-site.xml:<br \/>\nEnable HBase as backend-database:<\/p>\n<pre><code>\r\n&lt;property&gt;\r\n&lt;name&gt;storage.data.store.class&lt;\/name&gt;\r\n&lt;value&gt;org.apache.gora.hbase.store.HBaseStore&lt;\/value&gt;\r\n&lt;description&gt;Default class for storing data&lt;\/description&gt;\r\n&lt;\/property&gt;<\/code><\/pre>\n<p>&nbsp;<\/p>\n<p>&lt;property&gt;<br \/>\n&lt;name&gt;http.agent.name&lt;\/name&gt;<br \/>\n&lt;value&gt;My Private Spider Bot&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<\/p>\n<p><code>&lt;property&gt;<br \/>\n&lt;name&gt;http.robots.agents&lt;\/name&gt;<br \/>\n&lt;value&gt;My Private Spider Bot&lt;\/value&gt;<br \/>\n&lt;\/property&gt;<br \/>\n<\/code><\/p>\n<p>Now set the versions in your dependency-manager in NUTCH_HOME\/ivy\/ivy.xml:<\/p>\n<pre><code>\r\n&lt;!-- Uncomment this to use HBase as Gora backend. --&gt;\r\n&lt;dependency org=\"org.apache.gora\" name=\"gora-hbase\" rev=\"0.3\" conf=\"*-&gt;default\" \/&gt;\r\n<\/code><\/pre>\n<p>To make sure that the correct version of ElasticSearch is used you also need to change the default version to the one you want to use:<\/p>\n<pre><code>\r\n&lt;dependency org=\"org.elasticsearch\" name=\"elasticsearch\" rev=\"1.1.1\" conf=\"*-&gt;default\"\/&gt;\r\n<\/code><\/pre>\n<p>Now you need to edit a line of Java-Source-Code.<br \/>\nNUTCH_HOME\/src\/java\/org\/apache\/nutch\/indexer\/elastic\/ElasticWriter.java<br \/>\nThe line with item.failed() needs to be changed. Since there was an API-Update from the version that was used per default.<\/p>\n<pre><code>\r\nif (item.isFailed()) {...}\r\n<\/code><\/pre>\n<p>Now you need to edit in gora.properties:<br \/>\nEnable HBase as a default datastore:<\/p>\n<pre><code>\r\ngora.datastore.default=org.apache.gora.hbase.store.HBaseStore\r\n<\/code><\/pre>\n<h4>Compile Nutch<\/h4>\n<p>Just go to your NUTCH_HOME directory and run:<\/p>\n<pre><code>\r\n$ ant runtime\r\n<\/code><\/pre>\n<p>When the build was succesful you can start working.<\/p>\n<h4>Make sure Hbase is running!<\/h4>\n<h4>Now you can start crawling a website<\/h4>\n<p>Create a folder called e.g. \u2018urls\u2019 in NUTCH_HOME\/runtime<br \/>\nCreate a file called seed.txt inside and add, line per line all the URLs that you want to crawl.<\/p>\n<p>Now for the standalone mode (not using hadoop) go to NUTCH_HOME\/runtime\/local:<\/p>\n<p>Now you need to execute a pipeline of commands all starting with bin\/nutch:<br \/>\n<a href=\"http:\/\/wiki.apache.org\/nutch\/CommandLineOptions\" target=\"_blank\">http:\/\/wiki.apache.org\/nutch\/CommandLineOptions<\/a><\/p>\n<pre><code>\r\n\u06f1 $ bin\/nutch inject &lt;seed-url-dir&gt;\r\n\u06f2 $ bin\/nutch generate -topN &lt;n&gt;\r\n\u06f3 $ bin\/nutch fetch -all\r\n\u06f4 $ bin\/nutch parse -all\r\n\u06f5 $ bin\/nutch updatedb\r\n\u06f6 $ bin\/nutch elasticindex &lt;clustername&gt; -all\r\n<\/code><\/pre>\n<p>To check whether everything worked you can look at hbase (via hbase-shell):<\/p>\n<pre><code>\r\n$ &gt; scan 'webpage'\r\n<\/code><\/pre>\n<p>Then repete the steps 2-5 as much as you want and then write everything to the index (6).<\/p>\n<p>To check whether something has been written to the ElasticSearch index just execute:<\/p>\n<pre><code>\r\n$ curl -XGET 'http:\/\/localhost:9200\/index\/_search?q=*&amp;pretty=true'\r\n<\/code><\/pre>\n<p>There you should see the crawled and downloaded documents with the raw text and all the metadata in json-format.<\/p>\n<p>Nutch saves everything from HBase \u2018webpage\u2019 to an index called \u2018index\u2019 per default and exports all \u2018documents\u2019 to ElasticSearch with the type \u2018doc\u2019.<\/p>\n<p>Useful Links:<\/p>\n<p><a href=\"http:\/\/www.sigpwned.com\/content\/nutch-2-and-elasticsearch\" target=\"_blank\">http:\/\/www.sigpwned.com\/content\/nutch-2-and-elasticsearch<\/a><br \/>\n<a href=\"http:\/\/etechnologytips.com\/create-web-crawler-data-miner\/\" target=\"_blank\">http:\/\/etechnologytips.com\/create-web-crawler-data-miner\/<\/a><br \/>\n<a href=\"http:\/\/wiki.apache.org\/nutch\/CommandLineOptions\" target=\"_blank\">http:\/\/wiki.apache.org\/nutch\/CommandLineOptions<\/a><br \/>\n<a href=\"http:\/\/de.slideshare.net\/digitalpebble\/j-nioche-lucenerevoeu2013\" target=\"_blank\">http:\/\/de.slideshare.net\/digitalpebble\/j-nioche-lucenerevoeu2013<\/a><br \/>\n<a href=\"https:\/\/www.inkling.com\/read\/hadoop-definitive-guide-tom-white-3rd\/chapter-16\/nutch-search-engine\" target=\"_blank\">https:\/\/www.inkling.com\/read\/hadoop-definitive-guide-tom-white-3rd\/chapter-16\/nutch-search-engine<\/a><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p style=\"direction: rtl; text-align: right;\">\u0646\u0635\u0628 ElasticSearch \u062f\u0631 \u06f5 \u062f\u0642\u06cc\u0642\u0647 :<\/p>\n<p style=\"direction: rtl; text-align: right;\">\n<header>\n<h1 itemprop=\"name\">Install Elasticsearch in 5 Minutes<\/h1>\n<\/header>\n<div itemprop=\"mainContentOfPage\">\n<p>This is a short tutorial to install Elasticsearch in 5 minutes on Ubuntu in a Digital Ocean droplet.<\/p>\n<p>I\u2019ve been working with\u00a0WordPress for a long time and what really got me hooked in the early days was the\u00a0<a href=\"http:\/\/codex.wordpress.org\/Installing_WordPress#Famous_5-Minute_Install\" target=\"_blank\">\u201cFamous 5-Minute Install\u201d<\/a>.\u00a0I\u2019m extending that same concept to one of my new favorite tools\u00a0\u2013\u00a0<a title=\"Elastic Search\" href=\"http:\/\/www.elasticsearch.org\/\" target=\"_blank\">Elasticsearch.<\/a>\u00a0It\u2019s a super fast search service built on Lucene that has\u00a0an embedded RESTful JSON API.<\/p>\n<p>Since it\u2019s native\u00a0JSON, any object you have in your code \u2013 whether it be a Javascript object or a C# object \u2013 can be serialized and inserted into an Elasticsearch Index. So technically you can use it as a NoSQL database. It clusters and does a lot of other fancy stuff but that\u2019s not the point of this article. Anyway you probably already know what it is if you stumbled on this post so\u00a0lets get your very own Elastic Search sandbox up and running\u2026<\/p>\n<h2><strong>Step 1: Get A Server<\/strong><\/h2>\n<p><img loading=\"lazy\" decoding=\"async\" alt=\"Screen Shot 2014-05-23 at 1.23.28 PM\" src=\"http:\/\/brudtkuhl.com\/wp-content\/uploads\/2014\/05\/Screen-Shot-2014-05-23-at-1.23.28-PM-150x150.png\" width=\"150\" height=\"150\" \/>In order to get this done in 5 minutes we\u2019re going to use<a href=\"https:\/\/www.digitalocean.com\/?refcode=b0e2617efc3c\" target=\"_blank\">Digital Ocean<\/a>\u00a0to spin up a cloud server. Why? Because it\u2019s awesome and your server will be ready in 55 seconds\u2026 It\u2019s cheap to run and free to get started if you use one of\u00a0<a title=\"Digital Ocean Promo Codes\" href=\"https:\/\/twitter.com\/search?q=from%3Adigitalocean%20promo&amp;src=typd\" target=\"_blank\">their many promo codes<\/a>. If this doesn\u2019t sound awesome to you, feel free to spend an hour or so setting up a Linux virtual machine. Either way, this tutorial assumes you are going to run ElasticSearch on Linux,\u00a0specifically Ubuntu.<\/p>\n<p>So after you\u00a0<a href=\"https:\/\/www.digitalocean.com\/?refcode=b0e2617efc3c\" target=\"_blank\">sign up for Digital Ocean<\/a>, setup a free\u00a0\u00a0Ubuntu Droplet (<a title=\"Install Ubuntu On Digital Ocean\" href=\"https:\/\/www.digitalocean.com\/community\/articles\/initial-server-setup-with-ubuntu-14-04\" target=\"_blank\">more info than you need is here<\/a>). They\u2019ll email you the root password and you should be good to go to access the Linux console from their website.<\/p>\n<p><a href=\"http:\/\/brudtkuhl.com\/wp-content\/uploads\/2014\/05\/Screen-Shot-2014-05-23-at-1.31.38-PM.png\"><img loading=\"lazy\" decoding=\"async\" alt=\"Screen Shot 2014-05-23 at 1.31.38 PM\" src=\"http:\/\/brudtkuhl.com\/wp-content\/uploads\/2014\/05\/Screen-Shot-2014-05-23-at-1.31.38-PM-1024x462.png\" width=\"700\" height=\"315\" \/><\/a><\/p>\n<p><em>Note: there are a bunch of other things you\u2019ll want to do if you run this server in production \u2013 like setting up SSH, disabling root login, and other things. Follow this tutorial for \u2018<a title=\"Initial Server Setup with Ubuntu 14.04\" href=\"https:\/\/www.digitalocean.com\/community\/articles\/initial-server-setup-with-ubuntu-14-04\" target=\"_blank\">Initial Server Setup With Ubuntu<\/a>\u2018 for more details.<\/em><\/p>\n<h2><strong>Step 2: Install Elasticsearch<\/strong><\/h2>\n<p>Now you are ready to install Elasticsearch. Fortunately that\u2019s the easy part. Run the shell script in\u00a0<a title=\"Install Elastic Search On Ubuntu\" href=\"https:\/\/gist.github.com\/abrudtkuhl\/9983753\" target=\"_blank\">this gist<\/a>\u00a0to get up and running.<\/p>\n<p>&nbsp;<\/p>\n<div id=\"gist9983753\">\n<div>\n<div>\n<div>\n<table cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td>\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9<\/td>\n<td>\n<div id=\"file-es-sh-LC1\">cd ~<\/div>\n<div id=\"file-es-sh-LC2\">sudo apt-get update<\/div>\n<div id=\"file-es-sh-LC3\">sudo apt-get install openjdk-7-jre-headless -y<\/div>\n<div id=\"file-es-sh-LC4\"><\/div>\n<div id=\"file-es-sh-LC5\">wget https:\/\/download.elasticsearch.org\/elasticsearch\/elasticsearch\/elasticsearch-1.2.2.deb<\/div>\n<div id=\"file-es-sh-LC6\">sudo dpkg -i elasticsearch-1.2.2.deb<\/div>\n<div id=\"file-es-sh-LC7\">sudo service elasticsearch start<\/div>\n<div id=\"file-es-sh-LC8\"><\/div>\n<div id=\"file-es-sh-LC9\">#curl http:\/\/localhost:9200<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<div><a href=\"https:\/\/gist.github.com\/abrudtkuhl\/9983753\/raw\/es.sh\">view raw<\/a><a href=\"https:\/\/gist.github.com\/abrudtkuhl\/9983753#file-es-sh\">es.sh<\/a>\u00a0hosted with \u2764 by\u00a0<a href=\"https:\/\/github.com\/\">GitHub<\/a><\/div>\n<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Aaaand you\u2019re done.<\/p>\n<p>Want to make sure it\u2019s running? Run a curl in your console, hitting port 9200.<\/p>\n<pre>curl http:\/\/localhost:9200<\/pre>\n<p>You should see something like this giving you some meta data about your Elasticsearch instance.<\/p>\n<p>&nbsp;<\/p>\n<p><a href=\"http:\/\/brudtkuhl.com\/wp-content\/uploads\/2014\/05\/Screen-Shot-2014-05-29-at-3.10.02-PM.png\"><img loading=\"lazy\" decoding=\"async\" alt=\"Install ElasticSearch In 5 Minutes\" src=\"http:\/\/brudtkuhl.com\/wp-content\/uploads\/2014\/05\/Screen-Shot-2014-05-29-at-3.10.02-PM.png\" width=\"745\" height=\"421\" \/><\/a><\/p>\n<p>Now, if I had DNS setup for this hostname, you will now be able to hit Elasticsearch externally with http:\/\/elastic.brudtkuhl.com:9200 but for now you can just go at the public IP address that Digital Ocean provides.<\/p>\n<p>This is the first in a series of posts on my experiences working with Elasticsearch. Do you have any questions on how to install Elasticsearch?<\/p>\n<p>Now onto your next step:\u00a0<a title=\"Securing Elasticsearch\" href=\"http:\/\/brudtkuhl.com\/securing-elasticsearch\/\">Securing Elasticsearch<\/a>.<\/p>\n<p><small><em>\u00a0<\/em><\/small><\/p>\n<\/div>\n<p>&nbsp;<\/p>\n<\/div>\n<footer>\u00a0<\/footer>\n<\/article>\n<\/div>\n<aside role=\"complementary\">\n<div id=\"tag_cloud-2\"><\/div>\n<\/aside>\n","protected":false},"excerpt":{"rendered":"<p>Nutch 2.2 with ElasticSearch 1.x and HBase This document describes how to install and run Nutch 2.2.1 with HBase 0.90.4 and ElasticSearch 1.1.1 on Ubuntu 14.04 Prerequisites Make sure you installed the Java-SDK 7. $ sudo apt-get install openjdk-7-jdk And you set JAVA_HOME in your .bashrc: Add the following line at the bottom of HOME\/.bashrc: <a href='https:\/\/vua.nadiran.com\/?p=1407' class='excerpt-more'>[&#8230;]<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[34],"tags":[],"class_list":["post-1407","post","type-post","status-publish","format-standard","hentry","category-34","category-34-id","post-seq-1","post-parity-odd","meta-position-corners","fix"],"_links":{"self":[{"href":"https:\/\/vua.nadiran.com\/index.php?rest_route=\/wp\/v2\/posts\/1407"}],"collection":[{"href":"https:\/\/vua.nadiran.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/vua.nadiran.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/vua.nadiran.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/vua.nadiran.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1407"}],"version-history":[{"count":5,"href":"https:\/\/vua.nadiran.com\/index.php?rest_route=\/wp\/v2\/posts\/1407\/revisions"}],"predecessor-version":[{"id":1409,"href":"https:\/\/vua.nadiran.com\/index.php?rest_route=\/wp\/v2\/posts\/1407\/revisions\/1409"}],"wp:attachment":[{"href":"https:\/\/vua.nadiran.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1407"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/vua.nadiran.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1407"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/vua.nadiran.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1407"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}