خلاصه پروژه پیشنهاد واژگان در موتور جستجو jamsheed.ir

Nov 132016

خلاصه پروژه پیشنهاد واژگان در موتور جستجو

ابتدا دامینی به نام jamsheed.ir ثبت شد و کلید واژه هایی که در موتور جستجوی گوگل بیشترین جستجو را داشتند روی صفحه اول قرار دادم

این کلمات به صفحاتی لینک شدند که از وب سرویس پارسی جو محتوای زیادی را نمایش می دادند

با ورود کلید واژهای جستجو شده ترکیبی که در google Alanytics ثبت می شوند آنها را در دیتا بیس ذخیره کردم

کلمات کلیدی بهم پیوسته با برنامه ای به تک واژه شکسته می شوند و هر تک واژه در جدولی که شامل نام آن واژه و تعداد تکرار آن هست ذخیره می شود

با این روش تعداد واژه ها در یک بازه زمانی خاص به دست آمد

به طور مثال واژه هایی مثل “دانلود” با ۵۰۰۰۰ بار تکرار و واژه “آهنک” با ۱۵۰۰۰ تکرار به دست آمد

در مرحله بعد ، واژگان ترکیبی با دو واژه در جدولی جداگانه توسط برنامه ای که نوشتم به دست آمد و تعداد تکرار آنها هم ذخیره شد

به طور مثال واژه های ترکیبی مثل “ضمن+خدمت” با ۲۷۰۰ تکرار و “فیش+حقوقی” با ۱۷۰۰ بیشترین ضریب چسبندگی را به خود اختصاص دادند

در این مرحله واژه های ترکیبی غیر معمولی هم به دست می آمد که نشان می داد برای بالابردن دقت پیشگویی کننده باید ضریب چسبندگی ۳ واژه با هم را نیز به دست آورم

به طور مثال واژگان ترکیبی مثل “ضمن+خدمت+فرهنگیان” با ۳۹۰۰ تکرار و ” آموزش+و+پرورش” با ۳۳۰۰ تکرار و “دانلود+فیلتر+شکن” با ۲۴۰۰ تکرار جزو پر تکرار
ترین کلمات بودند

که نتایج قابل قبولی برای شبیه سازی آماری پیشنهاد دهنده واژگان در اختیار قرار می دادند

خزنده وب و موتور جستجو بر اساس Nutch + Hadoop + Hbase + ElasticSearch

موتور جستجو No Responses »

Jan 192015

خزنده وب و موتور جستجو بر اساس Nutch + Hadoop + Hbase + ElasticSearch

معماری خزنده وب Nutch + Hadoop، نمونه آنلاین توزیع معماری پردازش دسته ای است، توان خزیدن عملکرد بسیار خوبی دارد و ارائه گزینه های زیاد پیکربندی برای سفارشی سازی. خزندگان به عنوان تنها مسئول خزیدن در منابع شبکه وب است، بنابراین نیاز به یک موتور جستجو توزیع شده، مورد استفاده برای خزیدن منابع شبکه خزنده وب در زمان واقعی نمایه سازی و جستجو می باشد.

معماری موتور جستجو در ElasticSearch، توزیع آنلاین زمان واقعی معماری پرس و جوی تعاملی تقریبا با هیچ نقطه شکستی، به طور مقیاس پذیر، بسیار در دسترس است. از مقادیر زیادی از اطلاعات را با شاخص های جست و جو می تواند در مورد نزدیکی زمان واقعی است به اتمام است، این امکان وجود دارد تا به سرعت برای فایل در زمان واقعی و میلیاردها داده در سطح PB جستجو، در حالی که ارائه این گزینه از تمام جنبه، می تواند تقریبا در هر جنبه از موتور سفارشی ساخته شده است. پشتیبانی از API آرام، شما می توانید با استفاده از JSON تماس توابع مختلف خود را از طریق HTTP، از جمله جستجو، تجزیه و تحلیل و نظارت. علاوه بر این، برای جاوا، پی اچ پی، پرل، پایتون، روبی و و زبان های دیگر کتابخانه های مشتری بومی است.

پس از خزنده از طریق خزنده وب برای استخراج داده های ساخت یافته را مشاهده کنید برای جستجو شاخص موتورهای حرفه ای برای تجزیه و تحلیل پرس و جو. به عنوان موتور جستجو است که به نزدیکی زمان واقعی پرس و جو های پیچیده تعاملی طراحی شده است، بنابراین موتورهای جستجو به شاخص نیست صفحه برای نجات محتوای اصلی، بنابراین، نیاز به نزدیکی زمان واقعی پایگاه داده توزیع شده برای ذخیره محتویات صفحه اصلی.

معماری پایگاه داده توزیع شده در Hbase + Hadoop، نمونه توزیع آنلاین زمان واقعی معماری دسترسی تصادفی است. سطح قوی از مقیاس پذیری برای حمایت از میلیاردها و میلیون ها نفر از ردیف ستون، داده ها می تواند بر روی خزنده وب است در زمان واقعی نوشته شده است را مشاهده کنید، و می تواند موتور جستجو، زمان واقعی دسترسی به داده ها بر اساس نتایج جستجو را تامین کند.

خزنده وب، پایگاه داده توزیع شده، موتور جستجو در یک خوشه از سخت افزار تجاری عادی اجرا شود. خوشه با استفاده از معماری توزیع شده است که می تواند به هزاران نفر از ماشین آلات، مکانیزم های تحمل پذیر خطا، بخشی از شکست گره دستگاه از دست دادن داده نمی شود نمی خواهد به شکست وظایف محاسبات منجر گسترش داده است. نه تنها در دسترس بودن، هنگامی که یک شکست گره می تواند به سرعت عدم موفقیت، و گسترش بالا، دستگاه باید قادر به سادگی به افزایش سطح انبساط خطی، بهبود ظرفیت ذخیره سازی داده ها و سرعت محاسبات.

روابط خزنده وب، پایگاه داده توزیع شده، موتور جستجو بین:

۱، پس از خزنده وب را به رندر صفحه HTML خزیدن کامل است، داده تجزیه به صف بافر، توسط دو موضوعات دیگر مسئول برای پردازش داده ها، یک موضوع مسئول برای صرفه جویی در داده ها به پایگاه داده توزیع شده است، یک موضوع مسئول داده است به شاخص های ارائه شده به موتورهای جستجو.

۲، موتور جستجو فرآیندهای ضوابط جستجو کاربر، و نتایج جستجو بازگشت به کاربر در صورتی که کاربر مشاهده عکس فوری صفحه وب را از یک پایگاه داده توزیع شده برای به دست آوردن محتوای وب اصلی است.

معماری کلی به عنوان زیر نشان داده شده:

خوشه خزندگان، خوشه پایگاه داده توزیع شده، موتور جستجوی خوشه در استقرار فیزیکی، را می توان در همان خوشه سخت افزار مستقر می توان به طور جداگانه مستقر، تشکیل ۱-۳ سخت افزار خوشه.

خزنده وب خوشه دارای یک خزنده وب سیستم مدیریت پیکربندی اختصاصی مسئول پیکربندی و مدیریت از خزندگان است، به عنوان زیر نشان داده شده:

موتور جستجو توسط تکه تکه شدن (سفال) و کپی (ماکت) برای دستیابی به عملکرد بالا، مقیاس پذیر و بسیار در دسترس است. روش برش برای نمایه سازی و جستجوی انبوه موازی را فراهم می کند پشتیبانی، که تا حد زیادی بهبود عملکرد نمایه سازی و جستجو، که تا حد زیادی بهبود سطح مقیاس پذیری، فن آوری فراهم می کند نسخه اضافه از اطلاعات، بخشی از شکست ماشین می کند استفاده عادی از سیستم تاثیر نمی گذارد، برای اطمینان از در دسترس بودن ادامه از سیستم.

ساختار شاخص وجود دارد ۲ و ۳ نسخه از قطعات به شرح زیر است:

شاخص کامل از ۰ و ۱ به دو قسمت جداگانه قطع، هر بخش دارای دو نسخه از آن بخش خاکستری زیر کلیک کنید.

در یک محیط تولید، با افزایش اندازه داده، به سادگی یک گره به دستگاه های سخت افزاری را اضافه کنید، موتور جستجو به طور خودکار تنظیم به جای تعداد سخت افزار تکه تکه شدن را افزایش می دهد، زمانی که برخی از گره بازنشسته، موتور جستجو خواهد شد به طور خودکار تنظیم به جای کاهش در تعداد قطعات سخت افزاری، در حالی که تعداد نسخه را می توان با توجه به تغییرات در سطح اطمینان از سخت افزار و ظرفیت ذخیره سازی در هر زمان تغییر می کند، تمام این پویا است، بدون راه اندازی مجدد سیستم خوشه، که آن هم تضمین مهم برای در دسترس بودن بالا.

منبع : http://my.oschina.net/apdplat/blog/308396

یکپارچه سازی Nutch 1.7 با ElasticSearch

موتور جستجو No Responses »

Jan 192015

یکپارچه سازی Nutch 1.7 با ElasticSearch

قابلیت یکپارچه سازی Nutch 1.7 با ElasticSearch بوجود آمده است.
تنظیم یکپارچه سازی ارزش فوق العاده ای دارد

این راهنما برای افرادی که با Nutch و ElasticSearch کار کرده اند می تواند دستورالعمل خوبی باشد

Nutch کار خزش (Crawl) ، واکشی (fetch) و تجزیه (parse) را برای نمایه سازی (indexing) به طور معجزا آسایی انجام می دهد ، ولی با این حال یکپارچه نیست.

این کاری که انجام میدهیم تغییر فایل nutch-site.xml در شاخه conf در جایی که Nutch نصب شده است.
اول از همه احتیاج داریم که افزونه نمایه ساز ( Indexer Plugin ) را فعال کنیم که این کار را با دستورات زیر انجام میدهیم :

<name>plugin.includes</name>

<description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded.

In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP,

and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the

underlying commons-httpclient library.

</description>

</property>

آیتم های که اینجا اضافه شده اند برای نمایه ساز Elastic هستند.
در مرحله دوم احتیاج داریم که موارد زیر را در nutch-site.xml تغییر دهیم

<!– Elasticsearch properties –>

<name>elastic.host</name>

<value>localhost</value>

<description>The hostname to send documents to using TransportClient. Either host

and port must be defined or cluster.</description>

</property>

<name>elastic.port</name>

</description>

</property>

<name>elastic.cluster</name>

<value>elasticsearch</value>

<description>The cluster name to discover. Either host and potr must be defined

or cluster.</description>

</property>

<name>elastic.index</name>

<value>nutch</value>

<description>Default index to send documents to.</description>

</property>

<name>elastic.max.bulk.docs</name>

<description>Maximum size of the bulk in number of documents.</description>

</property>

<name>elastic.max.bulk.size</name>

<description>Maximum size of the bulk in bytes.</description>

</property>

در این مورد من ElasticSearch را روی همان کیس نصب کرده ام ، به همین دلیل elastic.host نام localhost من هست

نکته مهم دیگر نام elastic.cluster است، اگر شما چیزی در این مورد نمیدانید فایل elasticsearch.yml را در شاخه ای که تنظیمات نصب ElasticSearch قرار دارد می توانید پیدا کنید.

پورت elastic.port به صورت پیش فرض ۹۳۰۰ برای واسط است ( برای خرمجی وب پورت ۹۲۰۰ که زمانی است که با nutch یکپارچه سازی نشده ).
در نهایت ایندکس را در ElasticSearch در فایل تنظیمات elastic.index بسازید.

دیگر نیاز نیست که conf/elasticsearch.conf را تغییر دهید و یا به Nutch 2.x ارتقا دهیم.

ترجمه : نادی سنجانی

منبع : https://www.mind-it.info/integrating-nutch-1-7-elasticsearch

درباره Elasticsearch -Nutch – HBase

موتور جستجو No Responses »

Jan 182015

الاستیک سرچ

یک موتور ایندکس گذاری روی متن با قابلیت پشتیبانی از کوئری های پیچیده و انواع درخواست ها به صورت تقریبا ریل تایم و نحوه ذخیره سازی جی سان . به عنوان یک دیتابیس روی الاستیک شاید نتوان حساب کرد چون اصلا برای این منظور طراحی نشده است و سازندگان آنهم هنوز آنرا به عنوان یک دیتابیس اصلی توصیه نکرده اند و بهتر است به صورت یک دیتابیس جانبی برای جستجوهای متنی پیشرفته مورد استفاده قرار گیرد و حتی اگر ایندکس های آن دچار اشکال شد ، بتوان از روی دیتابیس اصلی داده ها را بازیابی کرد .

بنابراین برای بخش جستجو ، اخبار مرتبط و نیز چنل ها ، انجمن های گفتگو، سوال و جواب ها، نظرات کاربران و مانند آن که نیاز به کوئری گرفتن روی متن داریم (با فیلترهای مختلف) یک انتخاب ایده ال خواهد بود با این شرط که اصل انتری ها و نیز چنل های کاربر در دیتابیسی جداگانه نیز ذخیره شود تا به هر دلیلی بعدها نیاز به ساخت مجدد ایندکس ها را داشته باشیم، بتوانیم این کار را انجام دهیم .

منبع

Elasticsearch یک «موتور جستجوی توزیع شده» است که به صورت یک پروژه منبع-باز جاوایی تولید شده است. این موتور جستجو خود بر اساس Lucene (موتور جستجوی محبوب در جاوا) تولید شده است. از اصلی ترین خصوصیات این موتور جستجو، مقیاس پذیری (یعنی استفاده از آن در زمانی که حجم و اندازه داده ها زیاد باشد) و ایندکس و جستجوی داده ها «در لحظه» (realtime) است.

Apache Solrموتور جستجوی محبوب و سریع محصول شرکت آپاچی می باشد که شما را قادر خواهد ساخت تا بر روی کلمات و محتواهای سایتتان جستجوی سریعی داشته باشید. این ابزار با زبان جاوا پیاده سازی شده است. مزایای استفاده از Solr:

جستجو بصورت Full-Text، Hit Highlighting، Fast Searching، Dynamic Clustering، Database Integration

مهترین مزیت استفاده از Solr این است که می توان بار جستجو را بر روی یک سرور دیگر انداخت تا جستجو باعث کندی پورتال نشود.

منبع

الاستیک سرچ(Elasticsearch) یک موتور قابل انعطاف، سریع، متن باز و قدرتمند برای جستجوی در لحظه در سایت هاست که بسیاری از سایت های معروف نظیر سونی، وردپرس، گیت هاب، پت، موزیلا، استک اورفلو و … از آن استفده می کنند. استفاده از این موتور باعث افزایش سرعت و کارایی وبسایت می شود

منبع

تعاریف اولیه :

Apache Hadoop

هادوپ بستر نرم‌افزاری منبع‌بازی است که برای نرم‌افزارهای توزیع‌شده داده محور طراحی شده است. این بستر توسط Doug Cutting توسعه یافت تا بتواند در موتور جستجوی منبع‌باز Nutch کار کند. برای بهره‌گیری از سیستم پردازش چند ماشینه بستر سخت‌افزاری Nutch، کاتینگ از سیستم فایل توزیع‌شده و تکنیک کاهش نگاشت استفاده کرد که با کمک همدیگر هادوپ را تشکیل دادند. هادوپ نام فیل اسباب‌بازی پسر او است. از طریق کاهش نگاشت، هادوپ داده‌های بزرگ را در تکه‌های کوچک‌تر و در گره‌های شبکه قرار می‌دهد. این فناوری هم‌اکنون به‌عنوان محبوب‌ترین واسط ذخیره‌سازی داده‌های بزرگ ساخت‌یافته، نیمه‌ساخت‌یافته و بدو ساختار استفاده می‌شود. هادوپ تحت مجوز آپاچی ۰/۲ منتشر شده است.

ElasticSearch

شای بنون، الاستیک‌سرچ را تحت مجوز آپاچی منتشر کرده است. این نرم‌افزار جستجو که کاملا مبتنی بر REST است، می‌تواند بدون پیکربندی خاص، به‌صورت لحظه‌ای جواب‌های جستجو را بیاورد. شرکت‌های زیادی از جمله موزیلا و StumbleUpon از الاستیک سرچ استفاده می‌کنند.

Apache HBase

اچ‌بیس که به زبان جاوا نوشته شده، مدلی از BigTable‌ گوگل است. این دیتابیس توزیع‌شده غیر رابطه‌ای ستونی می‌تواند روی سیستم فایل هادوپ اجرا شود. ذخیره و بازیابی بدون خطا و دسترسی به تعداد زیادی از داده‌های اسپارس از جمله قابلیت‌های این دیتابیس است. اچ‌بیس یکی از چند مدل انباره‌های داده NoSQL است که طی سال‌های اخیر توسعه یافته است. سال ۲۰۱۰، گوگل از اچ‌بیس برای سرویس پیغام‌دهی خود استفاده کرد.

تفاوت ElasticSearch با Solr

https://www.youtube.com/embed/mkt3f-lgizQ

راه اندازی موتور جستجو با PHP توسط ElasticSearch

https://www.youtube.com/embed/3xb1dHLg-Lk

دانلود nutch-elasticsearch

https://github.com/duffj/nutch-elasticsearch/archive/master.zip

الستیک چیست ؟

الستیک سرچ یک موتور جستجو و تحلیل در لحظه است که اپن سورس ، قابل انعطاف و قدرتمند و توزیع پذیر است

الستیک سرج

ویژگی ها :

تجزیه و تحلیل ترافیک همزمان
توزیع شدگی
توانمند
معماری چند مستاجره
متن کامل
سند گرا
طرحواره رایگان
RESTful API
استمرار هر عملگر

توزیع شدگی :
در شروع در اندازه کوچک و مقیاس پذیر به صورت افقی پیاده سازی کنید.
برای ظرفیت های بیشتر، فقط گره را اضافه کنید و اجازه دهید خوشه خود را سازماندهی مجدد کند.

دسترس پذیری

خوشه بندی های ElasticSearch گره های شکست خورده را شناسایی و حذف می کند و خود را دوباره سازماندهی می کند.

چند مستاجره

یک خوشه می تواند با شاخص های متعدد میزبانی شود بطوری که می تواند به طور مستقل یا به عنوان یک گروه درخواست پرس وجو زده شود.

سند گرایی

ذخیره موجودیت های پیچیده دنیای واقعی در Elasticsearch در قالب مستندات ساخت یافته JSON.

RESTful API

تقریبا هر عملی را می توان با استفاده از یک رابط RESTful با JSON روی HTTP انجام داد.

curl -X GET
curl -X PUT
curl -X POST
curl -X DELETE

Apache Lucene

ElasticSearch روی آپاچی Lucene ساخته شده است. Lucene یک کتابخانه با کارایی بالا، با قابلیت کامل بازیابی اطلاعات است که با زبان جاوا نوشته شده است.

اصطلاحات ElasticSearch

در ElasticSearch، همه چیز به عنوان یک سند ذخیره می شود. سند می تواند آدرس دهی شوند و توسط خصوصیات کوئری ها بازخوانی شوند

انواع سندها
خصوصیات سند ها متمایز می شوند بنابراین ما می توانیم آنها را از هم جدا کنیم. (PDF , PPT , Doc)

Shard

هر Shard یک ایندکس طبیعی مستقل از Lucene است.

که ما اجازه می دهد بر محدودیت های RAM، ظرفیت دیسک سخت غلبه داشته باشیم.

منبع : http://xinh.org/5el/#/5

پیاده سازی موتور جستجو با HBase Nutch Elasticsearch

موتور جستجو No Responses »

Jan 182015

Nutch 2.2 with ElasticSearch 1.x and HBase

This document describes how to install and run Nutch 2.2.1 with HBase 0.90.4 and ElasticSearch 1.1.1 on Ubuntu 14.04

Prerequisites

Make sure you installed the Java-SDK 7.


$ sudo apt-get install openjdk-7-jdk

And you set JAVA_HOME in your .bashrc:
Add the following line at the bottom of HOME/.bashrc:


export JAVA_HOME=/usr/lib/jvm/java-7-openjdk

(the jdk might differ)

Now you need to either reconnect with your terminal or type:


$ source ~/.bashrc

To load the changes in that file.

Download Nutch 2.2.x

Download the latest release or 2.2.1 from:
https://nutch.apache.org/downloads.html

Unpack it and follow the steps described in the tutorial:
http://wiki.apache.org/nutch/Nutch2Tutorial

Download HBase

It’s proven to work with version 0.90.4. This version is quite old (2011) so you might try with newer versions but nutch doesn’t support them. Hopefully there will be an upgrade soon.

http://archive.apache.org/dist/hbase/hbase-0.90.4/

Download ElasticSearch

Download and unpack ElasticSearch 1.x from:

http://www.elasticsearch.org/overview/elkdownloads/

To run ElasticSearch with the default configuration just go to ES_HOME and type:


$ bin/elasticsearch

Install HBase

Install HBase according to:
http://hbase.apache.org/book/quickstart.html

If you’re running on Ubuntu you need to change the file /etc/hosts
Due to some internal problems with old versions of HBase and the loopback of IP-addresses you need to specify localhost as 127.0.0.1
Just change all localhost-ips to the format above. Sometimes (on Ubuntu) localhost is 127.0.1.1.
Apparently this is fixed in newer versions of HBase, but you cannot use them yet.

Now you have to change the configuration of HBASE_HOME/conf/hbase-site.xml.
Hbase and Zookeper need directories where to save data to. Default is /temp which would be gone after restarting the computer.
So create 2 folders one for HBase and one for Zookeeper where they can save their data.


<property>
<name>hbase.rootdir</name>
<value>file:///DIRECTORY/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/DIRECTORY/zookeeper</value>
</property>

Just replace DIRECTORY whith a folder of your choice. Don’t forget file:// in front of your hbase.rootdir
You need to specify a location on your local filesytem for running HBase in standalone-mode (without hdfs).

Now start Hbase and run in HBASE_HOME:


$ ./bin/start-hbase.sh

Now you can check the logs at the specified location.

Now please use the shell and test your HBase installation.


$ ./bin/hbase shell

You should be able to create a table:


$ create 'test', 'ab'

Expected output:


$ ۰ row(s) in 1.2200 seconds

With the command scan you can just list all the content of the created table:


$ scan 'test'

If there are no errors, you’re HBase should be set up correctly.

Setting up Nutch to work with HBase and ElasticSearch 1.x

Go to your NUTCH_HOME and edit conf/nutch-site.xml:
Enable HBase as backend-database:


<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>

<property>
<name>http.agent.name</name>
<value>My Private Spider Bot</value>
</property>

<property> <name>http.robots.agents</name> <value>My Private Spider Bot</value> </property>

Now set the versions in your dependency-manager in NUTCH_HOME/ivy/ivy.xml:


<!-- Uncomment this to use HBase as Gora backend. -->
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

To make sure that the correct version of ElasticSearch is used you also need to change the default version to the one you want to use:


<dependency org="org.elasticsearch" name="elasticsearch" rev="1.1.1" conf="*->default"/>

Now you need to edit a line of Java-Source-Code.
NUTCH_HOME/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
The line with item.failed() needs to be changed. Since there was an API-Update from the version that was used per default.


if (item.isFailed()) {...}

Now you need to edit in gora.properties:
Enable HBase as a default datastore:


gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Compile Nutch

Just go to your NUTCH_HOME directory and run:


$ ant runtime

When the build was succesful you can start working.

Make sure Hbase is running!

Now you can start crawling a website

Create a folder called e.g. ‘urls’ in NUTCH_HOME/runtime
Create a file called seed.txt inside and add, line per line all the URLs that you want to crawl.

Now for the standalone mode (not using hadoop) go to NUTCH_HOME/runtime/local:

Now you need to execute a pipeline of commands all starting with bin/nutch:
http://wiki.apache.org/nutch/CommandLineOptions


۱ $ bin/nutch inject <seed-url-dir>
۲ $ bin/nutch generate -topN <n>
۳ $ bin/nutch fetch -all
۴ $ bin/nutch parse -all
۵ $ bin/nutch updatedb
۶ $ bin/nutch elasticindex <clustername> -all

To check whether everything worked you can look at hbase (via hbase-shell):


$ > scan 'webpage'

Then repete the steps 2-5 as much as you want and then write everything to the index (6).

To check whether something has been written to the ElasticSearch index just execute:


$ curl -XGET 'http://localhost:9200/index/_search?q=*&pretty=true'

There you should see the crawled and downloaded documents with the raw text and all the metadata in json-format.

Nutch saves everything from HBase ‘webpage’ to an index called ‘index’ per default and exports all ‘documents’ to ElasticSearch with the type ‘doc’.

Useful Links:

http://www.sigpwned.com/content/nutch-2-and-elasticsearch
http://etechnologytips.com/create-web-crawler-data-miner/
http://wiki.apache.org/nutch/CommandLineOptions
http://de.slideshare.net/digitalpebble/j-nioche-lucenerevoeu2013
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-16/nutch-search-engine

نصب ElasticSearch در ۵ دقیقه :

Install Elasticsearch in 5 Minutes

This is a short tutorial to install Elasticsearch in 5 minutes on Ubuntu in a Digital Ocean droplet.

I’ve been working with WordPress for a long time and what really got me hooked in the early days was the “Famous 5-Minute Install”. I’m extending that same concept to one of my new favorite tools – Elasticsearch. It’s a super fast search service built on Lucene that has an embedded RESTful JSON API.

Since it’s native JSON, any object you have in your code – whether it be a Javascript object or a C# object – can be serialized and inserted into an Elasticsearch Index. So technically you can use it as a NoSQL database. It clusters and does a lot of other fancy stuff but that’s not the point of this article. Anyway you probably already know what it is if you stumbled on this post so lets get your very own Elastic Search sandbox up and running…

Step 1: Get A Server

Screen Shot 2014-05-23 at 1.23.28 PM In order to get this done in 5 minutes we’re going to useDigital Ocean to spin up a cloud server. Why? Because it’s awesome and your server will be ready in 55 seconds… It’s cheap to run and free to get started if you use one of their many promo codes. If this doesn’t sound awesome to you, feel free to spend an hour or so setting up a Linux virtual machine. Either way, this tutorial assumes you are going to run ElasticSearch on Linux, specifically Ubuntu.

So after you sign up for Digital Ocean, setup a free Ubuntu Droplet (more info than you need is here). They’ll email you the root password and you should be good to go to access the Linux console from their website.

Note: there are a bunch of other things you’ll want to do if you run this server in production – like setting up SSH, disabling root login, and other things. Follow this tutorial for ‘Initial Server Setup With Ubuntu‘ for more details.

Step 2: Install Elasticsearch

Now you are ready to install Elasticsearch. Fortunately that’s the easy part. Run the shell script in this gist to get up and running.

۱۲۳۴۵۶۷۸۹

cd ~

sudo apt-get update

sudo apt-get install openjdk-7-jre-headless -y

wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.2.2.deb

sudo dpkg -i elasticsearch-1.2.2.deb

sudo service elasticsearch start

#curl http://localhost:9200

view raw es.sh hosted with ❤ by GitHub

Aaaand you’re done.

Want to make sure it’s running? Run a curl in your console, hitting port 9200.

curl http://localhost:9200

You should see something like this giving you some meta data about your Elasticsearch instance.

Now, if I had DNS setup for this hostname, you will now be able to hit Elasticsearch externally with http://elastic.brudtkuhl.com:9200 but for now you can just go at the public IP address that Digital Ocean provides.

This is the first in a series of posts on my experiences working with Elasticsearch. Do you have any questions on how to install Elasticsearch?

Now onto your next step: Securing Elasticsearch.

معرفی OpenSearchServer

آموزشی تحقیقاتی, موتور جستجو No Responses »

Jan 182015

ویژگیهای OpenSearchServer
نا محدودیت ها :

مجموعه ای کامل از توابع جستجو
ساخت استراتژی دلخواه نمایه سازی
یک راه کار کاملا یکپارچه
سازه استخراج داده از متن کامل
خزنده (Crawler) می تواند همه چیز را شاخص گذاری کند

جستجو:

متن کامل، بولین و جستجوی آوایی
پیوند های بیرونی و درونی
خوشه بندی بند به بند و آبشاری
فیلتر جستجو بر اساس تاریخ، فاصله
موقعیت جغرافیایی با استفاده از مربع یا شعاع
الگوریتم های غلط گیر املایی متنوع
سفارشی سازی ارتباطات
پیشنهادات (تکمیل خودکار واژه جستجو )

نمایه سازی:

گزینه های ۱۷ زبان
تجزیه و تحلیل ویژه برای هر زبان
فیلترهای متعدد:n-gram, lemmatization, shingle, elisions, stripping diacritic و غیره
تشخیص خودکار زبان
شناسایی نام موجودیت ها
مترادف (کلمه و چند واژه ای )
طبقه بندی خودکار

یکپارچه سازی:

وب سرویس REST API و SOAP
ماژول مانیتورینگ
تکرار نمایه گذاری
برنامه ریزی برای کارهای دوره
قابلیت برنامه نویسی طراحی شده توسط Selenium®
پیاده سازی چند زبانه: پی اچ پی، روبی، پرل، C #، و غیره

دریافت:

اسناد Office® (Word®، Excel®، PowerPoint®، Visio®، Publisher®)
اسناد OpenOffice®
نرم افزار Adobe PDF® (با OCR)
صفحات وب (HTML)، RTF، متن ساده
ابرداده فایل های صوتی
تصاویر (ابرداده و OCR)
پیام های MAPI®
و غیره

خزنده

خزنده وب شامل گنجاندن یا حذف فیلتر با نویسه عام، ورود و خروج HTTP، تصویر، نقشه سایت، و غیره
بقیه خزنده خدمات شاخص وب داده
سیستم فایل خزنده مرور SMB / CIFS، FTP (S)، SWIFT
خزنده پایگاه داده پشتیبانی از تمام پایگاه های داده (JDBC)

معرفی
اضافه شده توسط نوین، آخرین ویرایش توسط امانوئل کلر در ۲۰۱۲ سپتامبر ۳ (تغییر مشاهده) برو به شروع فراداده
محتویات
منبع باز
صفحه اول در خزنده وب، فایل خزنده، پایگاه داده، XML ساختار بر اساس
نمایه سازی در چند زبان
تجزیه و تحلیل چند زبانه
فرمت سند های مختلف پشتیبانی
توابع پیشرفته
تنظیم آسان
ادغام سریع
OpenSearchServer است در سیستم عامل های زیر پشتیبانی
OpenSearchServer نرم افزار موتور جستجو تحت GPL V3 مجوز منبع باز توسعه یافته است. این مجموعه از متن کامل الگوریتم های جستجوی قوی ساخته شده است با استفاده از بهترین فن آوری های منبع باز در دسترس است.
منبع باز
این یک نرم افزار منبع باز تحت GPL V3 مجوز منبع باز توسعه یافته است.
صفحه اول در خزنده وب، فایل خزنده، پایگاه داده، XML ساختار بر اساس
شما می توانید شاخص از محتوای وب یا محتوای فایل های در حال اجرا خزنده وب و یا فایل خزنده به ترتیب ایجاد کنید. محتوای پایگاه داده می تواند به راحتی با استفاده از خزنده پایگاه داده، و همچنین به عنوان XML ساختار نمایه میشود.
نمایه سازی در چند زبان
محتوای وب / اسناد را می توان در هفده زبان ذکر شده در زیر نمایه: عربی، چینی، دانمارکی، هلندی، انگلیسی، فنلاندی، فرانسوی، آلمانی، مجارستانی، ایتالیایی، نروژی، پرتغالی، رومانیایی، روسی، اسپانیایی، سوئدی، ترکی.
تجزیه و تحلیل چند زبانه
انواع مختلف تجزیه و تحلیل در دسترس است با OpenSearchServer که جملات بریدن به کلمات وجود دارد، و سپس الگوریتم بر روی کلمات بر اساس زبان سند را (مفرد / جمع، جنس، افعال مزدوج، و غیره) را اجرا کنید.
فرمت سند های مختلف پشتیبانی
فرمت های زیر پشتیبانی می شوند: XML، XHTML / HTML، MS دفتر (ورد، پاورپوینت، اکسل)، ادوبی پی دی اف، دفتر گسترش، RTF، متن، صدا (OGG، MP3، WAV، تورنت)، تصاویر (JPEG، GIF، PNG ).
توابع پیشرفته
فکتینگ، مترادف، چک کردن غلط املایی، سقوط، stopwords، تکمیل خودکار، پیوست نمایش داده شد، OCR، تصویر.
تنظیم آسان
OpenSearchServer به راحتی قابل تنظیم از طریق فایل های XML و یا با استفاده از رابط وب غنی، که شامل تعریف رشته ها و گزینه های indexation است.
ادغام سریع
OpenSearchServer می تواند به سرعت یکپارچه:
راحتی (XML یا JSON)
SOAP خدمات وب
کتابخانه کارفرما: پی اچ پی، دات نت
دروپال و وردپرس پلاگین.
OpenSearchServer است در سیستم عامل های زیر پشتیبانی
Windows 20xx/XP/Vista/7/8
Linux
MacOS X
Solaris

Unlimited:
• A full set of search functions
• Build your own indexing strategy
• A fully integrated solution
• Parsers extract full-text data
• The crawlers can index everything
Search:
• Full-text, Boolean and phonetic search
• Outer and inner join
• Clusters with faceting & collapsing
• Filtered search (date, distance)
• Geolocation using square or radius
• Several spell-checking algorithms
• Relevance customization
• Suggestion (auto-completion(

Indexing:
• ۱۷ language options
• Special analysis for each language
• Numerous filters: n-gram, lemmatization, shingle, elisions, stripping diacritic, Etc.
• Automatic language detection
• Named entity recognition
• Synonyms (word and multi-terms)
• Automatic classifications
Integration:
• REST API and SOAP Web Service
• Monitoring module
• Index replication
• Scheduling for periodic tasks
• Scripting feature powered by Selenium®
• Multiple client implementations: PHP, Ruby, Perl, C#, Etc.

Parsing :
• Office® documents (Word®, Excel®, PowerPoint®, Visio®, Publisher®)
• OpenOffice® documents
• Adobe PDF® (with OCR)
• Web pages (HTML), RTF, plain text
• Audio files metadata
• Images (metadata and OCR)
• MAPI® messages
• Etc.

Crawlers
• The web crawler includes inclusion or exclusion filters with wildcards, HTTP authentication, screenshot, sitemap, Etc.
• The REST Crawler indexes Web Services data
• The file system crawler browses SMB/CIFS, FTP(S), SWIFT
• The database crawler supports all databases (JDBC)
OpenSearch Server

منبع : http://www.opensearchserver.com

http://www.opensearchserver.com/documentation/README.md

ElasticWho?

موتور جستجو No Responses »

Jan 172015

ElasticWho?

ElasticSearch is a flexible and powerful open source, distributed real-time search and analytics engine.

Features

Real time analytics
Distributed
High availability
Multi tenant architecture
Full text
Document oriented
Schema free
RESTful API
Per-operation persistence

Distributed

Start small and scale horizontally out of the box. For more capacity, just add more nodes and let the cluster reorganize itself.

High Availability

ElasticSearch clusters detect and remove failed nodes, and reorganize themselves.

Multi Tenancy


$ curl -XPUT http://localhost:9200/people

$ curl -XPUT http://localhost:9200/gems

$ curl -XPUT http://localhost:9200/gems/document/pry-0.5.9

$ curl -XGET http://localhost:9200/gems/document/pry-0.5.9

A cluster can host multiple indices which can be queried independently, or as a group.

Document Oriented


{
    "_id": "pry-0.5.9", 
    "_index": "gems", 
    "_source": {
        "authors": [
            "John Mair (banisterfiend)"
        ], 
        "autorequire": null, 
        "bindir": "bin", 
        "cert_chain": [], 
        "date": "Sun Feb 20 11:00:00 UTC 2011", 
        "default_executable": null, 
        "description": "attach an irb-like session to any object at runtime", 
        "email": "jrmair@gmail.com"
    }
}

Store complex real world entities in Elasticsearch as structured JSON documents.

RESTful API

Almost any operation can be performed using a simple RESTful interface using JSON over HTTP.

curl -X GET
curl -X PUT
curl -X POST
curl -X DELETE

Apache Lucene

ElasticSearch is built on top of Apache Lucene. Lucene is a high performance, full-featured Information Retrieval library, written in Java.

ElasticSearch Terminology

Document

$ curl -XGET http://localhost:9200/gems/document/pry-0.5.9


{
    "_id": "pry-0.5.9", 
    "_index": "gems", 
    "_source": {
        "authors": [
            "John Mair (banisterfiend)"
        ], 
        "autorequire": null, 
        "bindir": "bin", 
        "cert_chain": [], 
        "date": "Sun Feb 20 11:00:00 UTC 2011", 
        "default_executable": null, 
        "description": "attach an irb-like session to any object at runtime", 
        "email": "jrmair@gmail.com", 
        "executables": [
            "pry"
        ], 
        "extensions": [], 
        "extra_rdoc_files": [], 
        "files": [
            "lib/pry/commands.rb", 
            "lib/pry/command_base.rb", 
            "lib/pry/completion.rb", 
            "lib/pry/core_extensions.rb", 
            "lib/pry/hooks.rb", 
            "lib/pry/print.rb", 
            "lib/pry/prompts.rb", 
            "lib/pry/pry_class.rb", 
            "lib/pry/pry_instance.rb", 
            "lib/pry/version.rb", 
            "lib/pry.rb", 
            "examples/example_basic.rb", 
            "examples/example_commands.rb", 
            "examples/example_command_override.rb", 
            "examples/example_hooks.rb", 
            "examples/example_image_edit.rb", 
            "examples/example_input.rb", 
            "examples/example_input2.rb", 
            "examples/example_output.rb", 
            "examples/example_print.rb", 
            "examples/example_prompt.rb", 
            "test/test.rb", 
            "test/test_helper.rb", 
            "CHANGELOG", 
            "LICENSE", 
            "README.markdown", 
            "Rakefile", 
            ".gemtest", 
            "bin/pry"
        ], 
        "has_rdoc": true, 
        "homepage": "http://banisterfiend.wordpress.com", 
        "id": "pry-0.5.9", 
        "licenses": [], 
        "name": "pry", 
        "platform": "ruby", 
        "post_install_message": null, 
        "rdoc_options": [], 
        "require_paths": [
            "lib"
        ], 
        "requirements": [], 
        "rubyforge_project": null, 
        "rubygems_version": "1.5.2", 
        "signing_key": null, 
        "specification_version": 3, 
        "summary": "attach an irb-like session to any object at runtime", 
        "test_files": [], 
        "version": {
            "prerelease": null, 
            "version": "0.5.9"
        }
    }, 
    "_type": "document", 
    "_version": 1, 
    "exists": true
}

In ElasticSearch, everything is stored as a Document. Document can be addressed and retrieved by querying their attributes.

Document Types

Lets us specify document properties, so we can differentiate the objects.

Shard

Each Shard is a separate native Lucene Index. Lets us overcome RAM limitations, hard disk capacity.

Replica

An exact copy of primary Shard. Helps in setting up HA, increases query throughput.

Index

ElasticSearch stores its data in logical Indices. Think of a table, collection or a database.

ElasticSearch Index

An Index has atleast 1 primary Shard, and 0 or more Replicas.

Cluster

A collection of cooperating ElasticSearch nodes. Gives better availability and performance via Index Sharding and Replicas.

ElasticSearch Workshop

Download and start

Download ElasticSearch from http://www.elasticsearch.org/download


							# service elasticsearch start


							# /etc/init.d/elasticsearch start


							# ./bin/elasticsearch -f

ElasticSearch Plugins

A site plugin to view contents of ElasticSearch cluster.


# cd /usr/share/elasticsearch
# ./bin/plugin -install mobz/elasticsearch-head


# cd /opt/elasticsearch-0.90.2
# ./bin/plugin -install mobz/elasticsearch-head

Restart ElasticSearch. Plugins are detected and loaded on service startup.

elasticsearch-head

RESTful interface


$ curl -XGET 'http://localhost:9200/'


{
  "ok" : true,
  "status" : 200,
  "name" : "Drake, Frank",
  "version" : {
    "number" : "0.90.2",
    "snapshot_build" : false,
    "lucene_version" : "4.3.1"
  },
  "tagline" : "You Know, for Search"
}

Create Index


$ curl -XPUT 'http://localhost:9200/gems'


{
  "ok":true,
  "acknowledged":true
}

Cluster status


$ curl -XGET 'localhost:9200/_status'


{"ok":true,"_shards":{"total":20,"successful":10,"failed":0},
"indices":{"gems":{"index":{"primary_size":"495b","primary_size_in_bytes":495,
"size":"495b","size_in_bytes":495},"translog":{"operations":0},
"docs":{"num_docs":0,"max_doc":0,"deleted_docs":0},"merges":
{"current":0,"current_docs":0,"current_size":"0b","current_size_in_bytes":0,
"total":0,"total_time":"0s","total_time_in_millis":0,"total_docs":0,
"total_size":"0b","total_size_in_bytes":0},
...
...
...

Pretty Output


$ curl -XGET 'localhost:9200/_status?pretty'


$ curl -XGET 'localhost:9200/_status' | python -mjson.tool


$ curl -XGET 'localhost:9200/_status' | json_reformat


{
    "ok": true,
    "_shards": {
        "total": 20,
        "successful": 10,
        "failed": 0
    },
    "indices": {
        "gems": {
            "index": {
                "primary_size": "495b",
                "primary_size_in_bytes": 495,
                "size": "495b",
                "size_in_bytes": 495
            },
...

Delete Index


$ curl -XDELETE 'http://localhost:9200/gems'


{
  "ok":true,
  "acknowledged":true
}

Create custom Index


{
    "settings" : {
        "index" : {
            "number_of_shards" : 6,
            "number_of_replicas" : 0
        }
    }
}


$ curl -XPUT 'http://localhost:9200/gems' -d @body.json


{
  "ok":true,
  "acknowledged":true
}

Index a document


{
  "name": "pry", 
  "platform": "ruby", 
  "rubygems_version": "1.5.2", 
  "description": "attach an irb-like session to any object at runtime", 
  "email": "anurag@example.com", 
  "has_rdoc": true, 
  "homepage": "http://banisterfiend.wordpress.com"
}


$ curl -XPOST 'http://localhost:9200/gems/test/' -d @body.json


{
  "ok":true,
  "_index":"gems",
  "_type":"test",
  "_id":"lsJgxiwET6eg",
  "_version":1
}

Get document


$ curl -XGET 'http://localhost:9200/gems/test/lsJgxiwET6eg' | python -mjson.tool


{
    "_id": "lsJgxiwET6eg", 
    "_index": "gems", 
    "_source": {
        "description": "attach an irb-like session to any object at runtime", 
        "email": "anurag@example.com", 
        "has_rdoc": true, 
        "homepage": "http://banisterfiend.wordpress.com", 
        "name": "pry", 
        "platform": "ruby", 
        "rubygems_version": "1.5.2"
    }, 
    "_type": "test", 
    "_version": 1, 
    "exists": true
}

Index another document


{
  "name": "grit", 
  "platform": "jruby", 
  "rubygems_version": "2.5.0", 
  "description": "Ruby library for extracting information from a git repository.", 
  "email": "mojombo@github.com", 
  "has_rdoc": false,
  "homepage": "http://github.com/mojombo/grit"
}


$ curl -XPOST 'http://localhost:9200/gems/test/' -d @body.json


{
  "ok":true,
  "_index":"gems",
  "_type":"test",
  "_id":"ijUOHi2cQc2",
  "_version":1
}

Custom Document IDs


{
  "name": "grit", 
  "platform": "jruby", 
  "rubygems_version": "2.5.1", 
  "description": "Ruby library for extracting information from a git repository.", 
  "email": "mojombo@github.com", 
  "has_rdoc": false,
  "homepage": "http://github.com/mojombo/grit"
}


$ curl -XPUT 'http://localhost:9200/gems/test/grit-2.5.1' -d @body.json


{
  "ok":true,
  "_index":"gems",
  "_type":"test",
  "_id":"grit-2.5.1",
  "_version":1
}

IDs are unique across Index. Composed of DocumentType and ID.

Document Versions


$ curl -XPUT 'http://localhost:9200/gems/test/grit-2.5.1' -d @body.json


{
  "ok":true,
  "_index":"gems",
  "_type":"test",
  "_id":"grit-2.5.1",
  "_version":2
}

Searching Documents


{
  "query": {
    "term": {"name": "pry"}
  }
}


$ curl -XPOST http://localhost:9200/gems/_search -d @body.json | python -mjson.tool


{
  "_shards": {
    "failed": 0, 
    "successful": 6, 
    "total": 6
  },
  "hits": {
    "hits": [
      {
        "_id": "MWkKgzsMRgK", 
        "_index": "gems", 
        "_score": 1.4054651, 
        "_source": {
          "description": "attach an irb-like session to any object at runtime", 
          "email": "anurag@example.com", 
          "has_rdoc": true, 
          "homepage": "http://banisterfiend.wordpress.com", 
          "name": "pry", 
          "platform": "ruby", 
          "rubygems_version": "1.5.2"
        }, 
        "_type": "test"
      }
    ], 
    "max_score": 1.4054651, 
    "total": 1
  }, 
  "timed_out": false, 
  "took": 2
}

Counting Documents


{
  "term": {"name": "pry"}
}


$ curl -XGET http://localhost:9200/gems/test/_count -d @body.json


{
    "_shards": {
        "failed": 0, 
        "successful": 6, 
        "total": 6
    }, 
    "count": 1
}

Update a Document


{
  "doc": {
   "platform": "macruby" 
  }
}


$ curl -XPOST http://localhost:9200/gems/test/grit-2.5.1/_update -d @body.json


{
  "ok":true,
  "_index":"gems",
  "_type":"test",
  "_id":"grit-2.5.1",
  "_version":4
}

The partial document is merged using simple recursive merge.

Update via Script


{
    "script" : "ctx._source.platform = vm_name",
    "params" : {
        "vm_name" : "rubinius"
    }
}


$ curl -XPOST http://localhost:9200/gems/test/grit-2.5.1/_update -d @body.json


{
  "ok":true,
  "_index":"gems",
  "_type":"test",
  "_id":"grit-2.5.1",
  "_version":5
}

Delete Document


$ curl -XDELETE 'http://localhost:9200/gems/test/grit-2.5.1'


{
  "ok":true, 
  "found":true,
  "_index":"gems",
  "_type":"test",
  "_id":"grit-2.5.1",
  "_version":6
}

Put Mapping


{
  "gem" : {
    "properties" : {
      "name" :        {"type" : "string", "index": "not_analyzed"},
      "platform" :    {"type" : "string", "index": "not_analyzed"},
      "rubygems_version" : {"type" : "string", "index": "not_analyzed"},
      "description" : {"type" : "string", "store" : "yes"},
      "has_rdoc" :    {"type" : "boolean"}      
    }
  }
}


$ curl -XPUT 'http://localhost:9200/gems/gem/_mapping' -d @body.json


$ curl -XGET 'http://localhost:9200/gems/_mapping' | python -mjson.tool

Index Document with Mapping


{
  "name": "grit", 
  "platform": "ruby", 
  "rubygems_version": "2.5.1", 
  "description": "Ruby library for extracting information from a git repository.", 
  "email": "mojombo@github.com", 
  "has_rdoc": false,
  "homepage": "http://github.com/mojombo/grit"
}


$ curl -XPUT 'http://localhost:9200/gems/gem/grit-2.5.1' -d @body.json


{
  "ok":true,
  "_index":"gems",
  "_type":"gem",
  "_id":"grit-2.5.1",
  "_version":1
}

Matching documents


{
  "query": {
    "match" : {
        "description" : "git repository"
    }
  }
}


$ curl -XPOST http://localhost:9200/gems/gem/_search -d @body.json

Highlighting


{
  "query": {
    "match" : {
        "description" : "git repository"
    }
  },
  "highlight" : {
        "fields" : {
            "description" : {}
        }
    }
}


$ curl -XPOST http://localhost:9200/gems/gem/_search -d @body.json


"highlight": {
  "description": [
    "Ruby library for extracting information from a git repository."
  ]
}

Search Facets


{
  "query": { "match_all" : {} },
  "facets" : {
    "gem_names" : {
      "terms" : { "field": "name" }
    }
  }
}


$ curl -XPOST http://localhost:9200/gems/_search -d @body.json


...
  "facets": {
    "gem_names": {
      "_type": "terms", 
      "missing": 0, 
      "other": 0, 
      "terms": [
        {
          "count": 2, 
          "term": "pry"
        }, 
        {
          "count": 2, 
          "term": "grit"
        }, 
        {
          "count": 1, 
          "term": "abc"
        }
      ], 
      "total": 5
    }
  },
  "hits": {
    "hits": [
...

(Lab)

Analyzing Aadhaar’s Datasets

Download Public Dataset

Download from Aadhaar Public Data Portal at https://data.uidai.gov.in

Download Tools

$ git clone https://github.com/gnurag/aadhaar

Prepare Data & Configure


# gem install yajl-ruby tire activesupport

$ git clone https://github.com/gnurag/aadhaar
$ cd aadhaar/data
$ unzip UIDAI-ENR-DETAIL-20121001.zip
$ cd ../bin
$ vi aadhaar.rb

Configuration


AADHAAR_DATA_DIR = "/path/to/aadhaar/data"
ES_URL           = "http://localhost:9200"
ES_INDEX         = 'aadhaar'
ES_TYPE          = "UID"
BATCH_SIZE       = 1000

Index

$ ruby aadhaar.rb

Running Examples

$ curl -XPOST http://localhost:9200/aadhaar/UID/_search -d @template.json | python -mjson.tool

Additional Notes

Index Aliases

Group multiple Indexes, and query them together.


curl -XPOST 'http://localhost:9200/_aliases' -d '
{
    "actions" : [
        { "add" : { "index" : "index1", "alias" : "master-alias" } }
        { "add" : { "index" : "index2", "alias" : "master-alias" } }
    ]
}'


curl -XPOST 'http://localhost:9200/_aliases' -d '
{
    "actions" : [
        { "remove" : { "index" : "index2", "alias" : "master-alias" } }
    ]
}'

Document Routing

Control which Shard the document will be placed and queried from.

Parents & Children


$ curl -XPUT http://localhost:9200/gems/gem/roxml?parent=rexml -d '{
    "tag" : "something"
}'

Custom Analyzers

Boosting Search Results

ElasticSearch Ecosystem

A wide range of site plugins, analyzers, river plugins available from the community.

خزنده وب و موتور جستجو بر اساس Nutch + Hadoop + Hbase + ElasticSearch

Prerequisites

Download Nutch 2.2.x

Download HBase

Download ElasticSearch

Install HBase

Setting up Nutch to work with HBase and ElasticSearch 1.x

Compile Nutch

Make sure Hbase is running!

Now you can start crawling a website

Step 1: Get A Server

Step 2: Install Elasticsearch

Features

Distributed

High Availability

Multi Tenancy

Document Oriented

RESTful API

Apache Lucene

ElasticSearch Terminology

Document

Document Types

Shard

Replica

Index

Cluster

ElasticSearch Workshop

Download and start

ElasticSearch Plugins

elasticsearch-head

RESTful interface

Create Index

Cluster status

Pretty Output

Delete Index

Create custom Index

Index a document

Get document

Index another document

Custom Document IDs

Document Versions

Searching Documents

Counting Documents

Update a Document

Update via Script

Delete Document

Put Mapping

Index Document with Mapping

Matching documents

Highlighting

Search Facets

(Lab)

Analyzing Aadhaar’s Datasets

Download Public Dataset

Download Tools

Prepare Data & Configure

Configuration

Index

Running Examples

Additional Notes

Index Aliases

Document Routing

Parents & Children

Custom Analyzers

Boosting Search Results

ElasticSearch Ecosystem

THE END