23.11.2014 Views

Search Engine - YaCy

Search Engine - YaCy

Search Engine - YaCy

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Decentralised Web <strong>Search</strong><br />

Web <strong>Search</strong> <strong>Engine</strong> Software For Everyone:<br />

We can remove dependency<br />

from (a) large search engine provider<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> For Everyone<br />

Michael Christen<br />

http://yacy.net


Topics<br />

Tech<br />

Demo<br />

Dev<br />

•<strong>Search</strong> <strong>Engine</strong> Technology<br />

how large-scale search engines are made available for everyone<br />

using peer-to-peer technology<br />

•Demonstration:<br />

what you can do in just five minutes:<br />

installation, crawling, searching, monitoring, scheduling<br />

•System Components and Development:<br />

Details about a search appliance components like scheduler, document<br />

parser, administration and visualization.<br />

Easy integration into a web page.<br />

APIs for external index queries and external index feeding components.<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


<strong>Search</strong> <strong>Engine</strong> Components<br />

Tech<br />

Demo<br />

Dev<br />

Retrieval, Indexing, Storage and <strong>Search</strong> Components<br />

Depth = 0<br />

Start-URL<br />

filtering,<br />

parsing<br />

Text Analysis<br />

Crawler<br />

Depth = 1<br />

Depth = 2<br />

URL<br />

Crawl Stack<br />

@<br />

links<br />

Double Link<br />

Check<br />

words<br />

Stop words<br />

Check<br />

Reverse<br />

Word Index<br />

Indexing<br />

<strong>Search</strong><br />

Interface<br />

ranking,<br />

verification,<br />

visualisation<br />

Word<br />

URL References<br />

<strong>YaCy</strong> has an<br />

integrated NoSQL<br />

Database. The<br />

database stores a<br />

Reverse Word<br />

Index, Metadata<br />

and the source<br />

documents.<br />

Database<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


Large <strong>Search</strong> Cluster: Model<br />

Tech<br />

Demo<br />

Dev<br />

Efficient search engines<br />

are constructed using<br />

a matrix of many small search engines<br />

vertical scaling: more queries per second<br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong> <strong>Engine</strong> Cluster<br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

horizontal scaling: more documents<br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


Large <strong>Search</strong> Cluster in Data Center<br />

Tech<br />

Demo<br />

Dev<br />

Usually such search engine clusters<br />

are hosted by one organization in a data center<br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


Large <strong>Search</strong> Cluster: Decentralised<br />

Tech<br />

Demo<br />

Dev<br />

Imagine you can take the software outside<br />

and connect peers decentralised<br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

<strong>Search</strong><br />

<strong>Engine</strong><br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


Decentralised <strong>Search</strong> with <strong>YaCy</strong> 1/3<br />

Tech<br />

Demo<br />

Dev<br />

<strong>YaCy</strong> is a search engine appliance that can be used either<br />

in a data center or as a decentralised network of private peers<br />

Peer Peer Peer Peer Peer<br />

Peer Peer Peer Peer Peer<br />

Peer Peer Peer Peer Peer<br />

How can a search matrix be distributed?<br />

The peers are ordered using an ordering on peer hashes. The hash-ordering is<br />

closed at the end and the resulting network can be drawn as a circle...<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


Decentralised <strong>Search</strong> with <strong>YaCy</strong> 2/3<br />

Tech<br />

Demo<br />

Dev<br />

A ,Folded‘ <strong>Search</strong> Matrix<br />

Peer<br />

Peer<br />

Peer<br />

Peer<br />

Peer<br />

This peer (as an<br />

example) fetches some<br />

Web pages and<br />

distributes index<br />

fragments to other<br />

peers.<br />

Peer<br />

Peer<br />

Peer<br />

Peer<br />

Peer<br />

Peer<br />

A peer which searches<br />

information can access<br />

directly peers holding<br />

the corresponding<br />

index<br />

Peer<br />

Peer<br />

DHT-Store<br />

Peer<br />

Peer<br />

Peer<br />

DHT-Read<br />

<strong>YaCy</strong> peers store index fragments according to a ‘folded‘ ordering on word-hashes and url-hashes in a<br />

distributed hash table (DHT). The index is distributed redundantly to save the index when some peers are<br />

not available. The redundancy also helps to increase search performance.<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


Decentralised <strong>Search</strong> with <strong>YaCy</strong> 3/3<br />

Tech<br />

Demo<br />

Dev<br />

The ,default‘ <strong>YaCy</strong> <strong>Search</strong> <strong>Engine</strong> Network<br />

Peer Types:<br />

Junior<br />

behind firewall or router<br />

Senior<br />

has open server port<br />

Principal<br />

publishes seed-lists<br />

DHT-Store<br />

DHT-Read<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


<strong>YaCy</strong> <strong>Search</strong> Cluster in a Data Center<br />

Tech<br />

Demo<br />

Dev<br />

http://<br />

sciencenet.fzk.de<br />

300 million<br />

documents<br />

,Sciencenet‘: <strong>Search</strong> <strong>Engine</strong> for scientific content in<br />

the Karlsruhe Institute of Technology:<br />

34 computers running <strong>YaCy</strong> in it‘s own network<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


Decentralised <strong>Search</strong> for Everyone<br />

Tech<br />

Demo<br />

Dev<br />

<strong>Search</strong><br />

<strong>Engine</strong> @Home<br />

> 1 Billion<br />

Documents<br />

People run they own <strong>YaCy</strong> search peer at home<br />

and create independent search for everyone<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


Benefits<br />

Tech<br />

Demo<br />

Dev<br />

Impact of running your own search engine:<br />

become independent<br />

from large search engine operators<br />

keep company secrets<br />

search tracks can reveal industrial research targets<br />

your personal relevance<br />

you can create a ranking method for your personal needs<br />

same rights for all people<br />

everyone can run a search engine<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


Demo: Users<br />

Tech<br />

Demo<br />

Dev<br />

linuxtag.org<br />

linux-club.de<br />

geoclub.de<br />

fsfe.org<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


Use Cases<br />

Tech<br />

Demo<br />

Dev<br />

•Decentralised Peer-to-Peer Web <strong>Search</strong><br />

search engines for everyone<br />

•High-Performance <strong>Search</strong> Clusters<br />

generic search portals for any need<br />

•Internet <strong>Search</strong> Portal for a project<br />

combining wikis, blogs, forums and portal pages<br />

•Alert-Service for News using RSS<br />

create a News-Feed using recent search results for<br />

a specific topic<br />

•Intranet <strong>Search</strong> Appliance<br />

search in local web servers and file shares<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


<strong>Search</strong> Interface<br />

Tech<br />

Demo<br />

Dev<br />

SRU<br />

API for search results is RSS<br />

(Opensearch) and JSON<br />

Facets:<br />

Domains, Authors<br />

every link is verified before it is displayed: the<br />

content is loaded, parsed and used for a search<br />

snippet generation<br />

Standards<br />

APIs<br />

Tools<br />

Opensearch (search results with RSS), JSON, AJAX tools<br />

search widget, ready-to-use code snippets to embed search everywhere<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


Document Retrieval and Parser<br />

Tech Demo Dev<br />

A <strong>Search</strong><br />

<strong>Engine</strong> should support people in the search for documents in unstructured<br />

formats: this needs a kind of ‘understanding‘ of content<br />

Connection<br />

load and crawl from:<br />

HTTP, HTTPS, FTP, filesystem,<br />

SMB-shares<br />

Import from:<br />

Dublin Core / XML files,<br />

OAI-PMH, wikimedia dumps,<br />

SQL databases<br />

Parsing<br />

read document formats:<br />

HTML, XHTML, RSS, RDF,<br />

XHTML+RDFa, FOAF, vCard,<br />

Flash, PDF, PS, Word, Excel,<br />

Visio, Powerpoint,<br />

OpenOffice, RTF, csv, gzip, zip,<br />

tar, rar, bzip2, 7zip, images<br />

(EXIF), torrent files<br />

Interpretation<br />

find metadata (headline,<br />

author, date, locations)<br />

find links of different kind<br />

(text, images, movies etc.)<br />

store statistical data for<br />

search suggestions<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


<strong>Search</strong> Result Ranking<br />

a prototype discussion<br />

about ranking<br />

Tech Demo Dev<br />

do you use the same<br />

no. PR is difficult ranking as G**gle?<br />

and sometimes useless<br />

(i.e. in intranets) then you cannot be better?<br />

that‘s what<br />

lucene has<br />

similar to<br />

G**gle PR<br />

in <strong>YaCy</strong>, you can<br />

combine many<br />

weighted attributes<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

we have many ranking criteria<br />

and users can mix them.<br />

but is this better?<br />

what is ‘better‘? G**gle defines ‘better‘<br />

as: ‘most people like it‘<br />

I have an<br />

idea: ....<br />

suddenly people think about their<br />

personal relevance requirements..... then what is the<br />

best ranking?<br />

do experiments! If you run your<br />

own search engine, then you may need<br />

your own ranking. Different contents may<br />

need different rankings.<br />

every peer?<br />

when doing a remote<br />

search, the remote peer uses your own<br />

ranking too!<br />

Michael Christen<br />

http://yacy.net


Parts of a <strong>Search</strong> Appliance<br />

Tech Demo Dev<br />

<strong>Search</strong> <strong>Engine</strong><br />

Data Visualisation<br />

retrieval, indexing, storage and search components<br />

Scheduler and Steering<br />

index creation process, system load,<br />

link structure, p2p net configuration<br />

Database Administration<br />

<br />

<br />

<br />

<br />

automatic scheduled re-indexing and<br />

back-up of search appliance set-up<br />

crawl queues, robots.txt, rss feeds, scheduler<br />

data, p2p connections, network messages<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


<strong>Search</strong> Interface Integration<br />

Tech Demo Dev<br />

How to integrate a<br />

<strong>YaCy</strong> <strong>Search</strong> Portal:<br />

Just copy-paste the code snippet<br />

to your web page source code.<br />

Code Snippet Example #1: a search window in an iframe<br />


External Index Retrieval<br />

Tech Demo Dev<br />

> curl http://localhost:8080/yacysearch.rss?query=foaf&maximumRecords=10<br />

<br />

<br />


External Index Feeding<br />

Tech Demo Dev<br />

<br />

<br />

<br />

<br />

<br />

http://de.wikipedia.org/wiki/Alan_Smithee<br />

<br />

<br />

<br />

de<br />

2009-04-14T00:00:00Z<br />

<br />

<br />

<br />

Standards:<br />

<strong>YaCy</strong> can import standard<br />

Dublin Core Metadata<br />

XML files as input for<br />

indexing<br />

How to import Dublin Core Files:<br />

just place the xml files into a hand-over directory at<br />

DATA/SURROGATES/in/<br />

The Dublin Core XML File Standard:<br />

http://dublincore.org/documents/dc-xml-guidelines/<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


Installation<br />

License: GPL<br />

Free Software<br />

Tech Demo Dev<br />

•Download from http://yacy.net<br />

<strong>YaCy</strong> for Windows <strong>YaCy</strong> for Mac <strong>YaCy</strong> for Debian <strong>YaCy</strong> for Linux / generic (tar.gz)<br />

•Just Extract the Package, then Start the Start-Script<br />

There are simple installers for Windows, Mac and a debian release, but it is easy<br />

to just install the generic release because it contains everything that is needed.<br />

•Administration using the Web Interface<br />

<strong>YaCy</strong> is a Web Application. The administration can be done completely using the built-in<br />

web interface with your web browser. Just open http://localhost:8080<br />

The main configuration is done when you select your use case (Distributed P2P Web<br />

<strong>Search</strong>, Portal <strong>Search</strong>, Intranet <strong>Search</strong>) after just two clicks.<br />

•Support<br />

We have a web forum: http://forum.yacy.de<br />

Some information can be found at the wiki: http://wiki.yacy.de<br />

...or contact me: mc@yacy.net<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />

Michael Christen<br />

http://yacy.net


what you can do<br />

• learn about search engine technology and<br />

teach other people<br />

• create your own search portal<br />

• be creative! -- we listen to your ideas<br />

• help -- make a translation of the<br />

administration interface!<br />

FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />

Web <strong>Search</strong> <strong>Engine</strong> For Everyone<br />

Michael Christen<br />

http://yacy.net

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!