Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Decentralised Web <strong>Search</strong><br />
Web <strong>Search</strong> <strong>Engine</strong> Software For Everyone:<br />
We can remove dependency<br />
from (a) large search engine provider<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> For Everyone<br />
Michael Christen<br />
http://yacy.net
Topics<br />
Tech<br />
Demo<br />
Dev<br />
•<strong>Search</strong> <strong>Engine</strong> Technology<br />
how large-scale search engines are made available for everyone<br />
using peer-to-peer technology<br />
•Demonstration:<br />
what you can do in just five minutes:<br />
installation, crawling, searching, monitoring, scheduling<br />
•System Components and Development:<br />
Details about a search appliance components like scheduler, document<br />
parser, administration and visualization.<br />
Easy integration into a web page.<br />
APIs for external index queries and external index feeding components.<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
<strong>Search</strong> <strong>Engine</strong> Components<br />
Tech<br />
Demo<br />
Dev<br />
Retrieval, Indexing, Storage and <strong>Search</strong> Components<br />
Depth = 0<br />
Start-URL<br />
filtering,<br />
parsing<br />
Text Analysis<br />
Crawler<br />
Depth = 1<br />
Depth = 2<br />
URL<br />
Crawl Stack<br />
@<br />
links<br />
Double Link<br />
Check<br />
words<br />
Stop words<br />
Check<br />
Reverse<br />
Word Index<br />
Indexing<br />
<strong>Search</strong><br />
Interface<br />
ranking,<br />
verification,<br />
visualisation<br />
Word<br />
URL References<br />
<strong>YaCy</strong> has an<br />
integrated NoSQL<br />
Database. The<br />
database stores a<br />
Reverse Word<br />
Index, Metadata<br />
and the source<br />
documents.<br />
Database<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
Large <strong>Search</strong> Cluster: Model<br />
Tech<br />
Demo<br />
Dev<br />
Efficient search engines<br />
are constructed using<br />
a matrix of many small search engines<br />
vertical scaling: more queries per second<br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong> <strong>Engine</strong> Cluster<br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
horizontal scaling: more documents<br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
Large <strong>Search</strong> Cluster in Data Center<br />
Tech<br />
Demo<br />
Dev<br />
Usually such search engine clusters<br />
are hosted by one organization in a data center<br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
Large <strong>Search</strong> Cluster: Decentralised<br />
Tech<br />
Demo<br />
Dev<br />
Imagine you can take the software outside<br />
and connect peers decentralised<br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
<strong>Search</strong><br />
<strong>Engine</strong><br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
Decentralised <strong>Search</strong> with <strong>YaCy</strong> 1/3<br />
Tech<br />
Demo<br />
Dev<br />
<strong>YaCy</strong> is a search engine appliance that can be used either<br />
in a data center or as a decentralised network of private peers<br />
Peer Peer Peer Peer Peer<br />
Peer Peer Peer Peer Peer<br />
Peer Peer Peer Peer Peer<br />
How can a search matrix be distributed?<br />
The peers are ordered using an ordering on peer hashes. The hash-ordering is<br />
closed at the end and the resulting network can be drawn as a circle...<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
Decentralised <strong>Search</strong> with <strong>YaCy</strong> 2/3<br />
Tech<br />
Demo<br />
Dev<br />
A ,Folded‘ <strong>Search</strong> Matrix<br />
Peer<br />
Peer<br />
Peer<br />
Peer<br />
Peer<br />
This peer (as an<br />
example) fetches some<br />
Web pages and<br />
distributes index<br />
fragments to other<br />
peers.<br />
Peer<br />
Peer<br />
Peer<br />
Peer<br />
Peer<br />
Peer<br />
A peer which searches<br />
information can access<br />
directly peers holding<br />
the corresponding<br />
index<br />
Peer<br />
Peer<br />
DHT-Store<br />
Peer<br />
Peer<br />
Peer<br />
DHT-Read<br />
<strong>YaCy</strong> peers store index fragments according to a ‘folded‘ ordering on word-hashes and url-hashes in a<br />
distributed hash table (DHT). The index is distributed redundantly to save the index when some peers are<br />
not available. The redundancy also helps to increase search performance.<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
Decentralised <strong>Search</strong> with <strong>YaCy</strong> 3/3<br />
Tech<br />
Demo<br />
Dev<br />
The ,default‘ <strong>YaCy</strong> <strong>Search</strong> <strong>Engine</strong> Network<br />
Peer Types:<br />
Junior<br />
behind firewall or router<br />
Senior<br />
has open server port<br />
Principal<br />
publishes seed-lists<br />
DHT-Store<br />
DHT-Read<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
<strong>YaCy</strong> <strong>Search</strong> Cluster in a Data Center<br />
Tech<br />
Demo<br />
Dev<br />
http://<br />
sciencenet.fzk.de<br />
300 million<br />
documents<br />
,Sciencenet‘: <strong>Search</strong> <strong>Engine</strong> for scientific content in<br />
the Karlsruhe Institute of Technology:<br />
34 computers running <strong>YaCy</strong> in it‘s own network<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
Decentralised <strong>Search</strong> for Everyone<br />
Tech<br />
Demo<br />
Dev<br />
<strong>Search</strong><br />
<strong>Engine</strong> @Home<br />
> 1 Billion<br />
Documents<br />
People run they own <strong>YaCy</strong> search peer at home<br />
and create independent search for everyone<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
Benefits<br />
Tech<br />
Demo<br />
Dev<br />
Impact of running your own search engine:<br />
become independent<br />
from large search engine operators<br />
keep company secrets<br />
search tracks can reveal industrial research targets<br />
your personal relevance<br />
you can create a ranking method for your personal needs<br />
same rights for all people<br />
everyone can run a search engine<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
Demo: Users<br />
Tech<br />
Demo<br />
Dev<br />
linuxtag.org<br />
linux-club.de<br />
geoclub.de<br />
fsfe.org<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
Use Cases<br />
Tech<br />
Demo<br />
Dev<br />
•Decentralised Peer-to-Peer Web <strong>Search</strong><br />
search engines for everyone<br />
•High-Performance <strong>Search</strong> Clusters<br />
generic search portals for any need<br />
•Internet <strong>Search</strong> Portal for a project<br />
combining wikis, blogs, forums and portal pages<br />
•Alert-Service for News using RSS<br />
create a News-Feed using recent search results for<br />
a specific topic<br />
•Intranet <strong>Search</strong> Appliance<br />
search in local web servers and file shares<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
<strong>Search</strong> Interface<br />
Tech<br />
Demo<br />
Dev<br />
SRU<br />
API for search results is RSS<br />
(Opensearch) and JSON<br />
Facets:<br />
Domains, Authors<br />
every link is verified before it is displayed: the<br />
content is loaded, parsed and used for a search<br />
snippet generation<br />
Standards<br />
APIs<br />
Tools<br />
Opensearch (search results with RSS), JSON, AJAX tools<br />
search widget, ready-to-use code snippets to embed search everywhere<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
Document Retrieval and Parser<br />
Tech Demo Dev<br />
A <strong>Search</strong><br />
<strong>Engine</strong> should support people in the search for documents in unstructured<br />
formats: this needs a kind of ‘understanding‘ of content<br />
Connection<br />
load and crawl from:<br />
HTTP, HTTPS, FTP, filesystem,<br />
SMB-shares<br />
Import from:<br />
Dublin Core / XML files,<br />
OAI-PMH, wikimedia dumps,<br />
SQL databases<br />
Parsing<br />
read document formats:<br />
HTML, XHTML, RSS, RDF,<br />
XHTML+RDFa, FOAF, vCard,<br />
Flash, PDF, PS, Word, Excel,<br />
Visio, Powerpoint,<br />
OpenOffice, RTF, csv, gzip, zip,<br />
tar, rar, bzip2, 7zip, images<br />
(EXIF), torrent files<br />
Interpretation<br />
find metadata (headline,<br />
author, date, locations)<br />
find links of different kind<br />
(text, images, movies etc.)<br />
store statistical data for<br />
search suggestions<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
<strong>Search</strong> Result Ranking<br />
a prototype discussion<br />
about ranking<br />
Tech Demo Dev<br />
do you use the same<br />
no. PR is difficult ranking as G**gle?<br />
and sometimes useless<br />
(i.e. in intranets) then you cannot be better?<br />
that‘s what<br />
lucene has<br />
similar to<br />
G**gle PR<br />
in <strong>YaCy</strong>, you can<br />
combine many<br />
weighted attributes<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
we have many ranking criteria<br />
and users can mix them.<br />
but is this better?<br />
what is ‘better‘? G**gle defines ‘better‘<br />
as: ‘most people like it‘<br />
I have an<br />
idea: ....<br />
suddenly people think about their<br />
personal relevance requirements..... then what is the<br />
best ranking?<br />
do experiments! If you run your<br />
own search engine, then you may need<br />
your own ranking. Different contents may<br />
need different rankings.<br />
every peer?<br />
when doing a remote<br />
search, the remote peer uses your own<br />
ranking too!<br />
Michael Christen<br />
http://yacy.net
Parts of a <strong>Search</strong> Appliance<br />
Tech Demo Dev<br />
<strong>Search</strong> <strong>Engine</strong><br />
Data Visualisation<br />
retrieval, indexing, storage and search components<br />
Scheduler and Steering<br />
index creation process, system load,<br />
link structure, p2p net configuration<br />
Database Administration<br />
<br />
<br />
<br />
<br />
automatic scheduled re-indexing and<br />
back-up of search appliance set-up<br />
crawl queues, robots.txt, rss feeds, scheduler<br />
data, p2p connections, network messages<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
<strong>Search</strong> Interface Integration<br />
Tech Demo Dev<br />
How to integrate a<br />
<strong>YaCy</strong> <strong>Search</strong> Portal:<br />
Just copy-paste the code snippet<br />
to your web page source code.<br />
Code Snippet Example #1: a search window in an iframe<br />
External Index Retrieval<br />
Tech Demo Dev<br />
> curl http://localhost:8080/yacysearch.rss?query=foaf&maximumRecords=10<br />
<br />
<br />
External Index Feeding<br />
Tech Demo Dev<br />
<br />
<br />
<br />
<br />
<br />
http://de.wikipedia.org/wiki/Alan_Smithee<br />
<br />
<br />
<br />
de<br />
2009-04-14T00:00:00Z<br />
<br />
<br />
<br />
Standards:<br />
<strong>YaCy</strong> can import standard<br />
Dublin Core Metadata<br />
XML files as input for<br />
indexing<br />
How to import Dublin Core Files:<br />
just place the xml files into a hand-over directory at<br />
DATA/SURROGATES/in/<br />
The Dublin Core XML File Standard:<br />
http://dublincore.org/documents/dc-xml-guidelines/<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
Installation<br />
License: GPL<br />
Free Software<br />
Tech Demo Dev<br />
•Download from http://yacy.net<br />
<strong>YaCy</strong> for Windows <strong>YaCy</strong> for Mac <strong>YaCy</strong> for Debian <strong>YaCy</strong> for Linux / generic (tar.gz)<br />
•Just Extract the Package, then Start the Start-Script<br />
There are simple installers for Windows, Mac and a debian release, but it is easy<br />
to just install the generic release because it contains everything that is needed.<br />
•Administration using the Web Interface<br />
<strong>YaCy</strong> is a Web Application. The administration can be done completely using the built-in<br />
web interface with your web browser. Just open http://localhost:8080<br />
The main configuration is done when you select your use case (Distributed P2P Web<br />
<strong>Search</strong>, Portal <strong>Search</strong>, Intranet <strong>Search</strong>) after just two clicks.<br />
•Support<br />
We have a web forum: http://forum.yacy.de<br />
Some information can be found at the wiki: http://wiki.yacy.de<br />
...or contact me: mc@yacy.net<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> for everyone<br />
Michael Christen<br />
http://yacy.net
what you can do<br />
• learn about search engine technology and<br />
teach other people<br />
• create your own search portal<br />
• be creative! -- we listen to your ideas<br />
• help -- make a translation of the<br />
administration interface!<br />
FOSS ASIA - Ho Chi Minh City, Vietnam 2010<br />
Web <strong>Search</strong> <strong>Engine</strong> For Everyone<br />
Michael Christen<br />
http://yacy.net