BerandaComputers and TechnologyThoughts on DIY Search Engines

Thoughts on DIY Search Engines

Prelude

I read more often than ever that people are looking for ways to build
their own search engines.

Even if more on more “advanced” features are integrated into search
engines. They are mostly based on human grunt work. Semantic search
engine, whatever “semantic” does mean for you, is in fact merely a
couple, not more than a dozen, set of tricks. I like to say, much of
Google’s search engine is good old human labor. If you still doubt
it, here is again: Google results are not only biased, also they are
editorialized. Whether algorithms, and their bugs, party is
irrelevant.

My point is: it is human made. Not some necessarly advanced alien
tool.

The only thing preventing you to have your own search engine is there
is no readily available software, why? because there is no cheap
hardware.

This might sound like a crazy idea five or ten years ago, but with the
advent of AMD threadripper ie. cost gravity at
play
getting together a
personal search engine is, if not a necessity, at least a possibility.

The most common complain I read about Google is that it yields
irrelevant text “that do no even contain my search terms”. That is not
a bug, that is a feature! It seems to me the subtext is that you can
not easily customize Google or whatever search algorithm since it is
privative. Even retrieving Google search results for further
processing if not forbidden, is at least difficult.

Getting started

Let’s start with the beginning: what is a search engine? A search
engine is software that will crawl the Internet, index
ie. store the text in a particular format, and that users can
query and receive in return the most relevant text.

Let’s dive into what “relevant text” means. What is relevant text?

  1. A text that contains the search term in my query
  2. A text that has the same topic as my query
  3. A text that gives an answer to my query

The good answer is “it depends”. That’s why queries have grown from
keywords match like a book index, to boolean queries e.g. "Apple"
-bible
, until so called semantic search, which boils down to consider
one-way or two-way synonyms and other lexical features.

So far, I failed to build a crawler that works. Also much of the text
I am looking for is in wikipedia or StackOverflow for which they are
flat releases which are much more easy to get started than putting
together your own crawler. Still, they are some crawler around, so
you can use that or learn from it. I will not dive into the crawler
part because it still hurts when I think about robots.txt,
throttling, text encoding etc… booooh!

So, we will imagine that you have a dataset of plain text, for
instance wikipedia vital articles. It helps if you know the content
of the dataset. Querying random news article is not very easy to
grasp because you have to read (!) the text to figure when and how
they are relevant.

Before querying, you need to store the text, but to know how to
store the text you need to know which query feature you want. To get
started I will only consider positive keywords and negative keywords
like apple -bible. So, we need to figure which find out which text
contains the word apple but not the word bible. Looking up
everything that does not contain bible is difficult, you would need
to scan the whole database to what are those document. Instead we
will look for documents that contains the word apple. So the
following document contains the word we are looking for:

(define doc1 "apple is looking for a search engine.")

That is the moment where the most advanced technology of our current
century makes it appearance: the inverted index. What is an inverted
index? It reverses the relationship between the document and the word
.
Instead of saying “this document contains the word apple” it says
“apple is contained in this document”. So we might have a procedure
that returns the document identifier that contains apple, like:

(assert (contains? (inverted-index-lookup "apple") 1))

inverted-index-lookup returns a list, and that list contains the identifier 1
of the first document. That list might be big. So you want to consider the least common word from the query. I call that step
candidates selection. Also, you might want to convert the positive terms
into lemmas or stems to go toward semantic search, which will mean you need
to store lemmas or stems at index time.

Anyway, the next step, given the list of documents that contain the
least common word or term or lemma or stem, is to filter and score
according to the full query. In the above case, that is remove the
documents that contains bible. You can do that step serially, and
it will necessarly take time. The trick is to use for-each-par-map.
That is a cousin of map-reduce that execute the map procedure in
parallel.

For instance something like:

(let ((hits '()))
  (for-each-par-map
    (lambda (uid score) (when score (set! hits (take 10 (sort (cons (cons uid score) hits))))))
    score (inverted-index-lookup "apple"))
  hits)

The score function is interesting. I think going the
aho-corasick with a
FSM
is the best route because it is easy to implement proximity scoring,
“phrase matching”, or really anything I can think of.

⚠ That last paragraph is really the most important part of this
post.

Conclusion

There is a gigantic leap that is going to happen in search because of
hardware availability, and free software with readable source ie. the
only thing that makes human progress possible: knowledge sharing.

Malfunction! Need input!

Read More

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments