Блог пользователя teja349

Автор teja349, история, 8 лет назад, По-английски

Assume there are 2 patterns P1 and P2 such that P1 is a substring of P2 which is nether prefix nor suffix.Then how does a KMP type on string T work
Eg:
P1=of
P2=sofa
T=sofa
in this if we run aho corasick .will it detect string "of" or not

And also any good tutorial link on it will be apreciated
PS:I havent implemented it yet.I was studying it and I got this doubt as I read about its implementation

  • Проголосовать: нравится
  • +8
  • Проголосовать: не нравится

»
8 лет назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

Auto comment: topic has been updated by teja349 (previous revision, new revision, compare).

»
8 лет назад, # |
  Проголосовать: нравится +3 Проголосовать: не нравится

Yes, it will. In Aho-Corasick algorithm you create a trie nodes of which correspond to a prefix of some pattern. And you have suffix links for each node (similar to failure function in KMP). To find all the patterns which can end at a particular position in the string, you keep moving up using the suffix links as long as possible from the node you are currently at.

When traversing the text "sofa", when you reach the third position, the corresponding node in the trie corresponds to "sof". But this node has a suffix link to the node for "of". So when you move up the suffix link from the former to the latter, you will detect pattern P1. :)

  • »
    »
    8 лет назад, # ^ |
    Rev. 2   Проголосовать: нравится +3 Проголосовать: не нравится

    But,isnt it like only when there is a mismatch we go to suffix link
    so after reaching "sof".when we see "a" is present in continuation "sof" in trie .Then why will we see the suffix link.We only take suffix link at "sofa",am I wrong??

    • »
      »
      »
      8 лет назад, # ^ |
        Проголосовать: нравится +3 Проголосовать: не нравится

      No, you always traverse along suffix links to get all the patterns which end at that position, not only when there is a mismatch. I think what you are referring to (going to suffix link when there is a mismatch), that happens when you want to go the corresponding node in the trie after traversing the next character of the text.

      • »
        »
        »
        »
        8 лет назад, # ^ |
          Проголосовать: нравится +6 Проголосовать: не нравится

        I am not able to underastand what you are saying.i dont get by what u mean traversing all suffix links

        Can you provide ur implementation for aho corasick

  • »
    »
    8 лет назад, # ^ |
      Проголосовать: нравится +3 Проголосовать: не нравится

    You should build two different types of upward links: failure links (this always goes to the closest node that is also a valid prefix) and output links. Traversing output links from a node to the root should visit a subset of the nodes that are visited by doing the same with failure links. The sequence of output links should only visit those nodes in which we found an occurrence of a pattern.

»
8 лет назад, # |
Rev. 2   Проголосовать: нравится +14 Проголосовать: не нравится

Hey, I think the best "tutorial" on Aho-Corasick algorithm is the original paper. Seriously, it is really well written, thorough in explanations, and has detailed pseudocode. Look for "Efficient string matching: An aid to bibliographic search". (This link works for me.)

»
7 лет назад, # |
Rev. 2   Проголосовать: нравится +10 Проголосовать: не нравится

I apologize for the bump (and I'm sure OP has figured out his issues by now). However, when I was trying to implement Aho-Corasick I had the same concern as OP, ran across this post, and couldn't find an implementation I liked. I was using the description here, but they don't explain how to actually match on the text (which is the cause of confusion for the OP), nor do they have code for jumping to leafs (which is necessary for proper complexity).

So for the sake of what me from 12 hours ago would have wanted, here's my implementation (for 475D, extending upon the emaxx code: https://ideone.com/YIbEn5

It contains the missing parts from emaxx that are needed for a full implementation.

Aho-Corasick just by itself can be found here: https://gist.github.com/Chillee/69141ce899e9bc65291f8b120ee57b90

»
6 лет назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

Here is my implementation of aho-corasick: https://ideone.com/N3hMtK

I learnt it from here: https://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf

Implementation part and some optimizations is well described here: https://e-maxx-eng.appspot.com/string/aho_corasick.html.