LCP from suffix array

→ Обратите внимание

До соревнования
Rayan Programming Contest 2024 - Selection (Codeforces Round 989, Div. 1 + Div. 2)
15:17:24
Зарегистрироваться »

*есть доп. регистрация

→ Лидеры (рейтинг)

№	Пользователь	Рейтинг
1	tourist	3993
2	jiangly	3743
3	orzdevinwang	3707
4	Radewoosh	3627
5	jqdai0815	3620
6	Benq	3564
7	Kevin114514	3443
8	ksun48	3434
9	Rewinding	3397
10	Um_nik	3396

Страны | Города | Организации

Всё →

→ Лидеры (вклад)

№	Пользователь	Вклад
1	cry	167
2	Um_nik	163
3	maomao90	162
3	atcoder_official	162
5	adamant	159
6	-is-this-fft-	158
7	awoo	155
8	TheScrasse	154
9	Dominater069	153
10	djm03178	152

Всё →

→ Найти пользователя

→ Прямой эфир

Детальнее →

Блог пользователя cpp11

LCP from suffix array

Автор cpp11, 10 лет назад, По-английски

I am learning suffix arrays. I understood the O(nlogn) implementation of suffix array. But I am not being able to understand LCP calculation. Could someone explain how to calculate LCP from suffix arrays? Thanks in advance.

lcp, suffix array

cpp11
10 лет назад
36

Комментарии (26)

Показать архивные | Написать комментарий?

adamant

10 лет назад, # |

← Rev. 2 →

+10

Kasai's algorithm is pretty easy and works in O(n).

Let's look at the two continuous suffixes in the suffix array. Let their indexes in suffix array be i₁ and i₁ + 1. If their lcp > 0, then if we delete first letter from both of them. We can easily see that new strings will have the same relative order. Also we can see that lcp of new strings will be exactly lcp - 1.

Let's now look at the string wich we have got from the i suffix by deleting its first character. Obviously it is some suffix of the string too. Let its index be i₂. Let's look at the lcp of suffixes i₂ and i₂ + 1. We can see that it's lcp will be at least already mentioned lcp - 1. This is associated with certain properties of lcp array, in particular, that lcp(i, j) = min(lcp_i, lcp_i + 1, ..., lcp_j - 1).

And finally let's make the algorithm based on the mentioned above. We will need an additional array rank[n], wich will contain the index in the suffix array of the suffix starting in index i. Firstly we should calculate the lcp of the suffix with index rank[0]. Then let's iterate through all suffixes in order in which we meet them in the string and calculate lcp[rank[i]] in naive way, BUT starting it from lcp[rank[i - 1]] - 1. Easy to see that now we have O(n) algorithm because on the each step our lcp decreasing not more than by 1 (except the case when rank[i] = n - 1).

Implementation:

vector<int> kasai(string s, vector<int> sa)
{
    int n=s.size(),k=0;
    vector<int> lcp(n,0);
    vector<int> rank(n,0);

    for(int i=0; i<n; i++) rank[sa[i]]=i;

    for(int i=0; i<n; i++, k?k--:0)
    {
        if(rank[i]==n-1) {k=0; continue;}
        int j=sa[rank[i]+1];
        while(i+k<n && j+k<n && s[i+k]==s[j+k]) k++;
        lcp[rank[i]]=k;
    }
    return lcp;
}

→ Ответить

adamant

10 лет назад, # ^ |

There is also a way to build it in O(nlogn) with a segment tree described on the e-maxx, but in my opinion it is much harder and slower.

→ Ответить

Lord_F

10 лет назад, # ^ |

And there's another O(NlogN) algorithm which is much more intuitive: find LCP of each pair of consecutive suffixes using binary search and hashes. However, I'm not really sure what's easier and faster to implement: this method or Kasai's (and several other guys')

→ Ответить

adamant

10 лет назад, # ^ |

Yes, but hashes are evil, we don't want use them :)

→ Ответить

Xellos

10 лет назад, # ^ |

Why exactly? Due to anti-hash tests? Try hashing mod 2^64 and a randomly chosen reasonably small prime.

→ Ответить

adamant

10 лет назад, # ^ |

I just dislike hashes and trying to avoid them almost always when I have such opportunity. Also, double hashing is quite slow :(

→ Ответить

Xellos

10 лет назад, # ^ |

That's why 2^64 — what makes double hashing slow is especially the modulo operation, if you use just long long, then it's fast, but it's easy to make anti-hash tests, which is what the other part (mod smaller prime) takes care of while retaining decent runtime.

→ Ответить

adamant

10 лет назад, # ^ |

Interesting trick. But I still dislike hashes :)

→ Ответить

cpp11

10 лет назад, # ^ |

-8

That was helpful.I got the idea. Thanks a lot.But Could you explain why lcp(i, j) = min(lcpi, lcpi + 1, ..., lcpj-1). this property is true?

→ Ответить

k790alex

10 лет назад, # ^ |

write a suffix array + lcp in a paper, you will notice that property.

→ Ответить

adamant

10 лет назад, # ^ |

For example, we know lcp(i, j - 1). Obviously if lcp[j - 1] < lcp(i, j - 1) then lcp(i, j) = lcp[j - 1], otherwise lcp(i, j) = lcp(i, j - 1), i.e. lcp(i, j) = min(lcp(i, j - 1), lcp[j - 1]). Now we could rewrite lcp(i, j - 1) in this formula in the same way and get what we get.

→ Ответить

cpp11

10 лет назад, # ^ |

It is clear now. Thanks :)

→ Ответить

rahulnagurtha

9 лет назад, # ^ |

Do you mean lcp(1,4) in abcabcd = min(lcp(1,2),lcp(2,3),lcp(3,4)) = min(0,0,0) = 0 ?

→ Ответить

shyambs

9 лет назад, # ^ |

← Rev. 2 →

here lcp(i,j) means lcp(suffix from sa[i],suffix from sa[j]) ,where sa=>suffix array

→ Ответить

Heisenberg_333

7 лет назад, # ^ |

what is rank array....please explain a little more !!!!!

→ Ответить

Mahilewets

7 лет назад, # ^ |

rank array is just a reverse function for suffix array

→ Ответить

Heisenberg_333

7 лет назад, # ^ |

what if we used j=sa[i]+1;

??/

→ Ответить

Mahilewets

7 лет назад, # ^ |

The idea is following thing. Let s='abcdefghi' Then LCP(0)=lcp(abcdefghi, bcdefghi) =|bcdefghi|=8

And then we cut one character from left from each string and move one position forward and calculate LCP(1). It would be LCP(1)=|cdefghi|=7=LCP(0)-1

So if j=sa[i] +1 we can't say that LCP(i) =k-1, we should check prefix fully.

→ Ответить

Len

7 лет назад, # ^ |

lcp(abcdefghi, bcdefghi) =|bcdefghi|=8
wat

→ Ответить

Mahilewets

7 лет назад, # ^ |

Postfix increment is bad!

→ Ответить

adamant

7 лет назад, # ^ |

You're bad!

→ Ответить

Mahilewets

7 лет назад, # ^ |

← Rev. 5 →

I saw postfix increment implementation in GCC C++.
It was like this. [] [int &x] {int y=++x; return y;} So, postfix uses prefix form as a subroutine and therefore is slower.

UPD: Postfix increment does unnecessarily job. So it consumes additional energy without a real purpose. Energy is not infinite you know. You are bringing us closer to the
heat death of the Universe.

UPD2: I understood. You are talking about compiler optimization. It is almost certain that compiler will remove postfix and put prefix form.
Nonetheless, it requires additional time, additional energy and thus additional heating.

→ Ответить