📘 Finding duplicate texts using Perl 6

Find duplicate fragments in the same text.

This task was dictated by the practical need when I realised that I used the same phrases in different parts of the text of this book. Some of them, like Hello, World!, are unavoidable, but it would be a great help to find the rest.

Here is the full solution, which scans the text from standard input and finds the sequences of Nwords which appear more than once in the text.

my $text = $*IN.slurp;
$text .= lc;
$text ~~ s:g/\W+/ /;
my $length = $text.chars;

my %phrases;
my $start = 0;
while $text ~~ m:c($start)/(<< [\w+] ** 5 %% \s >>) .+ $0/ {
    $start = $0.from + 1;
    %phrases{$0}++;

    print (100 * $start / $length).fmt('%i%% ');
    say $0;
}

say "\nDuplicated strings:";

for %phrases.keys.sort({%phrases{$^b} <=> %phrases{$^a}}) {
    say "$_ = " ~ %phrases{$_} + 1;
}

The program is relatively complicated, so let us examine it bit by bit.

First, the program reads the input using the $*IN.slurp call that returns the whole input text. It reads all the lines and creates a single string variable out of it. The .=lc method, called on the $text variable, makes the string lowercase and also assigns it back to the variable.

With a substitution s/\W+/ /, all non-alphanumeric sequences are replaced with a space. Thus, we eliminate all the punctuation, for example.

The last step of preparatory work is to save the length of the text in a variable so that we use it later in the program directly, instead of calling the charsmethod (see Task 3, String length).

Now, the main loop starts. Its goal is to take all the five-word sequences that occur at least twice in the text and place them in the %phrases hash. Each time another copy of the phrase is found, the value in the %phrases hash is incremented. At the end of the loop, the hash contains the number of occurrences for each such five-word sequence.

Look at the regex that finds the repetitions:

m:c($start)/(<< [\w+] ** 5 %% \s >>) .+ $0/

The main part of it, << [\w+] ** 5 %% \s >>, finds five words separated by a space. The << and >> anchors stick to word boundaries, [\w+ ** 5] is a sequence of five words, and the separator is mentioned in the %% clause: %% \s. The regex then needs a copy of the just matched phrase, and this is the job of the $0 variable inside the regex.

Finally, the :c adverb with a parameter—the $start value—makes the regex match against the string starting the $start position. This counter is incremented in the loop body based on the location of the first found phrase: $start = $0.from + 1.

The rest of the program prints the result as a table. It sorts the found phrases and displays the most frequent first.

📘 Finding the longest palindrome using Perl 6

Find the longest palindromic substring in the given string.

The main idea behind the solution is to scan the string with a window of varying width. In other words, starting from a given character, test all the substrings of any length possible at that position.

For the string $string, this is how the loops can be organized:

for 0 .. $length -> $start {
    for $start .. $length - 1 -> $end {
        . . .
    }
}

Now, extract the substring using the substrmethod, which is defined for the objects of the Strtype, and do the check similar to the solution of Task 16, Palindrome test. Here, we have to be careful to check the palindrome without taking into account the non-letter characters but saving the result as part of the original string. For this, a copy of the substring is made.

my $substring = $string.substr($start, $length - $end);
my $test = $substring;
$test ~~ s:g/\W+//;
$test .= lc;
if $test eq $test.flip && $substring.chars > $found.chars {
    $found = $substring;
}

The temporary result is saved in the $found variable. The algorithm keeps track of the first longest palindromic substring. If there are more than one such substrings of the same length, they are ignored.

Here is the complete code of the program.

my $string = prompt('Enter string> ');
my $length = $string.chars;
my $found = '';

for 0 .. $length -> $start {
    for $start .. $length - 1 -> $end {
        my $substring = 
           $string.substr($start, $length - $end);
        my $test = $substring;
        $test ~~ s:g/\W+//;
        $test .= lc;
        if $test eq $test.flip && 
          $substring.chars > $found.chars {
           $found = $substring;
        }
    }
}

if $found {
    say "The longest substring is '$found'.";
}
else {
    say "No palindromic substrings found.";
}

Run the program and see how it works.

Enter string> Hello, World!
The longest substring is 'o, Wo'.

As homework, modify the code so that it can track more than one palindromic substrings of the same length. It may, for example, keep the candidates in an array and re-initialize it if a longer palindrome is found.

📘 Palindrome test using Perl 6

Check if the entered string is palindromic.

A palindrome is a string that can be read from both ends: left to right or right to left. First, start with the simple case when the string contains only letters. (Thus, spaces and punctuation do not affect anything.)

In Task 5, Reverse a string, the flipmethod is used to reverse a string. To check whether it is a palindrome, compare the string with its flipped version.

my $string = prompt('Enter a string: ');
my $is_palindrome = $string eq $string.flip;

say 'The string is ' ~ 
   (!$is_palindrome ?? 'not ' !! '') ~ 
   'palindromic.';

This code works well with single words like ABBA or madam or kayak.

Let us take the next step and teach the program to work with sentences that contain spaces and punctuation characters. For removing all the non-letter characters, regexes are a good choice:

$string ~~ s:g/\W+//;

The \W+ regex matches with all non-word characters. All occurrences of them are removed from the string (replaced with nothing). The :g adverb tells the regex to repeatedly scan the whole string.

Additionally, the string should be lowercased:

$string .= lc;

The lc method is called on $string, and the result is assigned back to the same variable. This construction is equivalent to the following:

$string = $string.lc;

Add these two lines to the program:

my $string = prompt('Enter a string: ');

$string ~~ s:g/\W+//;
$string .= lc;

my $is_palindrome = $string eq $string.flip;

say 'The string is ' ~ 
   (!$is_palindrome ?? 'not ' !! '') ~ 
    'palindromic.';

Check the modified program against a few random sentences and a few palindromes:

Never odd or even.

Was it a rat I saw?

Mr. Owl ate my metal worm.

That’s all. As an additional stroke, it is a good thing to simplify the concatenated string a bit and use interpolation:

my $string = prompt('Enter a string: ');
$string ~~ s:g/\W+//;
$string .= lc;my $is_palindrome = $string eq $string.flip;
my $not = $is_palindrome ?? '' !! ' not';
say "The string is$not palindromic.";

📘 Anagram test using Perl 6

Tell if the two words are anagrams of each other.

Anagrams are words or phrases that are built out of the same letters. We start with checking words only.

my $a = prompt('First word > ');
my $b = prompt('Second word > ');

say normalize($a) eq normalize($b) 
    ?? 'Anagrams.' !! 'Not anagrams.';

sub normalize($word) {
    return $word.split('').sort.join('');
}

The words, stored in the $a and $b variables, are passed through the normalize function, which converts a word into a string, where all the letters are alphabetically sorted. For example, the ‘hello’ string becomes ‘ehllo’. If both words can be normalised to the same form, they are anagrams.

To make the program accept phrases, let’s modify the normalize function so that it removes the spaces from the phrase and makes all the letters lowercase:

sub normalize($word) {
    return $word.lc.split('').sort.join('').trans(' ' => '');
}

There are two additions to the above chain of method calls: lc converts the string to the lowercase version, and the trans method replaces all the spaces with an empty string. After these changes, the ‘Hello World’ phrase becomes ‘dehllloorw’.

📘 Finding the longest common substring using Perl 6

Find the longest common substring in the given two strings.

Let us limit ourselves with finding only the first longest substring. If there are more common substrings of the same length, then the rest are ignored. There are two loops (see also Task 17, The longest palindrome) over the first string ($a), and they use the indexmethod to search for the substring in the second string ($b).

my $a = 'the quick brown fox jumps over the lazy dog';
my $b = 'what does the fox say?';

my $common = '';
for 0 .. $a.chars -> $start {
    for $start .. $a.chars - 1 -> $end {
        my $s = $a.substr($start, $a.chars - $end);
        if $s.chars > $common.chars && $b.index($s).defined {
           $common = $s;
        }
    }
}

say $common 
    ?? "The longest common substring is '$common'." 
    !! 'There are no common substrings.';

The index method returns the position of the substring $s if it is found in the string $b. It is a little bit tricky to check if the substring is found because when it is found at the beginning of the string, then the returned value is 0 (as 0 is the position of the substring). If the substring is not found, then Nil is returned. Nil is not a numeric value; thus, it cannot be compared using the == or != operators. Use the defined method to check if index returns a value, not Nil$b.index($s).defined.

📘 Finding the most frequent word using Perl 6

Find the most frequent word in the given text.

To find the most frequent word, you need first to find all the words in the text. 

This can be done via the global regex m:g/(\w+)/ or by using the comb method. The method returns a list of all the matched substrings. In the following example of solving the task, the regex matching is placed inside the for loop, which immediately updates the %count hash, which keeps the number of occurrences of each found word. To allow counting words case-insensitively, the $text value is first lower-cased with the help of the lc string method.

my $text = prompt('Text> ');
my %count;
%count{$_}++ for $text.lc.comb(/\w+/);
say (sort {%count{$^b} <=> %count{$^a}}, %count.keys)[0];

The sort function sorts the hash using the word frequency as the sorting parameter. Then, the first element, [0], is taken and printed.

Test the program on different texts to see how it works. What you may notice is that the program always prints only one word, even if there are other words with the same number of occurrences. To solve the problem, extract the number of repetitions and filter the %count hash to find all the words that match this condition.

my $max = %count{(sort {%count{$^b} <=> %count{$^a}},
                 count.keys)[0]};
.say for %count.keys.grep({%count{$_} == $max});

This program prints all the words having the maximum values in %count.

📘 Plural endings using Perl 6

Put a noun in the correct form—singular or plural—depending on the number next to it.

In program outputs, it is often required to print some number followed by a noun, for example:

10 files copied

If there is only one file, then the phrase should be ‘1 file copied’ instead. Let’s see how Perl 6 can help.

Of course, it is quite easy to print a noun separately using string concatenation to make the whole phrase:

for 1, 2, 3 -> $n {                     # Program output:
    my $word = 'file';                  # 1 file found
    $word ~= 's' if $n > 1;             # 2 files found
    say "$n $word found";               # 3 files found
}

It is also a good practice to interpolate the word choice into the string itself to avoid additional lines of code and to get rid of temporary variables.

The following program generates the same output:

for 1, 2, 3 -> $n {
    say "$n file{'s' if $n > 1} found";
}

A code block in curly braces inside the string contains a regular Perl 6 code. It returns 's' if the number $n is greater than one. Notice that there is no need to use a ternary operator here; a postfix if looks very self-explaining.