Wednesday, April 13, 2011

Comparing MARC fields and identifying similar contents / deduplication

When doing automatic enrichment of library MARC records you'll run into the problem of merging different record's field information into one record where the fields are ranging from completely unique to similar or mostly similar. The final two is the most difficult to master by means of computer programming.

Take these three examples of MARC 21 520 fields (504 in DanMARC2 that we normally use) that wrongfully merged together into one of our own library records.

520 00 $a This book is a linguistic-cultural study of the emergence of the Jewish ghettos during the Holocaust. It traces the origins and uses of the term 'ghetto' in European discourse from the sixteenth century to the Nazi regime. It examines with a magnifying glass both the actual establishment of and the discourse of the Nazis and their allies on ghettos from 1939 to 1944. With conclusions that oppose all existing explanations and cursory examinations of the ghetto, the book impacts overall understanding of the anti-Jewish policies of Nazi Germany.

520 00 $a Summary: "This book is a linguistic-cultural study of the emergence of the Jewish ghettos during the Holocaust. It traces the origins and uses of the term 'ghetto' in European discourse from the sixteenth century to the Nazi regime. It examines with a magnifying glass both the actual establishment of and the discourse of the Nazis and their allies on ghettos from 1939 to 1944. With conclusions that oppose all existing explanations and cursory examinations of the ghetto, the book impacts overall understanding of the anti-Jewish policies of Nazi Germany"--Provided by publisher

520 00 $a Summary: "This book is a linguistic-cultural study of the emergence of the Jewish ghettos during the Holocausta It traces the origins and uses of the term ,@221A@UFAghetto,@221A@UF9 in European discourse from the sixteenth century to the Nazi regime. It examines with a magnifying glass both the actual establishment of and the discourse of the Nazis and their allies on ghettos from 1939 to 1944. With conclusions that oppose all existing explanations and cursory examinations of the ghetto, the book impacts overall understanding of the anti-Jewish policies of Nazi Germany"--Provided by publisher

The first was the original 520 field provided by our supplier. The second and third are from WorldCat with one of them displaying some mangled marc8 encoding issues.

To the naked eye these are quickly identified as equal but to the computer you'll need some method of comparing the mostly identical contents. A simple string compare isn't enough.

In came CPAN with the help - I found the Perl module String::Compare which is described as: A module to see how much two strings are alike. Yes! That's it!

By writing a small demo program (proof of concept) I can test the three fields above with something unique:

use String::Compare;

my $m520_1 = '520 00 $a This book is a linguistic-cultural study of the emergence of the Jewish ghettos during the Holocaust. It traces the origins and uses of the term \'ghetto\' in European discourse from the sixteenth century to the Nazi regime.  It examines with a magnifying glass both the actual establishment of and the discourse of the Nazis and their allies on ghettos from 1939 to 1944. With conclusions that oppose all existing explanations and cursory examinations of the ghetto, the book impacts overall und
erstanding of the anti-Jewish policies of Nazi Germany.';
my $m520_2 = '520 00 $a Summary: "This book is a linguistic-cultural study of the emergence of the Jewish ghettos during the Holocaust. It traces the origins and uses of the term \'ghetto\' in European discourse from the sixteenth century to the Nazi regime. It examines with a magnifying glass both the actual establishment of and the discourse of the Nazis and their allies on ghettos from 1939 to 1944.  With conclusions that oppose all existing explanations and cursory examinations of the ghetto, the book impacts o
verall understanding of the anti-Jewish policies of Nazi Germany"--Provided by publisher';
my $m520_3 = '520 00 $a Summary: "This book is a linguistic-cultural study of the emergence of the Jewish ghettos during the Holocausta It traces the origins and uses of the term ,@221A@UFAghetto,@221A@UF9 in European discourse from the sixteenth century to the Nazi regime. It examines with a magnifying glass both the actual establishment of and the discourse of the Nazis and their allies on ghettos from 1939 to 1944. With conclusions that oppose all existing explanations and cursory examinations of the ghetto, the
 book impacts overall understanding of the anti-Jewish policies of Nazi Germany"--Provided by publisher';
my $m520_4 = '520 00 $a Summary: "This critical book explores the cosmic dimensions of the brain\'s inner theater, including film, theatre and television. In all eras and media, supernatural figures express the brain\'s anatomical features as humans transform their natural environment into cosmic and theological spaces in order to grapple with their vulnerability in the world"--Provided 
by publisher';

print "1 against 2: " . String::Compare::word_by_word ( $m520_1, $m520_2 ) . " (same)\n";
print "1 against 3: " . String::Compare::word_by_word ( $m520_1, $m520_3 ) . " (same)\n";
print "1 against 4: " . String::Compare::word_by_word ( $m520_1, $m520_4 ) . " (different)\n";
print "2 against 3: " . String::Compare::word_by_word ( $m520_2, $m520_3 ) . " (same)\n";
print "2 against 4: " . String::Compare::word_by_word ( $m520_2, $m520_4 ) . " (different)\n";
print "3 against 4: " . String::Compare::word_by_word ( $m520_3, $m520_4 ) . " (different)\n";

The output of the program is:

1 against 2: 0.940553071700613 (same)
1 against 3: 0.930495081438478 (same)
1 against 4: 0.503246509907556 (different)
2 against 3: 0.953972868217055 (same)
2 against 4: 0.530890287454307 (different)
3 against 4: 0.515555101066912 (different)

As you can see the same fields scores above 90% by the String::Compare word_by_word comparison (this method proved the best in this example) making it easy to discard the unneeded data.

No comments:

Post a Comment