Wednesday, April 13, 2011

Comparing MARC fields and identifying similar contents / deduplication

When doing automatic enrichment of library MARC records you'll run into the problem of merging different record's field information into one record where the fields are ranging from completely unique to similar or mostly similar. The final two is the most difficult to master by means of computer programming.

Take these three examples of MARC 21 520 fields (504 in DanMARC2 that we normally use) that wrongfully merged together into one of our own library records.

520 00 $a This book is a linguistic-cultural study of the emergence of the Jewish ghettos during the Holocaust. It traces the origins and uses of the term 'ghetto' in European discourse from the sixteenth century to the Nazi regime. It examines with a magnifying glass both the actual establishment of and the discourse of the Nazis and their allies on ghettos from 1939 to 1944. With conclusions that oppose all existing explanations and cursory examinations of the ghetto, the book impacts overall understanding of the anti-Jewish policies of Nazi Germany.

520 00 $a Summary: "This book is a linguistic-cultural study of the emergence of the Jewish ghettos during the Holocaust. It traces the origins and uses of the term 'ghetto' in European discourse from the sixteenth century to the Nazi regime. It examines with a magnifying glass both the actual establishment of and the discourse of the Nazis and their allies on ghettos from 1939 to 1944. With conclusions that oppose all existing explanations and cursory examinations of the ghetto, the book impacts overall understanding of the anti-Jewish policies of Nazi Germany"--Provided by publisher

520 00 $a Summary: "This book is a linguistic-cultural study of the emergence of the Jewish ghettos during the Holocausta It traces the origins and uses of the term ,@221A@UFAghetto,@221A@UF9 in European discourse from the sixteenth century to the Nazi regime. It examines with a magnifying glass both the actual establishment of and the discourse of the Nazis and their allies on ghettos from 1939 to 1944. With conclusions that oppose all existing explanations and cursory examinations of the ghetto, the book impacts overall understanding of the anti-Jewish policies of Nazi Germany"--Provided by publisher

The first was the original 520 field provided by our supplier. The second and third are from WorldCat with one of them displaying some mangled marc8 encoding issues.

To the naked eye these are quickly identified as equal but to the computer you'll need some method of comparing the mostly identical contents. A simple string compare isn't enough.

In came CPAN with the help - I found the Perl module String::Compare which is described as: A module to see how much two strings are alike. Yes! That's it!

By writing a small demo program (proof of concept) I can test the three fields above with something unique:

use String::Compare;

my $m520_1 = '520 00 $a This book is a linguistic-cultural study of the emergence of the Jewish ghettos during the Holocaust. It traces the origins and uses of the term \'ghetto\' in European discourse from the sixteenth century to the Nazi regime.  It examines with a magnifying glass both the actual establishment of and the discourse of the Nazis and their allies on ghettos from 1939 to 1944. With conclusions that oppose all existing explanations and cursory examinations of the ghetto, the book impacts overall und
erstanding of the anti-Jewish policies of Nazi Germany.';
my $m520_2 = '520 00 $a Summary: "This book is a linguistic-cultural study of the emergence of the Jewish ghettos during the Holocaust. It traces the origins and uses of the term \'ghetto\' in European discourse from the sixteenth century to the Nazi regime. It examines with a magnifying glass both the actual establishment of and the discourse of the Nazis and their allies on ghettos from 1939 to 1944.  With conclusions that oppose all existing explanations and cursory examinations of the ghetto, the book impacts o
verall understanding of the anti-Jewish policies of Nazi Germany"--Provided by publisher';
my $m520_3 = '520 00 $a Summary: "This book is a linguistic-cultural study of the emergence of the Jewish ghettos during the Holocausta It traces the origins and uses of the term ,@221A@UFAghetto,@221A@UF9 in European discourse from the sixteenth century to the Nazi regime. It examines with a magnifying glass both the actual establishment of and the discourse of the Nazis and their allies on ghettos from 1939 to 1944. With conclusions that oppose all existing explanations and cursory examinations of the ghetto, the
 book impacts overall understanding of the anti-Jewish policies of Nazi Germany"--Provided by publisher';
my $m520_4 = '520 00 $a Summary: "This critical book explores the cosmic dimensions of the brain\'s inner theater, including film, theatre and television. In all eras and media, supernatural figures express the brain\'s anatomical features as humans transform their natural environment into cosmic and theological spaces in order to grapple with their vulnerability in the world"--Provided 
by publisher';

print "1 against 2: " . String::Compare::word_by_word ( $m520_1, $m520_2 ) . " (same)\n";
print "1 against 3: " . String::Compare::word_by_word ( $m520_1, $m520_3 ) . " (same)\n";
print "1 against 4: " . String::Compare::word_by_word ( $m520_1, $m520_4 ) . " (different)\n";
print "2 against 3: " . String::Compare::word_by_word ( $m520_2, $m520_3 ) . " (same)\n";
print "2 against 4: " . String::Compare::word_by_word ( $m520_2, $m520_4 ) . " (different)\n";
print "3 against 4: " . String::Compare::word_by_word ( $m520_3, $m520_4 ) . " (different)\n";

The output of the program is:

1 against 2: 0.940553071700613 (same)
1 against 3: 0.930495081438478 (same)
1 against 4: 0.503246509907556 (different)
2 against 3: 0.953972868217055 (same)
2 against 4: 0.530890287454307 (different)
3 against 4: 0.515555101066912 (different)

As you can see the same fields scores above 90% by the String::Compare word_by_word comparison (this method proved the best in this example) making it easy to discard the unneeded data.

Thursday, December 9, 2010

ludelete unable to remove boot environment

At least a couple of times I ran into issues with ludelete and Solaris Live Upgrade.

This is one of these cases:


# ludelete bootenv2
System has findroot enabled GRUB
Checking if last BE on any disk...
ERROR: Read-only file system: cannot create mount point
ERROR: failed to create mount point for file system
ERROR: unmounting partially mounted boot environment file systems
ERROR: umount: warning: /dev/dsk/c1t0d0s3 not in mnttab
umount: /dev/dsk/c1t0d0s3 not mounted
ERROR: cannot unmount
ERROR: cannot mount boot environment by name
ERROR: Failed to mount BE .
ERROR: Failed to mount BE .
cat: cannot open /tmp/.lulib.luclb.dsk.5186.bootenv2
ERROR: This boot environment is the last BE on the above disk.
ERROR: Deleting this BE may make it impossible to boot from this disk.
ERROR: However you may still boot solaris if you have BE(s) on other disks.
ERROR: You *may* have to change boot-device order in the BIOS to accomplish this.
ERROR: If you still want to delete this BE , please use the force option (-f).
Unable to delete boot environment.


The solution to this problem was actually quite simple but why did it happen? Well, partitioning has changed after the boot environment was created and the mount point /storage1/db was referenced in this old boot environment and doesn't exist anymore.

Somehow ludelete require partitions to be as they were so it will not delete if you made changes to partitioning.

The solution was to find the reference in this case in /etc/lu/ICF.2 and remove the line:


bootenv2:/storage1/db:storage1/db:zfs:0


Problem was solved:


# ludelete bootenv2
System has findroot enabled GRUB
Checking if last BE on any disk...
BE is not the last BE on any disk.
No entry for BE in GRUB menu
Determining the devices to be marked free.
Updating boot environment configuration database.
Updating boot environment description database on all BEs.
Updating all boot environment configuration databases.
Boot environment deleted.

Monday, July 12, 2010

Download the contents of a podcast (one liner)

I just had the need of a quick'n'dirty one liner for downloading the contents of a podcast. The xml file had over 100 entries and I didn't want to fetch them individually or use a dedicated program.

wget -q -O - http://something/podcast.xml | \
     grep "enclosure url"  | awk '{print $2}' | \
     sed 's/url=//' | xargs wget

This can definitely be squeezed down to something smaller and more elegant (you are welcome to do so) but that was not the objective. It just gets the job done.

Wednesday, February 24, 2010

Adding mime types for Microsoft Office 2007 file types in Apache

When opening uploaded documents from Microsoft Office 2007 applications in MediaWiki (and some other web applications as well) they might turn up as zip-files or doesn't start the appropriate office application. This is due to the fact that Office Open XML (OOXML) is a zip based file format and Apache 1 and 2 doesn't yet come with updated mime type definitions (or IANA haven't registered them).

You can add these missing definitions yourself without much trouble. First locate the Apache mime types table (mime.types) and then just add the following and restart your apache server.

application/vnd.ms-word.document.macroEnabled.12 docm
application/vnd.openxmlformats-officedocument.wordprocessingml.document docx
application/vnd.openxmlformats-officedocument.wordprocessingml.template dotx
application/vnd.ms-powerpoint.template.macroEnabled.12 potm
application/vnd.openxmlformats-officedocument.presentationml.template potx
application/vnd.ms-powerpoint.addin.macroEnabled.12 ppam
application/vnd.ms-powerpoint.slideshow.macroEnabled.12 ppsm
application/vnd.openxmlformats-officedocument.presentationml.slideshow ppsx
application/vnd.ms-powerpoint.presentation.macroEnabled.12 pptm
application/vnd.openxmlformats-officedocument.presentationml.presentation pptx
application/vnd.ms-excel.addin.macroEnabled.12 xlam
application/vnd.ms-excel.sheet.binary.macroEnabled.12 xlsb
application/vnd.ms-excel.sheet.macroEnabled.12 xlsm
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet xlsx
application/vnd.ms-excel.template.macroEnabled.12 xltm
application/vnd.openxmlformats-officedocument.spreadsheetml.template xltx

For MediaWiki to identify the correct mime type (during upload) you can also add the above unmodified to includes/mime.types.

The MediaWiki file includes/mime.info has no practical use currently but you could add the following to it for completeness:

application/vnd.ms-word.document.macroEnabled.12 [OFFICE]
application/vnd.openxmlformats-officedocument.wordprocessingml.document [OFFICE]
application/vnd.openxmlformats-officedocument.wordprocessingml.template [OFFICE]
application/vnd.ms-powerpoint.template.macroEnabled.12 [OFFICE]
application/vnd.openxmlformats-officedocument.presentationml.template [OFFICE]
application/vnd.ms-powerpoint.addin.macroEnabled.12 [OFFICE]
application/vnd.ms-powerpoint.slideshow.macroEnabled.12 [OFFICE]
application/vnd.openxmlformats-officedocument.presentationml.slideshow [OFFICE]
application/vnd.ms-powerpoint.presentation.macroEnabled.12 [OFFICE]
application/vnd.openxmlformats-officedocument.presentationml.presentation [OFFICE]
application/vnd.ms-excel.addin.macroEnabled.12 [OFFICE]
application/vnd.ms-excel.sheet.binary.macroEnabled.12 [OFFICE]
application/vnd.ms-excel.sheet.macroEnabled.12 [OFFICE]
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet [OFFICE]
application/vnd.ms-excel.template.macroEnabled.12 [OFFICE]
application/vnd.openxmlformats-officedocument.spreadsheetml.template [OFFICE]

Finally remember to add the file types to the variable $wgFileExtensions in LocalSettings.php.

Tuesday, February 23, 2010

Duplex in LaTeX revisited

Back in the 90'ies when I studied at the university I wrote all my documents in LaTeX and I was really fond of it. I still think that it is the best layout system / engine ever invented. Back then I actually wrote some styles myself that were used at the university quite a lot. One of them were a duplex for PostScript functionality so you could enable duplex printing easilly via dvips.

Now my indepth knowledge of LaTeX is long gone but as a homage to it I will post the style right here on this blog for now - it should still be fully working though not tested:

% Definitions for enabling duplex print on PostScript printers (dvips)
% (c) Copyright 1998 Kasper Løvschall

\NeedsTeXFormat{LaTeX2e}
\ProvidesPackage{psduplex}[1998/03/23 v1.1 Duplex print on PostScript device (KL)]

% Defining duplex options

\DeclareOption{longedge}{\special{!userdict begin /start-hook{
  1 dict dup /Duplex true put setpagedevice
  1 dict dup /Tumble false put setpagedevice
  }def end}\PackageInfo{psduplex}{Duplex Long Edge Binding is active}}
\DeclareOption{shortedge}{\special{!userdict begin /start-hook{
  1 dict dup /Duplex true put setpagedevice
  1 dict dup /Tumble true put setpagedevice
  }def end}\PackageInfo{psduplex}{Duplex Short Edge Binding is active}}
\DeclareOption{none}{\special{!userdict begin /start-hook{
  1 dict dup /Duplex false put setpagedevice
  1 dict dup /Tumble false put setpagedevice
  }def end}\PackageInfo{psduplex}{Duplex None is active}}

% Defining media position

\DeclareOption{tray2}{\special{!userdict begin /start-hook{
  1 dict dup /DeferredMediaSelection true put setpagedevice
  1 dict dup /MediaPosition 0 put setpagedevice
  }def end}\PackageInfo{psduplex}{Tray 2 media selected}}
\DeclareOption{tray3}{\special{!userdict begin /start-hook{
  1 dict dup /DeferredMediaSelection true put setpagedevice
  1 dict dup /MediaPosition 1 put setpagedevice
  }def end}\PackageInfo{psduplex}{Tray 3 media selected}}
\DeclareOption{envelope}{\special{!userdict begin /start-hook{
  1 dict dup /DeferredMediaSelection true put setpagedevice
  1 dict dup /MediaPosition 2 put setpagedevice
  }def end}\PackageInfo{psduplex}{Envelope media selected}}
\DeclareOption{tray1}{\special{!userdict begin /start-hook{
  1 dict dup /DeferredMediaSelection true put setpagedevice
  1 dict dup /MediaPosition 3 put setpagedevice
  }def end}\PackageInfo{psduplex}{Tray 1 media selected}}
\DeclareOption{tray4}{\special{!userdict begin /start-hook{
  1 dict dup /DeferredMediaSelection true put setpagedevice
  1 dict dup /MediaPosition 4 put setpagedevice
  }def end}\PackageInfo{psduplex}{Tray 4 media selected}}

% Beginning-Of-Page hooks

\DeclareOption{kopi}{\special{!userdict begin /bop-hook{gsave 200 30
  translate 65 rotate /Courier-Bold findfont 300 scalefont setfont
  50 -10 moveto 0.8 setgray (KOPI) show grestore}def end}}
\DeclareOption{kladde}{\special{!userdict begin /bop-hook{gsave 200 30
  translate 65 rotate /Courier-Bold findfont 200 scalefont setfont
  50 20 moveto 0.8 setgray (KLADDE) show grestore}def end}}
\DeclareOption{udkast}{\special{!userdict begin /bop-hook{gsave 200 30
  translate 65 rotate /Courier-Bold findfont 200 scalefont setfont
  40 20 moveto 0.8 setgray (UDKAST) show grestore}def end}}
\DeclareOption{hemmeligt}{\special{!userdict begin /bop-hook{gsave 200 30
  translate 65 rotate /Courier-Bold findfont 135 scalefont setfont
  40 40 moveto 0.8 setgray (HEMMELIGT) show grestore}def end}}

% Error message if invalid argument

\DeclareOption*{%
  \PackageWarning{psduplex}{Unknown argument "\CurrentOption"}}
\ProcessOptions*\relax

\endinput %EOF

In you preamble you can now add:

\usepackage[longedge]{psduplex}

to enable duplex on the long edge. Other options include: shortedge, tray1, tray2, tray3, tray4, envelope and none. They should be self explanatory. Also some backdrop text usually now provided by the printer driver are included (and needs to be translated, if you like): kopi (copy), kladde (draft), udkast (outline) as well as hemmeligt (secret).

If I will have some spare time I will take a look at the LyX project: http://www.lyx.org which looks rather interesting.

Tuesday, January 26, 2010

NFS static ports and firewalls

NFS makes use of the portmapper to assign ports between the server and client. This makes firewall configuration rather difficult.

On Solaris and Linux nfsd is assigned to port 2049 but the supporting protocols are handled by the portmapper and thereby ports are rather unpredictable.

Linux implements the possibility to assign static ports easily to all of the NFS services which makes firewalling a lot easier.

You just edit /etc/sysconfig/nfs and adds your preferred ports:

# Port rquotad should listen on.
RQUOTAD_PORT=875
# TCP port rpc.lockd should listen on.
LOCKD_TCPPORT=32803
# UDP port rpc.lockd should listen on.
LOCKD_UDPPORT=32769
# Port rpc.mountd should listen on.
MOUNTD_PORT=892
# Port rpc.statd should listen on.
STATD_PORT=662

Then you can edit your firewall settings in /etc/sysconfig/iptables adding the static ports:

# nfsd
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 2049 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 2049 -j ACCEPT
# rquotad
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 875 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 875 -j ACCEPT
# lockd
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 32803 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 32769 -j ACCEPT
# mountd
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 892 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 892 -j ACCEPT
# statd
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 662 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 662 -j ACCEPT
# portmapper
-A RH-Firewall-1-INPUT -s 127.0.0.1 -p tcp --dport 111 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m tcp --dport 111 -j ACCEPT
-A RH-Firewall-1-INPUT -p udp -m udp --dport 111 -j ACCEPT

You can check the configuration after restarting the nfs service:

# rpcinfo -p localhost
   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100024    1   udp    662  status
    100024    1   tcp    662  status
    100011    1   udp    875  rquotad
    100011    2   udp    875  rquotad
    100011    1   tcp    875  rquotad
    100011    2   tcp    875  rquotad
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100021    1   udp  32769  nlockmgr
    100021    3   udp  32769  nlockmgr
    100021    4   udp  32769  nlockmgr
    100021    1   tcp  32803  nlockmgr
    100021    3   tcp  32803  nlockmgr
    100021    4   tcp  32803  nlockmgr
    100003    2   tcp   2049  nfs
    100003    3   tcp   2049  nfs
    100005    1   udp    892  mountd
    100005    1   tcp    892  mountd
    100005    2   udp    892  mountd
    100005    2   tcp    892  mountd
    100005    3   udp    892  mountd
    100005    3   tcp    892  mountd

Everything's fine and running on the specified ports.

Sunday, January 24, 2010

NFS interoperability between Linux server and Solaris client

I've been running NFS version 4 on Solaris 10 for many years but as a part of a hardware upgrade I moved the file server to Linux (CentOS 5).

I've never had any problems connecting Linux clients to Solaris NFS servers but now i quickly experienced that interoperability the other way round was a bit problematic.

From: http://www.novell.com/coolsolutions/feature/17581.html
In the case of NFSv3 and NFSv4 clients simultaneously accessing the same server, one must be aware that two different file systems are used: there is no backward support to NFSv3 by the NFSv4 server. 

If you install NFS on Linux and follow general recommendations you are not able to connect from a your Solaris server unless you fall back to a NFSv3 mount:

mount -o vers=3 linux_server:/u1 /mnt

This is very unpractical if you use the automounter but you can completely disable NFSv4 mounts from Solaris clients by editing /etc/default/nfs:

# Sets the maximum version of the NFS protocol that will be used by
# the NFS client.  Can be overridden by the "vers=" NFS mount option.
# If "vers=" is not specified for an NFS mount, this is the version
# that will be attempted first.  The default is 4.
NFS_CLIENT_VERSMAX=3

Well, I didn't want to add that to all Solaris clients and reading several blogs on the problem didn't help me to the optimum solution. So I decided to skip NFSv4 and fall back to NFSv3 on the server.

So I edited /etc/sysconfig/nfs:

# Turn off v4 protocol support
RPCNFSDARGS="-N 4"

Then interoperability was a bliss. Differences between NFS version 3 and 4 are next to nothing in our rather small environment so the drawbacks were none existent.

Further reading:
Solaris NFSv4 client mount from a Linux Server: : blogbert..
Introducing NFS Fundamentals for the Solaris OS