Tuesday, December 16, 2008

Vandalising Words

With Python

Some less cloudy evening, I got a little idea: how about posting a song lyrics that if I hover my mouse on a word, I would get the word translation as a tooltip text. Then the idea grew a little: given a lyric and a file with word and translation pairs in it, how about automating the process? The next logical step would be: how about not a file with word-translation pairs, but lookup to one or another web dictionary, how about stemming? Super!

Yet, those logical steps weren't taken. I like instant gratification, so I just set out and try to get my tooltip text without the fancy lookup, stemming, and fancy ribbons. Anyway, with a goal in mind, it's Python time!

The idea is simple: open a lyrics file, break it down into words list. open a diction file (that word-translation pair), break it down into pairs and put them into a dictionary. For every word in lyrics word list, check if there is a data in the dictionary with that word as key. If yes, put it around acronym tag and write to output file, if no, just write the word to output then move along.

Now, at the risk of getting ridiculed by how sloppy it is, the code:

#everything here is put into a function,
#so the module can be imported nicely in command line

#head() is the main function.
#Then we have conditioner() for input conditioning,
#translator() for word lookup and writing output,
#and finally acronyser() that just fill acronym tags with the contents

def head():
  #first get an input romaji and diction file, and decide a name for the output
  inp = raw_input("input file: ")
  dic = raw_input("diction file: ")
  out = raw_input("output file: ")
  #try our best to open it and...
  try:
      inpfile = open(inp, 'r')
      dicfile = open(dic, 'r')
      outfile = open(out, 'w')
  #cry when something goes wrong
  except:
      print "Something went wrong with opening files :("
      return
  #condition the data see conditioner() for details, contains no shampoo
  inpdata, dicdata = conditioner(inpfile, dicfile)
  #we don't need the input files anymore, might as well close them
  inpfile.close()
  dicfile.close()
  #now the fun part, translation and writing output
  translator(inpdata, dicdata, outfile)
  #done! go look at the output and be satisfied

def conditioner(inpfile, dicfile):
  inpdata = inpfile.read().split('\n')
  i = 0
  while i < diclist =" dicfile.read().split('\n')" dicdata =" {}" line =" line.split(None,"> 1:
          dicdata[line[0]] = line[1]
  return (inpdata, dicdata)

def translator(inpdata, dicdata, outfile):
  for line in inpdata:
      for word in line:
          if dicdata.has_key(word.lower()):
              outfile.write(acronyser(word, dicdata[word.lower()]))
          else:
              outfile.write(word+' ')
      outfile.write('
')
  outfile.close()
  return

def acronyser(word, title):
  return "%s " % (title, word)

A little sample of the output from L'Arc~en~Ciel's Anemone:

azayaka na kisetsu aa hana ga saku no matsu koto naku fune wa yuku mada minu basho e shizuka ni moeru honoo wa dare ni mo kese wa shinai kara
Regardless if it's such a good idea in the first place, I had fun doing it.

0 comments:

Post a Comment