Python Programming Question

admin one

1 month ago

Python Programming Question

Description

This homework asks you to do basic textual analysis to study the n-gram distribution of different languages, and examine a “mystery” text to determine what language it is in.

You will then perform a TF-IDF analysis

Goals

In this assignment you will:

Write functions to perform a basic n-gram analysis of texts
Visualize the n-gram distribution of different texts
Use this information to analyze a new text.
Compute TF-IDF scores for a set of documents, then find the most distinctive words.

Background

N-grams

In this week’s lecture we discussed n-grams: a way of breaking up text into smaller chunks to analyze their distribution. n-grams are defined over either words in a sentence (so the 3-grams in “words in a sentence” are “words in a” and “in a sentence”) or over the characters in a sentence (so the 3-grams in “sentence” are “sen” “ent” “nte” “ten” “enc” “nce”). In this homework, we will use characters.

It is common to “pad” strings with dummy characters (e.g., _) to produce more useful n-grams (to correctly capture, for example, that ‘s’ is the most common letter to start words through the 3-gram __s). The easiest way to do this is to alter the sentence that you’re building n-grams over by adding _s to the beginning and end.

The interesting feature about n-grams is that different languages have different distributions of n-grams. (In the simple case, the most common letter in English is e, but the most common letter in Spanish is a. The “1-grams” for English and Spanish have different distributions). This homework will build n-gram distributions for texts in different languages for variouus values of n, and you will find the pair of languages that are the most similar by analyzing their n-gram distributions. You will also see how the similarity value changes as we increase n.

TF-IDF

We also discussed TF-IDF as a metric to find the “most distinctive” words in documents. However, TF-IDF can be used to analyze other parts of documents as well! In this homework, we will compute TF-IDF scores for a set of documents and use that to determine the most distinctive n-gram in each document. You can refer to the class notes on brightspace for help with this part of the homework. You may use functions written in Problem 1 to complete this part.

NLTK

NLTK is the Natural Language Toolkit, a set of common text-processing tools for Python. You can install NLTK using:

<code>> python3 -mpip install --user nltk
</code>

Note: if you are using Jupyter Notebook on Scholar to do your assignments, you are going to want to run this command from a terminal window on Scholar (by SSHing to scholar, or opening a terminal window through Thinlinc). Some students have been having trouble trying to install modules using Jupyter Notebook’s built-in terminal.

We will use NLTK to clean the documents before processing the documents. To clean the documents, we will:

Remove stop words from each document:
Remove punctuation from the document (you may use the remove_punc helper method in helper.py to help with this)
Make the words lower case
Remove all spaces between the words

There are examples of steps 1 and 3 in the nltk_tutorial.ipynb notebook shown in class (available on Brightspace).

Instructions

0) Set up your repository for this homework.

Use the link on Piazza to set up Homework 8.

The repository should contain several files:

This README
Two starter files with some function stubs called hw8_1.py and hw8_2.py
A helper file called helper.py (this contains code to remove punctuation from a string)
7 translations of the UN Declaration on Human Rights in different languages, in the subdirectory ngrams/: english.txt, french.txt, german.txt, italian.txt, portuguese.txt, spanish.txt, swahili.txt.
A directory, lecs/ that contains 14 text files. These are the documents you will process for Problem 2

1) Homework Problem 1: Identifying a Language from n-gram Distributions

Step 1:

First fill in the functions get_formatted_text, get_ngrams, get_dict, and top_N_common as described in hw8_1.py to read an input file into a list of its lines, then process those lines to construct a dictionary of n-grams, and finally output a list of tuples containing the N largest (n-gram, count) pairs in decreasing order. get_ngrams, get_dict, and top_N_common also take in n as an argument which is the length of each n-gram. For example:

<code>>>> top_N_common('ngrams/english.txt',10,3)
[('the', 149), (' th', 142), (' an', 129), ('he ', 121), ('nd ', 113), ('and', 111), ('ion', 102), (' of', 93), ('of ', 89), ('tio', 88)]
</code>

It is recommended you review material on dictionaries, sets (when needed), and how to get rid of whitespaces and newlines using the strip() function.

The dictionaries you will create will map an n-gram (the dictionary key) to the number of times that the n-gram appears (the dictionary value). Pay careful attention to the processing we want you to do on each line (pad it out with _s, make all the text lowercase) and how we want you to do the processing.

When writing top_N_common, you may find the following StackOverflow post helpful: https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value. You may also find the dictionary method popitem() to be useful.

There will be a line of comment symbols in hw8_1.py, and once you have filled in the functions up to this point, you may run hw8_1.py to test that what you have so far works. (It should output the above output for top_N_common when you run it, followed by potentially and empty list and empty string that will change as you fill out the functions in the later steps).

Step 2:

Next, fill in the code for get_all_dicts, dict_union, and get_all_ngrams. get_all_dicts takes in a list of filepath strings and returns a list of the n-gram dictionaries for all the files. dict_union takes in a list of dictionaries of n-grams for a group of files and creates one large alphabetically sorted list of all the n-grams across all the input dictionaries. Note that this larger output list doesn’t involve the counts, it only involves the n-grams. It is highly recommended that you look into the “set” data type and methods such as union or update, as well as the dictionary method keys. get_all_ngrams takes in a list of filepaths and returns one list of all n-grams across all the files. Big hint: consider the functions you have written in step 2 already. Note that get_all_dicts and get_all_ngrams also take n as an argument, which defines the length of the n-gram.

You may test your code at this point to see if you get a large list of alphabetized n-grams output in addition to the output from step 1. This list is omitted to not overly clutter this readme, however, the beginning of it should look something like the following for n=3:

[‘ a ‘, ‘ ab’, ‘ ac’, ‘ ad’, ‘ af’, ‘ ag’, ‘ ah’, ‘ ai’, ‘ aj’, ‘ ak’, ‘ al’, ‘ am’, ‘ an’, ‘ ao’, ‘ ap’, ‘ aq’, ‘ ar’, ‘ as’, ‘ at’, ‘ au’, ‘ av’, ‘ ay’, ‘ az’, ‘ ba’, ‘ be’, ‘ bi’, ‘ bo’, ‘ br’, ‘ bu’, ‘ by’, ‘ ca’, ‘ ce’, ‘ ch’, ‘ ci’, ‘ cl’, ‘ co’, ‘ cr’, ‘ cu’, ‘ da’, ‘ de’, ‘ dh’, ‘ di’, ‘ do’, ‘ dr’, ‘ du’, ‘ e ‘, ‘ ea’, ‘ ec’, ‘ ed’, ‘ ef’, ‘ eg’, ‘ eh’, ‘ ei’, ‘ ej’, ‘ el’, ‘ em’, ‘ en’, ‘ ep’, ‘ eq’, ‘ er’, ‘ es’, ‘ et’, ‘ eu’, ‘ ev’, ‘ ex’, ‘ fa’, ‘ fe’, ‘ fi’, ‘ fo’, ‘ fr’, ‘ fu’, ‘ ga’, ‘ ge’, ‘ gi’, ‘ gl’, ‘ go’, ‘ gr’, ‘ gu’, ‘ ha’, ‘ he’, ‘ hi’, ‘ ho’, ‘ hu’, ‘ i ‘, ‘ id’, ‘ ie’, ‘ if’, ‘ ig’, ‘ ih’, ‘ ik’, ‘ il’, ‘ im’, ‘ in’, ‘ io’, ‘ ir’, ‘ is’, ‘ it’, ‘ iw’, ‘ ja’, ‘ je’, ‘ ji’, ‘ jo’, ‘ ju’, ‘ ka’, ‘ ke’, …

Step 3:

Finally, fill in the code for compare_langs. compare_langs takes in a group of files for different languages to be compared against each other, and a value N that tells how many of the top n-grams must be compared between each pair of language files. What it should do is find the intersection of the top N n-grams between each pair of language files, and determine which pair has the largest intersection. Note that you should take into account that a file may be compared with itself, which will definitely affect your output!

Your output should be a tuple, where the first element and second elements are filenames of the text files. The third element of the tuple should be a normalized similarity score that you can calculate by dividing the intersection by the number of top n-grams compared. An example output tuple would be ('ngrams\french.txt', 'ngrams\spanish.txt', 0.469). What happens as you increase n?

Please record similarity scores of all pairs of files and compare your results with the following test cases to make sure you’re on the right track.

For n=3 : ('ngrams\french.txt', 'ngrams\spanish.txt', 0.469) For n=4 : ('ngrams\french.txt', 'ngrams\spanish.txt', 0.19) For n=5 : ('ngrams\french.txt', 'ngrams\spanish.txt', 0.154)

Some notes:

Once again, consider using the “set” data type and, for this situation, its method intersection().
The line: myst_top_ngrams = set([tup[0] for tup in top_N_common(test_file,N)]) should give a starting point on how to construct a set of just the top N n-grams and discarding the count values for the mystery language file. You’ll have to also do something similar for the language files to compare them.
The list method index() may be useful.

Running hw8_1.py now should output something for steps 1,2 and 3.

Submit the filled-in version of hw8_1.py

2) Homework Problem 2: TF-IDF

For this problem, we will compute the tf-idf scores for the set of n-grams in each document in the lecs/ folder.

Fill in the missing functions in hw8_2.py, according to their specifications. You can use any functions written for problem 1 to aid you in this problem. You may find the lecture notes found on brightspace helpful for thinking about the format of the doc-word matrix, and referring to the notebooks shown in class may be helpful for thinking about how to construct a doc-word matrix.

Note that even after installing nltk and importing it, you may need to add the following below your nltk import statement:

<code>nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
</code>

Step 1:

Fill in the code for read_and_clean_doc. This function takes in a file path (you may assume that it is a valid filepath) and outputs a cleaned string generated from the text in the file at the filepath. You should read the file into a single string, remove punctuation (see helper.py), remove stopwords, and remove all whitespace from the string such that it is a sequence of only characters (numbers are ok). This function should return the cleaned string after you have completed the above processing steps.

Fill in the code for build_doc_word_matrix. This function takes in a list of filepaths and a number n equal to the number of characters in an ngram. You should generate a document word matrix with columns representing ngrams and rows representing each document in the list given. You may use functions previously written in hw8_1.py to complete this part, if you wish.

Step 2:

Fill in the code for build_tf_matrix and build_idf_matrix. In addition to the instructions given in hw8_2.py, you may want to refer to the lecture notes on brightspace for how to complete these functions.

Fill in the code for build_tfidf_matrix.

Step 3:

Fill in the code for find_distinctive_ngrams. This function calculates and outputs the top 3 most unique words found in each document (Note: the most unique words, not the most common). This function takes in a doc-word matrix, a list of ngrams, and a list of filepaths. You should return a dictionary with the following qualities: each filepath becomes a key in the dictionary. The value for each filepath key should be a list of the top 3 most unique ngrams in the file.

While we will test you using our own tests, at the end of hw8_2.py are some good test cases to check your code as you go. Feel free to uncomment and recomment them where convenient for you as you write the different functions. The output to those should be:

<code>*** Testing read_and_clean_doc ***
letslookwifisbackgro
*** Testing build_doc_word_matrix ***
(2, 1904)
1904
[ 1.  1.  4.  1. 16.  0.  0.  1.  1.  1.]
['003', '009', '00m', '013', '021', '02c', '02f', '02p', '038', '098']
[0. 0. 0. 0. 2. 1. 1. 2. 0. 0.]
*** Testing build_tf_matrix ***
[0.00024789 0.00024789 0.00099157 0.00024789 0.00396629 0.
 0.         0.00024789 0.00024789 0.00024789]
[0.         0.         0.         0.         0.0007837  0.00039185
 0.00039185 0.0007837  0.         0.        ]
[1. 1.]
*** Testing build_idf_matrix ***
[0.30103 0.30103 0.30103 0.30103 0.      0.30103 0.30103 0.      0.30103
 0.30103]
*** Testing build_tfidf_matrix ***
(2, 1904)
[7.46232017e-05 7.46232017e-05 2.98492807e-04 7.46232017e-05
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 7.46232017e-05 7.46232017e-05]
[0.         0.         0.         0.         0.         0.00011796
 0.00011796 0.         0.         0.        ]
*** Testing find_distinctive_words ***
{'lecs/1_vidText.txt': ['sec', 'tsp', 'ega'], 'lecs/2_vidText.txt': ['ute', 'poi', 'asi']}
</code>

Note: You may get an output like {'lecs\1_vidText.txt': ['sec', 'tsp', 'ega'], 'lecs\2_vidText.txt': ['ute', 'poi', 'asi']} when testing find_distinctive_words. This is also accepted. Different operating systems have different filesystem naming conventions, hence the \ vs / difference.

Submit your filled in version of hw8_2.py

What you need to submit

Each of the homework problems specify what file(s) to generate and submit for that problem.



^{Have a similar assignment? "Place an order for your assignment and have exceptional work written by our team of experts, guaranteeing you A results."}