Автокорректор ошибок правописания на python

Autocorrect

build
Downloads
Average time to resolve an issue
Code style: black

Spelling corrector in python. Currently supports English, Polish, Turkish, Russian, Ukrainian, Czech, Portuguese, Greek, Italian, Vietnamese, French and Spanish, but you can easily add new languages.

Based on: https://github.com/phatpiglet/autocorrect and Peter Norvig’s spelling corrector.

Installation

Examples

>>> from autocorrect import Speller
>>> spell = Speller()
>>> spell("I'm not sleapy and tehre is no place I'm giong to.")
"I'm not sleepy and there is no place I'm going to."

>>> spell = Speller('pl')
>>> spell('ptaaki latatją kluczmm')
'ptaki latają kluczem'

Speed

%timeit spell("I'm not sleapy and tehre is no place I'm giong to.")
373 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit spell("There is no comin to consiousnes without pain.")
150 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you see, for some words correction can take ~200ms. If speed is important for your use case (e.g. chatbot) you may want to use option ‘fast’:

spell = Speller(fast=True)
%timeit spell("There is no comin to consiousnes without pain.")
344 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Now, the correction should always work in microseconds, but words with double typos (like ‘consiousnes’) won’t be corrected.

OCR

When cleaning up OCR, replacements are the large majority of errors. If this is the case, you may want to use the option ‘only_replacements’:

spell = Speller(only_replacements=True)

Custom word sets

If you wish to use your own set of words for autocorrection, you can pass an nlp_data argument:

spell = Speller(nlp_data=your_word_frequency_dict)

Where your_word_frequency_dict is a dictionary which maps words to their average frequencies in your text. If you want to change the default word set only a bit, you can just edit spell.nlp_data parameter, after spell was initialized.

Adding new languages

First, define special letters, by adding entries in word_regexes and alphabets dicts in autocorrect/constants.py.

Now, you need a bunch of text. Easiest way is to download wikipedia.
For example for Russian you would go to:
https://dumps.wikimedia.org/ruwiki/latest/
and download ruwiki-latest-pages-articles.xml.bz2

bzip2 -d ruiwiki-latest-pages-articles.xml.bz2

After that:

First, edit the autocorrect.constants dictionaries in order to accommodate regexes and dictionaries for your language.

Then:

>>> from autocorrect.word_count import count_words
>>> count_words('ruwiki-latest-pages-articles.xml', 'ru')
tar -zcvf autocorrect/data/ru.tar.gz word_count.json

For the correction to work well, you need to cut out rarely used words. First, in test_all.py, write test words for your language, and add them to optional_language_tests the same way as it’s done for other languages. It’s good to have at least 30 words. Now run:

python test_all.py find_threshold ru

and see which threshold value has the least badly corrected words. After that, manually delete all the words with less occurences than the threshold value you found, from the file in hi.tar.gz (it’s already sorted so it should be easy).

To distribute this language support to others, you will need to upload your tar.gz file to IPFS (for example with Pinata, which will pin this file so it doesn’t disappear), and then add it’s path to ipfs_paths in constants.py. (tip: first put this file inside the folder, and upload the folder to IPFS, for the downloaded file to have the correct filename)

Good luck!

Даже очень грамотный человек может сделать опечатку в слове или допустить нелепую ошибку. Этот факт не всегда остаётся замеченным при перепроверке. Использование специализированных инструментов может обеспечить корректность текстов без прямого участия человека.

Рассмотрим вопрос применения модуля Python pyenchant для обнаружения ошибок в словах и возможность их исправления.

При подготовке различной текстовой документации, договоров, отчётов и т.д. важно соблюдать правописание. Используемые в настоящее время программные средства, в частности MS Office Word, подсвечивают слова, в которых допущены ошибки. Это очень удобно и, что немаловажно, наглядно.

Но нам может понадобиться автоматизировать обнаружение ошибок в текстах при отсутствии упомянутых выше программных средств. Либо, при их наличии, делать это, не открывая документ/множество документов. Или же искомый текст может быть попросту очень длинным, его проверка займёт много времени.

На помощь приходят небезызвестный язык программирования Python и модуль pyenchant, который не только позволяет проверять правописание слов, но и предлагает варианты исправления.

Для установки модуля используется стандартная команда:

pip install pyenchant

Код для проверки правописания слова довольно прост:

import enchant # при импроте пишем именно enchant (не pyenchant)
dictionary = enchant.Dict(«en_US»)
print(dictionary.check(«driver»))

Вывод: True

Намеренно допустим ошибку в проверяемом слове:

print(dictionary.check(«draiver»))

Вывод: False

Мы можем вывести список возможных исправлений слова:

print(dictionary.suggest(u»draiver»))

Вывод: [‘driver’, ‘drainer’, ‘Rivera’]

Читатель скорее всего заинтересуется, предоставляет ли модуль возможность проверять правописание слов русского языка, и ответ – да. Однако, по умолчанию это недоступно, нам нужен словарь. Он может быть найден, например, в пакете LibreOffice по пути его установки:

«…\LibreOffice\share\extensions\dict-ru»

Здесь нам нужны два файла: «ru_RU.aff» и «ru_RU.dic». Их необходимо разместить в папке модуля enchant, где хранятся словари для других языков по пути

C:\…\Python\Python36\site-packages\enchant\data\mingw64\share\enchant\hunspell»

Теперь, при создании объекта Dict достаточно передать строку «ru_RU», и мы сможем работать со словами русского языка.

Вернёмся к нашему примеру с ошибочно написанным словом driver. При помощи метода suggest() мы получили список возможных исправлений, и вручную мы конечно же легко сможем выбрать нужный вариант.

Но что, если мы хотим автоматизировать и этот процесс?

Давайте использовать модуль Python difflib, который позволяет сравнивать строковые последовательности. Попробуем выбрать из списка слово «driver»:

import enchant
import difflib

woi = «draiver»
sim = dict()

dictionary = enchant.Dict(«en_US»)
suggestions = set(dictionary.suggest(woi))

for word in suggestions:
measure = difflib.SequenceMatcher(None, woi, word).ratio()
sim[measure] = word

print(«Correct word is:», sim[max(sim.keys())])

Немного прокомментируем код. В словаре sim будут храниться значения степеней сходства (диапазон от 0 до 1) предложенных методом suggest() класса Dict слов с искомым словом («draiver»). Данные значения мы получаем в цикле при вызове метода ratio() класса SequenceMatcher и записываем в словарь. В конце получаем слово, которое максимально близко к проверяемому.

Вывод: Correct word is driver

Выше мы работали с отдельными словами, но будет полезно разобраться, как работать с целыми блоками текста. Для этой задачи нужно использовать класс SpellChecker:

from enchant.checker import SpellChecker

checker = SpellChecker(«en_US»)
checker.set_text(«I have got a new kar and it is ameizing.»)
print([i.word for i in checker])

Вывод: [‘kar’, ‘ameizing’]

Как видно, это не сложнее работы с отдельными словами. Кроме того, класс SpellChecker предоставляет возможность использовать фильтры, которые будут игнорировать особые последовательности, не являющиеся ошибочными, например, адрес электронной почты. Для этого необходимо импортировать класс или классы фильтров, если их несколько, и передать список фильтров параметру filters классу SpellChecker:

from enchant.checker import SpellChecker
from enchant.tokenize import EmailFilter, URLFilter

checker_with_filters = SpellChecker(«en_US», filters=[EmailFilter])
checker_with_filters.set_text(«Hi! My neim is John and thiz is my email: johnnyhatesjazz@gmail.com.»)
print([i.word for i in checker_with_filters])

Вывод: [‘neim’, ‘thiz’]

Как видно, адрес электронной почты не был выведен в качестве последовательности, содержащей ошибки в правописании.

Таким образом, комбинируя возможности модулей enchant и difflib, мы можем получить действительно мощный инструмент, позволяющий не только обнаруживать ошибки, но и подбирать варианты исправления с довольно высокой точностью, а также вносить эти исправления в текст.

Introduction

Spelling mistakes are common, and most people are used to software indicating if a mistake was made. From autocorrect on our phones, to red underlining in text editors, spell checking is an essential feature for many different products.

The first program to implement spell checking was written in 1971 for the DEC PDP-10. Called SPELL, it was capable of performing only simple comparisons of words and detecting one or two letter differences. As hardware and software advanced, so have spell checkers. Modern spell checkers are capable of handling morphology and using statistics to improve suggestions.

Python offers many modules to use for this purpose, making writing a simple spell checker an easy 20-minute ordeal.

One of these libraries being TextBlob, which is used for natural language processing that provides an intuitive API to work with.

In this article we’ll take a look at how to implement spelling correction in Python with TextBlob.

Installation

First, we’ll need to install TextBlob, since it doesn’t come preinstalled. Open up a console and install it using pip:

$ pip install textblob

This should install everything we need for this project. Upon finishing the installation, the console output should include something like:

Successfully installed click-7.1.2 joblib-0.17.0 nltk-3.5 regex-2020.11.13 textblob-0.15.3

TextBlob is built on top of NLTK, so it also comes with the installation.

The correct() Function

The most straightforward way to correct input text is to use the correct() method. The example text we’ll be using is a paragraph from Charles Darwin’s «On the Origin of Species», which is part of the public domain, packed into a file called text.txt.

In addition, we’ll add some deliberate spelling mistakes:

As far as I am abl to judg, after long attnding to the sbject, the condiions of lfe apear to act in two ways—directly on the whle organsaton or on certin parts alne and indirectly by afcting the reproducte sstem. Wit respct to te dirct action, we mst bea in mid tht in every cse, as Profesor Weismann hs latly insistd, and as I have inidently shwn in my wrk on "Variatin undr Domesticcation," thcere arae two factrs: namly, the natre of the orgnism and the natture of the condiions. The frmer sems to be much th mre importannt; foor nealy siimilar variations sometimes aris under, as far as we cn juddge, disimilar conditios; annd, on te oter hannd, disssimilar variatioons arise undder conditions which aappear to be nnearly uniiform. The efffects on tthe offspring arre ieither definnite or in definite. They maay be considdered as definnite whhen allc or neearly all thhe ofefspring off inadividuals exnposed tco ceertain conditionas duriing seveal ggenerations aree moodified in te saame maner.

It’s full of spelling mistakes, in almost every word. Let’s write up a simple script, using TextBlob, to correct these mistakes and print them back to the console:

from textblob import TextBlob

with open("text.txt", "r") as f:        # Opening the test file with the intention to read
    text = f.read()                     # Reading the file
    textBlb = TextBlob(text)            # Making our first textblob
    textCorrected = textBlb.correct()   # Correcting the text
    print(textCorrected)

If you’ve worked with TextBlob before, this flow will look familiar to you. We’ve read the file and the contents inside of it, and constructed a TextBlob instance by passing the contents to the constructor.

Then, we run the correct() function on that instance to perform spelling correction.

After running the script above, you should get an output similar to:

Is far as I am all to judge, after long attending to the subject, the conditions of life appear to act in two ways—directly on the while organisation or on certain parts alone and indirectly by acting the reproduce system. It respect to te direct action, we must be in mid the in every case, as Professor Weismann he lately insisted, and as I have evidently shown in my work on "Variation under Domesticcation," there are two facts: namely, the nature of the organism and the nature of the conditions. The former seems to be much th are important; for nearly similar variations sometimes arms under, as far as we in judge, similar condition; and, on te other hand, disssimilar variations arise under conditions which appear to be nearly uniform. The effects on the offspring are either definite or in definite. They may be considered as definite when all or nearly all the offspring off individuals exposed to certain conditions during several generations are modified in te same manner.

How Correct is TextBlob’s Spelling Correction?

As we can see, the text still has some spelling errors. Words like "abl" were supposed to be "able", not "all". Though, even with these, it’s still better than the original.

Now comes the question, how much better is it?

The following code snippet is a simple script that test how good is TextBlob in correcting errors, based on this example:

from textblob import TextBlob

# A function that compares two texts and returns 
# the number of matches and differences
def compare(text1, text2):  
    l1 = text1.split()
    l2 = text2.split()
    good = 0
    bad = 0
    for i in range(0, len(l1)):
        if l1[i] != l2[i]:
            bad += 1
        else:
            good += 1
    return (good, bad)

# Helper function to calculate the percentage of misspelled words
def percentageOfBad(x):
    return (x[1] / (x[0] + x[1])) * 100

Now, with those two functions, let’s run a quick analysis:

with open("test.txt", "r") as f1: # test.txt contains the same typo-filled text from the last example 
    t1 = f1.read()

with open("original.txt", "r") as f2: # original.txt contains the text from the actual book 
    t2 = f2.read()

t3 = TextBlob(t1).correct()

mistakesCompOriginal = compare(t1, t2)
originalCompCorrected = compare(t2, t3)
mistakesCompCorrected = compare(t1, t3)

print("Mistakes compared to original ", mistakesCompOriginal)
print("Original compared to corrected ", originalCompCorrected)
print("Mistakes compared to corrected ", mistakesCompCorrected, "\n")

print("Percentage of mistakes in the test: ", percentageOfBad(mistakesCompOriginal), "%")
print("Percentage of mistakes in the corrected: ", percentageOfBad(originalCompCorrected), "%")
print("Percentage of fixed mistakes: ", percentageOfBad(mistakesCompCorrected), "%", "\n")

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Running it will print out:

Mistakes compared to original  (126, 194)
Original compared to corrected  (269, 51)
Mistakes compared to corrected  (145, 175) 

Percentage of mistakes in the test:  60.62499999999999 %
Percentage of mistakes in the corrected:  15.937499999999998 %
Percentage of fixed mistakes:  54.6875 % 

As we can see, the correct method managed to get our spelling mistake percentage from 60.6% to 15.9%, which is pretty decent, however there’s a bit of a catch. It corrected 54.7% of the words, so why is there still a 15.9% mistake rate?

The answer is overcorrection. Sometimes, it can change a word that is spelled correctly, like the first word in our example text where "As" was corrected to "Is". Other times, it just doesn’t have enough information about the word and the context to tell which word the user was intending to type, so it guesses that it should replace "whl" with "while" instead of "whole".

There is no perfect spelling corrector because so much of spoken language is contextual, so keep that in mind. In most use cases, there are way fewer mistakes than in our example, so TextBlob should be able to work well enough for the average user.

Training TextBlob with Custom Datasets

What if you want to spellcheck another language which isn’t supported by TextBlob out of the box? Or maybe you want to get just a little bit more precise? Well, there might be a way to achieve this. It all comes down to the way spell checking works in TextBlob.

TextBlob uses statistics of word usage in English to make smart suggestions on which words to correct. It keeps these statistics in a file called en-spelling.txt, but it also allows you to make your very own word usage statistics file.

Let’s try to make one for our Darwin example. We’ll use all the words in the «On the Origin of Species» to train. You can use any text, just make sure it has enough words that are relevant to the text you wish to correct.

In our case, the rest of the book will provide great context and additional information that TextBlob would need to be more accurate in the correction.

Let’s rewrite the script:

from textblob.en import Spelling        
import re

textToLower = ""

with open("originOfSpecies.txt","r") as f1:           # Open our source file
    text = f1.read()                                  # Read the file                 
    textToLower = text.lower()                        # Lower all the capital letters

words = re.findall("[a-z]+", textToLower)             # Find all the words and place them into a list    
oneString = " ".join(words)                           # Join them into one string

pathToFile = "train.txt"                              # The path we want to store our stats file at
spelling = Spelling(path = pathToFile)                # Connect the path to the Spelling object
spelling.train(oneString, pathToFile)                 # Train

If we look into the train.txt file, we’ll see:

a 3389
abdomen 3
aberrant 9
aberration 5
abhorrent 1
abilities 1
ability 4
abjectly 1
able 54
ably 5
abnormal 17
abnormally 2
abodes 2
...

This indicates that the word "a" shows up as a word 3389 times, while "ably" shows up only 5 times. To test out this trained model, we’ll use suggest(text) instead of correct(text), which is a list of word-confidence tuples. The first element in the list will be the word it’s most confident about, so we can access it via suggest(text)[0][0].

Note that this might be slower, so go word by word while spell-checking, as dumping huge amounts of data can result in a crash:

from textblob.en import Spelling        
from textblob import TextBlob

pathToFile = "train.txt" 
spelling = Spelling(path = pathToFile)
text = " "

with open("test.txt", "r") as f: 
    text = f.read()

words = text.split()
corrected = " "
for i in words :
    corrected = corrected +" "+ spelling.suggest(i)[0][0] # Spell checking word by word

print(corrected)

And now, this will result in:

As far as I am all to judge after long attending to the subject the conditions of life appear to act in two ways—directly on the whole organisation or on certain parts alone and indirectly by acting the reproduce system It respect to the direct action we most be in mid the in every case as Professor Weismann as lately insisted and as I have incidently shown in my work on "Variatin under Domesticcation," there are two facts namely the nature of the organism and the nature of the conditions The former seems to be much th are important for nearly similar variations sometimes arise under as far as we in judge dissimilar conditions and on the other hand dissimilar variations arise under conditions which appear to be nearly uniform The effects on the offspring are either definite or in definite They may be considered as definite when all or nearly all the offspring off individuals exposed to certain conditions during several generations are modified in the same manner.

This fixes around 2 out of 3 of misspelled words, which is pretty good, considering the run without much context.

Conclusion

In this article we’ll used TextBlob to implement a basic spelling corrector, both with the stock prediction model a custom one.

Correcting man-made spelling errors has become a common task for software developers. Even though it has become easier and more efficient via data mining, many spelling mistakes need context to be corrected.

In conclusion, proofreaders are probably not going to get automated out of work any time soon, though some basic corrections can be automated to save time and effort.

Орфографические ошибки являются обычным явлением, и большинство людей привыкло к программному обеспечению, указывающему, была ли ошибка допущена. От автокоррекции на наших телефонах до красного подчеркивания в текстовых редакторах — проверка орфографии является важной функцией для многих различных продуктов.

Первая программа, реализующая проверку орфографии, была написана в 1971 году для DEC PDP-10. Названный SPELL, он был способен выполнять только простые сравнения слов и обнаруживать различия в одной или двух буквах. По мере развития аппаратного и программного обеспечения появляются и средства проверки орфографии. Современные средства проверки правописания способны обрабатывать морфологию и использовать статистику для улучшения предложений.

Python предлагает множество модулей для этих целей, что делает написание простой проверки орфографии легким 20-минутным испытанием.

Одной из этих библиотек является TextBlob, которая используется для обработки естественного языка и предоставляет интуитивно понятный API для работы.

В этой статье мы рассмотрим, как реализовать исправление орфографии в Python с помощью TextBlob.

Установка

Во-первых, нам нужно установить TextBlob, поскольку он не предустановлен. Откройте консоль и установите его с помощью pip:

Это должно установить все, что нам нужно для этого проекта. По окончании установки вывод консоли должен включать что-то вроде:

Successfully installed click-7.1.2 joblib-0.17.0 nltk-3.5 regex-2020.11.13 textblob-0.15.3

TextBlob построен на основе NLTK, поэтому он также поставляется с установкой.

Функция correct()

Самый простой способ исправить вводимый текст — использовать метод correct(). В качестве примера мы будем использовать абзац из книги Чарльза Дарвина «О происхождении видов», которая является частью общественного достояния и упакована в файл с именем text.txt.

Кроме того, мы добавим несколько умышленных орфографических ошибок:

As far as I am abl to judg, after long attnding to the sbject, the condiions of lfe apear to act in two ways—directly on the whle organsaton or on certin parts alne and indirectly by afcting the reproducte sstem. Wit respct to te dirct action, we mst bea in mid tht in every cse, as Profesor Weismann hs latly insistd, and as I have inidently shwn in my wrk on "Variatin undr Domesticcation," thcere arae two factrs: namly, the natre of the orgnism and the natture of the condiions. The frmer sems to be much th mre importannt; foor nealy siimilar variations sometimes aris under, as far as we cn juddge, disimilar conditios; annd, on te oter hannd, disssimilar variatioons arise undder conditions which aappear to be nnearly uniiform. The efffects on tthe offspring arre ieither definnite or in definite. They maay be considdered as definnite whhen allc or neearly all thhe ofefspring off inadividuals exnposed tco ceertain conditionas duriing seveal ggenerations aree moodified in te saame maner.

Это полный орфографических ошибок текст, почти в каждом слове. Давайте напишем простой скрипт, используя TextBlob, чтобы исправить эти ошибки и распечатать их обратно в консоль:

from textblob import TextBlob

with open("text.txt", "r") as f:        # Opening the test file with the intention to read
    text = f.read()                     # Reading the file
    textBlb = TextBlob(text)            # Making our first textblob
    textCorrected = textBlb.correct()   # Correcting the text
    print(textCorrected)

Если вы раньше работали с TextBlob, этот алгоритм будет вам знаком. Мы прочитали файл и его содержимое и создали экземпляр TextBlob, передав содержимое конструктору.

Затем мы запускаем функцию correct() в этом экземпляре для исправления орфографии.

После запуска приведенного выше сценария вы должны получить примерно такой результат:

Is far as I am all to judge, after long attending to the subject, the conditions of life appear to act in two ways—directly on the while organisation or on certain parts alone and indirectly by acting the reproduce system. It respect to te direct action, we must be in mid the in every case, as Professor Weismann he lately insisted, and as I have evidently shown in my work on "Variation under Domesticcation," there are two facts: namely, the nature of the organism and the nature of the conditions. The former seems to be much th are important; for nearly similar variations sometimes arms under, as far as we in judge, similar condition; and, on te other hand, disssimilar variations arise under conditions which appear to be nearly uniform. The effects on the offspring are either definite or in definite. They may be considered as definite when all or nearly all the offspring off individuals exposed to certain conditions during several generations are modified in te same manner.

Насколько верна коррекция орфографии TextBlob?

Как видим, в тексте все еще есть орфографические ошибки. Слова вроде "abl" должны были быть "able", а не "all". Хотя даже с ними все равно лучше оригинала.

Теперь возникает вопрос, насколько это лучше?

Следующий фрагмент кода представляет собой простой сценарий, который проверяет, насколько хорошо TextBlob исправляет ошибки, на основе этого примера:

from textblob import TextBlob

# A function that compares two texts and returns 
# the number of matches and differences
def compare(text1, text2):  
    l1 = text1.split()
    l2 = text2.split()
    good = 0
    bad = 0
    for i in range(0, len(l1)):
        if l1[i] != l2[i]:
            bad += 1
        else:
            good += 1
    return (good, bad)

# Helper function to calculate the percentage of misspelled words
def percentageOfBad(x):
    return (x[1] / (x[0] + x[1])) * 100

Теперь, используя эти две функции, давайте проведем быстрый анализ:

with open("test.txt", "r") as f1: # test.txt contains the same typo-filled text from the last example 
    t1 = f1.read()

with open("original.txt", "r") as f2: # original.txt contains the text from the actual book 
    t2 = f2.read()

t3 = TextBlob(t1).correct()

mistakesCompOriginal = compare(t1, t2)
originalCompCorrected = compare(t2, t3)
mistakesCompCorrected = compare(t1, t3)

print("Mistakes compared to original ", mistakesCompOriginal)
print("Original compared to corrected ", originalCompCorrected)
print("Mistakes compared to corrected ", mistakesCompCorrected, "\n")

print("Percentage of mistakes in the test: ", percentageOfBad(mistakesCompOriginal), "%")
print("Percentage of mistakes in the corrected: ", percentageOfBad(originalCompCorrected), "%")
print("Percentage of fixed mistakes: ", percentageOfBad(mistakesCompCorrected), "%", "\n")

Запустив его, вы распечатаете:

Mistakes compared to original  (126, 194)
Original compared to corrected  (269, 51)
Mistakes compared to corrected  (145, 175) 

Percentage of mistakes in the test:  60.62499999999999 %
Percentage of mistakes in the corrected:  15.937499999999998 %
Percentage of fixed mistakes:  54.6875 % 

Как мы видим, методу correct удалось уменьшить процент орфографических ошибок с 60,6% до 15,9%, что довольно неплохо, однако есть небольшая загвоздка. Он исправил 54,7% слов, так почему все еще остается 15,9% ошибок?

Ответ — чрезмерное исправление. Иногда он может изменить слово, которое написано правильно, например, первое слово в нашем примере текста, где "As" было исправлено "Is". В других случаях ему просто не хватает информации о слове и контексте, чтобы сказать, какое слово пользователь намеревался ввести, поэтому он догадывается, что его следует заменить "whl" на "while" вместо "whole".

Не существует идеального корректора орфографии, потому что большая часть разговорной речи зависит от контекста, так что имейте это в виду. В большинстве случаев ошибок гораздо меньше, чем в нашем примере, поэтому TextBlob должен работать достаточно хорошо для обычного пользователя.

Обучающий TextBlob с настраиваемыми наборами данных

Что, если вы хотите проверить орфографию на другом языке, который не поддерживается TextBlob из коробки? Или, может быть, вы хотите быть немного точнее? Что ж, может быть способ добиться этого. Все сводится к тому, как работает проверка орфографии в TextBlob.

TextBlob использует статистику использования слов на английском языке, чтобы делать разумные предложения по поводу того, какие слова следует исправить. Он хранит эту статистику в файле с именем en-spelling.txt, но также позволяет вам создать свой собственный файл статистики использования слов.

Попробуем сделать такой для нашего примера Дарвина. Мы будем использовать все слова из «Происхождения видов» для обучения. Вы можете использовать любой текст, просто убедитесь, что в нем достаточно слов, имеющих отношение к тексту, который вы хотите исправить.

В нашем случае остальная часть книги предоставит отличный контекст и дополнительную информацию, которая потребуется TextBlob для более точного исправления.

Перепишем скрипт:

from textblob.en import Spelling        
import re

textToLower = ""

with open("originOfSpecies.txt","r") as f1:           # Open our source file
	text = f1.read()                                  # Read the file                 
	textToLower = text.lower()                        # Lower all the capital letters

words = re.findall("[a-z]+", textToLower)             # Find all the words and place them into a list    
oneString = " ".join(words)                           # Join them into one string

pathToFile = "train.txt"                              # The path we want to store our stats file at
spelling = Spelling(path = pathToFile)                # Connect the path to the Spelling object
spelling.train(oneString, pathToFile)                 # Train

Если мы заглянем в файл train.txt, то увидим:

a 3389
abdomen 3
aberrant 9
aberration 5
abhorrent 1
abilities 1
ability 4
abjectly 1
able 54
ably 5
abnormal 17
abnormally 2
abodes 2
...

Это означает, что слово "a" отображается как слово 3389 раз, а "ably"только 5 раз. Чтобы проверить эту обученную модель, мы будем использовать suggest(text) вместо correct(text), который представляет собой список кортежей доверия слов. Первым элементом в списке будет слово, в котором он уверен, поэтому мы можем получить к нему доступ через suggest(text)[0][0].

Обратите внимание, что это может быть медленнее, поэтому проверяйте орфографию слово за словом, так как сброс огромных объемов данных может привести к сбою:

from textblob.en import Spelling        
from textblob import TextBlob

pathToFile = "train.txt" 
spelling = Spelling(path = pathToFile)
text = " "

with open("test.txt", "r") as f: 
	text = f.read()

words = text.split()
corrected = " "
for i in words :
    corrected = corrected +" "+ spelling.suggest(i)[0][0] # Spell checking word by word

print(corrected)

И теперь это приведет к:

As far as I am all to judge after long attending to the subject the conditions of life appear to act in two ways—directly on the whole organisation or on certain parts alone and indirectly by acting the reproduce system It respect to the direct action we most be in mid the in every case as Professor Weismann as lately insisted and as I have incidently shown in my work on "Variatin under Domesticcation," there are two facts namely the nature of the organism and the nature of the conditions The former seems to be much th are important for nearly similar variations sometimes arise under as far as we in judge dissimilar conditions and on the other hand dissimilar variations arise under conditions which appear to be nearly uniform The effects on the offspring are either definite or in definite They may be considered as definite when all or nearly all the offspring off individuals exposed to certain conditions during several generations are modified in the same manner.

Это исправляет примерно 2 из 3 слов с ошибками, что довольно хорошо, учитывая запуск без особого контекста.

Using Python, you can check whether a sentence is correct in an automated manner, and provide suggestions for fixing up bad spelling and grammar. Processing text in this manner falls under the category of natural language processing (NLP). This article provides documentation on how to use Sapling as a Python spelling and grammar checker. Compare it to a list of a few other open-source Python libraries that might be more suitable for individual, non-commercial or open-source projects. For each library, check out the installation guide as well as some sample quick-start Python code that demonstrates how to use each SDK.

Sapling

Sapling offers a deep neural network language model trained on millions of sentences. Outside of English, it also supports more than 10 different other languages in its language model and regular spell checking for more than 30 other languages. This style of automated proofreading can identify fluency improvements as well as areas where a correct English word was used but would be considered incorrect in the context of the sentence.

For use cases that have security, privacy, or regulatory requirements, Sapling is HIPAA compliant, SOC2 compliant, and offers options for no-data retention or on-premise/self-hosted models. The on-premise version allows users to host the Sapling service in their own cloud or infrastructure so that processed data will stay in a specific geographical region or compute environment.

Get an API key for free and use it for testing or personal use. The free API key comes with limits on usage. The paid version of Sapling’s API has no throttling limits and costs money based on usage.

Sapling’s Python grammar checker is licensed by Apache 2.0: there are no restrictions on how you can use it. This license makes Sapling compatible with commercial software products that want to keep their code proprietary. An alternative JavaScript library also exists for backend applications that use a JavaScript runtime environment like Node.js, or for applications that have an HTML or web-based front end and process text from textareas and content editables. Sapling also has an HTTP API that can be called using other languages directly like PHP or Ruby (or any scripting language that supports HTTP POST and GET requests).

Installing Sapling

  • Visit Sapling.ai to register an account.
  • Visit the dashboard to generate an API key.
  • Install Sapling’s SDK:
python -m pip install sapling-py

If you don’t have pip, you can follow the instructions here to install it: https://pip.pypa.io/en/stable/installation/

Sapling Usage

from sapling import SaplingClient

api_key = '<API_KEY>'
client = SaplingClient(api_key=api_key)
edits = client.edits('Lets get started!', session_id='test_session')

''' returns -> 
[{
  "id": "aa5ee291-a073-5146-8ebc-c9c899d01278",
  "sentence": "Lets get started!",
  "sentence_start": 0,
  "start": 0,
  "end": 4,
  "replacement": "Let's",
  "error_type": "R:OTHER",
  "general_error_type": "Other",
}]
''' 

You can read more in:

  • Python Docs: https://sapling.readthedocs.io/en/latest/api.html
  • Sapling Developer Docs: https://sapling.ai/docs

Open Source Libraries and Licenses

Before we discuss the next section of open source grammar checkers, take a quick overview of the licenses below. If you are familiar with open-source software licenses, you can skip this section.

For developers producing non-commercial products (like personal or research projects), open-source libraries may be a good choice. These are free and configurable. The trade-off between free and better performance or support may be an obvious one for those with budget constraints; however, it is also important to understand the restrictions.

Most open-source software licenses give users permission to modify and distribute the library in question. The open-source licenses for the Python grammar checkers on this list require modifications to the original code to be released publicly and under the same license.

  • GNU Lesser General Public License (LGPL): Programs that incorporate LGPL code also need to be LGPL. You can get around this limitation by dynamically linking to LGPL code. If the LGPL code is ever distributed to an end user, the user needs to be able to re-link the application to their own version of the LGPL library. This can work on platforms that allow for library changes, like Windows, MacOS, Linux, but is not possible for others, like iOS. When building an internal tool, or a purely server-based SaaS tool, the distribution clause does not apply.
  • Mozilla Public License (MPL): MPL is more permissive and allows for static linking of libraries. There are no re-linking requirements. This permissive license is easier to integrate into a commercial software product compared to GPL and LGPL.
  • BSD, MIT, Apache: These licenses are permissive and grant use, distribution and relicensing rights, making them the easiest to use with commercial products.

LanguageTool

LanguageTool is an open-source (LGPL) rules-based grammar checker. It is available as a cloud HTTP end-point hosted by the LanguageTool company. This version has a free offering that has usage (20 requests per minute) and correction limits (30 misspelled words), as well as a paid offering with less restrictions. The cloud offering is currently neither SOC2 nor HIPAA compliant. You can also run the Java backend yourself and call it through Python bindings; however, having to maintain and run a separate Java server or process along with using the Python grammar checker client makes maintenance more complicated.

LanguageTool comes with a database of community-curated grammar rules for different languages. Keep in mind that some of the other languages may not have as good of grammar rule coverage as English does.

Installing LanguageTool Backend

Local hosting of the backend is optional but can help keep text processing local for privacy and security reasons.

  1. Download the Java executable: https://languagetool.org/download/LanguageTool-stable.zip
  2. Install Java: https://www.java.com/en/download/help/download_options.html
  3. Run the LanguageTool Backend:
java -cp languagetool-server.jar org.languagetool.server.HTTPServer --port 8081 --allow-origin

Installing LanguageTool Python Client

pip install language-tool-python

LanguageTool Usage

import language_tool_python

tool = language_tool_python.LanguageTool('en-US')  # use a local server
tool = language_tool_python.LanguageToolPublicAPI('en-US') # or use public API

tool.correct('A sentence with a error in the Hitchhiker’s Guide tot he Galaxy')
# returns -> 'A sentence with an error in the Hitchhiker’s Guide to the Galaxy'

If you are looking for an alternative open source python grammar checker that utilizes the LanguageTool API:

  • language-check: https://github.com/myint/language-check/
  • pyLanguagetool: https://github.com/Findus23/pyLanguagetool

Hunspell

Hunspell is a popular open-source spell checker that you have likely come across before because it is integrated by default into Firefox, Chrome, and LibreOffice. It has extended support for unicode and language peculiarities like compounding and complex morphology. The name of the library comes from the fact that it is based on MySpell and works with MySpell dictionaries. One of the first languages supported was Hungarian. This is a good spell checker to integrate if you value a library that is widely used and is actively maintained.

Hunspell is written in C++ but you can use it in Python as a spell checker through Cython bindings. Hunspell is licensed under 3 separate licenses: GPL/LGPL/MPL. The MPL license makes Hunspell more permissive and easier to integrate into commercial products compared to Aspell, another spell checker which we will describe later.

Installing Hunspell

sudo apt install autoconf automake autopoint libtool

git clone https://github.com/hunspell/hunspell.git
cd hunspell
autoreconf -vfi
./configure
make
sudo make install
sudo ldconfig

Installing Hunspell Python Bindings

pip install cyhunspell

Hunspell Usage

from hunspell import Hunspell
h = Hunspell()

h.spell('correct') # True
h.spell('incorect') # False

Aspell

Aspell is an open-source spell checker that performs slightly better than Hunspell. In addition to spell checking, Aspell also has built-in functionality to suggest alternatives to words, even if they exist in the dictionary. These suggestions can be used to capture issues where a dictionary word is written, but may not be the intended word or is incorrect in context. Keep in mind though that Aspell does not do full grammar checking. While Aspell is a C++ library, you can use as a Python spelling checker through C++ bindings.

The wider adoption of Hunspell over Aspell is most likely due to Aspell being licensed under LGPL, which is less permissive than MPL. If you are building a non-commercial or backend Python application, Aspell is likely a better choice than Hunspell.

Install Aspell

git clone https://github.com/GNUAspell/aspell.git
cd aspell
./autogen
./configure --disable-static --enable-32-bit-hash-fun
make
make install

Install an Aspell Dictionary

# download dictionary from here https://ftp.gnu.org/gnu/aspell/dict/
cd aspell6-en-2019.10.06-0
./configure
make
make install

Install Aspell Python Bindings

pip install cyhunspell

Aspell Usage

import aspell

s = aspell.Speller(('lang', 'en_US'))
s.check('word') # correct word -> returns True
s.check('wrod') # incorrect -> returns False
s.suggest('wrod') # -> return suggestions for input

Building Your Own

Grammar checkers are more complex to build from the ground up, they require either maintaining a database of rules for matching against or enough data to train an effective machine-learning language model. Nowadays the most effective models are based off of neural nets, but statistical models can also be trained. Both the training and maintenance of your own grammar checker can be expensive. This path is preferable only if you want to invest in you or your team’s expertise in natural language processing.

Building a Spelling Checker

Building a spell checker in Python that takes text and suggests spelling corrections for words can be done in fewer than 50 lines of code. Starting with a dictionary or a list of words, the algorithm looks up each word in the sentence. For words that are not in the dictionary, suggestions are generated based on edit distance (the number of characters that need to change) compared to dictionary words. If only a couple suggestions are shown, they are prioritized assuming that words that are lower in edit distance are more likely to be the intended word.

An example of this algorithm and relevant Python code has been posted by Peter Norvig, a prominent AI computer scientist, and author of the most popular AI textbook “Artificial Intelligence: A Modern Approach “. You can read about his approach here: “How to Write a Spelling Corrector.

Building a Statistical based Grammar Checker

Statistics-based grammar checkers share very similar architectures to Statistical Machine Translation. They break down words and phrases into statistical likelihoods and use that to predict whether sentences are correct or incorrect. If replacement words or phrases are deemed to be statistically more likely, corrections can be suggested.

Symspell is an MIT licensed spell correction and fuzzy search library. The original library is written in C#, but various Python ports of the library exist; some of them are linked in the original repository here https://github.com/wolfgarbe/SymSpell. This library can be used to train a statistical model on text and then used as a spell checker.

Building a Neural Network based Grammar Checker

The neural network based grammar checker shares the same architecture as Neural Machine Translation. The steps required to build such a library from scratch are outside the scope of this blog post. However, some Python frameworks exist that can be used with pre-trained models. An example of this is the happy-transformer: https://github.com/EricFillion/happy-transformer. Other frameworks like PyTorch and TensorFlow can also be used to train your own language models.

The Best Python Spelling and Grammar Checker

Finding the optimal Python spelling and grammar checker will depend on your project requirements. Python’s support for HTTP POST and GET operations means that you can also use a non-Python HTTP API for this purpose. Popular grammar-checking services like Grammarly that do not have a Python or HTTP API were also not included. Likewise, we excluded spelling and grammar check APIs that do not provide Python support from this overview. You can visit this page for a comparison of JavaScript spelling and grammar checkers.

Library Pros Cons
Sapling – Serverless
– Multiple language support
– Neural network grammar checking
– Costs money
Language Tool – Multiple language support – Costs money, or hosting resources
Hunspell – Serverless
– Multiple language support
– No grammar checking
Aspell – Serverless
– Multiple language support
– More performant that Hunspell
– LGPL license is more restrictive
– No grammar checking
Build your own – ML expertise as a competitive advantage – Engineering cost of training and maintenance

Понравилась статья? Поделить с друзьями:
  • Автономка кингмун коды ошибок китайская
  • Автомобильный фен планар коды ошибок
  • Автокондиционер климатик коды ошибок
  • Автомобильный сканер ошибок elm327
  • Автономка камаз ошибки 2 раза моргает