164 lines
6.3 KiB
Plaintext
164 lines
6.3 KiB
Plaintext
|
Metadata-Version: 2.1
|
||
|
Name: snowballstemmer
|
||
|
Version: 2.2.0
|
||
|
Summary: This package provides 29 stemmers for 28 languages generated from Snowball algorithms.
|
||
|
Home-page: https://github.com/snowballstem/snowball
|
||
|
Author: Snowball Developers
|
||
|
Author-email: snowball-discuss@lists.tartarus.org
|
||
|
License: BSD-3-Clause
|
||
|
Keywords: stemmer
|
||
|
Platform: UNKNOWN
|
||
|
Classifier: Development Status :: 5 - Production/Stable
|
||
|
Classifier: Intended Audience :: Developers
|
||
|
Classifier: License :: OSI Approved :: BSD License
|
||
|
Classifier: Natural Language :: Arabic
|
||
|
Classifier: Natural Language :: Basque
|
||
|
Classifier: Natural Language :: Catalan
|
||
|
Classifier: Natural Language :: Danish
|
||
|
Classifier: Natural Language :: Dutch
|
||
|
Classifier: Natural Language :: English
|
||
|
Classifier: Natural Language :: Finnish
|
||
|
Classifier: Natural Language :: French
|
||
|
Classifier: Natural Language :: German
|
||
|
Classifier: Natural Language :: Greek
|
||
|
Classifier: Natural Language :: Hindi
|
||
|
Classifier: Natural Language :: Hungarian
|
||
|
Classifier: Natural Language :: Indonesian
|
||
|
Classifier: Natural Language :: Irish
|
||
|
Classifier: Natural Language :: Italian
|
||
|
Classifier: Natural Language :: Lithuanian
|
||
|
Classifier: Natural Language :: Nepali
|
||
|
Classifier: Natural Language :: Norwegian
|
||
|
Classifier: Natural Language :: Portuguese
|
||
|
Classifier: Natural Language :: Romanian
|
||
|
Classifier: Natural Language :: Russian
|
||
|
Classifier: Natural Language :: Serbian
|
||
|
Classifier: Natural Language :: Spanish
|
||
|
Classifier: Natural Language :: Swedish
|
||
|
Classifier: Natural Language :: Tamil
|
||
|
Classifier: Natural Language :: Turkish
|
||
|
Classifier: Operating System :: OS Independent
|
||
|
Classifier: Programming Language :: Python
|
||
|
Classifier: Programming Language :: Python :: 2
|
||
|
Classifier: Programming Language :: Python :: 2.6
|
||
|
Classifier: Programming Language :: Python :: 2.7
|
||
|
Classifier: Programming Language :: Python :: 3
|
||
|
Classifier: Programming Language :: Python :: 3.4
|
||
|
Classifier: Programming Language :: Python :: 3.5
|
||
|
Classifier: Programming Language :: Python :: 3.6
|
||
|
Classifier: Programming Language :: Python :: 3.7
|
||
|
Classifier: Programming Language :: Python :: 3.8
|
||
|
Classifier: Programming Language :: Python :: 3.9
|
||
|
Classifier: Programming Language :: Python :: 3.10
|
||
|
Classifier: Programming Language :: Python :: Implementation :: CPython
|
||
|
Classifier: Programming Language :: Python :: Implementation :: PyPy
|
||
|
Classifier: Topic :: Database
|
||
|
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
|
||
|
Classifier: Topic :: Text Processing :: Indexing
|
||
|
Classifier: Topic :: Text Processing :: Linguistic
|
||
|
Description-Content-Type: text/x-rst
|
||
|
License-File: COPYING
|
||
|
|
||
|
Snowball stemming library collection for Python
|
||
|
===============================================
|
||
|
|
||
|
Python 3 (>= 3.3) is supported. We no longer actively support Python 2 as
|
||
|
the Python developers stopped supporting it at the start of 2020. Snowball
|
||
|
2.1.0 was the last release to officially support Python 2.
|
||
|
|
||
|
What is Stemming?
|
||
|
-----------------
|
||
|
|
||
|
Stemming maps different forms of the same word to a common "stem" - for
|
||
|
example, the English stemmer maps *connection*, *connections*, *connective*,
|
||
|
*connected*, and *connecting* to *connect*. So a searching for *connected*
|
||
|
would also find documents which only have the other forms.
|
||
|
|
||
|
This stem form is often a word itself, but this is not always the case as this
|
||
|
is not a requirement for text search systems, which are the intended field of
|
||
|
use. We also aim to conflate words with the same meaning, rather than all
|
||
|
words with a common linguistic root (so *awe* and *awful* don't have the same
|
||
|
stem), and over-stemming is more problematic than under-stemming so we tend not
|
||
|
to stem in cases that are hard to resolve. If you want to always reduce words
|
||
|
to a root form and/or get a root form which is itself a word then Snowball's
|
||
|
stemming algorithms likely aren't the right answer.
|
||
|
|
||
|
How to use library
|
||
|
------------------
|
||
|
|
||
|
The ``snowballstemmer`` module has two functions.
|
||
|
|
||
|
The ``snowballstemmer.algorithms`` function returns a list of available
|
||
|
algorithm names.
|
||
|
|
||
|
The ``snowballstemmer.stemmer`` function takes an algorithm name and returns a
|
||
|
``Stemmer`` object.
|
||
|
|
||
|
``Stemmer`` objects have a ``Stemmer.stemWord(word)`` method and a
|
||
|
``Stemmer.stemWords(word[])`` method.
|
||
|
|
||
|
.. code-block:: python
|
||
|
|
||
|
import snowballstemmer
|
||
|
|
||
|
stemmer = snowballstemmer.stemmer('english');
|
||
|
print(stemmer.stemWords("We are the world".split()));
|
||
|
|
||
|
Automatic Acceleration
|
||
|
----------------------
|
||
|
|
||
|
`PyStemmer <https://pypi.org/project/PyStemmer/>`_ is a wrapper module for
|
||
|
Snowball's ``libstemmer_c`` and should provide results 100% compatible to
|
||
|
**snowballstemmer**.
|
||
|
|
||
|
**PyStemmer** is faster because it wraps generated C versions of the stemmers;
|
||
|
**snowballstemmer** uses generate Python code and is slower but offers a pure
|
||
|
Python solution.
|
||
|
|
||
|
If PyStemmer is installed, ``snowballstemmer.stemmer`` returns a ``PyStemmer``
|
||
|
``Stemmer`` object which provides the same ``Stemmer.stemWord()`` and
|
||
|
``Stemmer.stemWords()`` methods.
|
||
|
|
||
|
Benchmark
|
||
|
~~~~~~~~~
|
||
|
|
||
|
This is a crude benchmark which measures the time for running each stemmer on
|
||
|
every word in its sample vocabulary (10,787,583 words over 26 languages). It's
|
||
|
not a realistic test of normal use as a real application would do much more
|
||
|
than just stemming. It's also skewed towards the stemmers which do more work
|
||
|
per word and towards those with larger sample vocabularies.
|
||
|
|
||
|
* Python 2.7 + **snowballstemmer** : 13m00s (15.0 * PyStemmer)
|
||
|
* Python 3.7 + **snowballstemmer** : 12m19s (14.2 * PyStemmer)
|
||
|
* PyPy 7.1.1 (Python 2.7.13) + **snowballstemmer** : 2m14s (2.6 * PyStemmer)
|
||
|
* PyPy 7.1.1 (Python 3.6.1) + **snowballstemmer** : 1m46s (2.0 * PyStemmer)
|
||
|
* Python 2.7 + **PyStemmer** : 52s
|
||
|
|
||
|
For reference the equivalent test for C runs in 9 seconds.
|
||
|
|
||
|
These results are for Snowball 2.0.0. They're likely to evolve over time as
|
||
|
the code Snowball generates for both Python and C continues to improve (for
|
||
|
a much older test over a different set of stemmers using Python 2.7,
|
||
|
**snowballstemmer** was 30 times slower than **PyStemmer**, or 9 times slower
|
||
|
with **PyPy**).
|
||
|
|
||
|
The message to take away is that if you're stemming a lot of words you should
|
||
|
either install **PyStemmer** (which **snowballstemmer** will then automatically
|
||
|
use for you as described above) or use PyPy.
|
||
|
|
||
|
The TestApp example
|
||
|
-------------------
|
||
|
|
||
|
The ``testapp.py`` example program allows you to run any of the stemmers
|
||
|
on a sample vocabulary.
|
||
|
|
||
|
Usage::
|
||
|
|
||
|
testapp.py <algorithm> "sentences ... "
|
||
|
|
||
|
.. code-block:: bash
|
||
|
|
||
|
$ python testapp.py English "sentences... "
|
||
|
|
||
|
|