Spacy - v2.1.0-a0


🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly. It's not intended for production use.

pip install -U spacy-nightly

If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.

✨ New features and improvements

Tagger, Parser & NER

  • NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
  • Make parser, tagger and NER faster, through better hyperparameters.
  • Fix bugs in beam-search training objective.
  • Remove document length limit during training, by implementing faster Levenshtein alignment.
  • Use Thinc v6.11, which defaults to single-thread with fast OpenBLAS kernel. Parallelisation should be performed at the task level, e.g. by running more containers.

Models & Language Data

  • NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
  • NEW: The English and German models are now available under the MIT license.

CLI

  • NEW: New ud-train command, to train and evaluate using the CoNLL 2017 shared task data.
  • Check if model is already installed before downloading it via spacy download.
  • Pass additional arguments of download command to pip to customise installation.
  • Improve train command by letting GoldCorpus stream data, instead of loading into memory.
  • Improve init-model command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the spacy vocab command, which is now deprecated.

Other

  • NEW: Doc.retokenize context manager for merging tokens more efficiently.
  • NEW: Add support for custom pipeline component factories via entry points (#2348).
  • NEW: Implement fastText vectors with subword features.
  • Add warnings if .similarity method is called with empty vectors or without word vectors.
  • Improve rule-based Matcher and add return_matches keyword argument to Matcher.pipe to yield (doc, matches) tuples instead of only Doc objects, and as_tuples to add context to the Doc objects.
  • Make stop words via Token.is_stop and Lexeme.is_stop case-insensitive.

🚧 Under construction

This section includes new features and improvements that are planned for the stable v2.1.x release, but aren't included in the nightly yet.

  • Enhanced pattern API for rule-based Matcher (see #1971).
  • Built-in rule-based NER component to add entities based on match patterns (see #2513).
  • Improve tokenizer performance (see #1642).
  • Allow retokenizer to update Lexeme attributes on merge (see #2390).
  • md and lg models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.

🔴 Bug fixes

  • Fix issue #1487: Add Doc.retokenize() context manager.
  • Fix issue #1574: Make sure stop words are available in medium and large English models.
  • Fix issue #1665: Correct typos in symbol Animacy_inan and add Animacy_nhum.
  • Fix issue #1865: Correct licensing of it_core_news_sm model.
  • Fix issue #1889: Make stop words case-insensitive.
  • Fix issue #1903: Add relcl dependency label to symbols.
  • Fix issue #2014: Make Token.pos_ writeable.
  • Fix issue #2369: Respect pre-defined warning filters.
  • Fix serialization of custom tokenizer if not all functions are defined.

⚠️ Backwards incompatibilities

  • This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
  • If you've been training your own models, you'll need to retrain them with the new version.
  • While the Matcher API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that the Matcher in v2.1.x may produce different results compared to the Matcher in v2.0.x.
  • Also note that some of the model licenses have changed: it_core_news_sm is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.

📈 Benchmarks

| Model | Version | UAS | LAS | POS | NER F | Vec | Size |
| --- | ---: | ---: | ---: | ---: | ---: | :---: | ---: |
| en_core_web_sm | 2.1.0a0 | 91.8 | 90.0 | 96.8 | 85.6 | 𐄂 | 28 MB |
| en_core_web_md | 2.1.0a0 | 92.0 | 90.2 | 97.0 | 86.2 | ✓ | 107 MB |
| en_core_web_lg | 2.1.0a0 | 92.1 | 90.3 | 97.0 | 86.2 | ✓ | 805 MB |
| de_core_news_sm | 2.1.0a0 | 92.0 | 90.1 | 97.2 | 83.8 | 𐄂 | 26 MB |
| de_core_news_md | 2.1.0a0 | 92.4 | 90.7 | 97.4 | 84.2 | ✓ | 228 MB |
| es_core_news_sm | 2.1.0a0 | 90.1 | 87.2 | 96.9 | 89.4 | 𐄂 | 28 MB |
| es_core_news_md | 2.1.0a0 | 90.7 | 88.0 | 97.2 | 89.5 | ✓ | 88 MB |
| pt_core_news_sm | 2.1.0a0 | 89.4 | 86.3 | 80.1 | 82.7 | 𐄂 | 29 MB |
| fr_core_news_sm | 2.1.0a0 | 88.8 | 85.7 | 94.4 | 67.3 1 | 𐄂 | 32 MB |
| fr_core_news_md | 2.1.0a0 | 88.7 | 86.0 | 95.0 | 70.4 1 | ✓ | 100 MB |
| it_core_news_sm | 2.1.0a0 | 90.7 | 87.1 | 96.1 | 81.3 | 𐄂 | 27 MB |
| nl_core_news_sm | 2.1.0a0 | 83.5 | 77.6 | 91.5 | 87.3 | 𐄂 | 27 MB |
| xx_ent_wiki_sm | 2.1.0a0 | - | - | - | 83.8 | 𐄂 | 9 MB |

1) We're currently investigating this, as the results are anomalously low.

💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

📖 Documentation and examples

  • Fix various typos and inconsistencies.

👥 Contributors

Thanks to @DuyguA for the pull requests and contributions.


Details

date
July 21, 2018, 2:14 p.m.
name
v2.1.0a0: New models, joint word segmentation and parsing, better Matcher, bug fixes & more
type
Pre-release
👇
Register or login to:
  • 🔍View and search all Spacy releases.
  • 🛠️Create and share lists to track your tools.
  • 🚨Setup notifications for major, security, feature or patch updates.
  • 🚀Much more coming soon!
Continue with GitHub
Continue with Google
or