Getting Started with Weaviate Python Library
A full tutorial on the new Machine Learning based vector search engine, Weaviate
0. New updates.
- What is Weaviate?
- Where can it be used?
- What are the advantages?
- What is Weaviate Python Client?
- How to use the Weaviate Python Client with a weaviate cluster?
- 5.0. Create a Weaviate instance/cluster.
- 5.1. Connect to the cluster.
- 5.2. Get Data and Analyze it.
- 5.3. Create appropriate data types.
- 5.4. Load data.
- 5.5. Query data.
0. New updates.
Weaviate-client version 3.0.0 brings some new changes, check the official documentation for all the information here.
The whole article now contains examples with the old version and the new one (only the ones that are changed).
1. What is Weaviate?
Weaviate is an open-source, cloud-native, modular, real-time vector search engine. It is build to scale your machine learning models. Because Weaviate is modular, you can use it with any machine learning model that does data encoding. Weaviate comes with optional modules for text, image, and other other media types, that can be chose based on your task and data. Also, one could use more than one module, depending on the variety of data. More information here.
In this articles we are going to use the text module to see the most important functionalities and capabilities of Weaviate. The text module, also called text2vec-contextionary, captures the semantic meaning of the text objects and places it a concept hyper-space. This allows to do semantic search, in contrast to ‘word matching search’ that other search engines do.
For more information about the Weaviate and SeMI Technology — the company that builds Weaviate, visit the official website.
2. Where can it be used?
At the moment Weaviate is used in such cases as:
- semantic search,
- similarity search,
- image search,
- power recommendation engines,
- e-commerce search,
- cybersecurity threat analysis,
- automated data harmonization,
- anomaly detection,
- data classification in ERP systems,
, and many many more cases.
3. What are the advantages?
To understand what are the Weaviate advantages, you should ask yourself these questions:
- Is the quality of results, that your current search engine gives you, good enough for you?
- Is it is too much work to bring your machine learning models to scale?
- Do you need to classify large datasets fast and near-real time?
- Do you need to scale your machine learning models to production size?
Weaviate is the solution to all these questions.
4. What is Weaviate Python Client?
The Weaviate Python Client is a python package that allows you to connect and interact with a Weaviate instance. The python client is NOT a Weaviate instance but you can use it to create one on the Weaviate Cloud Service. It provides API for importing data, creating schemas, do classification, query data, … We are going to go through most of them and explain how and when one could use them.
The package is published to PyPI (link). Also, a CLI tool is available on PyPI (link).
5. How to use the python-client with a Weaviate cluster?
In this section we are going to go through the process of creating a Weaviate instance, connecting to it and explore some functionalities.
(The jupyter-notebook can be found here.)
5.0. Create an Weaviate instance/cluster.
Creating a Weaviate instance can be done in multiple ways. It can be done using a docker-compose.yaml
file that can be generated here. For this option you have to have docker
and docker-compose
installed, and space on your drive.
Another option is to create an account on Weaviate Cloud Service console (WCS console) and create a cluster there. There are different options for clusters you can choose from. If you do not have an account go ahead and create one.
In this tutorial we are going to create a cluster on WCS directly from python (you will only need your WCS credentials).
The first thing we have to do now, is to install the Weaviate Python Client. It can be done using pip command.
>>> import sys
>>> !{sys.executable} -m pip install weaviate-client==2.5.0
UPDATE for version 3.0.0.
>>> import sys
>>> !{sys.executable} -m pip install weaviate-client==3.0.0
Now lets import the package and create a cluster on WCS.
>>> from getpass import getpass # hide password
>>> import weaviate # to communicate to the Weaviate instance
>>> from weaviate.tools import WCS
UPDATE for version 3.0.0.
>>> from getpass import getpass # hide password
>>> import weaviate # to communicate to the Weaviate instance
>>> from weaviate.wcs import WCS
In order to authenticate to WCS or Weaviate instance (if Weaviate instance has Authentication enable) we need to create an Authentication object. At the moment it supports two types of authentication credentials:
- Password credentials:
weaviate.auth.AuthClientPassword(username='WCS_ACCOUNT_EMAIL', password='WCS_ACCOUNT_PASSWORD')
- Token credentials
weaviate.auth.AuthClientCredentials(client_secret=YOUR_SECRET_TOKEN)
For WCS we will use the Password credentials.
>>> my_credentials = weaviate.auth.AuthClientPassword(username=input("User name: "), password=getpass('Password: '))User name: WCS_ACCOUNT_EMAIL
Password: ········
The my_credentials
object contains your credentials so be careful not make it public.
>>> my_wcs = WCS(my_credentials)
Now that we connected to WCS, we can create
, delete
, get_clusters
, get_cluster_config
and check the status of a cluster with is_ready
method.
Here is the prototype of the create
method:
my_wcs.create(cluster_name:str=None,
cluster_type:str='sandbox',
config:dict=None,
wait_for_completion:bool=True) -> str
The return value is the URL of the created cluster.
NOTE: WCS names must be globally-unique as they are used to create public URLs to access the instance later. Thus, make sure to pick a unique name for cluster_name
.
If you want to check the prototype and docstring of any methods in a notebook, run this command: object.method?
. You can also use the help()
function.
Ex: WCS.is_ready?
or my_wcs.is_ready?
or help(WCS.is_ready)
.
>>> cluster_name = 'my-first-weaviate-instance'
>>> weaviate_url = my_wcs.create(cluster_name=cluster_name)
>>> weaviate_url100%|██████████| 100.0/100 [00:56<00:00, 1.78it/s]
'https://my-first-weaviate-instance.semi.network'>>> my_wcs.is_ready(cluster_name)True
5.1. Connect to the cluster.
Now we can connect to the created weaviate instance with the Client
object. The constructor looks like this:
weaviate.Client(
url:str,
auth_client_secret:weaviate.auth.AuthCredentials=None,
timeout_config:Union[Tuple[int, int], NoneType]=None,
)
The constructor has only one required argument, url
, and two optional ones: auth_client_secret
- used if weaviate instance has authentication enabled and timeout_config
- that sets REST time out configuration and is a tuple (retries, time out seconds). For more information about the arguments look at the docstring.
>>> client = weaviate.Client(weaviate_url)
Now that we connected to Weavite, it does not necessary mean that it is all set up. It might still do some setup processes in the background. We can check the health of the Weaviate instance by calling the .is_live
method, and check if Weaviate is ready for requests by calling the .is_ready
.
>>> client.is_ready()True
5.2. Get Data and Analyse it.
We set up the Weaviate instance, connected to it and have it ready for requests, now we can take a step back and get some data and analyze it.
This step, as for all the machine learning models, is the most important one. Here we have to decide what is relevant, what is important and what data structures/types to use.
In this example we are going to use news articles to construct weaviate data. For this we are going to need the newspaper3k
package.
>>> !{sys.executable} -m pip install newspaper3k
NOTE: If none of the articles were downloaded, it might be because nltk punkt tools were not downloaded. To fix it please run the cell below.
>>> import nltk # it is a dependency of newspaper3k
>>> nltk.download('punkt')
Thanks to the GitHub user @gosha1128 for finding this bug that is suppressed in the
try/except
block in theget_articles_from_newspaper
function below.
>>> import newspaper
>>> import uuid
>>> import json
>>> from tqdm import tqdm
>>> def get_articles_from_newspaper(
... news_url: str,
... max_articles: int=100
... ) -> None:
... """
... Download and save newspaper articles as weaviate schemas.
... Parameters
... ----------
... newspaper_url : str
... Newspaper title.
... """
...
... objects = []
...
... # Build the actual newspaper
... news_builder = newspaper.build(news_url, memoize_articles=False)
...
... if max_articles > news_builder.size():
... max_articles = news_builder.size()
... pbar = tqdm(total=max_articles)
... pbar.set_description(f"{news_url}")
... i = 0
... while len(objects) < max_articles and i < news_builder.size():
... article = news_builder.articles[i]
... try:
... article.download()
... article.parse()
... article.nlp()
... if (article.title != '' and \
... article.title is not None and \
... article.summary != '' and \
... article.summary is not None and\
... article.authors):
...
... # create an UUID for the article using its URL
... article_id = uuid.uuid3(uuid.NAMESPACE_DNS, article.url)
...
... # create the object
... objects.append({
... 'id': str(article_id),
... 'title': article.title,
... 'summary': article.summary,
... 'authors': article.authors
... })
...
... pbar.update(1)
...
... except:
... # something went wrong with getting the article, ignore it
... pass
... i += 1
... pbar.close()
... return objects>>> data = []
>>> data += get_articles_from_newspaper('https://www.theguardian.com/international')
>>> data += get_articles_from_newspaper('http://cnn.com')https://www.theguardian.com/international: 100%|██████████| 100/100 [00:34<00:00, 2.90it/s]
http://cnn.com: 100%|██████████| 100/100 [02:11<00:00, 1.32s/it]
5.3. Create appropriate data types.
In the function get_articles_from_newspaper
we keep the title, summary and authors of the article. We also compute an UUID (Universally Unique IDentifier) for each article. All of these fields can be seen in the cell above.
With this information at hand we already can define a schema, that is a data structure for each object type and how they are related. The schema is a nested dictionary.
So lets create the Article
class schema. We know that the article has a title, summary and authors.
More about schemas and how to create them can be found here and here.
>>> article_class_schema = {
... # name of the class
... "class": "Article",
... # a description of what this class represents
... "description": "An Article class to store the article summary and its authors",
... # class properties
... "properties": [
... {
... "name": "title",
... "dataType": ["string"],
... "description": "The title of the article",
... },
... {
... "name": "summary",
... "dataType": ["text"],
... "description": "The summary of the article",
... },
... {
... "name": "hasAuthors",
... "dataType": ["Author"],
... "description": "The authors this article has",
... }
... ]
... }
In the class schema above we create a class named Article
and with the description An Article class to store the article summary and its authors
. The description is there to explain the user what this class is about.
Also we define 3 properties: title
- The title of the article, of type string
(case sensitive), summary
- The summary of the article, of data type text
(case insensitive), hasAuthor
- The authors of the article, of data type Author
. The Author
is NOT a primitive data type, it is another class that we should define. The list of primitive data types can be found here.
NOTE 1: The properties should always be in cameCase format and starts with a lowercased word.
NOTE 2: The property data type is always a list because it can accept more than one data type.
Specifying another class as a data type is called cross-referencing. This way you can link your data objects in-between them and create a relation graph.
Now lets create the Author
class schema in the same manner, but with properties name
and wroteArticles
.
>>> author_class_schema = {
... "class": "Author",
... "description": "An Author class to store the author information",
... "properties": [
... {
... "name": "name",
... "dataType": ["string"],
... "description": "The name of the author",
... },
... {
... "name": "wroteArticles",
... "dataType": ["Article"],
... "description": "The articles of the author",
... }
... ]
... }
Now that we decided on the data structure, we can tell Weaviate what kind of data we will import. This can be done by accessing the schema
attribute of the client.
Schema can be created in two different ways:
- using the
.create_class()
method, this option creates only one class per call. - using the
.create()
method, this option creates multiple classes at once (useful if you have the whole schema)
Also we can check if a schema is present or if a particular class schema is present with the .contains()
method.
More about schema methods, click here.
Because we defined each class separately we should use the .create_class()
method.
client.schema.create_class(schema_class:Union[dict, str]) -> None
It accepts also file paths or URLs to a class definition file.
>>> client.schema.create_class(article_class_schema)---------------------------------------------------------------------------
UnexpectedStatusCodeException Traceback (most recent call last)
<ipython-input-12-6d56a74d9293> in <module>
----> 1 client.schema.create_class(article_class_schema)
~/miniconda3/envs/test/lib/python3.6/site-packages/weaviate/schema/crud_schema.py in create_class(self, schema_class)
138 check_class(loaded_schema_class)
139 self._create_class_with_premitives(loaded_schema_class)
--> 140 self._create_complex_properties_from_class(loaded_schema_class)
141
142 def delete_class(self, class_name: str) -> None:
~/miniconda3/envs/test/lib/python3.6/site-packages/weaviate/schema/crud_schema.py in _create_complex_properties_from_class(self, schema_class)
352 raise type(conn_err)(message).with_traceback(sys.exc_info()[2])
353 if response.status_code != 200:
--> 354 raise UnexpectedStatusCodeException("Add properties to classes", response)
355
356 def _create_complex_properties_from_classes(self, schema_classes_list: list) -> None:
UnexpectedStatusCodeException: Add properties to classes! Unexpected status code: 422, with response body: {'error': [{'message': "Data type of property 'hasAuthors' is invalid; SingleRef class name 'Author' does not exist"}]}
As we can see, we cannot create the class property that reference a non-existing data type. This does not mean that the class Article
was not created at all. Lets get the schema from weaviate and look what was created.
>>> # helper function
>>> def prettify(json_dict):
... print(json.dumps(json_dict, indent=2))>>> prettify(client.schema.get()){
"classes": [
{
"class": "Article",
"description": "An Article class to store the article summary and its authors",
"invertedIndexConfig": {
"cleanupIntervalSeconds": 60
},
"properties": [
{
"dataType": [
"string"
],
"description": "The title of the article",
"name": "title"
},
{
"dataType": [
"text"
],
"description": "The summary of the article",
"name": "summary"
}
],
"vectorIndexConfig": {
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"vectorCacheMaxObjects": 500000
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-contextionary"
}
]
}
The configurations we did not specify are not mandatory and were set to the default values.
As we can see, only the hasAuthor
property was not created. So lets then create the Author
class.
>>> client.schema.create_class(author_class_schema)
>>> prettify(client.schema.get()){
"classes": [
{
"class": "Article",
"description": "An Article class to store the article summary and its authors",
"invertedIndexConfig": {
"cleanupIntervalSeconds": 60
},
"properties": [
{
"dataType": [
"string"
],
"description": "The title of the article",
"name": "title"
},
{
"dataType": [
"text"
],
"description": "The summary of the article",
"name": "summary"
}
],
"vectorIndexConfig": {
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"vectorCacheMaxObjects": 500000
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-contextionary"
},
{
"class": "Author",
"description": "An Author class to store the author information",
"invertedIndexConfig": {
"cleanupIntervalSeconds": 60
},
"properties": [
{
"dataType": [
"string"
],
"description": "The name of the author",
"name": "name"
},
{
"dataType": [
"Article"
],
"description": "The articles of the author",
"name": "wroteArticles"
}
],
"vectorIndexConfig": {
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"vectorCacheMaxObjects": 500000
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-contextionary"
}
]
}
Now we have both classes created but still we do not have the hasAuthor
property. No worries, it can be created at any time, using the schema's attribute property
and its method create
.
client.schema.property.create(schema_class_name:str, schema_property:dict) -> None>>> client.schema.property.create('Article', article_class_schema['properties'][2])
Now lets get the schema and see if it is what we expect it to be.
>>> prettify(client.schema.get()){
"classes": [
{
"class": "Article",
"description": "An Article class to store the article summary and its authors",
"invertedIndexConfig": {
"cleanupIntervalSeconds": 60
},
"properties": [
{
"dataType": [
"string"
],
"description": "The title of the article",
"name": "title"
},
{
"dataType": [
"text"
],
"description": "The summary of the article",
"name": "summary"
},
{
"dataType": [
"Author"
],
"description": "The authors this article has",
"name": "hasAuthors"
}
],
"vectorIndexConfig": {
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"vectorCacheMaxObjects": 500000
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-contextionary"
},
{
"class": "Author",
"description": "An Author class to store the author information",
"invertedIndexConfig": {
"cleanupIntervalSeconds": 60
},
"properties": [
{
"dataType": [
"string"
],
"description": "The name of the author",
"name": "name"
},
{
"dataType": [
"Article"
],
"description": "The articles of the author",
"name": "wroteArticles"
}
],
"vectorIndexConfig": {
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"vectorCacheMaxObjects": 500000
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-contextionary"
}
]
}
Everything is exactly as we intended.
If you do not want to think about which class was created when and what properties might fail or not (due to yet non-existing classes), there is a solution for it. The solution is to create the whole schema with the create
method. So lets delete the schema from weaviate and see how it works.
>>> schema = client.schema.get() # save schema
>>> client.schema.delete_all() # delete all classes
>>> prettify(client.schema.get()){
"classes": []
}
Note that if we delete the schema or a class we delete all the objects associated with it.
Now lets create it from the saved schema.
>>> client.schema.create(schema)
>>> prettify(client.schema.get()){
"classes": [
{
"class": "Article",
"description": "An Article class to store the article summary and its authors",
"invertedIndexConfig": {
"cleanupIntervalSeconds": 60
},
"properties": [
{
"dataType": [
"string"
],
"description": "The title of the article",
"name": "title"
},
{
"dataType": [
"text"
],
"description": "The summary of the article",
"name": "summary"
},
{
"dataType": [
"Author"
],
"description": "The authors this article has",
"name": "hasAuthors"
}
],
"vectorIndexConfig": {
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"vectorCacheMaxObjects": 500000
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-contextionary"
},
{
"class": "Author",
"description": "An Author class to store the author information",
"invertedIndexConfig": {
"cleanupIntervalSeconds": 60
},
"properties": [
{
"dataType": [
"string"
],
"description": "The name of the author",
"name": "name"
},
{
"dataType": [
"Article"
],
"description": "The articles of the author",
"name": "wroteArticles"
}
],
"vectorIndexConfig": {
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"vectorCacheMaxObjects": 500000
},
"vectorIndexType": "hnsw",
"vectorizer": "text2vec-contextionary"
}
]
}
This looks exactly as the schema we created class by class and property by property. This way we can save now the schema in a file and in the next session just directly import it by providing the file path.
# save schema to file
with open('schema.json', 'w') as outfile:
json.dump(schema, outfile)
# remove current schema from Weaviate, removes all the data too
client.schema.delete_all()
# import schema using file path
client.schema.create('schema.json')
# print schema
print(json.dumps(client.schema.get(), indent=2))
5.4. Load data.
Now that we have our data ready, and Weaviate is aware of what kind of data we have, we can add the Articles
and Authors
to the Weaviate instance.
Importing data to weaviate can be done in 3 different ways.
- Adding object by object iteratively. This ca be done using the
data_object
object attribute of the client. - In batches. This can be done by creating an appropriate batch request object and submitting it using the
batch
object attribute of the client. (Only inweaviate-client
version <3.0.0.) - Using a
Batcher
object from theweaviate.tools
module. (Only inweaviate-client
version <3.0.0.) - New
Batch
class introduced in weaviate-client version 3.0.0.
We are going to see all of them in action, but first lets underline the differences between them.
- Option 1. is the safest method to add data objects and creating references because it does object validation before creating it, whereas importing data in batches skips most of the validation in favor for speed. This option requires one REST request per object, thus is slower than importing data in batches. It is recommended to use this option if you are not sure if the your data is valid.
- Option 2. as mentioned above skips most of data validation and requires only one REST request per BATCH. For this method you just add as much data as you want to a batch request (there are 2 types:
ReferenceBatchRequest
andObjectsBatchRequest
) then you submit it using thebatch
object attribute of the client. This option requires you to first import data objects, and then references (make sure that the objects used in the reference are already imported before creating a reference). (Only inweaviate-client
version <3.0.0.) - Option 3. relies on the batch requests from 2. but for a
Batcher
you do not have to submit any batch requests it does it automatically for you when it is full. (Only inweaviate-client
version <3.0.0.) - Option 4: New
Batch
class introduced in weaviate-client version 3.0.0. The newBatch
object does not need theBatchRequests
from 2. but uses them internally. The new class also supports 3 different cases of loading data in batches: a) Manually - the user has the absolute control when and how to add and create batches; b) Auto-create batches when full; c) Auto-create batches using dynamic batching, i.e. the batch size is adjusted every time it is created to avoid anyTimeout
errors.
5.4.1 Load data using data_object
attribute
For this case lets take only one article (data[0]
) and import it to Weaviate using the data_object
attribute.
The way to do it, is by creating first the objects and then the reference that links them.
Run client.data_object.create?
in a notebook to get more info about the method. Or help(client.data_object.create)
in the IDLE.
Each data object should have the same format as defined in schema.
>>> prettify(data[0]){
"id": "df2a2d1c-9c87-3b4b-9df3-d7aed6bb6a27",
"title": "Coronavirus live news: Pfizer says jab 100% effective in 12- to 15-year-olds; Macron could announce lockdown tonight",
"summary": "11:08Surge testing is being deployed in Bolton, Greater Manchester after one case of the South African variant of coronavirus has been identified.\nDr Helen Lowey, Bolton\u2019s director of public health, said that \u201cthe risk of any onward spread is low\u201d and there was no evidence that the variant caused more severe illness.\nPublic Health England identified the case in the area of Wingates Industrial Estate.\nDr Matthieu Pegorie from Public Health England North West said that there was no link to international travel, therefore suggesting that there are some cases in the community.\nThe Department of Health says that enhanced contact tracing will be deployed where a positive case of a variant is found.",
"authors": [
"Helen Sullivan",
"Yohannes Lowe",
"Martin Belam",
"Maya Wolfe-Robinson",
"Melissa Davey",
"Jessica Glenza",
"Jon Henley",
"Peter Beaumont"
]
}>>> article_object = {
... 'title': data[0]['title'],
... 'summary': data[0]['summary'].replace('\n', '') # remove newline character
... # we leave out the `hasAuthors` because it is a reference and will be created after we create the Authors
... }
>>> article_id = data[0]['id']
>>> # validated the object
>>> result = client.data_object.validate(
... data_object=article_object,
... class_name='Article',
... uuid=article_id
... )
>>> prettify(result){
"error": null,
"valid": true
}
Object passed the validation test, now it is safe to create/import it.
>>> # create the object
>>> client.data_object.create(
... data_object=article_object,
... class_name='Article',
... uuid=article_id # if not specified, weaviate is going to create an UUID for you.
...)'df2a2d1c-9c87-3b4b-9df3-d7aed6bb6a27'
The client.data_object.create
return the UUID of the object, if you specified one it is going to be returned too. If you do not specify one, Weaviate is going to generate one for you and return it.
Congratulations we have added our first object to weaviate!!!
Now we can actually “get” this object from Weaviate by its UUID using get_by_id
or get
method. (get
without specifying and UUID return first 100 objects)
>>> prettify(client.data_object.get(article_id, with_vector=False)){
"additional": {},
"class": "Article",
"creationTimeUnix": 1617191563170,
"id": "df2a2d1c-9c87-3b4b-9df3-d7aed6bb6a27",
"lastUpdateTimeUnix": 1617191563170,
"properties": {
"summary": "11:08Surge testing is being deployed in Bolton, Greater Manchester after one case of the South African variant of coronavirus has been identified.Dr Helen Lowey, Bolton\u2019s director of public health, said that \u201cthe risk of any onward spread is low\u201d and there was no evidence that the variant caused more severe illness.Public Health England identified the case in the area of Wingates Industrial Estate.Dr Matthieu Pegorie from Public Health England North West said that there was no link to international travel, therefore suggesting that there are some cases in the community.The Department of Health says that enhanced contact tracing will be deployed where a positive case of a variant is found.",
"title": "Coronavirus live news: Pfizer says jab 100% effective in 12- to 15-year-olds; Macron could announce lockdown tonight"
},
"vectorWeights": null
}
Now lets create the authors and the cross references between the Article
and the Authors
.
The reference addition is in the same manner, but to add references use the client.data_object.reference.add
method.
>>> # keep track of the authors already imported/created and their respective UUID
>>> # because same author can write more than one paper.
>>> created_authors = {}
>>> for author in data[0]['authors']:
... # create Author
... author_object = {
... 'name': author,
... # we leave out the `wroteArticles` because it is a reference and will be created after we create the Author
... }
... author_id = client.data_object.create(
... data_object=author_object,
... class_name='Author'
... )
...
... # add author to the created_authors
... created_authors[author] = author_id
...
... # add references
... ## Author -> Article
... client.data_object.reference.add(
... from_uuid=author_id,
... from_property_name='wroteArticles',
... to_uuid=article_id
... )
... ## Article -> Author
... client.data_object.reference.add(
... from_uuid=article_id,
... from_property_name='hasAuthors',
... to_uuid=author_id
... )
In the cell above we iterate through all authors of the article. For each iteration we first create the Author
then we add the references: the reference from Author
to Article
- linked via the wroteArticles
property of the Author
, and reference from Article
to Author
- through the hasAuthors
property of the Article
.
Note that it is not required to have bi-directional references.
Now lets get the object and take a look at it.
>>> prettify(client.data_object.get(article_id, with_vector=False)){
"additional": {},
"class": "Article",
"creationTimeUnix": 1617191563170,
"id": "df2a2d1c-9c87-3b4b-9df3-d7aed6bb6a27",
"lastUpdateTimeUnix": 1617191563170,
"properties": {
"hasAuthors": [
{
"beacon": "weaviate://localhost/1d0d3242-1fc2-4bba-adbe-ab9ef9a97dfe",
"href": "/v1/objects/1d0d3242-1fc2-4bba-adbe-ab9ef9a97dfe"
},
{
"beacon": "weaviate://localhost/c1c8afce-adb6-4b3c-bbe0-2414d55b0c8e",
"href": "/v1/objects/c1c8afce-adb6-4b3c-bbe0-2414d55b0c8e"
},
{
"beacon": "weaviate://localhost/b851f6fc-a02b-4a63-9b53-c3a8764c82c1",
"href": "/v1/objects/b851f6fc-a02b-4a63-9b53-c3a8764c82c1"
},
{
"beacon": "weaviate://localhost/e6b6c991-5d7a-447f-89c8-e6e01730f88f",
"href": "/v1/objects/e6b6c991-5d7a-447f-89c8-e6e01730f88f"
},
{
"beacon": "weaviate://localhost/d03f9353-d4fc-465d-babe-f116a29ccaf5",
"href": "/v1/objects/d03f9353-d4fc-465d-babe-f116a29ccaf5"
},
{
"beacon": "weaviate://localhost/8ab84df5-c92b-49ac-95dd-bf65f53a38cc",
"href": "/v1/objects/8ab84df5-c92b-49ac-95dd-bf65f53a38cc"
},
{
"beacon": "weaviate://localhost/e667d7c9-0c9b-48fe-b671-864cbfc84962",
"href": "/v1/objects/e667d7c9-0c9b-48fe-b671-864cbfc84962"
},
{
"beacon": "weaviate://localhost/9d094f60-3f58-46dc-b7fd-40495be2dd69",
"href": "/v1/objects/9d094f60-3f58-46dc-b7fd-40495be2dd69"
}
],
"summary": "11:08Surge testing is being deployed in Bolton, Greater Manchester after one case of the South African variant of coronavirus has been identified.Dr Helen Lowey, Bolton\u2019s director of public health, said that \u201cthe risk of any onward spread is low\u201d and there was no evidence that the variant caused more severe illness.Public Health England identified the case in the area of Wingates Industrial Estate.Dr Matthieu Pegorie from Public Health England North West said that there was no link to international travel, therefore suggesting that there are some cases in the community.The Department of Health says that enhanced contact tracing will be deployed where a positive case of a variant is found.",
"title": "Coronavirus live news: Pfizer says jab 100% effective in 12- to 15-year-olds; Macron could announce lockdown tonight"
},
"vectorWeights": null
}
As we can see we have the reference set as a beacon
and a href
. We cannot see the Author
s data by getting the objects from weaviate. We can do it by querying data (see section 5.5 Query data.) or by getting the the object by the UUID (or beacon
, or href
).
>> from weaviate.util import get_valid_uuid # extract UUID from URL (beacon or href)
>>> # extract authors references, lets take only the first one as an example (the article might have only one)
>>> author = client.data_object.get(article_id, with_vector=False)['properties']['hasAuthors'][0]
>>> # get and print data object by providing the 'beacon'
>>> author_uuid = get_valid_uuid(author['beacon']) # can be 'href' too
>>> prettify(client.data_object.get(author_uuid, with_vector=False)){
"additional": {},
"class": "Author",
"creationTimeUnix": 1617191569894,
"id": "1d0d3242-1fc2-4bba-adbe-ab9ef9a97dfe",
"lastUpdateTimeUnix": 1617191569894,
"properties": {
"name": "Helen Sullivan",
"wroteArticles": [
{
"beacon": "weaviate://localhost/df2a2d1c-9c87-3b4b-9df3-d7aed6bb6a27",
"href": "/v1/objects/df2a2d1c-9c87-3b4b-9df3-d7aed6bb6a27"
}
]
},
"vectorWeights": null
}
So Far, so Good (… So What!)
There are more methods for Data Objects (client.data_object
): .delete
, .exists
, .replace
and .update
.
Also there are some methods for references too (client.data_object.reference
): .add
, .delete
and .update
.
5.4.2 Load data using batches
(Only in weaviate-client
version < 3.0.0.)
Importing data in batches is very similar to adding object by object.
The first thing we have to do is to create a BatchRequest
object for each object type: DataObject
and Reference
. They are named accordingly: ObjectsBatchRequest
and ReferenceBatchRequest
.
Lets create an object of each batch and import the next 99 articles to Weaviate.
NOTE: I want to bring to your attention again that importing/creating data in batches skips some validation steps and might lead to a corrupted graph.
>>> from weaviate import ObjectsBatchRequest, ReferenceBatchRequest
Lets create a function that adds a single article to the batch request.
>>> def add_article(batch: ObjectsBatchRequest, article_data: dict) -> str:
...
... article_object = {
... 'title': article_data['title'],
... 'summary': article_data['summary'].replace('\n', '') # remove newline character
... }
... article_id = article_data['id']
...
... # add article to the object batch request
... batch.add(
... data_object=article_object,
... class_name='Article',
... uuid=article_id
... )
...
... return article_id
Lets now create a function add a single author to the batch request, if the author was not already created.
>>> def add_author(batch: ObjectsBatchRequest, author_name: str, created_authors: dict) -> str:
...
... if author_name in created_authors:
... # return author UUID
... return created_authors[author_name]
...
... # generate an UUID for the Author
... author_id = generate_uuid(author)
...
... # add author to the object batch request
... batch.add(
... data_object={'name': author_name},
... class_name='Author',
... uuid=author_id
... )
...
... created_authors[author_name] = author_id
... return author_id
And the last function for adding cross references.
>>> def add_references(batch: ReferenceBatchRequest, article_id: str, author_id: str)-> None:
... # add references to the reference batch request
... ## Author -> Article
... batch.add(
... from_object_uuid=author_id,
... from_object_class_name='Author',
... from_property_name='wroteArticles',
... to_object_uuid=article_id
... )
...
... ## Article -> Author
... batch.add(
... from_object_uuid=article_id,
... from_object_class_name='Article',
... from_property_name='hasAuthors',
... to_object_uuid=author_id
... )
Now we can iterate through the data and import data using batches.
>>> from weaviate.tools import generate_uuid
>>> from tqdm import trange
>>> objects_batch = ObjectsBatchRequest()
>>> reference_batch = ReferenceBatchRequest()
>>> for i in trange(1, 100):
...
... # add article to batch request
... article_id = add_article(objects_batch, data[i])
...
... for author in data[i]['authors']:
...
... # add author to batch request
... author_id = add_author(objects_batch, author, created_authors)
...
... # add cross references to the reference batch
... add_references(reference_batch, article_id=article_id, author_id=author_id)
...
... if i % 20 == 0:
... # submit the object batch request to weaviate, can be done with method '.create_objects'
... client.batch.create(objects_batch)
...
... # submit the reference batch request to weaviate, can be done with method '.create_references'
... client.batch.create(reference_batch)
...
... # batch requests are not reusable, so we create new ones
... objects_batch = ObjectsBatchRequest()
... reference_batch = ReferenceBatchRequest()
>>> # submit the any object that are left
>>> status_objects = client.batch.create(objects_batch)
>>> status_references = client.batch.create(reference_batch)0%| | 0/99 [00:00<?, ?it/s]
In order to import data in batches we should create a BatchRequest
object for the data object type we want to import. A batch request object does not have a size limit so you should submit it when there are as many objects as you want. (Keep in mind that if you will use a batch with too many objects it might result in an TimeOut error so keep it to a reasonable size so your Weaviate instance can process it.) Also we keep track of the authors we already created so we do not create the same author over and over again.
The call of the client.batch.create
returns the status of each object that was created. Check it if you want to be sure that everything worked just fine. Also NOTE that even if Weaviate failed to create objects it does not mean that the batch submission failed too, for more information read the documentation of the client.batch.create
.
5.4.3 Load data using a Batcher object.
(Only in weaviate-client
version < 3.0.0.)
The Batcher
is a class that automatically submits objects to weaviate, both DataObject
s and Reference
s. The Batcher
can be found in the weaviate.tools
module, and has the following constructor prototype:
Batcher(
client : weaviate.client.Client,
batch_size : int=512,
verbose : bool=False,
auto_commit_timeout : float=-1.0,
max_backoff_time : int=300,
max_request_retries : int=4,
return_values_callback : Callable=None,
)
See the documentation for an explanation of each argument.
Lets see how it works in action for the rest of the objects from data
we extracted.
>>> from weaviate.tools import Batcher
For a Batcher
we only need to add the objects we want to import to Weaviate. The Batcher
has a special method to add objects
(batcher.add_data_object
) and a special method to add references
(batcher.add_reference
). Also it provides a batcher.add
method that has keywords arguments, which detects what kind of data you are trying to add. The batcher.add
method makes it possible to reuse the add_article
, add_author
and add_references
functions we defined above.
NOTE: The Batcher.add
was introduced in weaviate-client
version 2.3.0.
Lets use the batcher to add the remaining articles and authors from data
. Because the Batcher
automatically submits objects to weaviate, we need to ALWAYS .close()
it after we are done to make sure we are submitting what remains in the Batcher
.
If you are like me, and sometimes forget to close objects, Bather
can be used in a context manager, i.e. used with with
. Lets see how it works with the context manager.
>>> # we still need the 'created_authors' so we do not add the same author twice
>>> with Batcher(client, 30, True) as batcher:
... for i in trange(100, 200):
...
... # add article to batcher
... article_id = add_article(batcher, data[i]) # NOTE the 'bather' object instead of 'objects_batch'
...
... for author in data[i]['authors']:
...
... # add author to batcher
... author_id = add_author(batcher, author, created_authors) # NOTE the 'bather' object instead of 'objects_batch'
...
... # add cross references to the batcher
... add_references(batcher, article_id=article_id, author_id=author_id) # NOTE the 'bather' object instead of 'reference_batch'Batcher object created!
0%| | 0/100 [00:00<?, ?it/s]
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Updated object batch successfully
Updated reference batch successfully
Batcher object closed!
That is it for a Batcher.
These are the 3 ways to import data to Weaviate. Choose the one that is appropriate for you and your project.
5.4.4 New Batch
object
(Only in weaviate-client
version >=3.0.0.)
The new Batch
object is accessible the same way: client.batch
. As described in 5.4 section this class can be used in 3 different ways, so lets see how exactly we can do it and what is the difference between them.
The first thing we need to do is to re-define the following functions: add_article
, add_author
and add_references
.
>>> from weaviate.batch import Batch # for the typing purposes
>>> from weaviate.util import generate_uuid5 # old way was from weaviate.tools import generate_uuid>>> def add_article(batch: Batch, article_data: dict) -> str:
...
... article_object = {
... 'title': article_data['title'],
... 'summary': article_data['summary'].replace('\n', '') # remove newline character
... }
... article_id = article_data['id']
...
... # add article to the object batch request
... batch.add_data_object( # old way was batch.add(...)
... data_object=article_object,
... class_name='Article',
... uuid=article_id
... )
...
... return article_id>>> def add_author(batch: Batch, author_name: str, created_authors: dict) -> str:
...
... if author_name in created_authors:
... # return author UUID
... return created_authors[author_name]
...
... # generate an UUID for the Author
... author_id = generate_uuid5(author)
...
... # add author to the object batch request
... batch.add_data_object( # old way was batch.add(...)
... data_object={'name': author_name},
... class_name='Author',
... uuid=author_id
... )
...
... created_authors[author_name] = author_id
... return author_id>>> def add_references(batch: Batch, article_id: str, author_id: str)-> None:
... # add references to the reference batch request
... ## Author -> Article
... batch.add_reference( # old way was batch.add(...)
... from_object_uuid=author_id,
... from_object_class_name='Author',
... from_property_name='wroteArticles',
... to_object_uuid=article_id
... )
...
... ## Article -> Author
... batch.add_reference( # old way was batch.add(...)
... from_object_uuid=article_id,
... from_object_class_name='Article',
... from_property_name='hasAuthors',
... to_object_uuid=author_id
... )
Now that we changed the above functions to be compatible with the new Batch
object, lets see them in action.
a) Manually
This method gives the user the absolute control of when and how to add and create batches. It is very similar to the BatchRequests
method of the weaviate-client
version < 3.0.0, so lets take a look how exactly we can use it (or migrate to it from the old version).
See the context manager way of the same method below the following code cell. (Do not run both cells!)
>>> from tqdm import trange>>> for i in trange(1, 100):
...
... # add article to the batch
... article_id = add_article(client.batch, data[i])
...
... for author in data[i]['authors']:
...
... # add author to the batch
... author_id = add_author(client.batch, author, created_authors)
...
... # add cross references to the batch
... add_references(client.batch, article_id=article_id, author_id=author_id)
...
... if i % 20 == 0:
... # submit the objects from the batch to weaviate
... client.batch.create_objects()
...
... # submit the references from the batch to weaviate
... client.batch.create_references()>>> # submit any objects that are left
>>> status_objects = client.batch.create_objects()
>>> status_references = client.batch.create_references()
>>> # if there is no need for the output from batch creation, one could flush both
>>> # object and references with one call
>>> client.batch.flush()
Alternatively, we could use the Batch
instance with a context manager, that would call the flush()
method when it exists the context.
>>> from tqdm import trange>>> with client.batch as batch:
... for i in trange(1, 100):
...
... # add article to the batch
... article_id = add_article(batch, data[i])
...
... for author in data[i]['authors']:
...
... # add author to the batch
... author_id = add_author(batch, author, created_authors)
...
... # add cross references to the batch
... add_references(batch, article_id=article_id, author_id=author_id)
...
... if i % 20 == 0:
... # submit the objects from the batch to weaviate
... batch.create_objects()
...
... # submit the reference from the batch to weaviate
... batch.create_references()
b) Auto-create batches when full
This method is similar to the weaviate-client
version < 3.0.0 's Batcher
object. Lets see how it works.
>>> # we still need the 'created_authors' so we do not add the same author twice
>>> client.batch.configure(
... batch_size=30,
... callback=None, # use this argument to set a callback function on the batch creation results
...)
>>> for i in trange(100, 200):
...
... # add article to the batch
... article_id = add_article(client.batch, data[i])... for author in data[i]['authors']:... # add author to the batch
... author_id = add_author(client.batch, author, created_authors)... # add cross references to the batch
... add_references(client.batch, article_id=article_id, author_id=author_id)
>>> client.batch.flush()
Of course we could use a context manager here too.
>>> # we still need the 'created_authors' so we do not add the same author twice
>>> client.batch.configure(
... batch_size=30,
... callback=None, # use this argument to set a callback function on the batch creation results
... )
>>> with client.batch(batch_size=30) as batch: # the client.batch(batch_size=30) is the same as client.batch.configure(batch_size=30)
... for i in trange(100, 200):... # add article to the batch
... article_id = add_article(batch, data[i])... for author in data[i]['authors']:... # add author to the batch
... author_id = add_author(batch, author, created_authors)... # add cross references to the batch
... add_references(batch, article_id=article_id, author_id=author_id)
c) Auto-create batches using dynamic batching, i.e. the batch size is adjusted every time it is created to avoid any Timeout
errors.
This method works in the same manner as the method described in b). We are not going to run any cell with it, but I am going to mention that to enable the dynamic batching, all one need to do is to provide another argument to the configure
/__call__
method.
Example:
client.batch.configure(
batch_size=30,
dynamic=True
)
To see the full capabilities of this new Batch
object see the full documentation here or execute the help
function on Batch
or/and any Batch
methods, like this: help(Batch)
5.5. Query data.
Now we have the data imported and ready to be queried. Data can be queried by using the query
attribute of the client object (client.query
).
The data is queried using GraphQL syntax, and can be done in three different ways:
- GET: query that gets objects and from Weaviate. More information here
Useclient.query.get(class_name, properties).OTHER_OPTIONAL_FILTERS.do()
- AGGREGATE: query that aggregates data. More information here
Useclient.query.aggregate(class_name, properties).OTHER_OPTIONAL_FILTERS.do()
- Or use a GraphQL query represented as a
str
.
Useclient.query.raw()
NOTE: Both .get
and .aggregate
require the call of the .do()
method to run the query. .raw()
does NOT.
Lets now get the Articles objects and their corresponding title only.
5.5.1 GET
>>> result = client.query.get(class_name='Article', properties="title")\
... .do()
>>> print(f"Number of articles returned: {len(result['data']['Get']['Article'])}")
>>> resultNumber of articles returned: 100
{'data': {'Get': {'Article': [{'title': "The soft power impact of Ruth Bader Ginsburg's decorative collars"},
{'title': 'After centuries in the ground, these French oaks will soon form part of the new spire at Notre Dame'},
{'title': 'With tradition and new tech, these Japanese designers are crafting more sustainably made clothing'},
{'title': "LEGO won't make modern war machines, but others are picking up the pieces"},
{'title': 'Remember when Jane Fonda revolutionized exercise in a leotard and leg warmers?'},
{'title': 'Be brave, Taylor tells Manchester City Women before Barcelona return leg'},
{'title': "'In the middle of a war zone': thousands flee as Venezuela troops and Colombia rebels clash"},
{'title': "Destruction of world's forests increased sharply in 2020"},
{'title': "Climate crisis 'likely cause' of early cherry blossom in Japan"},
{'title': "What's in a vaccine and what does it do to your body?"},
{'title': 'Zunar and Fahmi Reza: the cartoonists who helped take down Najib Razak'},
{'title': 'Downing Street suggests UK should be seen as model of racial equality'},
{'title': "'It's hard, we're neighbours': the coalmine polluting friendships on Poland's borders"},
{'title': 'Why we are all attracted to conspiracy theories – video'},
{'title': 'UK criticised for ignoring Paris climate goals in infrastructure decisions'},
{'title': 'GameStop raids Amazon for another heavy-hitter exec'},
{'title': 'The World Economic Forum says it will take an extra 36 years to close the gender gap'},
{'title': 'Share a story with the Guardian'},
{'title': "Paddleboarding and a released alligator: Tuesday's best photos"},
{'title': "Ballerina Chloé Lopes Gomes alleged racism at her company. Now she says it's time for change"},
{'title': 'How ancient Egyptian cosmetics influenced our beauty rituals'},
{'title': 'Why is Australia trying to regulate Google and Facebook – video explainer'},
{'title': 'Back in the swing and the swim: England returns to outdoor sport – in pictures'},
{'title': "Biden's tariffs threat shows how far Brexit Britain is from controlling its own destiny"},
{'title': 'Our cities may never look the same again after the pandemic'},
{'title': "World's first digital NFT house sells for $500,000"},
{'title': "The untold story of Ann Lowe, the Black designer behind Jackie Kennedy's wedding dress"},
{'title': 'Deliveroo shares slump on stock market debut'},
{'title': 'Hundreds of people missing after fire in Rohingya refugee camp in Bangladesh – video'},
{'title': "Why Beijing's Serpentine Pavilion signals a new age for Chinese architecture"},
{'title': 'My Brother’s Keeper: a former Guantánamo detainee, his guard and their unlikely friendship - video'},
{'title': 'Has your family been affected by the Brazilian or South African variants of Covid-19?'},
{'title': 'New Zealand raises minimum wage and increases taxes on the rich'},
{'title': 'Deliveroo shares plunge on market debut - business live'},
{'title': "Seoul's burgeoning drag scene confronts conservative attitudes"},
{'title': "David Hockney at 80: An encounter with the world's most popular artist"},
{'title': 'Una avanzada licuadora Nutribullet al mejor precio'},
{'title': "The 'fox eye' beauty trend continues to spread online. But critics insist it's racist"},
{'title': 'Lupita: the indigenous activist leading a new generation of Mexican women – video'},
{'title': "Hong Kong's vast $3.8 billion rain-tunnel network"},
{'title': 'See how tattoo art has changed since the 18th century'},
{'title': "Hong Kong Disneyland's new castle is an architectural vision of diversity"},
{'title': 'Rosamund Pike in "I Care a Lot" and six more recommendations if you love an antiheroine'},
{'title': 'How NFTs are fueling a digital art boom'},
{'title': "'We’re not little kids': leading agents ready for war with Fifa over new rules"},
{'title': 'A photographic history of men in love'},
{'title': "Hedge fund meltdown: Elizabeth Warren suggests regulators should've seen it coming"},
{'title': 'Palau to welcome first tourists in a year with presidential escort'},
{'title': 'Coronavirus: how wealthy nations are creating a ‘vaccine apartheid’'},
{'title': 'UK economy poised to recover after Covid-19 second wave'},
{'title': 'Missed it by that much: Australia falls 3.4m doses short of 4m vaccination target by end of March'},
{'title': "Meet North Korea's art dealer to the West"},
{'title': 'Why Australia remains confident in AstraZeneca vaccine as two countries put rollout on ice'},
{'title': 'Exclusive: Jamie Dimon speaks out on voting rights even as many CEOs remain silent'},
{'title': "Green investing 'is definitely not going to work’, says ex-BlackRock executive"},
{'title': 'European commission says AstraZeneca not obliged to prioritise vaccines for UK'},
{'title': '‘Honey, I forgot to duck’: the attempt to assassinate Ronald Reagan, 40 years on'},
{'title': 'Graba tus próximas aventuras con esta GoPro de oferta'},
{'title': "Europe’s 'baby bust': can paying for pregnancies save Greece? - video"},
{'title': "After the deluge: NSW's flood disaster victims begin cleanup – in pictures"},
{'title': "Wolfsburg v Chelsea: Women's Champions League quarter-final – live!"},
{'title': "'A parallel universe': the rickety pleasures of America's backroads - in pictures"},
{'title': "'Nomadland': Chloé Zhao and crew reveal how they made one of the year's best films"},
{'title': 'Seaspiracy: Netflix documentary accused of misrepresentation by participants'},
{'title': 'Unblocking the Suez canal – podcast'},
{'title': 'About half of people in UK now have antibodies against coronavirus'},
{'title': 'Top 10 books about New York | Craig Taylor'},
{'title': 'Is Moldova ready to embrace an unmarried, childfree president? | Europe’s baby bust – video'},
{'title': 'This woman left North Korea 70 years ago. Now virtual reality has helped her return'},
{'title': "'Immediate and drastic.' The climate crisis is seriously spooking economists"},
{'title': "Under Xi's rule, what is China's image of the 'ideal' man?"},
{'title': "'Hamlet' in the skies? The story behind Taiwan's newest airline, STARLUX"},
{'title': 'Pokémon at 25: How 151 fictional species took over the world'},
{'title': '‘Similar to having a baby, the euphoria’: rediscovery of rare gecko delights experts'},
{'title': 'How Budapest became a fine dining force to be reckoned with'},
{'title': "Teen who filmed killing tells court George Floyd was 'begging for his life'"},
{'title': 'Empowering, alluring, degenerate? The evolution of red lipstick'},
{'title': 'Real-world locations straight out of a Wes Anderson movie'},
{'title': 'The club kid designer dressing the most powerful women in US politics'},
{'title': "Fashion gaffes are a reflection of the industry's diversity problem"},
{'title': 'Our colorful clothes are killing the environment'},
{'title': 'Hazte con un Roku SE a mitad de precio'},
{'title': 'Why does Bollywood use the offensive practice of brownface in movies?'},
{'title': "Why Washington Football Team may stick with their 'so bad it’s good' name"},
{'title': "Graphic novel on the Tiananmen Massacre shows medium's power to capture history"},
{'title': 'Could a Norway boycott of the Qatar World Cup change the future of football?'},
{'title': 'Las 5 cosas que debes saber este 31 de marzo: Así es una instalación que alberga a menores migrantes'},
{'title': 'Multiplica el alcance de tu Wi-Fi con este repetidor rebajado un 50%'},
{'title': "'Lack of perspective': why Ursula von der Leyen's EU vaccine strategy is failing"},
{'title': 'Redder Days by Sue Rainsford review – waiting for the end of the world'},
{'title': 'Dita Von Teese and Winnie Harlow star in a star-studded fashion film'},
{'title': 'The Suez fiasco shows why ever bigger container ships are a problem'},
{'title': 'The most anticipated buildings set to shape the world in 2020'},
{'title': "Elite minority of frequent flyers 'cause most of aviation's climate damage'"},
{'title': "I love my boyfriend – but I really don't want to have sex with him"},
{'title': 'Amazon-backed Deliveroo crashes in London IPO'},
{'title': 'People are calling for museums to be abolished. Can whitewashed American history be rewritten?'},
{'title': 'Ruby Rose on gender, bullying and breaking free: ‘I had a problem with authority’'},
{'title': 'Merkel, Macron and Putin in talks on using Sputnik V jab in Europe, says Kremlin'},
{'title': 'After fighting cancer, Tracey Emin returns to the art world with raw, emotional works'}]}},
'errors': None}
So as we can see the result
contains only 100 articles, this is due to the default limit of 100. Lets change it.
>>> result = client.query.get(class_name='Article', properties="title")\
... .with_limit(200)\
... .do()
>>> print(f"Number of articles returned: {len(result['data']['Get']['Article'])}")Number of articles returned: 200
We can do much more by stacking multiple methods. The available methods for .get
are:
.with_limit
- set another limit of returned objects..with_near_object
- get objects that are similar to the object passed to this method..with_near_text
- get objects that are similar to the text passed to this method..with_near_vector
- get objects that are similar to the vector passed to this method..with_where
- get objects that are filtered using theWhere
filter, see this link for examples and explanation.
Also instead of .do()
one can use the .build()
method that returns the GraphQL query as a string. This string can be passed to .raw()
method.
NOTE: Only one .with_near_*
can be used per query.
>>> client.query.get(class_name='Article', properties="title")\
... .with_limit(5)\
... .with_near_text({'concepts': ['Fashion']})\
... .do(){'data': {'Get': {'Article': [{'title': 'Dita Von Teese and Winnie Harlow star in a star-studded fashion film'},
{'title': "Fashion gaffes are a reflection of the industry's diversity problem"},
{'title': "Bottega Veneta ditches Instagram to set up 'digital journal'"},
{'title': 'Heir to O? Drew Barrymore launches lifestyle magazine'},
{'title': 'Our colorful clothes are killing the environment'}]}},
'errors': None}
With Get
we can see the cross references of each object. We are going to use the .raw()
method for this since it is not possible with any existing .with_*
method.
>>> query = """
... {
... Get {
... Article(limit: 2) {
... title
...
... hasAuthors { # the reference
... ... on Author { # you always set the destination class
... name # the property related to target class
... }
... }
... }
... }
... }
... """
>>> prettify(client.query.raw(query)['data']['Get']['Article'])[
{
"hasAuthors": [
{
"name": "Rhonda Garelick"
}
],
"title": "The soft power impact of Ruth Bader Ginsburg's decorative collars"
},
{
"hasAuthors": [
{
"name": "Saskya Vandoorne"
}
],
"title": "After centuries in the ground, these French oaks will soon form part of the new spire at Notre Dame"
}
]
5.5.2 AGGREGATE
We can use the .aggregate
to count number of objects that satisfy a specific condition.
>>> # no filter, count all objects of class Article
>>> client.query.aggregate(class_name='Article')\
... .with_meta_count()\
... .do(){'data': {'Aggregate': {'Article': [{'meta': {'count': 200}}]}},
'errors': None}>>> # no filter, count all objects of class Author
>>> client.query.aggregate(class_name='Author')\
... .with_meta_count()\
... .do(){'data': {'Aggregate': {'Author': [{'meta': {'count': 258}}]}}, 'errors': None}
Here are the methods that are supported by the .aggregate
.
.with_meta_count
sets meta count to True. Used to count objects per filtered group..with_fields
- fields to return by the aggregated query..with_group_by_filter
- set aGroupBy
filter. See this link for more information about the filter..with_where
- aggregate objects using aWhere
filter. See this link for examples and explanation.
Of course when it comes to querying data, the possibilities are endless. Have fun experimenting with these capabilities.
(The jupyter-notebook can be found here.)
Feel free to check out and contribute to weaviate-client on GitHub.